Indexing Methods for Web Archives

Avishek ANAND
Max-Planck-Institut für Informatik - D5
Friday, 6 September 2013
There have been numerous efforts recently to digitize previously published content and preserving born-digital content leading to the widespread growth of large text repositories. Web archives are such continuously growing text collections which contain versions of documents spanning over long time periods. Web archives present many opportunities for historical, cultural and political analyses. Consequently there is a growing need for tools which can efficiently access and search them.

In this work, we are interested in indexing methods for supporting text-search workloads over web archives like time-travel queries and phrase queries. To this end we make the following contributions:

• Time-travel queries are keyword queries with a temporal predicate, e.g., “mpii saarland” @ [06/2009], which return versions of documents in the past. We introduce a novel index organization strategy, called index sharding, for efficiently supporting time-travel queries without incurring additional index-size blowup. We also propose index-maintenance approaches which scale to such continuously growing collections.

• We develop query-optimization techniques for time-travel queries called partition selection which maximizes recall at any given query-execution stage.

• We propose indexing methods to support phrase queries, e.g., “to be or not to be that is the question”. We index multi-word sequences and devise novel query-optimization methods over the indexed sequences to efficiently answer phrase queries.

We demonstrate the superior performance of our approaches over existing methods by extensive experimentation on real-world web archives.


