In this work, we are interested in indexing methods for supporting text-search workloads over web archives like time-travel queries and phrase queries. To this end we make the following contributions:
• Time-travel queries are keyword queries with a temporal predicate, e.g., “mpii saarland” @ [06/2009], which return versions of documents in the past. We introduce a novel index organization strategy, called index sharding, for efficiently supporting time-travel queries without incurring additional index-size blowup. We also propose index-maintenance approaches which scale to such continuously growing collections.
• We develop query-optimization techniques for time-travel queries called partition selection which maximizes recall at any given query-execution stage.
• We propose indexing methods to support phrase queries, e.g., “to be or not to be that is the question”. We index multi-word sequences and devise novel query-optimization methods over the indexed sequences to efficiently answer phrase queries.
We demonstrate the superior performance of our approaches over existing methods by extensive experimentation on real-world web archives.