Campus Event Calendar

Event Entry

What and Who

Indexing Methods for Web Archives

Avishek ANAND
Max-Planck-Institut für Informatik - D5
AG 1, AG 2, AG 3, AG 4, AG 5, SWS, RG1, MMCI  
Public Audience

Date, Time and Location

Friday, 6 September 2013
60 Minutes
E1 4


There have been numerous efforts recently to digitize previously published content and preserving born-digital content leading to the widespread growth of large text repositories. Web archives are such continuously growing text collections which contain versions of documents spanning over long time periods. Web archives present many opportunities for historical, cultural and political analyses. Consequently there is a growing need for tools which can efficiently access and search them.

In this work, we are interested in indexing methods for supporting text-search workloads over web archives like time-travel queries and phrase queries. To this end we make the following contributions:

• Time-travel queries are keyword queries with a temporal predicate, e.g., “mpii saarland” @ [06/2009], which return versions of documents in the past. We introduce a novel index organization strategy, called index sharding, for efficiently supporting time-travel queries without incurring additional index-size blowup. We also propose index-maintenance approaches which scale to such continuously growing collections.

• We develop query-optimization techniques for time-travel queries called partition selection which maximizes recall at any given query-execution stage.

• We propose indexing methods to support phrase queries, e.g., “to be or not to be that is the question”. We index multi-word sequences and devise novel query-optimization methods over the indexed sequences to efficiently answer phrase queries.

We demonstrate the superior performance of our approaches over existing methods by extensive experimentation on real-world web archives.


Petra Schaaf
--email hidden
passcode not visible
logged in users only

Petra Schaaf, 08/28/2013 09:47 -- Created document.