MPI-INF Logo
Publications

Thesis (Server    domino.mpi-inf.mpg.de)

 Library locked

Thesis

Doctoral dissertation | @PhdThesis{Anand2013, ... | Doktorarbeit

Anand, Avishek

Indexing Methods for Web Archives

Universität des Saarlandes, September, 2013, 179 pages
Universität des Saarlandes
Saarbrücken

There have been numerous efforts recently to digitize previously published content and preserving born-digital content leading to the widespread growth of large text repositories.
Web archives are such continuously growing text collections which contain versions
of documents spanning over long time periods. Web archives present many opportunities for historical, cultural and political analyses. Consequently there is a growing need for tools which can efficiently access and search them.
In this work, we are interested in indexing methods for supporting text-search workloads over web archives like time-travel queries and phrase queries. To this end we make the following contributions:
Time-travel queries are keyword queries with a temporal predicate, e.g., “mpii
saarland” @ [06/2009], which return versions of documents in the past. We introduce
a novel index organization strategy, called index sharding, for efficiently
supporting time-travel queries without incurring additional index-size blowup.
We also propose index-maintenance approaches which scale to such continuously
growing collections. We develop query-optimization techniques for time-travel queries called partition selection which maximizes recall at any given query-execution stage. We propose indexing methods to support phrase queries, e.g., “to be or not to be that is the question”. We index multi-word sequences and devise novel queryoptimization methods over the indexed sequences to efficiently answer phrase queries. We demonstrate the superior performance of our approaches over existing methods by extensive experimentation on real-world web archives.

Public
Download File(s):
Dr.-Ing. Klaus Berberich
Prof. Dr.-Ing. Kjetil Nørv°ag
Prof. Dr.-Ing. GerhardWeikum
Completed
6
September
2013
Max-Planck-Institut für Informatik
IMPRS-CS
Expert
MPII WWW Server, MPII FTP Server, MPG publications list, university publications list, working group publication list, Fachbeirat


BibTeX Entry:
@PHDTHESIS{Anand2013,
AUTHOR = {Anand, Avishek},
TITLE = {Indexing Methods for Web Archives},
PUBLISHER = {Universität des Saarlandes},
SCHOOL = {Universit{\"a}t des Saarlandes},
YEAR = {2013},
TYPE = {Doctoral dissertation}
PAGES = {179},
ADDRESS = {Saarbr{\"u}cken},
MONTH = {September},
}





Entry last modified by Stephanie Jörg, 01/20/2015
Hide details for Edit History (please click the blue arrow to see the details)Edit History (please click the blue arrow to see the details)

Editor(s)
[Library]
Created
09/13/2013 01:28:18 PM
Revision
1.
0.


Editor
Stephanie Jörg
Aaron Alsancak


Edit Date
01/14/2014 01:00:42 PM
09/13/2013 01:36:55 PM