MPI-I-2007-5-002
A time machine for text search
Berberich, Klaus and Bedathur, Srikanta and Neumann, Thomas and Weikum, Gerhard
July 2007, 39 pages.
.
Status: available - back from printing
Text search over temporally versioned document collections such as
web archives has received little attention as a research problem.
As a consequence, there is no scalable and principled solution to
search such a collection as of a specified time t. In this work,
we address this shortcoming and propose an efficient solution for
time-travel text search by extending the inverted file index
to make it ready for temporal search. We introduce approximate
temporal coalescing as a tunable method to reduce the index size
without significantly affecting the quality of results. In order to
further improve the performance of time-travel queries, we introduce
two principled techniques to trade off index size for its
performance. These techniques can be formulated as optimization
problems that can be solved to near-optimality. Finally, our
approach is evaluated in a comprehensive series of experiments on
two large-scale real-world datasets. Results unequivocally show
that our methods make it possible to build an efficient "time
machine" scalable to large versioned text collections.
-
- Attachement: MPI-2007-5-002.pdf (288 KBytes)
URL to this document: https://domino.mpi-inf.mpg.de/internet/reports.nsf/NumberView/2007-5-002
BibTeX
@TECHREPORT{BerberichBedathurNeumannWeikum2007,
AUTHOR = {Berberich, Klaus and Bedathur, Srikanta and Neumann, Thomas and Weikum, Gerhard},
TITLE = {A time machine for text search},
TYPE = {Research Report},
INSTITUTION = {Max-Planck-Institut f{\"u}r Informatik},
ADDRESS = {Stuhlsatzenhausweg 85, 66123 Saarbr{\"u}cken, Germany},
NUMBER = {MPI-I-2007-5-002},
MONTH = {July},
YEAR = {2007},
ISSN = {0946-011X},
}