Max-Planck-Institut für Informatik
max planck institut
mpii logo Minerva of the Max Planck Society


A time machine for text search

Berberich, Klaus and Bedathur, Srikanta and Neumann, Thomas and Weikum, Gerhard

MPI-I-2007-5-002. July 2007, 39 pages. | Status: available - back from printing | Next --> Entry | Previous <-- Entry

Abstract in LaTeX format:
Text search over temporally versioned document collections such as
web archives has received little attention as a research problem.
As a consequence, there is no scalable and principled solution to
search such a collection as of a specified time t. In this work,
we address this shortcoming and propose an efficient solution for
time-travel text search by extending the inverted file index
to make it ready for temporal search. We introduce approximate
temporal coalescing as a tunable method to reduce the index size
without significantly affecting the quality of results. In order to
further improve the performance of time-travel queries, we introduce
two principled techniques to trade off index size for its
performance. These techniques can be formulated as optimization
problems that can be solved to near-optimality. Finally, our
approach is evaluated in a comprehensive series of experiments on
two large-scale real-world datasets. Results unequivocally show
that our methods make it possible to build an efficient "time
machine" scalable to large versioned text collections.

References to related material:

To download this research report, please select the type of document that fits best your needs.Attachement Size(s):
MPI-2007-5-002.pdf288 KBytes
Please note: If you don't have a viewer for PostScript on your platform, try to install GhostScript and GhostView
URL to this document:
Hide details for BibTeXBibTeX
  AUTHOR = {Berberich, Klaus and Bedathur, Srikanta and Neumann, Thomas and Weikum, Gerhard},
  TITLE = {A time machine for text search},
  TYPE = {Research Report},
  INSTITUTION = {Max-Planck-Institut f{\"u}r Informatik},
  ADDRESS = {Stuhlsatzenhausweg 85, 66123 Saarbr{\"u}cken, Germany},
  NUMBER = {MPI-I-2007-5-002},
  MONTH = {July},
  YEAR = {2007},
  ISSN = {0946-011X},