MPI-INF Logo
MPI-INF/SWS Research Reports 1991-2021

2. Number - All Departments

MPI-I-2007-5-002

A time machine for text search

Berberich, Klaus and Bedathur, Srikanta and Neumann, Thomas and Weikum, Gerhard

July 2007, 39 pages.

.
Status: available - back from printing

Text search over temporally versioned document collections such as web archives has received little attention as a research problem. As a consequence, there is no scalable and principled solution to search such a collection as of a specified time t. In this work, we address this shortcoming and propose an efficient solution for time-travel text search by extending the inverted file index to make it ready for temporal search. We introduce approximate temporal coalescing as a tunable method to reduce the index size without significantly affecting the quality of results. In order to further improve the performance of time-travel queries, we introduce two principled techniques to trade off index size for its performance. These techniques can be formulated as optimization problems that can be solved to near-optimality. Finally, our approach is evaluated in a comprehensive series of experiments on two large-scale real-world datasets. Results unequivocally show that our methods make it possible to build an efficient "time machine" scalable to large versioned text collections.

  • MPI-2007-5-002.pdf
  • Attachement: MPI-2007-5-002.pdf (288 KBytes)

URL to this document: https://domino.mpi-inf.mpg.de/internet/reports.nsf/NumberView/2007-5-002

Hide details for BibTeXBibTeX
@TECHREPORT{BerberichBedathurNeumannWeikum2007,
  AUTHOR = {Berberich, Klaus and Bedathur, Srikanta and Neumann, Thomas and Weikum, Gerhard},
  TITLE = {A time machine for text search},
  TYPE = {Research Report},
  INSTITUTION = {Max-Planck-Institut f{\"u}r Informatik},
  ADDRESS = {Stuhlsatzenhausweg 85, 66123 Saarbr{\"u}cken, Germany},
  NUMBER = {MPI-I-2007-5-002},
  MONTH = {July},
  YEAR = {2007},
  ISSN = {0946-011X},
}