MPI-I-2010-5-003
Efficient temporal keyword queries over versioned text
Anand, Avishek and Bedathur, Srikanta and Berberich, Klaus and Schenkel, Ralf
November 2010, 39 pages.
.
Status: available - back from printing
Modern text analytics applications operate on large volumes of
temporal text data such as Web archives, newspaper archives, blogs,
wikis, and micro-blogs. In these settings, searching and mining
needs to use constraints on the time dimension in addition to
keyword constraints. A natural approach to address such queries is
using an inverted index whose entries are enriched with valid-time
intervals. It has been shown that these indexes have to be
partitioned along time in order to achieve efficiency. However, when
the temporal predicate corresponds to a long time range requiring
the processing of multiple partitions, naive query processing
incurs high cost of reading of redundant entries across partitions.
We present a framework for efficient approximate processing of
keyword queries over a temporally partitioned inverted index which
minimizes this overhead, thus speeding up query processing. By using
a small synopsis for each partition we identify partitions that
maximize the number of final non-redundant results, and schedule
them for processing early on. Our approach aims to balance the
estimated gains in the final result recall against the cost of index
reading required. We present practical algorithms for the resulting
optimization problem of index partition selection. Our experiments
with three diverse, large-scale text archives reveal that our
proposed approach can provide close to 80% result recall even when
only about half the index is allowed to be read.
URL to this document: https://domino.mpi-inf.mpg.de/internet/reports.nsf/NumberView/2010-5-003
BibTeX
@TECHREPORT{AnandBedathurBerberichSchenkel2010,
AUTHOR = {Anand, Avishek and Bedathur, Srikanta and Berberich, Klaus and Schenkel, Ralf},
TITLE = {Efficient temporal keyword queries over versioned text},
TYPE = {Research Report},
INSTITUTION = {Max-Planck-Institut f{\"u}r Informatik},
ADDRESS = {Stuhlsatzenhausweg 85, 66123 Saarbr{\"u}cken, Germany},
NUMBER = {MPI-I-2010-5-003},
MONTH = {November},
YEAR = {2010},
ISSN = {0946-011X},
}