MPI-INF Logo
MPI-INF/SWS Research Reports 1991-2021

2. Number - All Departments

MPI-I-2006-5-001

Overlap-aware global df estimation in distributed information retrieval systems

Bender, Matthias and Michel, Sebastian and Weikum, Gerhard and Triantafilou, Peter

January 2006, 25 pages.

.
Status: available - back from printing

Peer-to-Peer (P2P) search engines and other forms of distributed information retrieval (IR) are gaining momentum. Unlike in centralized IR, it is difficult and expensive to compute statistical measures about the entire document collection as it is widely distributed across many computers in a highly dynamic network. On the other hand, such network-wide statistics, most notably, global document frequencies of the individual terms, would be highly beneficial for ranking global search results that are compiled from different peers. This paper develops an efficient and scalable method for estimating global document frequencies in a large-scale, highly dynamic P2P network with autonomous peers. The main difficulty that is addressed in this paper is that the local collections of different peers may arbitrarily overlap, as many peers may choose to gather popular documents that fall into their specific interest profile. Our method is based on hash sketches as an underlying technique for compact data synopses, and exploits specific properties of hash sketches for duplicate elimination in the counting process. We report on experiments with real Web data that demonstrate the accuracy of our estimation method and also the benefit for better search result ranking.

  • MPI-I-2006-5-001.pdf
  • Attachement: MPI-I-2006-5-001.pdf (570 KBytes)

URL to this document: https://domino.mpi-inf.mpg.de/internet/reports.nsf/NumberView/2006-5-001

Hide details for BibTeXBibTeX
@TECHREPORT{BenderMichelWeikumTriantafilou2006,
  AUTHOR = {Bender, Matthias and Michel, Sebastian and Weikum, Gerhard and Triantafilou, Peter},
  TITLE = {Overlap-aware global df estimation in distributed information retrieval systems},
  TYPE = {Research Report},
  INSTITUTION = {Max-Planck-Institut f{\"u}r Informatik},
  ADDRESS = {Stuhlsatzenhausweg 85, 66123 Saarbr{\"u}cken, Germany},
  NUMBER = {MPI-I-2006-5-001},
  MONTH = {January},
  YEAR = {2006},
  ISSN = {0946-011X},
}