Max-Planck-Institut für Informatik
max planck institut
mpii logo Minerva of the Max Planck Society


Overlap-aware global df estimation in distributed information retrieval systems

Bender, Matthias and Michel, Sebastian and Weikum, Gerhard and Triantafilou, Peter

MPI-I-2006-5-001. January 2006, 25 pages. | Status: available - back from printing | Next --> Entry | Previous <-- Entry

Abstract in LaTeX format:
Peer-to-Peer (P2P) search engines and other forms of distributed
information retrieval (IR) are gaining momentum. Unlike in centralized
IR, it is difficult and expensive to compute statistical measures about
the entire document collection as it is widely distributed across many
computers in a highly dynamic network. On the other hand, such
network-wide statistics, most notably, global document frequencies of
the individual terms, would be highly beneficial for ranking global
search results that are compiled from different peers.
This paper develops an efficient and scalable method for estimating
global document frequencies in a large-scale, highly dynamic P2P network
with autonomous peers. The main difficulty that is addressed in this
paper is that the local collections of different peers
may arbitrarily overlap, as many peers may choose to gather popular
documents that fall into their specific interest profile.
Our method is based on hash sketches as an underlying technique for
compact data synopses, and exploits specific properties of hash sketches
for duplicate elimination in the counting process.
We report on experiments with real Web data that demonstrate the
accuracy of our estimation method and also the benefit for better search
result ranking.
References to related material:

To download this research report, please select the type of document that fits best your needs.Attachement Size(s):
MPI-I-2006-5-001.pdf570 KBytes
Please note: If you don't have a viewer for PostScript on your platform, try to install GhostScript and GhostView
URL to this document:
Hide details for BibTeXBibTeX
  AUTHOR = {Bender, Matthias and Michel, Sebastian and Weikum, Gerhard and Triantafilou, Peter},
  TITLE = {Overlap-aware global df estimation in distributed information retrieval systems},
  TYPE = {Research Report},
  INSTITUTION = {Max-Planck-Institut f{\"u}r Informatik},
  ADDRESS = {Stuhlsatzenhausweg 85, 66123 Saarbr{\"u}cken, Germany},
  NUMBER = {MPI-I-2006-5-001},
  MONTH = {January},
  YEAR = {2006},
  ISSN = {0946-011X},