Max-Planck-Institut für Informatik
max planck institut
informatik
mpii logo Minerva of the Max Planck Society
 

MPI-I-2012-5-003

Computing n-gram statistics in MapReduce

Berberich, Klaus and Bedathur, Srikanta

MPI-I-2012-5-003. 2012, 39 pages. | Status: available - back from printing | Next --> Entry | Previous <-- Entry

Abstract in LaTeX format:
Statistics about n-grams (i.e., sequences of contiguous words or other tokens in text documents or other string data) are an important building block in information retrieval and natural language processing. In this work, we study how n-gram statistics, optionally restricted by a maximum n-gram length and minimum collection frequency, can be computed efficiently harnessing MapReduce for distributed data processing. We describe different algorithms, ranging from an extension of word counting, via methods based on the Apriori principle, to a novel method Suffix-\sigma that relies on sorting and aggregating suffixes. We examine possible extensions of our method to support the notions of maximality/closedness and to perform aggregations beyond occurrence counting. Assuming Hadoop as a concrete MapReduce implementation, we provide insights on an efficient implementation of the methods. Extensive experiments on The New York Times Annotated Corpus and ClueWeb09 expose the relative benefits and trade-offs of the methods.
Acknowledgement:
References to related material:

To download this research report, please select the type of document that fits best your needs.Attachement Size(s):
MPI-I-2012-5-003.pdfMPI-I-2012-5-003.pdf393 KBytes
Please note: If you don't have a viewer for PostScript on your platform, try to install GhostScript and GhostView
URL to this document: http://domino.mpi-inf.mpg.de/internet/reports.nsf/NumberView/2012-5-003
Hide details for BibTeXBibTeX
@TECHREPORT{BerberichBedathur2012,
  AUTHOR = {Berberich, Klaus and Bedathur, Srikanta},
  TITLE = {Computing n-gram statistics in MapReduce},
  TYPE = {Research Report},
  INSTITUTION = {Max-Planck-Institut f{\"u}r Informatik},
  ADDRESS = {Stuhlsatzenhausweg 85, 66123 Saarbr{\"u}cken, Germany},
  NUMBER = {MPI-I-2012-5-003},
  YEAR = {2012},
  ISSN = {0946-011X},
}