MPI-I-2009-5-006

Scalable phrase mining for ad-hoc text analytics

Bedathur, Srikanta and Berberich, Klaus and Dittrich, Jens and Mamoulis, Nikos and Weikum, Gerhard

April 2009, 41 pages.

Status: available - back from printing

Large text corpora with news, customer mail and reports, or Web 2.0 contribu- tions offer a great potential for enhancing business-intelligence applications. We propose a framework for performing text analytics on such data in a versatile, ef- ﬁcient, and scalable manner. While much of the prior literature has emphasized mining keywords or tags in blogs or social-tagging communities, we emphasize the analysis of interesting phrases. These include named entities, important quo- tations, market slogans, and other multi-word phrases that are prominent in a dy- namically derived ad-hoc subset of the corpus, e.g., being frequent in the subset but relatively infrequent in the overall corpus. The ad-hoc subset may be derived by means of a keyword query against the corpus, or by focusing on a particular time period. We investigate alternative deﬁnitions of phrase interestingness, based on the probability of phrase occurrences. We develop preprocessing and indexing methods for phrases, paired with new search techniques for the top-k most inter- esting phrases on ad-hoc subsets of the corpus. Our framework is evaluated using a large-scale real-world corpus of New York Times news articles.

Attachement: mpi-i-2009-5-006.pdf (347 KBytes)

URL to this document: https://domino.mpi-inf.mpg.de/internet/reports.nsf/NumberView/2009-5-006

BibTeX

BibTeX
@TECHREPORT{BedathurBerberichDittrichMamoulisWeikum2009, AUTHOR = {Bedathur, Srikanta and Berberich, Klaus and Dittrich, Jens and Mamoulis, Nikos and Weikum, Gerhard}, TITLE = {Scalable phrase mining for ad-hoc text analytics}, TYPE = {Research Report}, INSTITUTION = {Max-Planck-Institut f{\"u}r Informatik}, ADDRESS = {Stuhlsatzenhausweg 85, 66123 Saarbr{\"u}cken, Germany}, NUMBER = {MPI-I-2009-5-006}, MONTH = {April}, YEAR = {2009}, ISSN = {0946-011X},}

Imprint / Impressum | Data Protection / Datenschutzhinweis