MPI-I-2009-5-006
Scalable phrase mining for ad-hoc text analytics
Bedathur, Srikanta and Berberich, Klaus and Dittrich, Jens and Mamoulis, Nikos and Weikum, Gerhard
April 2009, 41 pages.
.
Status: available - back from printing
Large text corpora with news, customer mail and reports, or Web 2.0 contribu-
tions offer a great potential for enhancing business-intelligence applications. We
propose a framework for performing text analytics on such data in a versatile, ef-
ficient, and scalable manner. While much of the prior literature has emphasized
mining keywords or tags in blogs or social-tagging communities, we emphasize
the analysis of interesting phrases. These include named entities, important quo-
tations, market slogans, and other multi-word phrases that are prominent in a dy-
namically derived ad-hoc subset of the corpus, e.g., being frequent in the subset
but relatively infrequent in the overall corpus. The ad-hoc subset may be derived
by means of a keyword query against the corpus, or by focusing on a particular
time period. We investigate alternative definitions of phrase interestingness, based
on the probability of phrase occurrences. We develop preprocessing and indexing
methods for phrases, paired with new search techniques for the top-k most inter-
esting phrases on ad-hoc subsets of the corpus. Our framework is evaluated using
a large-scale real-world corpus of New York Times news articles.
-
- Attachement: mpi-i-2009-5-006.pdf (347 KBytes)
URL to this document: https://domino.mpi-inf.mpg.de/internet/reports.nsf/NumberView/2009-5-006
BibTeX
@TECHREPORT{BedathurBerberichDittrichMamoulisWeikum2009,
AUTHOR = {Bedathur, Srikanta and Berberich, Klaus and Dittrich, Jens and Mamoulis, Nikos and Weikum, Gerhard},
TITLE = {Scalable phrase mining for ad-hoc text analytics},
TYPE = {Research Report},
INSTITUTION = {Max-Planck-Institut f{\"u}r Informatik},
ADDRESS = {Stuhlsatzenhausweg 85, 66123 Saarbr{\"u}cken, Germany},
NUMBER = {MPI-I-2009-5-006},
MONTH = {April},
YEAR = {2009},
ISSN = {0946-011X},
}