MPI-INF Logo
Campus Event Calendar

Event Entry

New for: D2, D3

What and Who

PhD Application Talk: Distributed analytics over web archives

Yagiz Kargin
PhD Application Talk
AG 1, AG 2, AG 3, AG 4, AG 5, SWS, RG1, MMCI  
MPI Audience
English

Date, Time and Location

Monday, 25 July 2011
09:00
120 Minutes
E1 4
024
Saarbrücken

Abstract

Text analytics has a key role in exploring interesting information in text collections. Frequent phrase mining, a special case of text analytics, is an important analytical task that is motivated by the need of knowledge on frequent phrases in various areas of computer science, such as information retrieval and machine translation etc. However it has to be conducted on increasingly large-scale data. Distributed approaches such as MapReduce, which is mainly designed to work on vast amount of text, can be utilized in this case. The problem we address is that finding frequent phrases through naive counting, even in MapReduce, is a time consuming task, because data to be processed gets much larger in size, when phrases are considered. As our work, we present partitioned approximate counting of phrases, a fast way to retrieve most of the frequent phrases together with their counts out of a collection to enable analysis of the content.

Included in this, we propose a technique, partitioned in-mapper combining, which enables us to aggregate data in memory correctly, even though the data to be aggregated is larger than the available memory. Evaluation of experiments on New York Times Annotated Corpus, which contains roughly 2 million documents, show that our approach works at least 2 times faster as compared to naive approach. It obtains more than 90% of frequent phrases with high precision. Moreover, it is able to find all highly frequent phrases exactly, along with their accurate counts. Furthermore, by a quick second pass on the data, we precisely provide most of the frequent phrases with their corresponding true counts, still being faster than naïve approach.

Contact

IMPRS-CS
-1803
--email hidden
passcode not visible
logged in users only

Tags, Category, Keywords and additional notes

Please note: The talks will take place in random order!

Heike Przybyl, 07/21/2011 12:08
Heike Przybyl, 07/21/2011 12:04 -- Created document.