Campus Event Calendar: Yagiz Kargin (07/25/2011 in E1 4/024)

Campus Event Calendar

Campus Event Calendar:
- All Upcoming:
  - only for D1
  - only for D2
  - only for INET
  - only for D4
  - only for D5
  - only for D6
  - only for RG1
  - Mailing Lists
  - by Speaker
  - by Type
  - by Category
  - by Title
  - Calendar
  - RSS Feed
- History of Events:

Event Entry

New for: D2, D3

What and Who

PhD Application Talk: Distributed analytics over web archives

Yagiz Kargin

PhD Application Talk

AG 1, AG 2, AG 3, AG 4, AG 5, SWS, RG1, MMCI

MPI Audience

English

Note: We use this to send email in the morning.

Date, Time and Location

Monday, 25 July 2011

09:00

120 Minutes

E1 4

024

Saarbrücken

Abstract

Text analytics has a key role in exploring interesting information in text collections. Frequent phrase mining, a special case of text analytics, is an important analytical task that is motivated by the need of knowledge on frequent phrases in various areas of computer science, such as information retrieval and machine translation etc. However it has to be conducted on increasingly large-scale data. Distributed approaches such as MapReduce, which is mainly designed to work on vast amount of text, can be utilized in this case. The problem we address is that finding frequent phrases through naive counting, even in MapReduce, is a time consuming task, because data to be processed gets much larger in size, when phrases are considered. As our work, we present partitioned approximate counting of phrases, a fast way to retrieve most of the frequent phrases together with their counts out of a collection to enable analysis of the content.

Included in this, we propose a technique, partitioned in-mapper combining, which enables us to aggregate data in memory correctly, even though the data to be aggregated is larger than the available memory. Evaluation of experiments on New York Times Annotated Corpus, which contains roughly 2 million documents, show that our approach works at least 2 times faster as compared to naive approach. It obtains more than 90% of frequent phrases with high precision. Moreover, it is able to find all highly frequent phrases exactly, along with their accurate counts. Furthermore, by a quick second pass on the data, we precisely provide most of the frequent phrases with their corresponding true counts, still being faster than naïve approach.

Contact

IMPRS-CS

-1803

--email hidden

System used:

Meeting URL:

Meeting ID:

Passcode:

passcode not visible

Code Visible for:

logged in users only

Tags, Category, Keywords and additional notes

Note:

Please note: The talks will take place in random order!

Heike Przybyl, 07/21/2011 12:08
Heike Przybyl, 07/21/2011 12:04 -- Created document.

Imprint / Impressum | Data Protection / Datenschutzhinweis