Campus Event Calendar: Bibek Paudel (05/07/2012 in E1 4/024)

Campus Event Calendar

Campus Event Calendar:
- All Upcoming:
  - only for D1
  - only for D2
  - only for INET
  - only for D4
  - only for D5
  - only for D6
  - only for RG1
  - Mailing Lists
  - by Speaker
  - by Type
  - by Category
  - by Title
  - Calendar
  - RSS Feed
- History of Events:

Event Entry

What and Who

Redundancy Control in Web Archives

Bibek Paudel

Fachrichtung Informatik - Saarbrücken

PhD Application Talk

AG 1, AG 2, AG 3, AG 4, AG 5, SWS, RG1, MMCI

Public Audience

English

Note: We use this to send email in the morning.

Date, Time and Location

Monday, 7 May 2012

09:20

75 Minutes

E1 4

024

Saarbrücken

Abstract

Large scale text collections like web archives evolve over time. However, the addition of new documents does not always add novel content, but also introduces content that are copied, enriched or recompiled from already existing documents. Thus, such collections are characterized by a lot of redundant content. Redundant documents waste storage space, make content analysis difficult and decrease the quality of search results. Although existing duplicate detection systems are able to identify content share across documents, they are not enough for different user requirements and application scenarios. We would like to give user the control over how to define redundancy. In this work, we propose a solution to systematically remove documents from the document collection whose content is sufficiently covered by other documents and other user specified conditions are met. This problem is challenging because of the scale of data that makes detecting redundancy inefficient. Our solution exploits the inherent properties of documents in the collection to detect redundancy efficiently and we use the MapReduce framework to tackle the problem of scale. Tested on real web archive datasets, our method was able to efficiently identify redundant documents. A keyword search on the original collection and the redundancy-controlled collection validates the effectiveness of our approach.

Contact

IMPRS Office Team

9325 1800

--email hidden

System used:

Meeting URL:

Meeting ID:

Passcode:

passcode not visible

Code Visible for:

logged in users only

Marc Schmitt, 05/04/2012 13:35 -- Created document.

Imprint / Impressum | Data Protection / Datenschutzhinweis