MPI-INF Logo
Campus Event Calendar

Event Entry

What and Who

Redundancy Control in Web Archives

Bibek Paudel
Fachrichtung Informatik - Saarbrücken
PhD Application Talk
AG 1, AG 2, AG 3, AG 4, AG 5, SWS, RG1, MMCI  
Public Audience
English

Date, Time and Location

Monday, 7 May 2012
09:20
75 Minutes
E1 4
024
Saarbrücken

Abstract

Large scale text collections like web archives evolve over time. However, the addition of new documents does not always add novel content, but also introduces content that are copied, enriched or recompiled from already existing documents. Thus, such collections are characterized by a lot of redundant content. Redundant documents waste storage space, make content analysis difficult and decrease the quality of search results. Although existing duplicate detection systems are able to identify content share across documents, they are not enough for different user requirements and application scenarios. We would like to give user the control over how to define redundancy. In this work, we propose a solution to systematically remove documents from the document collection whose content is sufficiently covered by other documents and other user specified conditions are met. This problem is challenging because of the scale of data that makes detecting redundancy inefficient. Our solution exploits the inherent properties of documents in the collection to detect redundancy efficiently and we use the MapReduce framework to tackle the problem of scale. Tested on real web archive datasets, our method was able to efficiently identify redundant documents. A keyword search on the original collection and the redundancy-controlled collection validates the effectiveness of our approach.

Contact

IMPRS Office Team
9325 1800
--email hidden
passcode not visible
logged in users only

Marc Schmitt, 05/04/2012 13:35 -- Created document.