MPI-INF Logo
Campus Event Calendar

Event Entry

New for: D1, D2, D3, D4, D5

What and Who

Models and Methods for Web Archive Crawling

Dimitar DENEV
Max-Planck-Institut für Informatik - D5
Promotionskolloquium
AG 1, AG 2, AG 3, AG 4, AG 5, SWS, RG1, MMCI  
Public Audience
English

Date, Time and Location

Monday, 20 August 2012
15:00
60 Minutes
E1 4
024
Saarbrücken

Abstract

Web archives offer a rich and plentiful source of information to researchers, analysts, and legal experts. For this purpose, they gather Web sites as the sites change over time. In order to keep up to high standards of data quality, Web archives have to collect all versions of the Web sites. Due to limited resuources and technical constraints this is not possible. Therefore, Web archives consist of versions archived at various time points without guarantee for mutual consistency. This thesis presents a model for assessing the data quality in Web archives as well as a family of crawling strategies yielding high-quality captures. We distinguish between single-visit crawling strategies for exploratory and visit-revisit crawling strategies for evidentiary purposes. Single-visit strategies download every page exactly once aiming for an “undistorted” capture of the ever-changing Web. We express the quality of such the resulting capture with the “blur” quality measure. In contrast, visit-revisit strategies download every page twice. The initial downloads of all pages form the visit phase of the crawling strategy. The second downloads are grouped together in the revisit phase. These two phases enable us to check which pages changed during the crawling process. Thus, we can identify the pages that are consistent with each other. The quality of the visit-revisit captures is expressed by the “coherence” measure. Quality-conscious strategies are based on predictions of the change behaviour of individual pages. We model the Web site dynamics by Poisson processes with page specific change rates. Furthermore, we show that these rates can be statistically predicted. Finally, we propose visualization techniques for exploring the quality of the resulting Web archives. A fully functional prototype demonstrates the practical viability of our approach.

Contact

Petra Schaaf
9325-5000
--email hidden
passcode not visible
logged in users only

Petra Schaaf, 08/20/2012 09:54
Ellen Fries, 08/14/2012 10:05 -- Created document.