Proceedings Article, Paper
@InProceedings
Beitrag in Tagungsband, Workshop


Show entries of:

this year (2019) | last year (2018) | two years ago (2017) | Notes URL

Action:

login to update

Options:




Library Locked Library locked




Author, Editor

Author(s):

Spaniol, Marc
Denev, Dimitar
Mazeika, Arturas
Weikum, Gerhard
Senellart, Pierre

dblp
dblp
dblp
dblp
dblp

Not MPG Author(s):

Senellart, Pierre

Editor(s):





BibTeX cite key*:

Spaniol-WICOW09

Title, Booktitle

Title*:

Data Quality in Web Archiving


p19-spaniolA.pdf (657.6 KB)

Booktitle*:

WICOW'09 : proceedings of the 3rd Workshop on Information Credibility on the Web

Event, URLs

URL of the conference:

http://www.dl.kuis.kyoto-u.ac.jp/wicow3/

URL for downloading the paper:

http://www.dl.kuis.kyoto-u.ac.jp/wicow3/papers/p19-spaniolA.pdf

Event Address*:

Madrid, Spain

Language:

English

Event Date*
(no longer used):


Organization:


Event Start Date:

20 April 2009

Event End Date:

20 April 2009

Publisher

Name*:

ACM

URL:


Address*:

New York, NY

Type:


Vol, No, Year, pp.

Series:


Volume:


Number:


Month:

April

Pages:

19-26

Year*:

2009

VG Wort Pages:

8

ISBN/ISSN:

978-1-60558-488-1/09/04

Sequence Number:


DOI:

10.1145/1526993.1526999



Note, Abstract, ©


(LaTeX) Abstract:

Web archives preserve the history of Web sites and have high long-term value for
media and business analysts. Such archives are maintained by periodically re-crawling
entire Web sites of interest.
From an archivist's point of view, the ideal case to ensure highest possible data quality
of the archive would be to ``freeze'' the complete contents of an entire Web site during the time span
of crawling and capturing the site. Of course, this is practically infeasible.
To comply with the politeness specification of a Web site, the crawler needs to pause
between subsequent http requests in order to avoid unduly high load on the site's http server.
As a consequence, capturing a large Web site may span hours or even days, which increases the risk that contents collected so far are incoherent
with the parts that are still to be crawled.
This paper introduces a model for identifying coherent sections of an archive and, thus,
measuring the data quality in Web archiving.
Additionally, we present a crawling strategy that aims to ensure archive coherence by
minimizing the diffusion of Web site captures.
Preliminary experiments demonstrate the usefulness of the model and the effectiveness of the strategy.

Keywords:

Web Archiving, Data Quality, Temporal Coherence



Download
Access Level:

Public

Correlation

MPG Unit:

Max-Planck-Institut für Informatik



MPG Subunit:

Databases and Information Systems Group

Research Context:

LiWA

Appearance:

MPII WWW Server, MPII FTP Server, MPG publications list, university publications list, working group publication list, Fachbeirat, VG Wort



BibTeX Entry:

@INPROCEEDINGS{Spaniol-WICOW09,
AUTHOR = {Spaniol, Marc and Denev, Dimitar and Mazeika, Arturas and Weikum, Gerhard and Senellart, Pierre},
TITLE = {Data Quality in Web Archiving},
BOOKTITLE = {WICOW'09 : proceedings of the 3rd Workshop on Information Credibility on the Web},
PUBLISHER = {ACM},
YEAR = {2009},
PAGES = {19--26},
ADDRESS = {Madrid, Spain},
MONTH = {April},
ISBN = {978-1-60558-488-1/09/04},
DOI = {10.1145/1526993.1526999},
}


Entry last modified by Anja Becker, 03/17/2011
Show details for Edit History (please click the blue arrow to see the details)Edit History (please click the blue arrow to see the details)
Hide details for Edit History (please click the blue arrow to see the details)Edit History (please click the blue arrow to see the details)

Editor(s)
[Library]
Created
04/15/2009 03:19:28 PM
Revisions
6.
5.
4.
3.
2.
Editor(s)
Anja Becker
Anja Becker
Marc Spaniol
Marc Spaniol
Marc Spaniol
Edit Dates
17.03.2011 16:13:02
23.03.2010 14:28:14
20.01.2010 14:58:00
20.01.2010 14:39:03
04/15/2009 03:46:51 PM
Show details for Attachment SectionAttachment Section
Hide details for Attachment SectionAttachment Section

View attachments here:


File Attachment Icon
p19-spaniolA.pdf