Campus Event Calendar: Martin Theobald (04/01/2009 in E1 4/024)

Campus Event Calendar

Campus Event Calendar:
- All Upcoming:
  - only for D1
  - only for D2
  - only for INET
  - only for D4
  - only for D5
  - only for D6
  - only for RG1
  - Mailing Lists
  - by Speaker
  - by Type
  - by Category
  - by Title
  - Calendar
  - RSS Feed
- History of Events:

Event Entry

What and Who

SpotSigs - Robust and Efficient Near-Duplicate Detection in Large Web Collections

Martin Theobald

Max-Planck-Institut für Informatik - D5

Senior Reseacher Series

AG 1, AG 3, AG 4, AG 5, SWS, RG1

AG Audience

English

Note: We use this to send email in the morning.

Date, Time and Location

Wednesday, 1 April 2009

16:00

60 Minutes

E1 4

024

Saarbrücken

Abstract

The talk presents a review of our recent Sigir paper on robust and efficient near-duplicate detection in large web collections, a work motivated by a collaboration with political scientists in the context of the Stanford WebBase project. Near-duplicate detection for web pages is particularly challenging in two ways: first, typical web pages are often interspersed with nondescript content such as advertisements or navigational banners that may easily lead simple content extraction techniques astray; and secondly, detecting all pairs of similar documents poses major challenges to any efficient clustering algorithm in order to avoid the inherent quadratic complexity of pairwise document comparisons over the entire collection.

Consequently, the contributions of SpotSigs are twofold: 1) by combining stopword antecedents with short chains of adjacent content terms, we create robust document signatures with a natural ability to filter out noisy components of web pages that would otherwise distract pure n-gram-based approaches such as shingling; 2) we provide an exact and efficient, self-tuning matching algorithm that exploits a novel combination of collection partitioning and inverted index pruning for high-dimensional similarity search. Our experiments confirm an increase in combined precision and recall of more than 24 percent over state-of-the-art approaches such as shingling or I-Match and up to a factor of 3 faster execution times than Locality Sensitive Hashing (LSH), over a demonstrative “Gold Set” of manually assessed near-duplicate news articles as well as the TREC WT10g Web collection.

Contact

Jennifer Müller

900

--email hidden

System used:

Meeting URL:

Meeting ID:

Passcode:

passcode not visible

Code Visible for:

logged in users only

Jennifer Müller, 03/05/2009 13:30
Jennifer Müller, 03/04/2009 08:30 -- Created document.

Imprint / Impressum | Data Protection / Datenschutzhinweis