MPI-INF Logo
Campus Event Calendar

Event Entry

What and Who

IBEX: Id-Based Entity Extraction

Aliaksandr Talaika
Fachrichtung Informatik - Saarbrücken
PhD Application Talk

Master of Science
AG 1, AG 2, AG 3, AG 4, AG 5, SWS, RG1, MMCI  
Public Audience
English

Date, Time and Location

Monday, 10 February 2014
10:50
90 Minutes
E1 4
024
Saarbrücken

Abstract

Several academic and industrial projects have started extracting entities from the Web. In this thesis, we show that a certain subclass of entities, namely those that have unique identifiers, can be extracted at large scale with high precision from Web data. This applies most notably to commercial products, but also to email addresses, scientific publications, chemical substances, and a wide variety of other entities. By making systematic use of the identifiers, our algorithm can leapfrog page segmentation, complex named entity recognition, or table alignment. Our method can extract millions of items, each disambiguated to a canonical entity, with a precision of 73-96%. This yields a database of unique entities at Web scale. It allows us detailed statistics on the presence of commercial products, people, and other objects on the Internet.

Contact

--email hidden
passcode not visible
logged in users only

Aaron Alsancak, 02/06/2014 10:38 -- Created document.