MPI-INF Logo
Campus Event Calendar

Event Entry

What and Who

Benjamin Habegger

String and Tree Pattern Generalization for n-Ary Information Extraction from the Web.
Max-Planck-Institut für Informatik - AG 5
Talk
AG 1, AG 2, AG 3, AG 4, AG 5  
AG Audience

Date, Time and Location

Wednesday, 7 September 2005
14:00
90 Minutes
46.1 - MPII
024
Saarbrücken

Abstract

Currently, data from online sources is given in a presentational format (HTML) which makes is difficult to use them in an automated
process. The problem of information extraction from the Web consists in building patterns based on presentational clues allowing to extract information for a specific task and from a specific sources of information. The approach to information extraction we take is to use machinge learning techniques to build extraction patterns. While the problem of unary extraction (ie. learning patterns allowing to extract lists of single item) has been highly studied, few works consider the problem of n-ary extraction (ie. extracting tuples of items). In this talk we will present pattern generalization for n-ary information extraction from the Web. HTML documents can be considered both as string or as trees. In a first part, a string-based approach to pattern generalization will be presented. It is based on the extraction the contexts of the desired information an their generalization into patterns. With few examples and without decomposing the examples this method allows to direclty build n-ary patterns. A thorough evaluation of the application of this technique to different Web sources has been lead showing its efficiency on real-world sources. In a second part, our currently ongoing work on tree-pattern generalization will be presented.

Contact

Jens Graupmann
--email hidden
passcode not visible
logged in users only

Adriana Davidescu, 09/01/2005 09:35
Adriana Davidescu, 08/23/2005 12:14 -- Created document.