MPI-INF Logo
Campus Event Calendar

Event Entry

New for: D2, D3

What and Who

TASM: Top-k Approximate Subtree Matching

Dr. Nikolaus Augsten
Free University of Bozen-Bolzano
Talk

Nikolaus Augsten is an assistant professor in computer science at the
Free University of Bozen-Bolzano, Italy. He received his PhD from
Aalborg University, Denmark, in 2008. His research interests include
data-centric applications in database and information systems with a
particular focus on approximate matching techniques for complex data
structures, efficient index structures for distance computations, and
similarity search in massive data collections.
http://www.inf.unibz.it/~augsten/
AG 1, AG 2, AG 3, AG 4, AG 5, SWS, RG1, MMCI  
Public Audience
English

Date, Time and Location

Tuesday, 25 January 2011
13:00
60 Minutes
E 1 7 - MMCI
0.01
Saarbrücken

Abstract

With the popularity of XML repositories has come the need for
efficient techniques to match XML trees based on their similarity.
This talk presents our work on the Top-k Approximate Subtree Matching
(TASM) problem: finding the k best matches of a small query tree,
e.g., an article represented as an XML tree with 15 nodes, in a large
document tree, e.g., the DBLP online bibliography with 26M nodes,
using the canonical tree edit distance as a similarity measure between
subtrees. Evaluating the tree edit distance for large XML trees is
difficult: the best known algorithms have cubic runtime and quadratic
space complexity, and, thus, do not scale.

Our solution is TASM-postorder, a memory-efficient and scalable TASM
algorithm. We prove an upper-bound for the maximum subtree size for
which the tree edit distance needs to be evaluated. The upper bound
depends on the query and is independent of the document size and
structure. A core problem is to efficiently prune subtrees that are
above this size threshold. We develop an algorithm based on the prefix
ring buffer that allows us to prune all subtrees above the threshold
in a single postorder scan of the document. The size of the prefix
ring buffer is linear in the threshold. As a result, the space
complexity of TASM-postorder depends only on k and the query size, and
the runtime of TASM-postorder is linear in the size of the
document. Our experimental evaluation on large synthetic and real XML
documents confirms our analytic results.

This work received the Best Paper Award at the IEEE International
Conference on Data Engineering (ICDE 2010).

Contact

Martin Theobald
--email hidden
passcode not visible
logged in users only

Petra Schaaf, 01/21/2011 10:47 -- Created document.