MPI-INF Logo
Campus Event Calendar

Event Entry

What and Who

Estimating the Selectivity of Approximate String Queries

Arturas Mazeika
Uni Bozen
Talk
AG 5, RG2  
AG Audience
English

Date, Time and Location

Thursday, 29 November 2007
11:30
60 Minutes
E1 4
433
Saarbrücken

Abstract

Approximate queries on string data are important, due to the prevalence
of such data in databases and various conventions and errors in string
data. We present the VSol estimator, a novel technique for estimating
the selectivity of approximate string queries. The VSol estimator is
based on inverse strings and makes the performance of the selectivity
estimator independent of the number of strings.  To get inverse
strings
we decompose all database strings into overlapping substrings of
length
q (q-grams) and then associate each q-gram with its inverse string:
the IDs of all strings that contain the q-gram.  We use signatures to
compress inverse strings, and clustering to group similar signatures.

We study our technique analytically and experimentally.  The space
complexity of our estimator only depends on the number of
neighborhoods
in the database and the desired estimation error.  The time to
estimate
the selectivity is independent of the number of database strings
and linear wrt the length of the query string.  We give a detailed
empirical performance evaluation of our solution for synthetic and
real world datasets. We show that VSol is effective for large skewed
databases of short strings.

The talk is based on the paper that Divesh Srivastava, Nick Koudas,
Mike Boehlen and I published this year in TODS.

Contact

Gerhard Weikum
500
--email hidden
passcode not visible
logged in users only

Petra Schaaf, 11/23/2007 13:50
Petra Schaaf, 11/23/2007 13:46 -- Created document.