What and Who
Title:Estimating the Selectivity of Approximate String Queries
Speaker:Arturas Mazeika
coming from:Uni Bozen
Event Type:Talk
Level:AG Audience
Date, Time and Location
Date:Thursday, 29 November 2007
Duration:60 Minutes
Building:E1 4
Approximate queries on string data are important, due to the prevalence
of such data in databases and various conventions and errors in string
data. We present the VSol estimator, a novel technique for estimating
the selectivity of approximate string queries. The VSol estimator is
based on inverse strings and makes the performance of the selectivity
estimator independent of the number of strings.  To get inverse
we decompose all database strings into overlapping substrings of
q (q-grams) and then associate each q-gram with its inverse string:
the IDs of all strings that contain the q-gram.  We use signatures to
compress inverse strings, and clustering to group similar signatures.

We study our technique analytically and experimentally.  The space
complexity of our estimator only depends on the number of
in the database and the desired estimation error.  The time to
the selectivity is independent of the number of database strings
and linear wrt the length of the query string.  We give a detailed
empirical performance evaluation of our solution for synthetic and
real world datasets. We show that VSol is effective for large skewed
databases of short strings.

The talk is based on the paper that Divesh Srivastava, Nick Koudas,
Mike Boehlen and I published this year in TODS.
Name(s):Gerhard Weikum
Video Broadcast
Tags, Category, Keywords and additional notes
