MPI-INF Logo
Campus Event Calendar

Event Entry

New for: D3

What and Who

The Similarity Join: A Powerful Database Primitive for High Performance Data Mining

Christian Böhm
LMU München
MPI-Kolloquium
AG 1, AG 2, AG 3, AG 4  
AG Audience

Date, Time and Location

Wednesday, 30 May 2001
17:40
45 Minutes
45 - FR 6.2
I
Saarbrücken

Abstract

Larger and larger amounts of data are collected and stored in

databases, increasing the need for efficient and effective
analysis methods to make use of the information contained
implicitly in the data. The extraction of such potentially useful
information is called data mining.

In the presentation, it is shown that numerous data mining methods
such as density based clustering, k-means clustering, outlier
detection, or k-nearest neighbor classification can be based on
the similarity join as a database primitive. By such a
reformulation, the identical result can be achieved at a
drastically improved efficiency.

Therefore, the similarity join becomes an important basic
operation of multimedia databases. For a given set of feature
vectors, the similairty join determines those object pairs which
are most similar according to some appropriate similairty measure,
in SQL style SELECT * FROM P1,P2 WHERE distance (P1,P2) <= epsilon

The structure of the talk is as follows: In the first part we show
how typical algorithms of data analysis and data mining can be
reformulated such that they are exclusively based on the
similarity join. According to several example applications, we
demonstrate the enormous performance potential of this database
primitive.

The second part of the presentation is dedicated to the develop-
ment of efficient algorithms for the computation of the similarity
join. First we introduce a new cost model for index based
similairty join algorithms. Starting from this cost model, we
develop an innovative index architecture which takes into account
that similarity join algorithms require a separate optimization of
CPU and I/O cost.

Next, we develop a similarity join algorithm which is particularly
suited for massive data sets. It is based on a particular sort
order for high dimensional data. Finally, we propose a technique
for the reduction of CPU cost which is universally applicable in
index based and non-index-based similarity join methods.

A perspective on future research directions in the area of
database primitives for similarity search, data analysis, and data
mining concludes the talk.

Contact

Christoph Storb
-900
--email hidden
passcode not visible
logged in users only

Tags, Category, Keywords and additional notes

Applicant for group leader of independant research group