MPI-INF Logo
Campus Event Calendar

Event Entry

What and Who

Finding Near Neighbors Through Cluster Pruning

Mauro Sozio
La Sapienza University, Rome
Talk
AG 5  
Expert Audience
English

Date, Time and Location

Tuesday, 17 October 2006
11:00
45 Minutes
E1 4
433
Saarbrücken

Abstract

Finding near(est) neighbors is a classic, difficult problem in
data management and retrieval, with applications in text and image
search in finding similar objects and matching patterns.
Here we study {\em cluster pruning}, an extremely simple randomized
technique.  During preprocessing we randomly choose a subset of data
points to be {\em leaders}; the remaining data points are partitioned by
which leader is the closest.  For query processing, we find the
leader(s) closest to the query point.  We then seek the nearest
neighbors for the query point among only the points in the clusters of
the closest leader(s).  Recursion may be used in both preprocessing and
in search.  Such schemes seek approximate nearest neighbors that are
``almost as good'' as the nearest neighbors.  How good are these
approximations and how much do they save in computation?

Our contributions are: (1) we quantify metrics that allow us to study
the tradeoff between the computation at query time and the quality of
the approximate nearest neighbors; (2) we give rigorous theoretical
analysis of our schemes, under natural generative processes
(generalizing Gaussian mixtures) for the data points; (3) Experiments on
both synthetic data from such generative processes, as well as on real
data from a document corpus, confirm that we save orders of magnitude in
query processing cost at modest compromises in the quality of retrieved
points. In particular,  we show that p-spheres, a state-of-the-art
solution, is outperformed by our simple scheme whether the data points
are stored in main or in external memory.

Contact

Gerhard Weikum
--email hidden
passcode not visible
logged in users only

Petra Schaaf, 10/16/2006 08:46
Petra Schaaf, 10/13/2006 13:57 -- Created document.