MPI-INF Logo
Campus Event Calendar

Event Entry

What and Who

PhD Application Talk: Efficient Large-Scale Clustering of Spelling Variants, with Applications to Error-Tolerant Text Search

Marjan Celikik
Talk
AG 1, AG 2, AG 3, AG 4, AG 5, SWS, RG1, RG2  
MPI Audience
English

Date, Time and Location

Monday, 29 October 2007
08:00
480 Minutes
E1 4
024
Saarbrücken

Abstract

In this thesis, the following spelling variants clustering problem is
considered: Given a list of distinct words, called lexicon, compute (possibly
overlapping) clusters of words which are spelling variants of each other. We are
looking for algorithms that are both efficient and accurate. Accuracy is
measured with respect to human judgment, e.g., a cluster is 100% accurate if it
contains all true spelling variants of the unique correct word it contains and
no other words, as judged by a human.

We have sifted the large body of literature on approximate string searching and
spelling correction problem for its applicability to our problem. We have
combined various ideas from previous approaches to two new algorithms, with two
distinctly different trade-offs between efficiency and accuracy. We have
analyzed both algorithms and tested them experimentally on a variety of test
collections, which were chosen to exhibit the whole spectrum of spelling errors
as they occur in practice (human-made, OCR-induced, garbage). Our largest
lexicon, containing roughly 25 million words, can be processed in half an hour
on a single machine. The accuracies we obtain range from 88% - 95%. We show that
previous approaches, if directly applied to our problem, are either
significantly slower or significantly less accurate or both.

Our spelling variants clustering problem arises naturally in the context of
search engine spelling correction of the following kind: For a given query,
return not only documents matching the query words exactly but also those
matching their spelling variants. This is inverse to the well-known "did you
mean: ..." web search engine feature, where the error tolerance is on the side
of the query, and not on the side of the documents. We have integrated our
algorithms with the CompleteSearch engine, and show that this feature can be
achieved without significant blowup in either index size or query processing time.

Contact

IMPRS-CS
-225
--email hidden
passcode not visible
logged in users only

Tags, Category, Keywords and additional notes

Please note: The talks will take place in random order!

Katja Schönborn, 10/26/2007 14:18
Katja Schönborn, 10/23/2007 19:21
Katja Schönborn, 10/18/2007 14:04 -- Created document.