MPI-INF Logo
Campus Event Calendar

Event Entry

What and Who

Fast and accurate clustering-based spelling corrections in search engines

Marjan Celikik
IMPRS
Talk
AG 1, AG 2, AG 3, AG 4, AG 5, SWS, RG1, RG2  
Public Audience
English

Date, Time and Location

Wednesday, 13 June 2007
13:00
-- Not specified --
E1 4
024
Saarbrücken

Abstract

Nowadays search engines have become the primary means of accessing information. However, many studies show misspelled words are very common in queries to these systems, and even worse, they abound in the indexed documents themselves, especially if one deals with OCR-ed text. As a consequence, the results from the search are incorrect or provide inconclusive information.

This problem can be viewed from two different aspects: misspellings in the search query and misspellings in the documents. The first aspect of the problem can be solved efficiently by some user intervention, i.e. the user gets some feedback from the results, being able to change the query. However, the second problem cannot be solved by the user itself, i.e. one cannot possibly correct or even be aware of the misspellings in the potential results; meaning that the retrieval is incomplete even if the query is perfect. This way one may inherently miss important information.

In this master thesis we concentrate on the second aspect of the problem. Furthermore, the challenge becomes even bigger as we deal with huge amount of data on one hand, and want to have fast user response on the other, i.e. the spelling correction should not affect the retrieval time crucial for a search engine. Our aim is also retrieving the results from the collection as complete as possible and as accurate as possible.

We present a novel approach which involves efficiently preprocessing the dictionary of a collection in a form such that the search engine does not involve any preprocessing effort at query time. Our solution is based on word clustering, i.e. producing clusters which contain all possible misspellings of a given correct word. Once the clusters are computed they are used by the search engine (in this case the CompleteSearch engine) to index all relevant documents per cluster instead of per word.

One cannot hope on classical clustering algorithms, since of the nature of the problem and the amount of the input data, the clustering might take unreasonably long time and still give meaningless results. The latter sets the two primary goals in this thesis: highly efficient and accurate clustering in the same time.

Contact

IMPRS
227
--email hidden
passcode not visible
logged in users only

Andrea Primm, 06/11/2007 14:28
Andrea Primm, 06/08/2007 13:46 -- Created document.