Campus Event Calendar: Marjan Celikik (06/13/2007 in E1 4/024)

Campus Event Calendar

Campus Event Calendar:
- All Upcoming:
  - only for D1
  - only for D2
  - only for INET
  - only for D4
  - only for D5
  - only for D6
  - only for RG1
  - Mailing Lists
  - by Speaker
  - by Type
  - by Category
  - by Title
  - Calendar
  - RSS Feed
- History of Events:

Event Entry

What and Who

Fast and accurate clustering-based spelling corrections in search engines

Marjan Celikik

IMPRS

Talk

AG 1, AG 2, AG 3, AG 4, AG 5, SWS, RG1, RG2

Public Audience

English

Note: We use this to send email in the morning.

Date, Time and Location

Wednesday, 13 June 2007

13:00

-- Not specified --

E1 4

024

Saarbrücken

Abstract

Nowadays search engines have become the primary means of accessing information. However, many studies show misspelled words are very common in queries to these systems, and even worse, they abound in the indexed documents themselves, especially if one deals with OCR-ed text. As a consequence, the results from the search are incorrect or provide inconclusive information.

This problem can be viewed from two different aspects: misspellings in the search query and misspellings in the documents. The first aspect of the problem can be solved efficiently by some user intervention, i.e. the user gets some feedback from the results, being able to change the query. However, the second problem cannot be solved by the user itself, i.e. one cannot possibly correct or even be aware of the misspellings in the potential results; meaning that the retrieval is incomplete even if the query is perfect. This way one may inherently miss important information.

In this master thesis we concentrate on the second aspect of the problem. Furthermore, the challenge becomes even bigger as we deal with huge amount of data on one hand, and want to have fast user response on the other, i.e. the spelling correction should not affect the retrieval time crucial for a search engine. Our aim is also retrieving the results from the collection as complete as possible and as accurate as possible.

We present a novel approach which involves efficiently preprocessing the dictionary of a collection in a form such that the search engine does not involve any preprocessing effort at query time. Our solution is based on word clustering, i.e. producing clusters which contain all possible misspellings of a given correct word. Once the clusters are computed they are used by the search engine (in this case the CompleteSearch engine) to index all relevant documents per cluster instead of per word.

One cannot hope on classical clustering algorithms, since of the nature of the problem and the amount of the input data, the clustering might take unreasonably long time and still give meaningless results. The latter sets the two primary goals in this thesis: highly efficient and accurate clustering in the same time.

Contact

IMPRS

227

--email hidden

System used:

Meeting URL:

Meeting ID:

Passcode:

passcode not visible

Code Visible for:

logged in users only

Andrea Primm, 06/11/2007 14:28
Andrea Primm, 06/08/2007 13:46 -- Created document.

Imprint / Impressum | Data Protection / Datenschutzhinweis