This problem can be viewed from two different aspects: misspellings in the search query and misspellings in the documents. The first aspect of the problem can be solved efficiently by some user intervention, i.e. the user gets some feedback from the results, being able to change the query. However, the second problem cannot be solved by the user itself, i.e. one cannot possibly correct or even be aware of the misspellings in the potential results; meaning that the retrieval is incomplete even if the query is perfect. This way one may inherently miss important information.
In this master thesis we concentrate on the second aspect of the problem. Furthermore, the challenge becomes even bigger as we deal with huge amount of data on one hand, and want to have fast user response on the other, i.e. the spelling correction should not affect the retrieval time crucial for a search engine. Our aim is also retrieving the results from the collection as complete as possible and as accurate as possible.
We present a novel approach which involves efficiently preprocessing the dictionary of a collection in a form such that the search engine does not involve any preprocessing effort at query time. Our solution is based on word clustering, i.e. producing clusters which contain all possible misspellings of a given correct word. Once the clusters are computed they are used by the search engine (in this case the CompleteSearch engine) to index all relevant documents per cluster instead of per word.
One cannot hope on classical clustering algorithms, since of the nature of the problem and the amount of the input data, the clustering might take unreasonably long time and still give meaningless results. The latter sets the two primary goals in this thesis: highly efficient and accurate clustering in the same time.