Full text indices provide fast string search over huge text collections. The most challenging issues of these indices have traditionally been their space consumption and construction time.
This thesis implements a file-based indexing framework for the TopX
search engine. The indexing framework constructs an inverted index. We have implemented parallelized indexer that succeeds against huge text collections.
Our indexing framework supports several index layouts with content based and proximity scoring functions. To reduce required disk space, we employ static index pruning techniques with quality guarantees. For a given keyword we are able to fetch the corresponding inverted list with only two sequential disk I/O operations.
Our experimentation using TREC Terabyte Track “GOV2” (426 GB)
collection showed that it is possible to construct indices with the BM25 and proximity scores in 190 hours and only using disk space of magnitude 31% of original collection size.
Keywords:
HyperLinks / References / URLs:
Personal Comments:
Download
Access Level:
Internal
Referee, Status
1. Referee:
Gerhard Weikum
2. Referee:
Holger Bast
Supervisor:
Ralf Schenkel
Status:
Completed
First Lecture Title:
Location of Lecture:
Date of the Kolloquium:
16 January 2009
Chair of the Kolloquium:
Correlation
MPG Unit:
Max-Planck-Institut für Informatik
MPG Subunit:
IMPRS-CS
Audience:
Expert
Appearance:
MPII WWW Server, MPII FTP Server, MPG publications list, university publications list, working group publication list, Fachbeirat, VG Wort
BibTeX Entry:
@MASTERSTHESIS{Kasradze2008,
AUTHOR = {Kasradze, Levan},
TITLE = {Implementation of a File-based Indexing Framework for the TopX Search Engine},
SCHOOL = {Universit{\"a}t des Saarlandes},
YEAR = {2008},
MONTH = {May},
}
Entry last modified by Stephanie Jörg, 01/16/2009
Edit History (please click the blue arrow to see the details)