New for: D2, D3
This thesis implements a file-based indexing framework for the TopX search engine. The indexing framework constructs an inverted index. We have implemented parallelized indexer that succeeds against huge text collections. Our indexing framework supports several index layouts with content based and proximity scoring functions. To reduce required disk space, we employ static index pruning techniques with quality guarantees. For a given keyword we are able to fetch the corresponding inverted list with only two sequential disk I/O operations.
Our experimentation using TREC Terabyte Track “GOV2” (426 GB) collection showed that it is possible to construct indices with the BM25 and proximity scores in 190 hours and only using disk space of magnitude 31% of original collection size.