MPI-INF Logo
Campus Event Calendar

Event Entry

What and Who

Space-Efficient Management of String Databases by Reusing Common Characters

Sourav Dutta
International Max Planck Research School for Computer Science - IMPRS
PhD Application Talk
AG 1, AG 2, AG 3, AG 4, AG 5, SWS, RG1, MMCI  
Public Audience
English

Date, Time and Location

Monday, 8 October 2012
09:00
60 Minutes
E1 4
024
Saarbrücken

Abstract

With the advent of scanning devices and optical character recognition tools, efficient storage and management of large sets of text data has emerged as a fundamental requirement for a large number of modern day applications including dictionary and thesaurus storage, indexing of sequence databases, search engine applications, and XML information mining.

To such applications, we propose a novel data structure, SEManTIKs, for space efficient storage and management of large text and sequence databases. The structure uses bit vectors that reuse the storage space for common triplets and, therefore, has low space requirements as compared to the existing trie-based structures. In addition to exact string search operation, SEManTIKs also efficiently handles prefix and suffix search queries. We also propose an extension of the structure for handling substring searches, albeit with an increase in the storage requirements. This extension is important in comparison to the trie-like and compressed dictionary-based methods that are unable to handle such queries efficiently.
We perform several experiments to show that SEManTIKs outperforms the existing structures by nearly a factor of two in terms of space requirements, while the various query times are either better or comparable.

Contact

--email hidden
passcode not visible
logged in users only

Marc Schmitt, 10/05/2012 16:08 -- Created document.