MPI-INF Logo
Campus Event Calendar

Event Entry

What and Who

New Tools for Text Indexing and Beyond: Substring Complexity and String Synchronizing Sets

Tomasz Kociumaka
Max-Planck-Institut für Informatik - D1
Joint Lecture Series
AG 1, AG 2, AG 3, INET, AG 4, AG 5, D6, SWS, RG1, MMCI  
Public Audience
English

Date, Time and Location

Wednesday, 7 February 2024
12:15
60 Minutes
E1 5
002
Saarbrücken

Abstract

In recent decades, the volume of sequential data processed in genomics and other disciplines has exploded. This trend escalates the challenge of designing text indexes – concise data structures that allow answering various queries efficiently. For example, pattern-matching queries ask if the text (a long string modeling the dataset) contains an occurrence of a given pattern (a short string provided at query time). The suffix array is a classic index for pattern matching and many other queries, but modern applications demand space-efficient alternatives like the FM index. Still, even these data structures struggle with massive highly repetitive datasets, such as human genome collections. This scenario requires indexes whose size is comparable to the best text compression methods.

The talk will introduce two tools originating from my text indexing research. Substring complexity is a well-behaved measure of text repetitiveness that aids in comparing the performance of compression algorithms. String synchronizing sets enable consistent selection of small subsets of text substrings. Recently, these tools became the core of the asymptotically smallest compressed text index with suffix-array functionality. Furthermore, they find applications beyond text indexing: in combinatorics on words, quantum string algorithms, and more.

Contact

Jennifer Müller
+49 681 9325 2900
--email hidden

Virtual Meeting Details

Zoom
997 1565 5535
passcode not visible
logged in users only

Jennifer Müller, 10/04/2023 10:44 -- Created document.