New for: D3, D4
Alberto Apostolico
Purdue University and Universita' di Padova
The identification of strings that are, by some measure, redundant or rare
in the context of larger sequences is variously pursued in order to
compress data, unveil structure, infer minimal or compact descriptions,
extract and classify features, etc. In the emerging field of bio-sequence
analysis, unusually frequent or rare strings have been variously implicated
in biological functions and mechanisms. In these domains,
tables for storing the number of occurrences in a string of substrings
of (or up to) a given length need to be computed, stored, and tested against
expectations. This talk concentrates on the computational cost involved
in these tasks within some basic probabilistic models.
We discuss algorithms and a software tool called Verbumculus for detecting
and displaying all substrings of a given sequence
of $n$ symbols that deviate from expected frequency. The main theoretical
result is that under several accepted probabilistic models
the search for such over- or underrepresented words may be restricted
to $O(n)$ candidates.
Joint work with M. Bock, S. Lonardi, X.Xu