ECCB 2002 Poster sorted by: Author | Number

Next | Previous poster (in order of the view you have selected)

Title: Linguistic Complexity Profiles of Prokaryotic Genomic Sequences Assist in Detection and Classification of Terminators
P64
Hosid, Sergey; Bolshoy, Alexander

Bolshoy@research.haifa.ac.il
Genome Diversity Center at the Institute of Evolution, University of Haifa, Haifa, ISRAEL

Motivation: It has become accepted in computational biology to consider biological sequences as linear texts and study them using linguistic techniques. One of the major features of genomic DNA text, distinguishing it from texts in most natural or artificial languages, is its high repetitiveness. Variation in the repetitiveness of genomic texts reflects the presence of different biologically important messages. We demonstrate that potential transcriptional terminators may be discovered following variation in the repetitiveness through construction of typical patterns of complexity distribution surrounding 3'-ends of predicted genes.
Genomic sequences can be analyzed as linear texts. One fundamental characteristic of linear texts is complexity, which could be defined by methods based on either Kolmogorov complexity or Shannon entropy. Another simpler way to introduce a measure of the sequence complexity would be in the definition of the richness of its vocabulary; to measure how uniformly different subwords of length k (k-grams) appear in the sequence for all possible ks. Trifonov first introduced this notion known as linguistic complexity [1]. We have already used it for enhancement of the nucleosomal pattern in [2], and a modified version of it for an overall description of the complete genomes in [3]. In our recent work [4], we presented a method for fast calculation of linguistic sequence complexity of DNA sequences. Our program utilizes suffix trees [5] to compute the number of subwords present in genomic sequences, thereby allowing the calculation of linguistic complexity in time linear in genome size. The major goal of that project was to study patterns of sequence complexity around flanks of coding sequences. The complexity profiles were constructed for all available, completely sequenced, prokaryotic genomes. We paid attention to the fact that among the major features of profiles of A+T prokaryotic genomes are relatively simple regions of about 50 bp immediately before the start of translation and immediately after the end of CDS. Simple regions become visible in the plot of complexity as minima. In some prokaryotes, especially in hyperthermophiles, these minima are sharply expressed. Further investigations of recurrent elements led us to the conclusion that mononucleotide runs of adenines and thymines are the main contributors to the "complexity dropping". AAA and TTT are overrepresented in the 3' gene flanks. These repeats substantially decreased the level of sequence complexity, and, what is more, they distributed nonrandomly. We speculate that these low complexity zones downstream to the ends of coding sequences point to locations of transcription terminators. Common knowledge says that RNA secondary structure in the nascent RNA, followed by a trail of U residues, is necessary and sufficient to terminate transcription. Using the software Genome Scanner for Terminators [6], Unniranam et al. identified putative terminators based on a prediction of a stable hairpin in the vicinity of a stop codon. Washio et al. performed a similar analysis earlier [7]. Both articles mentioned that in many prokaryotic genomes a proportion of putative hairpin terminators are pretty negligible, which suggests other mechanisms of termination. Analysis of positional autocorrelation functions for TTs and AAs reveals the period of about 10.5 bp. We speculate that the DNA curvature is involved in the slowing down of the RNA polymerase and assists in the termination of transcription.
Thus, a common usage of a few linguistic analysis tools putatively resulted in identification of novel features of prokaryotic terminators.
[1] Trifonov, E.N. Making Sense of the Human Genome. in Structure & Methods, Vol. 1 (eds. Sarma, R.H. & Sarma, M.H.) 69-77 (Adenine Press, Albany), 1990.
[2] Bolshoy, A., Shapiro, K., Trifonov, E.N. & Ioshikhes, I. Enhancement of the nucleosomal pattern in sequences of lower complexity. Nucleic Acids Res. 25, 3248-3254 (1997).
[3] Gabrielian, A.E. & Bolshoy, A. Sequence complexity and DNA curvature. Computers & Chemistry 23, 263-274 (1999).
[4] Troyanskaya O.G., Arbell O., Koren Y., Landau G.M. and Bolshoy A. Sequence Complexity Profiles of Prokaryotic Genomic Sequences: A Fast Algorithm for Calculating Linguistic Complexity. Bioinformatics 18, 679-688 (2002).
[5] Gusfield D. Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge [England]: Cambridge University Press; 1997. 534p.
[6]Unniraman S., Prakash R. and Nagaraja V. Conserved economics of transcription termination in eubacteria. Nucleic Acids Res. 30, 675-84 (2002).
[7] Washio T., Sasayama J. and Tomita M. Analysis of complete genomes suggests that many prokaryotes do not rely on hairpin formation in transcription termination. Nucleic Acids Res. 26, 5456-5463 (1998).