ECCB 2002 Poster sorted by: Author | Number

Next | Previous poster (in order of the view you have selected)

Title: Statistical analysis of consistent repeating patterns in a set of 16S rDNA sequences
P128
Raje,D.V.; Purohit, H.J.; Singh, R.N.

hemantdrd@hotmail.com
National Environmental Engineering Research Institute, Nehru Marg, Nagpur - 440020, India

Analyzing the genetic information in terms of pattern of nucleotides or amino acids and relating the findings to either structure or function of a gene is an active field of research since last decade. Such studies primarily focus on the patterns of interest in sequences followed by rigorous statistical analysis of their occurrences and inferring about their possible structural or functional relevance in biological system. The statistical significance and biological significance are not synonymous, but the ability to distinguish what is likely to occur from what is unlikely to occur by chance is important in this context and may help in identifying sequence features that demands experimental verification. There are some studies on these lines reported in literature.

One of the typical types of patterns observed in DNA sequences is the repeating pattern of nucleotide strings. In a set of homologous sequences, it is quite likely that there are some repeating patterns, which are conserved across the set, not only in terms of their occurrences but also their distribution in these sequences. The present work aims at identifying such conserved / consistent repeating patterns of maximal length in a set of related sequences and to study the statistical significance of their occurrences along with the significance of their locations and spacings in these sequences. It becomes interesting to know whether occurrences of consistent repeats in these sequences have arisen simply by chance. If not, then the inferences from the statistical analysis of selected repeats coupled with the knowledge of experimenter may lead to better understanding of the molecular mechanisms. This work however emphasizes only on the statistical aspects of the selected patterns.

The exercise initiates with identification of repeating patterns in an input sequence. We have developed a program Repeat Tuple Search to determine the consistent repeating patterns across homologous sequences, which works efficiently for small sequences of length less than 2kb (www.ebi.ac.uk/~lijnzaad/RepeatTupleSearch). The program accepts sequences one-by-one in a simple text format and stores data on repeating patterns and their separating distances for each input sequence. The collective data is processed to get the number of sequences (frequency) in which different repeating patterns make their appearances, considering the constant separating distance criterion. The patterns with high frequency of occurrences are considered as the most consistent repeating patterns.

The statistical significance of consistent repeating patterns in a set of sequences has been assessed through Z-statistic, which is used as a Gaussian approximation to binomial distribution. Two approaches have been presented to assess the statistical significance of spacings between the repeating patterns in natural sequences. The first one is based on the probability distribution of minimal and maximal intervals or spacings as suggested by Karlin and Brendel (1992), and the other based on probability distribution of any interval formed due to repeat patterns. The properties like significance of clumping, regularity of pattern dispersion in a sequence have been analyzed using these approaches. The entire analysis was carried out for the selected consistent repeating patterns observed in four hundred 16S rDNA sequences belonging to forty different bacterial genera. Nine consistent repeating patterns of size more than six were identified and analyzed for their significance of occurrence in a sequence. The dispersion of these repeating patterns was studied for their regularity in these sequences. The identified patterns have statistically significant features in a set of homologous 16S rDNA sequences and hence may have biological relevance.
[1] Bouthinon D. and Soldano, H. (1999), A new method to predict the consensus secondary structure of a set unaligned RNA sequences. Bioinformatics 15: 785-798.
[2] Brendel, V., Bucher, P., Nourbakhsh, I.R., Blaisdell, B.E. and Karlin, S. (1992), Methods and algorithms for statistical analysis of protein sequences. Proc. Natl. Acad. Sci. USA 89, 2002-2006.
[3] Califano A. (2000), SPLASH: structural pattern localization analysis by sequential histograms. Bioinformatics, 16: 341-357.
[4] Karlin, S. and Brendel, V. (1992) Chance and statistical significance in protein and DNA sequence analysis. Science, 257: 39-49.
[5] Karlin, S., Brendel, V. & Bucher, P. (1992), Significant similarity and dissimilarity in homologous proteins. Mol. Biol. Evol. 9, 152-167.
[6] Pesole, G, Sabino L, and D?Souza, M. (2000), PatSearch: a pattern matcher software that finds functional elements in nucleotide and protein sequences and assesses their statistical significance, Bioinformatics, 16: 439-450
[7] Kurtz, S. and Schleiermacher, C (1999), REPuter: fast computation of maximal repeats in complete genomes. Bioinformatics, 15: 426-427.
[8] Purohit HJ. (2002), Development of tracking tools for selected bacteria using 16S rDNA sequences: An approach based on repeating tuples and probabilities of dinucleotides. Post-Genome Knowledge Discovery Program, IMS, National University of Singapore, Singapore (http://www.ims.nus.edu.sg/Programs/genome/part1w.htm).
[9] Waterman MS. (1995), Introduction to Computational Biology. Chapmann & Hall/CRC, USA
[10] Reinert, G., Schbath, S., and Waterman, M.S. (2000), Probabilistic and statistical properties of words. J. Comp. Biol., 7:1-46.