ECCB 2002 Poster sorted by: Author | Number

Next | Previous poster (in order of the view you have selected)

Title: Probability occurrence of structured motifs and application to candidates promoters
P135
Robin, Stéphane; Daudin, Jean-Jacques; Richard, Hugues; Sagot, Marie-France; Schbath, Sophie

Sophie.Schbath@jouy.inra.fr
INRA, Unité Mathématique, Informatique & Génome, Jouy-en-Josas, FRANCE

The problem of extracting from a set of nucleic acid sequences motifs which may have a biological function has been concerning biologists, statisticians and computer scientists for many years. The motifs that may be considered in terms of algorithms are becoming increasingly sophisticated. This enables to start addressing a possible cooperative effect between binding sites implicated in a same biological process such as transcription for instance. Yet it is important to be able to evaluate with enough accuracy how unexpected are such motifs given a model for the sequences.
The complex motifs that are of interest here are the so-called "structured motifs" (see Marsan and Sagot, 2000). They are motifs that are composed of two ordered parts, called "boxes", separated by a distance which may take any value inside an interval. Each box may exhibit a different degree of conservation, i.e. a different maximum number of substitutions against the corresponding part in the motif. This paper proposes a first method for addressing the problem of calculating the probability of occurrence of such motifs under a Markov model.

Using the exact distributions of the waiting time before a given word occurs in a random sequence and of the distance between successive occurrences of multiple words (Daudin and Robin, 1999, 2001), we calculate the exact probability for a structured motif m to occur at a given position in a random sequence. The second step consists in calculating the probability for the structured motif m to occur in a given sequence (meaning at least once). Since it strongly depends on the overlapping structure of the motif that cannot be completely described here, we use a geometric approximation of order 1 that appears to be very accurate in the applications. This approximation requires to calculate a non-trivial exact probability: the conditional probability that m occurs at position i given it has not occurred at position (i-1). Finally, we can get the statistical significance for N sequences to contain the given motif.

As an application, we considered the dataset of Helmann (1995) that contains 131 non coding sequences of length 100bps coming just upstream from genes in B. subtilis. The transcriptional starting point of the genes have been experimentally determined, or more rarely the promoter itself. 564 structured motifs occurring in at least 4% of the sequences and having at most one substitution error have been extracted (Marsan and Sagot, 2000). We then have determined their statistical significance. The most exceptional motifs (with a p-value less than 10-6) seems to agree with the known consensus for the site at -35 and the TATA-box (see Record et al., 1996).
[1] Helmann, J. D. (1995) Compilation and analysis of Bacillus subtilis alpha-dependent promoter sequences: evidence for extended contact between RNA polymerase and upstream promoter DNA. Nucl. Acids Res. 23, 2351-2360.
[2] Marsan, L. and Sagot, M.-F. (2000a). Algorithms for extracting structured motifs using a suffix tree with an application to promoter and regulatory site consensus identification. J. Comp. Biol. 7, 345-362.
[3] Record, M. T., Reznikoff, W. S., Craig, M. L., McQuade, K. L. and Schlax, P. J. (1996). Escherishia coli RNA polymerase sigma70 promoters, and the kinetics of the steps of transcription initiation. (F. C. Neidhart, ed.), volume 1. ASM Press.
[4] Robin, S. and Daudin, J.-J., (1999). Exact distribution of word occurrences in a random sequence of letters. J. Appl. Prob. 36 179-193.
[5] Robin, S. and Daudin, J.-J. (2001). Exact distribution of the distances between any occurrences of a set of words. Ann. Inst. Statist. Math. 36 (4) 895-905.