ECCB 2002 Poster sorted by: Author | Number

Next | Previous poster (in order of the view you have selected)

Title: AGenDA: A WWW Server for Gene Recognition by Comparative Sequence Analysis
P161
Taher, Leila (1); Rinner, Oliver (2); Garg, Saurabh (1); Scyrba, Alexander (3); Brudno, Mike (4); Batzouglou, Serafim (5); Morgenstern, Burkhard (1)

ltaher@techfak.uni-bielefeld.de
(1) International Graduate School for Bioinformatics and Genome Research, and (3) Faculty of Technology, Research Group in Practical Computer Science, University of Bielefeld, Postfach 10 01 31, 33501 Bielefeld, Germany; (2) GSF Research Center, MIPS / Institute of Bioinformatics, Ingolstaedter Landstrasse 1, 85764 Neuherberg, Germany; (4) 110 Gates Building, (5) 138 Gates Building, Computer Science Department, Stanford University, Stanford, CA 94305.

The completion of the genome of a given organism is just the beginning of the task of discovering the complexities of the corresponding sequence. This refers especially to the documentation of the coding exons of each gene, as well as non-coding exonic and regulatory sequences. With the aim of automating this search, several gene prediction algorithms have been developed (Claverie, 1997; Stormo, 2000). These programs are traditionally categorized as either ab initio or extrinsic methods. The former, such as GENSCAN (Burge and Karlin, 1997), base their predictions only on the information that the DNA sequence provides by itself using statistical models of gene structure. Most ab initio methods correctly identify about 50% of the real exons (Burset and Guigo, 1996). On the other hand, extrinsic methods (GENEWISE (Birney and Durbin, 1997), PROCRUSTES (Gelfand et al., 1996)) exploit homology with genes already characterized, looking for similarities between the query sequence and sequences that are present in protein databases. As a result, their accuracy strongly depends on the presence of analogous genes in those databases.

Finally, a relatively new class of gene-prediction algorithms is based on the information extracted from aligning genomic sequences from two related species, especially mouse and human. This approach relies on the fact that functional regions tend to be conserved during evolution, while non-functional regions are more variable. Comparative sequence analysis has already been used as a powerful approach to identify functional elements in genomic sequences (Ansari-Lari, 1998; Batzoglou et al., 2000, Gottgens et al., 2000; Loots et al. 2000; Blayo et al., in press).

The gene prediction program that we are developing is called AGenDA (Alignment-based Gene-Detection Algorithm). It exploits homology between two evolutionary related genomes together with intrinsic information contained in the sequences. The current version of the program searches for evolutionary conserved regions using the DIALIGN program (Morgenstern et al., 1996;Morgenstern, 1999) and integrates this information with splice sites identified using standard probabilistic models (Salzberg, 1997). This produces a list of potential exons, from which the program builds complete gene structures using a combinatorial optimization approach. AGenDA has an important advantage over more traditional gene-prediction programs: except for the splice site detection, it does not depend on statistical content measures. Therefore, it can be applied to genome sequences from newly sequenced organisms where no training data exist, provided syntenic sequences are available from reference species at an appropriate evolutionary distance (sufficient divergence of non-coding genomic regions but preservation of the coding exons).

We developed a WWW server to make our program easily available for the genome research community. The user enters a pair of potentially related genomic sequences, such as from human and mouse, and the system automatically performs the following series of steps on them. Initially, an alignment between the two sequences is calculated, using DIALIGN. In order to speed up this process, we use the program CHAOS (Brudno and Morgenstern, 2002), which allows us to rapidly find homologous segments (http://www.stanford.edu/~brudno/chaos/). These similarities are in turn used to anchor the DIALIGN alignment procedure to reduce search space and running time. Secondly, we apply AGenDA to the sequence homologies identified by DIALIGN. The software defines a list of candidate exons, which are calculated using recognized conserved splice sites and start/stop codons. In a third step, the system calculates optimal gene models based on these candidate exons. Finally, the output is returned to the user via e-mail.

Because intrinsic and comparative methods are based on different information sources, a natural way of improving gene prediction accuracy is to combine the prediction power of both approaches. Comparative sequence analysis could help to improve the selectivity of more traditional methods in finding candidate exons. On the other hand, statistical models could help to discriminate between gene-coding and non-gene coding conservation in syntenic regions (Korf et al., 2001; Korf et. al, 2002). In this sense, AGenDA is a first step in the design of a tool that optimally combines these different kinds of information.

So far, our test results (Rinner and Morgenstern, 2002) have been promising, and are comparable to the ones that GenScan -- one of the best gene prediction programs available -- produces, when bench marked on the same data set. However, as expected for two programs that are based on different types of input information, AGenDA can detect genes that GENSCAN cannot, and vice versa. In this sense, our approach should be a valuable addition to other more conventional programs. Based on this first success, we are exploring now the possibilities of combining intrinsic and genome-comparison-derived information in order to obtain the maximum profit from it, and consequently increase the sensitivity and specificity of the algorithm.
[1] Ansari-Lari, M. A., Oeltjen, J. C., Schwartz, S., Zhang, Z., Muzny, D. M., Lu, J., Gorrell, J. H., Chinault, A. C., Belmont, J. W., Miller, W. and Gibbs, R. A. (1998). Comparative sequence analysis of a gene-rich cluster at human chromosome 12p13 and its syntenic region in mouse chromosome 6. Genome Res. 8, 29-40.
[2] Batzoglou, S., Pachter, L., Mesirovi, J. P., Berger, B. and Lander, E. S. (2000). Human and mouse gene structure: comparative analysis and application to exon prediction. Genome Res. 7, 950-958.
[3] Birney, E. and Durbin, R. (1997). Dynamite: A flexible code generating language for dynamic programming methods used in sequence comparison. In: Proceedings ISMB 5, pp. 56-64.
[4] Blayo P., Rouze P., Sagot M.F. (2002). Orphan gene finding - An exon assembly approach. Theoretical Computer Science, in press.
[5] Burge, C. and Karlin, S. (1997). Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78-94.
[6] Burset, M. and Guigo, R. (1996). Evaluation of gene structure prediction programs. Genomics 34, 353-367.
[7] Claverie, J.-M. (1997). Computational methods for the identification of genes in vertebrate genomic sequences. Hum. Mol. Genet. 6, 1735-1744.
[8] Gelfand, M. S., Mironov, A. A. and Pevzner, P. A. (1996). Gene recognition via spliced sequence alignment. Proc. Natl. Acad. Sci. U S A 93, 9061-9066.
[9] Gottgens, B., Barton, L. M., Gilbert, J. G. R., Bench, A. J., Sanchez, M. J., Bahn, S., Mistry, S., Grafham, D., McMurray, A., Vaudin, M., Amaya, E., Bentley, D. R. and Green, A. R. (2000). Analysis of vertebrate SCL loci identifies conserved enhancers. Nat. Biotechnol. 18, 181-186.
[10] Korf, I., Flicek, P., Duan, D. and Brent, M. R. (2001). Integrating genomic homology into gene structure prediction. Bioinformatics 17, S140-S148.
[11] Loots, G. G., Locksley, R. M., Blankespoor, C. M., Wang, Z. E., Miller, W., Rubin, E. M. and Frazer, K. A. (2000). Identification of a coordinate regulator of interleukins 4, 13, and 5 by cross-species sequence comparisons. Science 288, 136-140.
[12] Brudno, M. and Morgenstern, B. (2002). Fast and sensitive alignment of large genomic sequences. Proceedings IEEE Computer Society Bioinformatics Conference, in press.
[13] Morgenstern, B. (1999). DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics 15, 211-218.
[14] Morgenstern, B., Dress, A. and Werner, T. (1996). Multiple DNA and protein sequence alignment based on segment-to-segment comparison. Proc. Natl. Acad. Sci. U S A 93, 12098-12103.
[15] Rinner, O. and Morgensern, B. (2002). AGenDA: Gene prediction by comparative sequence analysis. In Silico Biology 2, 0018.
[16] Salzberg, S. L. (1997). A method for identifying splice sites and translational start sites in eukaryotic mRNA. Comput. Appl. Biosci. 13, 365-376.
[17] Stormo, G. D. (2000). Gene-finding approaches for eukaryotes. Genome Res. 10, 394-397.