ECCB 2002 Poster sorted by: Author | Number

Next | Previous poster (in order of the view you have selected)

Title: CodeProbe - coding frame detection tool
P87
Kumanduri, Vasudev; Schweizer, Patrick

chari@ipk-gatersleben.de

CodeProbe is an algorithm for identifying the protein coding frame of low quality DNA sequences, especially expressed sequence tags (ESTs). The raw input of EST sequences is computed to establish the coding probability of different frame fractions per EST and to generate the results as per the users requirement. A test set comprising 375 randomly chosen and annotated ESTs was thus defined to be used as the base for testing in which a stringent (E-Value = 10-50 or lower) was set as the threshold.
CodeProbe can be compared against a simple ORF based algorithm and can be shown that, while only modestly improved, the CodeProbe based predictions obtain higher sensitivity and specificity than open reading frame (ORF) based predictions. The advantage of CodeProbe becomes more evident with the ESTs of poor quality containing many frame shifts. This was mainly achieved by searching for typical peptide signatures and patterns besides analyzing the length of the fragment. Lately, the promising results in the predication of the coding frames have been obtained by analyzing them against the codon usage. The process of clubbing the tool for these different analyses and giving a computed result based on these is underway. The results have been very appreciative and we have been able to give more than 90% performance and efficiency in the tools. On further enhancements, CodeProbe could become a useful tool for discovery of pioneer peptide sequences that do not have a significant match to the protein database and for in silico generation of tryptic digest of the corresponding unknown proteins. This will be especially useful for the proteomics approaches in species where only EST sequence information is available.

Methodology
The coding probability, of a given frame fraction, calculated by CodeProbe was initially based on analysis of four different parameters. The tool translates each EST into all the six reading frames. Each frame is then broken into frame fraction based on the presence of stop codon in the corresponding frame. The fractions are then analysed for signatures, patterns and length. The algorithm carries out the analysis of all these parameters in a stepwise fashion and allocates a score to each of these parameters in the EST and finally computes the final coding probability for that particular frame fraction on an incremental percentile basis.

Although, while analyzing the codon usage, the frames are not broken into fractions but a graph is generated showing a clear frequency of the usage of codons. Based on this graph the frames can be broken down if required. We used the codon usage approach to differentiate the EST of Blumeria Graminis and Hordeum Vulgera from a hordeum Vulgera's infected epidermis tissue. The results were very rewarding.

The next step towards the completion of the tool is to co-ordinate the efficiency of codon usage with the analysis of the other parameters previously analysed. After completion the tool can be used for coding region prediction fined tuned to organisms which will give us higher specificity.
[1] Jongeneel, C.V, 2000, Searching the Expressed Sequence Tag (EST) database; panning the genes. Briefings in Bioinformatics Vol 1. 76-94.
[2] Lengauer. T, 2000, Bioinformatics- from pre genomic to post genomic era. European Research Consortium for Informatics and Mathematics, No.41.
[3] Chang. Y.L, Tao. J, Wang.J, Scheuring.C, Meksem.K, 1999, A large scale plant transformation and genome sequence ready physical map of the arabdopsis thaliana genome, Plant & Animal Genome VII Conference.
[4] Kozak, M.1983, Compilation and analysis of sequences upstream from the translational start site in eukaryotic mRNAs. Nucl. Acids Res. 12:857-872.
[5] Cigan, A.M, and Donahue, T.F. 1987. Sequence and structure features associated with translational initiator region in yeast ? a review. Gene 59:1-18.
[6] Cavener, D.R. 1987. Comparison of the consensus sequences flanking translational start sites in drosophila and vertebrates. Nucl. Acids Res. 15: 1353-1361.
[7] Lutcke, H.A.; Chow, K.C.; Mickel, F.S.; and Moss, K.A. 1987. Selection of AUG in initiation codons differ in plants and animals. EMBO J. 6:43-48.
[8] Cavener, D.R., and Ray, S.C. 1991. Eukaryotic start and stop translation sites. Nucl. Acids Res. 19:3185-3192.
[9] Yamauchi, K. 1991. The sequence flanking translation initiation site in protozoa. Nucl. Acids Res 19:2715-2720.
[10] Iseli, C., Jongeneel, C.V, Bucher P. (1999). ESTScan: a program for detecting, evaluating, and reconstructing potential coding regions in EST sequences. ISBM.138-48.
[11] JH Badger, GJ Olsen: CRITICA: Coding region identification tool invoking comparative analysis.Mol. Bio. Evol 1999, 16: 512-524.
[12] SF Altschul, W Gish, W Miller, EW Myers, DJ Lipman: Basic local alignment search tool. J. Mol. Biol 1990, 215: 403-410.
[13] T Ikemura: Codon usage and tRNA content in unicellular and multicellular organisms Mol. Bio. Evol 1985, 2: 13-34.
[14] Bairoch A. (1991) PROSITE: a dictionary of sites and patterns in proteins.Nucleic Acids Res. 19 Suppl: 2241-2245.
[15] Abarbanel R.M., Wieneke P.R., Mansfield E., Jaffe D.A. and Brutlag D.L.(1984) Rapid searches for complex patterns in biological molecules. NucleicAcids Res. 12:263-280.