ECCB 2002 Poster sorted by: Author | Number

Next | Previous poster (in order of the view you have selected)

Title: Toucan: A Workbench for Regulatory Sequence Analysis
P3
Aerts, Stein; Staes, Mik; Coessens, Bert; Thijs, Gert; Moreau, Yves; De Moor, Bart

stein.aerts@esat.kuleuven.ac.be
Department of Electrical Engineering (ESAT-SCD), KULeuven, Belgium

The identification and interpretation of regulatory systems in eukaryotic genomes remains a major challenge. DNA microarrays and other functional genomics technologies often result in sets of coexpressed and possibly coregulated genes. Many researchers thus come to a point of analyzing such gene sets to find common cis-regulatory elements in the non-coding regions (promoter, enhancers, silencers, 5'UTR, introns, 3'UTR). Sequence elements are allocated by scoring the sequences with position weight matrices (PWMs) of known transcription factor (TF) binding sites, new patterns are detected with Gibbs sampling techniques (Thijs et al., 2002), elements are clustered to find high local densities of TFs (Berman et al., 2002), phylogenetic footprinting is applied to find regulatory regions that are conserved between species (Wasserman et al., 2000), and also combinations of these approaches are used (Loots et al., 2002). For most of the above mentioned techniques, the necessary tools are publically available. This research domain is very extensive, and since not every biologist has access to the bioinformatics expertise to integrate several tools and web applications, biologists might benefit from a simple tool to perform their regulatory sequence analysis on prokaryotic and eukaryotic genomes. Furthermore, if such analysis are to be carried out on a large scale, the efficient retrieval of promoter sequences is essential. This task is now becoming more straightforward for organisms with fully sequenced genomes. By querying genomic databases like Ensembl for a gene and walking up- or downstream from it, the intergenic regions can be retrieved. We present here a Java application built on top of the BioJava library, that allows the user to construct a gene set by importing sequences from local or online sources, to visualize, manipulate, cut, and export them, to annotate them with putative transcription factor binding sites using a web service, and to perform a statistical analysis to select over-represented sites.
A gene list can be constructed either from local sequence files or from identifiers. In the latter case, the complete sequence or only the 5' upstream regions, together with their annotation and external database identifiers, are retrieved automatically from either the Ensembl or the EMBL database. In order to help with the identification of promoter or enhancer regions, the prediction of CpG islands is included, and it is possible to import the results of external prediction or alignment tools. Eponine TSS (Down and Hubbard, 2002) is well suited for the prediction of transcription start sites and its GFF output can directly be applied on the active gene set. Aligning upstream sequences of orthologs can be done using tools like Bayesaligner, (Wasserman et al., 2000), AVID/VISTA (Mayor et al., 2000), DNA Block Aligner (DBA) (Jareborg et al., 1999), or PipMaker (Schwartz et al., 2000). We provide an online GFF Toolbox to convert the alignment outputs of these programs to GFF. After annotation of the latter, the similar parts in the upstream sequences of the orthologs can be selected to construct a new sequence set for the analysis of regulatory elements.
A resulting set of sequences can be annotated with IUPAC consensus sequences or with PWMs of known transcription factors. For the latter we use an in-house developed MotifScanner (http://www.esat.kuleuven.ac.be/~dna/BioI/Software.html). The implemented algorithm uses a prior probability for having a positive hit and a background model. The background model helps to reduce the number of false positive, biologically non-functional predictions. The user can transparently use this tool as a web service from within Toucan, and the resulting GFF formatted output can be applied to the currently active sequence set. The PWM databases and background models that are used by the service reside on the server and can be selected in the client. Because we use web services, we are able to add more services that "do something" with fastA formatted sequence files, and to link with bioinformatics service registries in the future. This will help to improve the interoperability among visualization tools, algorithms and data providers for gene regulation bioinformatics (Stein, 2002).
A binomial distribution model is used to correlate all features with a p-value and a significance score based on their occurrence in the sequence set and relative to their expected frequency (Van Helden et al., 1998). The expected frequency of a feature can be approximated by calculating the respective actual frequencies, expressed as occurence per bp, from another sequence set or from a general (genome-wide) reference set (e.g. all promoters in the Eukaryotic Promoter Database). At our website we provide several expected frequencies files, calculated from the Eukaryotic Promoter Database and/or from the upstream regions of random gene subsets from complete genomes.
In summary, Toucan provides a simple and integrated environment for gene regulation bioinformatics. Starting only from gene identifiers, it is currently able to retrieve, visualize, annotate, and analyze promoter sequences of coregulated genes. By adding more web services in the future, it is expected to grow towards an interface to many sequence-based bioinformatics algorithms.
Availability: http://www.esat.kuleuven.ac.be/~dna/BioI/Software.html
[1] B. P. Berman, Y. Nibu, B. D. Pfeiffer, P. Tomancak, S. E. Celniker, M. Levine, G. M. Rubin, and M. B. Eisen. Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome. Proc Natl Acad Sci U S A, 99(2):757-762, Jan 2002.
[2] T.A. Down and T.J.P. Hubbard. Computational detection and location of transcription start sites in mam-malian genomic DNA. Genome Res, 12(3):458-461, Mar 2002.
[3] N. Jareborg, E. Birney, and R. Durbin. Comparative analysis of noncoding regions of 77 orthologous mouse and human gene pairs. Genome Res, 9(9):815-824, Sep 1999.
G.G. Loots, I. Ovcharenko, L. Pachter, I. Dubchak, and E.M. Rubin. rVista for comparative sequence-based discovery of functional transcription factor binding sites. Genome Res, 12(5):832-839, May 2002.
[4] C. Mayor, M. Brudno, J. R. Schwartz, A. Poliakov, E. M. Rubin, K. A. Frazer, L. S. Pachter, and I. Dubchak. VISTA : visualizing global DNA sequence alignments of arbitrary length. Bioinformatics, 16(11):1046-1047, Nov 2000.
[5] S. Schwartz, Z. Zhang, K. A. Frazer, A. Smit, C. Riemer, J. Bouck, R. Gibbs, R. Hardison, and W. Miller, PipMaker - a web server for aligning two genomic DNA sequences. Genome Res, 10(4):577-586, Apr 2000.
[6] L. Stein. Creating a bioinformatics nation. Nature, 417(6885):119-120, May 2002.
[7] G. Thijs, K. Marchal, M. Lescot, S. R, B. De Moor, P. Rouze, and Y. Moreau. A gibbs sampling method to detect overrepresented motifs in the upstream regions of coexpressed genes. J Comput Biol, 9(2):447-464, 2002.
[8] J. Van Helden, B. Andre, and J. Collado-Vides. Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J Mol Biol, 281(5):827-842, Sep 1998.
[9] W. W. Wasserman, M. Palumbo, W. Thompson, J. W. Fickett, and C. E. Lawrence. Human-mouse genome comparisons to locate regulatory sites. Nat Genet, 26(2):225-228, Oct 2000.