ECCB 2002 Poster sorted by: Author | Number

Next | Previous poster (in order of the view you have selected)

Title: From putative promoter sequence to genomic context : biological data collection on the web using a generic application (Xprom).
P27
Devignes, Marie-Dominique (1); Norsa, Yvan (1); Smaïl, Malika (1); Collet, Philippe (2); Domenjoud, Lionel (2)

devignes@loria.fr
(1)LORIA, CNRS-INRIA-University Henri Poincaré
(2)UPRES 3446, University Henri Poincaré, Nancy, FRANCE.

Information retrieval and data integration from biological sources disseminated on the web is today a crucial problem. Answering a biological question often requires querying several databases. Cross-referencing between databases is becoming more and more frequent and allows efficient browsing. However this type of search is time consuming and sometimes leads to disorientation and/or cognitive overload. Conflicting results are often retrieved. Variations of source contents in relation to source update necessitate survey or alert systems. We have dealt with these problems by proposing models for web source querying and for describing scenarios of information retrieval in the case of a precise biological question. An automatic application has already been tested that collects mapping data concerning genes of interest as well as information about co-localized orphan pathologies (1, 2). We present here the extension of our proposals to another biological question and the design of a generic application for collecting data on the web.

Peroxisome proliferators (PP) are various natural and chemical compounds that exert pleiotropic effects on the cell. Their usage in professional activities may induce cancer. Their effects on lipid metabolism also suggest that they could play a role in the mechanisms of multifactorial chronical diseases such as obesity and osteoarthritis. Identifying the genes that are controlled by PP is an approach towards understanding these complex diseases. Sequence elements (PPRE : PP Response Elements) and transcription factors (PPAR : PP activated receptors) are known. Thus, experiments were carried out to isolate human short DNA sequences (about 200 bp) that bind to PPAR. About one hundred of sequences have been produced. Availability of complete human genome sequence allows to locate these sequences of interest into their genomic context. The presence and the nature of the genes located in the vicinity can thus be checked by exploring surrounding genome sequence. Such genes become candidates for possible control by the isolated sequences. Litterature survey and exploration of gene network databases may then provide arguments in favor of such hypotheses and experiments can be designed to verify them.

Automatic analysis of the isolated sequences has revealed necessary for at least two reasons : (i) their number, (ii) the need to update the data regularly. New assemblies of human genome sequence are produced nearly every three months. A multi-step scenario (fig.1: scenarioXpromPPRE) has been designed that starts with raw sequence data (about 200 bp) and ends up with a schematized picture of its genomic context (as provided by the Genome Browser at UCSC ; http://genome.ucsc.edu/cgi-bin/hgGateway). Four steps have been defined : sequence preparation, sequence annotation, genomic contig identification, genomic contig annotation. Each step may be composed of one or more sub-steps. A sub-step is defined either as the execution of a local treatment (T) or as the querying of distant web resources ( R). For example, the step called "SEQUENCE ANNOTATION " involves two substeps : one (T2) is a search for the PPRE subsequences using a locally developped method, the other (R1) is a search for repeated sequences using the RepeatMasker program at Washington University. Additional substeps such as search for response elements other than PPRE or presence of an ORF could be included into this step. In the scenario, output data may serve as input data for subsequent substeps. All output data are stored in an XML structured document. Result files of each treatment and/or query are also stored.

The great diversity of biological web resources and the different strategies used by the biologists to answer a given query makes it difficult to predict all possible scenarios at once. A generic application (Xprom) has thus been designed and developed to allow any possible scenario to be implemented. Xprom is based on a generic scenario model (XML DTD) allowing description of the following set of characteristics for each sub-step of the scenario.
- substep name
- input data
- url of distant source or address of program
- syntaxis for formulating query
- parameters for the query or for calling the program
- output data filtered from result documents thanks to regular expressions.

The Xprom application is composed of two parts. The configuration module provides a graphical user interface for entering the specified scenario. An XML document is thus constructed. The analysis module will transform this XML file into an application capable of establishing the sequential list of steps and substeps of the scenario and of executing each substep along the following scheme (for a distant resource query) :
- query formulation
- query submission
- analysis of returned document
- filtering of data
- saving into an XML document
- iteration on multiple input data or going to next substep.

A second generic DTD has been defined to structure the session document. It reflects the execution of the scenario and records the characteristics and retrieved outputs for each substep. Conversion of the XML session document into another format specified according to some biologically oriented DTD or schema is then possible through XSL transformation.

It appears from our first analyses that in addition to the automated analysis of large number of sequences, the application may reveal very useful to compare various sources and/or scenarios for answering specific biological questions.
[1] Devignes M-D, Schaaff A and Smaïl M (1999) Querying heterogenous databases : a user-oriented system for collecting and structuring genome information. Poster, ISMB'99 Heidelberg, 6-10 Août 1999.
[2] Devignes M-D, Schaaf, A. and Smaïl M (2002) Collecte et intégration de données biologiques hétérogènes sur le web - Xmap : application dans le domaine de la cartographie du génome humain. Ingéniérie des Systèmes d'Information (in press).