ECCB 2002 Poster sorted by: Author | Number

Next | Previous poster (in order of the view you have selected)

Title: Seqan: a modular sequence analysis system
P25
De Rijk, Peter; Glassée, Wim; Weckx, Stefan; Del-Favero, Jurgen; Van Broeckhoven, Christine

derijkp@uia.ua.ac.be
Department of Molecular Genetics (VIB), University of Antwerp (UIA)

A local automated system for analyzing data becoming available through public and local high-throughput sequencing was needed. Seqan was developed to address this need. It is a modular sequence analysis system, where modules run parallelized on our Linux cluster.

The basis of this system is formed by a relational database that is being accessed via an object-relational mapping layer in the scripting language Tcl. The database has basic classes representing sequences (seq) and features (ft) linked to a sequence. Sequence features that result from different analyses can have different types of associated data, so they are stored as subclasses of the class feature. Although putting the sequences themselves as BLOBs under transaction control in the database would be possible, they are currently stored in separate files for performance reasons. Sequences can be imported from different sources using different formats. E.g. Public sequences and their features can be imported in EMBL or Genbank format, including all reference and cross reference information. Private sequences can be imported in the FASTA format.

Different analyses are controlled by scripts called run-adapters: They extract the necessary information for an analysis from the database, convert it into the right format, and start the analyses. The results of the analysis are imported by an import-adaptor that parses the result files, and puts the data in the appropriate places in the database. Currently adaptors have been developed for a number of analyses that are available for local execution: These include general sequence analysis (e.g. CpG island detection), repeat detection (RepeatMasker, sputnik), database searching (BLAST, SNPs) and gene prediction (genscan, geneid). Each of the run-adaptors can be executed on a different node of the cluster, and some adaptors can be parallelized even further (BLAST over different databases).

A system for visualizing the results has also been developed. The X-axis represents the position in the sequence. A number of lines with blocks and accompanying labels over (forward) or under (reverse) the line represent the features on the given position. The features shown on each line depend on a query in the database. Using a script, colors, thickness and labels of a block can be based on data about the feature in the database, e.g. a darker gray block for better scoring predictions. All data about the feature can be requested by a mouseclick. Feature data can also be queried directly in the database or exported to tables for further analysis.