ECCB 2002 Poster sorted by: Author | Number

Next | Previous poster (in order of the view you have selected)

Title: SWISS-PROT Format Developments
P119
Phan, Isabelle; Redaschi, Nicole; Roland, Pascal; Runte, Kai; Jain, Eric; Bairoch, Amos

iphan@isb-sib.ch, redaschi@isb-sib.ch
Swiss Institute of Bioinformatics

SWISS-PROT [1] is a curated, non-redundant protein sequence database that provides high quality annotation and is integrated with a large number of other biological databases. It is supplemented by TrEMBL [1], a computer-annotated database which contains translations of all coding sequences in the EMBL Nucleotide Sequence Database [2] which are not yet in SWISS-PROT.

Currently both data sets are maintained and distributed as flat files in a format which is described in the SWISS-PROT User Manual (http://www.expasy.org/sprot/userman.html) and to which we refer here as "the flat file format".

An inherent problem of flat file data banks is that their maintenance becomes increasingly difficult when they grow large in size and many people are involved in the production of the data. We are therefore in the process of porting the production of SWISS-PROT and TrEMBL to a Relational Database Management System.

Another problem is the SWISS-PROT and TrEMBL file format: although this historic format was developed with both human and machine readability in mind, its syntax is often too loosely defined to allow easy fine-grained parsing. Since many existing applications rely on the flat file format, format changes are done very rarely, and adding new types of data is difficult. To overcome these shortcomings a new file format based on the Extensible Markup Language (XML) is being developed: the SWISS-PROT Markup Language (SP-ML).

In order to develop a good representation of the data using either XML or a relational schema, we first designed a conceptual data model that describes the structure and constraints present in the data. We chose the Unified Modeling Language (UML) notation for this project, since this is a widely accepted standard for object modeling. This model forms the basis for the design of the XML and relational schema.
[1] Bairoch A., Apweiler R.; "The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000.", Nucl. Acids Res. 28:45-48(2000).
[2] Stoesser G., Baker W., van den Broek A., Camon E., Garcia-Pastor M., Kanz C., Kulikova T., Leinonen R., Lin Q., Lombard V., Lopez R., Redaschi N., Stoehr P., Tuli MA., Tzouvara K., Vaughan R.; "The EMBL Nucleotide Sequence Database", Nucl. Acids Res. 30(1):21-26(2002).