ECCB 2002 Poster sorted by: Author | Number

Next | Previous poster (in order of the view you have selected)

Title: TrEMBL protein sequence database: a guide to the proteomic world
P178
Williams, Allyson; Martin, Maria Jesus; O'Donovan, Claire; Barrell, Daniel; Fedotov, Alexander; Apweiler, Rolf

allyson@ebi.ac.uk, martin@ebi.ac.uk
EMBL - EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK

TrEMBL (http://www.ebi.ac.uk/trembl) is a computer-annotated Protein Sequence Database supplementing the SWISS-PROT Protein Knowledgebase. It was created in 1996 to cope with the increasing rate of sequence generation from genome sequencing projects, allowing these sequences to be made publicly available as quickly as possible without diluting the high quality annotation found in SWISS-PROT. TrEMBL contains the translations of all coding sequences (CDS) present in the EMBL/GenBank/DDBJ Nucleotide Sequence Database, as well as protein sequences extracted from the literature or submitted but not yet integrated into SWISS-PROT. The exponential growth of TrEMBL and its population by greater numbers of raw sequences present a considerable challenge that we have addressed in different ways. Redundancy checks merge identical sequences in the database, and an automatic annotation system is used to attach functional information to predicted sequences. This system uses a manually curated database of rules and a procedure of automatically generated decision trees and is regularly applied to TrEMBL.

TrEMBL also provides a high level of integration with other databases. There are many specialised databases providing relevant data for a particular protein. By cross-referencing to these databases we provide a well-connected entry point for all knowledge for a particular sequence. InterPro (http://www.ebi.ac.uk/interpro), an integrated documentation resource for protein families, domains and functional sites, is used to link TrEMBL entries to the pattern and cluster sequence databases PROSITE, PRINTS, Pfam, ProDom, TIGRFAM and SMART. TrEMBL also maintains regularly updated links to a variety of other databases including MGD, HSSP and FlyBase. SWISS-PROT and TrEMBL are also used when building other resources including the Proteome Analysis Database, created to analyse and classify all proteins encoded by complete genomes and the International Protein Index (IPI), which together with Ensembl and RefSeq provides a cross-referenced human dataset with high coverage and low redundancy. TrEMBL also puts special emphasis on annotation improvements of sequences from genome projects, such as Human, Drosophila and microbial genomes.

The diverse sources of information in TrEMBL entries are flagged by evidence tags, allowing users to see where data items came from and enabling SWISS-PROT staff to automatically update data if the underlying evidence changes. These evidence tags will be made public with the first XML version of TrEMBL. The XML version (SP-ML) for the entirety of SPTR should be available in the near future, with the first draft release available now at http://www.ebi.ac.uk/swissprot/SP-ML/.

Weekly updates of TrEMBL are available under the complete non-redundant protein sequence collection SPTR, available via ftp (ftp://ftp.ebi.ac.uk/pub/databases/sp_tr_nrdb/) and the Sequence Retrieval System (SRS) (http://srs.ebi.ac.uk/). The EBI also offers a range of services (http://www.ebi.ac.uk/Tools/) to run Smith-Waterman, FASTA and BLAST sequence similarity searches. An ORACLE version of the database is provided upon request.