ECCB 2002 Poster sorted by: Author | Number

Next | Previous poster (in order of the view you have selected)

Title: The Proteome Analysis Database in 2002
P125
Pruess, Manuela; Kanapin, Alexander; Karavidopoulou, Youla; Kersey, Paul; Kriventseva, Evgenia; Mittard, Virginie; Mulder, Nicola; Phan, Isabelle; Apweiler, Rolf

mpr@ebi.ac.uk
EMBL Outstation - The European Bioinformatics Institute (EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom

The EBI Proteome Analysis database (http://www.ebi.ac.uk/proteome/) ([1], developed in 2000, provides a tool for the in silico analysis of proteins and of whole proteomes. Tools like this have become increasingly important, because the various sequencing projects are leading to an accumulating amount of raw sequence data, and the field of proteomics is expanding rapidly.
The Proteome Analysis database has been set up to provide comprehensive statistical analyses of the predicted proteomes of fully sequenced organisms. The analysis is compiled using the InterPro database [2] on protein families, domains and functional sites, the CluSTr database [3], which offers an automatic classification of proteins into groups of related ones, and newly also GO Slim, part of the gene ontology (GO) project [4] which describes genes and gene products according to molecular function, biological process and cellular component. The analysis is performed on non-redundant complete proteome sets of SWISS-PROT and TrEMBL [5] entries, spanning archaea, bacteria and eukaryotes.

The statistical analysis for each proteome comprises Top 30 and Top 200 hits (which list the top 30 and 200 InterPro entries, respectively, with the highest number of protein matches for the reference proteome), 15 most common families (which lists the top 15 InterPro entries of type 'family' with the largest number of protein matches and displays the number of protein matches), 15 most common domains, 15 most common repeats, Top 30 proteins with the highest occurrence of different InterPro hits, a list of singletons (proteins for which no relative sequences in the proteome were found at the lowest studied protein similarity level, Z-score=10), the 30 biggest clusters (which lists the 30 biggest protein clusters, the number of proteins in each of the clusters and the InterPro-based functional classification of the proteins in the cluster), clusters without InterPro links, and clusters without HSSP links. Moreover, precomputed comparisons with appropriate selected proteomes are provided.
Structural information is also presented, including primary, secondary and tertiary structure information, as well as protein length distribution, for each proteome. The number of proteins with secondary structure homology to known structure are estimated using the well established Homology derived Secondary Structure of Proteins (HSSP) [6] method which relies upon the primary sequence alignment of proteins to experimentally determined structure in the Protein Data Bank (PDB) [7]. Information about protein length distribution and amino acid composition is represented graphically.
As of August 2002, the Proteome Analysis database contains proteome sets for 89 proteomes (8 eukaryotes, 12 archaea, and 69 prokaryotes); complete proteome analysis is available for 83 of these.

In addition, the Proteome Analysis project enables users to perform their own interactive proteome comparisons between any combination of organisms in the database, to run a FASTA similarity search against a complete proteome, and to download a proteome set or a list of InterPro matches for a given organism. Furthermore, with IPI, the International Protein Index, a top level guide to the main databases that describe the human and mouse proteomes is provided.
[1] Apweiler R. et al. (2001). Proteome Analysis Database: online application of InterPro and CluSTr for the functional classification of proteins in whole genomes. Nucleic Acids Res. 29, 44-48.
[2] Apweiler R. et al. (2001). The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Res. 29, 37-40.
[3] Kriventseva E.V. et al. (2001). CluSTr: a database of clusters of SWISS-PROT+TrEMBL proteins. Nucleic Acids Res. 29, 33-36.
[4] Ashburner M. et al. (2000). Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25, 25-29.
[5] Bairoch A. and Apweiler R. (2000). The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 28, 45-48.
[6] Holm L. and Sander C. (1999). Protein folds and families: sequence and structure alignments. Nucleic Acids Res. 27, 244-247.
[7] Berman H.M. et al. (2000). The Protein Data Bank. Nucleic Acids Research 28, 235-242.