ECCB 2002 Poster sorted by: Author | Number

Next | Previous poster (in order of the view you have selected)

Title: HAMAP: High Quality Automated Annotation of Microbial Proteomes
P44
Gattiker, Alexandre; Coudert, Elisabeth; Michoud, Karine; Rivoire, Catherine; Auchincloss, Andrea; Lima, Tania; Lachaize, Corinne; Pagni(1), Marco; Bairoch, Amos

hamap-project@isb-sib.ch, gattiker@isb-sib.ch
SWISS-PROT Group, Swiss Institute of Bioinformatics, 1 rue Michel Servet, CH-1211 Geneva 4, Switzerland, and (1) Swiss Institute of Bioinformatics and Swiss Institute for Experimental Cancer Research (ISREC), CH-1066 Epalinges/Lausanne, Switzerland

In July 1995, the complete genomic sequence of a bacterium, Haemophilus influenzae, became available. That of an archaeon, Methanoccocus jannaschii, quickly followed it. It was the prelude to a flood of microbial genome sequences. Today more than 75 of these genomes are available in public databases. Collectively they encode almost 200,000 different protein sequences. And this is only the beginning! Such a large amount of sequences makes classical manual annotation an intractable task.

We are therefore developing, in the framework of the SWISS-PROT knowledgebase [1], a project that aims to annotate automatically a significant percentage of proteins originating from microbial genome sequencing projects. Such a project differs from the many currently existing automatic annotation systems in that it does not attempt to hunt for distant similarity nor does it aim to annotate all potential proteins originating from a microbial genome. Rather, it is developed to deal specifically with two subsets of bacterial and archaeal proteins:

1) It automatically annotates proteins that have no recognizable similarity to any other microbial or non-microbial proteins (these are generally called "ORFans"). This task mainly implies automatic recognition and annotation of features such as signal sequences, transmembrane domains, coiled-coil regions, inteins, etc., as well as exclusion rules to combine the results of different prediction methods in a biologically sensible way.

2) The most challenging part of the project is aimed at annotating automatically proteins that are part of well-defined families or subfamilies. In most cases, these are well-characterized protein families for which it is possible, using software tools, to build automatically a SWISS-PROT entry of a quality identical to that produced manually by an expert annotator. In order to do this we are building, for each well-defined (sub)family, a rule system that describes the level and extent of annotations that can be assigned by similarity with a prototype manually-annotated entry. Such a rule system also includes a carefully edited multiple alignment of the family, which is used both for the automated generation of identification profiles and for the propagation of sequence features by similarity to a template sequence. 650 such families are currently used for automatic annotation.

In both cases described above, the idea is to annotate proteins with the highest level of quality. The programs developed are specifically designed to track down "eccentric" proteins. Among the peculiarities recognized by the programs are: size discrepancy, absence or divergence of regions involved in activity or binding (to metals, nucleotides, etc), presence of paralogs, inconsistencies with the biological context (i.e. if a protein belongs to a pathway apparently absent in a particular organism), etc. Such "problematic" proteins are not annotated automatically and are flagged for further analysis by SWISS-PROT expert annotators. This allows SWISS-PROT annotators to concentrate on the proteins that really need careful manual annotation. Four bacterial proteomes have been completely (or virtually completely) annotated up to the standards of SWISS-PROT.

In addition to bacterial and archaeal proteomes, HAMAP is also used in the annotation of organelle proteomes (chloroplasts, and soon mitochondria). The technologies and expertise that are developed thanks to the HAMAP project are also important for future automatic annotation projects for eukaryotic proteomes.

The HAMAP project is developed in collaboration with other groups of the SIB as well as with the groups of Francois Rechenmann and Alain Viari (Grenoble), Laurent Duret and Guy Perriere (Lyon), Claudine Medigue (Paris/Evry), Chantal Abergel and Cedric Notredame (Marseille).

The proteins annotated by HAMAP are seamlessly and continuously integrated into SWISS-PROT. Additional information about HAMAP can be found at http://www.expasy.org/sprot/hamap/.
[1] Bairoch A., Apweiler R. The SWISS-PROT protein sequence database its supplement TrEMBL in 2000. Nucleic Acids Res. 28:45-48(2000).