ECCB 2002 Poster sorted by: Author | Number

Next | Previous poster (in order of the view you have selected)

Title: Unifying Data Mining Tools for Automated Annotation on TrEMBL
P129
Rakow, Astrid; Hackmann, Andre; Kretschmann, Ernst; O'Rourke, John; Apweiler, Rolf

arakow@ebi.ac.uk, ahackman@ebi.ac.uk
EMBL Outstation - European Bioinformatics Institute (EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK

The TrEMBL database is a supplement to the SWISS-PROT database containing over 670.000 partially annotated protein entries [1]. Since the number of submissions currently increases exponentially, there is no hope that a limited number of curators could possibly process all data. Their main focus lies on annotating particular protein entries or groups of protein entries, which are of comparatively high priority to be moved to SWISS-PROT. Consequently, the backlog of partially annotated entries in TrEMBL grows daily.

Automated annotation procedures are used to increase the data content of TrEMBL entries, thus supporting the annotation process and presenting more information to users of the data base. A system has been developed that allows the integration of annotation instructions as long as they abide by the Predictive Model Markup Language standard (PMML, cf. http://www.dmg.org). This step was necessary to unify various existing data mining tools and create an interface that is open to future developments. The ultimate task was to reduce the application times for those tools, which have outgrown a feasible time frame.

The first data mining application that could be integrated successfully is the RuleBase system [2], which consists of a set of approximately 500 manually curated rules. It processes TrEMBL since early 2000 and touches around 25% of all entries. The new rule application system helped to reduce the run time of the RuleBase considerably by some 80%. As a next step the Spearmint [3] system is intended to be incorporated. This set of decision trees is automatically generated using the C4.5 algorithm and it has some advantages in terms of coverage (around 70% of all TrEMBL entries are touched).

Since added annotation often contradicts basic biological exclusions (e.g. Bacteria do not have a nucleus and thus annotation predictions linked to nuclei should be avoided), the Xanthippe project has been created. It consists of a set of manually and automatically created exclusion rules which are used as post processing step after automated annotation has been performed.

All three systems have been translated to the PMML standard and a preliminary PMML parser has been developed translating PMML documents into an object representation. Using this representation annotation actions can easily be executed on TrEMBL entries. This system contributed to gain a more reliable, extensible and maintainable product than the separation in different data mining tools would have allowed.
[1] Bairoch A., Apweiler R.; "The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000.", Nucl. Acids Res. 28:45-48(2000).
[2] Fleischmann W., Moeller S., Gateau A., Apweiler R.; "A novel method for automatic functional annotation of proteins.", Bioinfomatics 15:228-233(1999).
[3] Kretschmann E., Fleischmann W., Apweiler R.; "Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT.", Bioinformatics 17:920-926(2001).