ECCB 2002 Poster sorted by: Author | Number

Next | Previous poster (in order of the view you have selected)

Title: Hierarchical Machine Learning and Knowledge Discovery in Characterising Protein Families
P162
Tan, Aik Choon; Gilbert, David

actan@brc.dcs.gla.ac.uk, drg@brc.dcs.gla.ac.uk
Bioinformatics Research Centre, Department of Computer Science, University of Glasgow

Biology has rapidly become a data-rich, information-hungry science because of recent massive data generation technologies (Lathrop, 2001). These experimental data are stored in distributed databases across the web using various database schemata, information representation, structure and retrieval systems. This situation has complicated the process of retrieving useful patterns or knowledge across these databases by human experts. Thus one of the current trends in bioinformatics is to design and implement automatic yet intelligent approaches to assist the user to extract useful biological information from these databases. Machine learning is one such approach which has been widely applied to bioinformatics and has gained a lot of success in this research area.

One of the current research trends in machine learning applied to bioinformatics is to combine several sophisticated learning algorithms in order to increase a classifier's predictive accuracy and its explanatory power. When trying to learn from large and diverse data sets (e.g. biological databases) it is important to produce a rule-set that encapsulates all the information from different sources. The classifiers that are used to characterise and/or classify the data must be accurate and easily understandable by the human expert. Most methods in bioinformatics concentrate on the accuracy and less on the comprehensibility of the classifiers (e.g. patterns or rules).

The aim of this research is to construct a novel approach to induce invariant relationships between distributed biological data sources using knowledge discovery and hierarchical machine learning techniques. Specifically, our objective is to produce patterns that characterise a protein family by improving the explanatory power of the rules (Tan et al 2002).

The learning and discovery process of our system is in outline:
Divide: Induce an individual pattern for each database in parallel using a Level 0 machine learning approach.
Merge: Merge patterns from Level 0 with additional background knowledge, and convert into input set for Level 1 learner.
Conquer: Induce invariant relationships between various patterns by Level 1 learner.
Test and validate the Level 1 output.

We have applied this approach to classify and characterise several super-fold protein families by learning over sequence, structural, topological and functional databases. In this poster, we present the preliminary results obtained from this analysis.
[1] Lathrop, R. (2001). Intelligent systems in biology: why the excitement? IEEE Intelligent Systems, 16: 8-13.
[2]Tan, A.C., Gilbert, D. and Tuson, A. (2002). Characterisation of FAD-family folds using machine learning approach. In Proceedings of the International Conference on Bioinformatics, InCOB 2002.
[3] Dym, O. and Eisenberg, D. Protein Science, 10 (2001): 1712-1728.
[4] Quinlan, J.R. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo: C.A. 1993.
[5] Muggleton, S. and Firth, J. (2001) CProgol4.4: a tutorial introduction. In S. Dzeroski and N. Lavrac, editors, Relational Data Mining, pages 160-188. Springer-Verlag.
[6] Westhead, D. R., Slidel, T.W.F., Flores, T. P. J. and Thornton, J. M. Protein Science, 8: (1999) 897-904.