ECCB 2002 Poster sorted by: Author | Number

Next | Previous poster (in order of the view you have selected)

Title: Mapping sequence properties to biological classification by machine learning algorithms
P74
Kanapin, Alexander; Soinov, Lev; Krestyaninova, Maria

alex@ebi.ac.uk
EMBL-EBI

The wide spectrum of methods of protein sequence analysis, such as HMM, fingerprint, clustering gives an opportunity to divide all variety of known or predicted protein sequences into groups according to the sequence similarity.
The number of such data resources increases continuously, thus raising the problem of their consistency with each other. The necessity of the integration of different protein sequence analysis methods has led to the InterPro project [1]. Its progress by this moment has shown that the more information we accumulate in a single, non-redundant resource, the more precise and consistent the resulting classification of the proteins will be. Thus, the requirement of the system, mapping all known properties of a particular sequence to a definite biological class, becomes obvious.
InterPro allows characterizing every SwissProt protein in terms of so-called "signatures". The word "signature" is used to designate profile, fingerprint or HMM by a single term, as a variety of methods is used in InterPro. The combination of signatures described by such features as positional relationships of signatures, list of specific signatures, etc. is unique for each protein. At the same time SwissProt curators have characterized each protein sequence manually with specific terms (i.e. keywords). Given the set of sequences described in two different ways, we could use it as the training set for an Artificial Intelligence (AI) system that sets up a correspondence between these two types of protein sequence descriptions.
We performed mapping of InterPro features to SwissProt terms by methods of supervised classification. Using machine learning language, the problem of mapping is formulated as follows: we create classifiers, which, based on the given correct examples, assign the proper SwissProt term to the uncharacterised protein sequence using the set of InterPro features inherent to it. Whether to assign or not a SwissProt term to a given sequence is the two-class classification problem. Since the pool of correct examples forming the training set for AI system is highly imbalanced (i.e. the number of sequences that match the given SwissProt term (for example, keyword) is significantly smaller than the whole number of sequences in SwissProt) we used filtering techniques for feature subset selection and Receiver Operating Characteristic analysis for finding optimal misclassification costs in cost-sensitive classification [2]. Finally, classifiers are created in the form of rules for decision-making and can be used for classifying any uncharacterised protein sequence that hits at least one InterPro signature.
The resulting system will essentially simplify and improve quality control procedures and integration of new member databases into InterPro. Also it will increase the accuracy of InterPro functional prediction by evaluating its significance level using common statistical methods and created classification. Once the AI system is created, we will get an additional instrument for making consistent the results of manual annotation. We expect that similar interactive systems based on supervised learning approach will become essential analytical part of a variety of integrated biological resources.
[1] Apweiler R, Attwood TK, Bairoch A, Bateman A, Birney E, Biswas M, Bucher P, Cerutti L, Corpet F, Croning MD, Durbin R, Falquet L, Fleischmann W, Gouzy J, Hermjakob H, Hulo N, Jonassen I, Kahn D, Kanapin A, Karavidopoulou Y, Lopez R, Marx B, Mulder NJ, Oinn TM, Pagni M, Servant F. The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Res. 2001 Jan 1;29(1):37-40
[2] Witten I, Frank E. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, ISBN 1-55860-552-5, 1999.