ECCB 2002 Poster sorted by: Author | Number

Next | Previous poster (in order of the view you have selected)

Title: Feature Subset Selection for Splice Site Prediction by Estimation of Distribution Algorithms
P138
Saeys, Yvan; Degroeve, Sven; Aeyels, Dirk; Van de Peer, Yves; Rouzé, Pierre

yvan.saeys@rug.ac.be
Department of Plant Systems Biology, Ghent University, Flanders Interuniversity Institute of Biotechnology (VIB), K.L. Ledeganckstraat 35, Ghent, 9000, Belgium

For a lot of biological processes, it is still not clear which elements contribute to the observed behaviour. One example is the occurrence of splice sites in sequences, an important characteristic in the context of gene finding. Due to the completion of the Arabidopsis Genome project, much data became available, allowing the use of supervised learning methods to automate the process of splice site prediction. As it is not clear which features are relevant for an accurate splice site prediction these learning methods are usually provided with many features, assuming that this will increase the probability of including relevant information.

Since not all features are important, and some of the features might be correlated, there is a need to search for a relevant subset of features that maximizes the predictive accuracy of the classification system. Traditional Feature Subset Selection methods are sequential and are based on a greedy heuristic. Sequential Forward Elimination (SFE) starts with the empty feature set and iteratively adds features, while Sequential Backward Elimination (SBE) starts with the full feature set and iteratively discards features.

Recently, Estimation of Distribution Algorithms (EDAs) (Muehlenbein and Paass, 1996) emerged as a more general framework of evolutionary algorithms. Instead of using the traditional crossover and mutation operators to create the new population, a more statistical approach is used to estimate the distribution of the parameters from a selected group of individuals. Creation of the new population is then performed by sampling individuals from the estimated distribution. EDAs have proven to outperform the standard genetic algorithms in many problems where multiple dependencies among parameters exist, and they usually need fewer fitness evaluations to obtain good solutions.

We combined the use of Estimation of Distribution Algorithms with a wrapper approach for feature subset selection. Much time is devoted in analysing the fitness of each individual. For each individual a new model has to be trained, and this model has to be evaluated on a test set, requiring a fast classification algorithm. In our experiments we used the Naive Bayes Classifier, a fast classifier based on the assumption that all parameters are independent.

The results of our experiments show that it clearly outperforms the traditional sequential methods for FSS. Furthermore the wrapper based approach described in (Degroeve et al., 2002) can be combined with the EDA based approach, yielding better solutions. This can be seen in Fig. 1 where a simple EDA approach (EDA-UMDA) is compared to SBE with the Naive Bayes Classifier (SBE-NBM) and a Support Vector Machine with polynomial kernel (PSVM). The results in this figure represent feature subset selection for acceptor prediction. Training was done on a balanced training set (1000 positive and 1000 negative examples), and testing was performed on an unbalanced test set of independent genes (281 positives and 7505 negatives).

(For the figures see web representation of the poster abstract)
[1] Sven Degroeve, Bernard De Baets, Yves Van de Peer and Pierre Rouze (2002), Feature Subset Selection for Splice Site Prediction. Bioinformatics (in press).
[2] H. Muehlenbein and G. Paass (1996), From recombination of genes to the estimation of distributions. Binary parameters. Lecture Notes in Computer Science 1411: Parallel Problem Solving from Nature, PPSN IV, (pp. 178-187).