ECCB 2002 Poster sorted by: Author | Number

Next | Previous poster (in order of the view you have selected)

Title: Simple and Efficient Secondary Structure Prediction
P48
GuimarĂ£es, Katia; Melo, Jeane; Cavalcanti, George

katia@cin.ufpe.br, jcbdm@cin.ufpe.br, gdcc@cin.ufpe.br
Center of Informatics, Federal University of Pernambuco

Introduction
Machine-learning techniques, as artificial neural networks, have been applied in the prediction of protein secondary structure in the last twenty years[6, 8, 4, 9, 2, 1].
Different architectures, algorithms and inputs have been exploited to obtain better accuracy prediction.

In many cases test and training sets are developed for each application making it difficult comparing results. Recently Cuff and Barton [3] reported a comparative study of predictors for secondary structure that uses multiple sequence methods. The same training sets are applied for each predictor: RS126 developed by Rost and Sander [11] and CB396 proposed by themselves in that work. The maximum average Q3 prediction accuracy was 72.9% for the CB396 dataset and 74.8% for the RS126, both obtained by a combination of the four predictors analyzed. The improvement in relation to other methods is of approximately 1%, however the computational effort is significantly bigger.

In our work we report results comparable those gotten by Cuff and Barton using however a simpler method. Combining only three sequence-structure networks we had obtained a Q3 accuracy of 74,13% for the RS126 dataset. The tests with CB396 dataset are in execution, but, the partial results (approximately 74.8% for the tests effected until this moment) indicate that we can obtain a accuracy even better that the reported by Cuff and Barton. The training algorithm used was RPROP a efficient variation of the backpropagation [7]. As input data we had used the PSI-Blast profile [4], this contributes to improve the results in about 0.5%.


Methods
We had used three networks with 30, 35 and 40 nodes in the hidden layer. The input layer consists of the columns of PSI-BLAST profile and the sequences are covered by windows of size 13. The output layer consists of three nodes one for each secondary structure elements (helix, strands and coils). The three reference secondary structure states for each protein in the databases were produced by DSSP algorithm [5].

To obtain the training and test sets each data base was divided in seven parts. The RS126 follows the same partition proposed by Riis and Krogh [8] and the CB396 was divided randomly in sets of proportional sizes. A differential point in relation to other works was the use of RPROP algorithm for training the networks.

RPROP (Resilient PROPagation) is a learning schema to train neural networks based on local adaptation strategy. In contrast to other algorithms, like standard backpropagation, it uses only the sign of the partial derivative to perform learning. Other advantages can be cited, like: the number of epochs and the computational effort are reduced in comparison to the original gradient-descent procedure, and this algorithm is robust in terms of the adaptation of its initial parameter.

Combination of classifiers, technique used in the best predictors of the present time, have been applied in our work. The objective of combine classifiers is obtain better results from the union of efforts of individual classifiers. Its necessary assure that each network is looking for a local minimum different. We had made it using a different number of nodes in the hidden layer for each network involved.

The classification rules used was: voting, product, maximum, minimum and average. In the voting rule the class prevails that to get the frequency greater, i. e., that one with the majority of the votes. In the other rules we use the function softmax to normalize the outputs of the networks.


Results
The Q3 accuracy average for the RS126 database was 71.76% for network with 30 nodes in the hidden layer, 71.31% for that with 35 nodes and 71.16% for that with 40 nodes. The accuracy increseling about 3% when these networks are combined, reached 74.13% with the product rule and 74.10% with the average rule.

This result has motivated us to try a similar test with the CB396 database. The test are being performed but the firsts results give us a Q3 accuracy average of 74.8% for the product and average rules, and so, we believe that the final result will be at least comparable with the 72.9% reported by Cuff and Barton on the same data set.
[1] G. Pollastri, G., Przybylski, D., Rost, B. and Baldi, P., Improving the Prediction of Protein Secondary Strucure in Three and Eight Classes Using Recurrent Neural Networks and Profiles, Proteins, 47:228-235, 2002.
[2] Baldi, P. and Brunak, S., Bioinformatics: The Machine Learning Approach, 2001.
[3] Cuff J. A. and Barton G. J., Evaluation and improvement of multiple sequence methods for protein secondary structure prediction, PROTEINS: Structure, Function and Genetics, 34:508-519, 1999.
[4] Jones, D. T., Protein secondary structure prediction based on position-specific scoring matrices, J. Mol. Biol., 292:195--202, 1999.
[5] Kabsch, W. and Sander, C., A dictionary of protein secondary structure, Biopolymers, 22:2577--2637, 1983.
[6] Qian, N. and Sejnowski, T. J., Predicting the secondary structure of globular proteins using neural network models, J. Mol. Biol., 202:865--884, 1988.
[7] Riedmiller, M., and Braun, H., A Direct Adaptive Method for Faster Backpropagation Learning: The RPROP Algorithm, Proc. ICNN, 586--591, 1993.
[8] Riis, S. K. and Krogh, A., Improving prediction of protein secondary structure using structured neural networks and multiple sequence alignments, J. Comp. Biol., 3:163--183, 1996.
[9] Rost, B., Review: Protein secondary structure prediction continues to rise, J. Struct. Biol., 134:204--218, 2001.
[10] Rost, B. and Sander, C., Third generation prediction of secondary structure, Prot. Struct. Predict. : Methods and Protocols}, 71--95, 2000.
[11] Rost, B. and Sander, C., Prediction of protein secondary structure at better than 70% accuracy, J. Mol. Biol., 232:584--599, 1993.