Wistrand, Markus;Sonnhammer, Erik L.L - Choosing transition prior to maximize accuracy in homologue searches using Hidden Markov Models

ECCB 2002 Poster sorted by: Author | Number

Next | Previous poster (in order of the view you have selected)

Title: Choosing transition prior to maximize accuracy in homologue searches using Hidden Markov Models	P181
Wistrand, Markus; Sonnhammer, Erik L.L Markus.Wistrand@cgb.ki.se Center for Genomics and Bioinformatics, Karolinska Institute, Solna, Sweden

Protein homology searching methods are central in sequence analysis and ever more important as the number of sequenced genomes increases. Currently, a significant fraction (20-50%) of the predicted genes in a genome cannot be assigned a function based on homology using the best current techniques. It seems worthwhile trying to improve the existing computational methods.

Of the existing methodologies, Hidden Markov Models (HMMs) (Krogh et al 1994, Hughey et al., 1996, Eddy, 1998) have shown to have the greatest potential for genome annotation and novel discoveries (Park et al. 1998, Karplus et al, 1998). HMMs are probabilistic models composed of interconnected states, in this case match, delete and insert states. Symbol emission probabilities are associated to the states and transition probabilities to the connections between them. Building a HMM from an existing multiple alignment basically corresponds to estimating all these probabilities as posterior probabilities by combining prior probabilities and count events from the columns, using a Bayesian approach. The prior has two functions: 1) to avoid zero probabilities in the model and 2) to regularize the training data to avoid overfitting to the training data.

We have investigated the influence of the setting of the transition prior on model selectivity using HMMER (Eddy, 2001), which is one of the HMM packages most used. There are three types of transitions: from match state, from delete state and from insert state, and HMMER uses a Dirichlet densitiy to model each of the transition priors.

Priors are most often estimated using maximum-likelihood techniques and a large set of data. We looked instead at the actual number of misclassifications done by the HMM using a specific transition prior. To this end we constructed sets of training sequences and remote homologues using data from the seed alignments of families in Pfam 7.1 (Bateman et al., 2002). The basic principle was that test and training sequences should be homologues but no test sequence should have more than 20% sequence identity to any of the training sequences.

After this procedure, 378 families were left, divided into test and training sequences. Each family was associated with a file of noise sequences, based on the median length of the family's test sequences. Our approach was then to systematically change the parameters of one Dirichlet density at the time. For each setting of the transition prior, HMMs were built from the training sequences and the test and noise sequences were aligned to them. A threshold was set for each family so as to minimize the number of false positive and false negatives. Adding them together we got what we call the Minimum Error Rate Sum (MERS), a measure of accuracy, for that specific transition prior.

Doing this systemically we got a good picture of how model accuracy changed when we varied the transition prior. This was compared to the MERS using a prior estimated by a maximum-likelihood approach, which turned out to have a more than 25% higher (worse) MERS than the best prior.

How to explain the bad performance of the maximum-likelihood prior? Most important, judging from our experiments, HMMs built using the maximum-likelihood prior have a high probability too model sequences using long deletions. This adds to their sensitivity for remote homologues, but the drawback is that they also detect a lot of noise sequences. The extra sensitivity is thus outweighed by lower selectivity and summed together this leads to low accuracy.

[1] Bateman A., Birney E., Cerruti L., Durbin R., Etwiller L., Eddy S. R., Griffiths-Jones S., Howe K., L., Marshall M. and Sonnhammer E., L. 2002. The Pfam Protein Family Database. Nucleic Acids Research 30(1), 276-280.
[2] Eddy, S.R. 1998. Profile hidden Markov models. Bioinformatics 14, 755-763.
[3] Eddy. S.R, 2001 HMMER: Profile hidden Markov models for biological sequence analysis (http://hmmer.wustl.edu/).
[4] Hughey, R. and Krogh, A. 1996. Hidden Markov models for sequence analysis: extension and analysis of the basic method. CABIOS, 12, 95-107.
[5] Karplus, K., Barrett, C. and Hughey, R. Hidden Markov models for detecting remote protein homologies. Bioinformatics. 14, 846-856.
[6] Krogh, A., Brown, M., Mian, I.S., Sjölander, K. and Haussler, D. 1994. Hidden Markov Models in computational biology: Applications to protein modelling. J. Mol. Biol. 235, 1501-1531.