ECCB 2002 Poster sorted by: Author | Number

Next | Previous poster (in order of the view you have selected)

Title: A hidden Markov model for progressive multiple alignment
P96
Löytynoja, Ari; Milinkovitch, Michel

aloytyno@ulb.ac.be
Unit of Evolutionary Genetics, Free University of Brussels (ULB), cp300, Institute of Molecular Biology and Medicine, rue Jeener & Brachet 12, B-6041 Gosselies, Belgium

Introduction
Many of the current alignment programs base on the progressive alignment method of Feng and Doolittle (1987), and iterate the dynamic programming algorithm to build a complete multiple alignment. These methods do not guarantee a globally optimal multiple alignment nor provide any measure of solution's reliability.
A hidden Markov model (HMM) is an extension of basic Markov process, and suits well for a probabilistic description of complex systems. Durbin et al. (1998) defined a HMM for a pairwise sequence alignment with the affine-gap-score method. We have further developed their approach and generalized the HMMs for computationally-efficient production of probabilistic multiple alignments. Our method does not promise the correct alignment but it provides the posterior probability of the solution as a measure of alignment's quality. We also propose an alternative for the sequence profile alignment, and show that an algorithmically simple method, that does not implement any predetermined heuristics, can perform as well as or better than the traditional method.

Methods and results
A multiple alignment is created by progressively performing pairwise alignments at the nodes of the guide tree in order of decreasing similarity. All the nodes are represented by a single sequence but, to allow character uncertainty at the internal nodes of the alignment tree, the sequence sites are described as vectors of character probabilities. At terminal nodes (present sequences) the observed character is given the probability 1 as all the others are set to 0; at internal nodes (ancestral sequences) characters have intermediate probabilities that sum to 1. A "gap" is considered as an extra character.
All the pairwise alignments are performed with the same set of dynamic programming algorithms. An alignment match score corresponds the probability of evolution from a parent site to its two child sites, and is summed over all possible characters combinations. An alignment gap is not different from a match except that one of the child sites is known to be a "gap" for sure. The computation of a match/gap probability also defines the ancestral (parent) sequence by setting the relative probability for each character at the site. This ancestral sequence can then be further aligned with other ancestral (internal) or present (terminal) sequences, and profiles with character weighting are not needed. As a "gap" character has a non-zero probability at ancestral sequences, aligning more gaps at the location of an already existing gap is beneficial, and the algorithm gathers gaps at the same columns without local adjustment of gap penalties.
The HMM provides an algorithm that resolves the most probable state path (best solution) but can also sample sub-optimal alignments according to their posterior probabilities. Repeating the alignment procedure on the best path but randomly breaking the ties (equally good local solutions), or sampling sub-optimal pairwise alignments to build the multiple alignment may escape from a local optimum and give a globally better result. Two other algorithms compute the full probability of the alignment and the posterior probability of the alignment sites being aligned correctly. As the posterior probability reflects alignment's quality, it allows easy detection of problematic alignment regions, or removal of low-quality sites before a further analysis.
The performance of the new method was tested with artificial data sets (20 sequences, ~500 nucleotides) for which the true alignment was known. Depending on the evolutionary distances between the sequences and the length of the alignment gaps, the probabilistic method performed as well as or better than ClustalW (Thompson, Higgins & Gibson 1994). Sampling sub-optimal solutions could often improve the alignment, although detecting the best alignment can be difficult as the probability score and the correctness of the alignment did not correlate. Incorrectly aligned sites had in average lower posterior probability than the correct ones, and filtering sites with a low posterior probability removed most of the incorrect sites without losing too many correct ones.
The novel algorithms are implemented in the software "ProAlign" that performs the alignment of nucleotide sequences, and provides a GUI to view the quality (posterior probabilities) of the solution and to filter the low-quality regions before exporting the data to other software. ProAlign also offers a front-end to iterate the multiple alignment procedure by either choosing the best path or sampling the path from the posterior probabilities, as well as a command line interface to automate some of the tasks. ProAlign is written in Java, and runs on Linux, Mac OSX and Windows. It is freely downloadable at http://dbm.ulb.ac.be/ueg/ProAlign/.
[1] Durbin R, Eddy S, Krogh A & Mitchison G 1998. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , Cambridge University Press, Cambridge, UK.
[2] Feng DF & Doolittle RF 1987. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J Mol Evol 25:351-60.
[3] Thompson JD and Higgins DG & Gibson TJ 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22:4673-80.