Guinand, Frédéric;Parmentier, Gilles;Trystram, Denis - Integration of Multiple Alignment and Phylogeny Reconstruction

ECCB 2002 Poster sorted by: Author | Number

Next | Previous poster (in order of the view you have selected)

Title: Integration of Multiple Alignment and Phylogeny Reconstruction	P50
Guinand, Frédéric; Parmentier, Gilles; Trystram, Denis Gilles.Parmentier@imag.fr ID-IMAG

This work deals with the problem of reconstructing phylogenies from a nucleotide sequences dataset. Contrary to the usual methods, our approach is not based on a multiple alignment of the sequences, but considers simultaneously phylogeny reconstruction and multiple alignment. We propose a generic scheme to reconstruct phylogenies which is designed such that the user can wether computes large-scale phylogenies or small and precise ones. Some experiments are discussed for actual biological datasets.

Introduction
Phylogeny is one of the most challenging problem in biocomputing. It consists in reconstructing the tree of evolution for a set of species[2]. Most existing phylogenetic analysis are based on a pre-computed multiple alignment. This alignment is computed using a software such as TCoffee [3] or Clustal [4] and sometimes biologists modify by hand the computed alignment. The main drawback to this approach is the way the used software computes the multiple alignment. Mainly, they are based on progressive multiple alignment algorithms [1]. To guide this progressive multiple alignment, a tree, called a cladogram, is computed. It is important to notice here that such a tree is not exactly a phylogenetic tree, due to the inadequation of the data. In this case, the phylogenetic analysis may be skewed due to this cladogram. In our scheme, we try to tackle this bias using an iterative approach.

1. A generic scheme to integrate multiple alignment and phylogeny reconstruction

In order to be able to use different approaches among the existing ones, we choose to design a generic scheme for phylogeny reconstruction. This scheme is an iterative one, starting with a distance matrix computation. The principle consists in finding some sets of close sequences. Then, from these sets, we compute some clusters of taxa. Then we update the distance matrix and we iterate.
More precisely, to compute the distance matrix an all-to-all pairwise comparison has to be performed. This matrix may be a distance matrix or similarity matrix, depending on the used method. For the sake of clarity, we will restrict the presentation to distance matrix.
Next, we look for close sequence sets, according to this matrix. Two policies have been implemented (described in the next section).Next stage is the most time consuming one. It consists in processing the sets in order to find relevant clusters. These clusters will not change until the end of the algorithm. Here, a cluster consists in both a topology and a multiple alignment.

Last step is an update of the distance matrix, according to the clusters. This update may be done using different policies, like mean average over the distances between the grouped taxa or using multiple alignment score between the new clusters and the other elements of the matrix. This method allow us to evaluate different multiple alignments and different phylogenies.

2.Results

In order to test our scheme, we have implemented two variants of the neighborhoods search. The first one computes a neighborhood for each sequence. The neighborhood for sequence "i" is the set of the its "k" closest sequences (k is a fixed parameter). Now, for each neighborhood, we compute all the possible trees and we keep the best ones. From all these trees, we compute th clusters. Here, a cluster is defined as a set of elements such that each time it appears in a tree, it forms a subtree.The second one constructs some neighborhoods for which their components are the closest sequences related to each other.In both cases, special attention must be made to matrix update. n fact,it seems that our scheme performs well to find clusters in the first step. After the first iteration, the quality the results is with the adequation of the matrice update procedure.We present a result for a twelve-taxa tree. This dataset is from TreeBase [5], file M312.NX. As we explained before, from this file, we only use sequences information, not multiple-alignment information.

3.Conclusion

In this work, we proposed a generic scheme for simultaneous phylogeny reconstruction and multiple alignment. Our method is based on an iterative procedure, searching for relevant clusters at each step. Preliminary tests show good results. Further investigations will be done using better multiple alignment heuristics. We plan to implement a parallel version of this scheme in order to adress very large phylogenies.

[1] Feng D. and Doolittle R., Progressive Multiple Alignment as a Prerequisite to Correct Phylogenetic Trees, J. Miol. Biol., 1987, Vol.25, p.351-360.
[2] D. Hillis and C. Moritz and B.Marble, Molecular Systematics, Sunderland, 1996, MA: Sinauer, 2nd edition.
[3] Notredame C. and Higgins D.G. and Heringa J., T-Coffee: A Novel Method for Fast and Accurate Multiple Sequence Alignment, J. Mol. Biol., 2000, Vol. 302, p.205-217.
[4] Thompson J. and Higgins D. and Gibson T., CLUSTAL W: Improving the Sensitivity of Progressive Multiple Alignment through Sequence Weighting, Position-Specific Gap Penalties and Weight Matrix Choice, Nucleic Acids Res., 1994, Vol. 22, p.4673-4690.
[5] TreeBase, A Database of Phylogenetic Knowledge, http://www.treebase.org/treebase/.