ECCB 2002 Poster sorted by: Author | Number

Next | Previous poster (in order of the view you have selected)

Title: Advantages of a global optimization approach for the cluster analysis of gene expression data
P110
Möller, Ulrich; Thies, Frank

UMoeller@pmail.hki-jena.de
Hans Knöll Institute for Natural Products Research, Jena, Germany

INTRODUCTION
Clustering is a basic technique for an exploratory data analysis and, as such, it has been shown to be useful for the investigation of unknown structure in the field of bioinformatics. Although a single "best" clustering tool for all related applications does not seem to exist [1], some methods are more often used than others. The K-means algorithm has been applied many times, e.g., to gene expression data, where [2] is one of the frequently cited studies. In a comparison of several methods, K-means generated clusters with better structural quality, more consistent with the biological background information. However, K-means was relatively sensitive to noise perturbation in the data [1]. Another problem is that K-means solutions depend on the initial configuration. (K-means belongs to the clustering schemes based on function optimization and represents a calculus for local optimization. [3]). Although generally known, the optimization problem is often neglected in practical studies. This may be unjustified. It has been demonstrated for biological data, recorded with different techniques, that rerunning K-means with varying (e.g. random) initial configuration may yield quite different clustering partitions of the same data sets [4, 5]. Therefore, it may be potentially useful to consider global optimization also for the clustering of gene expression data or other types of data in this field.

METHODS
It is assumed here that the approach of distance-based partitioning clustering is considered promising, and the choices of the metric and other algorithmic parameters are appropriate. Under these circumstances, the potential benefit of global against local optimization was investigated by comparing the results of several clustering algorithms: (1) the common K-means (KM) algorithm, (2) the fuzzy K-means (FCM) algorithm, (3) the Random Seach among Centroids (RSC) algorithm [6], and (4) the Stochastic Relaxation with Decoder perturbation (SRD) [7]. KM and FKM represent deterministic hill-descent algorithms, once the initial state has been defined. RSC and SRD are stochastic optimization tools, and have been constructed for a gradient-free global minimization of the typical cost function of partitioning cluster analysis. In this respect, the RSC and SRD algorithms may be assigned to one class of methods together with the so-called simulated annealing (SA), see e.g. [8], and genetic algorithms. However, in comparison with these methods, the RSC and SRD can be considered to be computationally more efficient [6, 7].

RESULTS
Among others we analyzed freely available data sets, such as that of acute myeloid and lymphoblastic leukemia, reported in a study of Golub et al. (1999), and that of the yeast cell cycle data provided by Cho et al. (1998). The latter data set is useful, because it has established itself as a de facto standard for the assessment of newly developed clustering algorithms [8]. Furthermore, measures of the quality, reliability, and efficiency of the clustering solutions are presented.

DISCUSSION
It is demonstrated that clustering, based upon a local optimization strategy, yielded a variety of partitions, not only with different values of the objective function, but also with different cluster membership tables. This is shown to yield ambiguous, and potentially inappropriate, conclusions about the real clustering structure of gene expression patterns. We make aware of the fact that - like in [4, 5] - clustering according to a global optimization approach may improve the basis for the interpretation of the results with respect to their biological context.
[1] Chen et al.: Evaluation and comparison of clustering algorithms in analyzing ES cell gene expression data. Statistica Sinica 12 (2002) 241-262.
[2] Tavazoie et al.: Systematic determination of genetic network architecture. Nature Genetics 22 (1999) 281-285.
[3] Theodoridis and Koutroumbas: Pattern Recognition, Academic Press, San Diego, 1998.
[4] Möller et al.: Verbesserte Strukturierung von vorsegmentierten EEG-Abschnitten durch Clusterbildung mit globaler Optimierung. Eine methodische Studie. Z. EEG-EMG 27 (1996) 105-110.
[5] Möller et al.: Pitfalls in the clustering of neuroimage data and improvements by global optimization strategies. NeuroImage 14 (2001) 206-218.
[6] Möller et al.: An efficient vector quantizer providing globally optimal solutions. IEEE Transactions on Signal Processing 46 (1998) 2515-2529.
[7] Zeger et al.: Globally optimal vector quantizer design by stochastic relaxation. IEEE Transactions on Signal Processing 40 (1992) 310-322.
[8] Lukashin and Fuchs: Analysis of temporal gene expression profiles: clustering by simulated annealing and determining the optimal number of clusters. Bioinformatics 17 (2001) 405-414.