ECCB 2002 Poster sorted by: Author | Number

Next | Previous poster (in order of the view you have selected)

Title: Searching for biologically meaningful groupings in microarray clustering data
P141
Schacherer, Frank; Scheel, Hartmut; Hofmann, Kay

frank.schacherer@memorec.com
Bioinformatics Group, MEMOREC Stoffel GmbH, Stöckheimer Weg 1, D-50829 Köln, Germany

A usual first step in the interpretation of large amount of microarray data is the clustering of the genes by their expression properties. Several unsupervised clustering methods are commonly in use, including K-means clustering, hierarchical clustering, self-organizing maps and others. Of particular importance are hierarchical clustering schemes, which arrange the data in a tree-like structure, where genes with similar expression patterns occupy neighboring 'leaves' of the tree. The major advantage of this method is that it allows an analysis of the data at different levels of 'granularity': it is possible to look at various 'family sizes' defined by various degrees of expression similarity. The algorithms used for hierarchical clustering are largely the same as used for distance-based phylogenetic reconstruction from sequence data, but restricted to those methods that are fast enough to deal with large numbers of nodes.
Provisions for testing the statistical significance of individual 'clades', e.g. by bootstrapping, are commonplace in phylogenetic reconstruction but are not normally used in expression clustering. Thus, a microarray experiments with N genes results in N-1 clusters. It is obvious that not of all these clusters can be scrutinized manually for their biological relevance. We present a method for the automatic re-annotation of hierarchical clustering data by looking for statistically significant 'enrichment' of biological properties in co-expression clusters. The nature of the biological properties used for the re-annotation can be heterogeneous: all classification schemes of genes or gene products are possible. Interesting applications include data from biological pathways, multi-protein complexes, subcellular localization, or pre-established co-expression data coming from other experiments. The significance of the cluster-wide enrichment of the biological properties is assessed by methods of inferential statistics, e.g. by Fisher's exact test. We will demonstrate in a number of example applications how this approach can be used to gain biological insight into co-expression patterns.