ECCB 2002 Poster sorted by: Author | Number

Next | Previous poster (in order of the view you have selected)

Title: Improved gene selection for classification of microarrays
P70
Jaeger, Jochen; Sengupta, Rimli; Ruzzo, Walter L.

jaeger@molgen.mpg.de
Department of Computer Science & Engineering, University of Washington, Max Planck Institute for Molecular Genetics, Berlin

We present improved techniques for evaluating and selecting informative genes from microarray data used for such tasks as tumor/normal tissue classification. We compare five existing gene selection methods and evaluate their performance using support vector machines and leave one out cross validation. Then we propose new methods which can drastically improve classification. Typically, informative genes are selected in rank order according to a statistical test score such as a t-test p-value. A problem with this approach is that it may lead to the selection of many highly correlated genes. In general we expect high correlation to have a meaningful biological explanation. If, e.g., genes A and B are in the same pathway it could be that they have similar regulation and therefore similar expression profiles. If gene A has a good test score it is highly likely that gene B will, as well. Hence a typical feature selection scheme is likely to include both genes in a classifier, yet the pair of genes provides little additional information compared to either gene alone. We could, of course, just select more genes in order to capture all relevant genes. For classification tasks, however, more isn't necessarily better, for several reasons. First, inclusion of more genes increases the computational cost of classification. More seriously, it can skew the classification result if, e.g., we have many more genes involved in one pathway than others, since the classifier may give undue weight to the multiple redundant genes selected from the larger pathway. Futhermore, in the common case where the number of selected genes must be severely limited for budgetary or other reasons, selection of several individually informative but redundant genes from the majority pathway is likely to preclude inclusion of representative genes from important minority pathways, adversely impacting overall classifier accuracy. Consequently, selection of a small number of genes which collectively allow high classification accuracy is an important goal and the focus of our work. Our approach is to first find similar genes, group them, and then select informative genes from these groups to avoid redundancy. In this work we compare classification done with five different test statistics: Fisher, Golub, Wilcoxon, TNoM, and t-test on three different publicly available datasets, Golub(47 ALL and 25 AML leukemia samples), Notterman (18 tumor and 18 normal samples) and Alon (40 Adenocarcinoma and 22 normal samples). We propose two algorithms based on clustering and one based on correlated groups to find similar genes. The performance of the feature selection is calculated using support vector machines and leave one out cross validation scores. The result is that for any fixed number of informative genes to retrieve, the methods outlined here are identifying sets of genes that are in almost all cases stronger predictors than sets found by standard methods. This should be of significant value for diagnostic purposes as well as for guiding
further exploration of the underlying biology.
[1] A. Ben-Dor, L. Bruhn, N. Friedman, I. Nachman, M. Schummer, and Z. Yakhini. Tissue classification with gene expression profiles. RECOMB, (2000).
[2] J.L. Devore, Probability and Statistics for Engineering and the Sciences, 4th edition, Duxbury Press, (1995); Own unpublished results.
[3] P.J. Park, M. Pagano, M. Bonetti: A nonparametric scoring algorithm for identifying informative genes from microarray data. PSB:52-63, (2001).
[4] C.M. Bishop: Neural Networks for Pattern Recognition, Oxford University Press, (1995)
[5] T.R. Golub, D.K. Slonim, et al. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286:531-537, (1999).
[6] I. Lonnstedt and T. P. Speed. Replicated Microarray Data. Statistical Sinica, (2002).
[7] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, and V. Vapnik. Feature selection for SVMs, Advances in Neural Information Processing Systems 13. MIT Press, (2001).
[8] U. Alon, N. Barkai, D.A. Notterman, K. Gish, S. Ybarra, D. Mack and A.J. Levine: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays PNAS 96:6745­6750, (1999).
[9] Y.H. Yang, M.J. Buckley, S. Dudoit, T.P. and Speed: Comparison of methods for image analysis on cDNA microarray data. Technical report (2000)
[10] Y. H. Yang, S. Dudoit, P. Luu and T. P. Speed. Normalization for cDNA Microarray Data. SPIE BiOS, (2001).
[11] C. Cortes and V. N. Vapnik. Support vector networks. Machine Learning, 20:273-297, (1995).
[12] http://www.cs.columbia.edu/~noble/svm/doc/
[13]J.C. Dunn, "A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters", Journal of Cybernetics, 3:32--57, (1973).
[14] D.A. Notterman, U. Alon, A.J. Sierk, A.J. Levine: Transcriptional Gene Expression Profiles of Colorectal Adenoma, Adenocarcinoma and Normal Tissue Examined by Oligonucleotide Arrays, Cancer Research 61:3124-3130, (2001).
[15] C.E. Metz, Basic principles of ROC analysis, Seminars in Nuclear Medicine, Vol 8, No. 4, 283-298, (1978).
[16] G. Schwarz: Estimating the dimension of a model. Annals of Statistics, 6:461-464 (1978).
[17] K. Y. Yeung, C. Fraley, A. Murua, A. E. Raftery and W. L. Ruzzo, Model-based clustering and data transformation for gene expression data, Bioinformatics 17:977-987 (2001).