Hoffmann, Martin;Lau, Stephan;von Eggeling, Ferdinand;Junker, Kerstin;Guthke, Reinhard - Supervised Classification of Pathological States in Gene and Protein Expression Data

ECCB 2002 Poster sorted by: Author | Number

Next | Previous poster (in order of the view you have selected)

Title: Supervised Classification of Pathological States in Gene and Protein Expression Data	P61
Hoffmann, Martin; Lau, Stephan; von Eggeling, Ferdinand; Junker, Kerstin; Guthke, Reinhard MHoffman@pmail.hki-jena.de Hans Knöll Institute for Natural Products Research, Jena; Friedrich Schiller University, Institute for Human Genetics and Anthropology, Jena; Friedrich Schiller University, Clinic of Urology, Jena

The ability of different classification algorithms to identify pathological states in gene and protein expression data is assessed. The two datasets used are the leukaemia gene expression data published by Golub et al. [1] and a protein expression dataset [2]. The algorithms tested comprise variants of nearest neighbour classifiers (un/weighted kNN), support vector machines (SVM) with linear, polynomial and radial basis function (RBF) kernels [3,4] and a binary tree classifier [5].

The leukaemia dataset contains the gene expression profiles (6817 monitored genes) of 47 acute lymphoblastic leukaemia (ALL) and 25 acute myeloid leukaemia (AML) patients taken from either bone marrow samples (62 patients) or peripheral blood samples (10 patients). The protein expression data (43 monitored proteins) originate from 3 different renal tissue locations (normal, peripheral and central tumour) of 8 renal cell carcinoma (RCC) patients and were obtained using SELDI-TOF-MS experiments. The protein data set is understood as consisting of 23 independent tissue samples (one missing value).

We extend the analysis of Golub and co-workers (leukaemia data) in two ways. Instead of choosing only one learning data set consisting of 38 out of 72 patients to build a classifier we evaluate 10000 randomly chosen learning data sets in order to get a more reliable assessment of the classification performance. On the other hand we investigate in how far the number of informative genes used for classification can be reduced in order to identify the most significant diagnostic gene subset. The same strategy used for the analysis of Golub's leukaemia data is applied also to the protein expression data (three tissue classification problem) in order to search for protein patterns relevant for the progression of renal cell carcinoma.

For the gene expression data we find that Golub's choice of the learning and test sets was slightly sub-optimal since the 72 patients can be partitioned in learning and test sets (38 and 34 patients respectively) with a 100% classification rate in cross validation. The unweighted nearest neighbour classifier as well as SVMs with linear and RBF kernels show a comparable performance. The average classification rate is in the range of 33 out of 34 correct classifications using all of the 50 genes pre-selected by Golub et al. We find that this result can also be obtained by the use of only two genes, cyclin D3 (CCND3) and cystatin C (CST3). Furthermore, the subset of selected genes strongly determines which of the patients are systematically misclassified.

The results for the three-class protein expression problem are encouraging as well. A number of 20 out of 23 samples are found to be correctly classified by cross validation for the unweighted nearest neighbour method. This result is based on only 4 or 5 selected proteins. Which of the proteins give the best result does, however, depend to some extent on the scaling of the data and whether log-transformation is used or not. Classification without cross validation error is obtained for a two-class problem (normal vs. peripheral tumour) using a linear SVM and 2 selected proteins and the three-class problem (normal, peripheral and central tumour) using a polynomial SVM and 6 selected proteins.

Summarizing we find simple nearest neighbour classifiers to perform equally well compared to support vector machines in Golub's leukaemia problem but lag somewhat behind SVMs in the renal cell carcinoma problem.

[1] T. R. Golub et al.: SCIENCE 286, 1999.
[2] D. Woetzel et al.: Appling Data Mining Methods to SELDI-TOF Analysed Renal Cell Carcinoma Samples to Identify Tumor Markers, Poster, ECCB 2002, submitted.
[3] N. Cristianini, J. Shawe-Taylor: An Introduction to Support Vector Machines, Cambridge UP, 2000.
[4] http://eewww.eng.ohio-state.edu/~maj/osu_svm/.
[5] J. R. Quinlan: C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.