Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data

June, 2000
Report Number: 
Persi Diaconis & Steven N. Evans
J. Comb. Theory A. 98,175-191 (2002)

A reliable and precise classification of tumors is essential for successful treatment of cancer. cDNA microarrays and high-density oligonucleotide chi ps are novel biotechnologies which are being used increasingly in cancer research. By allowing the monitoring of expression levels for thousands of genes simultaneously, such techniques may lead to a more complete understanding of the molecular variations among tumors and hence to a finer and more informative classification. The ability to successfully distinguish between tumor classes (already known or yet to be discovered) using gene expression data is an important aspect of this novel approach to cancer classification.

In this paper, we compare the performance of different discrimination methods for the classification of tumors based on gene expression data. These methods include: nearest neighbor classifiers, linear discriminant analysis, and classification trees. In our comparison, we also consider recent machine learning approaches such as bagging and boosting. We investigate the use of prediction votes to assess the confidence of each prediction. The methods are applied to datasets from three recently published cancer gene expression studies.

PDF File: 
Postscript File: