Using Random Forest to Learn Imbalanced Data

July, 2004
Report Number: 
Chao Chen, Andy Liaw and Leo Breiman

In this paper we propose two ways to deal with the imbalanced data classification problem using random forest. One is based on cost sensitive learning, and the other is based on a sampling technique. Performance metrics such as precision and recall, false positive rate and false negative rate, $F$-measure and weighted accuracy are computed. Both methods are shown to improve the prediction accuracy of the minority class, and have favorable performance compared to the existing algorithms.

PDF File: 
Postscript File: