Distribution-balanced stratified cross-validation for accuracy estimation 论文

2000Journal of Experimental & Theoretical Artificial Intelligence引用 237
Machine Learning and Data ClassificationImbalanced Data Classification TechniquesData Mining Algorithms and Applications

摘要

Abstract Cross-validation has often been applied in machine learning research for estimating the accuracies of classifiers. In this work, we propose an extension to this method, called distribution-balanced stratified cross-validation (DBSCV), which improves the estimation quality by providing balanced intraclass distributions when partitioning a data set into multiple folds. We have tested DBSCV on nine real-world and three artificial domains using the C4.5 decision trees classifier. The results show that DBSCV performs better (has smaller biases) than the regular stratified crossvalidationin most cases, especially when the number of folds is small. The analysis and experiments based on three artificial data sets also reveal that DBSCV is particularly effective when multiple intraclass clusters exist in a data set. Keywords: Cross-VALIDATION Machine Learning Research True Accuracy Classifier