Wednesday, April 27, 2011

Unbalanced data sets in machine learning

http://www.openstarts.units.it/dspace/bitstream/10077/4002/1/Menardi%20Torelli%20DEAMS%20WPS2.pdf

It has been widely reported that the class imbalance heavily compromises the process of learning, because the model tends to focus on the prevalent class and to ignore the rare events (Japkowicz and Stephen, 2002).

However, unless the classes are perfectly separable (Hand and Vinciotti, 2003) or the complexity of the problem is low (Japkowicz and Stephen, 2002),
neglecting the unbalance leads to heavy consequences, both in model estimation and when the evaluation of the accuracy of the estimated model has to be measured.

What typically happens in such a situation is that standard classifiers tend to be overwhelmed by the prevalent class and ignore the rare examples.

Fixes:
1. A first approach to this class of methods produces some modification of the classifier in order to compensate the imbalance. This approach is generally applied to classifiers whose training is based on the optimization of some function related to the overall accuracy.
2. Solutions at the data level for dealing with unbalanced classes basically focus on altering the class distribution in order to get a more balanced sample. (oversampling and undersampling)  The reason that altering
the class distribution of the training data aids learning with highly skewed datasets
is that it effectively imposes non-uniform misclassification costs.

No comments: