Class Imbalance
Table of Contents
See Classification.
1. Abstract
- refers to the imbalance of the number of instances for each class in a data set
- problems may arise in different stages of the pipeline if a dataset is imbalanced and extra care may need to be taken for normal execution
- may result in insufficient data to learn the characteristics of a specific class and distributionally non-conforming validation/training sets if one does not use stratification when splitting datasets.
- Evaluation metrics may also be mislead if some classes are under-represented : this calls for a family of class-balanced loss functions/metrics for fair consideration of all classification in the training/evaluation process.
2. Solutions
One may deal with the issue by augmenting the training data or the algorithm itself
2.1. Algorithm Oriented
- weighing the rarer class more heavily during training can compensate for the under-representation
- some algorithms inherently handle minorities better than the others (random forests and gradient boosting) and may serve as good quick baselines
2.2. Data set oriented solutions
2.2.1. Oversampling
- replicating the data points from the minority class with/without some minor tweaks
- two common oversampling methods:
2.2.2. Undersampling
- proceeding with only a randomly selected subset of the over-represented classes to balance out the dataset in case one doesn't want duplication