Class Imbalance

See Classification.

1. Abstract

  • refers to the imbalance of the number of instances for each class in a data set
  • problems may arise in different stages of the pipeline if a dataset is imbalanced and extra care may need to be taken for normal execution
  • may result in insufficient data to learn the characteristics of a specific class and distributionally non-conforming validation/training sets if one does not use stratification when splitting datasets.
  • Evaluation metrics may also be mislead if some classes are under-represented : this calls for a family of class-balanced loss functions/metrics for fair consideration of all classification in the training/evaluation process.

2. Solutions

One may deal with the issue by augmenting the training data or the algorithm itself

2.1. Algorithm Oriented

  • weighing the rarer class more heavily during training can compensate for the under-representation
    • some algorithms inherently handle minorities better than the others (random forests and gradient boosting) and may serve as good quick baselines

2.2. Data set oriented solutions

2.2.1. Oversampling

2.2.2. Undersampling

  • proceeding with only a randomly selected subset of the over-represented classes to balance out the dataset in case one doesn't want duplication