Class Imbalance

1. Abstract

refers to the imbalance of the number of instances for each class in a data set
problems may arise in different stages of the pipeline if a dataset is imbalanced and extra care may need to be taken for normal execution
may result in insufficient data to learn the characteristics of a specific class and distributionally non-conforming validation/training sets if one does not use stratification when splitting datasets.
Evaluation metrics may also be mislead if some classes are under-represented : this calls for a family of class-balanced loss functions/metrics for fair consideration of all classification in the training/evaluation process.

One may deal with the issue by augmenting the training data or the algorithm itself

weighing the rarer class more heavily during training can compensate for the under-representation
- some algorithms inherently handle minorities better than the others (random forests and gradient boosting) and may serve as good quick baselines

replicating the data points from the minority class with/without some minor tweaks
- two common oversampling methods:
  1. Synthetic minority oversampling technique (SMOTE)
  2. Adaptive synthetic sampling method (ADASYN)

proceeding with only a randomly selected subset of the over-represented classes to balance out the dataset in case one doesn't want duplication