Regularization
Table of Contents
1. Basics
1.1. Purpose
Prevent overfitting by adding a penalty to the model's complexity.
1.2. Techniques
Common forms of regularization include L1 (Lasso) and L2 (Ridge) regularization, which add penalties based on the absolute and squared values of model parameters, respectively.
1.2.1. L1(Lasso)
- leads to a sparse model with most parameters set to zero due to the nature of the penalty's contour with respect to the parameters.
- can be used as a feature selector then: to find out which features are essential and increasing model Interpretability
1.2.2. L2(Ridge)
- promotes lower moduli of parameters.
- aka weight decay
- derivative of L2 norm is a linear absolute term that is subtracted from gradients during Stochastic Gradient Descent (SGD)
1.2.3. Elastic Net
- combination of L1 and L2
1.2.4. Batch Norm
1.2.5. Dropout
1.2.6. Data Augmentation
1.3. Effect
Regularization encourages the model to have smaller parameter values, reducing its complexity and preventing it from fitting noise in the training data.
1.4. Bias-Variance Trade-off
It introduces a trade-off between fitting the training data well and having a simpler model that generalizes better to new, unseen data.
1.5. Lambda (Hyperparameter)
The strength of regularization is controlled by a hyperparameter (lambda or alpha) that can be tuned to achieve the desired balance between bias and variance.
1.6. Benefits
It helps improve model stability, reduce overfitting, and make models more robust in real-world scenarios.
2. Specific to Neural Networks
2.1. Dropout
- exclude a certain percentage (or with certain probability) of the neurons (switch them off) during training during each iteration
- the higher that hyperparameter, the more regularizing effect we introduce.
- this can be achieved using a dropout layer
2.2. BatchNorm
- standardizing outputs of each layer before units of the subsequent layer receive them as input
- results in faster and more stable training
- not specifically a regularization technique but often exhibits regularization effects
- no particular harm if you use it : should always use it as a rule of thumb
- implemented as a batchNorm layer between two computational layers.
2.3. Data Augmentation
- usually with images
- involves creating synthetic examples by applying various transformations on the original:
- slight zoom in/out
- rotating
- flipping
- darkening
- and so on
2.4. Early Stopping
- overfits due to prolonged training can result in lower validation performance
- a common practice is to have checkpoints every certain number of epochs and store the best model so far based on validation performance