Regularization

1. Basics
2. Specific to Neural Networks

1. Basics

1.1. Purpose

Prevent overfitting by adding a penalty to the model's complexity.

1.2. Techniques

Common forms of regularization include L1 (Lasso) and L2 (Ridge) regularization, which add penalties based on the absolute and squared values of model parameters, respectively.

1.2.1. L1(Lasso)

leads to a sparse model with most parameters set to zero due to the nature of the penalty's contour with respect to the parameters.
can be used as a feature selector then: to find out which features are essential and increasing model Interpretability

1.2.2. L2(Ridge)

promotes lower moduli of parameters.
aka weight decay
- derivative of L2 norm is a linear absolute term that is subtracted from gradients during Stochastic Gradient Descent (SGD)

1.2.3. Elastic Net

combination of L1 and L2

1.2.4. Batch Norm

1.2.5. Dropout

1.2.6. Data Augmentation

1.3. Effect

Regularization encourages the model to have smaller parameter values, reducing its complexity and preventing it from fitting noise in the training data.

1.4. Bias-Variance Trade-off

It introduces a trade-off between fitting the training data well and having a simpler model that generalizes better to new, unseen data.

1.5. Lambda (Hyperparameter)

The strength of regularization is controlled by a hyperparameter (lambda or alpha) that can be tuned to achieve the desired balance between bias and variance.

1.6. Benefits

It helps improve model stability, reduce overfitting, and make models more robust in real-world scenarios.

2. Specific to Neural Networks

2.1. Dropout

exclude a certain percentage (or with certain probability) of the neurons (switch them off) during training during each iteration
the higher that hyperparameter, the more regularizing effect we introduce.
this can be achieved using a dropout layer

2.2. BatchNorm

standardizing outputs of each layer before units of the subsequent layer receive them as input
results in faster and more stable training
not specifically a regularization technique but often exhibits regularization effects
no particular harm if you use it : should always use it as a rule of thumb
implemented as a batchNorm layer between two computational layers.

2.3. Data Augmentation

usually with images
involves creating synthetic examples by applying various transformations on the original:
- slight zoom in/out
- rotating
- flipping
- darkening
- and so on

2.4. Early Stopping

overfits due to prolonged training can result in lower validation performance
a common practice is to have checkpoints every certain number of epochs and store the best model so far based on validation performance