Gradient Descent
Table of Contents
- an iterative optimization algorithm used to minimize a function (see loss).
- speaking briefly:
- we start at a random point on the parameter-space vs loss contour
- then we step down the hyper-hill, trying to avoid getting stuck in local hyper-troughs, repeating until we can report convergence when we reach a satisfactory (usually) hyper-valley
- note that we actually can't see the hyper-hill and need to calculate the loss everytime we step somewhere, akin to hiking in the dark.
- the parameter-space step size is controllable via a hyper-parameter -> the learning rate
- when working with a convex optimization criterion, we're sure to find a global minimum. Whereas a settlement might be necessary with complex contours.
1. Impovements
1.1. Stochastic Gradient Descent (SGD)
Computing the actual loss for all of the training data can be very slow and doing so using stochastically selected smaller batches leads to the idea of stochastic gradient descent.
1.2. Adagrad
This scales the learning rate individually for each parameter (ADAptive GRADient descent) according to the history of gradients.
- the learning rate is smaller for large gradients and larger for smaller gradients as a consequence.
1.3. Momentum
Accelerate SGD by retaining a sense of past gradients to impart some inertia to the optimizatio process
- helps deals with oscillations and move more meaningfully
1.4. RMSProp (Root Mean Square Propagation):
RMSProp (like Adagrad) adapts the learning rate for each parameter based on the past gradients, helping to stabilize and speed up the training process.
- note that RMSprop is an improvement over Adagrad and deals with the diminishing learning rate issue.
- read more at this wikipedia page
1.5. Adam (Adaptive Moment Estimation):
Adam combines the benefits of momentum and RMSProp, using both past gradients and their magnitudes to adjust learning rates, making it a versatile and efficient optimization algorithm.