4 Gradient Descent
4.1 Batch Gradient Descent
[i] Full Gradient Descent => terribly slow
[i] Gradient: \[∇ = \frac{1}{m} X^{T}(Χθ-y)\]
Scale well with number of features
4.2 Stochastic Gradient Descent
- Pick a random instance at every step (not epoch) to compute gradient
- Out-of-core algorithm
- Cost function: cost function is erratic, continue bounch around when get to the global minimum
- Can jump out local minimum
- Weights are good, not optimal => Improve by set gradually reduce learning_rate (called learning schedule)
- Randomness => Improve by shuffling to ensure pick every instance
4.3 Mini-Batch Gradient Descent
- Compute gradient on small random sets called mini-batches (boost by GPUs)
- Less erratic