4 Gradient Descent

4.1 Batch Gradient Descent

[i] Full Gradient Descent => terribly slow
[i] Gradient: \[∇ = \frac{1}{m} X^{T}(Χθ-y)\]
Scale well with number of features

4.2 Stochastic Gradient Descent

Pick a random instance at every step (not epoch) to compute gradient
Out-of-core algorithm
Cost function: cost function is erratic, continue bounch around when get to the global minimum
- Can jump out local minimum
- Weights are good, not optimal => Improve by set gradually reduce learning_rate (called learning schedule)
- Randomness => Improve by shuffling to ensure pick every instance

4.3 Mini-Batch Gradient Descent

Compute gradient on small random sets called mini-batches (boost by GPUs)
Less erratic