4  Gradient Descent

4.1 Batch Gradient Descent

  • [i] Full Gradient Descent => terribly slow

  • [i] Gradient: \[∇ = \frac{1}{m} X^{T}(Χθ-y)\]

  • Scale well with number of features

4.2 Stochastic Gradient Descent

  • Pick a random instance at every step (not epoch) to compute gradient
  • Out-of-core algorithm
  • Cost function: cost function is erratic, continue bounch around when get to the global minimum
    • Can jump out local minimum
    • Weights are good, not optimal => Improve by set gradually reduce learning_rate (called learning schedule)
    • Randomness => Improve by shuffling to ensure pick every instance

4.3 Mini-Batch Gradient Descent

  • Compute gradient on small random sets called mini-batches (boost by GPUs)
  • Less erratic