A meticulous, step-by-step guide to optimization — with interactive playgrounds, rigorous mathematics, and practical wisdom from the trenches.
The mountain analogy is all you need to understand optimization forever.
Imagine you're blindfolded on a mountainside. You feel the slope and step downhill. That's gradient descent — simple but powerful.
GD doesn't guarantee the global minimum. In deep learning's non-convex landscape, it finds local minima — which, remarkably, are nearly as good as global minima in over-parameterized networks (thousands of parameters smooth the landscape).
Six steps that power every neural network in production today.
Initialize θ₀ randomly (Xavier/He initialization)
Forward pass: ŷ = f(x; θ) — compute predictions
Compute loss: J(θ) = L(ŷ, y) — measure how wrong we are
Backward pass: ∇θJ via backpropagation
Update: θ ← θ − α · ∇θJ — take a step downhill
Repeat until convergence
Model: ŷ = wx + b | Data: (1,2), (2,4), (3,6) | w=0.5, b=0.1, α=0.01
Convergence painfully slow. Gets stuck in shallow local minima. Wastes compute.
SlowOvershoots, oscillates wildly, or diverges completely. Loss goes to infinity.
DivergesStep Decay: α = α₀ × 0.1 every 30 epochs · Cosine Annealing: α_t = α_min + ½(α_max − α_min)(1+cos(πt/T)) · Warmup + Decay: Transformers standard · Cyclical LR: Oscillate between bounds
From full-batch to stochastic — each with trade-offs.
Batch sizes: 32, 64, 128, 256 — powers of 2 for GPU alignment. This is the industry standard.
Modern optimizers that power every LLM in production.
Nesterov Variant: "Look ahead" before computing gradient → faster convergence. Standard for CNN training.
Flaw: G grows forever → learning rate shrinks to zero → training stalls.
Hinton's fix for AdaGrad — proposed in a Coursera lecture, never formally published!
Defaults: α=0.001, β₁=0.9, β₂=0.999, ε=10⁻⁸
Default optimizer for Transformers (GPT, BERT, LLaMA). The key insight: decouple weight decay from the adaptive gradient updates.
Sign-based momentum. Uses only the sign of gradients, not magnitudes — dramatically reduces memory.
Memory EfficientSecond-order optimizer — 2× faster than Adam for LLM pre-training by approximating Hessian info.
Second-OrderCritical techniques that make the difference between a model that trains and one that doesn't.
Without scaling, gradients are dominated by large-magnitude features. The loss landscape becomes an elongated ellipse, making GD oscillate. Always normalize inputs to zero mean, unit variance — or use Min-Max to [0,1]. This single step can speed up training 10×.
Normalize activations within each layer during training. Reduces internal covariate shift, allows higher learning rates, and acts as regularization. Used in nearly every modern CNN.
Can't fit batch size 256 in GPU memory? Run 8 forward passes with batch 32, accumulate gradients, then update once. Same math, less memory. Essential for training large models on consumer GPUs.
Standard momentum computes the gradient at the current position. Nesterov's insight: compute the gradient at the "look-ahead" position where momentum is about to take us. This anticipation prevents overshooting and converges faster.
NAG + proper LR schedule is the go-to for computer vision. SGD+Nesterov trained ResNet, VGG, and most ImageNet winners.
When gradients explode (common in RNNs and Transformers), clipping rescales the gradient vector so its norm never exceeds a threshold. Without it, a single bad batch can destroy your model.
Rule of thumb: max_norm = 1.0 for Transformers, 5.0 for RNNs. Monitor gradient norms — if they spike, you need clipping.
Watch the ball roll downhill on f(x) = x⁴ − 3x³ + 2. Tweak learning rate and starting position.
Rosenbrock function — the classic test for optimizers. Watch the path navigate the narrow valley.
SGD vs Momentum vs RMSProp vs Adam — who reaches the minimum first?
Watch gradient descent navigate a 3D loss landscape. Drag to rotate, scroll to zoom.
Visualize how different LR schedules decay over training. Toggle each to compare.
Copy-paste ready implementations for every optimizer and technique discussed.
Most are nearly as good as global in high dims. Over-parameterized nets smooth the landscape.
More common than local minima in high dims. Momentum and adaptive methods escape them.
Deep networks lose signal. Fix: ResNets (skip connections), BatchNorm, gradient clipping, ReLU.
Flat loss regions stall training. LR warmup + cosine decay + patience helps break through.
Watch SGD get stuck on a saddle point (left) while Momentum escapes it (right).
When gradient norms grow exponentially through layers (common in RNNs). Loss suddenly jumps to NaN. Signs: loss spikes, NaN in weights, gradient norm > 1000.
Fix: Gradient clipping, proper initialization (He/Xavier), LSTM/GRU gates, skip connections.
Sharp minima (high curvature) generalize poorly — small perturbations cause large loss increases. Flat minima (low curvature) are robust. This is why:
• Smaller batch sizes → more noise → find flatter minima → better generalization
• Large batch sizes → less noise → converge to sharp minima → overfit
• SAM optimizer (2020) explicitly seeks flat minima by optimizing worst-case loss in a neighborhood
Start with AdamW (lr = 3e-4). It works for almost everything.
Warmup the first 5–10% of training steps.
Gradient clipping with max_norm = 1.0 prevents explosions.
Monitor gradient norms — they reveal training health.
Run an LR range test to find the optimal learning rate.
Smaller batch sizes = better generalization (implicit regularization).
Weight decay λ = 0.01–0.1. Always use it.
| Optimizer | Year | Adaptive | Momentum | Memory | Best For |
|---|---|---|---|---|---|
| Batch GD | 1847 | — | — | Low | Small convex problems |
| SGD | 1951 | — | — | Low | Online learning |
| SGD + Momentum | 1964 | — | ✓ | 1× | CNNs (ResNet, etc.) |
| AdaGrad | 2011 | ✓ | — | 1× | Sparse / NLP |
| RMSProp | 2012 | ✓ | — | 1× | RNNs |
| Adam | 2014 | ✓ | ✓ | 2× | Default / General |
| AdamW | 2019 | ✓ | ✓ | 2× | Transformers (GPT, BERT) |
| Lion | 2023 | Sign | ✓ | 1× | Memory-constrained |
| Sophia | 2023 | 2nd-order | ✓ | 2× | LLM pre-training speed |