Gradient descent is the workhorse behind modern machine learning. In this guide we go slow and logical: first the goal and the idea of a loss, then the gradient and the famous one-line update, then safe learning-rate choices with tiny numeric examples and visuals. Only after that do we introduce practical variants (mini-batches, momentum, Adam) with clear definitions and code you can run.
Plain English. We want the model to make fewer mistakes. We turn “mistakes” into a number called the loss. Lower is better.
Idea. The gradient points to the direction of steepest increase. So we walk the opposite way to reduce the loss fast.
At step \(t\), with learning rate \(\eta\), take a small step opposite the gradient:
Why minus? Gradients point uphill; we want downhill.
Heuristic that works: start small, increase until the loss gets bouncy, then back off a bit.
import numpy as np
def f(w): return (w-3)**2
def grad(w): return 2*(w-3)
w, eta = 10.0, 0.1
for step in range(12):
w -= eta * grad(w)
print(f"step {step:2d}: w={w:6.3f} loss={f(w):7.4f}")
Try \(\eta=1.2\) to see divergence—useful for intuition.
Reading the map. Lines are equal-loss contours; arrows point down (opposite gradient). In narrow valleys, steps can zig-zag—later we’ll fix this with momentum.
When features have very different scales (e.g., age vs income), the loss “bowl” is stretched—some directions are steep, others flat. Gradient descent then zig-zags and needs a tiny \(\eta\). Standardizing inputs (zero mean, unit variance) makes the bowl rounder and training smoother.
Full-batch gradient uses all \(n\) examples every step—accurate but slow. Stochastic GD uses one example—fast but very noisy. Mini-batch (e.g., 64–256 samples) balances speed and stability.
The noise in \(g_t\) can help escape saddles and poor local minima.
Intuition. If you’ve been moving east, don’t instantly stop unless the slope forces you. Momentum damps zig-zags in narrow valleys.
Nesterov momentum peeks ahead a little before taking a step—often slightly better in practice.
Idea. Keep a moving average of gradients (like momentum) and also of squared gradients to scale each parameter’s step. Parameters with noisy gradients get smaller steps.
Popular defaults: \(\beta_1=0.9,\;\beta_2=0.999,\;\epsilon=10^{-8},\;\eta=10^{-3}\).
import numpy as np
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
# 1) Toy data
X, y = make_classification(n_samples=400, n_features=2, n_redundant=0,
n_clusters_per_class=1, random_state=0)
scaler = StandardScaler()
X = scaler.fit_transform(X)
X = np.c_[np.ones(len(X)), X] # add bias column of ones
# 2) Model & helpers
w = np.zeros(X.shape[1])
def sigmoid(z): return 1/(1+np.exp(-z))
def loss(w):
z = X @ w
p = sigmoid(z)
eps = 1e-9
return -np.mean(y*np.log(p+eps) + (1-y)*np.log(1-p+eps))
def grad(w):
z = X @ w
p = sigmoid(z)
return X.T @ (p - y) / len(y)
# 3) Train
eta = 0.2
for step in range(800):
w -= eta*grad(w)
if step % 100 == 0:
print(f"step {step:4d} loss={loss(w):.4f}")
print("weights:", w.round(3))
What you should see. Loss drops quickly then stabilizes; if it oscillates, lower \(\eta\); if it barely moves, increase \(\eta\) a bit.
If the gradient is Lipschitz-continuous with constant \(L\) and \(0<\eta<2/L\), each step decreases the loss unless you’re at a flat optimum:
For a 1D quadratic \(f(w)=\tfrac{a}{2}(w-w^\star)^2\), the update is \(w_{t+1}-w^\star=(1-\eta a)(w_t-w^\star)\). Convergence requires \(|1-\eta a|<1\Rightarrow 0<\eta<2/a\).
# Pseudo-code for a robust loop
init θ
optimizer = Adam(η=1e-3) # or SGD+momentum once you’re comfortable
for epoch in range(E):
for (x, y) in dataloader: # mini-batches
y_hat = model(x, θ)
L = loss(y_hat, y)
g = ∇θ L # backprop computes this in frameworks
θ = optimizer.update(θ, g)
# evaluate on validation set
# adjust η with a scheduler if needed
# early stop if validation loss worsens for K epochs
Use an LR finder: start tiny, increase \(\eta\) exponentially, plot loss vs \(\eta\), pick the largest value before the curve turns up, then divide by 3–10.
Adam is great for noisy/sparse gradients and fast starts. For some vision tasks, well-tuned SGD+momentum can match or beat final accuracy. Try Adam first, then compare.
Usually too-large \(\eta\) or invalid math (e.g., log(0)). Lower \(\eta\), standardize inputs, add gradient clipping (e.g., clip global \(\ell_2\) norm to 1), add small eps where needed.
Yes: larger batches give stabler gradients but may need warmup and schedule tweaks; smaller batches add helpful noise and can generalize well. Start with 64–256.