Overfitting is when a model memorizes noise instead of learning patterns. L2 regularization
(ridge / weight decay) fixes a surprising amount of it with one idea: gently penalize large weights. In this
guide we take it slow: a kid-level analogy, tiny numeric examples, the exact math, the geometry, the Bayesian
story, how to pick λ, and working Python code in numpy,
scikit-learn, and PyTorch.
Plain English. Overfitting = great on training, poor on new data. It happens when the model uses very large, twitchy weights to chase noise.
Fix. Add a small penalty for large weights; the model still fits the data, but prefers simpler settings unless data begs otherwise.
We add the squared size of the weights to the loss. A common definition (nice for gradients) is with a 1/2 factor:
Ordinary Least Squares (OLS). With design matrix \(X\in\mathbb{R}^{n\times p}\) and targets \(\mathbf{y}\in\mathbb{R}^n\):
Ridge (OLS + L2). Add \(\frac{\lambda}{2}\|\mathbf{w}\|^2\):
Setting the gradient to zero gives the closed form:
If you define OLS without the \(1/n\), the formula becomes \((X^\top X+\lambda I)^{-1}X^\top y\). Both are common; we’ll keep the \(n\lambda\) version consistent with the \(1/(2n)\) normalization above.
Gradient descent with learning rate \(\eta\) on \(\mathcal{L}_\lambda\) gives:
Key effect. Every step multiplies \(\mathbf{w}\) by \((1-\eta\lambda)\). That’s why L2 in optimizers is often called weight decay.
w ← w − η·(∇data) − η·wd·w where wd is the weight decay value (like λ).
OLS loss contours are ellipses in weight space; the L2 constraint \(\|\mathbf{w}\|^2\le c\) is a circle (sphere in higher-D). The optimum is where the smallest ellipse first touches the circle—tangency.
L2 vs L1. L1 is a diamond and loves corners → exact zeros (sparsity). L2 is a circle and shrinks everything smoothly (rarely exact zero).
Assume noise \( \varepsilon\sim\mathcal{N}(0,\sigma^2I) \) and a prior \( \mathbf{w}\sim\mathcal{N}(\mathbf{0},\tau^2 I) \). Maximizing the posterior (MAP) yields ridge with \( \lambda=\sigma^2/\tau^2 \):
Interpretation. Larger noise \(\sigma^2\) or stronger belief that weights are small (small \(\tau^2\)) → bigger \(\lambda\).
Let \(X=U\Sigma V^\top\) with singular values \(\sigma_1\ge\dots\ge\sigma_r\). In this basis, ridge scales components by:
Small \(\sigma_j\) (ill-conditioned directions) get shrunk most → better stability and less overfitting.
| Method | How | Good starting point |
|---|---|---|
| Cross-validation | Scan λ on a log grid; pick best validation score | \(10^{-4}\) … \(10^{1}\) |
| RidgeCV | Built-in scikit-learn cross-validated ridge | Decades grid, e.g., [1e-4, …, 10] |
| Empirical Bayes | Maximize marginal likelihood (advanced) | Use when you know noise priors |
import numpy as np, matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LinearRegression, Ridge, RidgeCV
from sklearn.pipeline import make_pipeline
# 1) Data: noisy sine
rng = np.random.default_rng(42)
X = np.linspace(0, 10, 80)[:, None]
y = np.sin(X).ravel() + 0.5 * rng.normal(size=80)
# 2) Overfit baseline: degree-15 OLS (no regularization)
ols = make_pipeline(StandardScaler(with_mean=False), # keep bias out; poly adds bias if include_bias=True
PolynomialFeatures(15, include_bias=False),
LinearRegression())
ols.fit(X, y)
# 3) Ridge with a fixed alpha
ridge = make_pipeline(PolynomialFeatures(15, include_bias=False),
StandardScaler(), # scale features -> crucial
Ridge(alpha=10.0, fit_intercept=True, random_state=0))
ridge.fit(X, y)
# 4) RidgeCV to pick alpha automatically (across decades)
alphas = np.logspace(-4, 1, 20)
ridgecv = make_pipeline(PolynomialFeatures(15, include_bias=False),
StandardScaler(),
RidgeCV(alphas=alphas, store_cv_values=False))
ridgecv.fit(X, y)
print("Best alpha from CV:", ridgecv.named_steps['ridgecv'].alpha_)
# 5) Plot
Xp = np.linspace(0, 10, 400)[:, None]
plt.figure(figsize=(8,4))
plt.scatter(X, y, s=15, c='k', label='data')
plt.plot(Xp, ols.predict(Xp), 'r--', label='deg-15 OLS (overfit)')
plt.plot(Xp, ridge.predict(Xp), 'b-', label='deg-15 Ridge (α=10)')
plt.plot(Xp, ridgecv.predict(Xp),'g-', label='RidgeCV (best α)')
plt.legend(); plt.tight_layout(); plt.show()
What to look for. OLS wiggles between points; Ridge yields a smoother curve; RidgeCV often lands on a value similar to what you’d choose by eye.
import torch, torch.nn as nn
torch.manual_seed(0)
# Simple dataset: y = 2x + noise
N = 256
x = torch.linspace(-3, 3, N).unsqueeze(1)
y = 2*x + 0.7*torch.randn_like(x)
# Tiny model
model = nn.Sequential(nn.Linear(1, 32), nn.ReLU(), nn.Linear(32, 1))
# Parameter groups: decay weights, DON'T decay biases or norms
decay, no_decay = [], []
for name, p in model.named_parameters():
if p.requires_grad:
if name.endswith(".bias"):
no_decay.append(p)
else:
decay.append(p)
optim = torch.optim.AdamW([
{'params': decay, 'weight_decay': 1e-2},
{'params': no_decay, 'weight_decay': 0.0}
], lr=1e-3)
loss_fn = nn.MSELoss()
for step in range(2000):
optim.zero_grad()
pred = model(x)
loss = loss_fn(pred, y)
loss.backward()
optim.step()
if step % 400 == 0:
print(f"step {step:4d} loss={loss.item():.4f}")
Why groups? Decaying biases/LayerNorm often hurts performance. Most training recipes decay only the “real capacity” weights (linear/conv kernels).
StandardScaler() before ridge.Yes. Add \(\frac{\lambda}{2}\|w\|^2\) to the logistic loss. It improves stability and generalization.
Usually no. In scikit-learn, that’s handled automatically; in PyTorch, exclude bias/LayerNorm from weight decay via parameter groups (see code above).
Different goals. L2 = stability and smooth shrinkage (keeps most features). L1 = sparsity (feature selection). Elastic Net mixes both.
Start with cross-validation over decades (\(10^{-4}\) … \(10^{1}\)), standardize features, and let RidgeCV pick the winner.
I remember a project early in my career where a model performed brilliantly on training data but fell apart completely in production. I spent days checking for data leakage, re-examining feature engineering, and questioning the entire pipeline, only to discover that adding a simple L2 penalty to the loss function solved the problem almost entirely. That experience fundamentally changed how I approach model development. Now, regularization is never an afterthought; it is part of my baseline configuration from the very first experiment.
One of the most common mistakes I see practitioners make is treating the regularization strength lambda as a minor hyperparameter to set once and forget. In reality, the optimal lambda depends heavily on the scale of your features, the size of your dataset, and even the specific optimizer you are using. I always standardize features before applying L2 regularization, and I use cross-validation across a wide logarithmic range to find the right strength. Another subtle point that catches people off guard: in modern deep learning, the distinction between L2 regularization and weight decay actually matters when you are using adaptive optimizers like Adam.
If there is one takeaway I hope readers get from this article, it is that regularization is not just a mathematical trick to make equations nicer. It encodes a genuine prior belief that simpler models generalize better, and developing an intuition for when and how to apply it is one of the most valuable skills a machine learning practitioner can build.
Comments