What is Regularization?

L2 Regularization (Ridge / Weight Decay) — A Beginner-Friendly Deep Dive

Overfitting is when a model memorizes noise instead of learning patterns. L2 regularization (ridge / weight decay) fixes a surprising amount of it with one idea: gently penalize large weights. In this guide we take it slow: a kid-level analogy, tiny numeric examples, the exact math, the geometry, the Bayesian story, how to pick λ, and working Python code in numpy, scikit-learn, and PyTorch.

Explain Like I’m 5: Imagine each model weight is a knob tied to a little rubber band that pulls it toward zero. If the knob goes too far, the band pulls back. This keeps the whole model from doing wild zig-zags, so it behaves better on new data.

1) Why do we need L2?


Overfitting with regularization

Plain English. Overfitting = great on training, poor on new data. It happens when the model uses very large, twitchy weights to chase noise.

Fix. Add a small penalty for large weights; the model still fits the data, but prefers simpler settings unless data begs otherwise.

Visual intuition: Without L2, a high-degree polynomial can wiggle through every point. With L2, it smooths out and follows the trend.

2) What L2 regularization actually is


We add the squared size of the weights to the loss. A common definition (nice for gradients) is with a 1/2 factor:

$$ \boxed{ \mathcal{L}_\lambda(\mathbf{w}) = \mathcal{L}_\text{data}(\mathbf{w}) \;+\; \frac{\lambda}{2}\,\|\mathbf{w}\|_2^2 } $$

3) From OLS to Ridge: the step-by-step math


Ordinary Least Squares (OLS). With design matrix \(X\in\mathbb{R}^{n\times p}\) and targets \(\mathbf{y}\in\mathbb{R}^n\):

$$ \mathcal{L}_\text{OLS}(\mathbf{w})=\frac{1}{2n}\,\|\mathbf{y}-X\mathbf{w}\|_2^2 \quad\Rightarrow\quad \nabla \mathcal{L}_\text{OLS}=\frac{1}{n}X^\top(X\mathbf{w}-\mathbf{y}). $$

Ridge (OLS + L2). Add \(\frac{\lambda}{2}\|\mathbf{w}\|^2\):

$$ \mathcal{L}_\lambda(\mathbf{w})=\frac{1}{2n}\,\|\mathbf{y}-X\mathbf{w}\|^2+\frac{\lambda}{2}\|\mathbf{w}\|^2, \quad \nabla \mathcal{L}_\lambda=\frac{1}{n}X^\top(X\mathbf{w}-\mathbf{y})+\lambda\mathbf{w}. $$

Setting the gradient to zero gives the closed form:

\[ \boxed{\;\hat{\mathbf{w}}_\text{ridge}=(X^\top X+n\lambda I)^{-1}X^\top\mathbf{y}\;} \]

If you define OLS without the \(1/n\), the formula becomes \((X^\top X+\lambda I)^{-1}X^\top y\). Both are common; we’ll keep the \(n\lambda\) version consistent with the \(1/(2n)\) normalization above.


4) Gradient update & “weight decay” (why the weights shrink)


Gradient descent with learning rate \(\eta\) on \(\mathcal{L}_\lambda\) gives:

$$ \mathbf{w}\leftarrow \mathbf{w}-\eta\left(\frac{1}{n}X^\top(X\mathbf{w}-\mathbf{y})+\lambda\mathbf{w}\right) \;=\; (1-\eta\lambda)\,\mathbf{w} \;-\; \eta\cdot \frac{1}{n}X^\top(X\mathbf{w}-\mathbf{y}). $$

Key effect. Every step multiplies \(\mathbf{w}\) by \((1-\eta\lambda)\). That’s why L2 in optimizers is often called weight decay.

PyTorch/AdamW note. AdamW decouples this decay from the adaptive gradient, so the shrinkage is clean: w ← w − η·(∇data) − η·wd·w where wd is the weight decay value (like λ).

5) Geometry: ellipse meets circle (why solutions get “rounded”)


OLS loss contours are ellipses in weight space; the L2 constraint \(\|\mathbf{w}\|^2\le c\) is a circle (sphere in higher-D). The optimum is where the smallest ellipse first touches the circle—tangency.

Ellipse of OLS meeting circle of L2 constraint at tangency

L2 vs L1. L1 is a diamond and loves corners → exact zeros (sparsity). L2 is a circle and shrinks everything smoothly (rarely exact zero).


6) Bayesian view: Gaussian prior ⇒ ridge


Assume noise \( \varepsilon\sim\mathcal{N}(0,\sigma^2I) \) and a prior \( \mathbf{w}\sim\mathcal{N}(\mathbf{0},\tau^2 I) \). Maximizing the posterior (MAP) yields ridge with \( \lambda=\sigma^2/\tau^2 \):

$$ \min_{\mathbf{w}} \frac{1}{2\sigma^2}\|\mathbf{y}-X\mathbf{w}\|^2 + \frac{1}{2\tau^2}\|\mathbf{w}\|^2 \quad\Longleftrightarrow\quad \lambda=\frac{\sigma^2}{\tau^2}. $$

Interpretation. Larger noise \(\sigma^2\) or stronger belief that weights are small (small \(\tau^2\)) → bigger \(\lambda\).


7) SVD shrinkage: how ridge dampens weak directions


Let \(X=U\Sigma V^\top\) with singular values \(\sigma_1\ge\dots\ge\sigma_r\). In this basis, ridge scales components by:

$$ \text{shrink factor } \;\; \frac{\sigma_j}{\sigma_j^2+n\lambda} \quad\in(0,1]. $$

Small \(\sigma_j\) (ill-conditioned directions) get shrunk most → better stability and less overfitting.


8) Choosing λ (practical and painless)


MethodHowGood starting point
Cross-validationScan λ on a log grid; pick best validation score\(10^{-4}\) … \(10^{1}\)
RidgeCVBuilt-in scikit-learn cross-validated ridgeDecades grid, e.g., [1e-4, …, 10]
Empirical BayesMaximize marginal likelihood (advanced)Use when you know noise priors
Always standardize features first. Otherwise λ penalizes features unevenly because of scale differences.

9) Tiny numeric examples (feel the effect)


1D toy. Fit \(y\approx w x\) to two points: (x,y) = (1, 2) and (3, 2). OLS solution: \(w_\text{OLS}=\frac{\sum x_i y_i}{\sum x_i^2}=\frac{1\cdot2+3\cdot2}{1^2+3^2}=\frac{8}{10}=0.8\). Ridge with λ=2 (and our \(1/(2n)\) convention → \(n\lambda=4\)): \[ w_\text{ridge}=\frac{\sum x_i y_i}{\sum x_i^2+n\lambda}=\frac{8}{10+4}=0.571\ldots \] The ridge weight is smaller → predictions are less extreme.
Update step feeling. Suppose \(w=1.0\), \(\eta=0.1\), \(\lambda=0.5\). The decay factor is \(1-\eta\lambda=1-0.05=0.95\). Even before the gradient term, the weight shrinks to \(0.95\). With repeated steps, weights that the data doesn’t strongly support will keep shrinking.

10) Hands-on (scikit-learn): overfit polynomial vs ridge/RidgeCV


import numpy as np, matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LinearRegression, Ridge, RidgeCV
from sklearn.pipeline import make_pipeline

# 1) Data: noisy sine
rng = np.random.default_rng(42)
X = np.linspace(0, 10, 80)[:, None]
y = np.sin(X).ravel() + 0.5 * rng.normal(size=80)

# 2) Overfit baseline: degree-15 OLS (no regularization)
ols = make_pipeline(StandardScaler(with_mean=False),  # keep bias out; poly adds bias if include_bias=True
                    PolynomialFeatures(15, include_bias=False),
                    LinearRegression())
ols.fit(X, y)

# 3) Ridge with a fixed alpha
ridge = make_pipeline(PolynomialFeatures(15, include_bias=False),
                      StandardScaler(),        # scale features -> crucial
                      Ridge(alpha=10.0, fit_intercept=True, random_state=0))
ridge.fit(X, y)

# 4) RidgeCV to pick alpha automatically (across decades)
alphas = np.logspace(-4, 1, 20)
ridgecv = make_pipeline(PolynomialFeatures(15, include_bias=False),
                        StandardScaler(),
                        RidgeCV(alphas=alphas, store_cv_values=False))
ridgecv.fit(X, y)
print("Best alpha from CV:", ridgecv.named_steps['ridgecv'].alpha_)

# 5) Plot
Xp = np.linspace(0, 10, 400)[:, None]
plt.figure(figsize=(8,4))
plt.scatter(X, y, s=15, c='k', label='data')
plt.plot(Xp, ols.predict(Xp),    'r--', label='deg-15 OLS (overfit)')
plt.plot(Xp, ridge.predict(Xp),  'b-',  label='deg-15 Ridge (α=10)')
plt.plot(Xp, ridgecv.predict(Xp),'g-',  label='RidgeCV (best α)')
plt.legend(); plt.tight_layout(); plt.show()

What to look for. OLS wiggles between points; Ridge yields a smoother curve; RidgeCV often lands on a value similar to what you’d choose by eye.


11) Hands-on (PyTorch): weight decay with proper parameter groups


import torch, torch.nn as nn
torch.manual_seed(0)

# Simple dataset: y = 2x + noise
N = 256
x = torch.linspace(-3, 3, N).unsqueeze(1)
y = 2*x + 0.7*torch.randn_like(x)

# Tiny model
model = nn.Sequential(nn.Linear(1, 32), nn.ReLU(), nn.Linear(32, 1))

# Parameter groups: decay weights, DON'T decay biases or norms
decay, no_decay = [], []
for name, p in model.named_parameters():
    if p.requires_grad:
        if name.endswith(".bias"):
            no_decay.append(p)
        else:
            decay.append(p)

optim = torch.optim.AdamW([
    {'params': decay,    'weight_decay': 1e-2},
    {'params': no_decay, 'weight_decay': 0.0}
], lr=1e-3)

loss_fn = nn.MSELoss()

for step in range(2000):
    optim.zero_grad()
    pred = model(x)
    loss = loss_fn(pred, y)
    loss.backward()
    optim.step()
    if step % 400 == 0:
        print(f"step {step:4d}  loss={loss.item():.4f}")

Why groups? Decaying biases/LayerNorm often hurts performance. Most training recipes decay only the “real capacity” weights (linear/conv kernels).


12) Common pitfalls & quick fixes


Rule of thumb: If validation error is noisy/unstable, try a slightly larger λ. If it’s stable but high, reduce λ (or increase model capacity).

13) Watch: StatQuest on Ridge




14) Mini-Glossary



15) References & Further Reading


  1. Hoerl & Kennard (1970). Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics.
  2. Hastie, Tibshirani, Friedman (2009). The Elements of Statistical Learning, Ch. 3 & 7.
  3. Bishop (2006). Pattern Recognition and Machine Learning, Ch. 3 & 7.
  4. Loshchilov & Hutter (2019). Decoupled Weight Decay Regularization (AdamW).
  5. Wikipedia: Ridge regression, Tikhonov regularization, Bias–variance trade-off.

16) FAQ


Q1. Should I use L2 for classification (logistic regression)?

Yes. Add \(\frac{\lambda}{2}\|w\|^2\) to the logistic loss. It improves stability and generalization.

Q2. Do I regularize the bias?

Usually no. In scikit-learn, that’s handled automatically; in PyTorch, exclude bias/LayerNorm from weight decay via parameter groups (see code above).

Q3. Is L2 better than L1?

Different goals. L2 = stability and smooth shrinkage (keeps most features). L1 = sparsity (feature selection). Elastic Net mixes both.

Q4. How large should λ be?

Start with cross-validation over decades (\(10^{-4}\) … \(10^{1}\)), standardize features, and let RidgeCV pick the winner.