89 Ridge Regression and L2 Regularization

Ordinary least squares is the workhorse of linear modeling, but it has a well known failure mode. When predictors are numerous, correlated, or nearly collinear, the least squares estimator becomes unstable: small perturbations in the data produce wildly different coefficient vectors, and predictions on new data degrade. Ridge regression, introduced by Hoerl and Kennard in 1970, addresses this by adding a quadratic penalty on the coefficient magnitudes. The result is a biased estimator that trades a small amount of bias for a large reduction in variance, and that remains well defined even when the design matrix is rank deficient. This chapter develops ridge regression from five complementary angles: the penalized objective, the closed form solution, the bias variance decomposition, the Bayesian interpretation as a Gaussian prior, and the shrinkage view through the singular value decomposition.

89.1 1. The Penalized Objective

89.1.1 1.1 Setup and notation

Let $\mathbf{X} \in \mathbb{R}^{n \times p}$ denote the design matrix with $n$ observations and $p$ predictors, and let $\mathbf{y} \in \mathbb{R}^{n}$ denote the response vector. We seek a coefficient vector $\boldsymbol{\beta} \in \mathbb{R}^{p}$. Ordinary least squares (OLS) minimizes the residual sum of squares,

\[ \hat{\boldsymbol{\beta}}^{\text{OLS}} = \arg\min_{\boldsymbol{\beta}} \; \lVert \mathbf{y} - \mathbf{X}\boldsymbol{\beta} \rVert_2^2 . \]

The OLS solution requires $\mathbf{X}^{\top}\mathbf{X}$ to be invertible. When $p > n$, or when columns of $\mathbf{X}$ are linearly dependent, this matrix is singular and the solution is not unique. Even when it is technically invertible, near collinearity makes $\mathbf{X}^{\top}\mathbf{X}$ ill conditioned, so the inverse amplifies noise.

89.1.2 1.2 Adding the L2 penalty

Ridge regression augments the objective with a penalty proportional to the squared Euclidean norm of the coefficients:

\[ \hat{\boldsymbol{\beta}}^{\text{ridge}}(\lambda) = \arg\min_{\boldsymbol{\beta}} \; \lVert \mathbf{y} - \mathbf{X}\boldsymbol{\beta} \rVert_2^2 + \lambda \lVert \boldsymbol{\beta} \rVert_2^2, \]

where $\lambda \geq 0$ is the regularization strength. The term $\lambda \lVert \boldsymbol{\beta} \rVert_2^2 = \lambda \sum_{j=1}^{p} \beta_j^2$ discourages large coefficients. At $\lambda = 0$ we recover OLS. As $\lambda \to \infty$ the penalty dominates and every coefficient is driven toward zero.

An equivalent formulation states ridge as a constrained problem,

\[ \hat{\boldsymbol{\beta}}^{\text{ridge}} = \arg\min_{\boldsymbol{\beta}} \; \lVert \mathbf{y} - \mathbf{X}\boldsymbol{\beta} \rVert_2^2 \quad \text{subject to} \quad \lVert \boldsymbol{\beta} \rVert_2^2 \leq t, \]

with a one to one correspondence between $\lambda$ and the budget $t$. This geometric picture shows that ridge confines the solution to a Euclidean ball, and the smooth spherical boundary is why ridge shrinks coefficients continuously rather than setting them exactly to zero, in contrast to the L1 penalty of the lasso.

89.1.3 1.3 The intercept and standardization

The penalty should not apply to the intercept, since shrinking the intercept would make the fit depend on the arbitrary origin of the response. In practice we center $\mathbf{y}$ and each column of $\mathbf{X}$, fit a ridge model without an intercept on the centered data, and recover the intercept as $\hat{\beta}_0 = \bar{y}$ on the original scale. Furthermore, because the penalty $\sum_j \beta_j^2$ treats every coefficient equally, the solution is not invariant to the units of each predictor. A variable measured in millimeters receives a different penalty than the same variable in meters. The standard remedy is to standardize each predictor to unit variance before fitting, so that the penalty is applied on a common scale. These two preprocessing steps, centering and scaling, are essential for ridge to behave as intended.

89.2 2. The Closed Form Solution

89.2.1 2.1 Deriving the normal equations

The ridge objective is a smooth convex quadratic in $\boldsymbol{\beta}$, so we find its minimizer by setting the gradient to zero. Write the objective as

\[ J(\boldsymbol{\beta}) = (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^{\top}(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) + \lambda \boldsymbol{\beta}^{\top}\boldsymbol{\beta} . \]

Differentiating with respect to $\boldsymbol{\beta}$ gives

\[ \nabla_{\boldsymbol{\beta}} J = -2 \mathbf{X}^{\top}(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) + 2 \lambda \boldsymbol{\beta} . \]

Setting this to zero yields the ridge normal equations,

\[ (\mathbf{X}^{\top}\mathbf{X} + \lambda \mathbf{I}_p)\,\boldsymbol{\beta} = \mathbf{X}^{\top}\mathbf{y}, \]

and therefore

\[ \hat{\boldsymbol{\beta}}^{\text{ridge}}(\lambda) = (\mathbf{X}^{\top}\mathbf{X} + \lambda \mathbf{I}_p)^{-1} \mathbf{X}^{\top}\mathbf{y} . \]

89.2.2 2.2 Why the solution always exists

The matrix $\mathbf{X}^{\top}\mathbf{X}$ is symmetric positive semidefinite, so its eigenvalues are nonnegative. Adding $\lambda \mathbf{I}_p$ for any $\lambda > 0$ shifts every eigenvalue up by $\lambda$, making the eigenvalues strictly positive. The matrix $\mathbf{X}^{\top}\mathbf{X} + \lambda \mathbf{I}_p$ is therefore positive definite and invertible regardless of the rank of $\mathbf{X}$. This is the algebraic heart of ridge regularization. Even in the $p > n$ regime, where $\mathbf{X}^{\top}\mathbf{X}$ is singular, the ridge estimator is unique and stable. The penalty also improves the condition number: if $\mathbf{X}^{\top}\mathbf{X}$ has eigenvalues $d_j^2$, then the condition number drops from $d_{\max}^2 / d_{\min}^2$ to $(d_{\max}^2 + \lambda)/(d_{\min}^2 + \lambda)$, which is much closer to one when $\lambda$ is comparable to the smaller eigenvalues.

89.2.3 2.3 Computational notes

Two practical formulations matter. When $p \leq n$, solving the $p \times p$ system above is efficient. When $p \gg n$, the Woodbury identity gives an equivalent expression that inverts an $n \times n$ matrix instead,

\[ \hat{\boldsymbol{\beta}}^{\text{ridge}} = \mathbf{X}^{\top}(\mathbf{X}\mathbf{X}^{\top} + \lambda \mathbf{I}_n)^{-1} \mathbf{y}, \]

which is cheaper in the high dimensional setting and exposes the kernel structure that underlies kernel ridge regression. In code, one should never form an explicit inverse; instead solve the linear system directly with a Cholesky factorization, since $\mathbf{X}^{\top}\mathbf{X} + \lambda \mathbf{I}$ is positive definite. Section 6 gives a runnable implementation.

89.3 3. The Bias Variance Effect

89.3.1 3.1 The decomposition

For a fixed test input $\mathbf{x}_0$, the expected squared prediction error decomposes into three parts: the irreducible noise variance, the squared bias of the estimator, and its variance. Ridge regression deliberately moves along this tradeoff. OLS is unbiased but can have enormous variance under collinearity. Ridge introduces bias by shrinking coefficients toward zero, but the variance reduction more than compensates over a useful range of $\lambda$, lowering the total expected error.

89.3.2 3.2 Bias and variance of the ridge estimator

Assume the linear model $\mathbf{y} = \mathbf{X}\boldsymbol{\beta}^{\star} + \boldsymbol{\varepsilon}$ with $\mathbb{E}[\boldsymbol{\varepsilon}] = \mathbf{0}$ and $\operatorname{Cov}(\boldsymbol{\varepsilon}) = \sigma^2 \mathbf{I}$. Let $\mathbf{Z}_\lambda = (\mathbf{X}^{\top}\mathbf{X} + \lambda \mathbf{I})^{-1}\mathbf{X}^{\top}$, so that $\hat{\boldsymbol{\beta}}^{\text{ridge}} = \mathbf{Z}_\lambda \mathbf{y}$. Taking expectations,

\[ \mathbb{E}[\hat{\boldsymbol{\beta}}^{\text{ridge}}] = \mathbf{Z}_\lambda \mathbf{X}\boldsymbol{\beta}^{\star} = (\mathbf{X}^{\top}\mathbf{X} + \lambda \mathbf{I})^{-1}\mathbf{X}^{\top}\mathbf{X}\,\boldsymbol{\beta}^{\star} . \]

The bias is

\[ \operatorname{Bias}(\hat{\boldsymbol{\beta}}^{\text{ridge}}) = -\lambda (\mathbf{X}^{\top}\mathbf{X} + \lambda \mathbf{I})^{-1}\boldsymbol{\beta}^{\star}, \]

which is zero only at $\lambda = 0$ and grows in magnitude with $\lambda$. The covariance is

\[ \operatorname{Cov}(\hat{\boldsymbol{\beta}}^{\text{ridge}}) = \sigma^2 \mathbf{Z}_\lambda \mathbf{Z}_\lambda^{\top} = \sigma^2 (\mathbf{X}^{\top}\mathbf{X} + \lambda \mathbf{I})^{-1}\mathbf{X}^{\top}\mathbf{X}(\mathbf{X}^{\top}\mathbf{X} + \lambda \mathbf{I})^{-1} . \]

Each factor of $(\mathbf{X}^{\top}\mathbf{X} + \lambda \mathbf{I})^{-1}$ shrinks as $\lambda$ grows, so the variance decreases monotonically in $\lambda$. Bias increases while variance decreases, and the mean squared error is minimized at some intermediate $\lambda$.

89.3.3 3.3 The existence theorem

A central theoretical result, due to Hoerl and Kennard, is that there always exists a $\lambda > 0$ for which the ridge estimator has strictly smaller total mean squared error than OLS. In other words, OLS is never the optimal point on the bias variance curve. The catch is that the optimal $\lambda$ depends on the unknown true coefficients and noise level, so in practice it must be chosen from data, most commonly by $k$ fold cross validation or by generalized cross validation, which has an efficient leave one out form for ridge.

89.4 4. The Bayesian Interpretation

89.4.1 4.1 Gaussian likelihood and Gaussian prior

Ridge regression has a clean probabilistic reading as the maximum a posteriori (MAP) estimate under a Gaussian likelihood and a Gaussian prior on the coefficients. Suppose the data are generated as

\[ \mathbf{y} \mid \boldsymbol{\beta} \sim \mathcal{N}(\mathbf{X}\boldsymbol{\beta}, \, \sigma^2 \mathbf{I}_n), \]

and place an independent zero mean Gaussian prior on each coefficient,

\[ \boldsymbol{\beta} \sim \mathcal{N}(\mathbf{0}, \, \tau^2 \mathbf{I}_p) . \]

The prior encodes a belief that coefficients are small, with $\tau^2$ controlling how strongly we expect them to cluster around zero.

89.4.2 4.2 The MAP estimate equals the ridge estimate

By Bayes’ rule, the log posterior is the sum of the log likelihood and the log prior, up to a constant:

\[ \log p(\boldsymbol{\beta} \mid \mathbf{y}) = -\frac{1}{2\sigma^2}\lVert \mathbf{y} - \mathbf{X}\boldsymbol{\beta} \rVert_2^2 - \frac{1}{2\tau^2}\lVert \boldsymbol{\beta} \rVert_2^2 + \text{const} . \]

Maximizing the posterior is equivalent to minimizing the negative log posterior. Multiplying by $2\sigma^2$ leaves the minimizer unchanged and yields exactly the ridge objective,

\[ \arg\max_{\boldsymbol{\beta}} \log p(\boldsymbol{\beta} \mid \mathbf{y}) = \arg\min_{\boldsymbol{\beta}} \; \lVert \mathbf{y} - \mathbf{X}\boldsymbol{\beta} \rVert_2^2 + \frac{\sigma^2}{\tau^2}\lVert \boldsymbol{\beta} \rVert_2^2 . \]

The regularization parameter is therefore $\lambda = \sigma^2 / \tau^2$. A strong prior belief in small coefficients (small $\tau^2$) corresponds to large $\lambda$, while a diffuse prior (large $\tau^2$) corresponds to small $\lambda$, recovering OLS in the limit $\tau^2 \to \infty$.

89.4.3 4.3 Beyond the point estimate

Because both the prior and the likelihood are Gaussian, the full posterior is itself Gaussian, with mean equal to the ridge estimate and covariance $(\sigma^{-2}\mathbf{X}^{\top}\mathbf{X} + \tau^{-2}\mathbf{I})^{-1}$. This is more than a curiosity. It means ridge is the posterior mean of a conjugate Bayesian linear model, so it delivers calibrated uncertainty intervals for free, and it connects directly to Gaussian process regression, where the same Gaussian prior is placed in a feature space. The Bayesian view also reframes the choice of $\lambda$ as a question of prior variance, which can be estimated by maximizing the marginal likelihood, an approach known as empirical Bayes or evidence maximization.

89.5 5. The Shrinkage and SVD View

89.5.1 5.1 Singular value decomposition of the design

The most illuminating analysis of ridge comes from the singular value decomposition (SVD) of the design matrix. Write

\[ \mathbf{X} = \mathbf{U}\mathbf{D}\mathbf{V}^{\top}, \]

where $\mathbf{U} \in \mathbb{R}^{n \times p}$ and $\mathbf{V} \in \mathbb{R}^{p \times p}$ have orthonormal columns and $\mathbf{D} = \operatorname{diag}(d_1, \ldots, d_p)$ holds the singular values $d_1 \geq d_2 \geq \cdots \geq d_p \geq 0$. The columns of $\mathbf{V}$ are the principal directions of the predictor space, and $d_j^2$ measures the variance of the data along direction $\mathbf{v}_j$.

89.5.2 5.2 Ridge fitted values as filtered components

To derive the shrinkage form, substitute $\mathbf{X} = \mathbf{U}\mathbf{D}\mathbf{V}^{\top}$ into the closed form solution. Using the orthonormality $\mathbf{V}^{\top}\mathbf{V} = \mathbf{I}$, the penalized Gram matrix becomes

\[ \mathbf{X}^{\top}\mathbf{X} + \lambda \mathbf{I} = \mathbf{V}\mathbf{D}^2\mathbf{V}^{\top} + \lambda \mathbf{V}\mathbf{V}^{\top} = \mathbf{V}(\mathbf{D}^2 + \lambda \mathbf{I})\mathbf{V}^{\top}, \]

whose inverse is $\mathbf{V}(\mathbf{D}^2 + \lambda \mathbf{I})^{-1}\mathbf{V}^{\top}$ because $\mathbf{D}^2 + \lambda \mathbf{I}$ is diagonal. The coefficient vector is then

\[ \hat{\boldsymbol{\beta}}^{\text{ridge}} = \mathbf{V}(\mathbf{D}^2 + \lambda \mathbf{I})^{-1}\mathbf{V}^{\top}\mathbf{V}\mathbf{D}\mathbf{U}^{\top}\mathbf{y} = \mathbf{V}(\mathbf{D}^2 + \lambda \mathbf{I})^{-1}\mathbf{D}\,\mathbf{U}^{\top}\mathbf{y} = \sum_{j=1}^{p} \mathbf{v}_j \, \frac{d_j}{d_j^2 + \lambda}\, \mathbf{u}_j^{\top}\mathbf{y} . \]

Left multiplying by $\mathbf{X} = \mathbf{U}\mathbf{D}\mathbf{V}^{\top}$ and again using $\mathbf{V}^{\top}\mathbf{V} = \mathbf{I}$, the diagonal term $d_j \cdot d_j/(d_j^2 + \lambda)$ collects into the shrinkage factor, so the ridge fitted values are

\[ \hat{\mathbf{y}}^{\text{ridge}} = \mathbf{X}\hat{\boldsymbol{\beta}}^{\text{ridge}} = \sum_{j=1}^{p} \mathbf{u}_j \, \frac{d_j^2}{d_j^2 + \lambda} \, \mathbf{u}_j^{\top}\mathbf{y} . \]

Compare this with OLS, which uses the same expression but with every filter factor equal to one:

\[ \hat{\mathbf{y}}^{\text{OLS}} = \sum_{j=1}^{p} \mathbf{u}_j \, \mathbf{u}_j^{\top}\mathbf{y} . \]

Ridge projects $\mathbf{y}$ onto each left singular vector $\mathbf{u}_j$ and then multiplies that component by the shrinkage factor

\[ f_j = \frac{d_j^2}{d_j^2 + \lambda} \in (0, 1) . \]

89.5.3 5.3 Reading the shrinkage factors

The factor $f_j$ reveals exactly how ridge acts. Directions with large singular values, where $d_j^2 \gg \lambda$, have $f_j \approx 1$ and are barely shrunk. Directions with small singular values, where $d_j^2 \ll \lambda$, have $f_j \approx d_j^2 / \lambda \approx 0$ and are shrunk heavily. Because $d_j^2$ is the variance of the data along $\mathbf{v}_j$, ridge preserves high variance directions and suppresses low variance directions. This is precisely where it helps: low variance directions are the ones along which collinearity makes the OLS coefficients explode, because dividing the projected response by a tiny $d_j$ amplifies noise. Ridge tempers that division by adding $\lambda$ to the denominator, so unstable directions are damped rather than amplified.

89.5.4 5.4 Effective degrees of freedom

The shrinkage factors also give ridge a natural complexity measure. The hat matrix for ridge is $\mathbf{H}_\lambda = \mathbf{X}(\mathbf{X}^{\top}\mathbf{X} + \lambda \mathbf{I})^{-1}\mathbf{X}^{\top}$, and its trace is

\[ \operatorname{df}(\lambda) = \operatorname{tr}(\mathbf{H}_\lambda) = \sum_{j=1}^{p} \frac{d_j^2}{d_j^2 + \lambda} = \sum_{j=1}^{p} f_j . \]

This effective degrees of freedom decreases smoothly from $p$ at $\lambda = 0$ toward zero as $\lambda \to \infty$, and it quantifies how many directions of the data the model is effectively using. It plays the role that the number of parameters plays in unregularized models, and it appears in model selection criteria and in the generalized cross validation score used to tune $\lambda$. The relationship to principal components analysis is direct: ridge is a soft version of principal components regression, which would set $f_j = 1$ for the top components and $f_j = 0$ for the rest, whereas ridge applies a smooth taper across all of them.

89.6 6. Ridge in Code

89.6.1 6.1 Coefficient path and test error (executable)

The following cell builds a synthetic regression with eight near collinear predictors, standardizes them, and sweeps $\lambda$ across six orders of magnitude. It reports the OLS test error, the cross validation optimal $\lambda$, the resulting effective degrees of freedom computed from the singular values, and the shrinkage in the coefficient norm. The seed is fixed so the output is reproducible.

Code

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split

rng = np.random.default_rng(0)

# Synthetic problem with correlated predictors to stress OLS
n, p = 120, 8
base = rng.standard_normal((n, 1))
noise = rng.standard_normal((n, p))
X = 0.92 * base + 0.08 * noise          # columns share a common factor: near collinear
X = (X - X.mean(0)) / X.std(0)          # standardize each predictor
beta_true = np.array([3.0, -2.0, 0.0, 1.5, 0.0, 0.0, 2.0, -1.0])
y = X @ beta_true + 1.5 * rng.standard_normal(n)
y = y - y.mean()

X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.4, random_state=0)

lambdas = np.logspace(-2, 4, 40)
coefs, test_mse = [], []
for lam in lambdas:
    m = Ridge(alpha=lam, fit_intercept=False).fit(X_tr, y_tr)
    coefs.append(m.coef_)
    test_mse.append(np.mean((y_te - m.predict(X_te)) ** 2))
coefs = np.array(coefs)
test_mse = np.array(test_mse)

# Effective degrees of freedom via SVD of the training design
d = np.linalg.svd(X_tr, compute_uv=False)
def edf(lam):
    return np.sum(d**2 / (d**2 + lam))

best = int(np.argmin(test_mse))
print(f"OLS (lambda=0) test MSE:        {test_mse[0]:.4f}")
print(f"Best lambda:                    {lambdas[best]:.4g}")
print(f"Best test MSE:                  {test_mse[best]:.4f}")
print(f"Effective df at best lambda:    {edf(lambdas[best]):.2f}  (of p={p})")
print(f"Coef L2 norm OLS vs best ridge: {np.linalg.norm(coefs[0]):.3f} -> {np.linalg.norm(coefs[best]):.3f}")

# Visualize the ridge coefficient path and the test error curve
fig, (ax_path, ax_err) = plt.subplots(1, 2, figsize=(11, 4.5))
fig.suptitle("Ridge regularization path and model selection", fontsize=12)

for j in range(p):
    ax_path.plot(lambdas, coefs[:, j], label=f"beta {j+1}")
ax_path.axhline(0.0, color="0.6", lw=0.8)
ax_path.axvline(lambdas[best], color="crimson", ls="--", lw=1.2,
                label="best lambda")
ax_path.set_xscale("log")
ax_path.set_xlabel("lambda (log scale)")
ax_path.set_ylabel("ridge coefficient")
ax_path.set_title("Coefficient path: shrinkage toward zero")
ax_path.legend(fontsize=7, ncol=2, loc="upper right")

ax_err.plot(lambdas, test_mse, color="steelblue", marker="o", ms=3)
ax_err.axvline(lambdas[best], color="crimson", ls="--", lw=1.2)
ax_err.scatter([lambdas[best]], [test_mse[best]], color="crimson", zorder=5,
               label=f"optimum: lambda={lambdas[best]:.2g}")
ax_err.set_xscale("log")
ax_err.set_xlabel("lambda (log scale)")
ax_err.set_ylabel("test MSE")
ax_err.set_title("Test error vs lambda")
ax_err.legend(fontsize=8, loc="upper left")

fig.tight_layout()
plt.show()

OLS (lambda=0) test MSE:        2.2798
Best lambda:                    1.425
Best test MSE:                  1.8474
Effective df at best lambda:    2.98  (of p=8)
Coef L2 norm OLS vs best ridge: 9.618 -> 2.545

Running this prints an OLS test MSE near $2.28$ against a best ridge MSE near $1.85$ at $\lambda \approx 1.4$, where the effective degrees of freedom has collapsed from $8$ to about $3$ and the coefficient norm has shrunk from roughly $9.6$ to $2.5$. This is the bias variance tradeoff and the shrinkage view made concrete on one dataset.

89.6.2 6.2 The same computation in three languages

The closed form solve and the lambda sweep are short in any numerical language. The Python panel below is executable; the Julia and Rust panels are illustrative.

Code

import numpy as np

def ridge_solve(X, y, lam):
    p = X.shape[1]
    A = X.T @ X + lam * np.eye(p)   # positive definite
    b = X.T @ y
    return np.linalg.solve(A, b)    # Cholesky-backed solve, never an explicit inverse

beta_hat = ridge_solve(X_tr, y_tr, 1.0)
print(np.round(beta_hat, 3))

[ 2.833 -0.953  0.633  0.349 -0.125 -0.245  0.606  0.008]

using LinearAlgebra

function ridge_solve(X, y, lam)
    p = size(X, 2)
    A = X' * X + lam * I(p)        # positive definite
    b = X' * y
    return cholesky(Symmetric(A)) \ b   # solve, no explicit inverse
end

beta_hat = ridge_solve(X_tr, y_tr, 1.0)
println(round.(beta_hat, digits = 3))

// Illustrative: uses the `nalgebra` crate
use nalgebra::{DMatrix, DVector};

fn ridge_solve(x: &DMatrix<f64>, y: &DVector<f64>, lam: f64) -> DVector<f64> {
    let p = x.ncols();
    let a = x.transpose() * x + lam * DMatrix::identity(p, p); // positive definite
    let b = x.transpose() * y;
    a.cholesky().expect("SPD").solve(&b) // solve, no explicit inverse
}

89.7 7. Practical Guidance

Ridge is most valuable when predictors are correlated, when $p$ is large relative to $n$, or when the goal is stable prediction rather than interpretation of individual coefficients. Always center and scale the predictors first, and exclude the intercept from the penalty. Select $\lambda$ by cross validation over a logarithmically spaced grid, since the useful range of $\lambda$ spans many orders of magnitude. If sparsity or variable selection is the goal, the lasso or the elastic net is more appropriate, but for sheer predictive stability under collinearity ridge is often the better default, and the elastic net combines both penalties to capture grouped correlated predictors that ridge handles gracefully and the lasso does not. Modern libraries implement all of these efficiently, and the SVD of the design can be reused to evaluate the entire ridge path across $\lambda$ values at almost no extra cost.

89.8 References

Hoerl, A. E., and Kennard, R. W. (1970). Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics, 12(1), 55 to 67. https://doi.org/10.1080/00401706.1970.10488634
Hoerl, A. E., and Kennard, R. W. (1970). Ridge Regression: Applications to Nonorthogonal Problems. Technometrics, 12(1), 69 to 82. https://doi.org/10.1080/00401706.1970.10488635
Golub, G. H., Heath, M., and Wahba, G. (1979). Generalized Cross Validation as a Method for Choosing a Good Ridge Parameter. Technometrics, 21(2), 215 to 223. https://doi.org/10.1080/00401706.1979.10489751
Zou, H., and Hastie, T. (2005). Regularization and Variable Selection via the Elastic Net. Journal of the Royal Statistical Society, Series B, 67(2), 301 to 320. https://doi.org/10.1111/j.1467-9868.2005.00503.x
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning, 2nd edition. Springer. https://doi.org/10.1007/978-0-387-84858-7
Pedregosa, F., et al. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825 to 2830. https://jmlr.org/papers/v12/pedregosa11a.html
Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. https://probml.github.io/pml-book/book1.html

# Ridge Regression and L2 Regularization Ordinary least squares is the workhorse of linear modeling, but it has a well known failure mode. When predictors are numerous, correlated, or nearly collinear, the least squares estimator becomes unstable: small perturbations in the data produce wildly different coefficient vectors, and predictions on new data degrade. Ridge regression, introduced by Hoerl and Kennard in 1970, addresses this by adding a quadratic penalty on the coefficient magnitudes. The result is a biased estimator that trades a small amount of bias for a large reduction in variance, and that remains well defined even when the design matrix is rank deficient. This chapter develops ridge regression from five complementary angles: the penalized objective, the closed form solution, the bias variance decomposition, the Bayesian interpretation as a Gaussian prior, and the shrinkage view through the singular value decomposition. ## 1. The Penalized Objective ### 1.1 Setup and notation Let $\mathbf{X} \in \mathbb{R}^{n \times p}$ denote the design matrix with $n$ observations and $p$ predictors, and let $\mathbf{y} \in \mathbb{R}^{n}$ denote the response vector. We seek a coefficient vector $\boldsymbol{\beta} \in \mathbb{R}^{p}$. Ordinary least squares (OLS) minimizes the residual sum of squares, $$ \hat{\boldsymbol{\beta}}^{\text{OLS}} = \arg\min_{\boldsymbol{\beta}} \; \lVert \mathbf{y} - \mathbf{X}\boldsymbol{\beta} \rVert_2^2 . $$ The OLS solution requires $\mathbf{X}^{\top}\mathbf{X}$ to be invertible. When $p > n$, or when columns of $\mathbf{X}$ are linearly dependent, this matrix is singular and the solution is not unique. Even when it is technically invertible, near collinearity makes $\mathbf{X}^{\top}\mathbf{X}$ ill conditioned, so the inverse amplifies noise. ### 1.2 Adding the L2 penalty Ridge regression augments the objective with a penalty proportional to the squared Euclidean norm of the coefficients: $$ \hat{\boldsymbol{\beta}}^{\text{ridge}}(\lambda) = \arg\min_{\boldsymbol{\beta}} \; \lVert \mathbf{y} - \mathbf{X}\boldsymbol{\beta} \rVert_2^2 + \lambda \lVert \boldsymbol{\beta} \rVert_2^2, $$ where $\lambda \geq 0$ is the regularization strength. The term $\lambda \lVert \boldsymbol{\beta} \rVert_2^2 = \lambda \sum_{j=1}^{p} \beta_j^2$ discourages large coefficients. At $\lambda = 0$ we recover OLS. As $\lambda \to \infty$ the penalty dominates and every coefficient is driven toward zero. An equivalent formulation states ridge as a constrained problem, $$ \hat{\boldsymbol{\beta}}^{\text{ridge}} = \arg\min_{\boldsymbol{\beta}} \; \lVert \mathbf{y} - \mathbf{X}\boldsymbol{\beta} \rVert_2^2 \quad \text{subject to} \quad \lVert \boldsymbol{\beta} \rVert_2^2 \leq t, $$ with a one to one correspondence between $\lambda$ and the budget $t$. This geometric picture shows that ridge confines the solution to a Euclidean ball, and the smooth spherical boundary is why ridge shrinks coefficients continuously rather than setting them exactly to zero, in contrast to the L1 penalty of the lasso. ### 1.3 The intercept and standardization The penalty should not apply to the intercept, since shrinking the intercept would make the fit depend on the arbitrary origin of the response. In practice we center $\mathbf{y}$ and each column of $\mathbf{X}$, fit a ridge model without an intercept on the centered data, and recover the intercept as $\hat{\beta}_0 = \bar{y}$ on the original scale. Furthermore, because the penalty $\sum_j \beta_j^2$ treats every coefficient equally, the solution is not invariant to the units of each predictor. A variable measured in millimeters receives a different penalty than the same variable in meters. The standard remedy is to standardize each predictor to unit variance before fitting, so that the penalty is applied on a common scale. These two preprocessing steps, centering and scaling, are essential for ridge to behave as intended. ## 2. The Closed Form Solution ### 2.1 Deriving the normal equations The ridge objective is a smooth convex quadratic in $\boldsymbol{\beta}$, so we find its minimizer by setting the gradient to zero. Write the objective as $$ J(\boldsymbol{\beta}) = (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^{\top}(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) + \lambda \boldsymbol{\beta}^{\top}\boldsymbol{\beta} . $$ Differentiating with respect to $\boldsymbol{\beta}$ gives $$ \nabla_{\boldsymbol{\beta}} J = -2 \mathbf{X}^{\top}(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) + 2 \lambda \boldsymbol{\beta} . $$ Setting this to zero yields the ridge normal equations, $$ (\mathbf{X}^{\top}\mathbf{X} + \lambda \mathbf{I}_p)\,\boldsymbol{\beta} = \mathbf{X}^{\top}\mathbf{y}, $$ and therefore $$ \hat{\boldsymbol{\beta}}^{\text{ridge}}(\lambda) = (\mathbf{X}^{\top}\mathbf{X} + \lambda \mathbf{I}_p)^{-1} \mathbf{X}^{\top}\mathbf{y} . $$ ### 2.2 Why the solution always exists The matrix $\mathbf{X}^{\top}\mathbf{X}$ is symmetric positive semidefinite, so its eigenvalues are nonnegative. Adding $\lambda \mathbf{I}_p$ for any $\lambda > 0$ shifts every eigenvalue up by $\lambda$, making the eigenvalues strictly positive. The matrix $\mathbf{X}^{\top}\mathbf{X} + \lambda \mathbf{I}_p$ is therefore positive definite and invertible regardless of the rank of $\mathbf{X}$. This is the algebraic heart of ridge regularization. Even in the $p > n$ regime, where $\mathbf{X}^{\top}\mathbf{X}$ is singular, the ridge estimator is unique and stable. The penalty also improves the condition number: if $\mathbf{X}^{\top}\mathbf{X}$ has eigenvalues $d_j^2$, then the condition number drops from $d_{\max}^2 / d_{\min}^2$ to $(d_{\max}^2 + \lambda)/(d_{\min}^2 + \lambda)$, which is much closer to one when $\lambda$ is comparable to the smaller eigenvalues. ### 2.3 Computational notes Two practical formulations matter. When $p \leq n$, solving the $p \times p$ system above is efficient. When $p \gg n$, the Woodbury identity gives an equivalent expression that inverts an $n \times n$ matrix instead, $$ \hat{\boldsymbol{\beta}}^{\text{ridge}} = \mathbf{X}^{\top}(\mathbf{X}\mathbf{X}^{\top} + \lambda \mathbf{I}_n)^{-1} \mathbf{y}, $$ which is cheaper in the high dimensional setting and exposes the kernel structure that underlies kernel ridge regression. In code, one should never form an explicit inverse; instead solve the linear system directly with a Cholesky factorization, since $\mathbf{X}^{\top}\mathbf{X} + \lambda \mathbf{I}$ is positive definite. Section 6 gives a runnable implementation. ## 3. The Bias Variance Effect ### 3.1 The decomposition For a fixed test input $\mathbf{x}_0$, the expected squared prediction error decomposes into three parts: the irreducible noise variance, the squared bias of the estimator, and its variance. Ridge regression deliberately moves along this tradeoff. OLS is unbiased but can have enormous variance under collinearity. Ridge introduces bias by shrinking coefficients toward zero, but the variance reduction more than compensates over a useful range of $\lambda$, lowering the total expected error. ### 3.2 Bias and variance of the ridge estimator Assume the linear model $\mathbf{y} = \mathbf{X}\boldsymbol{\beta}^{\star} + \boldsymbol{\varepsilon}$ with $\mathbb{E}[\boldsymbol{\varepsilon}] = \mathbf{0}$ and $\operatorname{Cov}(\boldsymbol{\varepsilon}) = \sigma^2 \mathbf{I}$. Let $\mathbf{Z}_\lambda = (\mathbf{X}^{\top}\mathbf{X} + \lambda \mathbf{I})^{-1}\mathbf{X}^{\top}$, so that $\hat{\boldsymbol{\beta}}^{\text{ridge}} = \mathbf{Z}_\lambda \mathbf{y}$. Taking expectations, $$ \mathbb{E}[\hat{\boldsymbol{\beta}}^{\text{ridge}}] = \mathbf{Z}_\lambda \mathbf{X}\boldsymbol{\beta}^{\star} = (\mathbf{X}^{\top}\mathbf{X} + \lambda \mathbf{I})^{-1}\mathbf{X}^{\top}\mathbf{X}\,\boldsymbol{\beta}^{\star} . $$ The bias is $$ \operatorname{Bias}(\hat{\boldsymbol{\beta}}^{\text{ridge}}) = -\lambda (\mathbf{X}^{\top}\mathbf{X} + \lambda \mathbf{I})^{-1}\boldsymbol{\beta}^{\star}, $$ which is zero only at $\lambda = 0$ and grows in magnitude with $\lambda$. The covariance is $$ \operatorname{Cov}(\hat{\boldsymbol{\beta}}^{\text{ridge}}) = \sigma^2 \mathbf{Z}_\lambda \mathbf{Z}_\lambda^{\top} = \sigma^2 (\mathbf{X}^{\top}\mathbf{X} + \lambda \mathbf{I})^{-1}\mathbf{X}^{\top}\mathbf{X}(\mathbf{X}^{\top}\mathbf{X} + \lambda \mathbf{I})^{-1} . $$ Each factor of $(\mathbf{X}^{\top}\mathbf{X} + \lambda \mathbf{I})^{-1}$ shrinks as $\lambda$ grows, so the variance decreases monotonically in $\lambda$. Bias increases while variance decreases, and the mean squared error is minimized at some intermediate $\lambda$. ### 3.3 The existence theorem A central theoretical result, due to Hoerl and Kennard, is that there always exists a $\lambda > 0$ for which the ridge estimator has strictly smaller total mean squared error than OLS. In other words, OLS is never the optimal point on the bias variance curve. The catch is that the optimal $\lambda$ depends on the unknown true coefficients and noise level, so in practice it must be chosen from data, most commonly by $k$ fold cross validation or by generalized cross validation, which has an efficient leave one out form for ridge. ## 4. The Bayesian Interpretation ### 4.1 Gaussian likelihood and Gaussian prior Ridge regression has a clean probabilistic reading as the maximum a posteriori (MAP) estimate under a Gaussian likelihood and a Gaussian prior on the coefficients. Suppose the data are generated as $$ \mathbf{y} \mid \boldsymbol{\beta} \sim \mathcal{N}(\mathbf{X}\boldsymbol{\beta}, \, \sigma^2 \mathbf{I}_n), $$ and place an independent zero mean Gaussian prior on each coefficient, $$ \boldsymbol{\beta} \sim \mathcal{N}(\mathbf{0}, \, \tau^2 \mathbf{I}_p) . $$ The prior encodes a belief that coefficients are small, with $\tau^2$ controlling how strongly we expect them to cluster around zero. ### 4.2 The MAP estimate equals the ridge estimate By Bayes' rule, the log posterior is the sum of the log likelihood and the log prior, up to a constant: $$ \log p(\boldsymbol{\beta} \mid \mathbf{y}) = -\frac{1}{2\sigma^2}\lVert \mathbf{y} - \mathbf{X}\boldsymbol{\beta} \rVert_2^2 - \frac{1}{2\tau^2}\lVert \boldsymbol{\beta} \rVert_2^2 + \text{const} . $$ Maximizing the posterior is equivalent to minimizing the negative log posterior. Multiplying by $2\sigma^2$ leaves the minimizer unchanged and yields exactly the ridge objective, $$ \arg\max_{\boldsymbol{\beta}} \log p(\boldsymbol{\beta} \mid \mathbf{y}) = \arg\min_{\boldsymbol{\beta}} \; \lVert \mathbf{y} - \mathbf{X}\boldsymbol{\beta} \rVert_2^2 + \frac{\sigma^2}{\tau^2}\lVert \boldsymbol{\beta} \rVert_2^2 . $$ The regularization parameter is therefore $\lambda = \sigma^2 / \tau^2$. A strong prior belief in small coefficients (small $\tau^2$) corresponds to large $\lambda$, while a diffuse prior (large $\tau^2$) corresponds to small $\lambda$, recovering OLS in the limit $\tau^2 \to \infty$. ### 4.3 Beyond the point estimate Because both the prior and the likelihood are Gaussian, the full posterior is itself Gaussian, with mean equal to the ridge estimate and covariance $(\sigma^{-2}\mathbf{X}^{\top}\mathbf{X} + \tau^{-2}\mathbf{I})^{-1}$. This is more than a curiosity. It means ridge is the posterior mean of a conjugate Bayesian linear model, so it delivers calibrated uncertainty intervals for free, and it connects directly to Gaussian process regression, where the same Gaussian prior is placed in a feature space. The Bayesian view also reframes the choice of $\lambda$ as a question of prior variance, which can be estimated by maximizing the marginal likelihood, an approach known as empirical Bayes or evidence maximization. ## 5. The Shrinkage and SVD View ### 5.1 Singular value decomposition of the design The most illuminating analysis of ridge comes from the singular value decomposition (SVD) of the design matrix. Write $$ \mathbf{X} = \mathbf{U}\mathbf{D}\mathbf{V}^{\top}, $$ where $\mathbf{U} \in \mathbb{R}^{n \times p}$ and $\mathbf{V} \in \mathbb{R}^{p \times p}$ have orthonormal columns and $\mathbf{D} = \operatorname{diag}(d_1, \ldots, d_p)$ holds the singular values $d_1 \geq d_2 \geq \cdots \geq d_p \geq 0$. The columns of $\mathbf{V}$ are the principal directions of the predictor space, and $d_j^2$ measures the variance of the data along direction $\mathbf{v}_j$. ### 5.2 Ridge fitted values as filtered components To derive the shrinkage form, substitute $\mathbf{X} = \mathbf{U}\mathbf{D}\mathbf{V}^{\top}$ into the closed form solution. Using the orthonormality $\mathbf{V}^{\top}\mathbf{V} = \mathbf{I}$, the penalized Gram matrix becomes $$ \mathbf{X}^{\top}\mathbf{X} + \lambda \mathbf{I} = \mathbf{V}\mathbf{D}^2\mathbf{V}^{\top} + \lambda \mathbf{V}\mathbf{V}^{\top} = \mathbf{V}(\mathbf{D}^2 + \lambda \mathbf{I})\mathbf{V}^{\top}, $$ whose inverse is $\mathbf{V}(\mathbf{D}^2 + \lambda \mathbf{I})^{-1}\mathbf{V}^{\top}$ because $\mathbf{D}^2 + \lambda \mathbf{I}$ is diagonal. The coefficient vector is then $$ \hat{\boldsymbol{\beta}}^{\text{ridge}} = \mathbf{V}(\mathbf{D}^2 + \lambda \mathbf{I})^{-1}\mathbf{V}^{\top}\mathbf{V}\mathbf{D}\mathbf{U}^{\top}\mathbf{y} = \mathbf{V}(\mathbf{D}^2 + \lambda \mathbf{I})^{-1}\mathbf{D}\,\mathbf{U}^{\top}\mathbf{y} = \sum_{j=1}^{p} \mathbf{v}_j \, \frac{d_j}{d_j^2 + \lambda}\, \mathbf{u}_j^{\top}\mathbf{y} . $$ Left multiplying by $\mathbf{X} = \mathbf{U}\mathbf{D}\mathbf{V}^{\top}$ and again using $\mathbf{V}^{\top}\mathbf{V} = \mathbf{I}$, the diagonal term $d_j \cdot d_j/(d_j^2 + \lambda)$ collects into the shrinkage factor, so the ridge fitted values are $$ \hat{\mathbf{y}}^{\text{ridge}} = \mathbf{X}\hat{\boldsymbol{\beta}}^{\text{ridge}} = \sum_{j=1}^{p} \mathbf{u}_j \, \frac{d_j^2}{d_j^2 + \lambda} \, \mathbf{u}_j^{\top}\mathbf{y} . $$ Compare this with OLS, which uses the same expression but with every filter factor equal to one: $$ \hat{\mathbf{y}}^{\text{OLS}} = \sum_{j=1}^{p} \mathbf{u}_j \, \mathbf{u}_j^{\top}\mathbf{y} . $$ Ridge projects $\mathbf{y}$ onto each left singular vector $\mathbf{u}_j$ and then multiplies that component by the shrinkage factor $$ f_j = \frac{d_j^2}{d_j^2 + \lambda} \in (0, 1) . $$ ### 5.3 Reading the shrinkage factors The factor $f_j$ reveals exactly how ridge acts. Directions with large singular values, where $d_j^2 \gg \lambda$, have $f_j \approx 1$ and are barely shrunk. Directions with small singular values, where $d_j^2 \ll \lambda$, have $f_j \approx d_j^2 / \lambda \approx 0$ and are shrunk heavily. Because $d_j^2$ is the variance of the data along $\mathbf{v}_j$, ridge preserves high variance directions and suppresses low variance directions. This is precisely where it helps: low variance directions are the ones along which collinearity makes the OLS coefficients explode, because dividing the projected response by a tiny $d_j$ amplifies noise. Ridge tempers that division by adding $\lambda$ to the denominator, so unstable directions are damped rather than amplified. ### 5.4 Effective degrees of freedom The shrinkage factors also give ridge a natural complexity measure. The hat matrix for ridge is $\mathbf{H}_\lambda = \mathbf{X}(\mathbf{X}^{\top}\mathbf{X} + \lambda \mathbf{I})^{-1}\mathbf{X}^{\top}$, and its trace is $$ \operatorname{df}(\lambda) = \operatorname{tr}(\mathbf{H}_\lambda) = \sum_{j=1}^{p} \frac{d_j^2}{d_j^2 + \lambda} = \sum_{j=1}^{p} f_j . $$ This effective degrees of freedom decreases smoothly from $p$ at $\lambda = 0$ toward zero as $\lambda \to \infty$, and it quantifies how many directions of the data the model is effectively using. It plays the role that the number of parameters plays in unregularized models, and it appears in model selection criteria and in the generalized cross validation score used to tune $\lambda$. The relationship to principal components analysis is direct: ridge is a soft version of principal components regression, which would set $f_j = 1$ for the top components and $f_j = 0$ for the rest, whereas ridge applies a smooth taper across all of them. ## 6. Ridge in Code ### 6.1 Coefficient path and test error (executable) The following cell builds a synthetic regression with eight near collinear predictors, standardizes them, and sweeps $\lambda$ across six orders of magnitude. It reports the OLS test error, the cross validation optimal $\lambda$, the resulting effective degrees of freedom computed from the singular values, and the shrinkage in the coefficient norm. The seed is fixed so the output is reproducible. ```{python} import numpy as np import matplotlib.pyplot as plt from sklearn.linear_model import Ridge from sklearn.model_selection import train_test_split rng = np.random.default_rng(0) # Synthetic problem with correlated predictors to stress OLS n, p = 120, 8 base = rng.standard_normal((n, 1)) noise = rng.standard_normal((n, p)) X = 0.92 * base + 0.08 * noise # columns share a common factor: near collinear X = (X - X.mean(0)) / X.std(0) # standardize each predictor beta_true = np.array([3.0, -2.0, 0.0, 1.5, 0.0, 0.0, 2.0, -1.0]) y = X @ beta_true + 1.5 * rng.standard_normal(n) y = y - y.mean() X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.4, random_state=0) lambdas = np.logspace(-2, 4, 40) coefs, test_mse = [], [] for lam in lambdas: m = Ridge(alpha=lam, fit_intercept=False).fit(X_tr, y_tr) coefs.append(m.coef_) test_mse.append(np.mean((y_te - m.predict(X_te)) ** 2)) coefs = np.array(coefs) test_mse = np.array(test_mse) # Effective degrees of freedom via SVD of the training design d = np.linalg.svd(X_tr, compute_uv=False) def edf(lam): return np.sum(d**2 / (d**2 + lam)) best = int(np.argmin(test_mse)) print(f"OLS (lambda=0) test MSE: {test_mse[0]:.4f}") print(f"Best lambda: {lambdas[best]:.4g}") print(f"Best test MSE: {test_mse[best]:.4f}") print(f"Effective df at best lambda: {edf(lambdas[best]):.2f} (of p={p})") print(f"Coef L2 norm OLS vs best ridge: {np.linalg.norm(coefs[0]):.3f} -> {np.linalg.norm(coefs[best]):.3f}") # Visualize the ridge coefficient path and the test error curve fig, (ax_path, ax_err) = plt.subplots(1, 2, figsize=(11, 4.5)) fig.suptitle("Ridge regularization path and model selection", fontsize=12) for j in range(p): ax_path.plot(lambdas, coefs[:, j], label=f"beta {j+1}") ax_path.axhline(0.0, color="0.6", lw=0.8) ax_path.axvline(lambdas[best], color="crimson", ls="--", lw=1.2, label="best lambda") ax_path.set_xscale("log") ax_path.set_xlabel("lambda (log scale)") ax_path.set_ylabel("ridge coefficient") ax_path.set_title("Coefficient path: shrinkage toward zero") ax_path.legend(fontsize=7, ncol=2, loc="upper right") ax_err.plot(lambdas, test_mse, color="steelblue", marker="o", ms=3) ax_err.axvline(lambdas[best], color="crimson", ls="--", lw=1.2) ax_err.scatter([lambdas[best]], [test_mse[best]], color="crimson", zorder=5, label=f"optimum: lambda={lambdas[best]:.2g}") ax_err.set_xscale("log") ax_err.set_xlabel("lambda (log scale)") ax_err.set_ylabel("test MSE") ax_err.set_title("Test error vs lambda") ax_err.legend(fontsize=8, loc="upper left") fig.tight_layout() plt.show() ``` Running this prints an OLS test MSE near $2.28$ against a best ridge MSE near $1.85$ at $\lambda \approx 1.4$, where the effective degrees of freedom has collapsed from $8$ to about $3$ and the coefficient norm has shrunk from roughly $9.6$ to $2.5$. This is the bias variance tradeoff and the shrinkage view made concrete on one dataset. ### 6.2 The same computation in three languages The closed form solve and the lambda sweep are short in any numerical language. The Python panel below is executable; the Julia and Rust panels are illustrative. ::: {.panel-tabset} ## Python ```{python} import numpy as np def ridge_solve(X, y, lam): p = X.shape[1] A = X.T @ X + lam * np.eye(p) # positive definite b = X.T @ y return np.linalg.solve(A, b) # Cholesky-backed solve, never an explicit inverse beta_hat = ridge_solve(X_tr, y_tr, 1.0) print(np.round(beta_hat, 3)) ``` ## Julia ```julia using LinearAlgebra function ridge_solve(X, y, lam) p = size(X, 2) A = X' * X + lam * I(p) # positive definite b = X' * y return cholesky(Symmetric(A)) \ b # solve, no explicit inverse end beta_hat = ridge_solve(X_tr, y_tr, 1.0) println(round.(beta_hat, digits = 3)) ``` ## Rust ```rust // Illustrative: uses the `nalgebra` crate use nalgebra::{DMatrix, DVector}; fn ridge_solve(x: &DMatrix<f64>, y: &DVector<f64>, lam: f64) -> DVector<f64> { let p = x.ncols(); let a = x.transpose() * x + lam * DMatrix::identity(p, p); // positive definite let b = x.transpose() * y; a.cholesky().expect("SPD").solve(&b) // solve, no explicit inverse } ``` ::: ## 7. Practical Guidance Ridge is most valuable when predictors are correlated, when $p$ is large relative to $n$, or when the goal is stable prediction rather than interpretation of individual coefficients. Always center and scale the predictors first, and exclude the intercept from the penalty. Select $\lambda$ by cross validation over a logarithmically spaced grid, since the useful range of $\lambda$ spans many orders of magnitude. If sparsity or variable selection is the goal, the lasso or the elastic net is more appropriate, but for sheer predictive stability under collinearity ridge is often the better default, and the elastic net combines both penalties to capture grouped correlated predictors that ridge handles gracefully and the lasso does not. Modern libraries implement all of these efficiently, and the SVD of the design can be reused to evaluate the entire ridge path across $\lambda$ values at almost no extra cost. ## References 1. Hoerl, A. E., and Kennard, R. W. (1970). Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics, 12(1), 55 to 67. https://doi.org/10.1080/00401706.1970.10488634 2. Hoerl, A. E., and Kennard, R. W. (1970). Ridge Regression: Applications to Nonorthogonal Problems. Technometrics, 12(1), 69 to 82. https://doi.org/10.1080/00401706.1970.10488635 3. Golub, G. H., Heath, M., and Wahba, G. (1979). Generalized Cross Validation as a Method for Choosing a Good Ridge Parameter. Technometrics, 21(2), 215 to 223. https://doi.org/10.1080/00401706.1979.10489751 4. Zou, H., and Hastie, T. (2005). Regularization and Variable Selection via the Elastic Net. Journal of the Royal Statistical Society, Series B, 67(2), 301 to 320. https://doi.org/10.1111/j.1467-9868.2005.00503.x 5. Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning, 2nd edition. Springer. https://doi.org/10.1007/978-0-387-84858-7 6. Pedregosa, F., et al. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825 to 2830. https://jmlr.org/papers/v12/pedregosa11a.html 7. Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. https://probml.github.io/pml-book/book1.html