88 Polynomial and Basis-Function Regression

Linear regression is often introduced as a tool for fitting straight lines, which makes it easy to dismiss as too rigid for the curved, nonlinear relationships that pervade real data. This view is a misreading. The word “linear” in linear regression refers to linearity in the parameters, not in the inputs. Once we accept that distinction, an entire family of expressive models opens up. We can transform the raw inputs through fixed nonlinear functions, called basis functions, and then fit a model that is linear in the transformed coordinates. Polynomials are the canonical first example, but the same machinery extends to splines, radial bases, and Fourier features. This chapter develops the theory and practice of basis-function regression, with polynomials as the running thread, and confronts head on the central difficulty these models introduce: as flexibility grows, the danger of overfitting grows with it.

88.1 1. From Linear Models to Basis Functions

88.1.1 1.1 Linearity in the parameters

Consider a scalar input $x$ and a target $y$. Ordinary linear regression posits $y \approx \beta_0 + \beta_1 x$. The polynomial extension simply augments the input with powers of $x$:

\[ y \approx \beta_0 + \beta_1 x + \beta_2 x^2 + \cdots + \beta_d x^d . \]

This is still a linear model in the statistical sense, because the prediction is a linear combination of the parameters $\beta_0, \ldots, \beta_d$. The nonlinearity lives entirely in the fixed transformation of $x$. More generally, we choose a set of basis functions $\phi_0(x), \phi_1(x), \ldots, \phi_{M-1}(x)$ and write

\[ f(x) = \sum_{j=0}^{M-1} \beta_j \, \phi_j(x) = \boldsymbol{\phi}(x)^\top \boldsymbol{\beta} . \]

For polynomial regression, $\phi_j(x) = x^j$. For a Fourier basis, the $\phi_j$ are sines and cosines. For radial basis functions, each $\phi_j(x) = \exp(-\|x - c_j\|^2 / 2\ell^2)$ is a bump centered at some location $c_j$. The estimation problem is identical in every case once we form the design matrix.

88.1.2 1.2 The design matrix and least squares

Given $n$ observations, stack the basis evaluations into a design matrix $\boldsymbol{\Phi} \in \mathbb{R}^{n \times M}$ whose row $i$ is $\boldsymbol{\phi}(x_i)^\top$. The ordinary least squares estimate minimizes the residual sum of squares:

\[ \hat{\boldsymbol{\beta}} = \arg\min_{\boldsymbol{\beta}} \; \| \mathbf{y} - \boldsymbol{\Phi}\boldsymbol{\beta} \|_2^2 = (\boldsymbol{\Phi}^\top \boldsymbol{\Phi})^{-1} \boldsymbol{\Phi}^\top \mathbf{y}, \]

assuming $\boldsymbol{\Phi}^\top \boldsymbol{\Phi}$ is invertible. The fitted values are $\hat{\mathbf{y}} = \boldsymbol{\Phi} \hat{\boldsymbol{\beta}} = \mathbf{H}\mathbf{y}$, where $\mathbf{H} = \boldsymbol{\Phi}(\boldsymbol{\Phi}^\top \boldsymbol{\Phi})^{-1}\boldsymbol{\Phi}^\top$ is the hat matrix, an orthogonal projection onto the column space of $\boldsymbol{\Phi}$. Its trace, $\operatorname{tr}(\mathbf{H}) = M$, counts the effective degrees of freedom of the unregularized fit. This quantity will reappear as the knob that governs the bias-variance tradeoff.

The crucial conceptual point is that no part of the estimation machinery knows or cares that the columns of $\boldsymbol{\Phi}$ are powers of $x$. We have reduced a nonlinear curve-fitting problem to a linear algebra problem of exactly the same shape as multiple linear regression.

# Conceptual construction of a degree-d polynomial design matrix.
# The companion library packages exactly this as polynomial_design + fit_ridge
# (see Section 7); the explicit form below shows there is no hidden machinery.
Phi = np.column_stack([x**j for j in range(d + 1)])
beta_hat = np.linalg.lstsq(Phi, y, rcond=None)[0]

88.2 2. Polynomial Features and Their Pathologies

88.2.1 2.1 Multivariate polynomial expansion

With a single input the polynomial basis is just the sequence of powers. With $p$ inputs, a degree $d$ polynomial includes all monomials $x_1^{a_1} x_2^{a_2} \cdots x_p^{a_p}$ with $\sum_k a_k \le d$. The number of such terms is $\binom{p+d}{d}$, which explodes combinatorially. For $p = 10$ inputs and degree $d = 3$ we already have $286$ features, and at degree $5$ we exceed three thousand. This rapid growth is the first practical warning sign: polynomial expansion in many dimensions is rarely tractable, which is one reason splines and localized bases dominate in practice.

88.2.2 2.2 Ill conditioning of the raw power basis

The naive monomial basis $1, x, x^2, \ldots, x^d$ is numerically treacherous. On an interval such as $[1, 2]$ the columns $x^j$ become nearly collinear as $j$ grows, because high powers all look like steeply rising curves. The matrix $\boldsymbol{\Phi}^\top \boldsymbol{\Phi}$ then has a condition number that grows exponentially in $d$, and the least squares solution becomes wildly sensitive to rounding and noise. The fix is to use an orthogonal polynomial basis, such as Legendre or Chebyshev polynomials, or to orthogonalize the columns numerically. These produce identical fitted values to the monomial basis in exact arithmetic but vastly better numerical stability.

# Prefer an orthogonal polynomial basis over raw powers
from numpy.polynomial import chebyshev
coef = chebyshev.chebfit(x, y, deg=d)

A second standard precaution is to center and scale the inputs to roughly $[-1, 1]$ before forming powers, which keeps the magnitudes of $x^j$ comparable and further tames the conditioning.

88.2.3 2.3 Runge’s phenomenon

Even with perfect arithmetic, high-degree global polynomials misbehave. Runge’s classic example fits the smooth function $1/(1 + 25x^2)$ on $[-1, 1]$ with polynomials interpolating equally spaced points. As the degree increases, the interpolant does not converge; instead it develops violent oscillations near the endpoints whose amplitude grows without bound. This is not a numerical artifact but a genuine property of global polynomial interpolation on uniform grids. Runge’s phenomenon is the analytic heart of why high-degree polynomials overfit: a single global function with many degrees of freedom must contort itself to pass through or near every point, and those contortions appear as oscillation. The lesson motivates two remedies pursued later in the chapter, namely localized basis functions (splines) and regularization.

88.3 3. The Bias-Variance Tradeoff with Degree

88.3.1 3.1 Decomposition of expected error

Suppose the data are generated by $y = g(x) + \varepsilon$ with $\mathbb{E}[\varepsilon] = 0$ and $\operatorname{Var}(\varepsilon) = \sigma^2$. For an estimator $\hat{f}$ trained on a random sample, the expected squared error at a fixed test point $x_0$ decomposes exactly as

\[ \mathbb{E}\big[(y_0 - \hat{f}(x_0))^2\big] = \underbrace{\big(g(x_0) - \mathbb{E}[\hat{f}(x_0)]\big)^2}_{\text{bias}^2} + \underbrace{\operatorname{Var}\big(\hat{f}(x_0)\big)}_{\text{variance}} + \underbrace{\sigma^2}_{\text{irreducible}} . \]

The polynomial degree $d$ is the lever that trades these terms against one another. A low degree gives a rigid model that cannot follow the true curvature, so bias dominates and the model underfits. A high degree gives a flexible model whose coefficients are estimated from finite noisy data, so its predictions swing from sample to sample and variance dominates, the model overfits. The irreducible term $\sigma^2$ is a floor that no model can cross.

88.3.2 3.2 Variance grows with effective degrees of freedom

The variance term has a clean form in the basis-function setting. Under homoscedastic noise the covariance of the fitted values is $\sigma^2 \mathbf{H}$, and summing the pointwise variances over the training inputs gives

\[ \sum_{i=1}^n \operatorname{Var}\big(\hat{f}(x_i)\big) = \sigma^2 \operatorname{tr}(\mathbf{H}) = \sigma^2 M . \]

The total in-sample variance is exactly proportional to the number of basis functions $M$, which for a degree $d$ polynomial is $d + 1$. Each parameter we add buys flexibility at a fixed price in variance. This is the quantitative content of the tradeoff: increasing $d$ reduces bias by enlarging the space of representable functions but inflates variance linearly. The optimal degree minimizes the sum, and it depends on the sample size $n$, the noise level $\sigma^2$, and the smoothness of $g$.

88.3.3 3.3 Selecting the degree

Because the optimum cannot be read off the training error, which decreases monotonically with $d$, degree selection must use a quantity that estimates out-of-sample error. The standard tool is $k$-fold cross-validation: partition the data, fit each candidate degree on the training folds, and average the held-out error. The degree minimizing cross-validated error is chosen. Information criteria such as AIC and BIC offer a cheaper alternative by penalizing the fitted log-likelihood with a term proportional to $M$; BIC’s heavier penalty $M \log n$ tends to select simpler models than AIC’s $2M$. In modern practice, cross-validation is preferred when computationally feasible because it makes the fewest assumptions about the noise distribution.

# Degree selection by cross-validated error
for d in candidate_degrees:
    cv_err[d] = mean(cross_val_score(make_poly_pipeline(d), X, y))
best_d = argmin(cv_err)

88.4 4. Splines and Piecewise Polynomials

88.4.1 4.1 The case for piecewise construction

Runge’s phenomenon teaches that a single global polynomial is the wrong object when the data are wiggly or numerous. The remedy is to abandon one global function and instead stitch together low-degree polynomials over local regions. We partition the input range with a set of breakpoints called knots, $t_1 < t_2 < \cdots < t_K$, and fit a separate polynomial on each subinterval. Without further constraints this produces a discontinuous, ragged fit, so we impose smoothness conditions at the knots. The result keeps the local adaptivity of piecewise fitting while presenting a globally smooth curve.

88.4.2 4.2 Cubic and natural cubic splines

A cubic spline uses degree three polynomials on each subinterval and requires that the function and its first and second derivatives are continuous across every knot. Cubic degree is the lowest that allows continuous curvature, which is why it is the default in applications; the eye perceives discontinuities in the second derivative as visible kinks. A cubic spline with $K$ interior knots has $K + 4$ degrees of freedom. Near the boundaries, an unconstrained cubic spline can still behave erratically because data are sparse there. A natural cubic spline adds the constraint that the function is linear beyond the boundary knots, that is, the second derivative vanishes at the extremes. This removes four degrees of freedom and dramatically stabilizes the tails, trading a small amount of flexibility for a large gain in reliability exactly where global polynomials fail worst.

88.4.3 4.3 The B-spline basis

Splines fit cleanly into the basis-function framework. The space of splines of a given degree and knot sequence is spanned by a set of B-spline basis functions, each of which is nonzero only over a few adjacent subintervals. This local support is the source of two practical virtues. First, the design matrix is banded and well conditioned, free of the global ill conditioning that plagues the monomial basis. Second, moving or adding a knot affects the fit only locally, so a spike in one region does not corrupt the curve elsewhere, which is precisely the failure mode of Runge oscillation. Given a B-spline design matrix $\boldsymbol{\Phi}$, fitting proceeds by the same least squares solve as before.

88.4.4 4.4 Knot placement and regression splines

The flexibility of a regression spline is controlled by the number and placement of knots. More knots mean more degrees of freedom, more flexibility, and the familiar march toward variance. A common heuristic places knots at quantiles of the observed inputs so that each subinterval contains a comparable number of points, concentrating flexibility where data are dense. The number of knots is then chosen by cross-validation, exactly as the polynomial degree was. Regression splines thus inherit the bias-variance structure of polynomials but partition the difficulty locally, which is why a moderate spline routinely outperforms a high-degree global polynomial on the same data.

88.5 5. Regularized Polynomial Fits

88.5.1 5.1 Penalizing coefficient magnitude

Rather than restrict flexibility by limiting the number of basis functions, we can include many basis functions and then penalize the coefficients to discourage overfitting. Ridge regression, also called Tikhonov regularization, augments the least squares objective with a squared $\ell_2$ penalty:

\[ \hat{\boldsymbol{\beta}}_\lambda = \arg\min_{\boldsymbol{\beta}} \; \| \mathbf{y} - \boldsymbol{\Phi}\boldsymbol{\beta} \|_2^2 + \lambda \|\boldsymbol{\beta}\|_2^2, \qquad \hat{\boldsymbol{\beta}}_\lambda = (\boldsymbol{\Phi}^\top \boldsymbol{\Phi} + \lambda \mathbf{I})^{-1} \boldsymbol{\Phi}^\top \mathbf{y}. \]

The penalty parameter $\lambda \ge 0$ tunes the strength of shrinkage. At $\lambda = 0$ we recover ordinary least squares; as $\lambda \to \infty$ the coefficients are driven toward zero and the fit flattens. Crucially, ridge also cures the ill conditioning of Section 2.2: adding $\lambda \mathbf{I}$ shifts every eigenvalue of $\boldsymbol{\Phi}^\top \boldsymbol{\Phi}$ upward by $\lambda$, guaranteeing invertibility and bounding the condition number. A high-degree polynomial that is hopeless under OLS often becomes well behaved under even mild ridge shrinkage.

88.5.2 5.2 Effective degrees of freedom and the continuous tradeoff

Regularization converts the discrete choice of degree into a continuous dial. The effective degrees of freedom of a ridge fit are

\[ \mathrm{df}(\lambda) = \operatorname{tr}\big(\boldsymbol{\Phi}(\boldsymbol{\Phi}^\top \boldsymbol{\Phi} + \lambda \mathbf{I})^{-1}\boldsymbol{\Phi}^\top\big) = \sum_{j} \frac{s_j^2}{s_j^2 + \lambda}, \]

where the $s_j$ are the singular values of $\boldsymbol{\Phi}$. As $\lambda$ ranges from $0$ to $\infty$, $\mathrm{df}(\lambda)$ decreases smoothly from $M$ to $0$. Where unregularized polynomials adjusted variance by integer jumps in $M$, ridge sweeps the same axis continuously, letting us land precisely on the bias-variance minimum. The penalty introduces bias by shrinking coefficients, but the variance reduction usually more than compensates, which is why a regularized high-degree fit frequently beats the best unregularized degree.

88.5.3 5.3 Sparsity and the smoothing spline connection

Replacing the $\ell_2$ penalty with an $\ell_1$ penalty gives the lasso, $\|\mathbf{y} - \boldsymbol{\Phi}\boldsymbol{\beta}\|_2^2 + \lambda \|\boldsymbol{\beta}\|_1$, which drives some coefficients exactly to zero and performs basis selection. With a rich basis this yields a parsimonious model that keeps only the terms the data support. A related and elegant idea is the smoothing spline, which penalizes the integrated squared second derivative of the fit:

\[ \arg\min_{f} \; \sum_{i=1}^n \big(y_i - f(x_i)\big)^2 + \lambda \int f''(x)^2 \, dx . \]

Remarkably, the minimizer over all twice-differentiable functions is a natural cubic spline with knots at every distinct data point. The penalty, not the knot count, controls smoothness, so we sidestep knot placement entirely and tune a single $\lambda$ by cross-validation. The smoothing spline thus unifies the two strands of this chapter, the spline construction of Section 4 and the regularization of Section 5, into one estimator.

88.5.4 5.4 Choosing the penalty

The penalty $\lambda$ is selected the same way the degree was, by cross-validation. Generalized cross-validation offers an efficient shortcut that approximates leave-one-out error using the trace of the hat matrix, avoiding repeated refitting. The practical workflow that emerges is consistent across every model in this chapter: choose a flexible basis, control its effective complexity with a tunable parameter, and set that parameter by estimating out-of-sample error. The basis functions change, but the discipline does not.

88.6 6. Practical Guidance

When a relationship is mildly curved and the data are modest, a low-degree polynomial on centered and scaled inputs, fit through an orthogonal basis, is simple and effective. When the relationship is wiggly or the sample is large, prefer splines, which localize flexibility and avoid Runge oscillation; natural cubic splines or smoothing splines are reliable defaults. When you want many features without committing to a degree, use a generous basis with ridge or lasso regularization and tune the penalty by cross-validation. Across all three routes, never trust training error to choose complexity, always center and scale inputs to protect conditioning, and remember that the irreducible noise $\sigma^2$ sets a hard floor that no amount of flexibility can breach. Basis-function regression is powerful precisely because it decouples the choice of expressive features from the linear estimation that follows, but that same expressiveness is what makes principled complexity control non-negotiable.

88.7 7. Reference Implementation

The companion libraries ship a small, tested basis-regression estimator in all three languages of this book: the Python package aiinaction, the Julia package AIInAction, and the Rust crate aiinaction. Each exposes the same surface: polynomial_design and rbf_design build a design matrix $\boldsymbol{\Phi}$, fit_ridge solves $(\boldsymbol{\Phi}^\top\boldsymbol{\Phi} + \lambda\mathbf{I})\boldsymbol{\beta} = \boldsymbol{\Phi}^\top\mathbf{y}$, predict evaluates $\boldsymbol{\Phi}\boldsymbol{\beta}$, and effective_dof reports $\mathrm{df}(\lambda) = \sum_j s_j^2/(s_j^2 + \lambda)$. The convenience BasisRegression type ties a basis and a penalty together with fit and predict. All three implementations are verified against the same numeric fixtures, so the coefficients below are identical to the last digit across languages.

The running example places five points exactly on the parabola $y = 1 + 2x - x^2$. A degree-two ordinary least squares fit therefore recovers the coefficients $[1, 2, -1]$ exactly, the fitted values reproduce the targets, and the effective degrees of freedom equal the three basis functions. Adding a ridge penalty $\lambda = 1$ shrinks the coefficients and lowers the effective degrees of freedom below three, the continuous complexity dial of Section 5.2 in action.

Code

from aiinaction.ch083_basis_regression import (
    polynomial_design, fit_ridge, predict, effective_dof, BasisRegression,
)

x = [-2.0, -1.0, 0.0, 1.0, 2.0]
y = [-7.0, -2.0, 1.0, 2.0, 1.0]  # lies exactly on 1 + 2x - x^2

phi = polynomial_design(x, degree=2)
beta_ols = fit_ridge(phi, y, penalty=0.0)
print("OLS coefficients :", [round(float(b), 6) for b in beta_ols])
print("Fitted values    :", [round(float(v), 6) for v in predict(phi, beta_ols)])
print("df(lambda=0)     :", round(effective_dof(phi, 0.0), 6))

beta_ridge = fit_ridge(phi, y, penalty=1.0)
print("Ridge coefficients:", [round(float(b), 6) for b in beta_ridge])
print("df(lambda=1)      :", round(effective_dof(phi, 1.0), 6))

# The convenience estimator: fit then extrapolate to x = 3.
model = BasisRegression(degree=2, penalty=0.0).fit(x, y)
print("Prediction at x=3 :", round(float(model.predict([3.0])[0]), 6))

OLS coefficients : [1.0, 2.0, -1.0]
Fitted values    : [-7.0, -2.0, 1.0, 2.0, 1.0]
df(lambda=0)     : 3.0
Ridge coefficients: [0.590909, 1.818182, -0.854545]
df(lambda=1)      : 2.536364
Prediction at x=3 : -2.0

using AIInAction.Ch083BasisRegression

x = [-2.0, -1.0, 0.0, 1.0, 2.0]
y = [-7.0, -2.0, 1.0, 2.0, 1.0]  # lies exactly on 1 + 2x - x^2

phi = polynomial_design(x, 2)
beta_ols = fit_ridge(phi, y, 0.0)
println("OLS coefficients : ", round.(beta_ols, digits=6))
println("Fitted values    : ", round.(predict(phi, beta_ols), digits=6))
println("df(lambda=0)     : ", round(effective_dof(phi, 0.0), digits=6))

beta_ridge = fit_ridge(phi, y, 1.0)
println("Ridge coefficients: ", round.(beta_ridge, digits=6))
println("df(lambda=1)      : ", round(effective_dof(phi, 1.0), digits=6))

model = BasisRegression(degree=2, penalty=0.0)
Ch083BasisRegression.fit!(model, x, y)
println("Prediction at x=3 : ", round(predict(model, [3.0])[1], digits=6))

use aiinaction::ch083_basis_regression::{
    polynomial_design, fit_ridge, predict, effective_dof,
};

fn main() {
    let x = [-2.0, -1.0, 0.0, 1.0, 2.0];
    let y = [-7.0, -2.0, 1.0, 2.0, 1.0]; // lies exactly on 1 + 2x - x^2

    let phi = polynomial_design(&x, 2).unwrap();
    let beta_ols = fit_ridge(&phi, &y, 0.0).unwrap();
    println!("OLS coefficients : {:?}", beta_ols);       // [1.0, 2.0, -1.0]
    println!("Fitted values    : {:?}", predict(&phi, &beta_ols).unwrap());
    println!("df(lambda=0)     : {}", effective_dof(&phi, 0.0).unwrap()); // 3.0

    let beta_ridge = fit_ridge(&phi, &y, 1.0).unwrap();
    println!("Ridge coefficients: {:?}", beta_ridge);
    println!("df(lambda=1)      : {}", effective_dof(&phi, 1.0).unwrap());
}

The Python tab executes when the book is rendered; the Julia and Rust tabs are the verbatim, parity-tested equivalents. Install the package for your language and the snippets run as written. Readers who want to go deeper can swap polynomial_design for rbf_design to fit the Gaussian radial basis of Section 1.1 with the identical fit_ridge solve.

88.8 References

Hastie, T., Tibshirani, R., and Friedman, J. The Elements of Statistical Learning, 2nd edition. Springer, 2009. https://hastie.su.domains/ElemStatLearn/
James, G., Witten, D., Hastie, T., and Tibshirani, R. An Introduction to Statistical Learning, 2nd edition. Springer, 2021. https://www.statlearning.com/
Bishop, C. M. Pattern Recognition and Machine Learning. Springer, 2006. https://www.microsoft.com/en-us/research/publication/pattern-recognition-machine-learning/
de Boor, C. A Practical Guide to Splines, revised edition. Springer, 2001. https://link.springer.com/book/9780387953663
Wahba, G. Spline Models for Observational Data. SIAM, 1990. https://epubs.siam.org/doi/book/10.1137/1.9781611970128
Trefethen, L. N. Approximation Theory and Approximation Practice. SIAM, 2013. https://people.maths.ox.ac.uk/trefethen/ATAP/
Runge, C. Uber empirische Funktionen und die Interpolation zwischen aquidistanten Ordinaten. Zeitschrift fur Mathematik und Physik, 1901. https://archive.org/details/zeitschriftfrma12runggoog
Hoerl, A. E., and Kennard, R. W. Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics, 1970. https://www.tandfonline.com/doi/abs/10.1080/00401706.1970.10488634
Tibshirani, R. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society Series B, 1996. https://www.jstor.org/stable/2346178
Pedregosa, F. et al. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 2011. https://jmlr.org/papers/v12/pedregosa11a.html

# Polynomial and Basis-Function Regression Linear regression is often introduced as a tool for fitting straight lines, which makes it easy to dismiss as too rigid for the curved, nonlinear relationships that pervade real data. This view is a misreading. The word "linear" in linear regression refers to linearity in the parameters, not in the inputs. Once we accept that distinction, an entire family of expressive models opens up. We can transform the raw inputs through fixed nonlinear functions, called basis functions, and then fit a model that is linear in the transformed coordinates. Polynomials are the canonical first example, but the same machinery extends to splines, radial bases, and Fourier features. This chapter develops the theory and practice of basis-function regression, with polynomials as the running thread, and confronts head on the central difficulty these models introduce: as flexibility grows, the danger of overfitting grows with it. ## 1. From Linear Models to Basis Functions ### 1.1 Linearity in the parameters Consider a scalar input $x$ and a target $y$. Ordinary linear regression posits $y \approx \beta_0 + \beta_1 x$. The polynomial extension simply augments the input with powers of $x$: $$ y \approx \beta_0 + \beta_1 x + \beta_2 x^2 + \cdots + \beta_d x^d . $$ This is still a linear model in the statistical sense, because the prediction is a linear combination of the parameters $\beta_0, \ldots, \beta_d$. The nonlinearity lives entirely in the fixed transformation of $x$. More generally, we choose a set of basis functions $\phi_0(x), \phi_1(x), \ldots, \phi_{M-1}(x)$ and write $$ f(x) = \sum_{j=0}^{M-1} \beta_j \, \phi_j(x) = \boldsymbol{\phi}(x)^\top \boldsymbol{\beta} . $$ For polynomial regression, $\phi_j(x) = x^j$. For a Fourier basis, the $\phi_j$ are sines and cosines. For radial basis functions, each $\phi_j(x) = \exp(-\|x - c_j\|^2 / 2\ell^2)$ is a bump centered at some location $c_j$. The estimation problem is identical in every case once we form the design matrix. ### 1.2 The design matrix and least squares Given $n$ observations, stack the basis evaluations into a design matrix $\boldsymbol{\Phi} \in \mathbb{R}^{n \times M}$ whose row $i$ is $\boldsymbol{\phi}(x_i)^\top$. The ordinary least squares estimate minimizes the residual sum of squares: $$ \hat{\boldsymbol{\beta}} = \arg\min_{\boldsymbol{\beta}} \; \| \mathbf{y} - \boldsymbol{\Phi}\boldsymbol{\beta} \|_2^2 = (\boldsymbol{\Phi}^\top \boldsymbol{\Phi})^{-1} \boldsymbol{\Phi}^\top \mathbf{y}, $$ assuming $\boldsymbol{\Phi}^\top \boldsymbol{\Phi}$ is invertible. The fitted values are $\hat{\mathbf{y}} = \boldsymbol{\Phi} \hat{\boldsymbol{\beta}} = \mathbf{H}\mathbf{y}$, where $\mathbf{H} = \boldsymbol{\Phi}(\boldsymbol{\Phi}^\top \boldsymbol{\Phi})^{-1}\boldsymbol{\Phi}^\top$ is the hat matrix, an orthogonal projection onto the column space of $\boldsymbol{\Phi}$. Its trace, $\operatorname{tr}(\mathbf{H}) = M$, counts the effective degrees of freedom of the unregularized fit. This quantity will reappear as the knob that governs the bias-variance tradeoff. The crucial conceptual point is that no part of the estimation machinery knows or cares that the columns of $\boldsymbol{\Phi}$ are powers of $x$. We have reduced a nonlinear curve-fitting problem to a linear algebra problem of exactly the same shape as multiple linear regression. ```python # Conceptual construction of a degree-d polynomial design matrix. # The companion library packages exactly this as polynomial_design + fit_ridge # (see Section 7); the explicit form below shows there is no hidden machinery. Phi = np.column_stack([x**j for j in range(d + 1)]) beta_hat = np.linalg.lstsq(Phi, y, rcond=None)[0] ``` ## 2. Polynomial Features and Their Pathologies ### 2.1 Multivariate polynomial expansion With a single input the polynomial basis is just the sequence of powers. With $p$ inputs, a degree $d$ polynomial includes all monomials $x_1^{a_1} x_2^{a_2} \cdots x_p^{a_p}$ with $\sum_k a_k \le d$. The number of such terms is $\binom{p+d}{d}$, which explodes combinatorially. For $p = 10$ inputs and degree $d = 3$ we already have $286$ features, and at degree $5$ we exceed three thousand. This rapid growth is the first practical warning sign: polynomial expansion in many dimensions is rarely tractable, which is one reason splines and localized bases dominate in practice. ### 2.2 Ill conditioning of the raw power basis The naive monomial basis $1, x, x^2, \ldots, x^d$ is numerically treacherous. On an interval such as $[1, 2]$ the columns $x^j$ become nearly collinear as $j$ grows, because high powers all look like steeply rising curves. The matrix $\boldsymbol{\Phi}^\top \boldsymbol{\Phi}$ then has a condition number that grows exponentially in $d$, and the least squares solution becomes wildly sensitive to rounding and noise. The fix is to use an orthogonal polynomial basis, such as Legendre or Chebyshev polynomials, or to orthogonalize the columns numerically. These produce identical fitted values to the monomial basis in exact arithmetic but vastly better numerical stability. ```python # Prefer an orthogonal polynomial basis over raw powers from numpy.polynomial import chebyshev coef = chebyshev.chebfit(x, y, deg=d) ``` A second standard precaution is to center and scale the inputs to roughly $[-1, 1]$ before forming powers, which keeps the magnitudes of $x^j$ comparable and further tames the conditioning. ### 2.3 Runge's phenomenon Even with perfect arithmetic, high-degree global polynomials misbehave. Runge's classic example fits the smooth function $1/(1 + 25x^2)$ on $[-1, 1]$ with polynomials interpolating equally spaced points. As the degree increases, the interpolant does not converge; instead it develops violent oscillations near the endpoints whose amplitude grows without bound. This is not a numerical artifact but a genuine property of global polynomial interpolation on uniform grids. Runge's phenomenon is the analytic heart of why high-degree polynomials overfit: a single global function with many degrees of freedom must contort itself to pass through or near every point, and those contortions appear as oscillation. The lesson motivates two remedies pursued later in the chapter, namely localized basis functions (splines) and regularization. ## 3. The Bias-Variance Tradeoff with Degree ### 3.1 Decomposition of expected error Suppose the data are generated by $y = g(x) + \varepsilon$ with $\mathbb{E}[\varepsilon] = 0$ and $\operatorname{Var}(\varepsilon) = \sigma^2$. For an estimator $\hat{f}$ trained on a random sample, the expected squared error at a fixed test point $x_0$ decomposes exactly as $$ \mathbb{E}\big[(y_0 - \hat{f}(x_0))^2\big] = \underbrace{\big(g(x_0) - \mathbb{E}[\hat{f}(x_0)]\big)^2}_{\text{bias}^2} + \underbrace{\operatorname{Var}\big(\hat{f}(x_0)\big)}_{\text{variance}} + \underbrace{\sigma^2}_{\text{irreducible}} . $$ The polynomial degree $d$ is the lever that trades these terms against one another. A low degree gives a rigid model that cannot follow the true curvature, so bias dominates and the model underfits. A high degree gives a flexible model whose coefficients are estimated from finite noisy data, so its predictions swing from sample to sample and variance dominates, the model overfits. The irreducible term $\sigma^2$ is a floor that no model can cross. ### 3.2 Variance grows with effective degrees of freedom The variance term has a clean form in the basis-function setting. Under homoscedastic noise the covariance of the fitted values is $\sigma^2 \mathbf{H}$, and summing the pointwise variances over the training inputs gives $$ \sum_{i=1}^n \operatorname{Var}\big(\hat{f}(x_i)\big) = \sigma^2 \operatorname{tr}(\mathbf{H}) = \sigma^2 M . $$ The total in-sample variance is exactly proportional to the number of basis functions $M$, which for a degree $d$ polynomial is $d + 1$. Each parameter we add buys flexibility at a fixed price in variance. This is the quantitative content of the tradeoff: increasing $d$ reduces bias by enlarging the space of representable functions but inflates variance linearly. The optimal degree minimizes the sum, and it depends on the sample size $n$, the noise level $\sigma^2$, and the smoothness of $g$. ### 3.3 Selecting the degree Because the optimum cannot be read off the training error, which decreases monotonically with $d$, degree selection must use a quantity that estimates out-of-sample error. The standard tool is $k$-fold cross-validation: partition the data, fit each candidate degree on the training folds, and average the held-out error. The degree minimizing cross-validated error is chosen. Information criteria such as AIC and BIC offer a cheaper alternative by penalizing the fitted log-likelihood with a term proportional to $M$; BIC's heavier penalty $M \log n$ tends to select simpler models than AIC's $2M$. In modern practice, cross-validation is preferred when computationally feasible because it makes the fewest assumptions about the noise distribution. ```python # Degree selection by cross-validated error for d in candidate_degrees: cv_err[d] = mean(cross_val_score(make_poly_pipeline(d), X, y)) best_d = argmin(cv_err) ``` ## 4. Splines and Piecewise Polynomials ### 4.1 The case for piecewise construction Runge's phenomenon teaches that a single global polynomial is the wrong object when the data are wiggly or numerous. The remedy is to abandon one global function and instead stitch together low-degree polynomials over local regions. We partition the input range with a set of breakpoints called knots, $t_1 < t_2 < \cdots < t_K$, and fit a separate polynomial on each subinterval. Without further constraints this produces a discontinuous, ragged fit, so we impose smoothness conditions at the knots. The result keeps the local adaptivity of piecewise fitting while presenting a globally smooth curve. ### 4.2 Cubic and natural cubic splines A cubic spline uses degree three polynomials on each subinterval and requires that the function and its first and second derivatives are continuous across every knot. Cubic degree is the lowest that allows continuous curvature, which is why it is the default in applications; the eye perceives discontinuities in the second derivative as visible kinks. A cubic spline with $K$ interior knots has $K + 4$ degrees of freedom. Near the boundaries, an unconstrained cubic spline can still behave erratically because data are sparse there. A natural cubic spline adds the constraint that the function is linear beyond the boundary knots, that is, the second derivative vanishes at the extremes. This removes four degrees of freedom and dramatically stabilizes the tails, trading a small amount of flexibility for a large gain in reliability exactly where global polynomials fail worst. ### 4.3 The B-spline basis Splines fit cleanly into the basis-function framework. The space of splines of a given degree and knot sequence is spanned by a set of B-spline basis functions, each of which is nonzero only over a few adjacent subintervals. This local support is the source of two practical virtues. First, the design matrix is banded and well conditioned, free of the global ill conditioning that plagues the monomial basis. Second, moving or adding a knot affects the fit only locally, so a spike in one region does not corrupt the curve elsewhere, which is precisely the failure mode of Runge oscillation. Given a B-spline design matrix $\boldsymbol{\Phi}$, fitting proceeds by the same least squares solve as before. ### 4.4 Knot placement and regression splines The flexibility of a regression spline is controlled by the number and placement of knots. More knots mean more degrees of freedom, more flexibility, and the familiar march toward variance. A common heuristic places knots at quantiles of the observed inputs so that each subinterval contains a comparable number of points, concentrating flexibility where data are dense. The number of knots is then chosen by cross-validation, exactly as the polynomial degree was. Regression splines thus inherit the bias-variance structure of polynomials but partition the difficulty locally, which is why a moderate spline routinely outperforms a high-degree global polynomial on the same data. ## 5. Regularized Polynomial Fits ### 5.1 Penalizing coefficient magnitude Rather than restrict flexibility by limiting the number of basis functions, we can include many basis functions and then penalize the coefficients to discourage overfitting. Ridge regression, also called Tikhonov regularization, augments the least squares objective with a squared $\ell_2$ penalty: $$ \hat{\boldsymbol{\beta}}_\lambda = \arg\min_{\boldsymbol{\beta}} \; \| \mathbf{y} - \boldsymbol{\Phi}\boldsymbol{\beta} \|_2^2 + \lambda \|\boldsymbol{\beta}\|_2^2, \qquad \hat{\boldsymbol{\beta}}_\lambda = (\boldsymbol{\Phi}^\top \boldsymbol{\Phi} + \lambda \mathbf{I})^{-1} \boldsymbol{\Phi}^\top \mathbf{y}. $$ The penalty parameter $\lambda \ge 0$ tunes the strength of shrinkage. At $\lambda = 0$ we recover ordinary least squares; as $\lambda \to \infty$ the coefficients are driven toward zero and the fit flattens. Crucially, ridge also cures the ill conditioning of Section 2.2: adding $\lambda \mathbf{I}$ shifts every eigenvalue of $\boldsymbol{\Phi}^\top \boldsymbol{\Phi}$ upward by $\lambda$, guaranteeing invertibility and bounding the condition number. A high-degree polynomial that is hopeless under OLS often becomes well behaved under even mild ridge shrinkage. ### 5.2 Effective degrees of freedom and the continuous tradeoff Regularization converts the discrete choice of degree into a continuous dial. The effective degrees of freedom of a ridge fit are $$ \mathrm{df}(\lambda) = \operatorname{tr}\big(\boldsymbol{\Phi}(\boldsymbol{\Phi}^\top \boldsymbol{\Phi} + \lambda \mathbf{I})^{-1}\boldsymbol{\Phi}^\top\big) = \sum_{j} \frac{s_j^2}{s_j^2 + \lambda}, $$ where the $s_j$ are the singular values of $\boldsymbol{\Phi}$. As $\lambda$ ranges from $0$ to $\infty$, $\mathrm{df}(\lambda)$ decreases smoothly from $M$ to $0$. Where unregularized polynomials adjusted variance by integer jumps in $M$, ridge sweeps the same axis continuously, letting us land precisely on the bias-variance minimum. The penalty introduces bias by shrinking coefficients, but the variance reduction usually more than compensates, which is why a regularized high-degree fit frequently beats the best unregularized degree. ### 5.3 Sparsity and the smoothing spline connection Replacing the $\ell_2$ penalty with an $\ell_1$ penalty gives the lasso, $\|\mathbf{y} - \boldsymbol{\Phi}\boldsymbol{\beta}\|_2^2 + \lambda \|\boldsymbol{\beta}\|_1$, which drives some coefficients exactly to zero and performs basis selection. With a rich basis this yields a parsimonious model that keeps only the terms the data support. A related and elegant idea is the smoothing spline, which penalizes the integrated squared second derivative of the fit: $$ \arg\min_{f} \; \sum_{i=1}^n \big(y_i - f(x_i)\big)^2 + \lambda \int f''(x)^2 \, dx . $$ Remarkably, the minimizer over all twice-differentiable functions is a natural cubic spline with knots at every distinct data point. The penalty, not the knot count, controls smoothness, so we sidestep knot placement entirely and tune a single $\lambda$ by cross-validation. The smoothing spline thus unifies the two strands of this chapter, the spline construction of Section 4 and the regularization of Section 5, into one estimator. ### 5.4 Choosing the penalty The penalty $\lambda$ is selected the same way the degree was, by cross-validation. Generalized cross-validation offers an efficient shortcut that approximates leave-one-out error using the trace of the hat matrix, avoiding repeated refitting. The practical workflow that emerges is consistent across every model in this chapter: choose a flexible basis, control its effective complexity with a tunable parameter, and set that parameter by estimating out-of-sample error. The basis functions change, but the discipline does not. ## 6. Practical Guidance When a relationship is mildly curved and the data are modest, a low-degree polynomial on centered and scaled inputs, fit through an orthogonal basis, is simple and effective. When the relationship is wiggly or the sample is large, prefer splines, which localize flexibility and avoid Runge oscillation; natural cubic splines or smoothing splines are reliable defaults. When you want many features without committing to a degree, use a generous basis with ridge or lasso regularization and tune the penalty by cross-validation. Across all three routes, never trust training error to choose complexity, always center and scale inputs to protect conditioning, and remember that the irreducible noise $\sigma^2$ sets a hard floor that no amount of flexibility can breach. Basis-function regression is powerful precisely because it decouples the choice of expressive features from the linear estimation that follows, but that same expressiveness is what makes principled complexity control non-negotiable. ## 7. Reference Implementation The companion libraries ship a small, tested basis-regression estimator in all three languages of this book: the Python package `aiinaction`, the Julia package `AIInAction`, and the Rust crate `aiinaction`. Each exposes the same surface: `polynomial_design` and `rbf_design` build a design matrix $\boldsymbol{\Phi}$, `fit_ridge` solves $(\boldsymbol{\Phi}^\top\boldsymbol{\Phi} + \lambda\mathbf{I})\boldsymbol{\beta} = \boldsymbol{\Phi}^\top\mathbf{y}$, `predict` evaluates $\boldsymbol{\Phi}\boldsymbol{\beta}$, and `effective_dof` reports $\mathrm{df}(\lambda) = \sum_j s_j^2/(s_j^2 + \lambda)$. The convenience `BasisRegression` type ties a basis and a penalty together with `fit` and `predict`. All three implementations are verified against the same numeric fixtures, so the coefficients below are identical to the last digit across languages. The running example places five points exactly on the parabola $y = 1 + 2x - x^2$. A degree-two ordinary least squares fit therefore recovers the coefficients $[1, 2, -1]$ exactly, the fitted values reproduce the targets, and the effective degrees of freedom equal the three basis functions. Adding a ridge penalty $\lambda = 1$ shrinks the coefficients and lowers the effective degrees of freedom below three, the continuous complexity dial of Section 5.2 in action. ::: {.panel-tabset} ## Python ```{python} from aiinaction.ch083_basis_regression import ( polynomial_design, fit_ridge, predict, effective_dof, BasisRegression, ) x = [-2.0, -1.0, 0.0, 1.0, 2.0] y = [-7.0, -2.0, 1.0, 2.0, 1.0] # lies exactly on 1 + 2x - x^2 phi = polynomial_design(x, degree=2) beta_ols = fit_ridge(phi, y, penalty=0.0) print("OLS coefficients :", [round(float(b), 6) for b in beta_ols]) print("Fitted values :", [round(float(v), 6) for v in predict(phi, beta_ols)]) print("df(lambda=0) :", round(effective_dof(phi, 0.0), 6)) beta_ridge = fit_ridge(phi, y, penalty=1.0) print("Ridge coefficients:", [round(float(b), 6) for b in beta_ridge]) print("df(lambda=1) :", round(effective_dof(phi, 1.0), 6)) # The convenience estimator: fit then extrapolate to x = 3. model = BasisRegression(degree=2, penalty=0.0).fit(x, y) print("Prediction at x=3 :", round(float(model.predict([3.0])[0]), 6)) ``` ## Julia ```julia using AIInAction.Ch083BasisRegression x = [-2.0, -1.0, 0.0, 1.0, 2.0] y = [-7.0, -2.0, 1.0, 2.0, 1.0] # lies exactly on 1 + 2x - x^2 phi = polynomial_design(x, 2) beta_ols = fit_ridge(phi, y, 0.0) println("OLS coefficients : ", round.(beta_ols, digits=6)) println("Fitted values : ", round.(predict(phi, beta_ols), digits=6)) println("df(lambda=0) : ", round(effective_dof(phi, 0.0), digits=6)) beta_ridge = fit_ridge(phi, y, 1.0) println("Ridge coefficients: ", round.(beta_ridge, digits=6)) println("df(lambda=1) : ", round(effective_dof(phi, 1.0), digits=6)) model = BasisRegression(degree=2, penalty=0.0) Ch083BasisRegression.fit!(model, x, y) println("Prediction at x=3 : ", round(predict(model, [3.0])[1], digits=6)) ``` ## Rust ```rust use aiinaction::ch083_basis_regression::{ polynomial_design, fit_ridge, predict, effective_dof, }; fn main() { let x = [-2.0, -1.0, 0.0, 1.0, 2.0]; let y = [-7.0, -2.0, 1.0, 2.0, 1.0]; // lies exactly on 1 + 2x - x^2 let phi = polynomial_design(&x, 2).unwrap(); let beta_ols = fit_ridge(&phi, &y, 0.0).unwrap(); println!("OLS coefficients : {:?}", beta_ols); // [1.0, 2.0, -1.0] println!("Fitted values : {:?}", predict(&phi, &beta_ols).unwrap()); println!("df(lambda=0) : {}", effective_dof(&phi, 0.0).unwrap()); // 3.0 let beta_ridge = fit_ridge(&phi, &y, 1.0).unwrap(); println!("Ridge coefficients: {:?}", beta_ridge); println!("df(lambda=1) : {}", effective_dof(&phi, 1.0).unwrap()); } ``` ::: The Python tab executes when the book is rendered; the Julia and Rust tabs are the verbatim, parity-tested equivalents. Install the package for your language and the snippets run as written. Readers who want to go deeper can swap `polynomial_design` for `rbf_design` to fit the Gaussian radial basis of Section 1.1 with the identical `fit_ridge` solve. ## References 1. Hastie, T., Tibshirani, R., and Friedman, J. The Elements of Statistical Learning, 2nd edition. Springer, 2009. https://hastie.su.domains/ElemStatLearn/ 2. James, G., Witten, D., Hastie, T., and Tibshirani, R. An Introduction to Statistical Learning, 2nd edition. Springer, 2021. https://www.statlearning.com/ 3. Bishop, C. M. Pattern Recognition and Machine Learning. Springer, 2006. https://www.microsoft.com/en-us/research/publication/pattern-recognition-machine-learning/ 4. de Boor, C. A Practical Guide to Splines, revised edition. Springer, 2001. https://link.springer.com/book/9780387953663 5. Wahba, G. Spline Models for Observational Data. SIAM, 1990. https://epubs.siam.org/doi/book/10.1137/1.9781611970128 6. Trefethen, L. N. Approximation Theory and Approximation Practice. SIAM, 2013. https://people.maths.ox.ac.uk/trefethen/ATAP/ 7. Runge, C. Uber empirische Funktionen und die Interpolation zwischen aquidistanten Ordinaten. Zeitschrift fur Mathematik und Physik, 1901. https://archive.org/details/zeitschriftfrma12runggoog 8. Hoerl, A. E., and Kennard, R. W. Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics, 1970. https://www.tandfonline.com/doi/abs/10.1080/00401706.1970.10488634 9. Tibshirani, R. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society Series B, 1996. https://www.jstor.org/stable/2346178 10. Pedregosa, F. et al. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 2011. https://jmlr.org/papers/v12/pedregosa11a.html