87 Linear Regression Extensions

Ordinary least squares (OLS) is the workhorse of applied statistics and a baseline for nearly every supervised learning problem. Yet the assumptions that justify OLS, namely homoscedastic and uncorrelated errors, Gaussian noise, and a strictly linear conditional mean, fail routinely in practice. Heteroscedasticity, correlated measurements, heavy tailed contamination, asymmetric loss, and nonlinear structure all break the optimality of the plain least squares estimator. This chapter develops a family of extensions that relax each assumption in turn while retaining the computational and interpretive tractability that makes linear models attractive. We cover weighted least squares, generalized least squares, robust regression including Huber loss and RANSAC, quantile regression, and basis function expansion. Throughout, the unifying theme is that we are still fitting a model that is linear in its parameters, even when it is nonlinear in the inputs or fitted under a non-Gaussian loss.

87.1 1. The Least Squares Baseline and Its Assumptions

87.1.1 1.1 The OLS estimator

Let $X \in \mathbb{R}^{n \times p}$ be a design matrix with rows $x_i^\top$, and let $y \in \mathbb{R}^n$ be the response. The linear model posits

\[ y = X\beta + \varepsilon, \qquad \mathbb{E}[\varepsilon] = 0, \quad \operatorname{Cov}(\varepsilon) = \sigma^2 I. \]

OLS minimizes the residual sum of squares $\lVert y - X\beta \rVert_2^2$, yielding the normal equations $X^\top X \hat\beta = X^\top y$ and the closed form

\[ \hat\beta_{\text{OLS}} = (X^\top X)^{-1} X^\top y. \]

Under the stated assumptions the Gauss Markov theorem guarantees that $\hat\beta_{\text{OLS}}$ is the best linear unbiased estimator (BLUE): among all estimators linear in $y$ and unbiased for $\beta$, it has minimum variance. If in addition $\varepsilon \sim \mathcal{N}(0, \sigma^2 I)$, then OLS coincides with the maximum likelihood estimator and is efficient among all unbiased estimators.

87.1.2 1.2 Where the assumptions fail

Three failure modes motivate the rest of this chapter. First, the error covariance may not be $\sigma^2 I$: variances may differ across observations (heteroscedasticity) or errors may be correlated (as in time series or spatial data). Second, the noise may be heavy tailed or contaminated by outliers, so that the squared loss, which grows quadratically, lets a handful of bad points dominate the fit. Third, the conditional mean $\mathbb{E}[y \mid x]$ may be nonlinear in $x$, or we may care about a conditional quantile rather than the mean. Each extension below targets one of these failures while preserving linearity in the parameters.

87.2 2. Weighted Least Squares

87.2.1 2.1 Motivation and estimator

Suppose the errors remain uncorrelated but are heteroscedastic, $\operatorname{Cov}(\varepsilon) = \sigma^2 W^{-1}$ where $W = \operatorname{diag}(w_1, \dots, w_n)$ is a known positive diagonal matrix. Observations with large variance carry less information and should be downweighted. Weighted least squares (WLS) minimizes the weighted residual sum of squares

\[ \hat\beta_{\text{WLS}} = \arg\min_\beta \sum_{i=1}^n w_i \,(y_i - x_i^\top \beta)^2 = \arg\min_\beta (y - X\beta)^\top W (y - X\beta). \]

Differentiating gives the weighted normal equations and the solution

\[ \hat\beta_{\text{WLS}} = (X^\top W X)^{-1} X^\top W y. \]

When $w_i \propto 1/\operatorname{Var}(\varepsilon_i)$, WLS is BLUE and equals the maximum likelihood estimator under independent Gaussian errors with the given variances.

87.2.2 2.2 Choosing the weights

In rare cases the variances are known from the measurement process, for example when each $y_i$ is itself an average of $m_i$ replicate measurements, in which case $w_i = m_i$. More often the variance structure is estimated. A common approach posits a variance model such as $\operatorname{Var}(\varepsilon_i) = \sigma^2 g(x_i)$ for some known function $g$, then sets $w_i = 1/g(x_i)$. When $g$ itself is unknown, feasible generalized least squares (Section 3.3) iterates between estimating the mean and estimating the variance from squared residuals.

# Conceptual WLS via reweighting
W   = diag(weights)               # weights ~ 1 / estimated variance
XtW = X.T @ W
beta = solve(XtW @ X, XtW @ y)

87.2.3 2.3 Connection to transformation

WLS is equivalent to OLS on transformed data. Define $\tilde X = W^{1/2} X$ and $\tilde y = W^{1/2} y$. Then the WLS objective becomes $\lVert \tilde y - \tilde X \beta \rVert_2^2$, an ordinary least squares problem. This whitening view, scaling each row by $\sqrt{w_i}$ so the transformed errors are homoscedastic, generalizes directly to the correlated case below.

87.3 3. Generalized Least Squares

87.3.1 3.1 The general error covariance

Generalized least squares (GLS) handles an arbitrary known positive definite error covariance,

\[ \operatorname{Cov}(\varepsilon) = \sigma^2 \Omega, \qquad \Omega \succ 0, \]

where $\Omega$ may have nonzero off diagonal entries representing correlated errors. The GLS estimator minimizes the Mahalanobis style objective

\[ \hat\beta_{\text{GLS}} = \arg\min_\beta (y - X\beta)^\top \Omega^{-1} (y - X\beta), \]

with closed form solution

\[ \hat\beta_{\text{GLS}} = (X^\top \Omega^{-1} X)^{-1} X^\top \Omega^{-1} y. \]

WLS is the special case $\Omega = W^{-1}$ with $W$ diagonal, and OLS is the further special case $\Omega = I$. Under the GLS assumptions, $\hat\beta_{\text{GLS}}$ is BLUE; ignoring the correlation and using OLS yields an estimator that remains unbiased but is inefficient and produces invalid standard errors.

87.3.2 3.2 The whitening interpretation

Because $\Omega$ is positive definite it admits a Cholesky factorization $\Omega = L L^\top$. Premultiplying the model by $L^{-1}$ gives

\[ L^{-1} y = L^{-1} X \beta + L^{-1}\varepsilon, \]

and the transformed error $L^{-1}\varepsilon$ has covariance $\sigma^2 L^{-1}\Omega L^{-\top} = \sigma^2 I$. Thus GLS is exactly OLS applied to the whitened data $(L^{-1}X, L^{-1}y)$. This factorization is also the practical recipe: rather than forming $\Omega^{-1}$ explicitly, one solves triangular systems, which is numerically stable and avoids an expensive matrix inverse.

87.3.3 3.3 Feasible GLS

In applications $\Omega$ is rarely known. Feasible GLS (FGLS) replaces it with an estimate $\hat\Omega$ obtained from a model of the dependence. The canonical example is an autoregressive error structure for time series, $\varepsilon_t = \rho \varepsilon_{t-1} + u_t$, where the lag one correlation $\rho$ is estimated from OLS residuals and used to build $\hat\Omega$. The procedure iterates: fit OLS, estimate the covariance parameters from residuals, refit by GLS with $\hat\Omega$, and repeat until convergence. FGLS is consistent and asymptotically efficient when the covariance model is correctly specified, but a misspecified $\hat\Omega$ can do more harm than good, so robust (heteroscedasticity and autocorrelation consistent) standard errors are often preferred when the dependence structure is uncertain.

87.4 4. Robust Regression

87.4.1 4.1 The fragility of squared loss

The squared loss $\rho(r) = r^2$ has an unbounded derivative $\psi(r) = 2r$, called the influence function. A single observation with a large residual can therefore exert arbitrarily large leverage on $\hat\beta$. The breakdown point of OLS, the smallest fraction of contaminated observations that can drive the estimate to infinity, is $0$: one bad point suffices. Robust regression replaces the squared loss with one that limits the influence of large residuals.

87.4.2 4.2 M-estimators and the Huber loss

An M-estimator generalizes least squares by minimizing $\sum_i \rho(r_i / \hat\sigma)$ for a loss function $\rho$ that is less sensitive to outliers, where $\hat\sigma$ is a robust scale estimate. The Huber loss is the canonical choice:

\[ \rho_\delta(r) = \begin{cases} \tfrac{1}{2} r^2 & |r| \le \delta, \\[4pt] \delta \left( |r| - \tfrac{1}{2}\delta \right) & |r| > \delta. \end{cases} \]

It is quadratic for small residuals, preserving efficiency under Gaussian noise, and linear for large residuals, capping their influence. Its derivative $\psi_\delta(r) = \operatorname{clip}(r, -\delta, \delta)$ is bounded, which is precisely what robustness requires. The tuning constant $\delta$ trades efficiency against robustness; $\delta = 1.345\,\hat\sigma$ delivers about 95 percent of the efficiency of OLS at the Gaussian model while bounding influence.

Huber estimation has no closed form but is readily solved by iteratively reweighted least squares (IRLS). At each iteration one forms weights $w_i = \psi_\delta(r_i)/r_i$ and performs a weighted least squares step:

beta = ols(X, y)
repeat:
    r  = y - X @ beta
    s  = 1.4826 * median(abs(r - median(r)))   # robust scale (MAD)
    w  = clip(s*delta / abs(r), max=1.0)        # Huber weights
    beta = wls(X, y, weights=w)
until converged

The Huber estimator has good local robustness but its breakdown point is still low because the linear tail does not fully reject extreme outliers. Redescending losses such as Tukey’s biweight, whose $\psi$ returns to zero for very large residuals, reject gross outliers entirely at the cost of a nonconvex objective and sensitivity to initialization.

87.4.3 4.2.1 Why IRLS solves the Huber problem

The IRLS recipe is not a heuristic; it is the exact stationarity condition of the M-estimation objective rewritten as a weighted least squares problem. Minimizing $J(\beta) = \sum_i \rho_\delta(r_i)$ with $r_i = y_i - x_i^\top\beta$ requires the gradient to vanish:

\[ \nabla_\beta J = -\sum_{i=1}^n \psi_\delta(r_i)\, x_i = 0, \qquad \psi_\delta(r) = \rho_\delta'(r) = \operatorname{clip}(r, -\delta, \delta). \]

The key algebraic trick is to write the bounded score as a residual times a data-dependent weight. Define

\[ w_i \;=\; \frac{\psi_\delta(r_i)}{r_i} \;=\; \min\!\left(1, \frac{\delta}{|r_i|}\right), \]

so that $\psi_\delta(r_i) = w_i\, r_i$ exactly (for $|r_i| \le \delta$ the weight is one and the term is quadratic; for $|r_i| > \delta$ the weight shrinks like $1/|r_i|$ so the contribution is capped at $\delta$). Substituting turns the stationarity condition into

\[ \sum_{i=1}^n w_i\, r_i\, x_i = 0 \quad\Longleftrightarrow\quad X^\top W (y - X\beta) = 0 \quad\Longleftrightarrow\quad X^\top W X\,\beta = X^\top W y, \]

which is precisely the weighted normal equations with $W = \operatorname{diag}(w_1,\dots,w_n)$. Because the weights depend on $\beta$ through the residuals, we iterate: fix the weights from the current residuals, solve the linear WLS subproblem, recompute residuals and weights, and repeat. Since the Huber loss is convex with a Lipschitz score, this fixed-point iteration is a special case of a majorize-minimize scheme and converges to the global minimizer. In practice the residuals are first divided by a robust scale estimate $\hat\sigma$ (we use the MAD, $1.4826 \cdot \operatorname{median}_i |r_i - \operatorname{median}(r)|$) so that $\delta$ is expressed in standard-deviation units and the procedure is equivariant under rescaling of $y$.

The same template yields quantile regression: the pinball loss $\rho_\tau$ has subgradient $\psi_\tau(r) = \tau - \mathbb{1}[r<0]$, and writing $w_i = (\tau \text{ or } 1-\tau)/|r_i|$ recovers a reweighted least squares iteration whose fixed point is the conditional-quantile fit (a small floor $\varepsilon$ on $|r_i|$ keeps the weights finite at near-zero residuals).

87.4.4 4.3 RANSAC

Random Sample Consensus (RANSAC) takes a different, combinatorial route to robustness and achieves a high breakdown point even when outliers are abundant. Rather than downweighting outliers, it seeks the largest subset of inliers consistent with a model. The algorithm repeatedly samples a minimal subset of points, just enough to fit the model, fits on that subset, and counts how many of the remaining points fall within a residual threshold of the fit. The fit with the largest consensus set wins, and a final estimate is computed on all its inliers.

best = None
for k in range(num_iterations):
    sample = random_minimal_subset(data, size=p)
    model  = fit(sample)
    inliers = [i for i in data if residual(model, i) < threshold]
    if best is None or len(inliers) > len(best.inliers):
        best = (model, inliers)
beta = fit(best.inliers)   # refit on all inliers

The number of iterations needed to draw at least one clean sample with confidence $1 - p_{\text{fail}}$ is

\[ N = \frac{\log(p_{\text{fail}})}{\log\left(1 - (1 - \epsilon)^s\right)}, \]

where $\epsilon$ is the outlier fraction and $s$ is the minimal sample size. RANSAC tolerates outlier fractions well above 50 percent, which makes it the standard tool in computer vision for tasks like homography and fundamental matrix estimation. Its weaknesses are the three user choices it requires, the threshold, the iteration count, and the minimal sample size, and the fact that it is randomized, so repeated runs can differ. Robust scale based variants such as MSAC and MLESAC refine the binary inlier count into a smoother score to improve stability.

87.4.5 4.4 Choosing among robust methods

Huber M-estimation is the right default when contamination is mild and the bulk of the data is approximately Gaussian; it is efficient, convex, and fast. RANSAC dominates when the outlier fraction is large or when outliers are structured, as in vision and robotics. For heavy contamination where one still wants an estimator with a high breakdown point and good asymptotic behavior, least trimmed squares, which minimizes the sum of the smallest $h$ squared residuals, offers a 50 percent breakdown point and is often used to seed an M-estimation refinement.

87.5 5. Quantile Regression

87.5.1 5.1 Beyond the conditional mean

OLS models the conditional mean $\mathbb{E}[y \mid x]$. Quantile regression instead models a conditional quantile $Q_\tau(y \mid x)$ for a chosen level $\tau \in (0,1)$, where $Q_{0.5}$ is the conditional median. This is valuable when the spread or shape of the response distribution varies with $x$ (heteroscedasticity that is itself of interest), when the tails matter more than the center (value at risk in finance, growth charts in medicine), or when robustness to outliers in $y$ is desired, since the median is far less sensitive than the mean.

87.5.2 5.2 The pinball loss

The key device is the asymmetric absolute loss, also called the pinball or check loss,

\[ \rho_\tau(r) = r\,(\tau - \mathbb{1}[r < 0]) = \begin{cases} \tau\, r & r \ge 0, \\ (\tau - 1)\, r & r < 0. \end{cases} \]

For $\tau = 0.5$ this is half the absolute loss, whose minimizer is the median. For general $\tau$ the asymmetry penalizes under prediction and over prediction differently, and the population minimizer of $\mathbb{E}[\rho_\tau(y - q)]$ is exactly the $\tau$ quantile. The conditional quantile estimator solves

\[ \hat\beta_\tau = \arg\min_\beta \sum_{i=1}^n \rho_\tau\!\left(y_i - x_i^\top \beta\right). \]

87.5.3 5.3 Computation and inference

The objective is piecewise linear and convex but not differentiable at zero residuals, so it is cast as a linear program and solved with interior point or simplex methods; for large data, smoothed or gradient based approximations of the pinball loss are common. Fitting several values of $\tau$ traces out the full conditional distribution and exposes how covariate effects differ across the response range, for example a covariate that shifts the median modestly but stretches the upper tail sharply. Inference on $\hat\beta_\tau$ typically relies on the asymptotic normality of quantile regression coefficients, with standard errors estimated by bootstrap or by kernel based estimates of the sparsity function, because the asymptotic variance depends on the conditional density of the errors at the quantile of interest.

# Pinball loss for level tau, suitable for a gradient based solver
def pinball(residual, tau):
    return max(tau * residual, (tau - 1) * residual)

A practical nuisance is quantile crossing: separately fitted curves for different $\tau$ can intersect, violating monotonicity of quantiles. Joint estimation or post hoc rearrangement restores the ordering.

87.6 6. Basis Function Expansion

87.6.1 6.1 Linear in parameters, nonlinear in inputs

The final extension relaxes linearity in the inputs while keeping linearity in the parameters, which is all that the least squares machinery actually requires. We replace each input $x$ by a vector of basis functions $\phi(x) = (\phi_1(x), \dots, \phi_M(x))^\top$ and fit

\[ f(x) = \sum_{m=1}^{M} \beta_m \,\phi_m(x) = \beta^\top \phi(x). \]

Because $f$ is linear in $\beta$, the OLS, WLS, GLS, robust, and quantile estimators all apply unchanged after substituting the expanded design matrix $\Phi$ with rows $\phi(x_i)^\top$. The nonlinearity lives entirely in the fixed, prechosen feature map.

87.6.2 6.2 Common bases

Several basis families recur in practice. Polynomial bases $\phi_m(x) = x^{m}$ are simple but globally unstable: high degree polynomials oscillate wildly near the boundary (Runge’s phenomenon), and a local change in the data perturbs the fit everywhere. Piecewise polynomial bases fix this by acting locally. B-splines, built from piecewise cubic pieces joined smoothly at knots, are the standard choice; they have local support, so each coefficient affects only a bounded region, and they give a numerically well conditioned design matrix. Natural cubic splines add the constraint that the fit is linear beyond the boundary knots, reducing variance in the tails where data are sparse. Radial basis functions $\phi_m(x) = \exp(-\lVert x - c_m \rVert^2 / 2\ell^2)$ place bumps at centers $c_m$ and are convenient in multiple dimensions. Fourier bases $\{\sin(k x), \cos(k x)\}$ suit periodic signals.

87.6.3 6.3 Controlling complexity

The number and placement of basis functions controls the bias variance tradeoff. Too few basis functions underfit; too many overfit and inflate variance. Two strategies manage this. The first chooses the basis size by cross validation or an information criterion. The second, more elegant, uses a rich basis but penalizes roughness, leading to smoothing splines and penalized regression splines (P-splines). The penalized objective is

\[ \hat\beta = \arg\min_\beta \; \lVert y - \Phi\beta \rVert_2^2 + \lambda\, \beta^\top \Pi \beta, \]

where $\Pi$ encodes a roughness penalty (for splines, an integrated squared second derivative) and $\lambda \ge 0$ tunes smoothness. This is exactly a ridge type problem with closed form solution $\hat\beta = (\Phi^\top \Phi + \lambda \Pi)^{-1}\Phi^\top y$, and $\lambda$ is selected by generalized cross validation. The penalty decouples the modeling decision (use many basis functions for flexibility) from the complexity decision (let $\lambda$ control effective degrees of freedom), which is why penalized splines underpin generalized additive models.

87.6.4 6.4 The curse of dimensionality

Basis expansion scales poorly with input dimension: a tensor product of $M$ functions per axis yields $M^d$ basis functions in $d$ dimensions. Additive models sidestep this by summing univariate expansions, $f(x) = \sum_{j} f_j(x_j)$ with each $f_j$ a spline, trading the ability to model interactions for tractability. This additive structure, fitted by penalized least squares, is the bridge from classical linear models to the generalized additive models widely used in interpretable machine learning.

87.7 7. Synthesis

The extensions in this chapter form a coherent toolkit organized by which OLS assumption they relax. Weighted and generalized least squares address the error covariance, handling heteroscedasticity and correlation respectively, and both reduce to OLS after whitening. Robust regression, through Huber M-estimation and RANSAC, addresses heavy tailed contamination by bounding or rejecting the influence of outliers. Quantile regression replaces the conditional mean with conditional quantiles via the asymmetric pinball loss, exposing the full shape of the response distribution. Basis function expansion relaxes linearity in the inputs while keeping linearity in the parameters, so the entire estimation apparatus carries over. Crucially these tools compose: one can fit a robust quantile spline under a generalized error covariance, because each modification touches a different and orthogonal part of the modeling pipeline. The enduring lesson is that the linear model is far more flexible than the OLS textbook suggests, and that understanding which assumption a given dataset violates points directly to the right extension.

87.8 8. Implementation

The companion libraries ship a small, dependency-light implementation of these estimators that is identical across Python, Julia, and Rust. The module ch082_robust_regression exposes fit_ols, fit_wls, fit_gls, fit_huber (the core Huber IRLS routine), fit_quantile, and a vandermonde basis expansion, all built on a shared weighted-normal-equations engine and a from-scratch Gaussian-elimination solver. They are part of the installable aiinaction (Python), AIInAction (Julia), and aiinaction (Rust) packages, and the cross-language parity tests assert that all three agree to within floating-point tolerance on the fixtures used below.

The example contaminates a clean line $y = 1 + 2x$ with a single gross outlier. Ordinary least squares is dragged badly toward the bad point, while the Huber fit recovers the underlying line almost exactly, illustrating the bounded-influence property derived in Section 4.2.1.

Code

from aiinaction.ch082_robust_regression import fit_ols, fit_huber

# Clean line y = 1 + 2x for x = 0..5, with index 3 contaminated (7 -> 20).
X = [[1.0, float(i)] for i in range(6)]
y = [1.0, 3.0, 5.0, 20.0, 9.0, 11.0]

ols = fit_ols(X, y)
huber = fit_huber(X, y, delta=1.345)

print(f"OLS   coefficients : intercept={ols[0]:.4f}, slope={ols[1]:.4f}")
print(f"Huber coefficients : intercept={huber.coef[0]:.4f}, slope={huber.coef[1]:.4f}")
print(f"Huber converged in {huber.n_iter} iterations (scale={huber.scale:.2e})")

OLS   coefficients : intercept=2.2381, slope=2.3714
Huber coefficients : intercept=1.0000, slope=2.0000
Huber converged in 19 iterations (scale=8.44e-11)

using AIInAction.Ch082RobustRegression

# Clean line y = 1 + 2x for x = 0..5, with index 3 contaminated (7 -> 20).
X = [[1.0, Float64(i)] for i in 0:5]
y = [1.0, 3.0, 5.0, 20.0, 9.0, 11.0]

ols   = fit_ols(X, y)
huber = fit_huber(X, y; delta=1.345)

println("OLS   coefficients : intercept=$(round(ols[1], digits=4)), slope=$(round(ols[2], digits=4))")
println("Huber coefficients : intercept=$(round(huber.coef[1], digits=4)), slope=$(round(huber.coef[2], digits=4))")
println("Huber converged in $(huber.n_iter) iterations")
# OLS   coefficients : intercept=2.2381, slope=2.3714
# Huber coefficients : intercept=1.0,    slope=2.0

use aiinaction::ch082_robust_regression::{fit_ols, fit_huber};

fn main() {
    // Clean line y = 1 + 2x for x = 0..5, with index 3 contaminated (7 -> 20).
    let x: Vec<Vec<f64>> = (0..6).map(|i| vec![1.0, i as f64]).collect();
    let y = vec![1.0, 3.0, 5.0, 20.0, 9.0, 11.0];

    let ols = fit_ols(&x, &y).unwrap();
    let huber = fit_huber(&x, &y, 1.345, 100, 1e-10).unwrap();

    println!("OLS   coefficients : intercept={:.4}, slope={:.4}", ols[0], ols[1]);
    println!("Huber coefficients : intercept={:.4}, slope={:.4}", huber.coef[0], huber.coef[1]);
    println!("Huber converged in {} iterations", huber.n_iter);
    // OLS   coefficients : intercept=2.2381, slope=2.3714
    // Huber coefficients : intercept=1.0000, slope=2.0000
}

The same fit_wls, fit_gls, and fit_quantile entry points cover the heteroscedastic, correlated-error, and conditional-quantile cases, and feeding a vandermonde design matrix into any estimator fits a model that is nonlinear in the inputs while remaining linear in the coefficients (Section 6).

87.9 References

Hastie, T., Tibshirani, R., and Friedman, J. The Elements of Statistical Learning, 2nd ed. Springer, 2009. https://hastie.su.domains/ElemStatLearn/
Huber, P. J. Robust Estimation of a Location Parameter. Annals of Mathematical Statistics, 1964. https://doi.org/10.1214/aoms/1177703732
Fischler, M. A., and Bolles, R. C. Random Sample Consensus: A Paradigm for Model Fitting. Communications of the ACM, 1981. https://doi.org/10.1145/358669.358692
Koenker, R., and Bassett, G. Regression Quantiles. Econometrica, 1978. https://doi.org/10.2307/1913643
Koenker, R. Quantile Regression. Cambridge University Press, 2005. https://doi.org/10.1017/CBO9780511754098
Aitken, A. C. On Least Squares and Linear Combination of Observations. Proceedings of the Royal Society of Edinburgh, 1936. https://doi.org/10.1017/S0370164600014346
de Boor, C. A Practical Guide to Splines. Springer, 2001. https://link.springer.com/book/9780387953663
Wood, S. N. Generalized Additive Models: An Introduction with R, 2nd ed. CRC Press, 2017. https://doi.org/10.1201/9781315370279
Eilers, P. H. C., and Marx, B. D. Flexible Smoothing with B-splines and Penalties. Statistical Science, 1996. https://doi.org/10.1214/ss/1038425655
Rousseeuw, P. J., and Leroy, A. M. Robust Regression and Outlier Detection. Wiley, 1987. https://doi.org/10.1002/0471725382
Greene, W. H. Econometric Analysis, 8th ed. Pearson, 2018. https://www.pearson.com/en-us/subject-catalog/p/econometric-analysis/P200000003741

# Linear Regression Extensions Ordinary least squares (OLS) is the workhorse of applied statistics and a baseline for nearly every supervised learning problem. Yet the assumptions that justify OLS, namely homoscedastic and uncorrelated errors, Gaussian noise, and a strictly linear conditional mean, fail routinely in practice. Heteroscedasticity, correlated measurements, heavy tailed contamination, asymmetric loss, and nonlinear structure all break the optimality of the plain least squares estimator. This chapter develops a family of extensions that relax each assumption in turn while retaining the computational and interpretive tractability that makes linear models attractive. We cover weighted least squares, generalized least squares, robust regression including Huber loss and RANSAC, quantile regression, and basis function expansion. Throughout, the unifying theme is that we are still fitting a model that is linear in its parameters, even when it is nonlinear in the inputs or fitted under a non-Gaussian loss. ## 1. The Least Squares Baseline and Its Assumptions ### 1.1 The OLS estimator Let $X \in \mathbb{R}^{n \times p}$ be a design matrix with rows $x_i^\top$, and let $y \in \mathbb{R}^n$ be the response. The linear model posits $$ y = X\beta + \varepsilon, \qquad \mathbb{E}[\varepsilon] = 0, \quad \operatorname{Cov}(\varepsilon) = \sigma^2 I. $$ OLS minimizes the residual sum of squares $\lVert y - X\beta \rVert_2^2$, yielding the normal equations $X^\top X \hat\beta = X^\top y$ and the closed form $$ \hat\beta_{\text{OLS}} = (X^\top X)^{-1} X^\top y. $$ Under the stated assumptions the Gauss Markov theorem guarantees that $\hat\beta_{\text{OLS}}$ is the best linear unbiased estimator (BLUE): among all estimators linear in $y$ and unbiased for $\beta$, it has minimum variance. If in addition $\varepsilon \sim \mathcal{N}(0, \sigma^2 I)$, then OLS coincides with the maximum likelihood estimator and is efficient among all unbiased estimators. ### 1.2 Where the assumptions fail Three failure modes motivate the rest of this chapter. First, the error covariance may not be $\sigma^2 I$: variances may differ across observations (heteroscedasticity) or errors may be correlated (as in time series or spatial data). Second, the noise may be heavy tailed or contaminated by outliers, so that the squared loss, which grows quadratically, lets a handful of bad points dominate the fit. Third, the conditional mean $\mathbb{E}[y \mid x]$ may be nonlinear in $x$, or we may care about a conditional quantile rather than the mean. Each extension below targets one of these failures while preserving linearity in the parameters. ## 2. Weighted Least Squares ### 2.1 Motivation and estimator Suppose the errors remain uncorrelated but are heteroscedastic, $\operatorname{Cov}(\varepsilon) = \sigma^2 W^{-1}$ where $W = \operatorname{diag}(w_1, \dots, w_n)$ is a known positive diagonal matrix. Observations with large variance carry less information and should be downweighted. Weighted least squares (WLS) minimizes the weighted residual sum of squares $$ \hat\beta_{\text{WLS}} = \arg\min_\beta \sum_{i=1}^n w_i \,(y_i - x_i^\top \beta)^2 = \arg\min_\beta (y - X\beta)^\top W (y - X\beta). $$ Differentiating gives the weighted normal equations and the solution $$ \hat\beta_{\text{WLS}} = (X^\top W X)^{-1} X^\top W y. $$ When $w_i \propto 1/\operatorname{Var}(\varepsilon_i)$, WLS is BLUE and equals the maximum likelihood estimator under independent Gaussian errors with the given variances. ### 2.2 Choosing the weights In rare cases the variances are known from the measurement process, for example when each $y_i$ is itself an average of $m_i$ replicate measurements, in which case $w_i = m_i$. More often the variance structure is estimated. A common approach posits a variance model such as $\operatorname{Var}(\varepsilon_i) = \sigma^2 g(x_i)$ for some known function $g$, then sets $w_i = 1/g(x_i)$. When $g$ itself is unknown, feasible generalized least squares (Section 3.3) iterates between estimating the mean and estimating the variance from squared residuals. ```text # Conceptual WLS via reweighting W = diag(weights) # weights ~ 1 / estimated variance XtW = X.T @ W beta = solve(XtW @ X, XtW @ y) ``` ### 2.3 Connection to transformation WLS is equivalent to OLS on transformed data. Define $\tilde X = W^{1/2} X$ and $\tilde y = W^{1/2} y$. Then the WLS objective becomes $\lVert \tilde y - \tilde X \beta \rVert_2^2$, an ordinary least squares problem. This whitening view, scaling each row by $\sqrt{w_i}$ so the transformed errors are homoscedastic, generalizes directly to the correlated case below. ## 3. Generalized Least Squares ### 3.1 The general error covariance Generalized least squares (GLS) handles an arbitrary known positive definite error covariance, $$ \operatorname{Cov}(\varepsilon) = \sigma^2 \Omega, \qquad \Omega \succ 0, $$ where $\Omega$ may have nonzero off diagonal entries representing correlated errors. The GLS estimator minimizes the Mahalanobis style objective $$ \hat\beta_{\text{GLS}} = \arg\min_\beta (y - X\beta)^\top \Omega^{-1} (y - X\beta), $$ with closed form solution $$ \hat\beta_{\text{GLS}} = (X^\top \Omega^{-1} X)^{-1} X^\top \Omega^{-1} y. $$ WLS is the special case $\Omega = W^{-1}$ with $W$ diagonal, and OLS is the further special case $\Omega = I$. Under the GLS assumptions, $\hat\beta_{\text{GLS}}$ is BLUE; ignoring the correlation and using OLS yields an estimator that remains unbiased but is inefficient and produces invalid standard errors. ### 3.2 The whitening interpretation Because $\Omega$ is positive definite it admits a Cholesky factorization $\Omega = L L^\top$. Premultiplying the model by $L^{-1}$ gives $$ L^{-1} y = L^{-1} X \beta + L^{-1}\varepsilon, $$ and the transformed error $L^{-1}\varepsilon$ has covariance $\sigma^2 L^{-1}\Omega L^{-\top} = \sigma^2 I$. Thus GLS is exactly OLS applied to the whitened data $(L^{-1}X, L^{-1}y)$. This factorization is also the practical recipe: rather than forming $\Omega^{-1}$ explicitly, one solves triangular systems, which is numerically stable and avoids an expensive matrix inverse. ### 3.3 Feasible GLS In applications $\Omega$ is rarely known. Feasible GLS (FGLS) replaces it with an estimate $\hat\Omega$ obtained from a model of the dependence. The canonical example is an autoregressive error structure for time series, $\varepsilon_t = \rho \varepsilon_{t-1} + u_t$, where the lag one correlation $\rho$ is estimated from OLS residuals and used to build $\hat\Omega$. The procedure iterates: fit OLS, estimate the covariance parameters from residuals, refit by GLS with $\hat\Omega$, and repeat until convergence. FGLS is consistent and asymptotically efficient when the covariance model is correctly specified, but a misspecified $\hat\Omega$ can do more harm than good, so robust (heteroscedasticity and autocorrelation consistent) standard errors are often preferred when the dependence structure is uncertain. ## 4. Robust Regression ### 4.1 The fragility of squared loss The squared loss $\rho(r) = r^2$ has an unbounded derivative $\psi(r) = 2r$, called the influence function. A single observation with a large residual can therefore exert arbitrarily large leverage on $\hat\beta$. The breakdown point of OLS, the smallest fraction of contaminated observations that can drive the estimate to infinity, is $0$: one bad point suffices. Robust regression replaces the squared loss with one that limits the influence of large residuals. ### 4.2 M-estimators and the Huber loss An M-estimator generalizes least squares by minimizing $\sum_i \rho(r_i / \hat\sigma)$ for a loss function $\rho$ that is less sensitive to outliers, where $\hat\sigma$ is a robust scale estimate. The Huber loss is the canonical choice: $$ \rho_\delta(r) = \begin{cases} \tfrac{1}{2} r^2 & |r| \le \delta, \\[4pt] \delta \left( |r| - \tfrac{1}{2}\delta \right) & |r| > \delta. \end{cases} $$ It is quadratic for small residuals, preserving efficiency under Gaussian noise, and linear for large residuals, capping their influence. Its derivative $\psi_\delta(r) = \operatorname{clip}(r, -\delta, \delta)$ is bounded, which is precisely what robustness requires. The tuning constant $\delta$ trades efficiency against robustness; $\delta = 1.345\,\hat\sigma$ delivers about 95 percent of the efficiency of OLS at the Gaussian model while bounding influence. Huber estimation has no closed form but is readily solved by iteratively reweighted least squares (IRLS). At each iteration one forms weights $w_i = \psi_\delta(r_i)/r_i$ and performs a weighted least squares step: ```text beta = ols(X, y) repeat: r = y - X @ beta s = 1.4826 * median(abs(r - median(r))) # robust scale (MAD) w = clip(s*delta / abs(r), max=1.0) # Huber weights beta = wls(X, y, weights=w) until converged ``` The Huber estimator has good local robustness but its breakdown point is still low because the linear tail does not fully reject extreme outliers. Redescending losses such as Tukey's biweight, whose $\psi$ returns to zero for very large residuals, reject gross outliers entirely at the cost of a nonconvex objective and sensitivity to initialization. ### 4.2.1 Why IRLS solves the Huber problem The IRLS recipe is not a heuristic; it is the exact stationarity condition of the M-estimation objective rewritten as a weighted least squares problem. Minimizing $J(\beta) = \sum_i \rho_\delta(r_i)$ with $r_i = y_i - x_i^\top\beta$ requires the gradient to vanish: $$ \nabla_\beta J = -\sum_{i=1}^n \psi_\delta(r_i)\, x_i = 0, \qquad \psi_\delta(r) = \rho_\delta'(r) = \operatorname{clip}(r, -\delta, \delta). $$ The key algebraic trick is to write the bounded score as a residual times a data-dependent weight. Define $$ w_i \;=\; \frac{\psi_\delta(r_i)}{r_i} \;=\; \min\!\left(1, \frac{\delta}{|r_i|}\right), $$ so that $\psi_\delta(r_i) = w_i\, r_i$ exactly (for $|r_i| \le \delta$ the weight is one and the term is quadratic; for $|r_i| > \delta$ the weight shrinks like $1/|r_i|$ so the contribution is capped at $\delta$). Substituting turns the stationarity condition into $$ \sum_{i=1}^n w_i\, r_i\, x_i = 0 \quad\Longleftrightarrow\quad X^\top W (y - X\beta) = 0 \quad\Longleftrightarrow\quad X^\top W X\,\beta = X^\top W y, $$ which is precisely the weighted normal equations with $W = \operatorname{diag}(w_1,\dots,w_n)$. Because the weights depend on $\beta$ through the residuals, we iterate: fix the weights from the current residuals, solve the linear WLS subproblem, recompute residuals and weights, and repeat. Since the Huber loss is convex with a Lipschitz score, this fixed-point iteration is a special case of a majorize-minimize scheme and converges to the global minimizer. In practice the residuals are first divided by a robust scale estimate $\hat\sigma$ (we use the MAD, $1.4826 \cdot \operatorname{median}_i |r_i - \operatorname{median}(r)|$) so that $\delta$ is expressed in standard-deviation units and the procedure is equivariant under rescaling of $y$. The same template yields quantile regression: the pinball loss $\rho_\tau$ has subgradient $\psi_\tau(r) = \tau - \mathbb{1}[r<0]$, and writing $w_i = (\tau \text{ or } 1-\tau)/|r_i|$ recovers a reweighted least squares iteration whose fixed point is the conditional-quantile fit (a small floor $\varepsilon$ on $|r_i|$ keeps the weights finite at near-zero residuals). ### 4.3 RANSAC Random Sample Consensus (RANSAC) takes a different, combinatorial route to robustness and achieves a high breakdown point even when outliers are abundant. Rather than downweighting outliers, it seeks the largest subset of inliers consistent with a model. The algorithm repeatedly samples a minimal subset of points, just enough to fit the model, fits on that subset, and counts how many of the remaining points fall within a residual threshold of the fit. The fit with the largest consensus set wins, and a final estimate is computed on all its inliers. ```text best = None for k in range(num_iterations): sample = random_minimal_subset(data, size=p) model = fit(sample) inliers = [i for i in data if residual(model, i) < threshold] if best is None or len(inliers) > len(best.inliers): best = (model, inliers) beta = fit(best.inliers) # refit on all inliers ``` The number of iterations needed to draw at least one clean sample with confidence $1 - p_{\text{fail}}$ is $$ N = \frac{\log(p_{\text{fail}})}{\log\left(1 - (1 - \epsilon)^s\right)}, $$ where $\epsilon$ is the outlier fraction and $s$ is the minimal sample size. RANSAC tolerates outlier fractions well above 50 percent, which makes it the standard tool in computer vision for tasks like homography and fundamental matrix estimation. Its weaknesses are the three user choices it requires, the threshold, the iteration count, and the minimal sample size, and the fact that it is randomized, so repeated runs can differ. Robust scale based variants such as MSAC and MLESAC refine the binary inlier count into a smoother score to improve stability. ### 4.4 Choosing among robust methods Huber M-estimation is the right default when contamination is mild and the bulk of the data is approximately Gaussian; it is efficient, convex, and fast. RANSAC dominates when the outlier fraction is large or when outliers are structured, as in vision and robotics. For heavy contamination where one still wants an estimator with a high breakdown point and good asymptotic behavior, least trimmed squares, which minimizes the sum of the smallest $h$ squared residuals, offers a 50 percent breakdown point and is often used to seed an M-estimation refinement. ## 5. Quantile Regression ### 5.1 Beyond the conditional mean OLS models the conditional mean $\mathbb{E}[y \mid x]$. Quantile regression instead models a conditional quantile $Q_\tau(y \mid x)$ for a chosen level $\tau \in (0,1)$, where $Q_{0.5}$ is the conditional median. This is valuable when the spread or shape of the response distribution varies with $x$ (heteroscedasticity that is itself of interest), when the tails matter more than the center (value at risk in finance, growth charts in medicine), or when robustness to outliers in $y$ is desired, since the median is far less sensitive than the mean. ### 5.2 The pinball loss The key device is the asymmetric absolute loss, also called the pinball or check loss, $$ \rho_\tau(r) = r\,(\tau - \mathbb{1}[r < 0]) = \begin{cases} \tau\, r & r \ge 0, \\ (\tau - 1)\, r & r < 0. \end{cases} $$ For $\tau = 0.5$ this is half the absolute loss, whose minimizer is the median. For general $\tau$ the asymmetry penalizes under prediction and over prediction differently, and the population minimizer of $\mathbb{E}[\rho_\tau(y - q)]$ is exactly the $\tau$ quantile. The conditional quantile estimator solves $$ \hat\beta_\tau = \arg\min_\beta \sum_{i=1}^n \rho_\tau\!\left(y_i - x_i^\top \beta\right). $$ ### 5.3 Computation and inference The objective is piecewise linear and convex but not differentiable at zero residuals, so it is cast as a linear program and solved with interior point or simplex methods; for large data, smoothed or gradient based approximations of the pinball loss are common. Fitting several values of $\tau$ traces out the full conditional distribution and exposes how covariate effects differ across the response range, for example a covariate that shifts the median modestly but stretches the upper tail sharply. Inference on $\hat\beta_\tau$ typically relies on the asymptotic normality of quantile regression coefficients, with standard errors estimated by bootstrap or by kernel based estimates of the sparsity function, because the asymptotic variance depends on the conditional density of the errors at the quantile of interest. ```text # Pinball loss for level tau, suitable for a gradient based solver def pinball(residual, tau): return max(tau * residual, (tau - 1) * residual) ``` A practical nuisance is quantile crossing: separately fitted curves for different $\tau$ can intersect, violating monotonicity of quantiles. Joint estimation or post hoc rearrangement restores the ordering. ## 6. Basis Function Expansion ### 6.1 Linear in parameters, nonlinear in inputs The final extension relaxes linearity in the inputs while keeping linearity in the parameters, which is all that the least squares machinery actually requires. We replace each input $x$ by a vector of basis functions $\phi(x) = (\phi_1(x), \dots, \phi_M(x))^\top$ and fit $$ f(x) = \sum_{m=1}^{M} \beta_m \,\phi_m(x) = \beta^\top \phi(x). $$ Because $f$ is linear in $\beta$, the OLS, WLS, GLS, robust, and quantile estimators all apply unchanged after substituting the expanded design matrix $\Phi$ with rows $\phi(x_i)^\top$. The nonlinearity lives entirely in the fixed, prechosen feature map. ### 6.2 Common bases Several basis families recur in practice. Polynomial bases $\phi_m(x) = x^{m}$ are simple but globally unstable: high degree polynomials oscillate wildly near the boundary (Runge's phenomenon), and a local change in the data perturbs the fit everywhere. Piecewise polynomial bases fix this by acting locally. B-splines, built from piecewise cubic pieces joined smoothly at knots, are the standard choice; they have local support, so each coefficient affects only a bounded region, and they give a numerically well conditioned design matrix. Natural cubic splines add the constraint that the fit is linear beyond the boundary knots, reducing variance in the tails where data are sparse. Radial basis functions $\phi_m(x) = \exp(-\lVert x - c_m \rVert^2 / 2\ell^2)$ place bumps at centers $c_m$ and are convenient in multiple dimensions. Fourier bases $\{\sin(k x), \cos(k x)\}$ suit periodic signals. ### 6.3 Controlling complexity The number and placement of basis functions controls the bias variance tradeoff. Too few basis functions underfit; too many overfit and inflate variance. Two strategies manage this. The first chooses the basis size by cross validation or an information criterion. The second, more elegant, uses a rich basis but penalizes roughness, leading to smoothing splines and penalized regression splines (P-splines). The penalized objective is $$ \hat\beta = \arg\min_\beta \; \lVert y - \Phi\beta \rVert_2^2 + \lambda\, \beta^\top \Pi \beta, $$ where $\Pi$ encodes a roughness penalty (for splines, an integrated squared second derivative) and $\lambda \ge 0$ tunes smoothness. This is exactly a ridge type problem with closed form solution $\hat\beta = (\Phi^\top \Phi + \lambda \Pi)^{-1}\Phi^\top y$, and $\lambda$ is selected by generalized cross validation. The penalty decouples the modeling decision (use many basis functions for flexibility) from the complexity decision (let $\lambda$ control effective degrees of freedom), which is why penalized splines underpin generalized additive models. ### 6.4 The curse of dimensionality Basis expansion scales poorly with input dimension: a tensor product of $M$ functions per axis yields $M^d$ basis functions in $d$ dimensions. Additive models sidestep this by summing univariate expansions, $f(x) = \sum_{j} f_j(x_j)$ with each $f_j$ a spline, trading the ability to model interactions for tractability. This additive structure, fitted by penalized least squares, is the bridge from classical linear models to the generalized additive models widely used in interpretable machine learning. ## 7. Synthesis The extensions in this chapter form a coherent toolkit organized by which OLS assumption they relax. Weighted and generalized least squares address the error covariance, handling heteroscedasticity and correlation respectively, and both reduce to OLS after whitening. Robust regression, through Huber M-estimation and RANSAC, addresses heavy tailed contamination by bounding or rejecting the influence of outliers. Quantile regression replaces the conditional mean with conditional quantiles via the asymmetric pinball loss, exposing the full shape of the response distribution. Basis function expansion relaxes linearity in the inputs while keeping linearity in the parameters, so the entire estimation apparatus carries over. Crucially these tools compose: one can fit a robust quantile spline under a generalized error covariance, because each modification touches a different and orthogonal part of the modeling pipeline. The enduring lesson is that the linear model is far more flexible than the OLS textbook suggests, and that understanding which assumption a given dataset violates points directly to the right extension. ## 8. Implementation The companion libraries ship a small, dependency-light implementation of these estimators that is identical across Python, Julia, and Rust. The module `ch082_robust_regression` exposes `fit_ols`, `fit_wls`, `fit_gls`, `fit_huber` (the core Huber IRLS routine), `fit_quantile`, and a `vandermonde` basis expansion, all built on a shared weighted-normal-equations engine and a from-scratch Gaussian-elimination solver. They are part of the installable `aiinaction` (Python), `AIInAction` (Julia), and `aiinaction` (Rust) packages, and the cross-language parity tests assert that all three agree to within floating-point tolerance on the fixtures used below. The example contaminates a clean line $y = 1 + 2x$ with a single gross outlier. Ordinary least squares is dragged badly toward the bad point, while the Huber fit recovers the underlying line almost exactly, illustrating the bounded-influence property derived in Section 4.2.1. ::: {.panel-tabset} ## Python ```{python} from aiinaction.ch082_robust_regression import fit_ols, fit_huber # Clean line y = 1 + 2x for x = 0..5, with index 3 contaminated (7 -> 20). X = [[1.0, float(i)] for i in range(6)] y = [1.0, 3.0, 5.0, 20.0, 9.0, 11.0] ols = fit_ols(X, y) huber = fit_huber(X, y, delta=1.345) print(f"OLS coefficients : intercept={ols[0]:.4f}, slope={ols[1]:.4f}") print(f"Huber coefficients : intercept={huber.coef[0]:.4f}, slope={huber.coef[1]:.4f}") print(f"Huber converged in {huber.n_iter} iterations (scale={huber.scale:.2e})") ``` ## Julia ```julia using AIInAction.Ch082RobustRegression # Clean line y = 1 + 2x for x = 0..5, with index 3 contaminated (7 -> 20). X = [[1.0, Float64(i)] for i in 0:5] y = [1.0, 3.0, 5.0, 20.0, 9.0, 11.0] ols = fit_ols(X, y) huber = fit_huber(X, y; delta=1.345) println("OLS coefficients : intercept=$(round(ols[1], digits=4)), slope=$(round(ols[2], digits=4))") println("Huber coefficients : intercept=$(round(huber.coef[1], digits=4)), slope=$(round(huber.coef[2], digits=4))") println("Huber converged in $(huber.n_iter) iterations") # OLS coefficients : intercept=2.2381, slope=2.3714 # Huber coefficients : intercept=1.0, slope=2.0 ``` ## Rust ```rust use aiinaction::ch082_robust_regression::{fit_ols, fit_huber}; fn main() { // Clean line y = 1 + 2x for x = 0..5, with index 3 contaminated (7 -> 20). let x: Vec<Vec<f64>> = (0..6).map(|i| vec![1.0, i as f64]).collect(); let y = vec![1.0, 3.0, 5.0, 20.0, 9.0, 11.0]; let ols = fit_ols(&x, &y).unwrap(); let huber = fit_huber(&x, &y, 1.345, 100, 1e-10).unwrap(); println!("OLS coefficients : intercept={:.4}, slope={:.4}", ols[0], ols[1]); println!("Huber coefficients : intercept={:.4}, slope={:.4}", huber.coef[0], huber.coef[1]); println!("Huber converged in {} iterations", huber.n_iter); // OLS coefficients : intercept=2.2381, slope=2.3714 // Huber coefficients : intercept=1.0000, slope=2.0000 } ``` ::: The same `fit_wls`, `fit_gls`, and `fit_quantile` entry points cover the heteroscedastic, correlated-error, and conditional-quantile cases, and feeding a `vandermonde` design matrix into any estimator fits a model that is nonlinear in the inputs while remaining linear in the coefficients (Section 6). ## References 1. Hastie, T., Tibshirani, R., and Friedman, J. The Elements of Statistical Learning, 2nd ed. Springer, 2009. https://hastie.su.domains/ElemStatLearn/ 2. Huber, P. J. Robust Estimation of a Location Parameter. Annals of Mathematical Statistics, 1964. https://doi.org/10.1214/aoms/1177703732 3. Fischler, M. A., and Bolles, R. C. Random Sample Consensus: A Paradigm for Model Fitting. Communications of the ACM, 1981. https://doi.org/10.1145/358669.358692 4. Koenker, R., and Bassett, G. Regression Quantiles. Econometrica, 1978. https://doi.org/10.2307/1913643 5. Koenker, R. Quantile Regression. Cambridge University Press, 2005. https://doi.org/10.1017/CBO9780511754098 6. Aitken, A. C. On Least Squares and Linear Combination of Observations. Proceedings of the Royal Society of Edinburgh, 1936. https://doi.org/10.1017/S0370164600014346 7. de Boor, C. A Practical Guide to Splines. Springer, 2001. https://link.springer.com/book/9780387953663 8. Wood, S. N. Generalized Additive Models: An Introduction with R, 2nd ed. CRC Press, 2017. https://doi.org/10.1201/9781315370279 9. Eilers, P. H. C., and Marx, B. D. Flexible Smoothing with B-splines and Penalties. Statistical Science, 1996. https://doi.org/10.1214/ss/1038425655 10. Rousseeuw, P. J., and Leroy, A. M. Robust Regression and Outlier Detection. Wiley, 1987. https://doi.org/10.1002/0471725382 11. Greene, W. H. Econometric Analysis, 8th ed. Pearson, 2018. https://www.pearson.com/en-us/subject-catalog/p/econometric-analysis/P200000003741