87  Linear Regression Extensions

Ordinary least squares (OLS) is the workhorse of applied statistics and a baseline for nearly every supervised learning problem. Yet the assumptions that justify OLS, namely homoscedastic and uncorrelated errors, Gaussian noise, and a strictly linear conditional mean, fail routinely in practice. Heteroscedasticity, correlated measurements, heavy tailed contamination, asymmetric loss, and nonlinear structure all break the optimality of the plain least squares estimator. This chapter develops a family of extensions that relax each assumption in turn while retaining the computational and interpretive tractability that makes linear models attractive. We cover weighted least squares, generalized least squares, robust regression including Huber loss and RANSAC, quantile regression, and basis function expansion. Throughout, the unifying theme is that we are still fitting a model that is linear in its parameters, even when it is nonlinear in the inputs or fitted under a non-Gaussian loss.

87.1 1. The Least Squares Baseline and Its Assumptions

87.1.1 1.1 The OLS estimator

Let \(X \in \mathbb{R}^{n \times p}\) be a design matrix with rows \(x_i^\top\), and let \(y \in \mathbb{R}^n\) be the response. The linear model posits

\[ y = X\beta + \varepsilon, \qquad \mathbb{E}[\varepsilon] = 0, \quad \operatorname{Cov}(\varepsilon) = \sigma^2 I. \]

OLS minimizes the residual sum of squares \(\lVert y - X\beta \rVert_2^2\), yielding the normal equations \(X^\top X \hat\beta = X^\top y\) and the closed form

\[ \hat\beta_{\text{OLS}} = (X^\top X)^{-1} X^\top y. \]

Under the stated assumptions the Gauss Markov theorem guarantees that \(\hat\beta_{\text{OLS}}\) is the best linear unbiased estimator (BLUE): among all estimators linear in \(y\) and unbiased for \(\beta\), it has minimum variance. If in addition \(\varepsilon \sim \mathcal{N}(0, \sigma^2 I)\), then OLS coincides with the maximum likelihood estimator and is efficient among all unbiased estimators.

87.1.2 1.2 Where the assumptions fail

Three failure modes motivate the rest of this chapter. First, the error covariance may not be \(\sigma^2 I\): variances may differ across observations (heteroscedasticity) or errors may be correlated (as in time series or spatial data). Second, the noise may be heavy tailed or contaminated by outliers, so that the squared loss, which grows quadratically, lets a handful of bad points dominate the fit. Third, the conditional mean \(\mathbb{E}[y \mid x]\) may be nonlinear in \(x\), or we may care about a conditional quantile rather than the mean. Each extension below targets one of these failures while preserving linearity in the parameters.

87.2 2. Weighted Least Squares

87.2.1 2.1 Motivation and estimator

Suppose the errors remain uncorrelated but are heteroscedastic, \(\operatorname{Cov}(\varepsilon) = \sigma^2 W^{-1}\) where \(W = \operatorname{diag}(w_1, \dots, w_n)\) is a known positive diagonal matrix. Observations with large variance carry less information and should be downweighted. Weighted least squares (WLS) minimizes the weighted residual sum of squares

\[ \hat\beta_{\text{WLS}} = \arg\min_\beta \sum_{i=1}^n w_i \,(y_i - x_i^\top \beta)^2 = \arg\min_\beta (y - X\beta)^\top W (y - X\beta). \]

Differentiating gives the weighted normal equations and the solution

\[ \hat\beta_{\text{WLS}} = (X^\top W X)^{-1} X^\top W y. \]

When \(w_i \propto 1/\operatorname{Var}(\varepsilon_i)\), WLS is BLUE and equals the maximum likelihood estimator under independent Gaussian errors with the given variances.

87.2.2 2.2 Choosing the weights

In rare cases the variances are known from the measurement process, for example when each \(y_i\) is itself an average of \(m_i\) replicate measurements, in which case \(w_i = m_i\). More often the variance structure is estimated. A common approach posits a variance model such as \(\operatorname{Var}(\varepsilon_i) = \sigma^2 g(x_i)\) for some known function \(g\), then sets \(w_i = 1/g(x_i)\). When \(g\) itself is unknown, feasible generalized least squares (Section 3.3) iterates between estimating the mean and estimating the variance from squared residuals.

# Conceptual WLS via reweighting
W   = diag(weights)               # weights ~ 1 / estimated variance
XtW = X.T @ W
beta = solve(XtW @ X, XtW @ y)

87.2.3 2.3 Connection to transformation

WLS is equivalent to OLS on transformed data. Define \(\tilde X = W^{1/2} X\) and \(\tilde y = W^{1/2} y\). Then the WLS objective becomes \(\lVert \tilde y - \tilde X \beta \rVert_2^2\), an ordinary least squares problem. This whitening view, scaling each row by \(\sqrt{w_i}\) so the transformed errors are homoscedastic, generalizes directly to the correlated case below.

87.3 3. Generalized Least Squares

87.3.1 3.1 The general error covariance

Generalized least squares (GLS) handles an arbitrary known positive definite error covariance,

\[ \operatorname{Cov}(\varepsilon) = \sigma^2 \Omega, \qquad \Omega \succ 0, \]

where \(\Omega\) may have nonzero off diagonal entries representing correlated errors. The GLS estimator minimizes the Mahalanobis style objective

\[ \hat\beta_{\text{GLS}} = \arg\min_\beta (y - X\beta)^\top \Omega^{-1} (y - X\beta), \]

with closed form solution

\[ \hat\beta_{\text{GLS}} = (X^\top \Omega^{-1} X)^{-1} X^\top \Omega^{-1} y. \]

WLS is the special case \(\Omega = W^{-1}\) with \(W\) diagonal, and OLS is the further special case \(\Omega = I\). Under the GLS assumptions, \(\hat\beta_{\text{GLS}}\) is BLUE; ignoring the correlation and using OLS yields an estimator that remains unbiased but is inefficient and produces invalid standard errors.

87.3.2 3.2 The whitening interpretation

Because \(\Omega\) is positive definite it admits a Cholesky factorization \(\Omega = L L^\top\). Premultiplying the model by \(L^{-1}\) gives

\[ L^{-1} y = L^{-1} X \beta + L^{-1}\varepsilon, \]

and the transformed error \(L^{-1}\varepsilon\) has covariance \(\sigma^2 L^{-1}\Omega L^{-\top} = \sigma^2 I\). Thus GLS is exactly OLS applied to the whitened data \((L^{-1}X, L^{-1}y)\). This factorization is also the practical recipe: rather than forming \(\Omega^{-1}\) explicitly, one solves triangular systems, which is numerically stable and avoids an expensive matrix inverse.

87.3.3 3.3 Feasible GLS

In applications \(\Omega\) is rarely known. Feasible GLS (FGLS) replaces it with an estimate \(\hat\Omega\) obtained from a model of the dependence. The canonical example is an autoregressive error structure for time series, \(\varepsilon_t = \rho \varepsilon_{t-1} + u_t\), where the lag one correlation \(\rho\) is estimated from OLS residuals and used to build \(\hat\Omega\). The procedure iterates: fit OLS, estimate the covariance parameters from residuals, refit by GLS with \(\hat\Omega\), and repeat until convergence. FGLS is consistent and asymptotically efficient when the covariance model is correctly specified, but a misspecified \(\hat\Omega\) can do more harm than good, so robust (heteroscedasticity and autocorrelation consistent) standard errors are often preferred when the dependence structure is uncertain.

87.4 4. Robust Regression

87.4.1 4.1 The fragility of squared loss

The squared loss \(\rho(r) = r^2\) has an unbounded derivative \(\psi(r) = 2r\), called the influence function. A single observation with a large residual can therefore exert arbitrarily large leverage on \(\hat\beta\). The breakdown point of OLS, the smallest fraction of contaminated observations that can drive the estimate to infinity, is \(0\): one bad point suffices. Robust regression replaces the squared loss with one that limits the influence of large residuals.

87.4.2 4.2 M-estimators and the Huber loss

An M-estimator generalizes least squares by minimizing \(\sum_i \rho(r_i / \hat\sigma)\) for a loss function \(\rho\) that is less sensitive to outliers, where \(\hat\sigma\) is a robust scale estimate. The Huber loss is the canonical choice:

\[ \rho_\delta(r) = \begin{cases} \tfrac{1}{2} r^2 & |r| \le \delta, \\[4pt] \delta \left( |r| - \tfrac{1}{2}\delta \right) & |r| > \delta. \end{cases} \]

It is quadratic for small residuals, preserving efficiency under Gaussian noise, and linear for large residuals, capping their influence. Its derivative \(\psi_\delta(r) = \operatorname{clip}(r, -\delta, \delta)\) is bounded, which is precisely what robustness requires. The tuning constant \(\delta\) trades efficiency against robustness; \(\delta = 1.345\,\hat\sigma\) delivers about 95 percent of the efficiency of OLS at the Gaussian model while bounding influence.

Huber estimation has no closed form but is readily solved by iteratively reweighted least squares (IRLS). At each iteration one forms weights \(w_i = \psi_\delta(r_i)/r_i\) and performs a weighted least squares step:

beta = ols(X, y)
repeat:
    r  = y - X @ beta
    s  = 1.4826 * median(abs(r - median(r)))   # robust scale (MAD)
    w  = clip(s*delta / abs(r), max=1.0)        # Huber weights
    beta = wls(X, y, weights=w)
until converged

The Huber estimator has good local robustness but its breakdown point is still low because the linear tail does not fully reject extreme outliers. Redescending losses such as Tukey’s biweight, whose \(\psi\) returns to zero for very large residuals, reject gross outliers entirely at the cost of a nonconvex objective and sensitivity to initialization.

87.4.3 4.3 RANSAC

Random Sample Consensus (RANSAC) takes a different, combinatorial route to robustness and achieves a high breakdown point even when outliers are abundant. Rather than downweighting outliers, it seeks the largest subset of inliers consistent with a model. The algorithm repeatedly samples a minimal subset of points, just enough to fit the model, fits on that subset, and counts how many of the remaining points fall within a residual threshold of the fit. The fit with the largest consensus set wins, and a final estimate is computed on all its inliers.

best = None
for k in range(num_iterations):
    sample = random_minimal_subset(data, size=p)
    model  = fit(sample)
    inliers = [i for i in data if residual(model, i) < threshold]
    if best is None or len(inliers) > len(best.inliers):
        best = (model, inliers)
beta = fit(best.inliers)   # refit on all inliers

The number of iterations needed to draw at least one clean sample with confidence \(1 - p_{\text{fail}}\) is

\[ N = \frac{\log(p_{\text{fail}})}{\log\left(1 - (1 - \epsilon)^s\right)}, \]

where \(\epsilon\) is the outlier fraction and \(s\) is the minimal sample size. RANSAC tolerates outlier fractions well above 50 percent, which makes it the standard tool in computer vision for tasks like homography and fundamental matrix estimation. Its weaknesses are the three user choices it requires, the threshold, the iteration count, and the minimal sample size, and the fact that it is randomized, so repeated runs can differ. Robust scale based variants such as MSAC and MLESAC refine the binary inlier count into a smoother score to improve stability.

87.4.4 4.4 Choosing among robust methods

Huber M-estimation is the right default when contamination is mild and the bulk of the data is approximately Gaussian; it is efficient, convex, and fast. RANSAC dominates when the outlier fraction is large or when outliers are structured, as in vision and robotics. For heavy contamination where one still wants an estimator with a high breakdown point and good asymptotic behavior, least trimmed squares, which minimizes the sum of the smallest \(h\) squared residuals, offers a 50 percent breakdown point and is often used to seed an M-estimation refinement.

87.5 5. Quantile Regression

87.5.1 5.1 Beyond the conditional mean

OLS models the conditional mean \(\mathbb{E}[y \mid x]\). Quantile regression instead models a conditional quantile \(Q_\tau(y \mid x)\) for a chosen level \(\tau \in (0,1)\), where \(Q_{0.5}\) is the conditional median. This is valuable when the spread or shape of the response distribution varies with \(x\) (heteroscedasticity that is itself of interest), when the tails matter more than the center (value at risk in finance, growth charts in medicine), or when robustness to outliers in \(y\) is desired, since the median is far less sensitive than the mean.

87.5.2 5.2 The pinball loss

The key device is the asymmetric absolute loss, also called the pinball or check loss,

\[ \rho_\tau(r) = r\,(\tau - \mathbb{1}[r < 0]) = \begin{cases} \tau\, r & r \ge 0, \\ (\tau - 1)\, r & r < 0. \end{cases} \]

For \(\tau = 0.5\) this is half the absolute loss, whose minimizer is the median. For general \(\tau\) the asymmetry penalizes under prediction and over prediction differently, and the population minimizer of \(\mathbb{E}[\rho_\tau(y - q)]\) is exactly the \(\tau\) quantile. The conditional quantile estimator solves

\[ \hat\beta_\tau = \arg\min_\beta \sum_{i=1}^n \rho_\tau\!\left(y_i - x_i^\top \beta\right). \]

87.5.3 5.3 Computation and inference

The objective is piecewise linear and convex but not differentiable at zero residuals, so it is cast as a linear program and solved with interior point or simplex methods; for large data, smoothed or gradient based approximations of the pinball loss are common. Fitting several values of \(\tau\) traces out the full conditional distribution and exposes how covariate effects differ across the response range, for example a covariate that shifts the median modestly but stretches the upper tail sharply. Inference on \(\hat\beta_\tau\) typically relies on the asymptotic normality of quantile regression coefficients, with standard errors estimated by bootstrap or by kernel based estimates of the sparsity function, because the asymptotic variance depends on the conditional density of the errors at the quantile of interest.

# Pinball loss for level tau, suitable for a gradient based solver
def pinball(residual, tau):
    return max(tau * residual, (tau - 1) * residual)

A practical nuisance is quantile crossing: separately fitted curves for different \(\tau\) can intersect, violating monotonicity of quantiles. Joint estimation or post hoc rearrangement restores the ordering.

87.6 6. Basis Function Expansion

87.6.1 6.1 Linear in parameters, nonlinear in inputs

The final extension relaxes linearity in the inputs while keeping linearity in the parameters, which is all that the least squares machinery actually requires. We replace each input \(x\) by a vector of basis functions \(\phi(x) = (\phi_1(x), \dots, \phi_M(x))^\top\) and fit

\[ f(x) = \sum_{m=1}^{M} \beta_m \,\phi_m(x) = \beta^\top \phi(x). \]

Because \(f\) is linear in \(\beta\), the OLS, WLS, GLS, robust, and quantile estimators all apply unchanged after substituting the expanded design matrix \(\Phi\) with rows \(\phi(x_i)^\top\). The nonlinearity lives entirely in the fixed, prechosen feature map.

87.6.2 6.2 Common bases

Several basis families recur in practice. Polynomial bases \(\phi_m(x) = x^{m}\) are simple but globally unstable: high degree polynomials oscillate wildly near the boundary (Runge’s phenomenon), and a local change in the data perturbs the fit everywhere. Piecewise polynomial bases fix this by acting locally. B-splines, built from piecewise cubic pieces joined smoothly at knots, are the standard choice; they have local support, so each coefficient affects only a bounded region, and they give a numerically well conditioned design matrix. Natural cubic splines add the constraint that the fit is linear beyond the boundary knots, reducing variance in the tails where data are sparse. Radial basis functions \(\phi_m(x) = \exp(-\lVert x - c_m \rVert^2 / 2\ell^2)\) place bumps at centers \(c_m\) and are convenient in multiple dimensions. Fourier bases \(\{\sin(k x), \cos(k x)\}\) suit periodic signals.

87.6.3 6.3 Controlling complexity

The number and placement of basis functions controls the bias variance tradeoff. Too few basis functions underfit; too many overfit and inflate variance. Two strategies manage this. The first chooses the basis size by cross validation or an information criterion. The second, more elegant, uses a rich basis but penalizes roughness, leading to smoothing splines and penalized regression splines (P-splines). The penalized objective is

\[ \hat\beta = \arg\min_\beta \; \lVert y - \Phi\beta \rVert_2^2 + \lambda\, \beta^\top \Pi \beta, \]

where \(\Pi\) encodes a roughness penalty (for splines, an integrated squared second derivative) and \(\lambda \ge 0\) tunes smoothness. This is exactly a ridge type problem with closed form solution \(\hat\beta = (\Phi^\top \Phi + \lambda \Pi)^{-1}\Phi^\top y\), and \(\lambda\) is selected by generalized cross validation. The penalty decouples the modeling decision (use many basis functions for flexibility) from the complexity decision (let \(\lambda\) control effective degrees of freedom), which is why penalized splines underpin generalized additive models.

87.6.4 6.4 The curse of dimensionality

Basis expansion scales poorly with input dimension: a tensor product of \(M\) functions per axis yields \(M^d\) basis functions in \(d\) dimensions. Additive models sidestep this by summing univariate expansions, \(f(x) = \sum_{j} f_j(x_j)\) with each \(f_j\) a spline, trading the ability to model interactions for tractability. This additive structure, fitted by penalized least squares, is the bridge from classical linear models to the generalized additive models widely used in interpretable machine learning.

87.7 7. Synthesis

The extensions in this chapter form a coherent toolkit organized by which OLS assumption they relax. Weighted and generalized least squares address the error covariance, handling heteroscedasticity and correlation respectively, and both reduce to OLS after whitening. Robust regression, through Huber M-estimation and RANSAC, addresses heavy tailed contamination by bounding or rejecting the influence of outliers. Quantile regression replaces the conditional mean with conditional quantiles via the asymmetric pinball loss, exposing the full shape of the response distribution. Basis function expansion relaxes linearity in the inputs while keeping linearity in the parameters, so the entire estimation apparatus carries over. Crucially these tools compose: one can fit a robust quantile spline under a generalized error covariance, because each modification touches a different and orthogonal part of the modeling pipeline. The enduring lesson is that the linear model is far more flexible than the OLS textbook suggests, and that understanding which assumption a given dataset violates points directly to the right extension.

87.8 References

  1. Hastie, T., Tibshirani, R., and Friedman, J. The Elements of Statistical Learning, 2nd ed. Springer, 2009. https://hastie.su.domains/ElemStatLearn/
  2. Huber, P. J. Robust Estimation of a Location Parameter. Annals of Mathematical Statistics, 1964. https://doi.org/10.1214/aoms/1177703732
  3. Fischler, M. A., and Bolles, R. C. Random Sample Consensus: A Paradigm for Model Fitting. Communications of the ACM, 1981. https://doi.org/10.1145/358669.358692
  4. Koenker, R., and Bassett, G. Regression Quantiles. Econometrica, 1978. https://doi.org/10.2307/1913643
  5. Koenker, R. Quantile Regression. Cambridge University Press, 2005. https://doi.org/10.1017/CBO9780511754098
  6. Aitken, A. C. On Least Squares and Linear Combination of Observations. Proceedings of the Royal Society of Edinburgh, 1936. https://doi.org/10.1017/S0370164600014346
  7. de Boor, C. A Practical Guide to Splines. Springer, 2001. https://link.springer.com/book/9780387953663
  8. Wood, S. N. Generalized Additive Models: An Introduction with R, 2nd ed. CRC Press, 2017. https://doi.org/10.1201/9781315370279
  9. Eilers, P. H. C., and Marx, B. D. Flexible Smoothing with B-splines and Penalties. Statistical Science, 1996. https://doi.org/10.1214/ss/1038425655
  10. Rousseeuw, P. J., and Leroy, A. M. Robust Regression and Outlier Detection. Wiley, 1987. https://doi.org/10.1002/0471725382
  11. Greene, W. H. Econometric Analysis, 8th ed. Pearson, 2018. https://www.pearson.com/en-us/subject-catalog/p/econometric-analysis/P200000003741