163  R-Squared and Adjusted R-Squared

The coefficient of determination, written \(R^2\), is among the most widely reported and most frequently misunderstood quantities in applied statistics and machine learning. It promises a single number summarizing how well a model explains the variation in a response variable. That promise is genuine but narrow. This chapter develops \(R^2\) from its algebraic foundations, examines its geometric and statistical meaning, introduces adjusted \(R^2\) as a partial remedy for one of its defects, and catalogs the situations in which a high \(R^2\) signals nothing useful or actively misleads.

163.1 1. Definition and Decomposition

163.1.1 1.1 The Sums of Squares

Let \(y_1, \dots, y_n\) be observed responses with mean \(\bar{y} = \frac{1}{n}\sum_i y_i\), and let \(\hat{y}_i\) be the fitted values produced by a model. Define three sums of squares:

\[ \text{SST} = \sum_{i=1}^{n} (y_i - \bar{y})^2, \quad \text{SSR} = \sum_{i=1}^{n} (\hat{y}_i - \bar{y})^2, \quad \text{SSE} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 . \]

SST is the total sum of squares, the total variability of the response around its mean. SSE is the residual (error) sum of squares, the variability left unexplained by the model. SSR is the regression sum of squares, the variability captured by the fitted values.

For ordinary least squares (OLS) with an intercept term, these three quantities satisfy the exact identity

\[ \text{SST} = \text{SSR} + \text{SSE}. \]

This decomposition is the backbone of the entire construction. It holds because the OLS residual vector is orthogonal to the column space of the design matrix, which includes the constant vector. The cross term \(2\sum_i (\hat{y}_i - \bar{y})(y_i - \hat{y}_i)\) vanishes precisely because of that orthogonality. When the model lacks an intercept, or when fitted values come from a method other than OLS, the identity can fail, and that failure has consequences we revisit in Section 4.

163.1.2 1.2 The Coefficient of Determination

Given the decomposition, \(R^2\) is defined as the fraction of total variability explained:

\[ R^2 = \frac{\text{SSR}}{\text{SST}} = 1 - \frac{\text{SSE}}{\text{SST}} . \]

When the identity \(\text{SST} = \text{SSR} + \text{SSE}\) holds and all sums are nonnegative, \(R^2 \in [0, 1]\). A value of \(1\) means SSE is zero, so the model reproduces every observation exactly. A value of \(0\) means SSR is zero, so the model does no better than the constant predictor \(\hat{y}_i = \bar{y}\).

163.2 2. Interpretations

163.2.1 2.1 Variance Explained

The textbook reading of \(R^2\) is “the proportion of variance in the response explained by the predictors.” This is accurate for OLS with an intercept, where SST, SSR, and SSE are proportional to sample variances. It is worth stressing that the word “explained” is mechanical, not causal. A predictor can raise \(R^2\) while having no causal relationship to the response, a point developed in Section 5.

163.2.2 2.2 The Correlation Interpretation

For simple linear regression of \(y\) on a single predictor \(x\), \(R^2\) equals the square of the Pearson correlation coefficient between \(x\) and \(y\):

\[ R^2 = r_{xy}^2, \qquad r_{xy} = \frac{\sum_i (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_i (x_i - \bar{x})^2}\sqrt{\sum_i (y_i - \bar{y})^2}} . \]

More generally, in multiple regression \(R^2\) equals the squared correlation between the observed \(y_i\) and the fitted \(\hat{y}_i\):

\[ R^2 = \big(\text{corr}(y, \hat{y})\big)^2 . \]

This identity is robust and is often the safest way to think about \(R^2\), because it remains meaningful even when the additive sum of squares decomposition is shaky. It also clarifies that \(R^2\) measures the strength of linear association between predictions and truth, nothing more.

163.2.3 2.3 Geometric View

Collect the centered response in a vector and view fitting as projection onto the subspace spanned by the centered predictors. Then

\[ R^2 = \cos^2 \theta, \]

where \(\theta\) is the angle between the centered response vector and its projection. A small angle means the predictors nearly align with the response and \(R^2\) approaches \(1\). An angle near \(90^\circ\) means the predictors are nearly orthogonal to the response and \(R^2\) approaches \(0\). This geometry makes the orthogonality of residuals and fit visually obvious.

163.3 3. Why \(R^2\) Never Decreases When You Add Predictors

A structural defect of \(R^2\) is that it is monotone nondecreasing in the number of predictors. Adding any regressor to an OLS model, even one filled with random noise, cannot increase SSE and therefore cannot decrease \(R^2\).

The reason is that the smaller model is nested inside the larger one. Least squares minimizes SSE over a larger parameter space when a column is added, and the minimum over a larger set cannot exceed the minimum over a subset. Setting the new coefficient to zero recovers the old fit, so the optimizer can only match or improve it.

add a column of pure noise to X
SSE can only stay equal or shrink
therefore R^2 can only stay equal or grow

The practical danger is overfitting. With \(p\) predictors and \(n\) observations, a model with \(p = n - 1\) free parameters plus an intercept can interpolate the data, driving SSE to zero and \(R^2\) to one, while predicting future data no better than random. \(R^2\) rewards complexity regardless of whether that complexity reflects signal or noise. This is exactly the failure that adjusted \(R^2\) tries to address.

163.4 4. Adjusted \(R^2\)

163.4.1 4.1 Definition

Adjusted \(R^2\) penalizes the inclusion of predictors by replacing raw sums of squares with their degrees-of-freedom-corrected counterparts. With \(n\) observations and \(p\) predictors (excluding the intercept):

\[ R^2_{\text{adj}} = 1 - \frac{\text{SSE} / (n - p - 1)}{\text{SST} / (n - 1)} = 1 - (1 - R^2)\,\frac{n - 1}{n - p - 1} . \]

The numerator \(\text{SSE}/(n - p - 1)\) is an unbiased estimate of the residual variance \(\sigma^2\), and the denominator \(\text{SST}/(n - 1)\) is the usual unbiased estimate of the variance of \(y\). So adjusted \(R^2\) can be read as one minus the ratio of two variance estimates.

163.4.2 4.2 Behavior

The correction factor \(\frac{n-1}{n-p-1}\) exceeds one and grows as \(p\) approaches \(n\). Adding a predictor changes \(R^2_{\text{adj}}\) in two competing ways: it can decrease the \((1 - R^2)\) term by reducing SSE, but it also increases the multiplier. A new predictor raises \(R^2_{\text{adj}}\) only if it reduces SSE by more than the penalty for spending a degree of freedom. A loose rule of thumb is that adjusted \(R^2\) rises when the new predictor’s \(t\) statistic exceeds one in absolute value.

Consequently \(R^2_{\text{adj}} \le R^2\) always, and \(R^2_{\text{adj}}\) can be negative, which happens when the model fits worse than the constant predictor after accounting for the parameters spent. A negative adjusted \(R^2\) is a clear signal that the predictors carry essentially no useful information.

new predictor helps SSE a lot  -> adjusted R^2 rises
new predictor barely helps SSE -> adjusted R^2 falls

163.4.3 4.3 What Adjusted \(R^2\) Does and Does Not Fix

Adjusted \(R^2\) corrects the naive monotonicity, making it a more honest in-sample criterion for comparing models with different numbers of predictors. It is, however, a weak penalty compared with criteria such as the Akaike information criterion (AIC) or the Bayesian information criterion (BIC), and it is not designed to estimate out-of-sample performance. For genuine generalization assessment, cross-validated error or a held-out test set remains the standard. Adjusted \(R^2\) should be understood as a refinement of an in-sample descriptive statistic, not a substitute for predictive validation.

163.5 5. Why a High \(R^2\) Is Not Always Good

163.5.1 5.1 It Says Nothing About Causation or Correctness

A high \(R^2\) measures linear association between fit and response. It does not certify that the model is correctly specified, that the relationship is causal, or that the predictors were measured without confounding. A regression of ice cream sales on drowning deaths can show a high \(R^2\) driven entirely by the lurking variable of summer temperature. The number is a measure of fit, not of truth.

163.5.2 5.2 Anscombe’s Quartet and Hidden Misspecification

Anscombe’s quartet is a celebrated set of four datasets that share nearly identical means, variances, regression lines, and \(R^2\) values of about \(0.67\), yet look entirely different when plotted. One is genuinely linear, one is curved, one is linear but for a single outlier, and one is dominated by a single high leverage point. The lesson is that \(R^2\) cannot detect nonlinearity, outliers, or leverage. A high \(R^2\) accompanied by a patterned residual plot indicates a misspecified model, regardless of how impressive the headline number looks.

163.5.3 5.3 Spurious Regression in Time Series

When two independent nonstationary time series, such as random walks, are regressed on each other, the resulting \(R^2\) is frequently large even though there is no relationship whatsoever. This phenomenon, known as spurious regression, arises because the usual sum of squares decomposition and its asymptotics assume stationarity. With trending data, a high \(R^2\) can be entirely an artifact of shared trends. Differencing the series, modeling the dynamics explicitly, or testing for cointegration is required before any \(R^2\) from levels can be trusted.

163.5.4 5.4 The Scale and Context Dependence of “High”

There is no universal threshold separating a good \(R^2\) from a bad one. In tightly controlled physical experiments, an \(R^2\) below \(0.99\) may indicate a problem. In cross-sectional social science or financial return modeling, an \(R^2\) of \(0.10\) can represent a genuine and valuable finding. The expected magnitude depends on the noise inherent in the domain. Judging a model by an absolute \(R^2\) cutoff ignores this, and chasing a higher number can push a modeler toward overfitting or toward discarding a correct but low-signal model.

163.5.5 5.5 Low \(R^2\) Is Not Always Bad

The mirror image of the previous point deserves its own statement. A correctly specified model in a high-noise environment will have a low \(R^2\) and still produce unbiased coefficient estimates, valid inference, and useful predictions of the conditional mean. If the goal is to estimate the effect of a predictor rather than to predict individual outcomes, the standard errors and coefficient estimates matter, and \(R^2\) may be almost irrelevant.

163.5.6 5.6 Out-of-Sample \(R^2\) Can Be Negative

When \(R^2\) is computed on data not used for fitting, using

\[ R^2_{\text{oos}} = 1 - \frac{\sum_{i \in \text{test}} (y_i - \hat{y}_i)^2}{\sum_{i \in \text{test}} (y_i - \bar{y}_{\text{train}})^2}, \]

it can fall below zero. A negative out-of-sample \(R^2\) means the model predicts worse on new data than simply using the training mean. This is one of the most informative diagnostics available, because the in-sample \(R^2\) can never reveal it. A model with high in-sample \(R^2\) and negative out-of-sample \(R^2\) has overfit.

163.6 6. Practical Guidance

Report \(R^2\) alongside, not instead of, residual diagnostics and out-of-sample error. Treat the squared correlation between \(y\) and \(\hat{y}\) as the most durable interpretation. When comparing models of differing complexity, prefer adjusted \(R^2\) over raw \(R^2\), and prefer cross-validation or information criteria over both when the goal is generalization. Always inspect residual plots, because Anscombe’s quartet guarantees that no scalar summary can replace them. Be especially skeptical of high \(R^2\) values from time series in levels, from models with many predictors relative to observations, and from any setting where the predictors might be downstream of confounders. The coefficient of determination is a useful first glance at fit, and a dangerous final word on model quality.

163.7 References

  1. Anscombe, F. J. (1973). Graphs in Statistical Analysis. The American Statistician, 27(1), 17-21. https://www.jstor.org/stable/2682899
  2. Draper, N. R., and Smith, H. (1998). Applied Regression Analysis, 3rd ed. Wiley. https://onlinelibrary.wiley.com/doi/book/10.1002/9781118625590
  3. Granger, C. W. J., and Newbold, P. (1974). Spurious Regressions in Econometrics. Journal of Econometrics, 2(2), 111-120. https://doi.org/10.1016/0304-4076(74)90034-7
  4. James, G., Witten, D., Hastie, T., and Tibshirani, R. (2021). An Introduction to Statistical Learning, 2nd ed. Springer. https://www.statlearning.com
  5. Kvalseth, T. O. (1985). Cautionary Note about \(R^2\). The American Statistician, 39(4), 279-285. https://doi.org/10.1080/00031305.1985.10479448
  6. Wikipedia contributors. Coefficient of Determination. https://en.wikipedia.org/wiki/Coefficient_of_determination