163 R-Squared and Adjusted R-Squared

The coefficient of determination, written $R^2$, is among the most widely reported and most frequently misunderstood quantities in applied statistics and machine learning. It promises a single number summarizing how well a model explains the variation in a response variable. That promise is genuine but narrow. This chapter develops $R^2$ from its algebraic foundations, examines its geometric and statistical meaning, introduces adjusted $R^2$ as a partial remedy for one of its defects, and catalogs the situations in which a high $R^2$ signals nothing useful or actively misleads.

163.1 1. Definition and Decomposition

163.1.1 1.1 The Sums of Squares

Let $y_1, \dots, y_n$ be observed responses with mean $\bar{y} = \frac{1}{n}\sum_i y_i$, and let $\hat{y}_i$ be the fitted values produced by a model. Define three sums of squares:

\[ \text{SST} = \sum_{i=1}^{n} (y_i - \bar{y})^2, \quad \text{SSR} = \sum_{i=1}^{n} (\hat{y}_i - \bar{y})^2, \quad \text{SSE} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 . \]

SST is the total sum of squares, the total variability of the response around its mean. SSE is the residual (error) sum of squares, the variability left unexplained by the model. SSR is the regression sum of squares, the variability captured by the fitted values. Some texts write SSR for the residual sum and SSReg for the regression sum, so always confirm which convention a source uses before comparing formulas.

For ordinary least squares (OLS) with an intercept term, these three quantities satisfy the exact identity

\[ \text{SST} = \text{SSR} + \text{SSE}. \]

This decomposition is the backbone of the entire construction. It holds because the OLS residual vector is orthogonal to the column space of the design matrix, which includes the constant vector. Writing $e_i = y_i - \hat{y}_i$ for the residuals, the expansion of SST is

\[ \sum_i (y_i - \bar{y})^2 = \sum_i \big( (\hat{y}_i - \bar{y}) + e_i \big)^2 = \text{SSR} + \text{SSE} + 2\sum_i (\hat{y}_i - \bar{y})\, e_i . \]

The cross term $2\sum_i (\hat{y}_i - \bar{y})\, e_i$ vanishes precisely because of orthogonality. The OLS normal equations force $\sum_i \hat{y}_i e_i = 0$ (residuals orthogonal to the fit) and, when an intercept is present, $\sum_i e_i = 0$ (residuals sum to zero), so $\sum_i \bar{y}\, e_i = \bar{y}\sum_i e_i = 0$ as well. Both pieces of the cross term are zero, and the clean partition follows. When the model lacks an intercept, or when fitted values come from a method other than OLS, the identity can fail, and that failure has consequences we revisit in Section 4 and Section 5.

The partition is easiest to hold in the mind as a single diagram.

flowchart TD
    SST["Total variation SST: spread of y around its mean"]
    SSR["Explained variation SSR: spread of fitted values around the mean"]
    SSE["Unexplained variation SSE: spread of residuals"]
    SST --> SSR
    SST --> SSE
    SSR --> RSQ["R squared equals SSR divided by SST"]
    SSE --> RSQ

163.1.2 1.2 The Coefficient of Determination

Given the decomposition, $R^2$ is defined as the fraction of total variability explained:

\[ R^2 = \frac{\text{SSR}}{\text{SST}} = 1 - \frac{\text{SSE}}{\text{SST}} . \]

When the identity $\text{SST} = \text{SSR} + \text{SSE}$ holds and all sums are nonnegative, $R^2 \in [0, 1]$. A value of $1$ means SSE is zero, so the model reproduces every observation exactly. A value of $0$ means SSR is zero, so the model does no better than the constant predictor $\hat{y}_i = \bar{y}$.

The right-hand form $1 - \text{SSE}/\text{SST}$ is the more fundamental definition, and the only one that should be used outside textbook OLS. It compares the model against a fixed baseline, the constant predictor $\bar{y}$, and asks what fraction of the baseline’s squared error the model removes. This framing extends cleanly to settings where the SST equals SSR plus SSE identity breaks, which is exactly why Section 4 and the out-of-sample discussion in Section 5 lean on it.

163.1.3 1.3 A Small Worked Example

Numbers fix the idea. Take five observations and a fitted line.

$i$	$y_i$	$\hat{y}_i$	$y_i - \bar{y}$	$\hat{y}_i - \bar{y}$	$y_i - \hat{y}_i$
1	2	2.2	$-4$	$-3.8$	$-0.2$
2	4	3.6	$-2$	$-2.4$	$0.4$
3	5	5.0	$-1$	$-1.0$	$0.0$
4	7	6.4	$1$	$0.4$	$0.6$
5	12	12.8	$6$	$6.8$	$-0.8$

The mean is $\bar{y} = 30/5 = 6$. Computing each column,

\[ \text{SST} = (-4)^2 + (-2)^2 + (-1)^2 + 1^2 + 6^2 = 58, \] \[ \text{SSE} = (-0.2)^2 + 0.4^2 + 0.0^2 + 0.6^2 + (-0.8)^2 = 0.04 + 0.16 + 0 + 0.36 + 0.64 = 1.20, \] \[ \text{SSR} = (-3.8)^2 + (-2.4)^2 + (-1.0)^2 + 0.4^2 + 6.8^2 = 14.44 + 5.76 + 1.0 + 0.16 + 46.24 = 67.6 . \]

These fitted values do not come from OLS (they were chosen to illustrate the arithmetic), so the identity does not hold exactly here: $\text{SSR} + \text{SSE} = 68.8 \ne 58 = \text{SST}$. The two definitions of $R^2$ therefore disagree. The ratio form gives $\text{SSR}/\text{SST} = 67.6/58 = 1.166$, an impossible value above one, while the error form gives $1 - \text{SSE}/\text{SST} = 1 - 1.20/58 = 0.979$. This is precisely why the error form is preferred: it remains a sensible “fraction of baseline error removed” even when the additive partition is violated. Had these been genuine OLS fits, the cross term would vanish, SSR and SSE would sum to SST, and the two formulas would agree exactly.

163.2 2. Interpretations

163.2.1 2.1 Variance Explained

The textbook reading of $R^2$ is “the proportion of variance in the response explained by the predictors.” This is accurate for OLS with an intercept, where SST, SSR, and SSE are proportional to sample variances. It is worth stressing that the word “explained” is mechanical, not causal. A predictor can raise $R^2$ while having no causal relationship to the response, a point developed in Section 5.

163.2.2 2.2 The Correlation Interpretation

For simple linear regression of $y$ on a single predictor $x$, $R^2$ equals the square of the Pearson correlation coefficient between $x$ and $y$:

\[ R^2 = r_{xy}^2, \qquad r_{xy} = \frac{\sum_i (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_i (x_i - \bar{x})^2}\sqrt{\sum_i (y_i - \bar{y})^2}} . \]

More generally, in multiple regression $R^2$ equals the squared correlation between the observed $y_i$ and the fitted $\hat{y}_i$:

\[ R^2 = \big(\text{corr}(y, \hat{y})\big)^2 . \]

This identity is robust and is often the safest way to think about $R^2$, because it remains meaningful even when the additive sum of squares decomposition is shaky. It also clarifies that $R^2$ measures the strength of linear association between predictions and truth, nothing more. Note one consequence: because squaring a correlation discards its sign, $R^2$ is blind to systematic direction errors. A model whose predictions are perfectly but negatively correlated with the truth would have $\text{corr}(y,\hat{y})^2 = 1$, even though it gets the direction of every deviation backwards. In practice OLS fits never produce this, but methods that compute $R^2$ from externally supplied predictions can.

163.2.3 2.3 Geometric View

Collect the centered response in a vector and view fitting as projection onto the subspace spanned by the centered predictors. Then

\[ R^2 = \cos^2 \theta, \]

where $\theta$ is the angle between the centered response vector and its projection. A small angle means the predictors nearly align with the response and $R^2$ approaches $1$. An angle near $90^\circ$ means the predictors are nearly orthogonal to the response and $R^2$ approaches $0$. This geometry makes the orthogonality of residuals and fit visually obvious: the residual vector is the leg of a right triangle perpendicular to the fitted-value leg, the centered response is the hypotenuse, and $R^2 = \cos^2\theta$ is the Pythagorean statement $\text{SST} = \text{SSR} + \text{SSE}$ rewritten as a ratio of squared lengths.

Formally, write the fit as $\hat{y} = H y$ where $H = X(X^\top X)^{-1} X^\top$ is the hat (projection) matrix onto the column space of the design matrix $X$. The hat matrix is symmetric and idempotent ($H^2 = H$), the defining properties of an orthogonal projection. The residual vector $y - \hat{y} = (I - H) y$ lives in the orthogonal complement, which is what makes the cross term vanish and forces $R^2$ into $[0,1]$. The trace of $H$ equals the number of fitted parameters, the fact that drives the degrees-of-freedom accounting in Section 4.

163.3 3. Why $R^2$ Never Decreases When You Add Predictors

A structural defect of $R^2$ is that it is monotone nondecreasing in the number of predictors. Adding any regressor to an OLS model, even one filled with random noise, cannot increase SSE and therefore cannot decrease $R^2$.

The reason is that the smaller model is nested inside the larger one. Least squares minimizes SSE over a larger parameter space when a column is added, and the minimum over a larger set cannot exceed the minimum over a subset. Setting the new coefficient to zero recovers the old fit, so the optimizer can only match or improve it.

add a column of pure noise to X
SSE can only stay equal or shrink
therefore R^2 can only stay equal or grow

The effect is not merely that noise cannot hurt; on any finite sample, noise predictors help by a predictable amount. If the $p$ added regressors are pure noise, independent of $y$, the expected increase in $R^2$ is approximately $p/(n-1)$. So a model with $20$ junk predictors and $40$ observations will, on average, report an $R^2$ near $0.5$ from nothing but chance alignment. More starkly, with $p$ predictors plus an intercept, the expected value of $R^2$ under a true null (no predictor relates to $y$) is roughly $p/(n-1)$, never zero. This is the quantitative core of the overfitting warning.

The practical danger is overfitting. With $p$ predictors and $n$ observations, a model with $p = n - 1$ free parameters plus an intercept can interpolate the data, driving SSE to zero and $R^2$ to one, while predicting future data no better than random. $R^2$ rewards complexity regardless of whether that complexity reflects signal or noise. This is exactly the failure that adjusted $R^2$ tries to address.

163.4 4. Adjusted $R^2$

163.4.1 4.1 Definition

Adjusted $R^2$ penalizes the inclusion of predictors by replacing raw sums of squares with their degrees-of-freedom-corrected counterparts. With $n$ observations and $p$ predictors (excluding the intercept):

\[ R^2_{\text{adj}} = 1 - \frac{\text{SSE} / (n - p - 1)}{\text{SST} / (n - 1)} = 1 - (1 - R^2)\,\frac{n - 1}{n - p - 1} . \]

The numerator $\text{SSE}/(n - p - 1)$ is an unbiased estimate of the residual variance $\sigma^2$, and the denominator $\text{SST}/(n - 1)$ is the usual unbiased estimate of the variance of $y$. So adjusted $R^2$ can be read as one minus the ratio of two variance estimates. The quantity $n - p - 1$ is the residual degrees of freedom, the sample size minus the number of fitted parameters (the $p$ slopes plus one intercept), and it matches the trace argument from Section 2.3.

163.4.2 4.2 Behavior

The correction factor $\frac{n-1}{n-p-1}$ exceeds one and grows as $p$ approaches $n$. Adding a predictor changes $R^2_{\text{adj}}$ in two competing ways: it can decrease the $(1 - R^2)$ term by reducing SSE, but it also increases the multiplier. A new predictor raises $R^2_{\text{adj}}$ only if it reduces SSE by more than the penalty for spending a degree of freedom.

This rule has a precise form. Adding one predictor raises $R^2_{\text{adj}}$ if and only if the partial $F$ statistic for that predictor exceeds one, which for a single coefficient is equivalent to its squared $t$ statistic exceeding one, that is $|t| > 1$. This is a far weaker bar than the conventional significance threshold of roughly $|t| > 2$. Adjusted $R^2$ will therefore happily retain predictors that are nowhere near statistically significant, which is one reason it is a mild rather than aggressive penalty.

Consequently $R^2_{\text{adj}} \le R^2$ always, and $R^2_{\text{adj}}$ can be negative, which happens when the model fits worse than the constant predictor after accounting for the parameters spent. A negative adjusted $R^2$ is a clear signal that the predictors carry essentially no useful information.

new predictor's |t| > 1   -> adjusted R^2 rises
new predictor's |t| < 1   -> adjusted R^2 falls

163.4.3 4.3 What Adjusted $R^2$ Does and Does Not Fix

Adjusted $R^2$ corrects the naive monotonicity, making it a more honest in-sample criterion for comparing models with different numbers of predictors. It is, however, a weak penalty compared with criteria such as the Akaike information criterion (AIC) or the Bayesian information criterion (BIC), and it is not designed to estimate out-of-sample performance. Roughly, adjusted $R^2$ keeps a predictor at the $|t|>1$ bar, AIC at about $|t|>1.4$, and BIC at a threshold that grows with $\log n$, so the three impose increasingly strict penalties on complexity. For genuine generalization assessment, cross-validated error or a held-out test set remains the standard. Adjusted $R^2$ should be understood as a refinement of an in-sample descriptive statistic, not a substitute for predictive validation.

163.4.4 4.4 Predictive $R^2$ and Cross-Validated $R^2$

Two further variants close the gap toward generalization while keeping the $R^2$ scale. The predicted residual sum of squares (PRESS) replaces each residual with its leave-one-out counterpart $y_i - \hat{y}_{(i)}$, where $\hat{y}_{(i)}$ is the prediction for observation $i$ from a model fit without that observation. For OLS this has the closed form $y_i - \hat{y}_{(i)} = e_i / (1 - h_{ii})$, where $h_{ii}$ is the $i$th diagonal of the hat matrix, so PRESS costs nothing beyond a single fit. The predictive $R^2$ is then

\[ R^2_{\text{pred}} = 1 - \frac{\text{PRESS}}{\text{SST}}, \qquad \text{PRESS} = \sum_i \left( \frac{e_i}{1 - h_{ii}} \right)^2 . \]

A $k$-fold cross-validated $R^2$ generalizes the same idea to arbitrary models by averaging held-out squared error across folds. Both can be negative, both penalize overfitting directly rather than through a degrees-of-freedom proxy, and both are preferable to adjusted $R^2$ when the question is how the model will behave on new data.

163.5 5. Why a High $R^2$ Is Not Always Good

163.5.1 5.1 It Says Nothing About Causation or Correctness

A high $R^2$ measures linear association between fit and response. It does not certify that the model is correctly specified, that the relationship is causal, or that the predictors were measured without confounding. A regression of ice cream sales on drowning deaths can show a high $R^2$ driven entirely by the lurking variable of summer temperature. The number is a measure of fit, not of truth.

163.5.2 5.2 Anscombe’s Quartet and Hidden Misspecification

Anscombe’s quartet is a celebrated set of four datasets that share nearly identical means, variances, regression lines, and $R^2$ values of about $0.67$, yet look entirely different when plotted. One is genuinely linear, one is curved, one is linear but for a single outlier, and one is dominated by a single high leverage point. The lesson is that $R^2$ cannot detect nonlinearity, outliers, or leverage. A high $R^2$ accompanied by a patterned residual plot indicates a misspecified model, regardless of how impressive the headline number looks. The Datasaurus Dozen extends the same warning to twelve dramatically different scatterplots that share summary statistics, reinforcing that no scalar can substitute for plotting the data.

163.5.3 5.3 Spurious Regression in Time Series

When two independent nonstationary time series, such as random walks, are regressed on each other, the resulting $R^2$ is frequently large even though there is no relationship whatsoever. This phenomenon, known as spurious regression, arises because the usual sum of squares decomposition and its asymptotics assume stationarity. With trending data, a high $R^2$ can be entirely an artifact of shared trends. Differencing the series, modeling the dynamics explicitly, or testing for cointegration is required before any $R^2$ from levels can be trusted.

163.5.4 5.4 The Scale and Context Dependence of “High”

There is no universal threshold separating a good $R^2$ from a bad one. In tightly controlled physical experiments, an $R^2$ below $0.99$ may indicate a problem. In cross-sectional social science or financial return modeling, an $R^2$ of $0.10$ can represent a genuine and valuable finding. The expected magnitude depends on the noise inherent in the domain. Judging a model by an absolute $R^2$ cutoff ignores this, and chasing a higher number can push a modeler toward overfitting or toward discarding a correct but low-signal model.

163.5.5 5.5 Low $R^2$ Is Not Always Bad

The mirror image of the previous point deserves its own statement. A correctly specified model in a high-noise environment will have a low $R^2$ and still produce unbiased coefficient estimates, valid inference, and useful predictions of the conditional mean. If the goal is to estimate the effect of a predictor rather than to predict individual outcomes, the standard errors and coefficient estimates matter, and $R^2$ may be almost irrelevant.

163.5.6 5.6 Out-of-Sample $R^2$ Can Be Negative

When $R^2$ is computed on data not used for fitting, using

\[ R^2_{\text{oos}} = 1 - \frac{\sum_{i \in \text{test}} (y_i - \hat{y}_i)^2}{\sum_{i \in \text{test}} (y_i - \bar{y}_{\text{train}})^2}, \]

it can fall below zero. A negative out-of-sample $R^2$ means the model predicts worse on new data than simply using the training mean. This is one of the most informative diagnostics available, because the in-sample $R^2$ can never reveal it. A model with high in-sample $R^2$ and negative out-of-sample $R^2$ has overfit. Note the deliberate choice of baseline: the denominator centers on the training mean $\bar{y}_{\text{train}}$, not the test mean, because the test mean would not be known at prediction time. Using the test mean instead defines a different and more lenient quantity, so report which baseline you used.

163.5.7 5.7 $R^2$ Outside Linear Regression

The plain coefficient of determination is built for continuous responses fit by least squares. For other model families the construction must be adapted, and the adaptations are not interchangeable. For logistic regression and other generalized linear models, McFadden’s pseudo-$R^2$, defined as $1 - \log L_{\text{full}} / \log L_{\text{null}}$ from the fitted and null log-likelihoods, is a common analogue, but its numerical range is not comparable to ordinary $R^2$ and values around $0.2$ to $0.4$ already indicate a strong fit. Nagelkerke and Cox-Snell offer alternative scalings. The practical rule is to never compare a pseudo-$R^2$ from one model family against an ordinary $R^2$ from another, because they measure different things on different scales.

163.6 6. Practical Guidance

Report $R^2$ alongside, not instead of, residual diagnostics and out-of-sample error. Treat the squared correlation between $y$ and $\hat{y}$, and the error form $1 - \text{SSE}/\text{SST}$, as the most durable interpretations. When comparing models of differing complexity, prefer adjusted $R^2$ over raw $R^2$, and prefer cross-validation, predictive $R^2$, or information criteria over both when the goal is generalization. Always inspect residual plots, because Anscombe’s quartet guarantees that no scalar summary can replace them. Be especially skeptical of high $R^2$ values from time series in levels, from models with many predictors relative to observations, and from any setting where the predictors might be downstream of confounders. The coefficient of determination is a useful first glance at fit, and a dangerous final word on model quality.

For practitioners, every quantity in this chapter is available in mature open-source tooling. In Python, numpy and pandas compute the sums of squares directly, scikit-learn provides r2_score (which uses the error form and can return negative values) along with cross-validation utilities, and statsmodels reports both $R^2$ and adjusted $R^2$ plus PRESS-based diagnostics in its OLS summary. In R, the base lm summary reports both, and packages such as caret and boot handle cross-validated variants. None of this requires proprietary software.

163.7 References

Anscombe, F. J. (1973). Graphs in Statistical Analysis. The American Statistician, 27(1), 17-21. https://doi.org/10.1080/00031305.1973.10478966
Draper, N. R., and Smith, H. (1998). Applied Regression Analysis, 3rd ed. Wiley. https://onlinelibrary.wiley.com/doi/book/10.1002/9781118625590
Granger, C. W. J., and Newbold, P. (1974). Spurious Regressions in Econometrics. Journal of Econometrics, 2(2), 111-120. https://doi.org/10.1016/0304-4076(74)90034-7
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2021). An Introduction to Statistical Learning, 2nd ed. Springer. https://doi.org/10.1007/978-1-0716-1418-1
Kvalseth, T. O. (1985). Cautionary Note about $R^2$. The American Statistician, 39(4), 279-285. https://doi.org/10.1080/00031305.1985.10479448
McFadden, D. (1974). Conditional Logit Analysis of Qualitative Choice Behavior. In P. Zarembka (Ed.), Frontiers in Econometrics (pp. 105-142). Academic Press.
Nagelkerke, N. J. D. (1991). A Note on a General Definition of the Coefficient of Determination. Biometrika, 78(3), 691-692. https://doi.org/10.1093/biomet/78.3.691
Allen, D. M. (1974). The Relationship Between Variable Selection and Data Augmentation and a Method for Prediction. Technometrics, 16(1), 125-127. https://doi.org/10.1080/00401706.1974.10489157

# R-Squared and Adjusted R-Squared The coefficient of determination, written $R^2$, is among the most widely reported and most frequently misunderstood quantities in applied statistics and machine learning. It promises a single number summarizing how well a model explains the variation in a response variable. That promise is genuine but narrow. This chapter develops $R^2$ from its algebraic foundations, examines its geometric and statistical meaning, introduces adjusted $R^2$ as a partial remedy for one of its defects, and catalogs the situations in which a high $R^2$ signals nothing useful or actively misleads. ## 1. Definition and Decomposition ### 1.1 The Sums of Squares Let $y_1, \dots, y_n$ be observed responses with mean $\bar{y} = \frac{1}{n}\sum_i y_i$, and let $\hat{y}_i$ be the fitted values produced by a model. Define three sums of squares: $$ \text{SST} = \sum_{i=1}^{n} (y_i - \bar{y})^2, \quad \text{SSR} = \sum_{i=1}^{n} (\hat{y}_i - \bar{y})^2, \quad \text{SSE} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 . $$ SST is the total sum of squares, the total variability of the response around its mean. SSE is the residual (error) sum of squares, the variability left unexplained by the model. SSR is the regression sum of squares, the variability captured by the fitted values. Some texts write SSR for the residual sum and SSReg for the regression sum, so always confirm which convention a source uses before comparing formulas. For ordinary least squares (OLS) with an intercept term, these three quantities satisfy the exact identity $$ \text{SST} = \text{SSR} + \text{SSE}. $$ This decomposition is the backbone of the entire construction. It holds because the OLS residual vector is orthogonal to the column space of the design matrix, which includes the constant vector. Writing $e_i = y_i - \hat{y}_i$ for the residuals, the expansion of SST is $$ \sum_i (y_i - \bar{y})^2 = \sum_i \big( (\hat{y}_i - \bar{y}) + e_i \big)^2 = \text{SSR} + \text{SSE} + 2\sum_i (\hat{y}_i - \bar{y})\, e_i . $$ The cross term $2\sum_i (\hat{y}_i - \bar{y})\, e_i$ vanishes precisely because of orthogonality. The OLS normal equations force $\sum_i \hat{y}_i e_i = 0$ (residuals orthogonal to the fit) and, when an intercept is present, $\sum_i e_i = 0$ (residuals sum to zero), so $\sum_i \bar{y}\, e_i = \bar{y}\sum_i e_i = 0$ as well. Both pieces of the cross term are zero, and the clean partition follows. When the model lacks an intercept, or when fitted values come from a method other than OLS, the identity can fail, and that failure has consequences we revisit in Section 4 and Section 5. The partition is easiest to hold in the mind as a single diagram. ```{mermaid} flowchart TD SST["Total variation SST: spread of y around its mean"] SSR["Explained variation SSR: spread of fitted values around the mean"] SSE["Unexplained variation SSE: spread of residuals"] SST --> SSR SST --> SSE SSR --> RSQ["R squared equals SSR divided by SST"] SSE --> RSQ ``` ### 1.2 The Coefficient of Determination Given the decomposition, $R^2$ is defined as the fraction of total variability explained: $$ R^2 = \frac{\text{SSR}}{\text{SST}} = 1 - \frac{\text{SSE}}{\text{SST}} . $$ When the identity $\text{SST} = \text{SSR} + \text{SSE}$ holds and all sums are nonnegative, $R^2 \in [0, 1]$. A value of $1$ means SSE is zero, so the model reproduces every observation exactly. A value of $0$ means SSR is zero, so the model does no better than the constant predictor $\hat{y}_i = \bar{y}$. The right-hand form $1 - \text{SSE}/\text{SST}$ is the more fundamental definition, and the only one that should be used outside textbook OLS. It compares the model against a fixed baseline, the constant predictor $\bar{y}$, and asks what fraction of the baseline's squared error the model removes. This framing extends cleanly to settings where the SST equals SSR plus SSE identity breaks, which is exactly why Section 4 and the out-of-sample discussion in Section 5 lean on it. ### 1.3 A Small Worked Example Numbers fix the idea. Take five observations and a fitted line. | $i$ | $y_i$ | $\hat{y}_i$ | $y_i - \bar{y}$ | $\hat{y}_i - \bar{y}$ | $y_i - \hat{y}_i$ | |----|------|------------|----------------|----------------------|-------------------| | 1 | 2 | 2.2 | $-4$ | $-3.8$ | $-0.2$ | | 2 | 4 | 3.6 | $-2$ | $-2.4$ | $0.4$ | | 3 | 5 | 5.0 | $-1$ | $-1.0$ | $0.0$ | | 4 | 7 | 6.4 | $1$ | $0.4$ | $0.6$ | | 5 | 12 | 12.8 | $6$ | $6.8$ | $-0.8$ | The mean is $\bar{y} = 30/5 = 6$. Computing each column, $$ \text{SST} = (-4)^2 + (-2)^2 + (-1)^2 + 1^2 + 6^2 = 58, $$ $$ \text{SSE} = (-0.2)^2 + 0.4^2 + 0.0^2 + 0.6^2 + (-0.8)^2 = 0.04 + 0.16 + 0 + 0.36 + 0.64 = 1.20, $$ $$ \text{SSR} = (-3.8)^2 + (-2.4)^2 + (-1.0)^2 + 0.4^2 + 6.8^2 = 14.44 + 5.76 + 1.0 + 0.16 + 46.24 = 67.6 . $$ These fitted values do not come from OLS (they were chosen to illustrate the arithmetic), so the identity does not hold exactly here: $\text{SSR} + \text{SSE} = 68.8 \ne 58 = \text{SST}$. The two definitions of $R^2$ therefore disagree. The ratio form gives $\text{SSR}/\text{SST} = 67.6/58 = 1.166$, an impossible value above one, while the error form gives $1 - \text{SSE}/\text{SST} = 1 - 1.20/58 = 0.979$. This is precisely why the error form is preferred: it remains a sensible "fraction of baseline error removed" even when the additive partition is violated. Had these been genuine OLS fits, the cross term would vanish, SSR and SSE would sum to SST, and the two formulas would agree exactly. ## 2. Interpretations ### 2.1 Variance Explained The textbook reading of $R^2$ is "the proportion of variance in the response explained by the predictors." This is accurate for OLS with an intercept, where SST, SSR, and SSE are proportional to sample variances. It is worth stressing that the word "explained" is mechanical, not causal. A predictor can raise $R^2$ while having no causal relationship to the response, a point developed in Section 5. ### 2.2 The Correlation Interpretation For simple linear regression of $y$ on a single predictor $x$, $R^2$ equals the square of the Pearson correlation coefficient between $x$ and $y$: $$ R^2 = r_{xy}^2, \qquad r_{xy} = \frac{\sum_i (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_i (x_i - \bar{x})^2}\sqrt{\sum_i (y_i - \bar{y})^2}} . $$ More generally, in multiple regression $R^2$ equals the squared correlation between the observed $y_i$ and the fitted $\hat{y}_i$: $$ R^2 = \big(\text{corr}(y, \hat{y})\big)^2 . $$ This identity is robust and is often the safest way to think about $R^2$, because it remains meaningful even when the additive sum of squares decomposition is shaky. It also clarifies that $R^2$ measures the strength of linear association between predictions and truth, nothing more. Note one consequence: because squaring a correlation discards its sign, $R^2$ is blind to systematic direction errors. A model whose predictions are perfectly but negatively correlated with the truth would have $\text{corr}(y,\hat{y})^2 = 1$, even though it gets the direction of every deviation backwards. In practice OLS fits never produce this, but methods that compute $R^2$ from externally supplied predictions can. ### 2.3 Geometric View Collect the centered response in a vector and view fitting as projection onto the subspace spanned by the centered predictors. Then $$ R^2 = \cos^2 \theta, $$ where $\theta$ is the angle between the centered response vector and its projection. A small angle means the predictors nearly align with the response and $R^2$ approaches $1$. An angle near $90^\circ$ means the predictors are nearly orthogonal to the response and $R^2$ approaches $0$. This geometry makes the orthogonality of residuals and fit visually obvious: the residual vector is the leg of a right triangle perpendicular to the fitted-value leg, the centered response is the hypotenuse, and $R^2 = \cos^2\theta$ is the Pythagorean statement $\text{SST} = \text{SSR} + \text{SSE}$ rewritten as a ratio of squared lengths. Formally, write the fit as $\hat{y} = H y$ where $H = X(X^\top X)^{-1} X^\top$ is the hat (projection) matrix onto the column space of the design matrix $X$. The hat matrix is symmetric and idempotent ($H^2 = H$), the defining properties of an orthogonal projection. The residual vector $y - \hat{y} = (I - H) y$ lives in the orthogonal complement, which is what makes the cross term vanish and forces $R^2$ into $[0,1]$. The trace of $H$ equals the number of fitted parameters, the fact that drives the degrees-of-freedom accounting in Section 4. ## 3. Why $R^2$ Never Decreases When You Add Predictors A structural defect of $R^2$ is that it is monotone nondecreasing in the number of predictors. Adding any regressor to an OLS model, even one filled with random noise, cannot increase SSE and therefore cannot decrease $R^2$. The reason is that the smaller model is nested inside the larger one. Least squares minimizes SSE over a larger parameter space when a column is added, and the minimum over a larger set cannot exceed the minimum over a subset. Setting the new coefficient to zero recovers the old fit, so the optimizer can only match or improve it. ```text add a column of pure noise to X SSE can only stay equal or shrink therefore R^2 can only stay equal or grow ``` The effect is not merely that noise cannot hurt; on any finite sample, noise predictors help by a predictable amount. If the $p$ added regressors are pure noise, independent of $y$, the expected increase in $R^2$ is approximately $p/(n-1)$. So a model with $20$ junk predictors and $40$ observations will, on average, report an $R^2$ near $0.5$ from nothing but chance alignment. More starkly, with $p$ predictors plus an intercept, the expected value of $R^2$ under a true null (no predictor relates to $y$) is roughly $p/(n-1)$, never zero. This is the quantitative core of the overfitting warning. The practical danger is overfitting. With $p$ predictors and $n$ observations, a model with $p = n - 1$ free parameters plus an intercept can interpolate the data, driving SSE to zero and $R^2$ to one, while predicting future data no better than random. $R^2$ rewards complexity regardless of whether that complexity reflects signal or noise. This is exactly the failure that adjusted $R^2$ tries to address. ## 4. Adjusted $R^2$ ### 4.1 Definition Adjusted $R^2$ penalizes the inclusion of predictors by replacing raw sums of squares with their degrees-of-freedom-corrected counterparts. With $n$ observations and $p$ predictors (excluding the intercept): $$ R^2_{\text{adj}} = 1 - \frac{\text{SSE} / (n - p - 1)}{\text{SST} / (n - 1)} = 1 - (1 - R^2)\,\frac{n - 1}{n - p - 1} . $$ The numerator $\text{SSE}/(n - p - 1)$ is an unbiased estimate of the residual variance $\sigma^2$, and the denominator $\text{SST}/(n - 1)$ is the usual unbiased estimate of the variance of $y$. So adjusted $R^2$ can be read as one minus the ratio of two variance estimates. The quantity $n - p - 1$ is the residual degrees of freedom, the sample size minus the number of fitted parameters (the $p$ slopes plus one intercept), and it matches the trace argument from Section 2.3. ### 4.2 Behavior The correction factor $\frac{n-1}{n-p-1}$ exceeds one and grows as $p$ approaches $n$. Adding a predictor changes $R^2_{\text{adj}}$ in two competing ways: it can decrease the $(1 - R^2)$ term by reducing SSE, but it also increases the multiplier. A new predictor raises $R^2_{\text{adj}}$ only if it reduces SSE by more than the penalty for spending a degree of freedom. This rule has a precise form. Adding one predictor raises $R^2_{\text{adj}}$ if and only if the partial $F$ statistic for that predictor exceeds one, which for a single coefficient is equivalent to its squared $t$ statistic exceeding one, that is $|t| > 1$. This is a far weaker bar than the conventional significance threshold of roughly $|t| > 2$. Adjusted $R^2$ will therefore happily retain predictors that are nowhere near statistically significant, which is one reason it is a mild rather than aggressive penalty. Consequently $R^2_{\text{adj}} \le R^2$ always, and $R^2_{\text{adj}}$ can be negative, which happens when the model fits worse than the constant predictor after accounting for the parameters spent. A negative adjusted $R^2$ is a clear signal that the predictors carry essentially no useful information. ```text new predictor's |t| > 1 -> adjusted R^2 rises new predictor's |t| < 1 -> adjusted R^2 falls ``` ### 4.3 What Adjusted $R^2$ Does and Does Not Fix Adjusted $R^2$ corrects the naive monotonicity, making it a more honest in-sample criterion for comparing models with different numbers of predictors. It is, however, a weak penalty compared with criteria such as the Akaike information criterion (AIC) or the Bayesian information criterion (BIC), and it is not designed to estimate out-of-sample performance. Roughly, adjusted $R^2$ keeps a predictor at the $|t|>1$ bar, AIC at about $|t|>1.4$, and BIC at a threshold that grows with $\log n$, so the three impose increasingly strict penalties on complexity. For genuine generalization assessment, cross-validated error or a held-out test set remains the standard. Adjusted $R^2$ should be understood as a refinement of an in-sample descriptive statistic, not a substitute for predictive validation. ### 4.4 Predictive $R^2$ and Cross-Validated $R^2$ Two further variants close the gap toward generalization while keeping the $R^2$ scale. The predicted residual sum of squares (PRESS) replaces each residual with its leave-one-out counterpart $y_i - \hat{y}_{(i)}$, where $\hat{y}_{(i)}$ is the prediction for observation $i$ from a model fit without that observation. For OLS this has the closed form $y_i - \hat{y}_{(i)} = e_i / (1 - h_{ii})$, where $h_{ii}$ is the $i$th diagonal of the hat matrix, so PRESS costs nothing beyond a single fit. The predictive $R^2$ is then $$ R^2_{\text{pred}} = 1 - \frac{\text{PRESS}}{\text{SST}}, \qquad \text{PRESS} = \sum_i \left( \frac{e_i}{1 - h_{ii}} \right)^2 . $$ A $k$-fold cross-validated $R^2$ generalizes the same idea to arbitrary models by averaging held-out squared error across folds. Both can be negative, both penalize overfitting directly rather than through a degrees-of-freedom proxy, and both are preferable to adjusted $R^2$ when the question is how the model will behave on new data. ## 5. Why a High $R^2$ Is Not Always Good ### 5.1 It Says Nothing About Causation or Correctness A high $R^2$ measures linear association between fit and response. It does not certify that the model is correctly specified, that the relationship is causal, or that the predictors were measured without confounding. A regression of ice cream sales on drowning deaths can show a high $R^2$ driven entirely by the lurking variable of summer temperature. The number is a measure of fit, not of truth. ### 5.2 Anscombe's Quartet and Hidden Misspecification Anscombe's quartet is a celebrated set of four datasets that share nearly identical means, variances, regression lines, and $R^2$ values of about $0.67$, yet look entirely different when plotted. One is genuinely linear, one is curved, one is linear but for a single outlier, and one is dominated by a single high leverage point. The lesson is that $R^2$ cannot detect nonlinearity, outliers, or leverage. A high $R^2$ accompanied by a patterned residual plot indicates a misspecified model, regardless of how impressive the headline number looks. The Datasaurus Dozen extends the same warning to twelve dramatically different scatterplots that share summary statistics, reinforcing that no scalar can substitute for plotting the data. ### 5.3 Spurious Regression in Time Series When two independent nonstationary time series, such as random walks, are regressed on each other, the resulting $R^2$ is frequently large even though there is no relationship whatsoever. This phenomenon, known as spurious regression, arises because the usual sum of squares decomposition and its asymptotics assume stationarity. With trending data, a high $R^2$ can be entirely an artifact of shared trends. Differencing the series, modeling the dynamics explicitly, or testing for cointegration is required before any $R^2$ from levels can be trusted. ### 5.4 The Scale and Context Dependence of "High" There is no universal threshold separating a good $R^2$ from a bad one. In tightly controlled physical experiments, an $R^2$ below $0.99$ may indicate a problem. In cross-sectional social science or financial return modeling, an $R^2$ of $0.10$ can represent a genuine and valuable finding. The expected magnitude depends on the noise inherent in the domain. Judging a model by an absolute $R^2$ cutoff ignores this, and chasing a higher number can push a modeler toward overfitting or toward discarding a correct but low-signal model. ### 5.5 Low $R^2$ Is Not Always Bad The mirror image of the previous point deserves its own statement. A correctly specified model in a high-noise environment will have a low $R^2$ and still produce unbiased coefficient estimates, valid inference, and useful predictions of the conditional mean. If the goal is to estimate the effect of a predictor rather than to predict individual outcomes, the standard errors and coefficient estimates matter, and $R^2$ may be almost irrelevant. ### 5.6 Out-of-Sample $R^2$ Can Be Negative When $R^2$ is computed on data not used for fitting, using $$ R^2_{\text{oos}} = 1 - \frac{\sum_{i \in \text{test}} (y_i - \hat{y}_i)^2}{\sum_{i \in \text{test}} (y_i - \bar{y}_{\text{train}})^2}, $$ it can fall below zero. A negative out-of-sample $R^2$ means the model predicts worse on new data than simply using the training mean. This is one of the most informative diagnostics available, because the in-sample $R^2$ can never reveal it. A model with high in-sample $R^2$ and negative out-of-sample $R^2$ has overfit. Note the deliberate choice of baseline: the denominator centers on the training mean $\bar{y}_{\text{train}}$, not the test mean, because the test mean would not be known at prediction time. Using the test mean instead defines a different and more lenient quantity, so report which baseline you used. ### 5.7 $R^2$ Outside Linear Regression The plain coefficient of determination is built for continuous responses fit by least squares. For other model families the construction must be adapted, and the adaptations are not interchangeable. For logistic regression and other generalized linear models, McFadden's pseudo-$R^2$, defined as $1 - \log L_{\text{full}} / \log L_{\text{null}}$ from the fitted and null log-likelihoods, is a common analogue, but its numerical range is not comparable to ordinary $R^2$ and values around $0.2$ to $0.4$ already indicate a strong fit. Nagelkerke and Cox-Snell offer alternative scalings. The practical rule is to never compare a pseudo-$R^2$ from one model family against an ordinary $R^2$ from another, because they measure different things on different scales. ## 6. Practical Guidance Report $R^2$ alongside, not instead of, residual diagnostics and out-of-sample error. Treat the squared correlation between $y$ and $\hat{y}$, and the error form $1 - \text{SSE}/\text{SST}$, as the most durable interpretations. When comparing models of differing complexity, prefer adjusted $R^2$ over raw $R^2$, and prefer cross-validation, predictive $R^2$, or information criteria over both when the goal is generalization. Always inspect residual plots, because Anscombe's quartet guarantees that no scalar summary can replace them. Be especially skeptical of high $R^2$ values from time series in levels, from models with many predictors relative to observations, and from any setting where the predictors might be downstream of confounders. The coefficient of determination is a useful first glance at fit, and a dangerous final word on model quality. For practitioners, every quantity in this chapter is available in mature open-source tooling. In Python, `numpy` and `pandas` compute the sums of squares directly, `scikit-learn` provides `r2_score` (which uses the error form and can return negative values) along with cross-validation utilities, and `statsmodels` reports both $R^2$ and adjusted $R^2$ plus PRESS-based diagnostics in its OLS summary. In R, the base `lm` summary reports both, and packages such as `caret` and `boot` handle cross-validated variants. None of this requires proprietary software. ## References 1. Anscombe, F. J. (1973). Graphs in Statistical Analysis. *The American Statistician*, 27(1), 17-21. https://doi.org/10.1080/00031305.1973.10478966 2. Draper, N. R., and Smith, H. (1998). *Applied Regression Analysis*, 3rd ed. Wiley. https://onlinelibrary.wiley.com/doi/book/10.1002/9781118625590 3. Granger, C. W. J., and Newbold, P. (1974). Spurious Regressions in Econometrics. *Journal of Econometrics*, 2(2), 111-120. https://doi.org/10.1016/0304-4076(74)90034-7 4. James, G., Witten, D., Hastie, T., and Tibshirani, R. (2021). *An Introduction to Statistical Learning*, 2nd ed. Springer. https://doi.org/10.1007/978-1-0716-1418-1 5. Kvalseth, T. O. (1985). Cautionary Note about $R^2$. *The American Statistician*, 39(4), 279-285. https://doi.org/10.1080/00031305.1985.10479448 6. McFadden, D. (1974). Conditional Logit Analysis of Qualitative Choice Behavior. In P. Zarembka (Ed.), *Frontiers in Econometrics* (pp. 105-142). Academic Press. 7. Nagelkerke, N. J. D. (1991). A Note on a General Definition of the Coefficient of Determination. *Biometrika*, 78(3), 691-692. https://doi.org/10.1093/biomet/78.3.691 8. Allen, D. M. (1974). The Relationship Between Variable Selection and Data Augmentation and a Method for Prediction. *Technometrics*, 16(1), 125-127. https://doi.org/10.1080/00401706.1974.10489157

\(i\)	\(y_i\)	\(\hat{y}_i\)	\(y_i - \bar{y}\)	\(\hat{y}_i - \bar{y}\)	\(y_i - \hat{y}_i\)
1	2	2.2	\(-4\)	\(-3.8\)	\(-0.2\)
2	4	3.6	\(-2\)	\(-2.4\)	\(0.4\)
3	5	5.0	\(-1\)	\(-1.0\)	\(0.0\)
4	7	6.4	\(1\)	\(0.4\)	\(0.6\)
5	12	12.8	\(6\)	\(6.8\)	\(-0.8\)

163 R-Squared and Adjusted R-Squared

163.1 1. Definition and Decomposition

163.1.1 1.1 The Sums of Squares

163.1.2 1.2 The Coefficient of Determination

163.1.3 1.3 A Small Worked Example

163.2 2. Interpretations

163.2.1 2.1 Variance Explained

163.2.2 2.2 The Correlation Interpretation

163.2.3 2.3 Geometric View

163.3 3. Why \(R^2\) Never Decreases When You Add Predictors

163.4 4. Adjusted \(R^2\)

163.4.1 4.1 Definition

163.4.2 4.2 Behavior

163.4.3 4.3 What Adjusted \(R^2\) Does and Does Not Fix

163.4.4 4.4 Predictive \(R^2\) and Cross-Validated \(R^2\)

163.5 5. Why a High \(R^2\) Is Not Always Good

163.5.1 5.1 It Says Nothing About Causation or Correctness

163.5.2 5.2 Anscombe’s Quartet and Hidden Misspecification

163.5.3 5.3 Spurious Regression in Time Series

163.5.4 5.4 The Scale and Context Dependence of “High”

163.5.5 5.5 Low \(R^2\) Is Not Always Bad

163.5.6 5.6 Out-of-Sample \(R^2\) Can Be Negative

163.5.7 5.7 \(R^2\) Outside Linear Regression

163.6 6. Practical Guidance

163.7 References