162 Regression Metrics: MSE, RMSE, and MAE
Regression models predict continuous quantities, and the quality of those predictions is summarized through scalar error metrics. Three metrics dominate practice: mean squared error (MSE), root mean squared error (RMSE), and mean absolute error (MAE). They look superficially similar, yet they encode different assumptions about what counts as a serious mistake. Choosing among them is not a stylistic preference. The choice determines which model wins a comparison, which predictions a fitted model favors, and how robust the resulting estimator is to contaminated data. This chapter develops the definitions rigorously, examines the units and statistical interpretation of each metric, analyzes their differing sensitivity to outliers, and introduces the Huber loss as a principled compromise.
162.1 1. Notation and Setup
Consider a dataset of \(n\) observations with targets \(y_1, \dots, y_n \in \mathbb{R}\) and corresponding model predictions \(\hat{y}_1, \dots, \hat{y}_n\). The per-observation residual is
\[ r_i = y_i - \hat{y}_i . \]
A regression metric is a function that maps the vector of residuals to a single non-negative number, where zero indicates perfect prediction. Two roles must be kept distinct. A metric can serve as an evaluation criterion, computed on a held-out test set to report how well a fixed model performs. It can also serve as a training loss, the objective minimized during fitting. The same algebraic form often fills both roles, but the consequences differ, and we will be careful to say which role is in play.
Throughout, we treat the predictions as fixed when computing a test metric and as functions of model parameters when discussing optimization.
162.2 2. Mean Absolute Error
162.2.1 2.1 Definition and Units
The mean absolute error is the average magnitude of the residuals:
\[ \text{MAE} = \frac{1}{n} \sum_{i=1}^{n} \lvert y_i - \hat{y}_i \rvert . \]
MAE inherits the units of the target variable. If \(y\) is measured in dollars, MAE is reported in dollars, and a value of \(42\) means the typical prediction misses by about forty two dollars in absolute terms. This direct interpretability is one of the strongest practical arguments for MAE. A stakeholder with no statistical training can read the number and understand it.
162.2.2 2.2 The Optimal Constant Predictor
A clarifying exercise is to ask which constant \(c\) minimizes a metric when we predict \(\hat{y}_i = c\) for all \(i\). For MAE the objective is \(\frac{1}{n}\sum_i \lvert y_i - c \rvert\). The subgradient with respect to \(c\) is
\[ \frac{\partial}{\partial c} \sum_{i=1}^{n} \lvert y_i - c \rvert = \sum_{i=1}^{n} \operatorname{sign}(c - y_i), \]
which vanishes when the number of targets above \(c\) equals the number below. The minimizer is therefore the median of the targets. More generally, fitting a model under absolute loss produces predictions of the conditional median of \(y\) given the features. This is the foundation of median regression and, with asymmetric weights, of quantile regression.
162.2.3 2.3 Robustness
Because each residual enters linearly, a single grossly wrong observation contributes in proportion to its error, not its square. MAE is therefore robust: a few extreme points cannot dominate the score. This robustness is the central reason to prefer MAE when the data contain anomalies that you do not want the model to chase.
162.3 3. Mean Squared Error
162.3.1 3.1 Definition and Units
The mean squared error averages the squared residuals:
\[ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 . \]
Squaring has two immediate effects. First, the metric is expressed in the squared units of the target. If \(y\) is in dollars, MSE is in dollars squared, a quantity with no intuitive meaning. Second, squaring is convex and smooth, so MSE is differentiable everywhere, which makes it convenient for gradient based optimization.
162.3.2 3.2 The Optimal Constant Predictor
Minimizing \(\frac{1}{n}\sum_i (y_i - c)^2\) over the constant \(c\) gives a clean closed form. Setting the derivative to zero,
\[ \frac{\partial}{\partial c} \sum_{i=1}^{n} (y_i - c)^2 = -2 \sum_{i=1}^{n} (y_i - c) = 0 \quad\Longrightarrow\quad c = \frac{1}{n} \sum_{i=1}^{n} y_i = \bar{y}. \]
The minimizer is the mean. Fitting a model under squared loss produces predictions of the conditional mean of \(y\) given the features, which connects squared error to least squares estimation and, through the Gaussian likelihood, to maximum likelihood estimation under additive normal noise.
162.3.3 3.3 Sensitivity to Large Errors
A residual of magnitude \(10\) contributes \(100\) to the sum, while a residual of magnitude \(1\) contributes only \(1\). The quadratic weighting means large errors dominate the metric. During training, the optimizer will sacrifice accuracy on many easy points to reduce one large residual, because shrinking a big error yields an outsized reduction in the loss. This is exactly the behavior you want when large errors are disproportionately costly, and exactly the behavior you want to avoid when large residuals are corrupt measurements.
162.4 4. Root Mean Squared Error
162.4.1 4.1 Definition and Units
The root mean squared error is the square root of MSE:
\[ \text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}. \]
Taking the square root restores the original units of the target. RMSE is therefore reported in dollars rather than dollars squared, recovering the interpretability that MSE lost. RMSE is the quadratic mean, or root mean square, of the residuals, and it can be read as a typical error magnitude under quadratic weighting.
162.4.2 4.2 Relationship to MSE and MAE
Because the square root is monotonic, RMSE and MSE always rank models identically. Any model with lower MSE also has lower RMSE, so for the purpose of model selection they are interchangeable, and the choice between them is purely about reporting units. RMSE is preferred for communication; MSE is often preferred inside optimizers because its gradient is simpler and avoids the square root.
RMSE and MAE relate through a standard inequality. By the power mean or Cauchy Schwarz inequality,
\[ \text{MAE} \le \text{RMSE} \le \sqrt{n} \cdot \text{MAE}, \]
with equality on the left when all residual magnitudes are identical. The gap between MAE and RMSE is itself diagnostic. When RMSE substantially exceeds MAE, the residual distribution has heavy tails or a few large errors, because RMSE inflates whenever the spread of error magnitudes is large. A ratio near one indicates uniform error sizes.
residuals: [1, 1, 1, 1, 6]
MAE = (1+1+1+1+6)/5 = 2.0
RMSE = sqrt((1+1+1+1+36)/5) = sqrt(8.0) ≈ 2.83
The single large residual pulls RMSE well above MAE.
162.5 5. Sensitivity to Outliers
162.5.1 5.1 Influence of a Single Point
The differing outlier behavior of these metrics can be made precise by asking how the metric changes as one residual \(r\) grows. Under absolute loss the contribution is \(\lvert r \rvert\), so its derivative with respect to \(r\) has constant magnitude one. The influence of a point is bounded no matter how far it lies from the prediction. Under squared loss the contribution is \(r^2\), whose derivative \(2r\) grows without bound. A single outlier exerts unbounded influence on MSE and RMSE.
In the language of robust statistics, the influence function of squared loss is unbounded while that of absolute loss is bounded. The breakdown point of the mean, the estimator implied by squared loss, is zero, because one observation sent to infinity drags the mean to infinity. The median, implied by absolute loss, has a breakdown point of one half, the highest possible. Up to half the data can be arbitrarily corrupted before the median becomes arbitrary.
162.5.2 5.2 Practical Consequences
These properties translate into concrete modeling decisions.
Use squared loss, hence MSE or RMSE, when large errors are genuinely more harmful than small ones and the data are clean. Forecasting electricity demand, where a large shortfall causes a blackout, is a setting where penalizing big misses quadratically reflects real costs. Squared loss is also the natural choice when the noise is approximately Gaussian, since minimizing it then coincides with maximum likelihood.
Use absolute loss, hence MAE, when the data contain outliers you do not trust, when you care about typical performance rather than worst case performance, or when the target distribution is skewed and you want to estimate a median rather than a mean. Sensor data with occasional spurious spikes, or financial data with rare extreme events that should not steer the model, are typical cases.
A subtle point is that a model trained to minimize one loss will not generally minimize another. A model fit by least squares is tuned to the conditional mean and may report a poor MAE, while a model fit by least absolute deviations is tuned to the conditional median and may report a poor RMSE. When possible, train and evaluate under the same metric, or at least understand the mismatch you are accepting.
162.6 6. The Huber Loss as a Compromise
162.6.1 6.1 Motivation
Squared loss is smooth and efficient when the noise is light tailed, but it is fragile against outliers. Absolute loss is robust, but it is not differentiable at zero and it weights all errors equally, ignoring the useful signal that a moderately larger error is in fact worse. The Huber loss, introduced by Peter Huber in 1964, interpolates between the two. It behaves quadratically for small residuals, capturing the efficiency of squared loss, and linearly for large residuals, capturing the robustness of absolute loss.
162.6.2 6.2 Definition
With a threshold parameter \(\delta > 0\), the Huber loss for a residual \(r = y - \hat{y}\) is
\[ L_\delta(r) = \begin{cases} \tfrac{1}{2} r^2 & \text{if } \lvert r \rvert \le \delta, \\[4pt] \delta \left( \lvert r \rvert - \tfrac{1}{2}\delta \right) & \text{if } \lvert r \rvert > \delta. \end{cases} \]
The two pieces are constructed to meet smoothly. At \(\lvert r \rvert = \delta\) both pieces equal \(\tfrac{1}{2}\delta^2\), so the function is continuous, and their first derivatives both equal \(\delta\) in magnitude, so the function is continuously differentiable. The constant \(-\tfrac{1}{2}\delta^2\) inside the linear branch is precisely what enforces this matching.
162.6.3 6.3 Gradient and Interpretation
The derivative with respect to the residual is the clipped, or winsorized, residual:
\[ \frac{\partial L_\delta}{\partial r} = \begin{cases} r & \text{if } \lvert r \rvert \le \delta, \\[4pt] \delta \cdot \operatorname{sign}(r) & \text{if } \lvert r \rvert > \delta. \end{cases} \]
This is the heart of the method. For small residuals the gradient is \(r\), identical to squared loss, so well behaved points are fit efficiently. For large residuals the gradient saturates at \(\pm \delta\), so an outlier can never push the optimizer harder than a point exactly at the threshold. The influence of any single observation is bounded by \(\delta\), which is what gives Huber regression its robustness while retaining smooth, gradient friendly behavior everywhere.
delta = 1.5
residual r = 0.5 -> quadratic branch, loss = 0.5*0.25 = 0.125
residual r = 5.0 -> linear branch, loss = 1.5*(5 - 0.75) = 6.375
A pure squared loss would charge 0.5*25 = 12.5 for r = 5.0.
162.6.4 6.4 Choosing the Threshold
The parameter \(\delta\) sets the boundary between what counts as an ordinary residual and what counts as an outlier. As \(\delta \to \infty\) the quadratic branch covers all residuals and Huber loss reduces to squared loss. As \(\delta \to 0\) the linear branch dominates and, after rescaling, it approaches absolute loss. Intermediate values blend the two regimes.
A common default scales \(\delta\) to the noise level of the residuals, for instance setting it to a robust estimate of their standard deviation. A widely cited choice is \(\delta = 1.345 \hat{\sigma}\), which yields roughly ninety five percent of the statistical efficiency of least squares when the noise is truly Gaussian, while still bounding the influence of gross outliers. Cross validation over a small grid of \(\delta\) values is a reliable practical alternative when no good noise estimate is available.
162.7 7. Choosing a Metric in Practice
The decision can be reduced to a few questions. Are large errors disproportionately costly in the real problem, and is the data clean? If so, prefer RMSE for reporting and squared loss for training, since the quadratic penalty matches the cost structure. Do the data contain outliers or corrupt measurements that the model should not chase, or is a typical case summary what stakeholders want? If so, prefer MAE and absolute loss, accepting that you are estimating a median. Do you want the efficiency of squared loss on the bulk of the data while remaining protected against a minority of extreme points? If so, train with Huber loss and tune \(\delta\) to the noise scale.
For reporting, RMSE and MAE are frequently presented together precisely because their ratio reveals the shape of the error distribution. A large gap signals heavy tails worth investigating before trusting any single number. No metric is universally correct; each is the right answer to a specific question about which errors matter and how much.
162.8 8. Summary
MSE penalizes residuals quadratically, is minimized by the conditional mean, lives in squared units, and is highly sensitive to outliers. RMSE is its square root, restoring the original units and a typical error interpretation while ranking models identically to MSE. MAE penalizes residuals linearly, is minimized by the conditional median, lives in the original units, and is robust to outliers. The Huber loss combines the quadratic behavior of MSE near zero with the linear behavior of MAE in the tails, governed by a threshold \(\delta\), yielding an estimator that is both efficient and robust. Sound practice begins by deciding which errors matter, then selecting the metric whose mathematics encodes that judgment.
162.9 References
- Huber, P. J. (1964). Robust Estimation of a Location Parameter. Annals of Mathematical Statistics, 35(1), 73 to 101. https://doi.org/10.1214/aoms/1177703732
- Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer. https://hastie.su.domains/ElemStatLearn/
- Willmott, C. J., and Matsuura, K. (2005). Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Climate Research, 30(1), 79 to 82. https://doi.org/10.3354/cr030079
- Chai, T., and Draxler, R. R. (2014). Root mean square error (RMSE) or mean absolute error (MAE)? Geoscientific Model Development, 7(3), 1247 to 1250. https://doi.org/10.5194/gmd-7-1247-2014
- scikit-learn developers. Metrics and scoring: regression metrics. https://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics
- Koenker, R., and Hallock, K. F. (2001). Quantile Regression. Journal of Economic Perspectives, 15(4), 143 to 156. https://doi.org/10.1257/jep.15.4.143