164 MAPE and Beyond: Percentage Errors and Probabilistic Forecast Scoring
Regression and forecasting systems are judged by error metrics, and the choice of metric quietly encodes the loss function the business actually cares about. Absolute and squared errors live in the units of the target, which makes them hard to compare across series with different scales. Percentage errors promise a scale free alternative, and the Mean Absolute Percentage Error (MAPE) became the default in demand planning, finance, and energy forecasting for exactly this reason. This chapter develops the percentage error family rigorously, exposes the pathologies of MAPE, then moves to scaled errors (MASE) and to proper scoring rules for probabilistic forecasts through the quantile (pinball) loss. The goal is a practitioner who can pick a metric on purpose rather than by habit.
164.1 1. Percentage Errors and MAPE
164.1.1 1.1 Definition
Let \(y_t\) denote the actual value at time \(t\) and \(\hat{y}_t\) the forecast, for \(t = 1, \dots, n\). The absolute percentage error is
\[ \text{APE}_t = \left| \frac{y_t - \hat{y}_t}{y_t} \right|, \]
and the Mean Absolute Percentage Error averages it:
\[ \text{MAPE} = \frac{100}{n} \sum_{t=1}^{n} \left| \frac{y_t - \hat{y}_t}{y_t} \right|. \]
The factor of 100 expresses MAPE as a percentage. Its appeal is interpretability and scale independence. A MAPE of 8 percent means the same thing for a product selling 10 units a week and one selling 10,000 units a week, so MAPE can be averaged across a heterogeneous catalog. This is precisely what a planner wants when reporting a single accuracy number to management.
164.1.2 1.2 The divide-by-zero and small-denominator problem
MAPE divides by the actual value \(y_t\). When \(y_t = 0\) the term is undefined, and when \(y_t\) is merely small the term explodes. Intermittent demand series, which contain many zero periods, make MAPE unusable or wildly inflated. Even a single near-zero actual can dominate the average. Suppose \(y_t = 1\) and \(\hat{y}_t = 11\). The contribution is \(1000\) percent, drowning out hundreds of well behaved terms. There is no honest patch. Common workarounds such as adding a constant to the denominator, dropping zero periods, or clipping large terms all change the quantity being measured and break comparability across datasets.
164.1.3 1.3 Asymmetry: MAPE penalizes over-forecasts more than under-forecasts
MAPE is not symmetric in the sign of the error, and this bias is structural rather than incidental. Consider \(y_t = 100\). An over-forecast of \(\hat{y}_t = 150\) gives an APE of 50 percent. A symmetric under-forecast of \(\hat{y}_t = 50\) also gives 50 percent. So far so balanced. But the bound differs. An under-forecast can produce at most a 100 percent error (when \(\hat{y}_t = 0\)), whereas an over-forecast is unbounded above as \(\hat{y}_t \to \infty\). The asymmetry has a sharp consequence: minimizing expected APE does not target the mean.
For a fixed actual distribution, the forecast that minimizes expected absolute percentage error is the weighted median where each outcome \(y\) is weighted by \(1/y\). Because small actuals receive large weights, the optimal point forecast under MAPE is pulled below the median of the data. A model tuned to minimize MAPE will systematically under-forecast. In demand planning this manifests as chronic stockouts, because the metric rewards the forecaster for shading predictions downward. Practitioners who report MAPE while optimizing a squared loss are then surprised that the two disagree about which model is best.
164.2 2. Symmetric MAPE (SMAPE)
164.2.1 2.1 Motivation and definition
SMAPE was proposed to repair the asymmetry by putting a symmetric quantity in the denominator. A widely used form is
\[ \text{SMAPE} = \frac{100}{n} \sum_{t=1}^{n} \frac{|y_t - \hat{y}_t|}{(|y_t| + |\hat{y}_t|)/2}. \]
By normalizing against the average magnitude of actual and forecast, SMAPE no longer divides by the actual alone, so an over-forecast and an under-forecast of equal absolute size relative to that average are treated more evenly. The denominator is also less likely to be exactly zero, since it vanishes only when both \(y_t\) and \(\hat{y}_t\) are zero.
164.2.2 2.2 SMAPE is not actually symmetric, and its range is awkward
The name oversells the cure. With \(y_t\) fixed, SMAPE is still not symmetric in \(\hat{y}_t\). Take \(y_t = 100\). For \(\hat{y}_t = 150\) the term is \(50 / 125 = 0.4\). For \(\hat{y}_t = 50\) the term is \(50 / 75 \approx 0.667\). The under-forecast is penalized more heavily here, the opposite tilt from MAPE, because the forecast enters the denominator and a smaller forecast shrinks it. SMAPE therefore trades one bias for another rather than removing bias.
The version above ranges from 0 to 200 percent, which surprises readers who expect a percentage to cap at 100. An alternative definition omits the division by 2 in the denominator and caps at 100 percent, so the literature contains at least two incompatible formulas. When reporting SMAPE you must state which one you used. SMAPE also remains undefined when both values are zero, which can happen for intermittent series during genuinely idle periods. The practical verdict is that SMAPE is an improvement on MAPE for some asymmetry problems but is not a clean, interpretable, or canonical metric.
164.3 3. Scaled Errors and MASE
164.3.1 3.1 The idea of scaling by a naive benchmark
The Mean Absolute Scaled Error (MASE), introduced by Hyndman and Koehler, sidesteps the division-by-actual problem entirely. Instead of normalizing each error by a value that can be zero, it normalizes the average absolute error of the model by the average absolute error of a naive benchmark computed over the training set. The benchmark is the in-sample one step naive forecast, which simply predicts the previous observation.
For a non seasonal series with training set of length \(T\), define the in-sample naive scaling factor
\[ Q = \frac{1}{T-1} \sum_{t=2}^{T} |y_t - y_{t-1}|, \]
which is the mean absolute error of the random walk forecast on the training data. The MASE over an evaluation set of \(n\) points is
\[ \text{MASE} = \frac{1}{n} \sum_{j=1}^{n} \frac{|y_j - \hat{y}_j|}{Q}. \]
For seasonal data with period \(m\), the scaling uses the seasonal naive error \(|y_t - y_{t-m}|\) averaged over the training set, so the benchmark is the obvious seasonal carry forward.
164.3.2 3.2 Why MASE behaves well
MASE has several properties that the percentage metrics lack. The scaling factor \(Q\) is a single number computed once from the training data, so individual zero or small actuals never appear in a denominator and cannot blow up. It is scale free, since the numerator and denominator share the units of \(y\), which makes MASE averageable across series of different magnitudes just like MAPE was supposed to be. It is symmetric in over and under forecasts because it is built on absolute errors with no actual in the denominator.
The interpretation is clean and benchmarked. A MASE of 1 means the model has the same average absolute error as the in-sample naive forecast. A MASE below 1 means the model beats the naive baseline, and above 1 means it is worse, which is a damning and immediately legible verdict. MASE is defined as long as the training series is not constant, since \(Q = 0\) only when every consecutive pair of training points is equal. For these reasons MASE was the headline metric in the M4 and M5 forecasting competitions and is the default recommendation in much of the modern forecasting literature.
# Pseudocode for MASE
Q = mean(|y_train[t] - y_train[t - m]|) # m = 1 if non-seasonal
mase = mean(|y_test - y_hat|) / Q
164.4 4. Probabilistic Forecasts and the Pinball Loss
164.4.1 4.1 From point forecasts to quantiles
Point error metrics summarize a forecast by a single number and therefore ignore uncertainty. Modern systems increasingly emit a full predictive distribution, or at least a set of quantiles, because downstream decisions such as safety stock and capacity reservation depend on tails rather than means. The right metric must score the whole distribution and reward calibrated uncertainty, not just central accuracy.
The fundamental tool is the quantile loss, also called the pinball loss. For a target quantile level \(\tau \in (0,1)\), with quantile forecast \(q\) and realized value \(y\), the pinball loss is
\[ L_\tau(y, q) = \begin{cases} \tau \, (y - q) & \text{if } y \ge q, \\[4pt] (1 - \tau)\,(q - y) & \text{if } y < q. \end{cases} \]
Equivalently, \(L_\tau(y, q) = \max\big(\tau (y - q),\, (\tau - 1)(y - q)\big)\). The loss is piecewise linear with a kink at \(y = q\), and the two slopes encode an asymmetric penalty.
164.4.2 4.2 Why the asymmetry is the point
For a high quantile such as \(\tau = 0.9\), an under-prediction (the realized \(y\) lands above the forecast \(q\)) is penalized with weight \(\tau = 0.9\), while an over-prediction is penalized with weight \(1 - \tau = 0.1\). This deliberately makes the forecaster reluctant to set \(q\) too low, which is exactly the behavior you want from a 90th percentile estimate. The minimizer of the expected pinball loss is the true \(\tau\) quantile of the predictive distribution:
\[ q^\star_\tau = \arg\min_{q} \; \mathbb{E}_{y}\big[ L_\tau(y, q) \big] = F^{-1}_y(\tau), \]
where \(F_y\) is the cumulative distribution function of the target. This makes the pinball loss a consistent (proper) scoring function for quantiles, meaning an honest forecaster minimizes expected loss by reporting the true quantile and cannot game the metric. At \(\tau = 0.5\) the pinball loss reduces to half the absolute error, so the median forecast is its optimizer, which connects the quantile world back to familiar mean absolute error.
164.4.3 4.3 Averaging over quantiles and the CRPS
A single quantile level scores only one slice of the distribution. To score a set of quantile levels \(\{\tau_1, \dots, \tau_K\}\) over \(n\) observations, average the pinball loss over both levels and observations:
\[ \text{QL} = \frac{1}{n K} \sum_{j=1}^{n} \sum_{k=1}^{K} L_{\tau_k}\big(y_j, \hat{q}_{\tau_k, j}\big). \]
This averaged quantile loss was the probabilistic scoring metric in the M5 uncertainty competition. As the grid of quantile levels becomes dense and uniform, the average pinball loss converges (up to a constant factor) to the Continuous Ranked Probability Score (CRPS):
\[ \text{CRPS}(F, y) = \int_{-\infty}^{\infty} \big( F(z) - \mathbb{1}\{y \le z\} \big)^2 \, dz = 2 \int_0^1 L_\tau\big(y, F^{-1}(\tau)\big)\, d\tau. \]
The CRPS is a strictly proper scoring rule for the full predictive distribution and, like the pinball loss, is reported in the units of the target. It generalizes mean absolute error to distributions: if the forecast is a point mass, CRPS collapses to the absolute error. This is why quantile loss and CRPS dominate probabilistic forecast evaluation.
# Pinball loss for one (y, q, tau)
def pinball(y, q, tau):
e = y - q
return tau * e if e >= 0 else (tau - 1) * e
164.5 5. Choosing a Metric in Practice
The metric should mirror the decision it informs. For comparing models on a single, strictly positive, non-intermittent series, MAPE remains tolerable and communicates well to non technical stakeholders, but be aware it rewards under-forecasting. For heterogeneous portfolios and for series with zeros or small values, prefer MASE, which is scale free, robust to zeros, and benchmarked against a naive baseline with an interpretation everyone understands. Treat SMAPE with caution and always state which formula you used, since its symmetry claim does not hold and its range is non standard.
When the output is a distribution or a set of quantiles, point metrics are simply the wrong tool. Score the forecast with the pinball loss at the quantile levels that matter to the decision, average across levels for an overall picture, and use CRPS when you want a single proper score for the entire distribution. A useful discipline is to report at least two metrics, one scale free point metric such as MASE and one probabilistic score such as averaged pinball loss, so that neither central accuracy nor calibrated uncertainty is silently ignored. The metric is not a postscript to modeling. It is the objective made visible, and choosing it deliberately is part of building the system.
164.6 References
- Hyndman, R. J., and Koehler, A. B. (2006). Another look at measures of forecast accuracy. International Journal of Forecasting, 22(4), 679 to 688. https://doi.org/10.1016/j.ijforecast.2006.03.001
- Hyndman, R. J., and Athanasopoulos, G. Forecasting: Principles and Practice (3rd ed.), chapter on evaluating forecast accuracy. https://otexts.com/fpp3/accuracy.html
- Makridakis, S. (1993). Accuracy measures: theoretical and practical concerns. International Journal of Forecasting, 9(4), 527 to 529. https://doi.org/10.1016/0169-2070(93)90079-3
- Makridakis, S., Spiliotis, E., and Assimakopoulos, V. (2022). The M5 accuracy competition: Results, findings, and conclusions. International Journal of Forecasting, 38(4), 1346 to 1364. https://doi.org/10.1016/j.ijforecast.2021.11.013
- Makridakis, S., Spiliotis, E., and Assimakopoulos, V. (2022). The M5 uncertainty competition: Results, findings and conclusions. International Journal of Forecasting, 38(4), 1365 to 1385. https://doi.org/10.1016/j.ijforecast.2021.10.009
- Gneiting, T., and Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477), 359 to 378. https://doi.org/10.1198/016214506000001437
- Koenker, R., and Bassett, G. (1978). Regression quantiles. Econometrica, 46(1), 33 to 50. https://doi.org/10.2307/1913643
- Gneiting, T. (2011). Making and evaluating point forecasts. Journal of the American Statistical Association, 106(494), 746 to 762. https://doi.org/10.1198/jasa.2011.r10138