164 MAPE and Beyond: Percentage Errors and Probabilistic Forecast Scoring

Regression and forecasting systems are judged by error metrics, and the choice of metric quietly encodes the loss function the business actually cares about. Absolute and squared errors live in the units of the target, which makes them hard to compare across series with different scales. Percentage errors promise a scale free alternative, and the Mean Absolute Percentage Error (MAPE) became the default in demand planning, finance, and energy forecasting for exactly this reason. This chapter develops the percentage error family rigorously, exposes the pathologies of MAPE, then moves to scaled errors (MASE) and to proper scoring rules for probabilistic forecasts through the quantile (pinball) loss. The goal is a practitioner who can pick a metric on purpose rather than by habit.

A useful mental frame runs through the whole chapter. Every accuracy metric implicitly answers two questions. First, what statistic of the predictive distribution does it reward, the mean, the median, or some quantile? A metric is consistent for a statistic when reporting that statistic minimizes the expected score, and it is proper when an honest forecaster cannot lower the expected score by misreporting (references 6 and 8). Second, how does it handle scale, so that errors can be pooled across series of wildly different magnitudes? MAPE, SMAPE, MASE, the pinball loss, and the CRPS are best understood as different answers to these two questions rather than as interchangeable summaries of goodness.

164.1 1. Percentage Errors and MAPE

164.1.1 1.1 Definition

Let $y_t$ denote the actual value at time $t$ and $\hat{y}_t$ the forecast, for $t = 1, \dots, n$. The absolute percentage error is

\[ \text{APE}_t = \left| \frac{y_t - \hat{y}_t}{y_t} \right|, \]

and the Mean Absolute Percentage Error averages it:

\[ \text{MAPE} = \frac{100}{n} \sum_{t=1}^{n} \left| \frac{y_t - \hat{y}_t}{y_t} \right|. \]

The factor of 100 expresses MAPE as a percentage. Its appeal is interpretability and scale independence. A MAPE of 8 percent means the same thing for a product selling 10 units a week and one selling 10,000 units a week, so MAPE can be averaged across a heterogeneous catalog. This is precisely what a planner wants when reporting a single accuracy number to management.

Two structural features deserve to be named at the outset, because they drive everything in the rest of this section. First, the error enters only through the relative deviation $r_t = (\hat{y}_t - y_t)/y_t$, so MAPE is invariant to multiplying an entire series by a constant but is not invariant to shifting it. Second, the actual value $y_t$ sits alone in the denominator. That single design choice is the source of both the divide-by-zero pathology in Section 1.2 and the optimization bias in Section 1.3.

164.1.2 1.2 The divide-by-zero and small-denominator problem

MAPE divides by the actual value $y_t$. When $y_t = 0$ the term is undefined, and when $y_t$ is merely small the term explodes. Intermittent demand series, which contain many zero periods, make MAPE unusable or wildly inflated. Even a single near-zero actual can dominate the average. Suppose $y_t = 1$ and $\hat{y}_t = 11$. The contribution is $1000$ percent, drowning out hundreds of well behaved terms. There is no honest patch. Common workarounds such as adding a constant to the denominator, dropping zero periods, or clipping large terms all change the quantity being measured and break comparability across datasets.

164.1.3 1.3 Asymmetry: MAPE penalizes over-forecasts more than under-forecasts

MAPE is not symmetric in the sign of the error, and this bias is structural rather than incidental. Consider $y_t = 100$. An over-forecast of $\hat{y}_t = 150$ gives an APE of 50 percent. A symmetric under-forecast of $\hat{y}_t = 50$ also gives 50 percent. So far so balanced. But the bound differs. An under-forecast can produce at most a 100 percent error (when $\hat{y}_t = 0$), whereas an over-forecast is unbounded above as $\hat{y}_t \to \infty$. The asymmetry has a sharp consequence: minimizing expected APE does not target the mean.

For a fixed actual distribution, the forecast that minimizes expected absolute percentage error is the weighted median where each outcome $y$ is weighted by $1/y$. Because small actuals receive large weights, the optimal point forecast under MAPE is pulled below the median of the data. A model tuned to minimize MAPE will systematically under-forecast. In demand planning this manifests as chronic stockouts, because the metric rewards the forecaster for shading predictions downward. Practitioners who report MAPE while optimizing a squared loss are then surprised that the two disagree about which model is best.

Derivation of the weighted-median optimizer. Treat the forecast $q$ as a constant and the actual $Y > 0$ as a random variable with density $f$. The expected absolute percentage error is

\[ R(q) = \mathbb{E}\!\left[\frac{|Y - q|}{Y}\right] = \int_0^\infty \frac{|y - q|}{y}\, f(y)\, dy. \]

Split the integral at $q$ and differentiate with respect to $q$. For $y > q$ the integrand is $(y-q)/y$ with derivative $-1/y$, and for $y < q$ the integrand is $(q-y)/y$ with derivative $+1/y$. The boundary terms cancel because the integrand is continuous at $y=q$, so

\[ R'(q) = \int_0^q \frac{1}{y} f(y)\, dy - \int_q^\infty \frac{1}{y} f(y)\, dy. \]

Setting $R'(q^\star) = 0$ requires the $1/y$-weighted probability mass below $q^\star$ to equal the mass above it:

\[ \int_0^{q^\star} \frac{f(y)}{y}\, dy = \int_{q^\star}^\infty \frac{f(y)}{y}\, dy. \]

This is exactly the median of the distribution whose density is proportional to $f(y)/y$, the original distribution reweighted by $1/y$. Since $1/y$ is decreasing, the reweighting shifts probability mass toward small $y$, so $q^\star$ falls below the ordinary median of $Y$. The bias is not a finite-sample artifact; it is a property of the population minimizer. By contrast, squared error is minimized at the mean and absolute error at the median, so reporting MAPE while training on either of those losses optimizes a different target than the one being scored.

164.2 2. Symmetric MAPE (SMAPE)

164.2.1 2.1 Motivation and definition

SMAPE was proposed to repair the asymmetry by putting a symmetric quantity in the denominator. A widely used form is

\[ \text{SMAPE} = \frac{100}{n} \sum_{t=1}^{n} \frac{|y_t - \hat{y}_t|}{(|y_t| + |\hat{y}_t|)/2}. \]

By normalizing against the average magnitude of actual and forecast, SMAPE no longer divides by the actual alone, so an over-forecast and an under-forecast of equal absolute size relative to that average are treated more evenly. The denominator is also less likely to be exactly zero, since it vanishes only when both $y_t$ and $\hat{y}_t$ are zero.

164.2.2 2.2 SMAPE is not actually symmetric, and its range is awkward

The name oversells the cure. With $y_t$ fixed, SMAPE is still not symmetric in $\hat{y}_t$. Take $y_t = 100$. For $\hat{y}_t = 150$ the term is $50 / 125 = 0.4$. For $\hat{y}_t = 50$ the term is $50 / 75 \approx 0.667$. The under-forecast is penalized more heavily here, the opposite tilt from MAPE, because the forecast enters the denominator and a smaller forecast shrinks it. SMAPE therefore trades one bias for another rather than removing bias.

The version above ranges from 0 to 200 percent, which surprises readers who expect a percentage to cap at 100. An alternative definition omits the division by 2 in the denominator and caps at 100 percent, so the literature contains at least two incompatible formulas. When reporting SMAPE you must state which one you used. SMAPE also remains undefined when both values are zero, which can happen for intermittent series during genuinely idle periods. The practical verdict is that SMAPE is an improvement on MAPE for some asymmetry problems but is not a clean, interpretable, or canonical metric.

164.2.3 2.3 A worked example contrasting the asymmetries

A single small table makes the competing biases concrete. Hold the actual at $y = 100$ and vary the forecast across a symmetric pair of over- and under-predictions of equal absolute size.

Forecast $\hat{y}$	Abs. error	APE (MAPE term)	SMAPE term (the form above)
50 (under)	50	$50/100 = 50\%$	$50/75 \approx 66.7\%$
150 (over)	50	$50/100 = 50\%$	$50/125 = 40.0\%$
100 (exact)	0	$0\%$	$0\%$

The absolute error is blind to sign, treating both misses identically. MAPE here also reports equal terms, because both forecasts are 50 away from the same denominator, but the equality is fragile: it holds only because we fixed the actual rather than averaging over a distribution of actuals, which is where the $1/y$ weighting bites. SMAPE breaks the tie in the opposite direction, punishing the under-forecast more because the smaller forecast shrinks its denominator. Neither percentage metric is the symmetric, sign-agnostic measure that absolute error already provides; they merely relocate the asymmetry.

164.3 3. Scaled Errors and MASE

164.3.1 3.1 The idea of scaling by a naive benchmark

The Mean Absolute Scaled Error (MASE), introduced by Hyndman and Koehler, sidesteps the division-by-actual problem entirely. Instead of normalizing each error by a value that can be zero, it normalizes the average absolute error of the model by the average absolute error of a naive benchmark computed over the training set. The benchmark is the in-sample one step naive forecast, which simply predicts the previous observation.

For a non seasonal series with training set of length $T$, define the in-sample naive scaling factor

\[ Q = \frac{1}{T-1} \sum_{t=2}^{T} |y_t - y_{t-1}|, \]

which is the mean absolute error of the random walk forecast on the training data. The MASE over an evaluation set of $n$ points is

\[ \text{MASE} = \frac{1}{n} \sum_{j=1}^{n} \frac{|y_j - \hat{y}_j|}{Q}. \]

For seasonal data with period $m$, the scaling uses the seasonal naive error $|y_t - y_{t-m}|$ averaged over the training set, so the benchmark is the obvious seasonal carry forward.

164.3.2 3.2 Why MASE behaves well

MASE has several properties that the percentage metrics lack. The scaling factor $Q$ is a single number computed once from the training data, so individual zero or small actuals never appear in a denominator and cannot blow up. It is scale free, since the numerator and denominator share the units of $y$, which makes MASE averageable across series of different magnitudes just like MAPE was supposed to be. It is symmetric in over and under forecasts because it is built on absolute errors with no actual in the denominator.

The interpretation is clean and benchmarked. A MASE of 1 means the model has the same average absolute error as the in-sample naive forecast. A MASE below 1 means the model beats the naive baseline, and above 1 means it is worse, which is a damning and immediately legible verdict. MASE is defined as long as the training series is not constant, since $Q = 0$ only when every consecutive pair of training points is equal. For these reasons MASE was the headline metric in the M4 and M5 forecasting competitions and is the default recommendation in much of the modern forecasting literature.

# Pseudocode for MASE
Q = mean(|y_train[t] - y_train[t - m]|)   # m = 1 if non-seasonal
mase = mean(|y_test - y_hat|) / Q

164.4 4. Probabilistic Forecasts and the Pinball Loss

164.4.1 4.1 From point forecasts to quantiles

Point error metrics summarize a forecast by a single number and therefore ignore uncertainty. Modern systems increasingly emit a full predictive distribution, or at least a set of quantiles, because downstream decisions such as safety stock and capacity reservation depend on tails rather than means. The right metric must score the whole distribution and reward calibrated uncertainty, not just central accuracy.

The fundamental tool is the quantile loss, also called the pinball loss. For a target quantile level $\tau \in (0,1)$, with quantile forecast $q$ and realized value $y$, the pinball loss is

\[ L_\tau(y, q) = \begin{cases} \tau \, (y - q) & \text{if } y \ge q, \\[4pt] (1 - \tau)\,(q - y) & \text{if } y < q. \end{cases} \]

Equivalently, $L_\tau(y, q) = \max\big(\tau (y - q),\, (\tau - 1)(y - q)\big)$. The loss is piecewise linear with a kink at $y = q$, and the two slopes encode an asymmetric penalty.

164.4.2 4.2 Why the asymmetry is the point

For a high quantile such as $\tau = 0.9$, an under-prediction (the realized $y$ lands above the forecast $q$) is penalized with weight $\tau = 0.9$, while an over-prediction is penalized with weight $1 - \tau = 0.1$. This deliberately makes the forecaster reluctant to set $q$ too low, which is exactly the behavior you want from a 90th percentile estimate. The minimizer of the expected pinball loss is the true $\tau$ quantile of the predictive distribution:

\[ q^\star_\tau = \arg\min_{q} \; \mathbb{E}_{y}\big[ L_\tau(y, q) \big] = F^{-1}_y(\tau), \]

where $F_y$ is the cumulative distribution function of the target. This makes the pinball loss a consistent (proper) scoring function for quantiles, meaning an honest forecaster minimizes expected loss by reporting the true quantile and cannot game the metric. At $\tau = 0.5$ the pinball loss reduces to half the absolute error, so the median forecast is its optimizer, which connects the quantile world back to familiar mean absolute error.

Derivation of the quantile optimizer. Write the expected loss as a function of the forecast $q$ and split the expectation at the kink:

\[ \mathbb{E}[L_\tau(Y, q)] = \tau \int_q^\infty (y - q)\, f(y)\, dy + (1-\tau) \int_{-\infty}^q (q - y)\, f(y)\, dy. \]

Differentiating with respect to $q$ and using Leibniz’s rule (the boundary terms vanish because the integrand is zero at $y = q$) gives

\[ \frac{d}{dq}\,\mathbb{E}[L_\tau(Y, q)] = -\tau\,\big(1 - F_y(q)\big) + (1-\tau)\,F_y(q) = F_y(q) - \tau. \]

Setting this to zero yields $F_y(q^\star) = \tau$, that is $q^\star = F_y^{-1}(\tau)$. The second derivative is $f(q) \ge 0$, so the stationary point is a minimum, and it is unique wherever the density is positive. The slope $F_y(q) - \tau$ also explains the mechanism directly: as long as more than a fraction $\tau$ of the mass lies above the current forecast, raising the forecast lowers expected loss, and the process halts exactly at the $\tau$ quantile. This is the population-level reason the asymmetric kink works.

164.4.3 4.3 Averaging over quantiles and the CRPS

A single quantile level scores only one slice of the distribution. To score a set of quantile levels $\{\tau_1, \dots, \tau_K\}$ over $n$ observations, average the pinball loss over both levels and observations:

\[ \text{QL} = \frac{1}{n K} \sum_{j=1}^{n} \sum_{k=1}^{K} L_{\tau_k}\big(y_j, \hat{q}_{\tau_k, j}\big). \]

This averaged quantile loss was the probabilistic scoring metric in the M5 uncertainty competition. As the grid of quantile levels becomes dense and uniform, the average pinball loss converges (up to a constant factor) to the Continuous Ranked Probability Score (CRPS):

\[ \text{CRPS}(F, y) = \int_{-\infty}^{\infty} \big( F(z) - \mathbb{1}\{y \le z\} \big)^2 \, dz = 2 \int_0^1 L_\tau\big(y, F^{-1}(\tau)\big)\, d\tau. \]

The CRPS is a strictly proper scoring rule for the full predictive distribution and, like the pinball loss, is reported in the units of the target. It generalizes mean absolute error to distributions: if the forecast is a point mass, CRPS collapses to the absolute error. This is why quantile loss and CRPS dominate probabilistic forecast evaluation.

# Pinball loss for one (y, q, tau)
def pinball(y, q, tau):
    e = y - q
    return tau * e if e >= 0 else (tau - 1) * e

164.5 5. Choosing a Metric in Practice

The following table summarizes the family along the two axes from the introduction, the statistic each metric rewards and how it copes with scale and zeros.

Metric	Rewards (optimal point forecast)	Scale handling	Robust to $y=0$?	Proper / consistent?
MAE	Median of $Y$	Units of $y$, not comparable across scales	Yes	Consistent for the median
RMSE	Mean of $Y$	Units of $y$, not comparable across scales	Yes	Consistent for the mean
MAPE	$1/y$-weighted median (below the median)	Scale free	No, denominator is $y_t$	Biased low; not mean-consistent
SMAPE	No clean statistic	Scale free, range 0 to 200 percent	Only if not both zero	No, residual asymmetry
MASE	Median (numerator is absolute error)	Scale free via naive benchmark	Yes, $Q$ is a single constant	Consistent for the median, benchmarked
Pinball $L_\tau$	$\tau$ quantile of $Y$	Units of $y$	Yes	Proper for the $\tau$ quantile
CRPS	Full distribution	Units of $y$	Yes	Strictly proper for the distribution

The decision flow below routes a problem to the right metric based on the shape of the output and the data.

flowchart TD
    A["Start: what does the model output"] --> B{"Point forecast or full distribution"}
    B -->|"Distribution or quantiles"| C["Use pinball loss at decision-relevant levels"]
    C --> D["Use CRPS for one proper score over the whole distribution"]
    B -->|"Point forecast"| E{"Series have zeros or very small values"}
    E -->|"Yes"| F["Use MASE, scale free and zero robust"]
    E -->|"No"| G{"Pooling across many series of different scales"}
    G -->|"Yes"| F
    G -->|"No"| H["MAPE is tolerable but biases toward under-forecasting"]

The metric should mirror the decision it informs. For comparing models on a single, strictly positive, non-intermittent series, MAPE remains tolerable and communicates well to non technical stakeholders, but be aware it rewards under-forecasting. For heterogeneous portfolios and for series with zeros or small values, prefer MASE, which is scale free, robust to zeros, and benchmarked against a naive baseline with an interpretation everyone understands. Treat SMAPE with caution and always state which formula you used, since its symmetry claim does not hold and its range is non standard.

When the output is a distribution or a set of quantiles, point metrics are simply the wrong tool. Score the forecast with the pinball loss at the quantile levels that matter to the decision, average across levels for an overall picture, and use CRPS when you want a single proper score for the entire distribution. A useful discipline is to report at least two metrics, one scale free point metric such as MASE and one probabilistic score such as averaged pinball loss, so that neither central accuracy nor calibrated uncertainty is silently ignored. The metric is not a postscript to modeling. It is the objective made visible, and choosing it deliberately is part of building the system.

164.6 References

Hyndman, R. J., and Koehler, A. B. (2006). Another look at measures of forecast accuracy. International Journal of Forecasting, 22(4), 679 to 688. https://doi.org/10.1016/j.ijforecast.2006.03.001
Hyndman, R. J., and Athanasopoulos, G. Forecasting: Principles and Practice (3rd ed.), chapter on evaluating forecast accuracy. https://otexts.com/fpp3/accuracy.html
Makridakis, S. (1993). Accuracy measures: theoretical and practical concerns. International Journal of Forecasting, 9(4), 527 to 529. https://doi.org/10.1016/0169-2070(93)90079-3
Makridakis, S., Spiliotis, E., and Assimakopoulos, V. (2022). The M5 accuracy competition: Results, findings, and conclusions. International Journal of Forecasting, 38(4), 1346 to 1364. https://doi.org/10.1016/j.ijforecast.2021.11.013
Makridakis, S., Spiliotis, E., and Assimakopoulos, V. (2022). The M5 uncertainty competition: Results, findings and conclusions. International Journal of Forecasting, 38(4), 1365 to 1385. https://doi.org/10.1016/j.ijforecast.2021.10.009
Gneiting, T., and Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477), 359 to 378. https://doi.org/10.1198/016214506000001437
Koenker, R., and Bassett, G. (1978). Regression quantiles. Econometrica, 46(1), 33 to 50. https://doi.org/10.2307/1913643
Gneiting, T. (2011). Making and evaluating point forecasts. Journal of the American Statistical Association, 106(494), 746 to 762. https://doi.org/10.1198/jasa.2011.r10138

# MAPE and Beyond: Percentage Errors and Probabilistic Forecast Scoring Regression and forecasting systems are judged by error metrics, and the choice of metric quietly encodes the loss function the business actually cares about. Absolute and squared errors live in the units of the target, which makes them hard to compare across series with different scales. Percentage errors promise a scale free alternative, and the Mean Absolute Percentage Error (MAPE) became the default in demand planning, finance, and energy forecasting for exactly this reason. This chapter develops the percentage error family rigorously, exposes the pathologies of MAPE, then moves to scaled errors (MASE) and to proper scoring rules for probabilistic forecasts through the quantile (pinball) loss. The goal is a practitioner who can pick a metric on purpose rather than by habit. A useful mental frame runs through the whole chapter. Every accuracy metric implicitly answers two questions. First, what statistic of the predictive distribution does it reward, the mean, the median, or some quantile? A metric is consistent for a statistic when reporting that statistic minimizes the expected score, and it is proper when an honest forecaster cannot lower the expected score by misreporting (references 6 and 8). Second, how does it handle scale, so that errors can be pooled across series of wildly different magnitudes? MAPE, SMAPE, MASE, the pinball loss, and the CRPS are best understood as different answers to these two questions rather than as interchangeable summaries of goodness. ## 1. Percentage Errors and MAPE ### 1.1 Definition Let $y_t$ denote the actual value at time $t$ and $\hat{y}_t$ the forecast, for $t = 1, \dots, n$. The absolute percentage error is $$ \text{APE}_t = \left| \frac{y_t - \hat{y}_t}{y_t} \right|, $$ and the Mean Absolute Percentage Error averages it: $$ \text{MAPE} = \frac{100}{n} \sum_{t=1}^{n} \left| \frac{y_t - \hat{y}_t}{y_t} \right|. $$ The factor of 100 expresses MAPE as a percentage. Its appeal is interpretability and scale independence. A MAPE of 8 percent means the same thing for a product selling 10 units a week and one selling 10,000 units a week, so MAPE can be averaged across a heterogeneous catalog. This is precisely what a planner wants when reporting a single accuracy number to management. Two structural features deserve to be named at the outset, because they drive everything in the rest of this section. First, the error enters only through the relative deviation $r_t = (\hat{y}_t - y_t)/y_t$, so MAPE is invariant to multiplying an entire series by a constant but is not invariant to shifting it. Second, the actual value $y_t$ sits alone in the denominator. That single design choice is the source of both the divide-by-zero pathology in Section 1.2 and the optimization bias in Section 1.3. ### 1.2 The divide-by-zero and small-denominator problem MAPE divides by the actual value $y_t$. When $y_t = 0$ the term is undefined, and when $y_t$ is merely small the term explodes. Intermittent demand series, which contain many zero periods, make MAPE unusable or wildly inflated. Even a single near-zero actual can dominate the average. Suppose $y_t = 1$ and $\hat{y}_t = 11$. The contribution is $1000$ percent, drowning out hundreds of well behaved terms. There is no honest patch. Common workarounds such as adding a constant to the denominator, dropping zero periods, or clipping large terms all change the quantity being measured and break comparability across datasets. ### 1.3 Asymmetry: MAPE penalizes over-forecasts more than under-forecasts MAPE is not symmetric in the sign of the error, and this bias is structural rather than incidental. Consider $y_t = 100$. An over-forecast of $\hat{y}_t = 150$ gives an APE of 50 percent. A symmetric under-forecast of $\hat{y}_t = 50$ also gives 50 percent. So far so balanced. But the bound differs. An under-forecast can produce at most a 100 percent error (when $\hat{y}_t = 0$), whereas an over-forecast is unbounded above as $\hat{y}_t \to \infty$. The asymmetry has a sharp consequence: minimizing expected APE does not target the mean. For a fixed actual distribution, the forecast that minimizes expected absolute percentage error is the weighted median where each outcome $y$ is weighted by $1/y$. Because small actuals receive large weights, the optimal point forecast under MAPE is pulled below the median of the data. A model tuned to minimize MAPE will systematically under-forecast. In demand planning this manifests as chronic stockouts, because the metric rewards the forecaster for shading predictions downward. Practitioners who report MAPE while optimizing a squared loss are then surprised that the two disagree about which model is best. **Derivation of the weighted-median optimizer.** Treat the forecast $q$ as a constant and the actual $Y > 0$ as a random variable with density $f$. The expected absolute percentage error is $$ R(q) = \mathbb{E}\!\left[\frac{|Y - q|}{Y}\right] = \int_0^\infty \frac{|y - q|}{y}\, f(y)\, dy. $$ Split the integral at $q$ and differentiate with respect to $q$. For $y > q$ the integrand is $(y-q)/y$ with derivative $-1/y$, and for $y < q$ the integrand is $(q-y)/y$ with derivative $+1/y$. The boundary terms cancel because the integrand is continuous at $y=q$, so $$ R'(q) = \int_0^q \frac{1}{y} f(y)\, dy - \int_q^\infty \frac{1}{y} f(y)\, dy. $$ Setting $R'(q^\star) = 0$ requires the $1/y$-weighted probability mass below $q^\star$ to equal the mass above it: $$ \int_0^{q^\star} \frac{f(y)}{y}\, dy = \int_{q^\star}^\infty \frac{f(y)}{y}\, dy. $$ This is exactly the median of the distribution whose density is proportional to $f(y)/y$, the original distribution reweighted by $1/y$. Since $1/y$ is decreasing, the reweighting shifts probability mass toward small $y$, so $q^\star$ falls below the ordinary median of $Y$. The bias is not a finite-sample artifact; it is a property of the population minimizer. By contrast, squared error is minimized at the mean and absolute error at the median, so reporting MAPE while training on either of those losses optimizes a different target than the one being scored. ## 2. Symmetric MAPE (SMAPE) ### 2.1 Motivation and definition SMAPE was proposed to repair the asymmetry by putting a symmetric quantity in the denominator. A widely used form is $$ \text{SMAPE} = \frac{100}{n} \sum_{t=1}^{n} \frac{|y_t - \hat{y}_t|}{(|y_t| + |\hat{y}_t|)/2}. $$ By normalizing against the average magnitude of actual and forecast, SMAPE no longer divides by the actual alone, so an over-forecast and an under-forecast of equal absolute size relative to that average are treated more evenly. The denominator is also less likely to be exactly zero, since it vanishes only when both $y_t$ and $\hat{y}_t$ are zero. ### 2.2 SMAPE is not actually symmetric, and its range is awkward The name oversells the cure. With $y_t$ fixed, SMAPE is still not symmetric in $\hat{y}_t$. Take $y_t = 100$. For $\hat{y}_t = 150$ the term is $50 / 125 = 0.4$. For $\hat{y}_t = 50$ the term is $50 / 75 \approx 0.667$. The under-forecast is penalized more heavily here, the opposite tilt from MAPE, because the forecast enters the denominator and a smaller forecast shrinks it. SMAPE therefore trades one bias for another rather than removing bias. The version above ranges from 0 to 200 percent, which surprises readers who expect a percentage to cap at 100. An alternative definition omits the division by 2 in the denominator and caps at 100 percent, so the literature contains at least two incompatible formulas. When reporting SMAPE you must state which one you used. SMAPE also remains undefined when both values are zero, which can happen for intermittent series during genuinely idle periods. The practical verdict is that SMAPE is an improvement on MAPE for some asymmetry problems but is not a clean, interpretable, or canonical metric. ### 2.3 A worked example contrasting the asymmetries A single small table makes the competing biases concrete. Hold the actual at $y = 100$ and vary the forecast across a symmetric pair of over- and under-predictions of equal absolute size. | Forecast $\hat{y}$ | Abs. error | APE (MAPE term) | SMAPE term (the form above) | |---:|---:|---:|---:| | 50 (under) | 50 | $50/100 = 50\%$ | $50/75 \approx 66.7\%$ | | 150 (over) | 50 | $50/100 = 50\%$ | $50/125 = 40.0\%$ | | 100 (exact) | 0 | $0\%$ | $0\%$ | The absolute error is blind to sign, treating both misses identically. MAPE here also reports equal terms, because both forecasts are 50 away from the same denominator, but the equality is fragile: it holds only because we fixed the actual rather than averaging over a distribution of actuals, which is where the $1/y$ weighting bites. SMAPE breaks the tie in the opposite direction, punishing the under-forecast more because the smaller forecast shrinks its denominator. Neither percentage metric is the symmetric, sign-agnostic measure that absolute error already provides; they merely relocate the asymmetry. ## 3. Scaled Errors and MASE ### 3.1 The idea of scaling by a naive benchmark The Mean Absolute Scaled Error (MASE), introduced by Hyndman and Koehler, sidesteps the division-by-actual problem entirely. Instead of normalizing each error by a value that can be zero, it normalizes the average absolute error of the model by the average absolute error of a naive benchmark computed over the training set. The benchmark is the in-sample one step naive forecast, which simply predicts the previous observation. For a non seasonal series with training set of length $T$, define the in-sample naive scaling factor $$ Q = \frac{1}{T-1} \sum_{t=2}^{T} |y_t - y_{t-1}|, $$ which is the mean absolute error of the random walk forecast on the training data. The MASE over an evaluation set of $n$ points is $$ \text{MASE} = \frac{1}{n} \sum_{j=1}^{n} \frac{|y_j - \hat{y}_j|}{Q}. $$ For seasonal data with period $m$, the scaling uses the seasonal naive error $|y_t - y_{t-m}|$ averaged over the training set, so the benchmark is the obvious seasonal carry forward. ### 3.2 Why MASE behaves well MASE has several properties that the percentage metrics lack. The scaling factor $Q$ is a single number computed once from the training data, so individual zero or small actuals never appear in a denominator and cannot blow up. It is scale free, since the numerator and denominator share the units of $y$, which makes MASE averageable across series of different magnitudes just like MAPE was supposed to be. It is symmetric in over and under forecasts because it is built on absolute errors with no actual in the denominator. The interpretation is clean and benchmarked. A MASE of 1 means the model has the same average absolute error as the in-sample naive forecast. A MASE below 1 means the model beats the naive baseline, and above 1 means it is worse, which is a damning and immediately legible verdict. MASE is defined as long as the training series is not constant, since $Q = 0$ only when every consecutive pair of training points is equal. For these reasons MASE was the headline metric in the M4 and M5 forecasting competitions and is the default recommendation in much of the modern forecasting literature. ```text # Pseudocode for MASE Q = mean(|y_train[t] - y_train[t - m]|) # m = 1 if non-seasonal mase = mean(|y_test - y_hat|) / Q ``` ## 4. Probabilistic Forecasts and the Pinball Loss ### 4.1 From point forecasts to quantiles Point error metrics summarize a forecast by a single number and therefore ignore uncertainty. Modern systems increasingly emit a full predictive distribution, or at least a set of quantiles, because downstream decisions such as safety stock and capacity reservation depend on tails rather than means. The right metric must score the whole distribution and reward calibrated uncertainty, not just central accuracy. The fundamental tool is the quantile loss, also called the pinball loss. For a target quantile level $\tau \in (0,1)$, with quantile forecast $q$ and realized value $y$, the pinball loss is $$ L_\tau(y, q) = \begin{cases} \tau \, (y - q) & \text{if } y \ge q, \\[4pt] (1 - \tau)\,(q - y) & \text{if } y < q. \end{cases} $$ Equivalently, $L_\tau(y, q) = \max\big(\tau (y - q),\, (\tau - 1)(y - q)\big)$. The loss is piecewise linear with a kink at $y = q$, and the two slopes encode an asymmetric penalty. ### 4.2 Why the asymmetry is the point For a high quantile such as $\tau = 0.9$, an under-prediction (the realized $y$ lands above the forecast $q$) is penalized with weight $\tau = 0.9$, while an over-prediction is penalized with weight $1 - \tau = 0.1$. This deliberately makes the forecaster reluctant to set $q$ too low, which is exactly the behavior you want from a 90th percentile estimate. The minimizer of the expected pinball loss is the true $\tau$ quantile of the predictive distribution: $$ q^\star_\tau = \arg\min_{q} \; \mathbb{E}_{y}\big[ L_\tau(y, q) \big] = F^{-1}_y(\tau), $$ where $F_y$ is the cumulative distribution function of the target. This makes the pinball loss a consistent (proper) scoring function for quantiles, meaning an honest forecaster minimizes expected loss by reporting the true quantile and cannot game the metric. At $\tau = 0.5$ the pinball loss reduces to half the absolute error, so the median forecast is its optimizer, which connects the quantile world back to familiar mean absolute error. **Derivation of the quantile optimizer.** Write the expected loss as a function of the forecast $q$ and split the expectation at the kink: $$ \mathbb{E}[L_\tau(Y, q)] = \tau \int_q^\infty (y - q)\, f(y)\, dy + (1-\tau) \int_{-\infty}^q (q - y)\, f(y)\, dy. $$ Differentiating with respect to $q$ and using Leibniz's rule (the boundary terms vanish because the integrand is zero at $y = q$) gives $$ \frac{d}{dq}\,\mathbb{E}[L_\tau(Y, q)] = -\tau\,\big(1 - F_y(q)\big) + (1-\tau)\,F_y(q) = F_y(q) - \tau. $$ Setting this to zero yields $F_y(q^\star) = \tau$, that is $q^\star = F_y^{-1}(\tau)$. The second derivative is $f(q) \ge 0$, so the stationary point is a minimum, and it is unique wherever the density is positive. The slope $F_y(q) - \tau$ also explains the mechanism directly: as long as more than a fraction $\tau$ of the mass lies above the current forecast, raising the forecast lowers expected loss, and the process halts exactly at the $\tau$ quantile. This is the population-level reason the asymmetric kink works. ### 4.3 Averaging over quantiles and the CRPS A single quantile level scores only one slice of the distribution. To score a set of quantile levels $\{\tau_1, \dots, \tau_K\}$ over $n$ observations, average the pinball loss over both levels and observations: $$ \text{QL} = \frac{1}{n K} \sum_{j=1}^{n} \sum_{k=1}^{K} L_{\tau_k}\big(y_j, \hat{q}_{\tau_k, j}\big). $$ This averaged quantile loss was the probabilistic scoring metric in the M5 uncertainty competition. As the grid of quantile levels becomes dense and uniform, the average pinball loss converges (up to a constant factor) to the Continuous Ranked Probability Score (CRPS): $$ \text{CRPS}(F, y) = \int_{-\infty}^{\infty} \big( F(z) - \mathbb{1}\{y \le z\} \big)^2 \, dz = 2 \int_0^1 L_\tau\big(y, F^{-1}(\tau)\big)\, d\tau. $$ The CRPS is a strictly proper scoring rule for the full predictive distribution and, like the pinball loss, is reported in the units of the target. It generalizes mean absolute error to distributions: if the forecast is a point mass, CRPS collapses to the absolute error. This is why quantile loss and CRPS dominate probabilistic forecast evaluation. ```text # Pinball loss for one (y, q, tau) def pinball(y, q, tau): e = y - q return tau * e if e >= 0 else (tau - 1) * e ``` ## 5. Choosing a Metric in Practice The following table summarizes the family along the two axes from the introduction, the statistic each metric rewards and how it copes with scale and zeros. | Metric | Rewards (optimal point forecast) | Scale handling | Robust to $y=0$? | Proper / consistent? | |---|---|---|---|---| | MAE | Median of $Y$ | Units of $y$, not comparable across scales | Yes | Consistent for the median | | RMSE | Mean of $Y$ | Units of $y$, not comparable across scales | Yes | Consistent for the mean | | MAPE | $1/y$-weighted median (below the median) | Scale free | No, denominator is $y_t$ | Biased low; not mean-consistent | | SMAPE | No clean statistic | Scale free, range 0 to 200 percent | Only if not both zero | No, residual asymmetry | | MASE | Median (numerator is absolute error) | Scale free via naive benchmark | Yes, $Q$ is a single constant | Consistent for the median, benchmarked | | Pinball $L_\tau$ | $\tau$ quantile of $Y$ | Units of $y$ | Yes | Proper for the $\tau$ quantile | | CRPS | Full distribution | Units of $y$ | Yes | Strictly proper for the distribution | The decision flow below routes a problem to the right metric based on the shape of the output and the data. ```{mermaid} flowchart TD A["Start: what does the model output"] --> B{"Point forecast or full distribution"} B -->|"Distribution or quantiles"| C["Use pinball loss at decision-relevant levels"] C --> D["Use CRPS for one proper score over the whole distribution"] B -->|"Point forecast"| E{"Series have zeros or very small values"} E -->|"Yes"| F["Use MASE, scale free and zero robust"] E -->|"No"| G{"Pooling across many series of different scales"} G -->|"Yes"| F G -->|"No"| H["MAPE is tolerable but biases toward under-forecasting"] ``` The metric should mirror the decision it informs. For comparing models on a single, strictly positive, non-intermittent series, MAPE remains tolerable and communicates well to non technical stakeholders, but be aware it rewards under-forecasting. For heterogeneous portfolios and for series with zeros or small values, prefer MASE, which is scale free, robust to zeros, and benchmarked against a naive baseline with an interpretation everyone understands. Treat SMAPE with caution and always state which formula you used, since its symmetry claim does not hold and its range is non standard. When the output is a distribution or a set of quantiles, point metrics are simply the wrong tool. Score the forecast with the pinball loss at the quantile levels that matter to the decision, average across levels for an overall picture, and use CRPS when you want a single proper score for the entire distribution. A useful discipline is to report at least two metrics, one scale free point metric such as MASE and one probabilistic score such as averaged pinball loss, so that neither central accuracy nor calibrated uncertainty is silently ignored. The metric is not a postscript to modeling. It is the objective made visible, and choosing it deliberately is part of building the system. ## References 1. Hyndman, R. J., and Koehler, A. B. (2006). Another look at measures of forecast accuracy. International Journal of Forecasting, 22(4), 679 to 688. https://doi.org/10.1016/j.ijforecast.2006.03.001 2. Hyndman, R. J., and Athanasopoulos, G. Forecasting: Principles and Practice (3rd ed.), chapter on evaluating forecast accuracy. https://otexts.com/fpp3/accuracy.html 3. Makridakis, S. (1993). Accuracy measures: theoretical and practical concerns. International Journal of Forecasting, 9(4), 527 to 529. https://doi.org/10.1016/0169-2070(93)90079-3 4. Makridakis, S., Spiliotis, E., and Assimakopoulos, V. (2022). The M5 accuracy competition: Results, findings, and conclusions. International Journal of Forecasting, 38(4), 1346 to 1364. https://doi.org/10.1016/j.ijforecast.2021.11.013 5. Makridakis, S., Spiliotis, E., and Assimakopoulos, V. (2022). The M5 uncertainty competition: Results, findings and conclusions. International Journal of Forecasting, 38(4), 1365 to 1385. https://doi.org/10.1016/j.ijforecast.2021.10.009 6. Gneiting, T., and Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477), 359 to 378. https://doi.org/10.1198/016214506000001437 7. Koenker, R., and Bassett, G. (1978). Regression quantiles. Econometrica, 46(1), 33 to 50. https://doi.org/10.2307/1913643 8. Gneiting, T. (2011). Making and evaluating point forecasts. Journal of the American Statistical Association, 106(494), 746 to 762. https://doi.org/10.1198/jasa.2011.r10138