165  Probabilistic Metrics and Proper Scoring Rules

Most machine learning systems do not merely predict; they assert degrees of belief. A weather model says rain is 70 percent likely, a credit model assigns a 3 percent default probability, a language model places a distribution over the next token. To evaluate such systems we need metrics that grade the entire probability distribution rather than a single hard decision. This chapter develops the theory of proper scoring rules, derives the two workhorse classification scores (log loss and the Brier score), extends the framework to real-valued forecasts through the continuous ranked probability score, and explains the central reason these metrics matter: they make honesty the optimal strategy.

165.1 1. From Decisions to Distributions

A classifier that outputs only a label can be scored with accuracy, precision, or recall. These metrics discard information. Two models that both predict “rain” may differ sharply: one assigns probability \(0.51\), the other \(0.99\). If it does not rain, the second model was far more wrong, yet a threshold based metric treats them identically. Probabilistic metrics retain this gradation.

Formally, let the outcome be a random variable \(Y\) with realized value \(y\), and let a forecaster issue a predictive distribution \(P\) (or a density \(p\), or for binary outcomes a probability \(\hat{p}\)). A scoring rule is a function \(S(P, y)\) giving the reward (or, by convention here, the loss) when the forecast was \(P\) and the outcome turned out to be \(y\). Lower loss is better. The quality of a forecaster is summarized by the expected score under the true data generating distribution \(Q\):

\[ S(P, Q) = \mathbb{E}_{Y \sim Q}\big[ S(P, Y) \big]. \]

The forecaster controls \(P\); nature controls \(Q\). The key design question is how to choose \(S\) so that a rational forecaster who wants to minimize expected loss is driven to report \(P = Q\).

165.2 2. Proper Scoring Rules

165.2.1 2.1 Definition

A scoring rule \(S\) is proper if reporting the true distribution is never worse in expectation than reporting any other distribution:

\[ S(Q, Q) \le S(P, Q) \quad \text{for all } P. \]

It is strictly proper if equality holds only when \(P = Q\). Propriety is exactly the property that removes any incentive to misrepresent beliefs. If a forecaster privately believes the true distribution is \(Q\), a strictly proper rule guarantees that their unique expected loss minimizing report is \(Q\) itself. Section 6 returns to why this matters operationally.

165.2.2 2.2 The Decomposition into Calibration and Sharpness

Every proper score can be understood through two complementary properties of a forecaster. Calibration asks whether stated probabilities match empirical frequencies: among all days a model said 70 percent, did it rain about 70 percent of the time? Sharpness (or refinement) asks how concentrated the forecasts are; a model that always predicts the base rate is perfectly calibrated but useless. The guiding principle, due to Gneiting and colleagues, is to maximize sharpness subject to calibration. Proper scores reward both simultaneously, which is why they cannot be gamed by a forecaster who only fixes one.

A useful algebraic fact is that the expected score of any proper rule decomposes as

\[ \mathbb{E}[S(P, Y)] = \underbrace{\text{Uncertainty}}_{H(Q)} - \underbrace{\text{Resolution}}_{\ge 0} + \underbrace{\text{Reliability}}_{\ge 0}, \]

where uncertainty depends only on the underlying problem, resolution rewards forecasts that separate outcomes, and reliability penalizes miscalibration. We make this concrete for the Brier score in Section 4.

165.3 3. Logarithmic Score and Log Loss

165.3.1 3.1 Definition

The logarithmic score assigns loss equal to the negative log probability the forecast placed on the realized outcome. For a categorical outcome over classes \(1, \dots, K\) with forecast vector \(\hat{p} = (\hat{p}_1, \dots, \hat{p}_K)\) and realized class \(y\),

\[ S_{\log}(\hat{p}, y) = -\log \hat{p}_y . \]

Averaged over a dataset of \(n\) examples with one hot labels \(y_{i,k}\), this is the familiar cross entropy or log loss:

\[ \text{LogLoss} = -\frac{1}{n} \sum_{i=1}^{n} \sum_{k=1}^{K} y_{i,k} \log \hat{p}_{i,k}. \]

For binary outcomes with label \(y \in \{0,1\}\) and predicted positive probability \(\hat{p}\), this reduces to the binary cross entropy

\[ S_{\log}(\hat{p}, y) = -\big[\, y \log \hat{p} + (1-y)\log(1-\hat{p}) \,\big]. \]

165.3.2 3.2 Properness

To see that the log score is strictly proper, suppose the true class probabilities are \(q = (q_1, \dots, q_K)\) and the forecaster reports \(p\). The expected score is

\[ \mathbb{E}_{Y \sim q}\big[-\log p_Y\big] = -\sum_{k} q_k \log p_k = H(q) + D_{\mathrm{KL}}(q \,\|\, p), \]

where \(H(q) = -\sum_k q_k \log q_k\) is the Shannon entropy and \(D_{\mathrm{KL}}(q \| p) \ge 0\) is the Kullback Leibler divergence. The entropy term does not depend on the report \(p\). The divergence is nonnegative and equals zero only when \(p = q\). Hence the expected log loss is minimized uniquely at \(p = q\), which is the definition of strict propriety. This identity also explains why the log score is the natural loss for maximum likelihood estimation: minimizing log loss is minimizing KL divergence to the truth.

165.3.3 3.3 Properties and Cautions

The log score is local: it depends only on the probability assigned to the outcome that actually occurred, ignoring how mass was spread over the other classes. Up to affine transformation it is the only smooth local proper score, a result of considerable theoretical weight. Its defining practical hazard is its unboundedness. If a model assigns probability zero to an event that then occurs, the loss is infinite. A single confident mistake can dominate an entire evaluation, so practitioners clip predictions away from \(0\) and \(1\) or apply smoothing.

# conceptual, not executable
eps = 1e-15
p   = clip(p, eps, 1 - eps)
loss = -mean( y*log(p) + (1-y)*log(1-p) )

165.4 4. The Brier Score

165.4.1 4.1 Definition

The Brier score, introduced by Glenn Brier in 1950 for weather forecasting, is the mean squared error between predicted probabilities and outcomes. For binary outcomes,

\[ \text{BS} = \frac{1}{n} \sum_{i=1}^{n} (\hat{p}_i - y_i)^2 . \]

For the multiclass case with \(K\) classes and one hot labels it generalizes to

\[ \text{BS} = \frac{1}{n} \sum_{i=1}^{n} \sum_{k=1}^{K} (\hat{p}_{i,k} - y_{i,k})^2 . \]

It is bounded in \([0, 1]\) for the binary form and in \([0, 2]\) for the multiclass form, which makes it more robust to overconfident errors than the log score.

165.4.2 4.2 Properness

The Brier score is strictly proper. With true probability \(q\) for a binary event and report \(p\), the expected score is

\[ \mathbb{E}_{Y \sim q}\big[(p - Y)^2\big] = q(1-p)^2 + (1-q)p^2 . \]

Differentiating with respect to \(p\) and setting the derivative to zero,

\[ \frac{d}{dp}\Big[q(1-p)^2 + (1-q)p^2\Big] = -2q(1-p) + 2(1-q)p = 2(p - q) = 0, \]

gives the unique minimizer \(p = q\). The second derivative is \(2 > 0\), confirming a strict minimum. So, as with the log score, truthful reporting is optimal.

165.4.3 4.3 The Murphy Decomposition

A celebrated result of Allan Murphy decomposes the Brier score into interpretable terms. Partition forecasts into \(K\) bins where bin \(k\) contains \(n_k\) predictions sharing forecast value \(\bar{p}_k\) and observed frequency \(\bar{o}_k\), and let \(\bar{o}\) be the overall base rate. Then

\[ \text{BS} = \underbrace{\frac{1}{n}\sum_k n_k (\bar{p}_k - \bar{o}_k)^2}_{\text{Reliability}} - \underbrace{\frac{1}{n}\sum_k n_k (\bar{o}_k - \bar{o})^2}_{\text{Resolution}} + \underbrace{\bar{o}(1-\bar{o})}_{\text{Uncertainty}}. \]

Reliability measures calibration error and we want it small. Resolution measures how much the binned outcome frequencies deviate from the base rate and we want it large. Uncertainty is the irreducible variance of the outcome, fixed by the problem. This decomposition operationalizes the calibration versus sharpness tradeoff from Section 2.2 and connects directly to reliability diagrams, where one plots \(\bar{o}_k\) against \(\bar{p}_k\) and looks for points near the diagonal.

165.4.4 4.4 Log Loss versus Brier in Practice

Both are strictly proper, so both incentivize honesty, but they weight errors differently. The log score grows without bound as confidence in a wrong answer increases, making it the more sensitive instrument for detecting and punishing overconfidence; it is the standard objective for training neural classifiers and language models. The Brier score, being bounded and quadratic, is gentler, more stable under outliers, and often preferred when reporting calibration to stakeholders or comparing well calibrated models. A common recommendation is to optimize log loss during training and report both log loss and Brier score, alongside a reliability diagram, at evaluation time.

165.5 5. The Continuous Ranked Probability Score

165.5.1 5.1 Motivation

Log loss and Brier score grade distributions over discrete labels. Many forecasts are real valued: tomorrow’s temperature, a delivery time, a demand quantity. For these we issue a full predictive distribution with cumulative distribution function \(F\) and want a proper score that rewards both calibration and sharpness while respecting distance on the real line. The continuous ranked probability score (CRPS) does this.

165.5.2 5.2 Definition

Given a predictive CDF \(F\) and a realized scalar outcome \(y\), the CRPS is the integrated squared difference between the forecast CDF and the step function at the observation:

\[ \text{CRPS}(F, y) = \int_{-\infty}^{\infty} \big( F(x) - \mathbb{1}\{x \ge y\} \big)^2 \, dx . \]

The integrand compares the predicted cumulative probability at each threshold \(x\) to the ideal CDF, which jumps from \(0\) to \(1\) at the true value \(y\). CRPS can be read as the Brier score for the event “outcome at most \(x\),” integrated over all thresholds \(x\). This makes its properness inherited from the properness of the Brier score at every threshold.

165.5.3 5.3 A Closed Form and Its Interpretation

An equivalent and computationally convenient expression uses two independent draws \(X, X'\) from the forecast distribution \(F\):

\[ \text{CRPS}(F, y) = \mathbb{E}_F |X - y| - \tfrac{1}{2}\, \mathbb{E}_F |X - X'| . \]

The first term rewards forecasts whose mass sits near the observation; the second term, the expected absolute spread within the forecast, rewards sharpness by penalizing needlessly diffuse predictions. The two terms together enforce the sharpness subject to calibration principle. A pleasant property follows: if the forecast collapses to a point mass at a single value, CRPS reduces to the absolute error \(|X - y|\). CRPS is therefore reported in the same units as the outcome, which aids interpretation, and it generalizes mean absolute error to probabilistic forecasts.

For a Gaussian forecast \(\mathcal{N}(\mu, \sigma^2)\) there is a closed form. With \(z = (y - \mu)/\sigma\) and \(\phi, \Phi\) the standard normal density and CDF,

\[ \text{CRPS}\big(\mathcal{N}(\mu,\sigma^2), y\big) = \sigma\left[ z\big(2\Phi(z) - 1\big) + 2\phi(z) - \frac{1}{\sqrt{\pi}} \right]. \]

When a closed form is unavailable, the ensemble (sample based) estimator below is standard.

# conceptual ensemble CRPS from samples x_1..x_m of F, observation y
term1 = mean( |x_j - y| )                 # j over m samples
term2 = mean( |x_j - x_k| ) / 2           # j,k over all pairs
crps  = term1 - term2

165.6 6. Why Proper Scoring Rules Incentivize Honesty

The recurring algebra of Sections 3 and 4 is not a coincidence; it is the entire point. A proper scoring rule is engineered so that the report minimizing a forecaster’s own expected loss is precisely their true belief.

Consider a forecaster who privately believes the outcome probability is \(q\) but contemplates reporting some other value \(p\). Their subjectively expected loss is \(S(p, q) = \mathbb{E}_{Y \sim q}[S(p, Y)]\). Under a strictly proper rule this is uniquely minimized at \(p = q\). Any attempt to shade the forecast, to hedge toward \(0.5\) to look cautious, or to exaggerate toward \(0\) or \(1\) to look decisive, raises the forecaster’s own expected penalty. There is no separate enforcement mechanism: the metric is self enforcing because misreporting is self defeating.

This stands in sharp contrast to improper metrics. Accuracy is improper for probabilities: a model can maximize accuracy by reporting \(0\) or \(1\) regardless of its true uncertainty, which destroys all probabilistic information. Mean absolute error on probabilities, \(|\hat{p} - y|\), is also improper; its expected value \(q|1 - p| + (1-q)|p|\) is minimized at the boundary \(p = 0\) or \(p = 1\) depending on whether \(q\) is below or above \(0.5\), again rewarding overconfident reports. Using such metrics to train or select probabilistic models silently encourages dishonest, miscalibrated outputs.

The honesty guarantee carries three practical consequences. First, in model training, minimizing a proper score (typically log loss) over a flexible model class drives the fitted conditional probabilities toward the true conditionals, the foundation of probabilistic machine learning. Second, in model selection and evaluation, proper scores let us rank competing forecasters without fear that the winner merely exploited the metric. Third, in elicitation, when humans or markets are paid according to a proper score, the payment structure makes truthful probability reporting the rational strategy, which is why proper scoring rules underpin forecasting tournaments and prediction markets.

Two cautions temper the theory. Propriety is an expectation level property: it guarantees the right incentive on average under the true distribution, not that any single realized score is meaningful in isolation. And propriety says nothing about which proper rule to use; the choice among log loss, Brier, CRPS, and others should follow from how one wishes to weight calibration, tail behavior, boundedness, and units, as discussed throughout this chapter.

165.7 7. Summary

Probabilistic metrics evaluate full predictive distributions rather than hard decisions. Proper scoring rules are those for which truthful reporting minimizes expected loss, and strictly proper rules make truth the unique optimum. The logarithmic score yields log loss and cross entropy, is local and intimately tied to KL divergence and maximum likelihood, but is unbounded and unforgiving of confident errors. The Brier score is a bounded quadratic alternative whose Murphy decomposition exposes calibration, resolution, and intrinsic uncertainty. The continuous ranked probability score extends proper scoring to real valued forecasts, reduces to mean absolute error for point forecasts, and is reported in the outcome’s own units. Across all of them, propriety is the unifying design principle: it is what lets a number on a leaderboard stand in for honest, well calibrated belief.

165.8 References

  1. Gneiting, T. and Raftery, A. E. (2007). Strictly Proper Scoring Rules, Prediction, and Estimation. Journal of the American Statistical Association, 102(477), 359 to 378. https://doi.org/10.1198/016214506000001437
  2. Brier, G. W. (1950). Verification of Forecasts Expressed in Terms of Probability. Monthly Weather Review, 78(1), 1 to 3. https://doi.org/10.1175/1520-0493(1950)078%3C0001:VOFEIT%3E2.0.CO;2
  3. Murphy, A. H. (1973). A New Vector Partition of the Probability Score. Journal of Applied Meteorology, 12(4), 595 to 600. https://doi.org/10.1175/1520-0450(1973)012%3C0595:ANVPOT%3E2.0.CO;2
  4. Gneiting, T., Balabdaoui, F., and Raftery, A. E. (2007). Probabilistic Forecasts, Calibration and Sharpness. Journal of the Royal Statistical Society Series B, 69(2), 243 to 268. https://doi.org/10.1111/j.1467-9868.2007.00587.x
  5. Matheson, J. E. and Winkler, R. L. (1976). Scoring Rules for Continuous Probability Distributions. Management Science, 22(10), 1087 to 1096. https://doi.org/10.1287/mnsc.22.10.1087
  6. Gneiting, T. and Katzfuss, M. (2014). Probabilistic Forecasting. Annual Review of Statistics and Its Application, 1, 125 to 151. https://doi.org/10.1146/annurev-statistics-062713-085831
  7. Bröcker, J. (2009). Reliability, Sufficiency, and the Decomposition of Proper Scores. Quarterly Journal of the Royal Meteorological Society, 135(643), 1512 to 1519. https://doi.org/10.1002/qj.456
  8. Dawid, A. P. and Musio, M. (2014). Theory and Applications of Proper Scoring Rules. Metron, 72(2), 169 to 183. https://doi.org/10.1007/s40300-014-0039-y