165 Probabilistic Metrics and Proper Scoring Rules

Most machine learning systems do not merely predict; they assert degrees of belief. A weather model says rain is 70 percent likely, a credit model assigns a 3 percent default probability, a language model places a distribution over the next token. To evaluate such systems we need metrics that grade the entire probability distribution rather than a single hard decision. This chapter develops the theory of proper scoring rules, derives the two workhorse classification scores (log loss and the Brier score), extends the framework to real-valued forecasts through the continuous ranked probability score, and explains the central reason these metrics matter: they make honesty the optimal strategy.

165.1 1. From Decisions to Distributions

A classifier that outputs only a label can be scored with accuracy, precision, or recall. These metrics discard information. Two models that both predict “rain” may differ sharply: one assigns probability $0.51$, the other $0.99$. If it does not rain, the second model was far more wrong, yet a threshold based metric treats them identically. Probabilistic metrics retain this gradation.

Formally, let the outcome be a random variable $Y$ with realized value $y$, and let a forecaster issue a predictive distribution $P$ (or a density $p$, or for binary outcomes a probability $\hat{p}$). A scoring rule is a function $S(P, y)$ giving the reward (or, by convention here, the loss) when the forecast was $P$ and the outcome turned out to be $y$. Lower loss is better. The quality of a forecaster is summarized by the expected score under the true data generating distribution $Q$:

\[ S(P, Q) = \mathbb{E}_{Y \sim Q}\big[ S(P, Y) \big]. \]

The forecaster controls $P$; nature controls $Q$. The key design question is how to choose $S$ so that a rational forecaster who wants to minimize expected loss is driven to report $P = Q$.

The landscape of probabilistic metrics, and the place of proper scoring rules within it, is summarized below.

flowchart TD
    A["Forecast quality"] --> B["Hard decision metrics"]
    A --> C["Probabilistic metrics"]
    B --> B1["Accuracy, precision, recall"]
    C --> D["Proper scoring rules"]
    C --> E["Calibration diagnostics"]
    D --> D1["Log score (cross entropy)"]
    D --> D2["Brier score"]
    D --> D3["CRPS for real valued outcomes"]
    E --> E1["Reliability diagram"]
    E --> E2["Expected calibration error"]

Hard decision metrics answer “was the label right.” Proper scoring rules answer “was the stated belief honest and well resolved.” Calibration diagnostics are visual or summary companions to the scores, not substitutes for them.

165.2 2. Proper Scoring Rules

165.2.1 2.1 Definition

A scoring rule $S$ is proper if reporting the true distribution is never worse in expectation than reporting any other distribution:

\[ S(Q, Q) \le S(P, Q) \quad \text{for all } P. \]

It is strictly proper if equality holds only when $P = Q$. Propriety is exactly the property that removes any incentive to misrepresent beliefs. If a forecaster privately believes the true distribution is $Q$, a strictly proper rule guarantees that their unique expected loss minimizing report is $Q$ itself. Section 6 returns to why this matters operationally.

There is a clean geometric way to see propriety. Define the expected score function $G(Q) = S(Q, Q)$, the loss a perfectly truthful forecaster expects to incur against nature $Q$. A standard result of Gneiting and Raftery (reference 1) states that a regular scoring rule is proper if and only if $G$ is concave and $S(P, Q)$ is a supporting affine function of $G$ at the point $P$. In words, the truthful report sits on the function $G$ itself, while every untruthful report lies on a tangent line that can only sit at or above it. For the log score $G(Q) = -H(Q)$ is the negative entropy, and for the Brier score $G(Q) = -\sum_k q_k(1 - q_k)$ is the negative Gini index. The gap between the tangent line and the curve is precisely a Bregman divergence from $Q$ to $P$, which recovers the Kullback Leibler and squared error penalties seen later in this chapter.

165.2.2 2.2 The Decomposition into Calibration and Sharpness

Every proper score can be understood through two complementary properties of a forecaster. Calibration asks whether stated probabilities match empirical frequencies: among all days a model said 70 percent, did it rain about 70 percent of the time? Sharpness (or refinement) asks how concentrated the forecasts are; a model that always predicts the base rate is perfectly calibrated but useless. The guiding principle, due to Gneiting and colleagues, is to maximize sharpness subject to calibration. Proper scores reward both simultaneously, which is why they cannot be gamed by a forecaster who only fixes one.

A useful algebraic fact is that the expected score of any proper rule decomposes as

\[ \mathbb{E}[S(P, Y)] = \underbrace{\text{Uncertainty}}_{H(Q)} - \underbrace{\text{Resolution}}_{\ge 0} + \underbrace{\text{Reliability}}_{\ge 0}, \]

where uncertainty depends only on the underlying problem, resolution rewards forecasts that separate outcomes, and reliability penalizes miscalibration. We make this concrete for the Brier score in Section 4.

165.3 3. Logarithmic Score and Log Loss

165.3.1 3.1 Definition

The logarithmic score assigns loss equal to the negative log probability the forecast placed on the realized outcome. For a categorical outcome over classes $1, \dots, K$ with forecast vector $\hat{p} = (\hat{p}_1, \dots, \hat{p}_K)$ and realized class $y$,

\[ S_{\log}(\hat{p}, y) = -\log \hat{p}_y . \]

Averaged over a dataset of $n$ examples with one hot labels $y_{i,k}$, this is the familiar cross entropy or log loss:

\[ \text{LogLoss} = -\frac{1}{n} \sum_{i=1}^{n} \sum_{k=1}^{K} y_{i,k} \log \hat{p}_{i,k}. \]

For binary outcomes with label $y \in \{0,1\}$ and predicted positive probability $\hat{p}$, this reduces to the binary cross entropy

\[ S_{\log}(\hat{p}, y) = -\big[\, y \log \hat{p} + (1-y)\log(1-\hat{p}) \,\big]. \]

165.3.2 3.2 Properness

To see that the log score is strictly proper, suppose the true class probabilities are $q = (q_1, \dots, q_K)$ and the forecaster reports $p$. The expected score is

\[ \mathbb{E}_{Y \sim q}\big[-\log p_Y\big] = -\sum_{k} q_k \log p_k = H(q) + D_{\mathrm{KL}}(q \,\|\, p), \]

where $H(q) = -\sum_k q_k \log q_k$ is the Shannon entropy and $D_{\mathrm{KL}}(q \| p) \ge 0$ is the Kullback Leibler divergence. The entropy term does not depend on the report $p$. The divergence is nonnegative and equals zero only when $p = q$. Hence the expected log loss is minimized uniquely at $p = q$, which is the definition of strict propriety. This identity also explains why the log score is the natural loss for maximum likelihood estimation: minimizing log loss is minimizing KL divergence to the truth.

165.3.3 3.3 Properties and Cautions

The log score is local: it depends only on the probability assigned to the outcome that actually occurred, ignoring how mass was spread over the other classes. Up to affine transformation it is the only smooth local proper score for categorical outcomes with three or more classes, a result of considerable theoretical weight (the Bernstein and McCarthy characterization). Locality is a feature when the off outcome structure is meaningless, and a drawback when near misses should be penalized less than far misses, which is exactly the gap that CRPS fills for ordered outcomes.

Its defining practical hazard is unboundedness. If a model assigns probability zero to an event that then occurs, the loss is infinite. A single confident mistake can dominate an entire evaluation, so practitioners clip predictions away from $0$ and $1$ or apply smoothing.

# conceptual, not executable
eps = 1e-15
p   = clip(p, eps, 1 - eps)
loss = -mean( y*log(p) + (1-y)*log(1-p) )

165.4 4. The Brier Score

165.4.1 4.1 Definition

The Brier score, introduced by Glenn Brier in 1950 for weather forecasting, is the mean squared error between predicted probabilities and outcomes. For binary outcomes,

\[ \text{BS} = \frac{1}{n} \sum_{i=1}^{n} (\hat{p}_i - y_i)^2 . \]

For the multiclass case with $K$ classes and one hot labels it generalizes to

\[ \text{BS} = \frac{1}{n} \sum_{i=1}^{n} \sum_{k=1}^{K} (\hat{p}_{i,k} - y_{i,k})^2 . \]

It is bounded in $[0, 1]$ for the binary form and in $[0, 2]$ for the multiclass form, which makes it more robust to overconfident errors than the log score.

165.4.2 4.2 Properness

The Brier score is strictly proper. With true probability $q$ for a binary event and report $p$, the expected score is

\[ \mathbb{E}_{Y \sim q}\big[(p - Y)^2\big] = q(1-p)^2 + (1-q)p^2 . \]

Differentiating with respect to $p$ and setting the derivative to zero,

\[ \frac{d}{dp}\Big[q(1-p)^2 + (1-q)p^2\Big] = -2q(1-p) + 2(1-q)p = 2(p - q) = 0, \]

gives the unique minimizer $p = q$. The second derivative is $2 > 0$, confirming a strict minimum. So, as with the log score, truthful reporting is optimal.

165.4.3 4.3 The Murphy Decomposition

A celebrated result of Allan Murphy decomposes the Brier score into interpretable terms. Partition forecasts into $K$ bins where bin $k$ contains $n_k$ predictions sharing forecast value $\bar{p}_k$ and observed frequency $\bar{o}_k$, and let $\bar{o}$ be the overall base rate. Then

\[ \text{BS} = \underbrace{\frac{1}{n}\sum_k n_k (\bar{p}_k - \bar{o}_k)^2}_{\text{Reliability}} - \underbrace{\frac{1}{n}\sum_k n_k (\bar{o}_k - \bar{o})^2}_{\text{Resolution}} + \underbrace{\bar{o}(1-\bar{o})}_{\text{Uncertainty}}. \]

Reliability measures calibration error and we want it small. Resolution measures how much the binned outcome frequencies deviate from the base rate and we want it large. Uncertainty is the irreducible variance of the outcome, fixed by the problem. This decomposition operationalizes the calibration versus sharpness tradeoff from Section 2.2 and connects directly to reliability diagrams, where one plots $\bar{o}_k$ against $\bar{p}_k$ and looks for points near the diagonal.

165.4.4 4.4 Log Loss versus Brier in Practice

Both are strictly proper, so both incentivize honesty, but they weight errors differently. The log score grows without bound as confidence in a wrong answer increases, making it the more sensitive instrument for detecting and punishing overconfidence; it is the standard objective for training neural classifiers and language models. The Brier score, being bounded and quadratic, is gentler, more stable under outliers, and often preferred when reporting calibration to stakeholders or comparing well calibrated models. A common recommendation is to optimize log loss during training and report both log loss and Brier score, alongside a reliability diagram, at evaluation time.

165.4.5 4.5 A Worked Comparison

The difference in temperament between the two scores is easiest to feel numerically. Consider three forecasters facing a single binary event that turns out to be $y = 1$.

Forecast $\hat{p}$	Description	Log loss $-\log \hat{p}$	Brier $(\hat{p}-1)^2$
$0.90$	confident and correct	$0.105$	$0.010$
$0.50$	maximally uncertain	$0.693$	$0.250$
$0.01$	confident and wrong	$4.605$	$0.980$

Both scores agree on the ranking and both reward the correct confident forecast. The instructive contrast is the bottom row. Moving from a fence sitting $0.50$ to a confidently wrong $0.01$ multiplies the log loss by roughly $6.6$ but the Brier score by only $3.9$, and as $\hat{p} \to 0$ the log loss diverges to infinity while the Brier penalty saturates at $1$. This is the bounded versus unbounded distinction made concrete. If a single egregiously overconfident prediction should be allowed to dominate the leaderboard, prefer log loss; if it should not, prefer Brier. Averaging the per example losses over a test set gives the reported metric in each case.

165.5 5. The Continuous Ranked Probability Score

165.5.1 5.1 Motivation

Log loss and Brier score grade distributions over discrete labels. Many forecasts are real valued: tomorrow’s temperature, a delivery time, a demand quantity. For these we issue a full predictive distribution with cumulative distribution function $F$ and want a proper score that rewards both calibration and sharpness while respecting distance on the real line. The continuous ranked probability score (CRPS) does this.

165.5.2 5.2 Definition

Given a predictive CDF $F$ and a realized scalar outcome $y$, the CRPS is the integrated squared difference between the forecast CDF and the step function at the observation:

\[ \text{CRPS}(F, y) = \int_{-\infty}^{\infty} \big( F(x) - \mathbb{1}\{x \ge y\} \big)^2 \, dx . \]

The integrand compares the predicted cumulative probability at each threshold $x$ to the ideal CDF, which jumps from $0$ to $1$ at the true value $y$. CRPS can be read as the Brier score for the event “outcome at most $x$,” integrated over all thresholds $x$. This makes its properness inherited from the properness of the Brier score at every threshold.

165.5.3 5.3 A Closed Form and Its Interpretation

An equivalent and computationally convenient expression uses two independent draws $X, X'$ from the forecast distribution $F$:

\[ \text{CRPS}(F, y) = \mathbb{E}_F |X - y| - \tfrac{1}{2}\, \mathbb{E}_F |X - X'| . \]

The first term rewards forecasts whose mass sits near the observation; the second term, the expected absolute spread within the forecast, rewards sharpness by penalizing needlessly diffuse predictions. The two terms together enforce the sharpness subject to calibration principle. A pleasant property follows: if the forecast collapses to a point mass at a single value, CRPS reduces to the absolute error $|X - y|$. CRPS is therefore reported in the same units as the outcome, which aids interpretation, and it generalizes mean absolute error to probabilistic forecasts.

For a Gaussian forecast $\mathcal{N}(\mu, \sigma^2)$ there is a closed form. With $z = (y - \mu)/\sigma$ and $\phi, \Phi$ the standard normal density and CDF,

\[ \text{CRPS}\big(\mathcal{N}(\mu,\sigma^2), y\big) = \sigma\left[ z\big(2\Phi(z) - 1\big) + 2\phi(z) - \frac{1}{\sqrt{\pi}} \right]. \]

When a closed form is unavailable, the ensemble (sample based) estimator below is standard.

# conceptual ensemble CRPS from samples x_1..x_m of F, observation y
term1 = mean( |x_j - y| )                 # j over m samples
term2 = mean( |x_j - x_k| ) / 2           # j,k over all pairs
crps  = term1 - term2

165.6 6. Why Proper Scoring Rules Incentivize Honesty

The recurring algebra of Sections 3 and 4 is not a coincidence; it is the entire point. A proper scoring rule is engineered so that the report minimizing a forecaster’s own expected loss is precisely their true belief.

Consider a forecaster who privately believes the outcome probability is $q$ but contemplates reporting some other value $p$. Their subjectively expected loss is $S(p, q) = \mathbb{E}_{Y \sim q}[S(p, Y)]$. Under a strictly proper rule this is uniquely minimized at $p = q$. Any attempt to shade the forecast, to hedge toward $0.5$ to look cautious, or to exaggerate toward $0$ or $1$ to look decisive, raises the forecaster’s own expected penalty. There is no separate enforcement mechanism: the metric is self enforcing because misreporting is self defeating.

This stands in sharp contrast to improper metrics. Accuracy is improper for probabilities: a model can maximize accuracy by reporting $0$ or $1$ regardless of its true uncertainty, which destroys all probabilistic information. Mean absolute error on probabilities, $|\hat{p} - y|$, is also improper; its expected value $q|1 - p| + (1-q)|p|$ is minimized at the boundary $p = 0$ or $p = 1$ depending on whether $q$ is below or above $0.5$, again rewarding overconfident reports. Using such metrics to train or select probabilistic models silently encourages dishonest, miscalibrated outputs.

The honesty guarantee carries three practical consequences. First, in model training, minimizing a proper score (typically log loss) over a flexible model class drives the fitted conditional probabilities toward the true conditionals, the foundation of probabilistic machine learning. Second, in model selection and evaluation, proper scores let us rank competing forecasters without fear that the winner merely exploited the metric. Third, in elicitation, when humans or markets are paid according to a proper score, the payment structure makes truthful probability reporting the rational strategy, which is why proper scoring rules underpin forecasting tournaments and prediction markets.

Two cautions temper the theory. Propriety is an expectation level property: it guarantees the right incentive on average under the true distribution, not that any single realized score is meaningful in isolation. And propriety says nothing about which proper rule to use; the choice among log loss, Brier, CRPS, and others should follow from how one wishes to weight calibration, tail behavior, boundedness, and units, as discussed throughout this chapter.

165.7 7. Choosing a Score: When to Use What, and Common Pitfalls

The three scores are not competitors so much as instruments tuned to different outcome types and reporting goals.

Situation	Recommended score	Reason
Training neural classifiers or language models	Log loss	Differentiable, unbounded, matches maximum likelihood
Reporting calibration to stakeholders	Brier score plus reliability diagram	Bounded, decomposes into reliability and resolution
Comparing well calibrated, similar models	Brier score	Stable under outliers, less dominated by single errors
Real valued or ordered forecasts	CRPS	Respects distance on the line, reported in outcome units
Auditing overconfidence	Log loss	Diverges on confident errors, surfacing them sharply

A few recurring pitfalls are worth naming explicitly.

Treating accuracy as a probability metric. Thresholding at $0.5$ and counting hits discards calibration entirely and is improper, as Section 6 shows. Report a proper score alongside any accuracy figure.
Averaging log loss without clipping. One probability of exactly $0$ on a realized event makes the mean infinite. Clip to $[\varepsilon, 1-\varepsilon]$ or smooth, and disclose the clipping constant, since it affects the number.
Reading a single realized score as quality. Propriety is an expectation level guarantee. A good forecaster can still incur a large loss on an unlucky draw. Compare scores across many examples and, where possible, with confidence intervals from resampling.
Comparing scores across different datasets or base rates. The uncertainty term in the Murphy decomposition depends on the base rate, so raw scores are not portable. Use skill scores, which normalize against a reference forecast such as the climatological base rate, when comparing across problems.
Confusing calibration with skill. A model that always predicts the base rate is perfectly calibrated and worthless. Sharpness, captured by the resolution term, is what separates useful forecasts from trivial ones.

Mature open source tooling covers all three scores. The scikit learn library provides log_loss and brier_score_loss; the properscoring package and the scoringrules package implement CRPS with both the closed form Gaussian and the ensemble estimators; and statsmodels and scikit learn offer calibration curves for reliability diagrams.

165.8 8. Summary

Probabilistic metrics evaluate full predictive distributions rather than hard decisions. Proper scoring rules are those for which truthful reporting minimizes expected loss, and strictly proper rules make truth the unique optimum. The logarithmic score yields log loss and cross entropy, is local and intimately tied to KL divergence and maximum likelihood, but is unbounded and unforgiving of confident errors. The Brier score is a bounded quadratic alternative whose Murphy decomposition exposes calibration, resolution, and intrinsic uncertainty. The continuous ranked probability score extends proper scoring to real valued forecasts, reduces to mean absolute error for point forecasts, and is reported in the outcome’s own units. Across all of them, propriety is the unifying design principle: it is what lets a number on a leaderboard stand in for honest, well calibrated belief.

165.9 References

Gneiting, T. and Raftery, A. E. (2007). Strictly Proper Scoring Rules, Prediction, and Estimation. Journal of the American Statistical Association, 102(477), 359 to 378. https://doi.org/10.1198/016214506000001437
Brier, G. W. (1950). Verification of Forecasts Expressed in Terms of Probability. Monthly Weather Review, 78(1), 1 to 3. https://doi.org/10.1175/1520-0493(1950)078%3C0001:VOFEIT%3E2.0.CO;2
Murphy, A. H. (1973). A New Vector Partition of the Probability Score. Journal of Applied Meteorology, 12(4), 595 to 600. https://doi.org/10.1175/1520-0450(1973)012%3C0595:ANVPOT%3E2.0.CO;2
Gneiting, T., Balabdaoui, F., and Raftery, A. E. (2007). Probabilistic Forecasts, Calibration and Sharpness. Journal of the Royal Statistical Society Series B, 69(2), 243 to 268. https://doi.org/10.1111/j.1467-9868.2007.00587.x
Matheson, J. E. and Winkler, R. L. (1976). Scoring Rules for Continuous Probability Distributions. Management Science, 22(10), 1087 to 1096. https://doi.org/10.1287/mnsc.22.10.1087
Gneiting, T. and Katzfuss, M. (2014). Probabilistic Forecasting. Annual Review of Statistics and Its Application, 1, 125 to 151. https://doi.org/10.1146/annurev-statistics-062713-085831
Bröcker, J. (2009). Reliability, Sufficiency, and the Decomposition of Proper Scores. Quarterly Journal of the Royal Meteorological Society, 135(643), 1512 to 1519. https://doi.org/10.1002/qj.456
Dawid, A. P. and Musio, M. (2014). Theory and Applications of Proper Scoring Rules. Metron, 72(2), 169 to 183. https://doi.org/10.1007/s40300-014-0039-y

# Probabilistic Metrics and Proper Scoring Rules Most machine learning systems do not merely predict; they assert degrees of belief. A weather model says rain is 70 percent likely, a credit model assigns a 3 percent default probability, a language model places a distribution over the next token. To evaluate such systems we need metrics that grade the entire probability distribution rather than a single hard decision. This chapter develops the theory of proper scoring rules, derives the two workhorse classification scores (log loss and the Brier score), extends the framework to real-valued forecasts through the continuous ranked probability score, and explains the central reason these metrics matter: they make honesty the optimal strategy. ## 1. From Decisions to Distributions A classifier that outputs only a label can be scored with accuracy, precision, or recall. These metrics discard information. Two models that both predict "rain" may differ sharply: one assigns probability $0.51$, the other $0.99$. If it does not rain, the second model was far more wrong, yet a threshold based metric treats them identically. Probabilistic metrics retain this gradation. Formally, let the outcome be a random variable $Y$ with realized value $y$, and let a forecaster issue a predictive distribution $P$ (or a density $p$, or for binary outcomes a probability $\hat{p}$). A **scoring rule** is a function $S(P, y)$ giving the reward (or, by convention here, the loss) when the forecast was $P$ and the outcome turned out to be $y$. Lower loss is better. The quality of a forecaster is summarized by the expected score under the true data generating distribution $Q$: $$ S(P, Q) = \mathbb{E}_{Y \sim Q}\big[ S(P, Y) \big]. $$ The forecaster controls $P$; nature controls $Q$. The key design question is how to choose $S$ so that a rational forecaster who wants to minimize expected loss is driven to report $P = Q$. The landscape of probabilistic metrics, and the place of proper scoring rules within it, is summarized below. ```{mermaid} flowchart TD A["Forecast quality"] --> B["Hard decision metrics"] A --> C["Probabilistic metrics"] B --> B1["Accuracy, precision, recall"] C --> D["Proper scoring rules"] C --> E["Calibration diagnostics"] D --> D1["Log score (cross entropy)"] D --> D2["Brier score"] D --> D3["CRPS for real valued outcomes"] E --> E1["Reliability diagram"] E --> E2["Expected calibration error"] ``` Hard decision metrics answer "was the label right." Proper scoring rules answer "was the stated belief honest and well resolved." Calibration diagnostics are visual or summary companions to the scores, not substitutes for them. ## 2. Proper Scoring Rules ### 2.1 Definition A scoring rule $S$ is **proper** if reporting the true distribution is never worse in expectation than reporting any other distribution: $$ S(Q, Q) \le S(P, Q) \quad \text{for all } P. $$ It is **strictly proper** if equality holds only when $P = Q$. Propriety is exactly the property that removes any incentive to misrepresent beliefs. If a forecaster privately believes the true distribution is $Q$, a strictly proper rule guarantees that their unique expected loss minimizing report is $Q$ itself. Section 6 returns to why this matters operationally. There is a clean geometric way to see propriety. Define the **expected score function** $G(Q) = S(Q, Q)$, the loss a perfectly truthful forecaster expects to incur against nature $Q$. A standard result of Gneiting and Raftery (reference 1) states that a regular scoring rule is proper if and only if $G$ is concave and $S(P, Q)$ is a supporting affine function of $G$ at the point $P$. In words, the truthful report sits on the function $G$ itself, while every untruthful report lies on a tangent line that can only sit at or above it. For the log score $G(Q) = -H(Q)$ is the negative entropy, and for the Brier score $G(Q) = -\sum_k q_k(1 - q_k)$ is the negative Gini index. The gap between the tangent line and the curve is precisely a Bregman divergence from $Q$ to $P$, which recovers the Kullback Leibler and squared error penalties seen later in this chapter. ### 2.2 The Decomposition into Calibration and Sharpness Every proper score can be understood through two complementary properties of a forecaster. **Calibration** asks whether stated probabilities match empirical frequencies: among all days a model said 70 percent, did it rain about 70 percent of the time? **Sharpness** (or refinement) asks how concentrated the forecasts are; a model that always predicts the base rate is perfectly calibrated but useless. The guiding principle, due to Gneiting and colleagues, is to maximize sharpness subject to calibration. Proper scores reward both simultaneously, which is why they cannot be gamed by a forecaster who only fixes one. A useful algebraic fact is that the expected score of any proper rule decomposes as $$ \mathbb{E}[S(P, Y)] = \underbrace{\text{Uncertainty}}_{H(Q)} - \underbrace{\text{Resolution}}_{\ge 0} + \underbrace{\text{Reliability}}_{\ge 0}, $$ where uncertainty depends only on the underlying problem, resolution rewards forecasts that separate outcomes, and reliability penalizes miscalibration. We make this concrete for the Brier score in Section 4. ## 3. Logarithmic Score and Log Loss ### 3.1 Definition The **logarithmic score** assigns loss equal to the negative log probability the forecast placed on the realized outcome. For a categorical outcome over classes $1, \dots, K$ with forecast vector $\hat{p} = (\hat{p}_1, \dots, \hat{p}_K)$ and realized class $y$, $$ S_{\log}(\hat{p}, y) = -\log \hat{p}_y . $$ Averaged over a dataset of $n$ examples with one hot labels $y_{i,k}$, this is the familiar **cross entropy** or **log loss**: $$ \text{LogLoss} = -\frac{1}{n} \sum_{i=1}^{n} \sum_{k=1}^{K} y_{i,k} \log \hat{p}_{i,k}. $$ For binary outcomes with label $y \in \{0,1\}$ and predicted positive probability $\hat{p}$, this reduces to the binary cross entropy $$ S_{\log}(\hat{p}, y) = -\big[\, y \log \hat{p} + (1-y)\log(1-\hat{p}) \,\big]. $$ ### 3.2 Properness To see that the log score is strictly proper, suppose the true class probabilities are $q = (q_1, \dots, q_K)$ and the forecaster reports $p$. The expected score is $$ \mathbb{E}_{Y \sim q}\big[-\log p_Y\big] = -\sum_{k} q_k \log p_k = H(q) + D_{\mathrm{KL}}(q \,\|\, p), $$ where $H(q) = -\sum_k q_k \log q_k$ is the Shannon entropy and $D_{\mathrm{KL}}(q \| p) \ge 0$ is the Kullback Leibler divergence. The entropy term does not depend on the report $p$. The divergence is nonnegative and equals zero only when $p = q$. Hence the expected log loss is minimized uniquely at $p = q$, which is the definition of strict propriety. This identity also explains why the log score is the natural loss for maximum likelihood estimation: minimizing log loss is minimizing KL divergence to the truth. ### 3.3 Properties and Cautions The log score is **local**: it depends only on the probability assigned to the outcome that actually occurred, ignoring how mass was spread over the other classes. Up to affine transformation it is the only smooth local proper score for categorical outcomes with three or more classes, a result of considerable theoretical weight (the Bernstein and McCarthy characterization). Locality is a feature when the off outcome structure is meaningless, and a drawback when near misses should be penalized less than far misses, which is exactly the gap that CRPS fills for ordered outcomes. Its defining practical hazard is unboundedness. If a model assigns probability zero to an event that then occurs, the loss is infinite. A single confident mistake can dominate an entire evaluation, so practitioners clip predictions away from $0$ and $1$ or apply smoothing. ```text # conceptual, not executable eps = 1e-15 p = clip(p, eps, 1 - eps) loss = -mean( y*log(p) + (1-y)*log(1-p) ) ``` ## 4. The Brier Score ### 4.1 Definition The **Brier score**, introduced by Glenn Brier in 1950 for weather forecasting, is the mean squared error between predicted probabilities and outcomes. For binary outcomes, $$ \text{BS} = \frac{1}{n} \sum_{i=1}^{n} (\hat{p}_i - y_i)^2 . $$ For the multiclass case with $K$ classes and one hot labels it generalizes to $$ \text{BS} = \frac{1}{n} \sum_{i=1}^{n} \sum_{k=1}^{K} (\hat{p}_{i,k} - y_{i,k})^2 . $$ It is bounded in $[0, 1]$ for the binary form and in $[0, 2]$ for the multiclass form, which makes it more robust to overconfident errors than the log score. ### 4.2 Properness The Brier score is strictly proper. With true probability $q$ for a binary event and report $p$, the expected score is $$ \mathbb{E}_{Y \sim q}\big[(p - Y)^2\big] = q(1-p)^2 + (1-q)p^2 . $$ Differentiating with respect to $p$ and setting the derivative to zero, $$ \frac{d}{dp}\Big[q(1-p)^2 + (1-q)p^2\Big] = -2q(1-p) + 2(1-q)p = 2(p - q) = 0, $$ gives the unique minimizer $p = q$. The second derivative is $2 > 0$, confirming a strict minimum. So, as with the log score, truthful reporting is optimal. ### 4.3 The Murphy Decomposition A celebrated result of Allan Murphy decomposes the Brier score into interpretable terms. Partition forecasts into $K$ bins where bin $k$ contains $n_k$ predictions sharing forecast value $\bar{p}_k$ and observed frequency $\bar{o}_k$, and let $\bar{o}$ be the overall base rate. Then $$ \text{BS} = \underbrace{\frac{1}{n}\sum_k n_k (\bar{p}_k - \bar{o}_k)^2}_{\text{Reliability}} - \underbrace{\frac{1}{n}\sum_k n_k (\bar{o}_k - \bar{o})^2}_{\text{Resolution}} + \underbrace{\bar{o}(1-\bar{o})}_{\text{Uncertainty}}. $$ Reliability measures calibration error and we want it small. Resolution measures how much the binned outcome frequencies deviate from the base rate and we want it large. Uncertainty is the irreducible variance of the outcome, fixed by the problem. This decomposition operationalizes the calibration versus sharpness tradeoff from Section 2.2 and connects directly to reliability diagrams, where one plots $\bar{o}_k$ against $\bar{p}_k$ and looks for points near the diagonal. ### 4.4 Log Loss versus Brier in Practice Both are strictly proper, so both incentivize honesty, but they weight errors differently. The log score grows without bound as confidence in a wrong answer increases, making it the more sensitive instrument for detecting and punishing overconfidence; it is the standard objective for training neural classifiers and language models. The Brier score, being bounded and quadratic, is gentler, more stable under outliers, and often preferred when reporting calibration to stakeholders or comparing well calibrated models. A common recommendation is to optimize log loss during training and report both log loss and Brier score, alongside a reliability diagram, at evaluation time. ### 4.5 A Worked Comparison The difference in temperament between the two scores is easiest to feel numerically. Consider three forecasters facing a single binary event that turns out to be $y = 1$. | Forecast $\hat{p}$ | Description | Log loss $-\log \hat{p}$ | Brier $(\hat{p}-1)^2$ | |---|---|---|---| | $0.90$ | confident and correct | $0.105$ | $0.010$ | | $0.50$ | maximally uncertain | $0.693$ | $0.250$ | | $0.01$ | confident and wrong | $4.605$ | $0.980$ | Both scores agree on the ranking and both reward the correct confident forecast. The instructive contrast is the bottom row. Moving from a fence sitting $0.50$ to a confidently wrong $0.01$ multiplies the log loss by roughly $6.6$ but the Brier score by only $3.9$, and as $\hat{p} \to 0$ the log loss diverges to infinity while the Brier penalty saturates at $1$. This is the bounded versus unbounded distinction made concrete. If a single egregiously overconfident prediction should be allowed to dominate the leaderboard, prefer log loss; if it should not, prefer Brier. Averaging the per example losses over a test set gives the reported metric in each case. ## 5. The Continuous Ranked Probability Score ### 5.1 Motivation Log loss and Brier score grade distributions over discrete labels. Many forecasts are real valued: tomorrow's temperature, a delivery time, a demand quantity. For these we issue a full predictive distribution with cumulative distribution function $F$ and want a proper score that rewards both calibration and sharpness while respecting distance on the real line. The **continuous ranked probability score** (CRPS) does this. ### 5.2 Definition Given a predictive CDF $F$ and a realized scalar outcome $y$, the CRPS is the integrated squared difference between the forecast CDF and the step function at the observation: $$ \text{CRPS}(F, y) = \int_{-\infty}^{\infty} \big( F(x) - \mathbb{1}\{x \ge y\} \big)^2 \, dx . $$ The integrand compares the predicted cumulative probability at each threshold $x$ to the ideal CDF, which jumps from $0$ to $1$ at the true value $y$. CRPS can be read as the Brier score for the event "outcome at most $x$," integrated over all thresholds $x$. This makes its properness inherited from the properness of the Brier score at every threshold. ### 5.3 A Closed Form and Its Interpretation An equivalent and computationally convenient expression uses two independent draws $X, X'$ from the forecast distribution $F$: $$ \text{CRPS}(F, y) = \mathbb{E}_F |X - y| - \tfrac{1}{2}\, \mathbb{E}_F |X - X'| . $$ The first term rewards forecasts whose mass sits near the observation; the second term, the expected absolute spread within the forecast, rewards sharpness by penalizing needlessly diffuse predictions. The two terms together enforce the sharpness subject to calibration principle. A pleasant property follows: if the forecast collapses to a point mass at a single value, CRPS reduces to the absolute error $|X - y|$. CRPS is therefore reported in the same units as the outcome, which aids interpretation, and it generalizes mean absolute error to probabilistic forecasts. For a Gaussian forecast $\mathcal{N}(\mu, \sigma^2)$ there is a closed form. With $z = (y - \mu)/\sigma$ and $\phi, \Phi$ the standard normal density and CDF, $$ \text{CRPS}\big(\mathcal{N}(\mu,\sigma^2), y\big) = \sigma\left[ z\big(2\Phi(z) - 1\big) + 2\phi(z) - \frac{1}{\sqrt{\pi}} \right]. $$ When a closed form is unavailable, the ensemble (sample based) estimator below is standard. ```text # conceptual ensemble CRPS from samples x_1..x_m of F, observation y term1 = mean( |x_j - y| ) # j over m samples term2 = mean( |x_j - x_k| ) / 2 # j,k over all pairs crps = term1 - term2 ``` ## 6. Why Proper Scoring Rules Incentivize Honesty The recurring algebra of Sections 3 and 4 is not a coincidence; it is the entire point. A proper scoring rule is engineered so that the report minimizing a forecaster's own expected loss is precisely their true belief. Consider a forecaster who privately believes the outcome probability is $q$ but contemplates reporting some other value $p$. Their subjectively expected loss is $S(p, q) = \mathbb{E}_{Y \sim q}[S(p, Y)]$. Under a strictly proper rule this is uniquely minimized at $p = q$. Any attempt to shade the forecast, to hedge toward $0.5$ to look cautious, or to exaggerate toward $0$ or $1$ to look decisive, raises the forecaster's own expected penalty. There is no separate enforcement mechanism: the metric is self enforcing because misreporting is self defeating. This stands in sharp contrast to improper metrics. **Accuracy** is improper for probabilities: a model can maximize accuracy by reporting $0$ or $1$ regardless of its true uncertainty, which destroys all probabilistic information. **Mean absolute error on probabilities**, $|\hat{p} - y|$, is also improper; its expected value $q|1 - p| + (1-q)|p|$ is minimized at the boundary $p = 0$ or $p = 1$ depending on whether $q$ is below or above $0.5$, again rewarding overconfident reports. Using such metrics to train or select probabilistic models silently encourages dishonest, miscalibrated outputs. The honesty guarantee carries three practical consequences. First, in **model training**, minimizing a proper score (typically log loss) over a flexible model class drives the fitted conditional probabilities toward the true conditionals, the foundation of probabilistic machine learning. Second, in **model selection and evaluation**, proper scores let us rank competing forecasters without fear that the winner merely exploited the metric. Third, in **elicitation**, when humans or markets are paid according to a proper score, the payment structure makes truthful probability reporting the rational strategy, which is why proper scoring rules underpin forecasting tournaments and prediction markets. Two cautions temper the theory. Propriety is an expectation level property: it guarantees the right incentive on average under the true distribution, not that any single realized score is meaningful in isolation. And propriety says nothing about which proper rule to use; the choice among log loss, Brier, CRPS, and others should follow from how one wishes to weight calibration, tail behavior, boundedness, and units, as discussed throughout this chapter. ## 7. Choosing a Score: When to Use What, and Common Pitfalls The three scores are not competitors so much as instruments tuned to different outcome types and reporting goals. | Situation | Recommended score | Reason | |---|---|---| | Training neural classifiers or language models | Log loss | Differentiable, unbounded, matches maximum likelihood | | Reporting calibration to stakeholders | Brier score plus reliability diagram | Bounded, decomposes into reliability and resolution | | Comparing well calibrated, similar models | Brier score | Stable under outliers, less dominated by single errors | | Real valued or ordered forecasts | CRPS | Respects distance on the line, reported in outcome units | | Auditing overconfidence | Log loss | Diverges on confident errors, surfacing them sharply | A few recurring pitfalls are worth naming explicitly. - **Treating accuracy as a probability metric.** Thresholding at $0.5$ and counting hits discards calibration entirely and is improper, as Section 6 shows. Report a proper score alongside any accuracy figure. - **Averaging log loss without clipping.** One probability of exactly $0$ on a realized event makes the mean infinite. Clip to $[\varepsilon, 1-\varepsilon]$ or smooth, and disclose the clipping constant, since it affects the number. - **Reading a single realized score as quality.** Propriety is an expectation level guarantee. A good forecaster can still incur a large loss on an unlucky draw. Compare scores across many examples and, where possible, with confidence intervals from resampling. - **Comparing scores across different datasets or base rates.** The uncertainty term in the Murphy decomposition depends on the base rate, so raw scores are not portable. Use skill scores, which normalize against a reference forecast such as the climatological base rate, when comparing across problems. - **Confusing calibration with skill.** A model that always predicts the base rate is perfectly calibrated and worthless. Sharpness, captured by the resolution term, is what separates useful forecasts from trivial ones. Mature open source tooling covers all three scores. The scikit learn library provides `log_loss` and `brier_score_loss`; the `properscoring` package and the `scoringrules` package implement CRPS with both the closed form Gaussian and the ensemble estimators; and `statsmodels` and `scikit learn` offer calibration curves for reliability diagrams. ## 8. Summary Probabilistic metrics evaluate full predictive distributions rather than hard decisions. Proper scoring rules are those for which truthful reporting minimizes expected loss, and strictly proper rules make truth the unique optimum. The logarithmic score yields log loss and cross entropy, is local and intimately tied to KL divergence and maximum likelihood, but is unbounded and unforgiving of confident errors. The Brier score is a bounded quadratic alternative whose Murphy decomposition exposes calibration, resolution, and intrinsic uncertainty. The continuous ranked probability score extends proper scoring to real valued forecasts, reduces to mean absolute error for point forecasts, and is reported in the outcome's own units. Across all of them, propriety is the unifying design principle: it is what lets a number on a leaderboard stand in for honest, well calibrated belief. ## References 1. Gneiting, T. and Raftery, A. E. (2007). Strictly Proper Scoring Rules, Prediction, and Estimation. Journal of the American Statistical Association, 102(477), 359 to 378. https://doi.org/10.1198/016214506000001437 2. Brier, G. W. (1950). Verification of Forecasts Expressed in Terms of Probability. Monthly Weather Review, 78(1), 1 to 3. https://doi.org/10.1175/1520-0493(1950)078%3C0001:VOFEIT%3E2.0.CO;2 3. Murphy, A. H. (1973). A New Vector Partition of the Probability Score. Journal of Applied Meteorology, 12(4), 595 to 600. https://doi.org/10.1175/1520-0450(1973)012%3C0595:ANVPOT%3E2.0.CO;2 4. Gneiting, T., Balabdaoui, F., and Raftery, A. E. (2007). Probabilistic Forecasts, Calibration and Sharpness. Journal of the Royal Statistical Society Series B, 69(2), 243 to 268. https://doi.org/10.1111/j.1467-9868.2007.00587.x 5. Matheson, J. E. and Winkler, R. L. (1976). Scoring Rules for Continuous Probability Distributions. Management Science, 22(10), 1087 to 1096. https://doi.org/10.1287/mnsc.22.10.1087 6. Gneiting, T. and Katzfuss, M. (2014). Probabilistic Forecasting. Annual Review of Statistics and Its Application, 1, 125 to 151. https://doi.org/10.1146/annurev-statistics-062713-085831 7. Bröcker, J. (2009). Reliability, Sufficiency, and the Decomposition of Proper Scores. Quarterly Journal of the Royal Meteorological Society, 135(643), 1512 to 1519. https://doi.org/10.1002/qj.456 8. Dawid, A. P. and Musio, M. (2014). Theory and Applications of Proper Scoring Rules. Metron, 72(2), 169 to 183. https://doi.org/10.1007/s40300-014-0039-y

Forecast \(\hat{p}\)	Description	Log loss \(-\log \hat{p}\)	Brier \((\hat{p}-1)^2\)
\(0.90\)	confident and correct	\(0.105\)	\(0.010\)
\(0.50\)	maximally uncertain	\(0.693\)	\(0.250\)
\(0.01\)	confident and wrong	\(4.605\)	\(0.980\)