166  Calibration Metrics

A probabilistic classifier does more than rank instances or pick a winning label. It emits numbers that purport to be probabilities, and downstream decisions often treat those numbers literally. A medical triage system that flags a patient as having a \(0.9\) probability of sepsis invites a different response than one reporting \(0.55\). Selective prediction, cost-sensitive thresholding, ensembling, and human-in-the-loop review all assume that a reported confidence of \(p\) behaves like a genuine frequency. Calibration is the formal study of whether that assumption holds, and calibration metrics are the instruments that quantify the gap between claimed and realized confidence.

This chapter develops the measurement of calibration for classification models. We define calibration precisely, construct reliability diagrams, derive the expected and maximum calibration errors with their estimators, examine the tension between calibration and sharpness through the lens of proper scoring rules, and close with the practical pitfalls that make calibration estimation deceptively subtle.

166.1 1. Defining Calibration

166.1.1 1.1 Perfect Calibration

Let \((X, Y)\) be a random pair with features \(X\) and label \(Y \in \{1, \dots, K\}\). A probabilistic classifier produces a vector \(f(X) \in \Delta^{K-1}\) on the probability simplex, where \(f_k(X)\) is the predicted probability of class \(k\). The model is perfectly calibrated when its predicted probabilities match the conditional class frequencies:

\[ \mathbb{P}\big(Y = k \mid f_k(X) = p\big) = p \quad \text{for all } k \text{ and all } p \in [0, 1]. \]

The interpretation is operational. Among all inputs for which the model claims probability \(p\) for class \(k\), a fraction exactly \(p\) truly belong to class \(k\). Calibration is a property of the joint distribution of predictions and outcomes, not of any single prediction, so it can only be assessed over a population.

166.1.2 1.2 Confidence Calibration

Full multiclass calibration as defined above is demanding because it constrains the entire predicted vector. In practice much of the literature studies the weaker notion of confidence calibration, which concerns only the top prediction. Let \(\hat{Y} = \arg\max_k f_k(X)\) be the predicted label and \(\hat{P} = \max_k f_k(X)\) be the associated confidence. Confidence calibration requires

\[ \mathbb{P}\big(\hat{Y} = Y \mid \hat{P} = p\big) = p \quad \text{for all } p \in [0, 1]. \]

This reduces a \(K\) class problem to a single scalar score \(\hat{P}\) paired with a binary correctness event \(\mathbb{1}[\hat{Y} = Y]\). Most reliability diagrams and calibration error estimates used in deep learning evaluate this confidence notion, and it is the focus of the metrics below. The distinction matters: a model can be confidence calibrated while remaining poorly calibrated on non-top classes, which is why classwise and multiclass variants exist.

166.2 2. Reliability Diagrams

166.2.1 2.1 Construction

The reliability diagram is the canonical visualization of calibration. The procedure partitions the confidence interval \([0, 1]\) into \(M\) bins and compares, within each bin, the average confidence against the empirical accuracy.

Given a held-out sample \(\{(x_i, y_i)\}_{i=1}^{n}\), define the bin boundaries, commonly the equal-width partition \(B_m = \big(\tfrac{m-1}{M}, \tfrac{m}{M}\big]\). Assign each prediction to the bin containing its confidence \(\hat{p}_i\). Within bin \(B_m\) compute the accuracy and the average confidence:

\[ \mathrm{acc}(B_m) = \frac{1}{|B_m|} \sum_{i \in B_m} \mathbb{1}[\hat{y}_i = y_i], \qquad \mathrm{conf}(B_m) = \frac{1}{|B_m|} \sum_{i \in B_m} \hat{p}_i . \]

The diagram plots \(\mathrm{acc}(B_m)\) against \(\mathrm{conf}(B_m)\). Perfect calibration places every bin on the diagonal \(y = x\). Bins below the diagonal indicate overconfidence, where accuracy lags the claimed confidence; bins above it indicate underconfidence. The signed gap \(\mathrm{acc}(B_m) - \mathrm{conf}(B_m)\) summarizes the direction and magnitude of miscalibration in each bin.

flowchart LR
    C["confidence conf of B_m"] --> D["diagonal y equals x, perfect calibration"]
    A["accuracy acc of B_m"] --> D
    D --> U["above diagonal, underconfident"]
    D --> O["below diagonal, overconfident"]

166.2.2 2.2 Reading the Diagram

A reliability diagram conveys more than a single number. The pattern of deviations reveals the structure of the miscalibration. Modern overparameterized neural networks typically show a characteristic bow below the diagonal at high confidence, the empirical signature of systematic overconfidence reported by Guo and colleagues. The diagram also exposes where the model concentrates its mass: a histogram of bin counts displayed alongside the diagram shows that for confident classifiers most samples fall in the rightmost bins, so deviations there dominate any aggregate error even when low-confidence bins look ragged.

166.3 3. Expected Calibration Error

166.3.1 3.1 The Population Quantity

To collapse the reliability diagram into a scalar, we measure the expected discrepancy between confidence and accuracy. The population expected calibration error for confidence calibration is

\[ \mathrm{ECE} = \mathbb{E}_{\hat{P}}\Big[\big| \mathbb{P}(\hat{Y} = Y \mid \hat{P}) - \hat{P} \big|\Big]. \]

This is the average absolute gap between the confidence level and the true accuracy at that confidence level, weighted by how often each confidence level occurs. It is zero exactly when the model is confidence calibrated.

166.3.2 3.2 The Binned Estimator

The conditional probability inside the expectation cannot be evaluated pointwise from finite data, so the standard estimator replaces it with the binned approximation. Using the bins of the reliability diagram,

\[ \widehat{\mathrm{ECE}} = \sum_{m=1}^{M} \frac{|B_m|}{n} \, \big| \mathrm{acc}(B_m) - \mathrm{conf}(B_m) \big| . \]

Each bin contributes its absolute calibration gap weighted by the fraction of samples it contains. The metric lies in \([0, 1]\), with smaller values indicating better calibration.

ECE estimator
  for each bin m in 1..M:
      gap_m = | acc(B_m) - conf(B_m) |
      w_m   = count(B_m) / n
  ECE = sum_m w_m * gap_m

A general \(L^p\) form replaces the absolute value with a \(p\)-th power, giving the \(\mathrm{ECE}_p\) family; \(p = 2\) penalizes large bin gaps more heavily and connects ECE to the binned calibration component of the squared error. The choice \(p = 1\) remains the most reported.

166.4 4. Maximum Calibration Error

166.4.1 4.1 Definition

ECE averages over bins and can mask a single severely miscalibrated region. In high-stakes settings the worst-case gap is the relevant quantity. The maximum calibration error reports the largest deviation across bins:

\[ \widehat{\mathrm{MCE}} = \max_{m \in \{1, \dots, M\}} \big| \mathrm{acc}(B_m) - \mathrm{conf}(B_m) \big| . \]

The population analogue is the supremum of the calibration gap over confidence values, \(\mathrm{MCE} = \sup_{p} | \mathbb{P}(\hat{Y} = Y \mid \hat{P} = p) - p |\). MCE answers a different question from ECE. A safety-critical system that must never overstate its confidence by more than a tolerance cares about MCE; a system optimizing aggregate decision quality cares about ECE. The two should be reported together because a model can have low ECE yet alarming MCE in a sparsely populated bin.

166.4.2 4.2 Sensitivity to Sparse Bins

MCE inherits a weakness from binning: a bin holding only a handful of samples produces a high-variance accuracy estimate, and that noisy estimate can dominate the maximum. Reporting MCE alongside per-bin counts, or restricting the maximum to bins above a minimum occupancy, mitigates spurious worst-case readings driven by sampling noise rather than genuine miscalibration.

166.5 5. Calibration and Sharpness

166.5.1 5.1 Why Calibration Alone Is Insufficient

Calibration is necessary but not sufficient for a useful forecaster. Consider the constant predictor that outputs the marginal base rate \(\bar{p} = \mathbb{P}(Y = 1)\) for every input in a binary task. This predictor is perfectly calibrated: among all instances, exactly a fraction \(\bar{p}\) are positive, and it claimed \(\bar{p}\) for all of them. Yet it is useless, because it never discriminates between instances. It achieves calibration by refusing to be informative.

The missing ingredient is sharpness, the property that predictions concentrate near \(0\) and \(1\) rather than clustering at the base rate. Sharpness measures the variance of the predicted probabilities, \(\mathrm{Var}(\hat{P})\), and is a property of the forecasts alone, independent of the outcomes. The guiding principle, articulated by Gneiting, Balabdaoui, and Raftery, is to maximize sharpness subject to calibration. We want predictions that are both confident and correct, not merely correct on average.

166.5.2 5.2 The Calibration-Refinement Decomposition

The trade-off is made rigorous by decomposing a proper scoring rule. For the Brier score in the binary case, with prediction \(\hat{p}_i\) and label \(y_i \in \{0, 1\}\),

\[ \mathrm{BS} = \frac{1}{n} \sum_{i=1}^{n} (\hat{p}_i - y_i)^2 . \]

Murphy’s decomposition expresses the expected Brier score, after grouping predictions by their value, as

\[ \mathbb{E}[\mathrm{BS}] = \underbrace{\mathbb{E}\big[(\hat{P} - \bar{y}_{\hat{P}})^2\big]}_{\text{calibration}} \;-\; \underbrace{\mathbb{E}\big[(\bar{y}_{\hat{P}} - \bar{y})^2\big]}_{\text{refinement (sharpness)}} \;+\; \underbrace{\bar{y}(1 - \bar{y})}_{\text{irreducible}}, \]

where \(\bar{y}_{\hat{P}}\) is the conditional outcome frequency given the prediction and \(\bar{y}\) is the overall base rate. The calibration term is the reliability of the forecasts and is exactly the squared population calibration gap; it is minimized at zero by a calibrated model. The refinement term rewards sharpness and enters with a negative sign, so increasing it lowers the score. The final term is the irreducible uncertainty of the label distribution. The decomposition shows why optimizing a proper scoring rule does not reduce to optimizing calibration: a model can trade a small calibration penalty for a large refinement gain. This is why post-hoc recalibration methods, which adjust calibration while preserving the ranking and hence the refinement, can improve the Brier score and the log loss without retraining.

166.5.3 5.3 Proper Scoring Rules as Joint Measures

A scoring rule \(S(\hat{p}, y)\) is proper when reporting the true conditional probability minimizes the expected score, and strictly proper when the true probability is the unique minimizer. The Brier score and the negative log likelihood (log loss) are both strictly proper and therefore reward calibration and sharpness simultaneously. Calibration metrics such as ECE isolate one component of these joint measures. They are diagnostic, telling you in which direction and where the forecasts deviate, whereas a proper scoring rule gives a single optimization-friendly summary that a calibrated-but-unsharp model cannot game.

166.6 6. Measuring Calibration in Practice

166.6.1 6.1 The Binning Bias

The binned ECE estimator is biased, and the bias depends on the number of bins. With too few bins, opposing errors within a wide bin cancel: a region that is overconfident at one end and underconfident at the other can average to a near-zero gap, understating the true miscalibration. With too many bins, each bin holds few samples, the accuracy estimates become noisy, and the estimator systematically overstates calibration error because absolute deviations of noisy estimates are positive in expectation. The estimator is therefore not consistent for the population ECE at a fixed bin count, and naive comparisons of ECE across models with different confidence distributions can mislead.

166.6.2 6.2 Binning Schemes

Two binning schemes dominate. Equal-width binning splits \([0, 1]\) into intervals of equal length and is simple but leaves high-confidence bins overloaded and low-confidence bins nearly empty for typical neural networks. Equal-mass (adaptive) binning chooses boundaries so that each bin holds the same number of samples, stabilizing per-bin variance and giving every region comparable statistical weight. Adaptive binning generally yields lower-variance estimates and is preferred when the confidence distribution is highly skewed, though it makes the reliability diagram’s horizontal axis nonuniform and slightly harder to read.

166.6.3 6.3 Binning-Free and Debiased Estimators

To escape the bin-count dependence, several alternatives have been proposed. Kernel density estimates of the calibration gap, used in the KDE-ECE, replace hard bins with smooth kernels and a bandwidth parameter. The maximum mean calibration error and other kernel embeddings recast calibration as a distance in a reproducing kernel Hilbert space. The smooth ECE of Blasiok and Nakkiran provides a binning-free estimator with a single continuous smoothing scale and accompanying consistency guarantees. Debiased estimators of the squared ECE subtract an estimate of the binning-induced bias. None of these is uniformly dominant, so reporting the estimator, its parameters, and ideally a bootstrap confidence interval is essential for reproducibility.

166.6.4 6.4 Evaluation Protocol

Calibration must be measured on data not used to fit either the model or any recalibration map. Calibration on the training set is meaningless because flexible models can memorize. The standard protocol fits a recalibration transform, such as temperature scaling, on a held-out validation split and evaluates ECE, MCE, and a proper score on a separate test split. Temperature scaling, which divides the logits by a single learned scalar \(T > 0\) before the softmax, is the canonical baseline: it preserves the argmax and hence accuracy and refinement while shrinking or sharpening confidences to improve reliability.

flowchart LR
    A["train model on train split"] --> B["fit recalibration such as temperature T on validation split"]
    B --> C["on test split, compute"]
    C --> D["reliability diagram with bin counts"]
    C --> E["ECE_1, MCE, Brier, log loss"]
    C --> F["bootstrap CI for ECE"]

166.6.5 6.5 Reporting Recommendations

A defensible calibration report includes the reliability diagram with bin occupancy, the ECE with its binning scheme and bin count, the MCE with a minimum-occupancy guard, at least one strictly proper score such as the Brier score or log loss to capture sharpness, and a bootstrap confidence interval on the calibration error. Because ECE values are not comparable across different binning choices, any cross-model claim must hold the estimator fixed. Reporting calibration without sharpness, or a single calibration number without its estimator and uncertainty, invites the exact misinterpretations that calibration analysis is meant to prevent.

166.7 7. Summary

Calibration asks whether a model’s stated probabilities behave like real frequencies. Reliability diagrams visualize the answer; expected and maximum calibration errors summarize it as average-case and worst-case gaps. Both are estimated through binning, which introduces a bias that pulls in opposite directions as the bin count changes, motivating adaptive and binning-free estimators. Calibration alone is satisfied by the uninformative base-rate predictor, so it must be paired with sharpness, and proper scoring rules such as the Brier score formalize the joint objective through the calibration-refinement decomposition. Sound practice evaluates on held-out data, reports the estimator and its uncertainty, and never separates calibration from the sharpness that makes a forecaster useful.

166.8 References

  1. Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. (2017). On Calibration of Modern Neural Networks. International Conference on Machine Learning. https://arxiv.org/abs/1706.04599
  2. Naeini, M. P., Cooper, G. F., and Hauskrecht, M. (2015). Obtaining Well Calibrated Probabilities Using Bayesian Binning. AAAI Conference on Artificial Intelligence. https://ojs.aaai.org/index.php/AAAI/article/view/9602
  3. Gneiting, T., Balabdaoui, F., and Raftery, A. E. (2007). Probabilistic Forecasts, Calibration and Sharpness. Journal of the Royal Statistical Society Series B. https://rss.onlinelibrary.wiley.com/doi/10.1111/j.1467-9868.2007.00587.x
  4. Gneiting, T., and Raftery, A. E. (2007). Strictly Proper Scoring Rules, Prediction, and Estimation. Journal of the American Statistical Association. https://www.tandfonline.com/doi/abs/10.1198/016214506000001437
  5. Murphy, A. H. (1973). A New Vector Partition of the Probability Score. Journal of Applied Meteorology. https://journals.ametsoc.org/view/journals/apme/12/4/1520-0450_1973_012_0595_anvpot_2_0_co_2.xml
  6. Nixon, J., Dusenberry, M. W., Zhang, L., Jerfel, G., and Tran, D. (2019). Measuring Calibration in Deep Learning. CVPR Workshops. https://arxiv.org/abs/1904.01685
  7. Kumar, A., Liang, P., and Ma, T. (2019). Verified Uncertainty Calibration. Neural Information Processing Systems. https://arxiv.org/abs/1909.10155
  8. Blasiok, J., and Nakkiran, P. (2023). Smooth ECE: Principled Reliability Diagrams via Kernel Smoothing. International Conference on Learning Representations. https://arxiv.org/abs/2309.12236
  9. Brier, G. W. (1950). Verification of Forecasts Expressed in Terms of Probability. Monthly Weather Review. https://journals.ametsoc.org/view/journals/mwre/78/1/1520-0493_1950_078_0001_vofeit_2_0_co_2.xml
  10. Kull, M., Perello-Nieto, M., Kangsepp, M., Silva Filho, T., Song, H., and Flach, P. (2019). Beyond Temperature Scaling: Dirichlet Calibration. Neural Information Processing Systems. https://arxiv.org/abs/1910.12656