173  Confidence Intervals for Model Performance

173.1 1. Why a Single Metric Is Not Enough

When a model achieves 91.3% accuracy on a held-out test set, the natural temptation is to treat that number as a property of the model. It is not. It is a property of the model evaluated on one particular finite sample drawn from some underlying distribution. Had we drawn a different test set of the same size from the same population, we would almost certainly have observed a different number. The reported metric is therefore a realization of a random variable, and a point estimate without a measure of dispersion conceals how much that realization could have varied.

This matters for three practical reasons. First, model selection decisions are frequently made on differences of a percentage point or less, and such differences may lie entirely within the noise floor of the evaluation. Second, regulatory and scientific reporting standards increasingly demand quantified uncertainty rather than bare scores. Third, deployment risk assessment depends on the plausible range of performance, not on a single optimistic estimate.

Formally, let the test set consist of \(n\) independent and identically distributed examples drawn from a distribution \(\mathcal{D}\). A metric such as accuracy is a statistic \(\hat{\theta}_n\) that estimates a population quantity \(\theta = \mathbb{E}_{\mathcal{D}}[\,\cdot\,]\). A confidence interval at level \(1 - \alpha\) is a data-dependent interval \([L, U]\) such that, under repeated sampling of test sets,

\[ \Pr\big(L \le \theta \le U\big) \ge 1 - \alpha . \]

The interval is random; the parameter \(\theta\) is fixed. The frequentist guarantee is about coverage across hypothetical replications, not about the probability that any single computed interval contains \(\theta\). This distinction is subtle but governs how the interval should be interpreted and reported.

A useful mental model is that the width of the interval scales like \(1/\sqrt{n}\). Doubling confidence in a score requires roughly quadrupling the test set. Teams that obsess over a fourth significant figure of accuracy on a thousand-example test set are, in effect, reading tea leaves.

173.2 2. Analytic Intervals for Accuracy

Accuracy is the mean of a Bernoulli indicator: each test example is either classified correctly (\(1\)) or not (\(0\)). If \(\hat{p}\) is the observed accuracy and \(n\) the test size, then \(n\hat{p}\) is a binomial count, and the entire machinery of binomial proportion intervals applies.

173.2.1 2.1 The Wald Interval and Its Failure Modes

The textbook interval invokes the central limit theorem to approximate the sampling distribution of \(\hat{p}\) as Gaussian:

\[ \hat{p} \pm z_{1-\alpha/2} \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}}, \]

where \(z_{1-\alpha/2} = 1.96\) for a 95% interval. This Wald interval is ubiquitous because it is trivial to compute, and it is wrong in exactly the situations practitioners care about most. When \(\hat{p}\) is close to \(0\) or \(1\), which is precisely the regime of strong models, the normal approximation degrades badly. The interval can extend below \(0\) or above \(1\), and its actual coverage can fall well under the nominal \(95\%\). For a model at \(99\%\) accuracy on \(200\) examples, the Wald interval is essentially meaningless.

The variance estimate \(\hat{p}(1-\hat{p})/n\) also collapses to zero as \(\hat{p} \to 1\), which absurdly implies near-perfect certainty exactly when the data are most sparse in the minority outcome.

173.2.2 2.2 The Wilson Score Interval

A far better default inverts the score test rather than plugging in the observed proportion as the variance. The Wilson interval solves for the set of \(p\) values not rejected by the test, yielding

\[ \frac{\hat{p} + \dfrac{z^2}{2n} \;\pm\; z\sqrt{\dfrac{\hat{p}(1-\hat{p})}{n} + \dfrac{z^2}{4n^2}}}{1 + \dfrac{z^2}{n}}, \]

with \(z = z_{1-\alpha/2}\). The Wilson interval is always contained in \([0, 1]\), behaves sensibly at the boundaries, and maintains close-to-nominal coverage even for small \(n\) and extreme \(\hat{p}\). It should be the default reporting interval for accuracy, precision, recall, and any other proportion-based metric.

# Wilson 95% interval, schematic
z = 1.96
center = (p_hat + z**2/(2n)) / (1 + z**2/n)
half   = z*sqrt(p_hat*(1-p_hat)/n + z**2/(4n**2)) / (1 + z**2/n)
ci = (center - half, center + half)

173.2.3 2.3 Clopper-Pearson and Exactness

When strict coverage guarantees are required, the Clopper-Pearson interval inverts the exact binomial test using beta-distribution quantiles. It guarantees coverage of at least \(1 - \alpha\) for every true \(p\), but because the binomial is discrete, it is conservative: actual coverage often exceeds the nominal level, producing intervals wider than necessary. Clopper-Pearson is appropriate when undercoverage is unacceptable, for example in safety-critical certification, whereas Wilson is the better all-purpose choice when honest average coverage is the goal.

173.2.4 2.4 Choosing Among Them

For routine reporting, Wilson is the recommended default. Reserve Clopper-Pearson for conservative guarantees and avoid Wald except as a rough mental approximation when \(n\) is large and \(\hat{p}\) is near \(0.5\). None of these intervals account for the fact that examples may be correlated, for instance when multiple test cases share a document or a patient. In clustered settings the effective sample size is smaller than \(n\), and all of the intervals above will be too narrow unless a cluster-aware variance estimate or a clustered bootstrap is used.

173.3 3. Intervals for AUC

The area under the receiver operating characteristic curve summarizes ranking quality across all thresholds. Unlike accuracy, AUC is not a simple mean of independent Bernoulli trials, so its uncertainty requires more care.

173.3.1 3.1 AUC as a Probability and the Mann-Whitney Connection

The AUC equals the probability that a randomly chosen positive example receives a higher score than a randomly chosen negative example:

\[ \text{AUC} = \Pr\big(s(X^{+}) > s(X^{-})\big) . \]

Its empirical estimator is the normalized Mann-Whitney \(U\) statistic, computed over all \(n_{+} n_{-}\) positive-negative pairs. This pairwise structure means the variance depends not only on the counts \(n_{+}\) and \(n_{-}\) but also on how scores are distributed.

173.3.2 3.2 The Hanley-McNeil Analytic Interval

Hanley and McNeil derived a widely used variance estimate for AUC, denoted \(A\):

\[ \widehat{\mathrm{Var}}(A) = \frac{A(1-A) + (n_{+}-1)(Q_1 - A^2) + (n_{-}-1)(Q_2 - A^2)}{n_{+} n_{-}}, \]

where \(Q_1 = A/(2 - A)\) and \(Q_2 = 2A^2/(1 + A)\) under a common exponential approximation. A Gaussian interval is then \(A \pm z_{1-\alpha/2}\sqrt{\widehat{\mathrm{Var}}(A)}\). The \(Q_1\) and \(Q_2\) approximations assume a particular score distribution and can be inaccurate when that assumption is violated, so the analytic interval is best treated as a quick estimate rather than a definitive one.

173.3.3 3.3 The DeLong Method

The DeLong method provides a nonparametric variance estimator based on the structural components of the \(U\) statistic, the so-called placement values. It does not assume a parametric score distribution and yields asymptotically correct intervals. Crucially, DeLong extends to the comparison of two correlated AUCs, for instance two models evaluated on the same test set, by estimating the covariance between their statistics. This makes it the standard analytic tool when comparing classifiers on shared data.

# DeLong, schematic
V10, V01 = placement_values(scores_pos, scores_neg)  # per-example components
var_auc  = var(V10)/n_pos + var(V01)/n_neg
ci = auc +/- 1.96*sqrt(var_auc)

173.3.4 3.4 Bootstrap Intervals as a General Fallback

When the metric is complex, or when assumptions behind analytic formulas are doubtful, the bootstrap offers a distribution-free alternative. Resample the test set with replacement \(B\) times, recompute the metric on each resample, and form an interval from the resulting distribution. The percentile interval takes the empirical \(\alpha/2\) and \(1 - \alpha/2\) quantiles of the bootstrap replicates. The bias-corrected and accelerated (BCa) variant adjusts for bias and skew and generally yields better coverage at modest extra cost.

For AUC and for any metric involving ranking, stratified resampling that preserves the number of positives and negatives is important, since the metric is undefined or unstable when a resample contains only one class. The bootstrap is also the most natural way to handle clustered data: resample whole clusters rather than individual examples to respect the dependence structure.

A caution: the bootstrap is asymptotic. With very few positives, say fewer than \(20\), both analytic and bootstrap intervals become unreliable, and the honest move is to report the small sample size prominently and resist over-interpreting the point estimate.

173.4 4. Reporting Uncertainty in Model Comparison

The most consequential use of confidence intervals is deciding whether model \(A\) truly outperforms model \(B\). Here a common and serious error is to compare two independent intervals by eye and conclude there is no difference because they overlap.

173.4.1 4.1 The Overlap Fallacy

Overlapping confidence intervals do not imply a non-significant difference. Two intervals can overlap substantially while the difference between the estimates is statistically significant. The correct object of inference is the interval on the difference \(\theta_A - \theta_B\), not the two marginal intervals. The variance of a difference depends on the covariance between the two estimates, and ignoring that covariance discards exactly the information that makes paired comparison powerful.

173.4.2 4.2 Paired Evaluation on the Same Test Set

When both models are evaluated on the same examples, their errors are correlated, and a paired analysis is both correct and far more sensitive. For accuracy, McNemar’s test focuses on discordant examples: let \(b\) be the count where \(A\) is correct and \(B\) is wrong, and \(c\) the reverse. The other examples, where both agree, carry no information about the difference. The test statistic is

\[ \chi^2 = \frac{(b - c)^2}{b + c}, \]

and a confidence interval for the accuracy difference can be built on the paired proportions. The intuition is that if both models fail on the same hard examples, those examples tell us nothing about which model is better.

For AUC, the DeLong test for two correlated curves plays the analogous role, using the estimated covariance to form an interval on \(\text{AUC}_A - \text{AUC}_B\). The paired bootstrap is the general purpose alternative: on each resample, recompute both metrics and record their difference, then take quantiles of the difference distribution. Because both models see the same resample, the shared variation cancels and the interval on the difference is appropriately tight.

# Paired bootstrap on a difference
for b in 1..B:
    idx = resample_indices(n)              # same indices for both models
    diff[b] = metric(A, idx) - metric(B, idx)
ci_diff = quantile(diff, [0.025, 0.975])

If the interval on the difference excludes zero, the comparison is significant at the corresponding level; if it straddles zero, the data do not support a confident ranking.

173.4.3 4.3 Multiple Comparisons and Selection Effects

Benchmarks rarely compare two models in isolation. Leaderboards rank dozens, and each pairwise comparison is an opportunity for a false positive. Reporting one nominal \(95\%\) interval per comparison guarantees that some apparent winners are noise. When many models or many test slices are compared, control the family-wise error rate or the false discovery rate, for example with a Bonferroni or Benjamini-Hochberg adjustment, and widen the intervals accordingly. A related and underappreciated hazard is the winner’s curse: the model that tops a leaderboard tends to have benefited from favorable noise, so its test score is an optimistically biased estimate of its true performance. The model selected as best should be re-evaluated on fresh data before its headline number is trusted.

173.4.4 4.4 A Reporting Checklist

Sound reporting practice can be summarized compactly. State the test set size and the number of positives and negatives, since these determine the achievable precision. Report a confidence interval, not just a point estimate, and name the method used to compute it. For comparisons, report the interval on the difference rather than two separate intervals, and use a paired method when the models share a test set. Disclose any clustering or dependence in the data and account for it through cluster-aware variance or a clustered bootstrap. Finally, if multiple models or slices were compared, state the correction applied. None of these steps is expensive, and together they convert a brittle single number into a defensible scientific claim.

The discipline of attaching uncertainty to every reported metric does more than satisfy reviewers. It changes how teams reason about progress: a model that is one point better with overlapping difference intervals is not a confirmed improvement, and treating it as one wastes engineering effort chasing noise. Confidence intervals are the instrument that distinguishes signal from sampling variation, and they belong in every evaluation report.

173.5 References

  1. Wilson, E. B. (1927). Probable Inference, the Law of Succession, and Statistical Inference. Journal of the American Statistical Association. https://www.tandfonline.com/doi/abs/10.1080/01621459.1927.10502953
  2. Brown, L. D., Cai, T. T., DasGupta, A. (2001). Interval Estimation for a Binomial Proportion. Statistical Science. https://projecteuclid.org/journals/statistical-science/volume-16/issue-2/Interval-Estimation-for-a-Binomial-Proportion/10.1214/ss/1009213286.full
  3. Clopper, C. J., Pearson, E. S. (1934). The Use of Confidence or Fiducial Limits Illustrated in the Case of the Binomial. Biometrika. https://academic.oup.com/biomet/article-abstract/26/4/404/291538
  4. Hanley, J. A., McNeil, B. J. (1982). The Meaning and Use of the Area under a Receiver Operating Characteristic (ROC) Curve. Radiology. https://pubs.rsna.org/doi/10.1148/radiology.143.1.7063747
  5. DeLong, E. R., DeLong, D. M., Clarke-Pearson, D. L. (1988). Comparing the Areas under Two or More Correlated Receiver Operating Characteristic Curves. Biometrics. https://www.jstor.org/stable/2531595
  6. Efron, B., Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Chapman and Hall. https://www.taylorfrancis.com/books/mono/10.1201/9780429246593/introduction-bootstrap-bradley-efron-robert-tibshirani
  7. Dietterich, T. G. (1998). Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Computation. https://direct.mit.edu/neco/article/10/7/1895/6224
  8. Benjamini, Y., Hochberg, Y. (1995). Controlling the False Discovery Rate. Journal of the Royal Statistical Society Series B. https://www.jstor.org/stable/2346101