173 Confidence Intervals for Model Performance

173.1 1. Why a Single Metric Is Not Enough

When a model achieves 91.3% accuracy on a held-out test set, the natural temptation is to treat that number as a property of the model. It is not. It is a property of the model evaluated on one particular finite sample drawn from some underlying distribution. Had we drawn a different test set of the same size from the same population, we would almost certainly have observed a different number. The reported metric is therefore a realization of a random variable, and a point estimate without a measure of dispersion conceals how much that realization could have varied.

This matters for three practical reasons. First, model selection decisions are frequently made on differences of a percentage point or less, and such differences may lie entirely within the noise floor of the evaluation. Second, regulatory and scientific reporting standards increasingly demand quantified uncertainty rather than bare scores. Third, deployment risk assessment depends on the plausible range of performance, not on a single optimistic estimate.

Formally, let the test set consist of $n$ independent and identically distributed examples drawn from a distribution $\mathcal{D}$. A metric such as accuracy is a statistic $\hat{\theta}_n$ that estimates a population quantity $\theta = \mathbb{E}_{\mathcal{D}}[\,\cdot\,]$. A confidence interval at level $1 - \alpha$ is a data-dependent interval $[L, U]$ such that, under repeated sampling of test sets,

\[ \Pr\big(L \le \theta \le U\big) \ge 1 - \alpha . \]

The interval is random; the parameter $\theta$ is fixed. The frequentist guarantee is about coverage across hypothetical replications, not about the probability that any single computed interval contains $\theta$. This distinction is subtle but governs how the interval should be interpreted and reported.

A useful mental model is that the width of the interval scales like $1/\sqrt{n}$. Doubling confidence in a score requires roughly quadrupling the test set. Teams that obsess over a fourth significant figure of accuracy on a thousand-example test set are, in effect, reading tea leaves.

To make the scale concrete, the standard error of a proportion is largest at $\hat{p} = 0.5$, where $\sqrt{\hat{p}(1-\hat{p})} = 0.5$. The half-width of a $95\%$ Wald interval is then $1.96 \times 0.5 / \sqrt{n} \approx 0.98/\sqrt{n}$. A test set of $n = 100$ buys a half-width near $\pm 0.098$, $n = 1{,}000$ tightens it to about $\pm 0.031$, and reaching $\pm 0.01$ requires roughly $n = 9{,}600$. This is the worst case; near the boundaries the interval is narrower, but it is also where the symmetric Wald form breaks down, as the next section explains.

173.1.1 1.1 Sources of Uncertainty

The intervals in this chapter quantify one specific source of variation, the random draw of the test set, while holding the trained model fixed. This is the right object when the question is “how well does this deployed model generalize.” It is not the only source of variation in a typical machine learning pipeline, and conflating the sources leads to misleading claims.

Source	What varies	What it answers
Test sampling	The held-out examples	How precisely is this fixed model’s score known
Training randomness	Seeds, initialization, data order	How stable is the training procedure
Train/test split	Which examples are held out	How much does the estimate depend on the partition

The classical intervals here address only the first row. Variation from retraining under different seeds, or from different cross-validation folds, is a separate and often larger quantity, and it must be estimated by actually repeating training, not by a binomial formula. A single train/test split cannot reveal split-induced variance at all. Cross-validation addresses that, but its folds overlap and share training data, so naive variance estimates across folds are optimistically narrow. Bengio and Grandvalet (reference 9) show that no unbiased estimator of the variance of the cross-validation estimate exists in general, which is why repeated independent test sets remain the gold standard when they are affordable.

173.2 2. Analytic Intervals for Accuracy

Accuracy is the mean of a Bernoulli indicator: each test example is either classified correctly ($1$) or not ($0$). If $\hat{p}$ is the observed accuracy and $n$ the test size, then $n\hat{p}$ is a binomial count, and the entire machinery of binomial proportion intervals applies.

173.2.1 2.1 The Wald Interval and Its Failure Modes

The textbook interval invokes the central limit theorem to approximate the sampling distribution of $\hat{p}$ as Gaussian:

\[ \hat{p} \pm z_{1-\alpha/2} \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}}, \]

where $z_{1-\alpha/2} = 1.96$ for a 95% interval. This Wald interval is ubiquitous because it is trivial to compute, and it is wrong in exactly the situations practitioners care about most. When $\hat{p}$ is close to $0$ or $1$, which is precisely the regime of strong models, the normal approximation degrades badly. The interval can extend below $0$ or above $1$, and its actual coverage can fall well under the nominal $95\%$. For a model at $99\%$ accuracy on $200$ examples, the Wald interval is essentially meaningless.

The variance estimate $\hat{p}(1-\hat{p})/n$ also collapses to zero as $\hat{p} \to 1$, which absurdly implies near-perfect certainty exactly when the data are most sparse in the minority outcome. The degenerate case is stark: if a model is correct on all $n$ examples, then $\hat{p} = 1$, the estimated variance is $0$, and the Wald interval is the single point $[1, 1]$, asserting with certainty that the model never errs. No finite sample can justify that claim. Brown, Cai, and DasGupta (reference 2) document that even away from the boundary the Wald coverage oscillates well below the nominal level, and that the deficiency does not vanish as $n$ grows.

173.2.2 2.2 The Wilson Score Interval

A far better default inverts the score test rather than plugging in the observed proportion as the variance. The Wilson interval solves for the set of $p$ values not rejected by the test, yielding

\[ \frac{\hat{p} + \dfrac{z^2}{2n} \;\pm\; z\sqrt{\dfrac{\hat{p}(1-\hat{p})}{n} + \dfrac{z^2}{4n^2}}}{1 + \dfrac{z^2}{n}}, \]

with $z = z_{1-\alpha/2}$. The Wilson interval is always contained in $[0, 1]$, behaves sensibly at the boundaries, and maintains close-to-nominal coverage even for small $n$ and extreme $\hat{p}$. It should be the default reporting interval for accuracy, precision, recall, and any other proportion-based metric.

The formula rewards a little interpretation. The center is not $\hat{p}$ but a shrinkage of $\hat{p}$ toward $1/2$: rewriting the numerator as $(n\hat{p} + z^2/2)/(n + z^2)$ shows that the interval behaves as though $z^2/2$ successful and $z^2/2$ failed pseudo-observations were added to the data. At $95\%$ confidence $z^2 \approx 3.84$, so this is close to adding two successes and two failures, which is exactly the rounding that produces the simpler Agresti-Coull interval. That pseudo-count is why Wilson never collapses to a point at the boundary: even when $\hat{p} = 1$, the center is pulled inward and the half-width stays strictly positive.

A concrete case makes the contrast with Wald vivid. Suppose a classifier is correct on $196$ of $200$ test examples, so $\hat{p} = 0.98$. The Wald half-width is $1.96\sqrt{0.98 \times 0.02 / 200} \approx 0.0194$, giving the interval $[0.961, 0.999]$, which is symmetric and crowds against the boundary. The Wilson interval for the same data is approximately $[0.949, 0.992]$: shifted downward, asymmetric, with a longer reach toward smaller $p$, correctly reflecting that with only four errors observed the true error rate could plausibly be appreciably higher than $2\%$ but cannot be much lower. If instead the model were correct on all $200$ examples, Wald would report $[1, 1]$ while Wilson reports roughly $[0.981, 1]$, preserving an honest lower bound. The so-called rule of three is a handy memory aid here: when zero failures occur in $n$ trials, an approximate upper bound on the failure rate is $3/n$, so $0$ errors in $200$ trials is consistent with a true error rate as high as about $1.5\%$.

# Wilson 95% interval, schematic
z = 1.96
center = (p_hat + z**2/(2n)) / (1 + z**2/n)
half   = z*sqrt(p_hat*(1-p_hat)/n + z**2/(4n**2)) / (1 + z**2/n)
ci = (center - half, center + half)

173.2.3 2.3 Clopper-Pearson and Exactness

When strict coverage guarantees are required, the Clopper-Pearson interval inverts the exact binomial test using beta-distribution quantiles. It guarantees coverage of at least $1 - \alpha$ for every true $p$, but because the binomial is discrete, it is conservative: actual coverage often exceeds the nominal level, producing intervals wider than necessary. Clopper-Pearson is appropriate when undercoverage is unacceptable, for example in safety-critical certification, whereas Wilson is the better all-purpose choice when honest average coverage is the goal.

173.2.4 2.4 Choosing Among Them

For routine reporting, Wilson is the recommended default. Reserve Clopper-Pearson for conservative guarantees and avoid Wald except as a rough mental approximation when $n$ is large and $\hat{p}$ is near $0.5$. The following diagram summarizes the choice.

flowchart TD
    A["Proportion metric (accuracy, precision, recall)"] --> B{"Need guaranteed coverage at or above 1 minus alpha"}
    B -->|"Yes (safety certification)"| C["Clopper-Pearson exact interval"]
    B -->|"No (honest average coverage)"| D{"Are test examples independent"}
    D -->|"Yes"| E["Wilson score interval (default)"]
    D -->|"No (shared document or patient)"| F["Clustered bootstrap or cluster-aware variance"]

None of the closed-form intervals account for the fact that examples may be correlated, for instance when multiple test cases share a document, a patient, or a user session. In clustered settings the effective sample size is smaller than $n$. A useful approximation is the design effect $\mathrm{DEFF} = 1 + (\bar{m} - 1)\rho$, where $\bar{m}$ is the average cluster size and $\rho$ is the intra-cluster correlation; the effective sample size is $n_{\mathrm{eff}} = n / \mathrm{DEFF}$. With ten correlated sentences per document and even a modest $\rho = 0.3$, the design effect is about $3.7$, so a nominal $1{,}000$-example test set carries the information of roughly $270$ independent examples. Treating it as $1{,}000$ makes every interval above too narrow by a factor near $\sqrt{3.7} \approx 1.9$. The honest remedy is a cluster-aware variance estimate or a clustered bootstrap that resamples whole clusters.

173.3 3. Intervals for AUC

The area under the receiver operating characteristic curve summarizes ranking quality across all thresholds. Unlike accuracy, AUC is not a simple mean of independent Bernoulli trials, so its uncertainty requires more care.

173.3.1 3.1 AUC as a Probability and the Mann-Whitney Connection

The AUC equals the probability that a randomly chosen positive example receives a higher score than a randomly chosen negative example:

\[ \text{AUC} = \Pr\big(s(X^{+}) > s(X^{-})\big) . \]

Its empirical estimator is the normalized Mann-Whitney $U$ statistic, computed over all $n_{+} n_{-}$ positive-negative pairs. This pairwise structure means the variance depends not only on the counts $n_{+}$ and $n_{-}$ but also on how scores are distributed.

173.3.2 3.2 The Hanley-McNeil Analytic Interval

Hanley and McNeil derived a widely used variance estimate for AUC, denoted $A$:

\[ \widehat{\mathrm{Var}}(A) = \frac{A(1-A) + (n_{+}-1)(Q_1 - A^2) + (n_{-}-1)(Q_2 - A^2)}{n_{+} n_{-}}, \]

where $Q_1 = A/(2 - A)$ and $Q_2 = 2A^2/(1 + A)$ under a common exponential approximation. A Gaussian interval is then $A \pm z_{1-\alpha/2}\sqrt{\widehat{\mathrm{Var}}(A)}$. The $Q_1$ and $Q_2$ approximations assume a particular score distribution and can be inaccurate when that assumption is violated, so the analytic interval is best treated as a quick estimate rather than a definitive one.

173.3.3 3.3 The DeLong Method

The DeLong method provides a nonparametric variance estimator based on the structural components of the $U$ statistic, the so-called placement values. It does not assume a parametric score distribution and yields asymptotically correct intervals. Crucially, DeLong extends to the comparison of two correlated AUCs, for instance two models evaluated on the same test set, by estimating the covariance between their statistics. This makes it the standard analytic tool when comparing classifiers on shared data.

# DeLong, schematic
V10, V01 = placement_values(scores_pos, scores_neg)  # per-example components
var_auc  = var(V10)/n_pos + var(V01)/n_neg
ci = auc +/- 1.96*sqrt(var_auc)

173.3.4 3.4 Bootstrap Intervals as a General Fallback

When the metric is complex, or when assumptions behind analytic formulas are doubtful, the bootstrap offers a distribution-free alternative. The principle is to treat the empirical distribution $\hat{F}_n$ of the test set as a stand-in for the unknown $\mathcal{D}$, and to estimate the sampling variability of $\hat{\theta}_n$ by repeatedly drawing samples of size $n$ from $\hat{F}_n$. Concretely, resample the test set with replacement $B$ times, recompute the metric $\hat{\theta}^{*}_b$ on each resample, and form an interval from the resulting distribution $\{\hat{\theta}^{*}_1, \dots, \hat{\theta}^{*}_B\}$.

The percentile interval takes the empirical $\alpha/2$ and $1 - \alpha/2$ quantiles of the bootstrap replicates directly. It is simple and transformation-respecting, but it can miscover when the estimator is biased or when its sampling distribution is skewed, both common for ranking metrics. The bias-corrected and accelerated (BCa) interval of Efron and Tibshirani (reference 6) corrects this. It computes a bias-correction term $\hat{z}_0 = \Phi^{-1}\!\big(\#\{\hat{\theta}^{*}_b < \hat{\theta}_n\}/B\big)$, the normal quantile of the fraction of replicates below the observed estimate, and an acceleration term $\hat{a}$ estimated from the jackknife skewness of the statistic. It then reads off adjusted percentiles

\[ \alpha_1 = \Phi\!\left(\hat{z}_0 + \frac{\hat{z}_0 + z_{\alpha/2}}{1 - \hat{a}(\hat{z}_0 + z_{\alpha/2})}\right), \]

and the analogous $\alpha_2$ for the upper end, taking the $\alpha_1$ and $\alpha_2$ quantiles of the replicates. When $\hat{z}_0 = 0$ and $\hat{a} = 0$ this reduces exactly to the percentile interval, so BCa is a strict refinement. Choose $B$ large enough that the quantiles are stable: $B = 1{,}000$ suffices for a rough interval, while $B \ge 10{,}000$ is prudent for the tail-sensitive BCa endpoints.

For AUC and for any metric involving ranking, stratified resampling that preserves the number of positives and negatives is important, since the metric is undefined or unstable when a resample contains only one class. The bootstrap is also the most natural way to handle clustered data: resample whole clusters rather than individual examples to respect the dependence structure.

A caution: the bootstrap is asymptotic. With very few positives, say fewer than $20$, both analytic and bootstrap intervals become unreliable, and the honest move is to report the small sample size prominently and resist over-interpreting the point estimate.

173.4 4. Reporting Uncertainty in Model Comparison

The most consequential use of confidence intervals is deciding whether model $A$ truly outperforms model $B$. Here a common and serious error is to compare two independent intervals by eye and conclude there is no difference because they overlap.

173.4.1 4.1 The Overlap Fallacy

Overlapping confidence intervals do not imply a non-significant difference. Two intervals can overlap substantially while the difference between the estimates is statistically significant. The correct object of inference is the interval on the difference $\theta_A - \theta_B$, not the two marginal intervals. The variance of a difference depends on the covariance between the two estimates, and ignoring that covariance discards exactly the information that makes paired comparison powerful.

173.4.2 4.2 Paired Evaluation on the Same Test Set

When both models are evaluated on the same examples, their errors are correlated, and a paired analysis is both correct and far more sensitive. For accuracy, McNemar’s test focuses on discordant examples: let $b$ be the count where $A$ is correct and $B$ is wrong, and $c$ the reverse. The other examples, where both agree, carry no information about the difference. The test statistic is

\[ \chi^2 = \frac{(b - c)^2}{b + c}, \]

which under the null hypothesis of equal accuracy is approximately chi-squared with one degree of freedom. When $b + c$ is small, say below $25$, the continuity-corrected form $\big(|b - c| - 1\big)^2/(b + c)$ or an exact binomial test on $b$ out of $b + c$ is preferred. A confidence interval for the accuracy difference $\hat{p}_A - \hat{p}_B = (b - c)/n$ can be built on the paired proportions; the variance of the difference is $\big((b + c) - (b - c)^2/n\big)/n^2$, which is smaller than the unpaired variance precisely because the concordant examples drop out. The intuition is that if both models fail on the same hard examples, those examples tell us nothing about which model is better, so conditioning on the discordant pairs is what makes the paired test more powerful. Dietterich (reference 7) found McNemar to be among the few tests with acceptable Type I error for comparing classifiers on a single test set.

A small worked instance: on $n = 1{,}000$ examples suppose model $A$ is right and $B$ wrong on $b = 45$ cases, while $B$ is right and $A$ wrong on $c = 25$. The raw accuracies differ by only $(45 - 25)/1000 = 2.0$ points, which an unpaired eyeball comparison might dismiss. The McNemar statistic is $(45 - 25)^2 / (45 + 25) = 400/70 \approx 5.71$, exceeding the $95\%$ critical value of $3.84$, so the difference is significant. The shared $930$ concordant examples never enter the calculation, which is exactly why the paired test sees a signal the marginal intervals would hide.

For AUC, the DeLong test for two correlated curves plays the analogous role, using the estimated covariance to form an interval on $\text{AUC}_A - \text{AUC}_B$. The paired bootstrap is the general purpose alternative: on each resample, recompute both metrics and record their difference, then take quantiles of the difference distribution. Because both models see the same resample, the shared variation cancels and the interval on the difference is appropriately tight.

# Paired bootstrap on a difference
for b in 1..B:
    idx = resample_indices(n)              # same indices for both models
    diff[b] = metric(A, idx) - metric(B, idx)
ci_diff = quantile(diff, [0.025, 0.975])

If the interval on the difference excludes zero, the comparison is significant at the corresponding level; if it straddles zero, the data do not support a confident ranking.

173.4.3 4.3 Multiple Comparisons and Selection Effects

Benchmarks rarely compare two models in isolation. Leaderboards rank dozens, and each pairwise comparison is an opportunity for a false positive. Reporting one nominal $95\%$ interval per comparison guarantees that some apparent winners are noise. When many models or many test slices are compared, control the family-wise error rate or the false discovery rate, for example with a Bonferroni or Benjamini-Hochberg adjustment, and widen the intervals accordingly. A related and underappreciated hazard is the winner’s curse: the model that tops a leaderboard tends to have benefited from favorable noise, so its test score is an optimistically biased estimate of its true performance. The model selected as best should be re-evaluated on fresh data before its headline number is trusted.

173.4.4 4.4 A Reporting Checklist

Sound reporting practice can be summarized compactly. State the test set size and the number of positives and negatives, since these determine the achievable precision. Report a confidence interval, not just a point estimate, and name the method used to compute it. For comparisons, report the interval on the difference rather than two separate intervals, and use a paired method when the models share a test set. Disclose any clustering or dependence in the data and account for it through cluster-aware variance or a clustered bootstrap. Finally, if multiple models or slices were compared, state the correction applied. None of these steps is expensive, and together they convert a brittle single number into a defensible scientific claim.

173.4.5 4.5 Common Pitfalls

A handful of mistakes account for most misreported uncertainty in practice.

Reading two marginal intervals for overlap instead of forming the interval on the difference. This is the overlap fallacy of section 4.1 and it both loses power and can mislead.
Using the Wald interval near the boundary, where strong models live, and reporting endpoints outside $[0, 1]$ or a degenerate point at perfect accuracy.
Treating clustered examples as independent, which understates every interval by the design-effect factor.
Bootstrapping a ranking metric without stratifying by class, so that some resamples contain a single class and the metric is undefined or wildly unstable.
Quoting an interval to more significant figures than the test size supports, for example $\pm 0.001$ on a few hundred examples.
Forgetting the multiplicity correction after scanning many models or slices, and then trusting the winner’s headline number without re-evaluation on fresh data.

173.4.6 4.6 Open-Source Tooling

Every method in this chapter is available in mature, free, open-source libraries, so there is no reason to hand-roll error-prone code. In Python, statsmodels.stats.proportion.proportion_confint computes Wilson, Clopper-Pearson, Agresti-Coull, and Wald intervals, and proportions_ztest and statsmodels.stats.contingency_tables.mcnemar cover the comparison tests. The scipy.stats module provides bootstrap, which implements the percentile and BCa intervals directly, along with binomtest for exact binomial inference. For ROC analysis, scikit-learn supplies roc_auc_score, and the DeLong covariance estimator is available in small, auditable open-source implementations that pair naturally with it. Preferring these well-tested tools over bespoke formulas eliminates a large class of off-by-one and boundary bugs.

The discipline of attaching uncertainty to every reported metric does more than satisfy reviewers. It changes how teams reason about progress: a model that is one point better with overlapping difference intervals is not a confirmed improvement, and treating it as one wastes engineering effort chasing noise. Confidence intervals are the instrument that distinguishes signal from sampling variation, and they belong in every evaluation report.

173.5 References

Wilson, E. B. (1927). Probable Inference, the Law of Succession, and Statistical Inference. Journal of the American Statistical Association. https://www.tandfonline.com/doi/abs/10.1080/01621459.1927.10502953
Brown, L. D., Cai, T. T., DasGupta, A. (2001). Interval Estimation for a Binomial Proportion. Statistical Science. https://projecteuclid.org/journals/statistical-science/volume-16/issue-2/Interval-Estimation-for-a-Binomial-Proportion/10.1214/ss/1009213286.full
Clopper, C. J., Pearson, E. S. (1934). The Use of Confidence or Fiducial Limits Illustrated in the Case of the Binomial. Biometrika. https://academic.oup.com/biomet/article-abstract/26/4/404/291538
Hanley, J. A., McNeil, B. J. (1982). The Meaning and Use of the Area under a Receiver Operating Characteristic (ROC) Curve. Radiology. https://pubs.rsna.org/doi/10.1148/radiology.143.1.7063747
DeLong, E. R., DeLong, D. M., Clarke-Pearson, D. L. (1988). Comparing the Areas under Two or More Correlated Receiver Operating Characteristic Curves. Biometrics. https://www.jstor.org/stable/2531595
Efron, B., Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Chapman and Hall. https://www.taylorfrancis.com/books/mono/10.1201/9780429246593/introduction-bootstrap-bradley-efron-robert-tibshirani
Dietterich, T. G. (1998). Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Computation. https://direct.mit.edu/neco/article/10/7/1895/6224
Benjamini, Y., Hochberg, Y. (1995). Controlling the False Discovery Rate. Journal of the Royal Statistical Society Series B. https://www.jstor.org/stable/2346101
Bengio, Y., Grandvalet, Y. (2004). No Unbiased Estimator of the Variance of K-Fold Cross-Validation. Journal of Machine Learning Research, 5, 1089-1105. https://www.jmlr.org/papers/v5/grandvalet04a.html

# Confidence Intervals for Model Performance ## 1. Why a Single Metric Is Not Enough When a model achieves 91.3% accuracy on a held-out test set, the natural temptation is to treat that number as a property of the model. It is not. It is a property of the model evaluated on one particular finite sample drawn from some underlying distribution. Had we drawn a different test set of the same size from the same population, we would almost certainly have observed a different number. The reported metric is therefore a realization of a random variable, and a point estimate without a measure of dispersion conceals how much that realization could have varied. This matters for three practical reasons. First, model selection decisions are frequently made on differences of a percentage point or less, and such differences may lie entirely within the noise floor of the evaluation. Second, regulatory and scientific reporting standards increasingly demand quantified uncertainty rather than bare scores. Third, deployment risk assessment depends on the plausible range of performance, not on a single optimistic estimate. Formally, let the test set consist of $n$ independent and identically distributed examples drawn from a distribution $\mathcal{D}$. A metric such as accuracy is a statistic $\hat{\theta}_n$ that estimates a population quantity $\theta = \mathbb{E}_{\mathcal{D}}[\,\cdot\,]$. A confidence interval at level $1 - \alpha$ is a data-dependent interval $[L, U]$ such that, under repeated sampling of test sets, $$ \Pr\big(L \le \theta \le U\big) \ge 1 - \alpha . $$ The interval is random; the parameter $\theta$ is fixed. The frequentist guarantee is about coverage across hypothetical replications, not about the probability that any single computed interval contains $\theta$. This distinction is subtle but governs how the interval should be interpreted and reported. A useful mental model is that the width of the interval scales like $1/\sqrt{n}$. Doubling confidence in a score requires roughly quadrupling the test set. Teams that obsess over a fourth significant figure of accuracy on a thousand-example test set are, in effect, reading tea leaves. To make the scale concrete, the standard error of a proportion is largest at $\hat{p} = 0.5$, where $\sqrt{\hat{p}(1-\hat{p})} = 0.5$. The half-width of a $95\%$ Wald interval is then $1.96 \times 0.5 / \sqrt{n} \approx 0.98/\sqrt{n}$. A test set of $n = 100$ buys a half-width near $\pm 0.098$, $n = 1{,}000$ tightens it to about $\pm 0.031$, and reaching $\pm 0.01$ requires roughly $n = 9{,}600$. This is the worst case; near the boundaries the interval is narrower, but it is also where the symmetric Wald form breaks down, as the next section explains. ### 1.1 Sources of Uncertainty The intervals in this chapter quantify one specific source of variation, the random draw of the test set, while holding the trained model fixed. This is the right object when the question is "how well does this deployed model generalize." It is not the only source of variation in a typical machine learning pipeline, and conflating the sources leads to misleading claims. | Source | What varies | What it answers | |---|---|---| | Test sampling | The held-out examples | How precisely is this fixed model's score known | | Training randomness | Seeds, initialization, data order | How stable is the training procedure | | Train/test split | Which examples are held out | How much does the estimate depend on the partition | The classical intervals here address only the first row. Variation from retraining under different seeds, or from different cross-validation folds, is a separate and often larger quantity, and it must be estimated by actually repeating training, not by a binomial formula. A single train/test split cannot reveal split-induced variance at all. Cross-validation addresses that, but its folds overlap and share training data, so naive variance estimates across folds are optimistically narrow. Bengio and Grandvalet (reference 9) show that no unbiased estimator of the variance of the cross-validation estimate exists in general, which is why repeated independent test sets remain the gold standard when they are affordable. ## 2. Analytic Intervals for Accuracy Accuracy is the mean of a Bernoulli indicator: each test example is either classified correctly ($1$) or not ($0$). If $\hat{p}$ is the observed accuracy and $n$ the test size, then $n\hat{p}$ is a binomial count, and the entire machinery of binomial proportion intervals applies. ### 2.1 The Wald Interval and Its Failure Modes The textbook interval invokes the central limit theorem to approximate the sampling distribution of $\hat{p}$ as Gaussian: $$ \hat{p} \pm z_{1-\alpha/2} \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}}, $$ where $z_{1-\alpha/2} = 1.96$ for a 95% interval. This Wald interval is ubiquitous because it is trivial to compute, and it is wrong in exactly the situations practitioners care about most. When $\hat{p}$ is close to $0$ or $1$, which is precisely the regime of strong models, the normal approximation degrades badly. The interval can extend below $0$ or above $1$, and its actual coverage can fall well under the nominal $95\%$. For a model at $99\%$ accuracy on $200$ examples, the Wald interval is essentially meaningless. The variance estimate $\hat{p}(1-\hat{p})/n$ also collapses to zero as $\hat{p} \to 1$, which absurdly implies near-perfect certainty exactly when the data are most sparse in the minority outcome. The degenerate case is stark: if a model is correct on all $n$ examples, then $\hat{p} = 1$, the estimated variance is $0$, and the Wald interval is the single point $[1, 1]$, asserting with certainty that the model never errs. No finite sample can justify that claim. Brown, Cai, and DasGupta (reference 2) document that even away from the boundary the Wald coverage oscillates well below the nominal level, and that the deficiency does not vanish as $n$ grows. ### 2.2 The Wilson Score Interval A far better default inverts the score test rather than plugging in the observed proportion as the variance. The Wilson interval solves for the set of $p$ values not rejected by the test, yielding $$ \frac{\hat{p} + \dfrac{z^2}{2n} \;\pm\; z\sqrt{\dfrac{\hat{p}(1-\hat{p})}{n} + \dfrac{z^2}{4n^2}}}{1 + \dfrac{z^2}{n}}, $$ with $z = z_{1-\alpha/2}$. The Wilson interval is always contained in $[0, 1]$, behaves sensibly at the boundaries, and maintains close-to-nominal coverage even for small $n$ and extreme $\hat{p}$. It should be the default reporting interval for accuracy, precision, recall, and any other proportion-based metric. The formula rewards a little interpretation. The center is not $\hat{p}$ but a shrinkage of $\hat{p}$ toward $1/2$: rewriting the numerator as $(n\hat{p} + z^2/2)/(n + z^2)$ shows that the interval behaves as though $z^2/2$ successful and $z^2/2$ failed pseudo-observations were added to the data. At $95\%$ confidence $z^2 \approx 3.84$, so this is close to adding two successes and two failures, which is exactly the rounding that produces the simpler Agresti-Coull interval. That pseudo-count is why Wilson never collapses to a point at the boundary: even when $\hat{p} = 1$, the center is pulled inward and the half-width stays strictly positive. A concrete case makes the contrast with Wald vivid. Suppose a classifier is correct on $196$ of $200$ test examples, so $\hat{p} = 0.98$. The Wald half-width is $1.96\sqrt{0.98 \times 0.02 / 200} \approx 0.0194$, giving the interval $[0.961, 0.999]$, which is symmetric and crowds against the boundary. The Wilson interval for the same data is approximately $[0.949, 0.992]$: shifted downward, asymmetric, with a longer reach toward smaller $p$, correctly reflecting that with only four errors observed the true error rate could plausibly be appreciably higher than $2\%$ but cannot be much lower. If instead the model were correct on all $200$ examples, Wald would report $[1, 1]$ while Wilson reports roughly $[0.981, 1]$, preserving an honest lower bound. The so-called rule of three is a handy memory aid here: when zero failures occur in $n$ trials, an approximate upper bound on the failure rate is $3/n$, so $0$ errors in $200$ trials is consistent with a true error rate as high as about $1.5\%$. ```text # Wilson 95% interval, schematic z = 1.96 center = (p_hat + z**2/(2n)) / (1 + z**2/n) half = z*sqrt(p_hat*(1-p_hat)/n + z**2/(4n**2)) / (1 + z**2/n) ci = (center - half, center + half) ``` ### 2.3 Clopper-Pearson and Exactness When strict coverage guarantees are required, the Clopper-Pearson interval inverts the exact binomial test using beta-distribution quantiles. It guarantees coverage of at least $1 - \alpha$ for every true $p$, but because the binomial is discrete, it is conservative: actual coverage often exceeds the nominal level, producing intervals wider than necessary. Clopper-Pearson is appropriate when undercoverage is unacceptable, for example in safety-critical certification, whereas Wilson is the better all-purpose choice when honest average coverage is the goal. ### 2.4 Choosing Among Them For routine reporting, Wilson is the recommended default. Reserve Clopper-Pearson for conservative guarantees and avoid Wald except as a rough mental approximation when $n$ is large and $\hat{p}$ is near $0.5$. The following diagram summarizes the choice. ```{mermaid} flowchart TD A["Proportion metric (accuracy, precision, recall)"] --> B{"Need guaranteed coverage at or above 1 minus alpha"} B -->|"Yes (safety certification)"| C["Clopper-Pearson exact interval"] B -->|"No (honest average coverage)"| D{"Are test examples independent"} D -->|"Yes"| E["Wilson score interval (default)"] D -->|"No (shared document or patient)"| F["Clustered bootstrap or cluster-aware variance"] ``` None of the closed-form intervals account for the fact that examples may be correlated, for instance when multiple test cases share a document, a patient, or a user session. In clustered settings the effective sample size is smaller than $n$. A useful approximation is the design effect $\mathrm{DEFF} = 1 + (\bar{m} - 1)\rho$, where $\bar{m}$ is the average cluster size and $\rho$ is the intra-cluster correlation; the effective sample size is $n_{\mathrm{eff}} = n / \mathrm{DEFF}$. With ten correlated sentences per document and even a modest $\rho = 0.3$, the design effect is about $3.7$, so a nominal $1{,}000$-example test set carries the information of roughly $270$ independent examples. Treating it as $1{,}000$ makes every interval above too narrow by a factor near $\sqrt{3.7} \approx 1.9$. The honest remedy is a cluster-aware variance estimate or a clustered bootstrap that resamples whole clusters. ## 3. Intervals for AUC The area under the receiver operating characteristic curve summarizes ranking quality across all thresholds. Unlike accuracy, AUC is not a simple mean of independent Bernoulli trials, so its uncertainty requires more care. ### 3.1 AUC as a Probability and the Mann-Whitney Connection The AUC equals the probability that a randomly chosen positive example receives a higher score than a randomly chosen negative example: $$ \text{AUC} = \Pr\big(s(X^{+}) > s(X^{-})\big) . $$ Its empirical estimator is the normalized Mann-Whitney $U$ statistic, computed over all $n_{+} n_{-}$ positive-negative pairs. This pairwise structure means the variance depends not only on the counts $n_{+}$ and $n_{-}$ but also on how scores are distributed. ### 3.2 The Hanley-McNeil Analytic Interval Hanley and McNeil derived a widely used variance estimate for AUC, denoted $A$: $$ \widehat{\mathrm{Var}}(A) = \frac{A(1-A) + (n_{+}-1)(Q_1 - A^2) + (n_{-}-1)(Q_2 - A^2)}{n_{+} n_{-}}, $$ where $Q_1 = A/(2 - A)$ and $Q_2 = 2A^2/(1 + A)$ under a common exponential approximation. A Gaussian interval is then $A \pm z_{1-\alpha/2}\sqrt{\widehat{\mathrm{Var}}(A)}$. The $Q_1$ and $Q_2$ approximations assume a particular score distribution and can be inaccurate when that assumption is violated, so the analytic interval is best treated as a quick estimate rather than a definitive one. ### 3.3 The DeLong Method The DeLong method provides a nonparametric variance estimator based on the structural components of the $U$ statistic, the so-called placement values. It does not assume a parametric score distribution and yields asymptotically correct intervals. Crucially, DeLong extends to the comparison of two correlated AUCs, for instance two models evaluated on the same test set, by estimating the covariance between their statistics. This makes it the standard analytic tool when comparing classifiers on shared data. ```text # DeLong, schematic V10, V01 = placement_values(scores_pos, scores_neg) # per-example components var_auc = var(V10)/n_pos + var(V01)/n_neg ci = auc +/- 1.96*sqrt(var_auc) ``` ### 3.4 Bootstrap Intervals as a General Fallback When the metric is complex, or when assumptions behind analytic formulas are doubtful, the bootstrap offers a distribution-free alternative. The principle is to treat the empirical distribution $\hat{F}_n$ of the test set as a stand-in for the unknown $\mathcal{D}$, and to estimate the sampling variability of $\hat{\theta}_n$ by repeatedly drawing samples of size $n$ from $\hat{F}_n$. Concretely, resample the test set with replacement $B$ times, recompute the metric $\hat{\theta}^{*}_b$ on each resample, and form an interval from the resulting distribution $\{\hat{\theta}^{*}_1, \dots, \hat{\theta}^{*}_B\}$. The percentile interval takes the empirical $\alpha/2$ and $1 - \alpha/2$ quantiles of the bootstrap replicates directly. It is simple and transformation-respecting, but it can miscover when the estimator is biased or when its sampling distribution is skewed, both common for ranking metrics. The bias-corrected and accelerated (BCa) interval of Efron and Tibshirani (reference 6) corrects this. It computes a bias-correction term $\hat{z}_0 = \Phi^{-1}\!\big(\#\{\hat{\theta}^{*}_b < \hat{\theta}_n\}/B\big)$, the normal quantile of the fraction of replicates below the observed estimate, and an acceleration term $\hat{a}$ estimated from the jackknife skewness of the statistic. It then reads off adjusted percentiles $$ \alpha_1 = \Phi\!\left(\hat{z}_0 + \frac{\hat{z}_0 + z_{\alpha/2}}{1 - \hat{a}(\hat{z}_0 + z_{\alpha/2})}\right), $$ and the analogous $\alpha_2$ for the upper end, taking the $\alpha_1$ and $\alpha_2$ quantiles of the replicates. When $\hat{z}_0 = 0$ and $\hat{a} = 0$ this reduces exactly to the percentile interval, so BCa is a strict refinement. Choose $B$ large enough that the quantiles are stable: $B = 1{,}000$ suffices for a rough interval, while $B \ge 10{,}000$ is prudent for the tail-sensitive BCa endpoints. For AUC and for any metric involving ranking, stratified resampling that preserves the number of positives and negatives is important, since the metric is undefined or unstable when a resample contains only one class. The bootstrap is also the most natural way to handle clustered data: resample whole clusters rather than individual examples to respect the dependence structure. A caution: the bootstrap is asymptotic. With very few positives, say fewer than $20$, both analytic and bootstrap intervals become unreliable, and the honest move is to report the small sample size prominently and resist over-interpreting the point estimate. ## 4. Reporting Uncertainty in Model Comparison The most consequential use of confidence intervals is deciding whether model $A$ truly outperforms model $B$. Here a common and serious error is to compare two independent intervals by eye and conclude there is no difference because they overlap. ### 4.1 The Overlap Fallacy Overlapping confidence intervals do not imply a non-significant difference. Two intervals can overlap substantially while the difference between the estimates is statistically significant. The correct object of inference is the interval on the difference $\theta_A - \theta_B$, not the two marginal intervals. The variance of a difference depends on the covariance between the two estimates, and ignoring that covariance discards exactly the information that makes paired comparison powerful. ### 4.2 Paired Evaluation on the Same Test Set When both models are evaluated on the same examples, their errors are correlated, and a paired analysis is both correct and far more sensitive. For accuracy, McNemar's test focuses on discordant examples: let $b$ be the count where $A$ is correct and $B$ is wrong, and $c$ the reverse. The other examples, where both agree, carry no information about the difference. The test statistic is $$ \chi^2 = \frac{(b - c)^2}{b + c}, $$ which under the null hypothesis of equal accuracy is approximately chi-squared with one degree of freedom. When $b + c$ is small, say below $25$, the continuity-corrected form $\big(|b - c| - 1\big)^2/(b + c)$ or an exact binomial test on $b$ out of $b + c$ is preferred. A confidence interval for the accuracy difference $\hat{p}_A - \hat{p}_B = (b - c)/n$ can be built on the paired proportions; the variance of the difference is $\big((b + c) - (b - c)^2/n\big)/n^2$, which is smaller than the unpaired variance precisely because the concordant examples drop out. The intuition is that if both models fail on the same hard examples, those examples tell us nothing about which model is better, so conditioning on the discordant pairs is what makes the paired test more powerful. Dietterich (reference 7) found McNemar to be among the few tests with acceptable Type I error for comparing classifiers on a single test set. A small worked instance: on $n = 1{,}000$ examples suppose model $A$ is right and $B$ wrong on $b = 45$ cases, while $B$ is right and $A$ wrong on $c = 25$. The raw accuracies differ by only $(45 - 25)/1000 = 2.0$ points, which an unpaired eyeball comparison might dismiss. The McNemar statistic is $(45 - 25)^2 / (45 + 25) = 400/70 \approx 5.71$, exceeding the $95\%$ critical value of $3.84$, so the difference is significant. The shared $930$ concordant examples never enter the calculation, which is exactly why the paired test sees a signal the marginal intervals would hide. For AUC, the DeLong test for two correlated curves plays the analogous role, using the estimated covariance to form an interval on $\text{AUC}_A - \text{AUC}_B$. The paired bootstrap is the general purpose alternative: on each resample, recompute both metrics and record their difference, then take quantiles of the difference distribution. Because both models see the same resample, the shared variation cancels and the interval on the difference is appropriately tight. ```text # Paired bootstrap on a difference for b in 1..B: idx = resample_indices(n) # same indices for both models diff[b] = metric(A, idx) - metric(B, idx) ci_diff = quantile(diff, [0.025, 0.975]) ``` If the interval on the difference excludes zero, the comparison is significant at the corresponding level; if it straddles zero, the data do not support a confident ranking. ### 4.3 Multiple Comparisons and Selection Effects Benchmarks rarely compare two models in isolation. Leaderboards rank dozens, and each pairwise comparison is an opportunity for a false positive. Reporting one nominal $95\%$ interval per comparison guarantees that some apparent winners are noise. When many models or many test slices are compared, control the family-wise error rate or the false discovery rate, for example with a Bonferroni or Benjamini-Hochberg adjustment, and widen the intervals accordingly. A related and underappreciated hazard is the winner's curse: the model that tops a leaderboard tends to have benefited from favorable noise, so its test score is an optimistically biased estimate of its true performance. The model selected as best should be re-evaluated on fresh data before its headline number is trusted. ### 4.4 A Reporting Checklist Sound reporting practice can be summarized compactly. State the test set size and the number of positives and negatives, since these determine the achievable precision. Report a confidence interval, not just a point estimate, and name the method used to compute it. For comparisons, report the interval on the difference rather than two separate intervals, and use a paired method when the models share a test set. Disclose any clustering or dependence in the data and account for it through cluster-aware variance or a clustered bootstrap. Finally, if multiple models or slices were compared, state the correction applied. None of these steps is expensive, and together they convert a brittle single number into a defensible scientific claim. ### 4.5 Common Pitfalls A handful of mistakes account for most misreported uncertainty in practice. - Reading two marginal intervals for overlap instead of forming the interval on the difference. This is the overlap fallacy of section 4.1 and it both loses power and can mislead. - Using the Wald interval near the boundary, where strong models live, and reporting endpoints outside $[0, 1]$ or a degenerate point at perfect accuracy. - Treating clustered examples as independent, which understates every interval by the design-effect factor. - Bootstrapping a ranking metric without stratifying by class, so that some resamples contain a single class and the metric is undefined or wildly unstable. - Quoting an interval to more significant figures than the test size supports, for example $\pm 0.001$ on a few hundred examples. - Forgetting the multiplicity correction after scanning many models or slices, and then trusting the winner's headline number without re-evaluation on fresh data. ### 4.6 Open-Source Tooling Every method in this chapter is available in mature, free, open-source libraries, so there is no reason to hand-roll error-prone code. In Python, `statsmodels.stats.proportion.proportion_confint` computes Wilson, Clopper-Pearson, Agresti-Coull, and Wald intervals, and `proportions_ztest` and `statsmodels.stats.contingency_tables.mcnemar` cover the comparison tests. The `scipy.stats` module provides `bootstrap`, which implements the percentile and BCa intervals directly, along with `binomtest` for exact binomial inference. For ROC analysis, `scikit-learn` supplies `roc_auc_score`, and the DeLong covariance estimator is available in small, auditable open-source implementations that pair naturally with it. Preferring these well-tested tools over bespoke formulas eliminates a large class of off-by-one and boundary bugs. The discipline of attaching uncertainty to every reported metric does more than satisfy reviewers. It changes how teams reason about progress: a model that is one point better with overlapping difference intervals is not a confirmed improvement, and treating it as one wastes engineering effort chasing noise. Confidence intervals are the instrument that distinguishes signal from sampling variation, and they belong in every evaluation report. ## References 1. Wilson, E. B. (1927). Probable Inference, the Law of Succession, and Statistical Inference. Journal of the American Statistical Association. https://www.tandfonline.com/doi/abs/10.1080/01621459.1927.10502953 2. Brown, L. D., Cai, T. T., DasGupta, A. (2001). Interval Estimation for a Binomial Proportion. Statistical Science. https://projecteuclid.org/journals/statistical-science/volume-16/issue-2/Interval-Estimation-for-a-Binomial-Proportion/10.1214/ss/1009213286.full 3. Clopper, C. J., Pearson, E. S. (1934). The Use of Confidence or Fiducial Limits Illustrated in the Case of the Binomial. Biometrika. https://academic.oup.com/biomet/article-abstract/26/4/404/291538 4. Hanley, J. A., McNeil, B. J. (1982). The Meaning and Use of the Area under a Receiver Operating Characteristic (ROC) Curve. Radiology. https://pubs.rsna.org/doi/10.1148/radiology.143.1.7063747 5. DeLong, E. R., DeLong, D. M., Clarke-Pearson, D. L. (1988). Comparing the Areas under Two or More Correlated Receiver Operating Characteristic Curves. Biometrics. https://www.jstor.org/stable/2531595 6. Efron, B., Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Chapman and Hall. https://www.taylorfrancis.com/books/mono/10.1201/9780429246593/introduction-bootstrap-bradley-efron-robert-tibshirani 7. Dietterich, T. G. (1998). Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Computation. https://direct.mit.edu/neco/article/10/7/1895/6224 8. Benjamini, Y., Hochberg, Y. (1995). Controlling the False Discovery Rate. Journal of the Royal Statistical Society Series B. https://www.jstor.org/stable/2346101 9. Bengio, Y., Grandvalet, Y. (2004). No Unbiased Estimator of the Variance of K-Fold Cross-Validation. Journal of Machine Learning Research, 5, 1089-1105. https://www.jmlr.org/papers/v5/grandvalet04a.html