174 Bootstrapping for Evaluation
When we report that a classifier achieves an accuracy of 0.91 or that a retrieval system reaches a mean reciprocal rank of 0.63, we are reporting a single number computed from one finite test set. That number is a point estimate of a quantity we cannot observe directly: the performance the system would attain on the full population of inputs it will eventually face. The bootstrap is the most general tool we have for quantifying how much that point estimate would wobble if we had drawn a different test set. This chapter develops the bootstrap principle, the construction of percentile and bias corrected and accelerated (BCa) confidence intervals, the out-of-bag and 0.632 estimators that recycle bootstrap resamples into model assessment, and the practical machinery of bootstrapping whole metric distributions for machine learning evaluation.
174.1 1. The Bootstrap Principle
174.1.1 1.1 The plug-in idea
Suppose our test data \(x_1, \dots, x_n\) are drawn independently from an unknown distribution \(F\). A metric of interest is a functional \(\theta = t(F)\), for example the population accuracy or the population area under the ROC curve. We estimate it with the plug-in estimator \(\hat\theta = t(\hat F_n)\), where \(\hat F_n\) is the empirical distribution that places mass \(1/n\) on each observed point. The trouble is that \(\hat\theta\) is a random quantity, and we want its sampling distribution so we can attach an interval to it.
The bootstrap principle is a single substitution: since we do not know \(F\), we use \(\hat F_n\) in its place. Where the true sampling distribution comes from resampling \(n\) points from \(F\), the bootstrap sampling distribution comes from resampling \(n\) points from \(\hat F_n\). Resampling from \(\hat F_n\) is exactly sampling \(n\) items with replacement from the observed data. Efron’s insight in 1979 was that this substitution, although it looks circular, is justified whenever \(\hat F_n\) converges to \(F\) and the functional \(t\) is smooth enough that small perturbations of its argument produce small perturbations of its value.
174.1.2 1.2 The Monte Carlo algorithm
We almost never compute the bootstrap distribution analytically. Instead we approximate it by Monte Carlo:
for b in 1..B:
sample x*_1..x*_n with replacement from x_1..x_n
theta*_b = metric(x*_1..x*_n)
return {theta*_1, ..., theta*_B}
The collection \(\{\hat\theta^*_b\}_{b=1}^B\) approximates the sampling distribution of \(\hat\theta\). From it we read off the bootstrap standard error as the sample standard deviation of the replicates,
\[ \widehat{\operatorname{se}}_{\mathrm{boot}} = \sqrt{\frac{1}{B-1}\sum_{b=1}^B \left(\hat\theta^*_b - \bar{\theta}^*\right)^2}, \qquad \bar{\theta}^* = \frac{1}{B}\sum_{b=1}^B \hat\theta^*_b . \]
Two sources of error are now in play. There is statistical error, which depends on \(n\) and shrinks as the test set grows, and there is Monte Carlo error, which depends on \(B\) and shrinks as we draw more resamples. We control the second freely. For standard errors \(B\) in the hundreds suffices, but for the tail quantiles needed by confidence intervals we want \(B\) in the low thousands, since a \(95\%\) interval depends on the behavior of the empirical distribution near its \(2.5\)th and \(97.5\)th percentiles where samples are sparse.
174.1.3 1.3 When the bootstrap works and when it fails
The bootstrap is consistent for smooth functionals of the distribution: means, variances, correlations, and most differentiable risk metrics. It struggles or fails for non-smooth functionals. The sample maximum is the textbook failure, because the maximum of a resample can never exceed the maximum of the original data, so the bootstrap distribution is degenerate on one side. Heavy tailed data with infinite variance and parameters on the boundary of their space are further cases where the naive bootstrap is unreliable. In evaluation work the most common pitfall is not non-smoothness but dependence: if test examples are correlated, for instance multiple turns from the same conversation or several patches from the same image, resampling individual rows breaks the dependence structure and understates the variance. The remedy is to resample the independent unit, the conversation or the image, using a block or cluster bootstrap rather than a row level one.
174.2 2. Percentile and BCa Intervals
174.2.1 2.1 The percentile interval
The most direct way to turn bootstrap replicates into a confidence interval is to read off the empirical quantiles. Let \(\hat\theta^*_{(\alpha)}\) denote the \(\alpha\) quantile of the replicate distribution. The percentile interval at level \(1 - 2\alpha\) is
\[ \left[\, \hat\theta^*_{(\alpha)}, \; \hat\theta^*_{(1-\alpha)} \,\right]. \]
For a \(95\%\) interval we take the \(2.5\)th and \(97.5\)th percentiles of the \(B\) replicates. The percentile interval is appealing because it is transformation respecting: if we had bootstrapped a monotone function of \(\theta\) instead, the interval would simply be the transformed interval. It also automatically lands inside the natural range of the metric, so an interval for accuracy never strays below zero or above one. Its weakness is that it assumes the bootstrap distribution is approximately unbiased and symmetric on some transformed scale, and it makes no correction when that assumption is violated.
174.2.2 2.2 Bias and acceleration
The BCa interval, also from Efron, repairs the percentile interval with two corrections. The first is a bias correction \(\hat z_0\) that measures the median bias of the bootstrap distribution. We compute it from the fraction of replicates below the original estimate,
\[ \hat z_0 = \Phi^{-1}\!\left( \frac{\#\{\hat\theta^*_b < \hat\theta\}}{B} \right), \]
where \(\Phi^{-1}\) is the standard normal quantile function. If exactly half the replicates fall below \(\hat\theta\) then \(\hat z_0 = 0\) and there is no median bias.
The second correction is an acceleration \(\hat a\) that measures how fast the standard error of \(\hat\theta\) changes with the underlying value, capturing skewness. It is computed from the jackknife. Let \(\hat\theta_{(i)}\) be the estimate with the \(i\)th point deleted and \(\hat\theta_{(\cdot)}\) their mean. Then
\[ \hat a = \frac{\sum_{i=1}^n \left(\hat\theta_{(\cdot)} - \hat\theta_{(i)}\right)^3}{6 \left[\sum_{i=1}^n \left(\hat\theta_{(\cdot)} - \hat\theta_{(i)}\right)^2\right]^{3/2}}. \]
174.2.3 2.3 Assembling the BCa endpoints
With \(\hat z_0\) and \(\hat a\) in hand, BCa replaces the fixed percentiles \(\alpha\) and \(1 - \alpha\) with adjusted ones. For a target tail probability \(\alpha\) the adjusted quantile is
\[ \alpha_1 = \Phi\!\left( \hat z_0 + \frac{\hat z_0 + z_\alpha}{1 - \hat a (\hat z_0 + z_\alpha)} \right), \]
where \(z_\alpha = \Phi^{-1}(\alpha)\), and analogously for the upper endpoint using \(z_{1-\alpha}\). The interval is then \([\hat\theta^*_{(\alpha_1)}, \hat\theta^*_{(\alpha_2)}]\). When both corrections vanish, \(\alpha_1 = \alpha\) and BCa collapses to the percentile interval. BCa is second order accurate, meaning its coverage error shrinks like \(1/n\) rather than the \(1/\sqrt n\) of the percentile and standard normal intervals, and it is the recommended default when the cost of the extra jackknife pass is affordable. The following sketch shows the structure.
z0 = Phi_inv(mean(theta_star < theta_hat))
a = jackknife_acceleration(data, metric)
for tail in (alpha, 1 - alpha):
z = Phi_inv(tail)
adj = Phi(z0 + (z0 + z) / (1 - a*(z0 + z)))
endpoints.append(quantile(theta_star, adj))
A practical caution: BCa can behave poorly when \(B\) is too small, because the adjusted quantiles may push into the extreme tail where only a handful of replicates live. Use at least a few thousand resamples, and be alert when \(\hat a\) or \(\hat z_0\) is large, since that signals a skewed or biased metric where reporting the interval alongside the replicate histogram is wise.
174.3 3. Out-of-Bag and the 0.632 Bootstrap
174.3.1 3.1 Resampling for model assessment
The bootstrap can also estimate prediction error, not just the variance of a fixed metric. Here each bootstrap resample becomes a training set, and the question is how to use the resamples to estimate how well a model trained on \(n\) points will generalize. The naive idea of training on a resample and testing on the same original data is badly optimistic, because roughly two thirds of the data appear in any given resample and the model has effectively seen them.
The key combinatorial fact is that the probability a particular observation is omitted from a bootstrap sample of size \(n\) is
\[ \left(1 - \frac{1}{n}\right)^n \longrightarrow e^{-1} \approx 0.368 \]
as \(n\) grows. So on average about \(36.8\%\) of the data are left out of each resample. These omitted points are out-of-bag (OOB), and they form a ready made held out set for the model trained on that resample.
174.3.2 3.2 The out-of-bag estimator
For each observation \(i\), collect the bootstrap samples in which \(i\) was out-of-bag, predict \(i\) with each corresponding model, and aggregate. The leave-one-out bootstrap error is
\[ \widehat{\mathrm{Err}}^{(1)} = \frac{1}{n} \sum_{i=1}^n \frac{1}{|C_i|} \sum_{b \in C_i} L\!\left(y_i, \hat f^{*b}(x_i)\right), \]
where \(C_i\) is the set of resamples not containing \(i\) and \(L\) is the loss. This estimator is the foundation of the out-of-bag error reported by random forests, where it comes for free as a byproduct of the bagging procedure and gives an honest generalization estimate without a separate validation split.
174.3.3 3.3 The 0.632 and 0.632+ corrections
The leave-one-out bootstrap error is biased upward, because each model is trained on the distinct elements of a resample, roughly \(0.632 n\) effective points, which is fewer than the full \(n\) and therefore a weaker learner. Efron and Tibshirani proposed averaging it with the optimistic resubstitution error \(\overline{\mathrm{err}}\), the training error on the full data, to cancel the biases:
\[ \widehat{\mathrm{Err}}^{(0.632)} = 0.368 \, \overline{\mathrm{err}} + 0.632 \, \widehat{\mathrm{Err}}^{(1)} . \]
The weight \(0.632 = 1 - e^{-1}\) is the expected fraction of distinct points in a resample, which is why the estimator carries that name. This blend works well when overfitting is mild but is itself optimistic for severe overfitters, such as a one nearest neighbor classifier whose resubstitution error is zero. The 0.632+ estimator addresses this by letting the weight depend on the relative overfitting rate \(R\), defined through the no-information error rate \(\gamma\) that the model would achieve if features and labels were independent:
\[ \hat w = \frac{0.632}{1 - 0.368\,\hat R}, \qquad \hat R = \frac{\widehat{\mathrm{Err}}^{(1)} - \overline{\mathrm{err}}}{\hat\gamma - \overline{\mathrm{err}}}. \]
The final estimator is \(\widehat{\mathrm{Err}}^{(0.632+)} = (1 - \hat w)\,\overline{\mathrm{err}} + \hat w\,\widehat{\mathrm{Err}}^{(1)}\). When overfitting is absent, \(\hat R = 0\) and the weight returns to \(0.632\); when overfitting is total, the weight rises toward one and the estimator leans on the honest out-of-bag term. In modern practice cross validation has largely displaced the 0.632 family for routine error estimation, but the OOB idea remains central to ensemble methods and the 0.632 correction is still valuable for small samples where every data point counts.
174.4 4. Bootstrapping Metric Distributions
174.4.1 4.1 Beyond a single scalar
Evaluation rarely reduces to one number. We often want the full sampling distribution of a metric, the joint behavior of several metrics, or a confidence interval on the difference between two systems. The bootstrap handles all of these uniformly, because each resample yields a complete recomputation of whatever metric we care about, however nonlinear.
Consider comparing system \(A\) and system \(B\) on the same test set. We resample test items, and on each resample we recompute both metrics on the identical resampled items, recording the difference \(\Delta^*_b = m_A^* - m_B^*\). This paired bootstrap preserves the correlation between the two systems’ performance, which sharpens the interval relative to treating the systems independently. If the \(1 - 2\alpha\) percentile interval for \(\{\Delta^*_b\}\) excludes zero, we have evidence that the systems differ at that level. The same construction gives intervals for ratios, for differences of F1 scores, or for any contrast that would be awkward to handle with a closed form variance.
for b in 1..B:
idx = sample_with_replacement(range(n), n)
delta_star[b] = metric_A(data[idx]) - metric_B(data[idx])
ci = (quantile(delta_star, 0.025), quantile(delta_star, 0.975))
174.4.2 4.2 Stratified and grouped resampling
Real evaluation sets have structure that plain resampling ignores. When a metric is computed over imbalanced classes, an unstratified resample can occasionally contain no positives, making metrics like precision undefined or wildly unstable. Stratified resampling fixes the count drawn from each class to its observed value, keeping every replicate well defined and reducing variance. When examples are clustered, as with multiple questions per document or several utterances per speaker, we resample whole clusters to respect the dependence, as noted in section 1.3. The unit of resampling should always match the unit of independence in the data generating process; choosing it wrongly is the single most common way bootstrap intervals in machine learning come out too narrow.
174.4.3 4.3 Reporting and interpretation
A bootstrap interval is a statement about sampling variability under repeated draws of the test set from the same distribution. It does not capture distribution shift, annotation error, or bias in the test set itself, and it should never be presented as if it did. Three habits make bootstrap reporting trustworthy. First, fix and report the random seed and the number of resamples \(B\) so the interval is reproducible. Second, prefer BCa or studentized intervals when the metric is skewed or bounded near an endpoint, and fall back to percentile intervals when the jackknife for acceleration is too expensive. Third, when comparing systems, bootstrap the difference directly rather than checking whether two separate intervals overlap, because non overlapping intervals are a conservative and sometimes misleading proxy for a significant difference. Plotting the replicate histogram alongside the reported interval is the cheapest diagnostic available and routinely exposes multimodality, boundary effects, or degeneracy that a bare interval would hide.
174.4.4 4.4 Cost and scale
The computational cost of the bootstrap is \(B\) times the cost of evaluating the metric once. For cheap metrics on modest test sets this is trivial, but for expensive evaluations, such as scoring large language model outputs with a judge model, recomputing the metric thousands of times is prohibitive. Two economies help. The first is to cache the per example scores once and resample indices into the cached score vector, so each resample costs an \(O(n)\) aggregation rather than a full re-evaluation. This works whenever the metric is a function of fixed per example quantities, which covers accuracy, mean scores, and many ranking metrics. The second is the multiplier or Bayesian bootstrap, which replaces integer resample counts with continuous Dirichlet weights and can reduce variance for smooth functionals while sharing the same caching trick. With per example caching, even a heavy judge based metric admits a full bootstrap distribution at the price of one evaluation pass plus a few seconds of resampling arithmetic.
174.5 5. References
- Efron, B. (1979). Bootstrap Methods: Another Look at the Jackknife. Annals of Statistics, 7(1), 1 to 26. https://doi.org/10.1214/aos/1176344552
- Efron, B., and Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Chapman and Hall/CRC. https://doi.org/10.1201/9780429246593
- Efron, B. (1987). Better Bootstrap Confidence Intervals. Journal of the American Statistical Association, 82(397), 171 to 185. https://doi.org/10.1080/01621459.1987.10478410
- Efron, B., and Tibshirani, R. (1997). Improvements on Cross-Validation: The .632+ Bootstrap Method. Journal of the American Statistical Association, 92(438), 548 to 560. https://doi.org/10.1080/01621459.1997.10474007
- Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5 to 32. https://doi.org/10.1023/A:1010933404324
- DiCiccio, T. J., and Efron, B. (1996). Bootstrap Confidence Intervals. Statistical Science, 11(3), 189 to 228. https://doi.org/10.1214/ss/1032280214
- Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer. https://hastie.su.domains/ElemStatLearn/
- Rubin, D. B. (1981). The Bayesian Bootstrap. Annals of Statistics, 9(1), 130 to 134. https://doi.org/10.1214/aos/1176345338
- Davison, A. C., and Hinkley, D. V. (1997). Bootstrap Methods and Their Application. Cambridge University Press. https://doi.org/10.1017/CBO9780511802843