174 Bootstrapping for Evaluation

When we report that a classifier achieves an accuracy of 0.91 or that a retrieval system reaches a mean reciprocal rank of 0.63, we are reporting a single number computed from one finite test set. That number is a point estimate of a quantity we cannot observe directly: the performance the system would attain on the full population of inputs it will eventually face. The bootstrap is the most general tool we have for quantifying how much that point estimate would wobble if we had drawn a different test set. This chapter develops the bootstrap principle, the construction of percentile and bias corrected and accelerated (BCa) confidence intervals, the out-of-bag and 0.632 estimators that recycle bootstrap resamples into model assessment, and the practical machinery of bootstrapping whole metric distributions for machine learning evaluation.

174.1 1. The Bootstrap Principle

174.1.1 1.1 The plug-in idea

Suppose our test data $x_1, \dots, x_n$ are drawn independently from an unknown distribution $F$. A metric of interest is a functional $\theta = t(F)$, for example the population accuracy or the population area under the ROC curve. We estimate it with the plug-in estimator $\hat\theta = t(\hat F_n)$, where $\hat F_n$ is the empirical distribution that places mass $1/n$ on each observed point. The trouble is that $\hat\theta$ is a random quantity, and we want its sampling distribution so we can attach an interval to it.

The bootstrap principle is a single substitution: since we do not know $F$, we use $\hat F_n$ in its place. Where the true sampling distribution comes from resampling $n$ points from $F$, the bootstrap sampling distribution comes from resampling $n$ points from $\hat F_n$. Resampling from $\hat F_n$ is exactly sampling $n$ items with replacement from the observed data. Efron’s insight in 1979 was that this substitution, although it looks circular, is justified whenever $\hat F_n$ converges to $F$ and the functional $t$ is smooth enough that small perturbations of its argument produce small perturbations of its value.

174.1.2 1.2 The Monte Carlo algorithm

We almost never compute the bootstrap distribution analytically. Instead we approximate it by Monte Carlo:

for b in 1..B:
    sample x*_1..x*_n with replacement from x_1..x_n
    theta*_b = metric(x*_1..x*_n)
return {theta*_1, ..., theta*_B}

The collection $\{\hat\theta^*_b\}_{b=1}^B$ approximates the sampling distribution of $\hat\theta$. From it we read off the bootstrap standard error as the sample standard deviation of the replicates,

\[ \widehat{\operatorname{se}}_{\mathrm{boot}} = \sqrt{\frac{1}{B-1}\sum_{b=1}^B \left(\hat\theta^*_b - \bar{\theta}^*\right)^2}, \qquad \bar{\theta}^* = \frac{1}{B}\sum_{b=1}^B \hat\theta^*_b . \]

Two sources of error are now in play. There is statistical error, which depends on $n$ and shrinks as the test set grows, and there is Monte Carlo error, which depends on $B$ and shrinks as we draw more resamples. We control the second freely. For standard errors $B$ in the hundreds suffices, but for the tail quantiles needed by confidence intervals we want $B$ in the low thousands, since a $95\%$ interval depends on the behavior of the empirical distribution near its $2.5$th and $97.5$th percentiles where samples are sparse.

174.1.3 1.3 When the bootstrap works and when it fails

The bootstrap is consistent for smooth functionals of the distribution: means, variances, correlations, and most differentiable risk metrics. It struggles or fails for non-smooth functionals. The sample maximum is the textbook failure, because the maximum of a resample can never exceed the maximum of the original data, so the bootstrap distribution is degenerate on one side. Heavy tailed data with infinite variance and parameters on the boundary of their space are further cases where the naive bootstrap is unreliable. In evaluation work the most common pitfall is not non-smoothness but dependence: if test examples are correlated, for instance multiple turns from the same conversation or several patches from the same image, resampling individual rows breaks the dependence structure and understates the variance. The remedy is to resample the independent unit, the conversation or the image, using a block or cluster bootstrap rather than a row level one.

174.2 2. Percentile and BCa Intervals

174.2.1 2.1 The percentile interval

The most direct way to turn bootstrap replicates into a confidence interval is to read off the empirical quantiles. Let $\hat\theta^*_{(\alpha)}$ denote the $\alpha$ quantile of the replicate distribution. The percentile interval at level $1 - 2\alpha$ is

\[ \left[\, \hat\theta^*_{(\alpha)}, \; \hat\theta^*_{(1-\alpha)} \,\right]. \]

For a $95\%$ interval we take the $2.5$th and $97.5$th percentiles of the $B$ replicates. The percentile interval is appealing because it is transformation respecting: if we had bootstrapped a monotone function of $\theta$ instead, the interval would simply be the transformed interval. It also automatically lands inside the natural range of the metric, so an interval for accuracy never strays below zero or above one. Its weakness is that it assumes the bootstrap distribution is approximately unbiased and symmetric on some transformed scale, and it makes no correction when that assumption is violated.

174.2.2 2.2 Bias and acceleration

The BCa interval, also from Efron, repairs the percentile interval with two corrections. The first is a bias correction $\hat z_0$ that measures the median bias of the bootstrap distribution. We compute it from the fraction of replicates below the original estimate,

\[ \hat z_0 = \Phi^{-1}\!\left( \frac{\#\{\hat\theta^*_b < \hat\theta\}}{B} \right), \]

where $\Phi^{-1}$ is the standard normal quantile function. If exactly half the replicates fall below $\hat\theta$ then $\hat z_0 = 0$ and there is no median bias.

The second correction is an acceleration $\hat a$ that measures how fast the standard error of $\hat\theta$ changes with the underlying value, capturing skewness. It is computed from the jackknife. Let $\hat\theta_{(i)}$ be the estimate with the $i$th point deleted and $\hat\theta_{(\cdot)}$ their mean. Then

\[ \hat a = \frac{\sum_{i=1}^n \left(\hat\theta_{(\cdot)} - \hat\theta_{(i)}\right)^3}{6 \left[\sum_{i=1}^n \left(\hat\theta_{(\cdot)} - \hat\theta_{(i)}\right)^2\right]^{3/2}}. \]

174.2.3 2.3 Assembling the BCa endpoints

With $\hat z_0$ and $\hat a$ in hand, BCa replaces the fixed percentiles $\alpha$ and $1 - \alpha$ with adjusted ones. For a target tail probability $\alpha$ the adjusted quantile is

\[ \alpha_1 = \Phi\!\left( \hat z_0 + \frac{\hat z_0 + z_\alpha}{1 - \hat a (\hat z_0 + z_\alpha)} \right), \]

where $z_\alpha = \Phi^{-1}(\alpha)$, and analogously for the upper endpoint using $z_{1-\alpha}$. The interval is then $[\hat\theta^*_{(\alpha_1)}, \hat\theta^*_{(\alpha_2)}]$. When both corrections vanish, $\alpha_1 = \alpha$ and BCa collapses to the percentile interval. BCa is second order accurate, meaning its coverage error shrinks like $1/n$ rather than the $1/\sqrt n$ of the percentile and standard normal intervals, and it is the recommended default when the cost of the extra jackknife pass is affordable. The following sketch shows the structure.

z0 = Phi_inv(mean(theta_star < theta_hat))
a  = jackknife_acceleration(data, metric)
for tail in (alpha, 1 - alpha):
    z = Phi_inv(tail)
    adj = Phi(z0 + (z0 + z) / (1 - a*(z0 + z)))
    endpoints.append(quantile(theta_star, adj))

A practical caution: BCa can behave poorly when $B$ is too small, because the adjusted quantiles may push into the extreme tail where only a handful of replicates live. Use at least a few thousand resamples, and be alert when $\hat a$ or $\hat z_0$ is large, since that signals a skewed or biased metric where reporting the interval alongside the replicate histogram is wise.

174.3 3. Out-of-Bag and the 0.632 Bootstrap

174.3.1 3.1 Resampling for model assessment

The bootstrap can also estimate prediction error, not just the variance of a fixed metric. Here each bootstrap resample becomes a training set, and the question is how to use the resamples to estimate how well a model trained on $n$ points will generalize. The naive idea of training on a resample and testing on the same original data is badly optimistic, because roughly two thirds of the data appear in any given resample and the model has effectively seen them.

The key combinatorial fact is that the probability a particular observation is omitted from a bootstrap sample of size $n$ is

\[ \left(1 - \frac{1}{n}\right)^n \longrightarrow e^{-1} \approx 0.368 \]

as $n$ grows. So on average about $36.8\%$ of the data are left out of each resample. These omitted points are out-of-bag (OOB), and they form a ready made held out set for the model trained on that resample.

174.3.2 3.2 The out-of-bag estimator

For each observation $i$, collect the bootstrap samples in which $i$ was out-of-bag, predict $i$ with each corresponding model, and aggregate. The leave-one-out bootstrap error is

\[ \widehat{\mathrm{Err}}^{(1)} = \frac{1}{n} \sum_{i=1}^n \frac{1}{|C_i|} \sum_{b \in C_i} L\!\left(y_i, \hat f^{*b}(x_i)\right), \]

where $C_i$ is the set of resamples not containing $i$ and $L$ is the loss. This estimator is the foundation of the out-of-bag error reported by random forests, where it comes for free as a byproduct of the bagging procedure and gives an honest generalization estimate without a separate validation split.

174.3.3 3.3 The 0.632 and 0.632+ corrections

The leave-one-out bootstrap error is biased upward, because each model is trained on the distinct elements of a resample, roughly $0.632 n$ effective points, which is fewer than the full $n$ and therefore a weaker learner. Efron and Tibshirani proposed averaging it with the optimistic resubstitution error $\overline{\mathrm{err}}$, the training error on the full data, to cancel the biases:

\[ \widehat{\mathrm{Err}}^{(0.632)} = 0.368 \, \overline{\mathrm{err}} + 0.632 \, \widehat{\mathrm{Err}}^{(1)} . \]

The weight $0.632 = 1 - e^{-1}$ is the expected fraction of distinct points in a resample, which is why the estimator carries that name. This blend works well when overfitting is mild but is itself optimistic for severe overfitters, such as a one nearest neighbor classifier whose resubstitution error is zero. The 0.632+ estimator addresses this by letting the weight depend on the relative overfitting rate $R$, defined through the no-information error rate $\gamma$ that the model would achieve if features and labels were independent:

\[ \hat w = \frac{0.632}{1 - 0.368\,\hat R}, \qquad \hat R = \frac{\widehat{\mathrm{Err}}^{(1)} - \overline{\mathrm{err}}}{\hat\gamma - \overline{\mathrm{err}}}. \]

The final estimator is $\widehat{\mathrm{Err}}^{(0.632+)} = (1 - \hat w)\,\overline{\mathrm{err}} + \hat w\,\widehat{\mathrm{Err}}^{(1)}$. When overfitting is absent, $\hat R = 0$ and the weight returns to $0.632$; when overfitting is total, the weight rises toward one and the estimator leans on the honest out-of-bag term. In modern practice cross validation has largely displaced the 0.632 family for routine error estimation, but the OOB idea remains central to ensemble methods and the 0.632 correction is still valuable for small samples where every data point counts.

174.4 4. Bootstrapping Metric Distributions

174.4.1 4.1 Beyond a single scalar

Evaluation rarely reduces to one number. We often want the full sampling distribution of a metric, the joint behavior of several metrics, or a confidence interval on the difference between two systems. The bootstrap handles all of these uniformly, because each resample yields a complete recomputation of whatever metric we care about, however nonlinear.

Consider comparing system $A$ and system $B$ on the same test set. We resample test items, and on each resample we recompute both metrics on the identical resampled items, recording the difference $\Delta^*_b = m_A^* - m_B^*$. This paired bootstrap preserves the correlation between the two systems’ performance, which sharpens the interval relative to treating the systems independently. If the $1 - 2\alpha$ percentile interval for $\{\Delta^*_b\}$ excludes zero, we have evidence that the systems differ at that level. The same construction gives intervals for ratios, for differences of F1 scores, or for any contrast that would be awkward to handle with a closed form variance.

for b in 1..B:
    idx = sample_with_replacement(range(n), n)
    delta_star[b] = metric_A(data[idx]) - metric_B(data[idx])
ci = (quantile(delta_star, 0.025), quantile(delta_star, 0.975))

174.4.2 4.2 Stratified and grouped resampling

Real evaluation sets have structure that plain resampling ignores. When a metric is computed over imbalanced classes, an unstratified resample can occasionally contain no positives, making metrics like precision undefined or wildly unstable. Stratified resampling fixes the count drawn from each class to its observed value, keeping every replicate well defined and reducing variance. When examples are clustered, as with multiple questions per document or several utterances per speaker, we resample whole clusters to respect the dependence, as noted in section 1.3. The unit of resampling should always match the unit of independence in the data generating process; choosing it wrongly is the single most common way bootstrap intervals in machine learning come out too narrow.

174.4.3 4.3 Reporting and interpretation

A bootstrap interval is a statement about sampling variability under repeated draws of the test set from the same distribution. It does not capture distribution shift, annotation error, or bias in the test set itself, and it should never be presented as if it did. Three habits make bootstrap reporting trustworthy. First, fix and report the random seed and the number of resamples $B$ so the interval is reproducible. Second, prefer BCa or studentized intervals when the metric is skewed or bounded near an endpoint, and fall back to percentile intervals when the jackknife for acceleration is too expensive. Third, when comparing systems, bootstrap the difference directly rather than checking whether two separate intervals overlap, because non overlapping intervals are a conservative and sometimes misleading proxy for a significant difference. Plotting the replicate histogram alongside the reported interval is the cheapest diagnostic available and routinely exposes multimodality, boundary effects, or degeneracy that a bare interval would hide.

174.4.4 4.4 Cost and scale

The computational cost of the bootstrap is $B$ times the cost of evaluating the metric once. For cheap metrics on modest test sets this is trivial, but for expensive evaluations, such as scoring large language model outputs with a judge model, recomputing the metric thousands of times is prohibitive. Two economies help. The first is to cache the per example scores once and resample indices into the cached score vector, so each resample costs an $O(n)$ aggregation rather than a full re-evaluation. This works whenever the metric is a function of fixed per example quantities, which covers accuracy, mean scores, and many ranking metrics. The second is the multiplier or Bayesian bootstrap, which replaces integer resample counts with continuous Dirichlet weights and can reduce variance for smooth functionals while sharing the same caching trick. With per example caching, even a heavy judge based metric admits a full bootstrap distribution at the price of one evaluation pass plus a few seconds of resampling arithmetic.

174.5 5. A Reference Implementation

The aiinaction companion library ships a from-scratch bootstrap for the mean of a one-dimensional sample, exposing both the percentile and BCa intervals discussed above. Three design choices make it a faithful teaching implementation rather than a black box. First, the resampling uses a fully specified 64-bit linear congruential generator with Lemire’s multiplicative index map, so the exact resamples, and hence every replicate and interval endpoint, are reproducible bit-for-bit across Python, Julia, and Rust given the same seed. Second, the standard normal quantile $\Phi^{-1}$ is Acklam’s rational approximation and the CDF $\Phi$ is built on a self-contained error function, so the BCa quantile adjustments need no external statistics dependency. Third, the acceleration $\hat a$ is computed from the exact leave-one-out jackknife of the mean, using the identity $\hat\theta_{(i)} = (\sum_j x_j - x_i)/(n-1)$ so the whole jackknife costs $O(n)$.

The example below bootstraps a small set of per-example scores, the kind of cached metric values described in section 4.4, and reports the point estimate, the bootstrap standard error, and both intervals.

Code

from aiinaction.ch169_bootstrap import bootstrap_mean_ci

# Cached per-example scores (e.g. per-item accuracy or judge scores).
scores = [4.0, 8.0, 15.0, 16.0, 23.0, 42.0, 1.0, 9.0]

perc = bootstrap_mean_ci(scores, n_resamples=500, alpha=0.025,
                         method="percentile", seed=12345)
bca = bootstrap_mean_ci(scores, n_resamples=500, alpha=0.025,
                        method="bca", seed=12345)

print(f"estimate        = {perc.estimate:.4f}")
print(f"standard error  = {perc.standard_error:.4f}")
print(f"95% percentile  = [{perc.ci_low:.4f}, {perc.ci_high:.4f}]")
print(f"95% BCa         = [{bca.ci_low:.4f}, {bca.ci_high:.4f}]")

estimate        = 14.7500
standard error  = 4.4018
95% percentile  = [7.4344, 24.5062]
95% BCa         = [8.5505, 26.8397]

using AIInAction.Ch169Bootstrap

scores = [4.0, 8.0, 15.0, 16.0, 23.0, 42.0, 1.0, 9.0]

perc = bootstrap_mean_ci(scores; n_resamples=500, alpha=0.025,
                         method="percentile", seed=12345)
bca = bootstrap_mean_ci(scores; n_resamples=500, alpha=0.025,
                        method="bca", seed=12345)

println("estimate        = ", round(perc.estimate; digits=4))
println("standard error  = ", round(perc.standard_error; digits=4))
println("95% percentile  = [", round(perc.ci_low; digits=4), ", ",
        round(perc.ci_high; digits=4), "]")
println("95% BCa         = [", round(bca.ci_low; digits=4), ", ",
        round(bca.ci_high; digits=4), "]")

use aiinaction::ch169_bootstrap::bootstrap_mean_ci;

fn main() {
    let scores = [4.0, 8.0, 15.0, 16.0, 23.0, 42.0, 1.0, 9.0];

    let perc = bootstrap_mean_ci(&scores, 500, 0.025, "percentile", 12345).unwrap();
    let bca = bootstrap_mean_ci(&scores, 500, 0.025, "bca", 12345).unwrap();

    println!("estimate        = {:.4}", perc.estimate);
    println!("standard error  = {:.4}", perc.standard_error);
    println!("95% percentile  = [{:.4}, {:.4}]", perc.ci_low, perc.ci_high);
    println!("95% BCa         = [{:.4}, {:.4}]", bca.ci_low, bca.ci_high);
}

All three print the same numbers: a point estimate of $14.75$, a bootstrap standard error near $4.40$, a percentile interval of about $[7.43, 24.51]$, and a BCa interval of about $[8.55, 26.84]$. The BCa interval sits noticeably higher than the percentile one here, because this small heavy-tailed sample (the value $42$ is an outlier) makes the bootstrap distribution right-skewed, exactly the regime where the bias and acceleration corrections earn their keep.

174.6 6. References

Efron, B. (1979). Bootstrap Methods: Another Look at the Jackknife. Annals of Statistics, 7(1), 1 to 26. https://doi.org/10.1214/aos/1176344552
Efron, B., and Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Chapman and Hall/CRC. https://doi.org/10.1201/9780429246593
Efron, B. (1987). Better Bootstrap Confidence Intervals. Journal of the American Statistical Association, 82(397), 171 to 185. https://doi.org/10.1080/01621459.1987.10478410
Efron, B., and Tibshirani, R. (1997). Improvements on Cross-Validation: The .632+ Bootstrap Method. Journal of the American Statistical Association, 92(438), 548 to 560. https://doi.org/10.1080/01621459.1997.10474007
Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5 to 32. https://doi.org/10.1023/A:1010933404324
DiCiccio, T. J., and Efron, B. (1996). Bootstrap Confidence Intervals. Statistical Science, 11(3), 189 to 228. https://doi.org/10.1214/ss/1032280214
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer. https://hastie.su.domains/ElemStatLearn/
Rubin, D. B. (1981). The Bayesian Bootstrap. Annals of Statistics, 9(1), 130 to 134. https://doi.org/10.1214/aos/1176345338
Davison, A. C., and Hinkley, D. V. (1997). Bootstrap Methods and Their Application. Cambridge University Press. https://doi.org/10.1017/CBO9780511802843

# Bootstrapping for Evaluation When we report that a classifier achieves an accuracy of 0.91 or that a retrieval system reaches a mean reciprocal rank of 0.63, we are reporting a single number computed from one finite test set. That number is a point estimate of a quantity we cannot observe directly: the performance the system would attain on the full population of inputs it will eventually face. The bootstrap is the most general tool we have for quantifying how much that point estimate would wobble if we had drawn a different test set. This chapter develops the bootstrap principle, the construction of percentile and bias corrected and accelerated (BCa) confidence intervals, the out-of-bag and 0.632 estimators that recycle bootstrap resamples into model assessment, and the practical machinery of bootstrapping whole metric distributions for machine learning evaluation. ## 1. The Bootstrap Principle ### 1.1 The plug-in idea Suppose our test data $x_1, \dots, x_n$ are drawn independently from an unknown distribution $F$. A metric of interest is a functional $\theta = t(F)$, for example the population accuracy or the population area under the ROC curve. We estimate it with the plug-in estimator $\hat\theta = t(\hat F_n)$, where $\hat F_n$ is the empirical distribution that places mass $1/n$ on each observed point. The trouble is that $\hat\theta$ is a random quantity, and we want its sampling distribution so we can attach an interval to it. The bootstrap principle is a single substitution: since we do not know $F$, we use $\hat F_n$ in its place. Where the true sampling distribution comes from resampling $n$ points from $F$, the bootstrap sampling distribution comes from resampling $n$ points from $\hat F_n$. Resampling from $\hat F_n$ is exactly sampling $n$ items with replacement from the observed data. Efron's insight in 1979 was that this substitution, although it looks circular, is justified whenever $\hat F_n$ converges to $F$ and the functional $t$ is smooth enough that small perturbations of its argument produce small perturbations of its value. ### 1.2 The Monte Carlo algorithm We almost never compute the bootstrap distribution analytically. Instead we approximate it by Monte Carlo: ```text for b in 1..B: sample x*_1..x*_n with replacement from x_1..x_n theta*_b = metric(x*_1..x*_n) return {theta*_1, ..., theta*_B} ``` The collection $\{\hat\theta^*_b\}_{b=1}^B$ approximates the sampling distribution of $\hat\theta$. From it we read off the bootstrap standard error as the sample standard deviation of the replicates, $$ \widehat{\operatorname{se}}_{\mathrm{boot}} = \sqrt{\frac{1}{B-1}\sum_{b=1}^B \left(\hat\theta^*_b - \bar{\theta}^*\right)^2}, \qquad \bar{\theta}^* = \frac{1}{B}\sum_{b=1}^B \hat\theta^*_b . $$ Two sources of error are now in play. There is statistical error, which depends on $n$ and shrinks as the test set grows, and there is Monte Carlo error, which depends on $B$ and shrinks as we draw more resamples. We control the second freely. For standard errors $B$ in the hundreds suffices, but for the tail quantiles needed by confidence intervals we want $B$ in the low thousands, since a $95\%$ interval depends on the behavior of the empirical distribution near its $2.5$th and $97.5$th percentiles where samples are sparse. ### 1.3 When the bootstrap works and when it fails The bootstrap is consistent for smooth functionals of the distribution: means, variances, correlations, and most differentiable risk metrics. It struggles or fails for non-smooth functionals. The sample maximum is the textbook failure, because the maximum of a resample can never exceed the maximum of the original data, so the bootstrap distribution is degenerate on one side. Heavy tailed data with infinite variance and parameters on the boundary of their space are further cases where the naive bootstrap is unreliable. In evaluation work the most common pitfall is not non-smoothness but dependence: if test examples are correlated, for instance multiple turns from the same conversation or several patches from the same image, resampling individual rows breaks the dependence structure and understates the variance. The remedy is to resample the independent unit, the conversation or the image, using a block or cluster bootstrap rather than a row level one. ## 2. Percentile and BCa Intervals ### 2.1 The percentile interval The most direct way to turn bootstrap replicates into a confidence interval is to read off the empirical quantiles. Let $\hat\theta^*_{(\alpha)}$ denote the $\alpha$ quantile of the replicate distribution. The percentile interval at level $1 - 2\alpha$ is $$ \left[\, \hat\theta^*_{(\alpha)}, \; \hat\theta^*_{(1-\alpha)} \,\right]. $$ For a $95\%$ interval we take the $2.5$th and $97.5$th percentiles of the $B$ replicates. The percentile interval is appealing because it is transformation respecting: if we had bootstrapped a monotone function of $\theta$ instead, the interval would simply be the transformed interval. It also automatically lands inside the natural range of the metric, so an interval for accuracy never strays below zero or above one. Its weakness is that it assumes the bootstrap distribution is approximately unbiased and symmetric on some transformed scale, and it makes no correction when that assumption is violated. ### 2.2 Bias and acceleration The BCa interval, also from Efron, repairs the percentile interval with two corrections. The first is a bias correction $\hat z_0$ that measures the median bias of the bootstrap distribution. We compute it from the fraction of replicates below the original estimate, $$ \hat z_0 = \Phi^{-1}\!\left( \frac{\#\{\hat\theta^*_b < \hat\theta\}}{B} \right), $$ where $\Phi^{-1}$ is the standard normal quantile function. If exactly half the replicates fall below $\hat\theta$ then $\hat z_0 = 0$ and there is no median bias. The second correction is an acceleration $\hat a$ that measures how fast the standard error of $\hat\theta$ changes with the underlying value, capturing skewness. It is computed from the jackknife. Let $\hat\theta_{(i)}$ be the estimate with the $i$th point deleted and $\hat\theta_{(\cdot)}$ their mean. Then $$ \hat a = \frac{\sum_{i=1}^n \left(\hat\theta_{(\cdot)} - \hat\theta_{(i)}\right)^3}{6 \left[\sum_{i=1}^n \left(\hat\theta_{(\cdot)} - \hat\theta_{(i)}\right)^2\right]^{3/2}}. $$ ### 2.3 Assembling the BCa endpoints With $\hat z_0$ and $\hat a$ in hand, BCa replaces the fixed percentiles $\alpha$ and $1 - \alpha$ with adjusted ones. For a target tail probability $\alpha$ the adjusted quantile is $$ \alpha_1 = \Phi\!\left( \hat z_0 + \frac{\hat z_0 + z_\alpha}{1 - \hat a (\hat z_0 + z_\alpha)} \right), $$ where $z_\alpha = \Phi^{-1}(\alpha)$, and analogously for the upper endpoint using $z_{1-\alpha}$. The interval is then $[\hat\theta^*_{(\alpha_1)}, \hat\theta^*_{(\alpha_2)}]$. When both corrections vanish, $\alpha_1 = \alpha$ and BCa collapses to the percentile interval. BCa is second order accurate, meaning its coverage error shrinks like $1/n$ rather than the $1/\sqrt n$ of the percentile and standard normal intervals, and it is the recommended default when the cost of the extra jackknife pass is affordable. The following sketch shows the structure. ```text z0 = Phi_inv(mean(theta_star < theta_hat)) a = jackknife_acceleration(data, metric) for tail in (alpha, 1 - alpha): z = Phi_inv(tail) adj = Phi(z0 + (z0 + z) / (1 - a*(z0 + z))) endpoints.append(quantile(theta_star, adj)) ``` A practical caution: BCa can behave poorly when $B$ is too small, because the adjusted quantiles may push into the extreme tail where only a handful of replicates live. Use at least a few thousand resamples, and be alert when $\hat a$ or $\hat z_0$ is large, since that signals a skewed or biased metric where reporting the interval alongside the replicate histogram is wise. ## 3. Out-of-Bag and the 0.632 Bootstrap ### 3.1 Resampling for model assessment The bootstrap can also estimate prediction error, not just the variance of a fixed metric. Here each bootstrap resample becomes a training set, and the question is how to use the resamples to estimate how well a model trained on $n$ points will generalize. The naive idea of training on a resample and testing on the same original data is badly optimistic, because roughly two thirds of the data appear in any given resample and the model has effectively seen them. The key combinatorial fact is that the probability a particular observation is omitted from a bootstrap sample of size $n$ is $$ \left(1 - \frac{1}{n}\right)^n \longrightarrow e^{-1} \approx 0.368 $$ as $n$ grows. So on average about $36.8\%$ of the data are left out of each resample. These omitted points are out-of-bag (OOB), and they form a ready made held out set for the model trained on that resample. ### 3.2 The out-of-bag estimator For each observation $i$, collect the bootstrap samples in which $i$ was out-of-bag, predict $i$ with each corresponding model, and aggregate. The leave-one-out bootstrap error is $$ \widehat{\mathrm{Err}}^{(1)} = \frac{1}{n} \sum_{i=1}^n \frac{1}{|C_i|} \sum_{b \in C_i} L\!\left(y_i, \hat f^{*b}(x_i)\right), $$ where $C_i$ is the set of resamples not containing $i$ and $L$ is the loss. This estimator is the foundation of the out-of-bag error reported by random forests, where it comes for free as a byproduct of the bagging procedure and gives an honest generalization estimate without a separate validation split. ### 3.3 The 0.632 and 0.632+ corrections The leave-one-out bootstrap error is biased upward, because each model is trained on the distinct elements of a resample, roughly $0.632 n$ effective points, which is fewer than the full $n$ and therefore a weaker learner. Efron and Tibshirani proposed averaging it with the optimistic resubstitution error $\overline{\mathrm{err}}$, the training error on the full data, to cancel the biases: $$ \widehat{\mathrm{Err}}^{(0.632)} = 0.368 \, \overline{\mathrm{err}} + 0.632 \, \widehat{\mathrm{Err}}^{(1)} . $$ The weight $0.632 = 1 - e^{-1}$ is the expected fraction of distinct points in a resample, which is why the estimator carries that name. This blend works well when overfitting is mild but is itself optimistic for severe overfitters, such as a one nearest neighbor classifier whose resubstitution error is zero. The 0.632+ estimator addresses this by letting the weight depend on the relative overfitting rate $R$, defined through the no-information error rate $\gamma$ that the model would achieve if features and labels were independent: $$ \hat w = \frac{0.632}{1 - 0.368\,\hat R}, \qquad \hat R = \frac{\widehat{\mathrm{Err}}^{(1)} - \overline{\mathrm{err}}}{\hat\gamma - \overline{\mathrm{err}}}. $$ The final estimator is $\widehat{\mathrm{Err}}^{(0.632+)} = (1 - \hat w)\,\overline{\mathrm{err}} + \hat w\,\widehat{\mathrm{Err}}^{(1)}$. When overfitting is absent, $\hat R = 0$ and the weight returns to $0.632$; when overfitting is total, the weight rises toward one and the estimator leans on the honest out-of-bag term. In modern practice cross validation has largely displaced the 0.632 family for routine error estimation, but the OOB idea remains central to ensemble methods and the 0.632 correction is still valuable for small samples where every data point counts. ## 4. Bootstrapping Metric Distributions ### 4.1 Beyond a single scalar Evaluation rarely reduces to one number. We often want the full sampling distribution of a metric, the joint behavior of several metrics, or a confidence interval on the difference between two systems. The bootstrap handles all of these uniformly, because each resample yields a complete recomputation of whatever metric we care about, however nonlinear. Consider comparing system $A$ and system $B$ on the same test set. We resample test items, and on each resample we recompute both metrics on the identical resampled items, recording the difference $\Delta^*_b = m_A^* - m_B^*$. This paired bootstrap preserves the correlation between the two systems' performance, which sharpens the interval relative to treating the systems independently. If the $1 - 2\alpha$ percentile interval for $\{\Delta^*_b\}$ excludes zero, we have evidence that the systems differ at that level. The same construction gives intervals for ratios, for differences of F1 scores, or for any contrast that would be awkward to handle with a closed form variance. ```text for b in 1..B: idx = sample_with_replacement(range(n), n) delta_star[b] = metric_A(data[idx]) - metric_B(data[idx]) ci = (quantile(delta_star, 0.025), quantile(delta_star, 0.975)) ``` ### 4.2 Stratified and grouped resampling Real evaluation sets have structure that plain resampling ignores. When a metric is computed over imbalanced classes, an unstratified resample can occasionally contain no positives, making metrics like precision undefined or wildly unstable. Stratified resampling fixes the count drawn from each class to its observed value, keeping every replicate well defined and reducing variance. When examples are clustered, as with multiple questions per document or several utterances per speaker, we resample whole clusters to respect the dependence, as noted in section 1.3. The unit of resampling should always match the unit of independence in the data generating process; choosing it wrongly is the single most common way bootstrap intervals in machine learning come out too narrow. ### 4.3 Reporting and interpretation A bootstrap interval is a statement about sampling variability under repeated draws of the test set from the same distribution. It does not capture distribution shift, annotation error, or bias in the test set itself, and it should never be presented as if it did. Three habits make bootstrap reporting trustworthy. First, fix and report the random seed and the number of resamples $B$ so the interval is reproducible. Second, prefer BCa or studentized intervals when the metric is skewed or bounded near an endpoint, and fall back to percentile intervals when the jackknife for acceleration is too expensive. Third, when comparing systems, bootstrap the difference directly rather than checking whether two separate intervals overlap, because non overlapping intervals are a conservative and sometimes misleading proxy for a significant difference. Plotting the replicate histogram alongside the reported interval is the cheapest diagnostic available and routinely exposes multimodality, boundary effects, or degeneracy that a bare interval would hide. ### 4.4 Cost and scale The computational cost of the bootstrap is $B$ times the cost of evaluating the metric once. For cheap metrics on modest test sets this is trivial, but for expensive evaluations, such as scoring large language model outputs with a judge model, recomputing the metric thousands of times is prohibitive. Two economies help. The first is to cache the per example scores once and resample indices into the cached score vector, so each resample costs an $O(n)$ aggregation rather than a full re-evaluation. This works whenever the metric is a function of fixed per example quantities, which covers accuracy, mean scores, and many ranking metrics. The second is the multiplier or Bayesian bootstrap, which replaces integer resample counts with continuous Dirichlet weights and can reduce variance for smooth functionals while sharing the same caching trick. With per example caching, even a heavy judge based metric admits a full bootstrap distribution at the price of one evaluation pass plus a few seconds of resampling arithmetic. ## 5. A Reference Implementation The `aiinaction` companion library ships a from-scratch bootstrap for the mean of a one-dimensional sample, exposing both the percentile and BCa intervals discussed above. Three design choices make it a faithful teaching implementation rather than a black box. First, the resampling uses a fully specified 64-bit linear congruential generator with Lemire's multiplicative index map, so the exact resamples, and hence every replicate and interval endpoint, are reproducible bit-for-bit across Python, Julia, and Rust given the same seed. Second, the standard normal quantile $\Phi^{-1}$ is Acklam's rational approximation and the CDF $\Phi$ is built on a self-contained error function, so the BCa quantile adjustments need no external statistics dependency. Third, the acceleration $\hat a$ is computed from the exact leave-one-out jackknife of the mean, using the identity $\hat\theta_{(i)} = (\sum_j x_j - x_i)/(n-1)$ so the whole jackknife costs $O(n)$. The example below bootstraps a small set of per-example scores, the kind of cached metric values described in section 4.4, and reports the point estimate, the bootstrap standard error, and both intervals. ::: {.panel-tabset} ## Python ```{python} from aiinaction.ch169_bootstrap import bootstrap_mean_ci # Cached per-example scores (e.g. per-item accuracy or judge scores). scores = [4.0, 8.0, 15.0, 16.0, 23.0, 42.0, 1.0, 9.0] perc = bootstrap_mean_ci(scores, n_resamples=500, alpha=0.025, method="percentile", seed=12345) bca = bootstrap_mean_ci(scores, n_resamples=500, alpha=0.025, method="bca", seed=12345) print(f"estimate = {perc.estimate:.4f}") print(f"standard error = {perc.standard_error:.4f}") print(f"95% percentile = [{perc.ci_low:.4f}, {perc.ci_high:.4f}]") print(f"95% BCa = [{bca.ci_low:.4f}, {bca.ci_high:.4f}]") ``` ## Julia ```julia using AIInAction.Ch169Bootstrap scores = [4.0, 8.0, 15.0, 16.0, 23.0, 42.0, 1.0, 9.0] perc = bootstrap_mean_ci(scores; n_resamples=500, alpha=0.025, method="percentile", seed=12345) bca = bootstrap_mean_ci(scores; n_resamples=500, alpha=0.025, method="bca", seed=12345) println("estimate = ", round(perc.estimate; digits=4)) println("standard error = ", round(perc.standard_error; digits=4)) println("95% percentile = [", round(perc.ci_low; digits=4), ", ", round(perc.ci_high; digits=4), "]") println("95% BCa = [", round(bca.ci_low; digits=4), ", ", round(bca.ci_high; digits=4), "]") ``` ## Rust ```rust use aiinaction::ch169_bootstrap::bootstrap_mean_ci; fn main() { let scores = [4.0, 8.0, 15.0, 16.0, 23.0, 42.0, 1.0, 9.0]; let perc = bootstrap_mean_ci(&scores, 500, 0.025, "percentile", 12345).unwrap(); let bca = bootstrap_mean_ci(&scores, 500, 0.025, "bca", 12345).unwrap(); println!("estimate = {:.4}", perc.estimate); println!("standard error = {:.4}", perc.standard_error); println!("95% percentile = [{:.4}, {:.4}]", perc.ci_low, perc.ci_high); println!("95% BCa = [{:.4}, {:.4}]", bca.ci_low, bca.ci_high); } ``` ::: All three print the same numbers: a point estimate of $14.75$, a bootstrap standard error near $4.40$, a percentile interval of about $[7.43, 24.51]$, and a BCa interval of about $[8.55, 26.84]$. The BCa interval sits noticeably higher than the percentile one here, because this small heavy-tailed sample (the value $42$ is an outlier) makes the bootstrap distribution right-skewed, exactly the regime where the bias and acceleration corrections earn their keep. ## 6. References 1. Efron, B. (1979). Bootstrap Methods: Another Look at the Jackknife. Annals of Statistics, 7(1), 1 to 26. https://doi.org/10.1214/aos/1176344552 2. Efron, B., and Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Chapman and Hall/CRC. https://doi.org/10.1201/9780429246593 3. Efron, B. (1987). Better Bootstrap Confidence Intervals. Journal of the American Statistical Association, 82(397), 171 to 185. https://doi.org/10.1080/01621459.1987.10478410 4. Efron, B., and Tibshirani, R. (1997). Improvements on Cross-Validation: The .632+ Bootstrap Method. Journal of the American Statistical Association, 92(438), 548 to 560. https://doi.org/10.1080/01621459.1997.10474007 5. Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5 to 32. https://doi.org/10.1023/A:1010933404324 6. DiCiccio, T. J., and Efron, B. (1996). Bootstrap Confidence Intervals. Statistical Science, 11(3), 189 to 228. https://doi.org/10.1214/ss/1032280214 7. Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer. https://hastie.su.domains/ElemStatLearn/ 8. Rubin, D. B. (1981). The Bayesian Bootstrap. Annals of Statistics, 9(1), 130 to 134. https://doi.org/10.1214/aos/1176345338 9. Davison, A. C., and Hinkley, D. V. (1997). Bootstrap Methods and Their Application. Cambridge University Press. https://doi.org/10.1017/CBO9780511802843