171 Multiple Comparisons and Corrections
171.1 1. Introduction
Modern machine learning practice is saturated with statistical tests. We tune dozens of hyperparameters, compare many models on the same benchmark, screen thousands of features for association with a label, and probe deep networks with hundreds of interpretability hypotheses. Each of these activities involves not one statistical decision but many. When many decisions are made together, the probability that at least one of them is wrong by chance grows quickly, and naive significance thresholds become misleading. This is the multiple comparisons problem.
This chapter develops the problem rigorously and presents the standard remedies. We define the family-wise error rate and the false discovery rate, derive the Bonferroni and Holm procedures for controlling the former, present the Benjamini-Hochberg procedure for controlling the latter, and discuss how these ideas apply when comparing many models. The goal is to give a working data scientist both the formal guarantees and the practical judgment needed to report honest results.
171.2 2. The Multiple Testing Problem
171.2.1 2.1 Setup and Notation
Suppose we test \(m\) null hypotheses \(H_1, \dots, H_m\) simultaneously. For each hypothesis \(H_i\) we compute a \(p\)-value \(p_i\), the probability under the null of observing a test statistic at least as extreme as the one observed. A single test is calibrated so that if \(H_i\) is true, then \(p_i\) is uniformly distributed on \([0,1]\), which gives
\[ \Pr(p_i \le \alpha \mid H_i \text{ true}) = \alpha . \]
Rejecting \(H_i\) when \(p_i \le \alpha\) therefore caps the chance of a false positive on that one test at the chosen level \(\alpha\), commonly \(0.05\).
171.2.2 2.2 Why the Problem Arises
The guarantee above is per test. Consider \(m\) independent tests where every null is true. The probability that we make no false rejection is \((1-\alpha)^m\), so the probability of at least one false rejection is
\[ \Pr(\text{at least one false positive}) = 1 - (1-\alpha)^m . \]
With \(\alpha = 0.05\) and \(m = 20\), this equals \(1 - 0.95^{20} \approx 0.64\). With \(m = 100\) it exceeds \(0.99\). In other words, if you run a hundred tests at the \(0.05\) level on pure noise, you are almost guaranteed to find something that looks significant. The label “significant” loses its meaning once it is detached from the number of comparisons that produced it.
This is not a contrived scenario. A feature screen across a genome, a grid search over a learning rate and a regularization coefficient, or an ablation study with twenty variants all generate large \(m\). The danger is acute when the search is informal, because the analyst may never count how many comparisons were really made.
171.2.3 2.3 The Confusion Matrix of Decisions
It helps to organize outcomes across the \(m\) tests. Let \(m_0\) be the number of true nulls and \(m_1 = m - m_0\) the number of true alternatives. After applying a decision rule, define the counts in the table below.
| Declared non-significant | Declared significant | Total | |
|---|---|---|---|
| True null | \(U\) | \(V\) | \(m_0\) |
| True alternative | \(T\) | \(S\) | \(m_1\) |
| Total | \(m - R\) | \(R\) | \(m\) |
Here \(R = V + S\) is the number of rejections, \(V\) is the number of false discoveries (type I errors), and \(T\) is the number of missed true effects (type II errors). The quantities \(V\), \(S\), \(T\), \(U\) are unobserved random variables, since we do not know which nulls are true. Error rate definitions are statements about the distributions of these counts.
171.3 3. The Family-Wise Error Rate
171.3.1 3.1 Definition
The family-wise error rate (FWER) is the probability of making at least one false rejection across the whole family of tests:
\[ \mathrm{FWER} = \Pr(V \ge 1) . \]
A procedure controls the FWER at level \(\alpha\) if \(\mathrm{FWER} \le \alpha\) regardless of which and how many nulls are true. This is a stringent guarantee: it treats even a single false positive among thousands as a failure. It is the right target when any false claim is costly, for example when each rejection triggers an expensive confirmatory experiment or a published assertion that others will build on.
171.3.2 3.2 The Bonferroni Correction
The simplest FWER procedure is the Bonferroni correction. Reject \(H_i\) whenever
\[ p_i \le \frac{\alpha}{m} . \]
The proof of control uses Boole’s inequality, which requires no assumption about dependence among the tests. Let \(I_0\) index the true nulls. Then
\[ \mathrm{FWER} = \Pr\!\left( \bigcup_{i \in I_0} \left\{ p_i \le \tfrac{\alpha}{m} \right\} \right) \le \sum_{i \in I_0} \Pr\!\left( p_i \le \tfrac{\alpha}{m} \right) = m_0 \cdot \frac{\alpha}{m} \le \alpha . \]
The inequality \(m_0 \le m\) gives the final bound. Because Boole’s inequality never assumes independence, Bonferroni control holds under arbitrary dependence among the \(p\)-values, which makes it remarkably robust.
The cost is power. Dividing the threshold by \(m\) makes each individual test much harder to pass, so genuine effects of modest size are frequently missed. When \(m\) is large and the tests are positively correlated, Bonferroni is conservative, meaning the true FWER sits well below \(\alpha\) and we sacrifice detections we did not need to.
171.3.3 3.3 The Holm Correction
Holm’s step-down procedure controls the FWER under the same arbitrary-dependence conditions as Bonferroni but is uniformly more powerful, so it should be preferred whenever Bonferroni would be used. Sort the \(p\)-values in ascending order,
\[ p_{(1)} \le p_{(2)} \le \dots \le p_{(m)}, \]
with \(H_{(i)}\) the corresponding hypothesis. Find the smallest index \(k\) such that
\[ p_{(k)} > \frac{\alpha}{m - k + 1} . \]
Reject all hypotheses \(H_{(1)}, \dots, H_{(k-1)}\) and retain the rest. If no such \(k\) exists, reject every hypothesis.
The logic is sequential. The smallest \(p\)-value faces the full Bonferroni threshold \(\alpha/m\). Once it is rejected, only \(m-1\) hypotheses remain in contention, so the next is tested at \(\alpha/(m-1)\), and so on. The thresholds relax as we proceed, which is why Holm rejects at least as much as Bonferroni and often more. The proof that FWER \(\le \alpha\) again rests on Boole’s inequality applied to the true nulls at the step where the first true null is considered.
holm(p[1..m], alpha):
order = argsort(p) ascending
for j = 1..m:
threshold = alpha / (m - j + 1)
if p[order[j]] > threshold:
reject order[1..j-1]; stop
reject all
171.4 4. The False Discovery Rate
171.4.1 4.1 Motivation
Controlling the FWER is often too strict for exploratory analysis. If we screen \(20{,}000\) genes or rank \(5{,}000\) candidate features, we do not need certainty that every single discovery is real. We need the list of discoveries to be mostly correct, tolerating a small known fraction of false positives in exchange for far greater power. This shift in goal motivates the false discovery rate.
171.4.2 4.2 Definition
Define the false discovery proportion as the fraction of rejections that are false,
\[ \mathrm{FDP} = \frac{V}{\max(R, 1)}, \]
where the maximum in the denominator handles the case \(R = 0\) by setting the proportion to zero. The false discovery rate is its expectation,
\[ \mathrm{FDR} = \mathbb{E}\!\left[ \mathrm{FDP} \right] = \mathbb{E}\!\left[ \frac{V}{\max(R, 1)} \right] . \]
A procedure controls the FDR at level \(q\) if \(\mathrm{FDR} \le q\). Note the relationship to FWER: when all nulls are true, every rejection is false, so \(\mathrm{FDP} \in \{0,1\}\) and \(\mathrm{FDR} = \Pr(V \ge 1) = \mathrm{FWER}\). In that boundary case the two criteria coincide. When some alternatives are true, FDR control is strictly more permissive, because true discoveries inflate \(R\) in the denominator and dilute the proportion of errors.
171.4.3 4.3 The Benjamini-Hochberg Procedure
The Benjamini-Hochberg (BH) procedure controls the FDR at a target level \(q\). Sort the \(p\)-values ascending as before. Find the largest index \(k\) such that
\[ p_{(k)} \le \frac{k}{m} \, q . \]
Reject all hypotheses \(H_{(1)}, \dots, H_{(k)}\). If no index satisfies the inequality, reject nothing.
Geometrically, plot the sorted \(p\)-values against their ranks and overlay the line through the origin with slope \(q/m\). BH finds the rightmost point at or below the line and rejects everything up to and including it. The procedure is a step-up rule: it starts from the largest \(p\)-value and moves down until it finds a value that clears the threshold.
benjamini_hochberg(p[1..m], q):
order = argsort(p) ascending
k = 0
for i = 1..m:
if p[order[i]] <= (i / m) * q:
k = i
reject order[1..k]
Benjamini and Hochberg proved that under independence of the test statistics this procedure satisfies
\[ \mathrm{FDR} = \frac{m_0}{m} \, q \le q . \]
The factor \(m_0 / m\) shows the procedure is slightly conservative when some alternatives are true, since the realized FDR is below the nominal \(q\). The guarantee also holds under a condition called positive regression dependence, which covers many practical cases such as one-sided tests with positively correlated statistics. Under arbitrary dependence the bound degrades by the harmonic factor \(H_m = \sum_{i=1}^m 1/i \approx \ln m + 0.577\), giving the Benjamini-Yekutieli variant that uses the threshold \(\frac{k}{m H_m} q\).
171.4.4 4.4 An Adjusted p-Value View
Both Holm and BH can be expressed as adjusted \(p\)-values, which is how most software reports them. An adjusted \(p\)-value \(\tilde p_i\) is constructed so that rejecting whenever \(\tilde p_i \le \alpha\) reproduces the procedure. For BH the adjusted values are
\[ \tilde p_{(i)} = \min_{j \ge i} \, \min\!\left( \frac{m}{j} \, p_{(j)}, \; 1 \right), \]
with the outer minimum enforcing monotonicity. Reporting adjusted \(p\)-values lets a reader apply any threshold they like without rerunning the procedure.
171.5 5. Comparing Many Models
171.5.1 5.1 The Problem in Machine Learning
Model selection is a multiple comparisons problem in disguise. Suppose we evaluate \(m\) models on a shared validation set and pick the one with the best score. The maximum of \(m\) noisy estimates is biased upward, so the winner’s validation score systematically overstates its true performance. This is the optimism of selection, and it grows with \(m\). Reporting the selected model’s validation score as if it were an unbiased estimate is a form of the multiple comparisons error, and the remedy is a held-out test set untouched by selection, or nested cross-validation.
When the question is whether one model is genuinely better than another, rather than which is best, we run pairwise significance tests. With \(m\) models there are \(\binom{m}{2}\) pairwise comparisons, and the FWER across them inflates exactly as in Section 2.2.
171.5.2 5.2 Statistical Tests for Classifier Comparison
A common protocol follows Demsar. To compare several classifiers across multiple datasets, first run the Friedman test, a non-parametric analog of repeated-measures ANOVA on the ranks of the classifiers per dataset. The Friedman statistic tests the global null that all classifiers perform equally. If it rejects, proceed to post-hoc pairwise comparisons.
The post-hoc step is where corrections enter. When comparing all classifiers against one control, the Bonferroni-Dunn test adjusts the critical difference for the number of comparisons. When comparing all pairs, the Nemenyi test uses a studentized range statistic that already accounts for the multiplicity. If instead we have explicit pairwise \(p\)-values, applying Holm across the \(\binom{m}{2}\) comparisons gives FWER control with more power than Bonferroni.
compare_classifiers(scores[datasets, models]):
run Friedman test on per-dataset ranks
if not significant: stop, no differences claimed
compute pairwise p-values
adjust with Holm (FWER) or BH (FDR)
report adjusted significant pairs
171.5.3 5.3 Practical Guidance
Choose the error criterion before looking at results, and choose it from the cost structure of mistakes. Use FWER control, via Holm, when each false claim is individually expensive or will be confirmed downstream, such as the small set of models you intend to deploy or report in a paper. Use FDR control, via BH, for exploratory screens where a controlled fraction of false leads is acceptable, such as ranking thousands of features or interpretability hypotheses.
Count every comparison, including the informal ones. The most common failure is not a wrong formula but an undercounted \(m\): hyperparameters swept by hand, datasets revisited after a peek, or significance tests added until one passes. A pre-registered analysis plan, a sequestered test set, and an honest accounting of how many comparisons were truly made are worth more than any single correction. The corrections in this chapter are necessary, but they only work on the comparisons you admit to having run.
171.6 6. Summary
Running many tests at a fixed per-test level guarantees false positives. The family-wise error rate caps the chance of any false positive and is controlled by Bonferroni or, preferably, by the uniformly more powerful Holm procedure, both valid under arbitrary dependence. The false discovery rate caps the expected fraction of false positives among discoveries and is controlled by Benjamini-Hochberg under independence or positive dependence, trading a small known error fraction for substantial power. Model comparison is a multiple testing problem too, vulnerable to selection optimism and to inflated error across pairwise tests, and it should be handled with held-out evaluation and appropriate post-hoc corrections. The single most important practice is to count every comparison honestly and to fix the error criterion before seeing the data.
171.7 References
- Bonferroni, C. E. “Teoria statistica delle classi e calcolo delle probabilita.” Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di Firenze, 1936. https://en.wikipedia.org/wiki/Bonferroni_correction
- Holm, S. “A Simple Sequentially Rejective Multiple Test Procedure.” Scandinavian Journal of Statistics, 6(2), 65-70, 1979. https://www.jstor.org/stable/4615733
- Benjamini, Y., and Hochberg, Y. “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” Journal of the Royal Statistical Society, Series B, 57(1), 289-300, 1995. https://www.jstor.org/stable/2346101
- Benjamini, Y., and Yekutieli, D. “The Control of the False Discovery Rate in Multiple Testing under Dependency.” Annals of Statistics, 29(4), 1165-1188, 2001. https://www.jstor.org/stable/2674075
- Demsar, J. “Statistical Comparisons of Classifiers over Multiple Data Sets.” Journal of Machine Learning Research, 7, 1-30, 2006. https://www.jmlr.org/papers/v7/demsar06a.html
- Dudoit, S., and van der Laan, M. J. “Multiple Testing Procedures with Applications to Genomics.” Springer, 2008. https://link.springer.com/book/10.1007/978-0-387-49317-6
- Wasserman, L. “All of Statistics: A Concise Course in Statistical Inference.” Springer, 2004. https://link.springer.com/book/10.1007/978-0-387-21736-9