171 Multiple Comparisons and Corrections

171.1 1. Introduction

Modern machine learning practice is saturated with statistical tests. We tune dozens of hyperparameters, compare many models on the same benchmark, screen thousands of features for association with a label, and probe deep networks with hundreds of interpretability hypotheses. Each of these activities involves not one statistical decision but many. When many decisions are made together, the probability that at least one of them is wrong by chance grows quickly, and naive significance thresholds become misleading. This is the multiple comparisons problem.

This chapter develops the problem rigorously and presents the standard remedies. We define the family-wise error rate and the false discovery rate, derive the Bonferroni and Holm procedures for controlling the former, present the Benjamini-Hochberg procedure for controlling the latter, and discuss how these ideas apply when comparing many models. The goal is to give a working data scientist both the formal guarantees and the practical judgment needed to report honest results.

The conceptual map below shows how the pieces fit together. A family of tests produces a family of $p$-values, the analyst chooses an error criterion from the cost of mistakes, and the criterion selects a procedure.

flowchart TD
    A["Family of m hypothesis tests"] --> B["m p-values"]
    B --> C{"Which error criterion"}
    C -->|"any false positive is costly"| D["Control FWER"]
    C -->|"a known false fraction is tolerable"| E["Control FDR"]
    D --> F["Holm step-down (preferred) or Bonferroni"]
    E --> G["Benjamini-Hochberg (independence or PRDS)"]
    E --> H["Benjamini-Yekutieli (arbitrary dependence)"]
    F --> I["Report adjusted p-values"]
    G --> I
    H --> I

171.2 2. The Multiple Testing Problem

171.2.1 2.1 Setup and Notation

Suppose we test $m$ null hypotheses $H_1, \dots, H_m$ simultaneously. For each hypothesis $H_i$ we compute a $p$-value $p_i$, the probability under the null of observing a test statistic at least as extreme as the one observed. A single test is calibrated so that if $H_i$ is true, then $p_i$ is uniformly distributed on $[0,1]$, which gives

\[ \Pr(p_i \le \alpha \mid H_i \text{ true}) = \alpha . \]

Rejecting $H_i$ when $p_i \le \alpha$ therefore caps the chance of a false positive on that one test at the chosen level $\alpha$, commonly $0.05$.

171.2.2 2.2 Why the Problem Arises

The guarantee above is per test. Consider $m$ independent tests where every null is true. The probability that we make no false rejection is $(1-\alpha)^m$, so the probability of at least one false rejection is

\[ \Pr(\text{at least one false positive}) = 1 - (1-\alpha)^m . \]

With $\alpha = 0.05$ and $m = 20$, this equals $1 - 0.95^{20} \approx 0.64$. With $m = 100$ it exceeds $0.99$. In other words, if you run a hundred tests at the $0.05$ level on pure noise, you are almost guaranteed to find something that looks significant. The label “significant” loses its meaning once it is detached from the number of comparisons that produced it.

This is not a contrived scenario. A feature screen across a genome, a grid search over a learning rate and a regularization coefficient, or an ablation study with twenty variants all generate large $m$. The danger is acute when the search is informal, because the analyst may never count how many comparisons were really made.

171.2.3 2.3 The Confusion Matrix of Decisions

It helps to organize outcomes across the $m$ tests. Let $m_0$ be the number of true nulls and $m_1 = m - m_0$ the number of true alternatives. After applying a decision rule, define the counts in the table below.

	Declared non-significant	Declared significant	Total
True null	$U$	$V$	$m_0$
True alternative	$T$	$S$	$m_1$
Total	$m - R$	$R$	$m$

Here $R = V + S$ is the number of rejections, $V$ is the number of false discoveries (type I errors), and $T$ is the number of missed true effects (type II errors). Among these counts only $R$ and $m - R$ are observable, because we see which hypotheses were rejected but not which of those rejections were correct. The quantities $V$, $S$, $T$, $U$ are unobserved random variables, since we do not know which nulls are true. Error rate definitions are statements about the distributions of these counts, and the art of a multiple-testing procedure is to bound an unobservable quantity such as $\mathbb{E}[V/R]$ using only the observable $p$-values.

This vocabulary lets us name the targets precisely. The per-comparison error rate is $\mathbb{E}[V]/m_0$, the average false positive rate of an individual test. The family-wise error rate is $\Pr(V \ge 1)$. The false discovery rate is $\mathbb{E}[V/\max(R,1)]$. These three sit on a spectrum from least to most permissive, and choosing among them is the central decision of this chapter.

171.3 3. The Family-Wise Error Rate

171.3.1 3.1 Definition

The family-wise error rate (FWER) is the probability of making at least one false rejection across the whole family of tests:

\[ \mathrm{FWER} = \Pr(V \ge 1) . \]

A procedure controls the FWER at level $\alpha$ if $\mathrm{FWER} \le \alpha$ regardless of which and how many nulls are true. This is a stringent guarantee: it treats even a single false positive among thousands as a failure. It is the right target when any false claim is costly, for example when each rejection triggers an expensive confirmatory experiment or a published assertion that others will build on.

171.3.2 3.2 The Bonferroni Correction

The simplest FWER procedure is the Bonferroni correction. Reject $H_i$ whenever

\[ p_i \le \frac{\alpha}{m} . \]

The proof of control uses Boole’s inequality, which requires no assumption about dependence among the tests. Let $I_0$ index the true nulls. Then

\[ \mathrm{FWER} = \Pr\!\left( \bigcup_{i \in I_0} \left\{ p_i \le \tfrac{\alpha}{m} \right\} \right) \le \sum_{i \in I_0} \Pr\!\left( p_i \le \tfrac{\alpha}{m} \right) = m_0 \cdot \frac{\alpha}{m} \le \alpha . \]

The inequality $m_0 \le m$ gives the final bound. Because Boole’s inequality never assumes independence, Bonferroni control holds under arbitrary dependence among the $p$-values, which makes it remarkably robust.

The cost is power. Dividing the threshold by $m$ makes each individual test much harder to pass, so genuine effects of modest size are frequently missed. When $m$ is large and the tests are positively correlated, Bonferroni is conservative, meaning the true FWER sits well below $\alpha$ and we sacrifice detections we did not need to.

171.3.3 3.3 The Holm Correction

Holm’s step-down procedure controls the FWER under the same arbitrary-dependence conditions as Bonferroni but is uniformly more powerful, so it should be preferred whenever Bonferroni would be used. Sort the $p$-values in ascending order,

\[ p_{(1)} \le p_{(2)} \le \dots \le p_{(m)}, \]

with $H_{(i)}$ the corresponding hypothesis. Find the smallest index $k$ such that

\[ p_{(k)} > \frac{\alpha}{m - k + 1} . \]

Reject all hypotheses $H_{(1)}, \dots, H_{(k-1)}$ and retain the rest. If no such $k$ exists, reject every hypothesis.

The logic is sequential. The smallest $p$-value faces the full Bonferroni threshold $\alpha/m$. Once it is rejected, only $m-1$ hypotheses remain in contention, so the next is tested at $\alpha/(m-1)$, and so on. The thresholds relax as we proceed, which is why Holm rejects at least as much as Bonferroni and often more. The proof that FWER $\le \alpha$ again rests on Boole’s inequality applied to the true nulls at the step where the first true null is considered.

holm(p[1..m], alpha):
    order = argsort(p) ascending
    for j = 1..m:
        threshold = alpha / (m - j + 1)
        if p[order[j]] > threshold:
            reject order[1..j-1]; stop
    reject all

To see why Holm controls the FWER, suppose at least one true null is rejected and let $H_{(\ell)}$ be the first true null in sorted order, with the true nulls indexed by $I_0$. For Holm to reject it, every preceding step must have passed its threshold, so in particular $p_{(\ell)} \le \alpha / (m - \ell + 1)$. Because the $\ell - 1$ hypotheses before it are all false nulls, the count of true nulls at or after position $\ell$ is exactly $m_0$, and one checks that $m - \ell + 1 \ge m_0$. Hence rejecting a true null requires the smallest true-null $p$-value to fall below $\alpha / m_0$. By Boole’s inequality applied to the $m_0$ true nulls,

\[ \Pr\!\left( \min_{i \in I_0} p_i \le \frac{\alpha}{m_0} \right) \le \sum_{i \in I_0} \Pr\!\left( p_i \le \frac{\alpha}{m_0} \right) = m_0 \cdot \frac{\alpha}{m_0} = \alpha, \]

so $\mathrm{FWER} \le \alpha$ under arbitrary dependence.

171.3.4 3.4 A Worked Example

Suppose an ablation study yields $m = 5$ raw $p$-values, which we sort as

\[ p_{(1)} = 0.005, \quad p_{(2)} = 0.011, \quad p_{(3)} = 0.02, \quad p_{(4)} = 0.04, \quad p_{(5)} = 0.5, \]

and we want to control errors at level $0.05$. The three procedures act differently.

Bonferroni compares every $p$-value to $\alpha/m = 0.05/5 = 0.01$. Only $p_{(1)} = 0.005$ passes, so Bonferroni makes one rejection.

Holm walks down the sorted list with relaxing thresholds $\alpha/(m-j+1)$ equal to $0.01, 0.0125, 0.0167, 0.025, 0.05$. We have $p_{(1)} = 0.005 \le 0.01$ and $p_{(2)} = 0.011 \le 0.0125$, both passing, but $p_{(3)} = 0.02 > 0.0167$, so Holm stops and rejects the first two hypotheses. Holm recovers a discovery that Bonferroni missed, illustrating its uniform improvement at no cost in assumptions.

Benjamini-Hochberg at $q = 0.05$ compares $p_{(k)}$ to $(k/m)q$, namely $0.01, 0.02, 0.03, 0.04, 0.05$. Scanning from the largest, $p_{(5)} = 0.5 > 0.05$ and $p_{(4)} = 0.04 \le 0.04$, so the largest passing index is $k = 4$ and BH rejects the first four hypotheses. The contrast is stark: Bonferroni finds one effect, Holm two, BH four, in exact order of increasing permissiveness.

171.4 4. The False Discovery Rate

171.4.1 4.1 Motivation

Controlling the FWER is often too strict for exploratory analysis. If we screen $20{,}000$ genes or rank $5{,}000$ candidate features, we do not need certainty that every single discovery is real. We need the list of discoveries to be mostly correct, tolerating a small known fraction of false positives in exchange for far greater power. This shift in goal motivates the false discovery rate.

171.4.2 4.2 Definition

Define the false discovery proportion as the fraction of rejections that are false,

\[ \mathrm{FDP} = \frac{V}{\max(R, 1)}, \]

where the maximum in the denominator handles the case $R = 0$ by setting the proportion to zero. The false discovery rate is its expectation,

\[ \mathrm{FDR} = \mathbb{E}\!\left[ \mathrm{FDP} \right] = \mathbb{E}\!\left[ \frac{V}{\max(R, 1)} \right] . \]

A procedure controls the FDR at level $q$ if $\mathrm{FDR} \le q$. Note the relationship to FWER: when all nulls are true, every rejection is false, so $\mathrm{FDP} \in \{0,1\}$ and $\mathrm{FDR} = \Pr(V \ge 1) = \mathrm{FWER}$. In that boundary case the two criteria coincide. When some alternatives are true, FDR control is strictly more permissive, because true discoveries inflate $R$ in the denominator and dilute the proportion of errors.

171.4.3 4.3 The Benjamini-Hochberg Procedure

The Benjamini-Hochberg (BH) procedure controls the FDR at a target level $q$. Sort the $p$-values ascending as before. Find the largest index $k$ such that

\[ p_{(k)} \le \frac{k}{m} \, q . \]

Reject all hypotheses $H_{(1)}, \dots, H_{(k)}$. If no index satisfies the inequality, reject nothing.

Geometrically, plot the sorted $p$-values against their ranks and overlay the line through the origin with slope $q/m$. BH finds the rightmost point at or below the line and rejects everything up to and including it. The procedure is a step-up rule: it starts from the largest $p$-value and moves down until it finds a value that clears the threshold.

benjamini_hochberg(p[1..m], q):
    order = argsort(p) ascending
    k = 0
    for i = 1..m:
        if p[order[i]] <= (i / m) * q:
            k = i
    reject order[1..k]

Benjamini and Hochberg proved that under independence of the test statistics this procedure satisfies

\[ \mathrm{FDR} = \frac{m_0}{m} \, q \le q . \]

The factor $m_0 / m$ shows the procedure is slightly conservative when some alternatives are true, since the realized FDR is below the nominal $q$. If a reliable estimate $\hat m_0$ of the number of true nulls were available, replacing $m$ by $\hat m_0$ in the threshold would recover the lost power; this is the idea behind adaptive and Storey-style procedures.

The guarantee also holds under a condition called positive regression dependence on the subset of true nulls (PRDS), proved by Benjamini and Yekutieli. Informally, PRDS holds when the true-null statistics tend to move together with any test statistic in a monotone way, which covers many practical cases such as one-sided tests with non-negatively correlated Gaussian statistics. Under arbitrary dependence, with no structure assumed at all, the bound degrades by the harmonic factor $H_m = \sum_{i=1}^m 1/i \approx \ln m + 0.577$, giving the Benjamini-Yekutieli variant that uses the threshold $\frac{k}{m H_m} q$. For $m = 1000$ the harmonic factor is roughly $7.5$, a substantial penalty, so one prefers plain BH whenever positive dependence can be argued and reserves Benjamini-Yekutieli for genuinely adversarial dependence.

171.4.4 4.4 An Adjusted p-Value View

Both Holm and BH can be expressed as adjusted $p$-values, which is how most software reports them. An adjusted $p$-value $\tilde p_i$ is constructed so that rejecting whenever $\tilde p_i \le \alpha$ reproduces the procedure. For BH the adjusted values are

\[ \tilde p_{(i)} = \min_{j \ge i} \, \min\!\left( \frac{m}{j} \, p_{(j)}, \; 1 \right), \]

with the outer minimum over $j \ge i$ enforcing the monotonicity that adjusted $p$-values must respect their original ordering. Holm has the analogous step-down form

\[ \tilde p_{(i)} = \max_{j \le i} \, \min\!\left( (m - j + 1)\, p_{(j)}, \; 1 \right), \]

where the running maximum guarantees the adjusted values are non-decreasing in rank. Reporting adjusted $p$-values lets a reader apply any threshold they like without rerunning the procedure. In the free open-source ecosystem these are one call away: statsmodels.stats.multitest.multipletests in Python implements Bonferroni, Holm, BH, and Benjamini-Yekutieli, p.adjust is built into base R, and MultipleTesting.jl covers the same family in Julia. Reaching for these tested implementations is safer than hand-coding the loop, where off-by-one and tie-handling mistakes are common.

171.5 5. Comparing Many Models

171.5.1 5.1 The Problem in Machine Learning

Model selection is a multiple comparisons problem in disguise. Suppose we evaluate $m$ models on a shared validation set and pick the one with the best score. The maximum of $m$ noisy estimates is biased upward, so the winner’s validation score systematically overstates its true performance. This is the optimism of selection, and it grows with $m$. Reporting the selected model’s validation score as if it were an unbiased estimate is a form of the multiple comparisons error, and the remedy is a held-out test set untouched by selection, or nested cross-validation.

When the question is whether one model is genuinely better than another, rather than which is best, we run pairwise significance tests. With $m$ models there are $\binom{m}{2}$ pairwise comparisons, and the FWER across them inflates exactly as in Section 2.2.

171.5.2 5.2 Statistical Tests for Classifier Comparison

A common protocol follows Demsar. To compare several classifiers across multiple datasets, first run the Friedman test, a non-parametric analog of repeated-measures ANOVA on the ranks of the classifiers per dataset. The Friedman statistic tests the global null that all classifiers perform equally. If it rejects, proceed to post-hoc pairwise comparisons.

The post-hoc step is where corrections enter. When comparing all classifiers against one control, the Bonferroni-Dunn test adjusts the critical difference for the number of comparisons. When comparing all pairs, the Nemenyi test uses a studentized range statistic that already accounts for the multiplicity. If instead we have explicit pairwise $p$-values, applying Holm across the $\binom{m}{2}$ comparisons gives FWER control with more power than Bonferroni.

compare_classifiers(scores[datasets, models]):
    run Friedman test on per-dataset ranks
    if not significant: stop, no differences claimed
    compute pairwise p-values
    adjust with Holm (FWER) or BH (FDR)
    report adjusted significant pairs

171.5.3 5.3 Practical Guidance

Choose the error criterion before looking at results, and choose it from the cost structure of mistakes. Use FWER control, via Holm, when each false claim is individually expensive or will be confirmed downstream, such as the small set of models you intend to deploy or report in a paper. Use FDR control, via BH, for exploratory screens where a controlled fraction of false leads is acceptable, such as ranking thousands of features or interpretability hypotheses.

Count every comparison, including the informal ones. The most common failure is not a wrong formula but an undercounted $m$: hyperparameters swept by hand, datasets revisited after a peek, or significance tests added until one passes. A pre-registered analysis plan, a sequestered test set, and an honest accounting of how many comparisons were truly made are worth more than any single correction. The corrections in this chapter are necessary, but they only work on the comparisons you admit to having run.

171.5.4 5.4 Pitfalls and Common Mistakes

A few errors recur often enough to name explicitly.

Undercounting the family. Every $p$-value you computed belongs in $m$, not only the ones you chose to report. Selecting the best result and then correcting for only that result reintroduces exactly the bias the correction was meant to remove.
Applying BH and then reading the surviving $p$-values as if they were unadjusted. The whole point of correction is that the threshold, or equivalently the adjusted $p$-value, is what carries the guarantee, not the raw value.
Treating FDR control as if it controlled the FWER. BH bounds the expected fraction of false discoveries, not the probability of any false discovery. On a list of two hundred discoveries at $q = 0.1$, expecting around twenty false ones is the design, not a defect.
Using plain BH under negative or unknown dependence and assuming the bound still holds. When dependence cannot be argued to be positive, switch to Benjamini-Yekutieli or report the dependence assumption openly.
Confusing selection optimism with a testing error. Reporting the best of many validation scores overstates performance even if no formal test was run. The fix is a held-out test set or nested cross-validation, not a $p$-value adjustment.
Mixing one-sided and two-sided $p$-values, or $p$-values from differently calibrated tests, in a single family. The procedures assume each $p_i$ is a valid (super-uniform under the null) $p$-value on the same scale.

171.6 6. Summary

Running many tests at a fixed per-test level guarantees false positives. The family-wise error rate caps the chance of any false positive and is controlled by Bonferroni or, preferably, by the uniformly more powerful Holm procedure, both valid under arbitrary dependence. The false discovery rate caps the expected fraction of false positives among discoveries and is controlled by Benjamini-Hochberg under independence or positive dependence, trading a small known error fraction for substantial power. Model comparison is a multiple testing problem too, vulnerable to selection optimism and to inflated error across pairwise tests, and it should be handled with held-out evaluation and appropriate post-hoc corrections. The single most important practice is to count every comparison honestly and to fix the error criterion before seeing the data.

171.7 References

Bonferroni, C. E. “Teoria statistica delle classi e calcolo delle probabilita.” Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di Firenze, 1936. https://en.wikipedia.org/wiki/Bonferroni_correction
Holm, S. “A Simple Sequentially Rejective Multiple Test Procedure.” Scandinavian Journal of Statistics, 6(2), 65-70, 1979. https://www.jstor.org/stable/4615733
Benjamini, Y., and Hochberg, Y. “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” Journal of the Royal Statistical Society, Series B, 57(1), 289-300, 1995. https://www.jstor.org/stable/2346101
Benjamini, Y., and Yekutieli, D. “The Control of the False Discovery Rate in Multiple Testing under Dependency.” Annals of Statistics, 29(4), 1165-1188, 2001. https://www.jstor.org/stable/2674075
Demsar, J. “Statistical Comparisons of Classifiers over Multiple Data Sets.” Journal of Machine Learning Research, 7, 1-30, 2006. https://www.jmlr.org/papers/v7/demsar06a.html
Dudoit, S., and van der Laan, M. J. “Multiple Testing Procedures with Applications to Genomics.” Springer, 2008. https://link.springer.com/book/10.1007/978-0-387-49317-6
Wasserman, L. “All of Statistics: A Concise Course in Statistical Inference.” Springer, 2004. https://link.springer.com/book/10.1007/978-0-387-21736-9

# Multiple Comparisons and Corrections ## 1. Introduction Modern machine learning practice is saturated with statistical tests. We tune dozens of hyperparameters, compare many models on the same benchmark, screen thousands of features for association with a label, and probe deep networks with hundreds of interpretability hypotheses. Each of these activities involves not one statistical decision but many. When many decisions are made together, the probability that at least one of them is wrong by chance grows quickly, and naive significance thresholds become misleading. This is the multiple comparisons problem. This chapter develops the problem rigorously and presents the standard remedies. We define the family-wise error rate and the false discovery rate, derive the Bonferroni and Holm procedures for controlling the former, present the Benjamini-Hochberg procedure for controlling the latter, and discuss how these ideas apply when comparing many models. The goal is to give a working data scientist both the formal guarantees and the practical judgment needed to report honest results. The conceptual map below shows how the pieces fit together. A family of tests produces a family of $p$-values, the analyst chooses an error criterion from the cost of mistakes, and the criterion selects a procedure. ```{mermaid} flowchart TD A["Family of m hypothesis tests"] --> B["m p-values"] B --> C{"Which error criterion"} C -->|"any false positive is costly"| D["Control FWER"] C -->|"a known false fraction is tolerable"| E["Control FDR"] D --> F["Holm step-down (preferred) or Bonferroni"] E --> G["Benjamini-Hochberg (independence or PRDS)"] E --> H["Benjamini-Yekutieli (arbitrary dependence)"] F --> I["Report adjusted p-values"] G --> I H --> I ``` ## 2. The Multiple Testing Problem ### 2.1 Setup and Notation Suppose we test $m$ null hypotheses $H_1, \dots, H_m$ simultaneously. For each hypothesis $H_i$ we compute a $p$-value $p_i$, the probability under the null of observing a test statistic at least as extreme as the one observed. A single test is calibrated so that if $H_i$ is true, then $p_i$ is uniformly distributed on $[0,1]$, which gives $$ \Pr(p_i \le \alpha \mid H_i \text{ true}) = \alpha . $$ Rejecting $H_i$ when $p_i \le \alpha$ therefore caps the chance of a false positive on that one test at the chosen level $\alpha$, commonly $0.05$. ### 2.2 Why the Problem Arises The guarantee above is per test. Consider $m$ independent tests where every null is true. The probability that we make no false rejection is $(1-\alpha)^m$, so the probability of at least one false rejection is $$ \Pr(\text{at least one false positive}) = 1 - (1-\alpha)^m . $$ With $\alpha = 0.05$ and $m = 20$, this equals $1 - 0.95^{20} \approx 0.64$. With $m = 100$ it exceeds $0.99$. In other words, if you run a hundred tests at the $0.05$ level on pure noise, you are almost guaranteed to find something that looks significant. The label "significant" loses its meaning once it is detached from the number of comparisons that produced it. This is not a contrived scenario. A feature screen across a genome, a grid search over a learning rate and a regularization coefficient, or an ablation study with twenty variants all generate large $m$. The danger is acute when the search is informal, because the analyst may never count how many comparisons were really made. ### 2.3 The Confusion Matrix of Decisions It helps to organize outcomes across the $m$ tests. Let $m_0$ be the number of true nulls and $m_1 = m - m_0$ the number of true alternatives. After applying a decision rule, define the counts in the table below. | | Declared non-significant | Declared significant | Total | |---------------------|--------------------------|----------------------|-------| | True null | $U$ | $V$ | $m_0$ | | True alternative | $T$ | $S$ | $m_1$ | | Total | $m - R$ | $R$ | $m$ | Here $R = V + S$ is the number of rejections, $V$ is the number of false discoveries (type I errors), and $T$ is the number of missed true effects (type II errors). Among these counts only $R$ and $m - R$ are observable, because we see which hypotheses were rejected but not which of those rejections were correct. The quantities $V$, $S$, $T$, $U$ are unobserved random variables, since we do not know which nulls are true. Error rate definitions are statements about the distributions of these counts, and the art of a multiple-testing procedure is to bound an unobservable quantity such as $\mathbb{E}[V/R]$ using only the observable $p$-values. This vocabulary lets us name the targets precisely. The per-comparison error rate is $\mathbb{E}[V]/m_0$, the average false positive rate of an individual test. The family-wise error rate is $\Pr(V \ge 1)$. The false discovery rate is $\mathbb{E}[V/\max(R,1)]$. These three sit on a spectrum from least to most permissive, and choosing among them is the central decision of this chapter. ## 3. The Family-Wise Error Rate ### 3.1 Definition The family-wise error rate (FWER) is the probability of making at least one false rejection across the whole family of tests: $$ \mathrm{FWER} = \Pr(V \ge 1) . $$ A procedure controls the FWER at level $\alpha$ if $\mathrm{FWER} \le \alpha$ regardless of which and how many nulls are true. This is a stringent guarantee: it treats even a single false positive among thousands as a failure. It is the right target when any false claim is costly, for example when each rejection triggers an expensive confirmatory experiment or a published assertion that others will build on. ### 3.2 The Bonferroni Correction The simplest FWER procedure is the Bonferroni correction. Reject $H_i$ whenever $$ p_i \le \frac{\alpha}{m} . $$ The proof of control uses Boole's inequality, which requires no assumption about dependence among the tests. Let $I_0$ index the true nulls. Then $$ \mathrm{FWER} = \Pr\!\left( \bigcup_{i \in I_0} \left\{ p_i \le \tfrac{\alpha}{m} \right\} \right) \le \sum_{i \in I_0} \Pr\!\left( p_i \le \tfrac{\alpha}{m} \right) = m_0 \cdot \frac{\alpha}{m} \le \alpha . $$ The inequality $m_0 \le m$ gives the final bound. Because Boole's inequality never assumes independence, Bonferroni control holds under arbitrary dependence among the $p$-values, which makes it remarkably robust. The cost is power. Dividing the threshold by $m$ makes each individual test much harder to pass, so genuine effects of modest size are frequently missed. When $m$ is large and the tests are positively correlated, Bonferroni is conservative, meaning the true FWER sits well below $\alpha$ and we sacrifice detections we did not need to. ### 3.3 The Holm Correction Holm's step-down procedure controls the FWER under the same arbitrary-dependence conditions as Bonferroni but is uniformly more powerful, so it should be preferred whenever Bonferroni would be used. Sort the $p$-values in ascending order, $$ p_{(1)} \le p_{(2)} \le \dots \le p_{(m)}, $$ with $H_{(i)}$ the corresponding hypothesis. Find the smallest index $k$ such that $$ p_{(k)} > \frac{\alpha}{m - k + 1} . $$ Reject all hypotheses $H_{(1)}, \dots, H_{(k-1)}$ and retain the rest. If no such $k$ exists, reject every hypothesis. The logic is sequential. The smallest $p$-value faces the full Bonferroni threshold $\alpha/m$. Once it is rejected, only $m-1$ hypotheses remain in contention, so the next is tested at $\alpha/(m-1)$, and so on. The thresholds relax as we proceed, which is why Holm rejects at least as much as Bonferroni and often more. The proof that FWER $\le \alpha$ again rests on Boole's inequality applied to the true nulls at the step where the first true null is considered. ```text holm(p[1..m], alpha): order = argsort(p) ascending for j = 1..m: threshold = alpha / (m - j + 1) if p[order[j]] > threshold: reject order[1..j-1]; stop reject all ``` To see why Holm controls the FWER, suppose at least one true null is rejected and let $H_{(\ell)}$ be the first true null in sorted order, with the true nulls indexed by $I_0$. For Holm to reject it, every preceding step must have passed its threshold, so in particular $p_{(\ell)} \le \alpha / (m - \ell + 1)$. Because the $\ell - 1$ hypotheses before it are all false nulls, the count of true nulls at or after position $\ell$ is exactly $m_0$, and one checks that $m - \ell + 1 \ge m_0$. Hence rejecting a true null requires the smallest true-null $p$-value to fall below $\alpha / m_0$. By Boole's inequality applied to the $m_0$ true nulls, $$ \Pr\!\left( \min_{i \in I_0} p_i \le \frac{\alpha}{m_0} \right) \le \sum_{i \in I_0} \Pr\!\left( p_i \le \frac{\alpha}{m_0} \right) = m_0 \cdot \frac{\alpha}{m_0} = \alpha, $$ so $\mathrm{FWER} \le \alpha$ under arbitrary dependence. ### 3.4 A Worked Example Suppose an ablation study yields $m = 5$ raw $p$-values, which we sort as $$ p_{(1)} = 0.005, \quad p_{(2)} = 0.011, \quad p_{(3)} = 0.02, \quad p_{(4)} = 0.04, \quad p_{(5)} = 0.5, $$ and we want to control errors at level $0.05$. The three procedures act differently. Bonferroni compares every $p$-value to $\alpha/m = 0.05/5 = 0.01$. Only $p_{(1)} = 0.005$ passes, so Bonferroni makes one rejection. Holm walks down the sorted list with relaxing thresholds $\alpha/(m-j+1)$ equal to $0.01, 0.0125, 0.0167, 0.025, 0.05$. We have $p_{(1)} = 0.005 \le 0.01$ and $p_{(2)} = 0.011 \le 0.0125$, both passing, but $p_{(3)} = 0.02 > 0.0167$, so Holm stops and rejects the first two hypotheses. Holm recovers a discovery that Bonferroni missed, illustrating its uniform improvement at no cost in assumptions. Benjamini-Hochberg at $q = 0.05$ compares $p_{(k)}$ to $(k/m)q$, namely $0.01, 0.02, 0.03, 0.04, 0.05$. Scanning from the largest, $p_{(5)} = 0.5 > 0.05$ and $p_{(4)} = 0.04 \le 0.04$, so the largest passing index is $k = 4$ and BH rejects the first four hypotheses. The contrast is stark: Bonferroni finds one effect, Holm two, BH four, in exact order of increasing permissiveness. ## 4. The False Discovery Rate ### 4.1 Motivation Controlling the FWER is often too strict for exploratory analysis. If we screen $20{,}000$ genes or rank $5{,}000$ candidate features, we do not need certainty that every single discovery is real. We need the list of discoveries to be mostly correct, tolerating a small known fraction of false positives in exchange for far greater power. This shift in goal motivates the false discovery rate. ### 4.2 Definition Define the false discovery proportion as the fraction of rejections that are false, $$ \mathrm{FDP} = \frac{V}{\max(R, 1)}, $$ where the maximum in the denominator handles the case $R = 0$ by setting the proportion to zero. The false discovery rate is its expectation, $$ \mathrm{FDR} = \mathbb{E}\!\left[ \mathrm{FDP} \right] = \mathbb{E}\!\left[ \frac{V}{\max(R, 1)} \right] . $$ A procedure controls the FDR at level $q$ if $\mathrm{FDR} \le q$. Note the relationship to FWER: when all nulls are true, every rejection is false, so $\mathrm{FDP} \in \{0,1\}$ and $\mathrm{FDR} = \Pr(V \ge 1) = \mathrm{FWER}$. In that boundary case the two criteria coincide. When some alternatives are true, FDR control is strictly more permissive, because true discoveries inflate $R$ in the denominator and dilute the proportion of errors. ### 4.3 The Benjamini-Hochberg Procedure The Benjamini-Hochberg (BH) procedure controls the FDR at a target level $q$. Sort the $p$-values ascending as before. Find the largest index $k$ such that $$ p_{(k)} \le \frac{k}{m} \, q . $$ Reject all hypotheses $H_{(1)}, \dots, H_{(k)}$. If no index satisfies the inequality, reject nothing. Geometrically, plot the sorted $p$-values against their ranks and overlay the line through the origin with slope $q/m$. BH finds the rightmost point at or below the line and rejects everything up to and including it. The procedure is a step-up rule: it starts from the largest $p$-value and moves down until it finds a value that clears the threshold. ```text benjamini_hochberg(p[1..m], q): order = argsort(p) ascending k = 0 for i = 1..m: if p[order[i]] <= (i / m) * q: k = i reject order[1..k] ``` Benjamini and Hochberg proved that under independence of the test statistics this procedure satisfies $$ \mathrm{FDR} = \frac{m_0}{m} \, q \le q . $$ The factor $m_0 / m$ shows the procedure is slightly conservative when some alternatives are true, since the realized FDR is below the nominal $q$. If a reliable estimate $\hat m_0$ of the number of true nulls were available, replacing $m$ by $\hat m_0$ in the threshold would recover the lost power; this is the idea behind adaptive and Storey-style procedures. The guarantee also holds under a condition called positive regression dependence on the subset of true nulls (PRDS), proved by Benjamini and Yekutieli. Informally, PRDS holds when the true-null statistics tend to move together with any test statistic in a monotone way, which covers many practical cases such as one-sided tests with non-negatively correlated Gaussian statistics. Under arbitrary dependence, with no structure assumed at all, the bound degrades by the harmonic factor $H_m = \sum_{i=1}^m 1/i \approx \ln m + 0.577$, giving the Benjamini-Yekutieli variant that uses the threshold $\frac{k}{m H_m} q$. For $m = 1000$ the harmonic factor is roughly $7.5$, a substantial penalty, so one prefers plain BH whenever positive dependence can be argued and reserves Benjamini-Yekutieli for genuinely adversarial dependence. ### 4.4 An Adjusted p-Value View Both Holm and BH can be expressed as adjusted $p$-values, which is how most software reports them. An adjusted $p$-value $\tilde p_i$ is constructed so that rejecting whenever $\tilde p_i \le \alpha$ reproduces the procedure. For BH the adjusted values are $$ \tilde p_{(i)} = \min_{j \ge i} \, \min\!\left( \frac{m}{j} \, p_{(j)}, \; 1 \right), $$ with the outer minimum over $j \ge i$ enforcing the monotonicity that adjusted $p$-values must respect their original ordering. Holm has the analogous step-down form $$ \tilde p_{(i)} = \max_{j \le i} \, \min\!\left( (m - j + 1)\, p_{(j)}, \; 1 \right), $$ where the running maximum guarantees the adjusted values are non-decreasing in rank. Reporting adjusted $p$-values lets a reader apply any threshold they like without rerunning the procedure. In the free open-source ecosystem these are one call away: `statsmodels.stats.multitest.multipletests` in Python implements Bonferroni, Holm, BH, and Benjamini-Yekutieli, `p.adjust` is built into base R, and `MultipleTesting.jl` covers the same family in Julia. Reaching for these tested implementations is safer than hand-coding the loop, where off-by-one and tie-handling mistakes are common. ## 5. Comparing Many Models ### 5.1 The Problem in Machine Learning Model selection is a multiple comparisons problem in disguise. Suppose we evaluate $m$ models on a shared validation set and pick the one with the best score. The maximum of $m$ noisy estimates is biased upward, so the winner's validation score systematically overstates its true performance. This is the optimism of selection, and it grows with $m$. Reporting the selected model's validation score as if it were an unbiased estimate is a form of the multiple comparisons error, and the remedy is a held-out test set untouched by selection, or nested cross-validation. When the question is whether one model is genuinely better than another, rather than which is best, we run pairwise significance tests. With $m$ models there are $\binom{m}{2}$ pairwise comparisons, and the FWER across them inflates exactly as in Section 2.2. ### 5.2 Statistical Tests for Classifier Comparison A common protocol follows Demsar. To compare several classifiers across multiple datasets, first run the Friedman test, a non-parametric analog of repeated-measures ANOVA on the ranks of the classifiers per dataset. The Friedman statistic tests the global null that all classifiers perform equally. If it rejects, proceed to post-hoc pairwise comparisons. The post-hoc step is where corrections enter. When comparing all classifiers against one control, the Bonferroni-Dunn test adjusts the critical difference for the number of comparisons. When comparing all pairs, the Nemenyi test uses a studentized range statistic that already accounts for the multiplicity. If instead we have explicit pairwise $p$-values, applying Holm across the $\binom{m}{2}$ comparisons gives FWER control with more power than Bonferroni. ```text compare_classifiers(scores[datasets, models]): run Friedman test on per-dataset ranks if not significant: stop, no differences claimed compute pairwise p-values adjust with Holm (FWER) or BH (FDR) report adjusted significant pairs ``` ### 5.3 Practical Guidance Choose the error criterion before looking at results, and choose it from the cost structure of mistakes. Use FWER control, via Holm, when each false claim is individually expensive or will be confirmed downstream, such as the small set of models you intend to deploy or report in a paper. Use FDR control, via BH, for exploratory screens where a controlled fraction of false leads is acceptable, such as ranking thousands of features or interpretability hypotheses. Count every comparison, including the informal ones. The most common failure is not a wrong formula but an undercounted $m$: hyperparameters swept by hand, datasets revisited after a peek, or significance tests added until one passes. A pre-registered analysis plan, a sequestered test set, and an honest accounting of how many comparisons were truly made are worth more than any single correction. The corrections in this chapter are necessary, but they only work on the comparisons you admit to having run. ### 5.4 Pitfalls and Common Mistakes A few errors recur often enough to name explicitly. - Undercounting the family. Every $p$-value you computed belongs in $m$, not only the ones you chose to report. Selecting the best result and then correcting for only that result reintroduces exactly the bias the correction was meant to remove. - Applying BH and then reading the surviving $p$-values as if they were unadjusted. The whole point of correction is that the threshold, or equivalently the adjusted $p$-value, is what carries the guarantee, not the raw value. - Treating FDR control as if it controlled the FWER. BH bounds the expected fraction of false discoveries, not the probability of any false discovery. On a list of two hundred discoveries at $q = 0.1$, expecting around twenty false ones is the design, not a defect. - Using plain BH under negative or unknown dependence and assuming the bound still holds. When dependence cannot be argued to be positive, switch to Benjamini-Yekutieli or report the dependence assumption openly. - Confusing selection optimism with a testing error. Reporting the best of many validation scores overstates performance even if no formal test was run. The fix is a held-out test set or nested cross-validation, not a $p$-value adjustment. - Mixing one-sided and two-sided $p$-values, or $p$-values from differently calibrated tests, in a single family. The procedures assume each $p_i$ is a valid (super-uniform under the null) $p$-value on the same scale. ## 6. Summary Running many tests at a fixed per-test level guarantees false positives. The family-wise error rate caps the chance of any false positive and is controlled by Bonferroni or, preferably, by the uniformly more powerful Holm procedure, both valid under arbitrary dependence. The false discovery rate caps the expected fraction of false positives among discoveries and is controlled by Benjamini-Hochberg under independence or positive dependence, trading a small known error fraction for substantial power. Model comparison is a multiple testing problem too, vulnerable to selection optimism and to inflated error across pairwise tests, and it should be handled with held-out evaluation and appropriate post-hoc corrections. The single most important practice is to count every comparison honestly and to fix the error criterion before seeing the data. ## References 1. Bonferroni, C. E. "Teoria statistica delle classi e calcolo delle probabilita." Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di Firenze, 1936. https://en.wikipedia.org/wiki/Bonferroni_correction 2. Holm, S. "A Simple Sequentially Rejective Multiple Test Procedure." Scandinavian Journal of Statistics, 6(2), 65-70, 1979. https://www.jstor.org/stable/4615733 3. Benjamini, Y., and Hochberg, Y. "Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing." Journal of the Royal Statistical Society, Series B, 57(1), 289-300, 1995. https://www.jstor.org/stable/2346101 4. Benjamini, Y., and Yekutieli, D. "The Control of the False Discovery Rate in Multiple Testing under Dependency." Annals of Statistics, 29(4), 1165-1188, 2001. https://www.jstor.org/stable/2674075 5. Demsar, J. "Statistical Comparisons of Classifiers over Multiple Data Sets." Journal of Machine Learning Research, 7, 1-30, 2006. https://www.jmlr.org/papers/v7/demsar06a.html 6. Dudoit, S., and van der Laan, M. J. "Multiple Testing Procedures with Applications to Genomics." Springer, 2008. https://link.springer.com/book/10.1007/978-0-387-49317-6 7. Wasserman, L. "All of Statistics: A Concise Course in Statistical Inference." Springer, 2004. https://link.springer.com/book/10.1007/978-0-387-21736-9

	Declared non-significant	Declared significant	Total
True null	\(U\)	\(V\)	\(m_0\)
True alternative	\(T\)	\(S\)	\(m_1\)
Total	\(m - R\)	\(R\)	\(m\)