172 Effect Sizes in Machine Learning

172.1 1. Introduction

Modern machine learning research lives and dies by comparison. A new architecture beats a baseline, a tuned optimizer edges out the default, a fairness intervention reduces disparity. The standard ritual for backing such claims is the significance test, which produces a $p$ value and an accompanying verdict of “significant” or “not significant.” Yet a $p$ value answers a narrow question. It tells us how surprising the observed data would be if there were truly no difference between systems. It does not tell us how large the difference is, whether that difference matters in deployment, or whether a practitioner should adopt the new method. Those questions belong to the domain of effect sizes.

An effect size is a quantitative measure of the magnitude of a phenomenon, expressed on a scale that is interpretable and, ideally, comparable across studies. In machine learning the phenomenon is usually the performance gap between two models, the strength of association between a design choice and an outcome, or the size of a treatment effect in an A/B test. This chapter argues that effect sizes deserve a permanent place beside $p$ values in any empirical ML report. We develop the conceptual distinction between statistical and practical significance, define the standardized effect size measures most relevant to ML with their estimators and properties, work through a numerical example, and offer concrete reporting guidance.

A useful organizing principle, due to Cohen and refined by many since, is that every claim of an effect has two ingredients that significance testing fatally entangles: the signal (how large the effect is) and the resolution (how precisely we measured it). The $p$ value is a function of both. An effect size isolates the signal, and a confidence interval restores the resolution as a separate, visible quantity. Keeping the two apart is the whole game.

172.2 2. Statistical Significance Is Not Practical Significance

172.2.1 2.1 What a p value actually controls

Consider comparing two classifiers $A$ and $B$ over $n$ paired test instances. Let $D_i$ be the per instance accuracy difference, with population mean $\mu_D$ and standard deviation $\sigma_D$. The paired $t$ statistic is

\[ t = \frac{\bar{D}}{s_D / \sqrt{n}}, \]

where $\bar{D}$ is the sample mean difference and $s_D$ the sample standard deviation. The $p$ value is the probability, under the null hypothesis $\mu_D = 0$, of observing a statistic at least as extreme as $t$. Crucially, the denominator shrinks as $n$ grows. For any fixed nonzero $\mu_D$, no matter how trivially small, $|t|$ grows without bound as $n \to \infty$, and the $p$ value tends to zero.

It is worth stating plainly what the $p$ value does and does not control. Under the null it is a probability statement about hypothetical data, not about the hypothesis. It is not the probability that the null is true, not the probability that the result is a fluke, and not one minus the probability that the finding replicates. The ASA statement on $p$ values (Wasserstein and Lazar 2016) devotes itself almost entirely to dispelling these misreadings, and the entanglement of magnitude with sample size is the most consequential of them for ML.

172.2.2 2.2 The large sample trap

This dependence on sample size is the heart of the problem. With the enormous evaluation sets common in ML, where $n$ can reach millions of tokens or images, almost any nonzero difference becomes statistically significant. A model that is better by $0.01$ accuracy points and a model that is better by $5$ accuracy points can both yield $p < 10^{-6}$. Significance certifies that the effect is detectable; it says nothing about whether the effect is large enough to justify the engineering cost, latency increase, or carbon footprint of switching systems.

The converse failure is equally real. A genuinely large improvement evaluated on a small or noisy benchmark may fail to reach significance, producing a false sense that “nothing is happening.” Statistical significance conflates the size of an effect with the precision of its estimate. Effect sizes separate the two.

flowchart TD
    Q["Two systems compared"] --> P{"p value small?"}
    P -->|yes| ES1{"Effect size meaningful?"}
    P -->|no| ES2{"Interval rules out a meaningful effect?"}
    ES1 -->|yes| A["Adopt: real and worthwhile"]
    ES1 -->|no| B["Detectable but trivial: do not ship"]
    ES2 -->|yes| C["Genuine equivalence: stop chasing"]
    ES2 -->|no| D["Underpowered: collect more data"]

The diagram captures the four quadrants that a $p$ value alone collapses into two. The two diagonal cells, “detectable but trivial” and “underpowered,” are exactly the failures that effect size reporting exposes.

172.3 3. Cohen’s d and Standardized Mean Differences

172.3.1 3.1 Definition

The most widely used standardized effect size for a difference in means is Cohen’s $d$. For two independent groups with means $\bar{x}_1, \bar{x}_2$ and a pooled standard deviation $s_p$,

\[ d = \frac{\bar{x}_1 - \bar{x}_2}{s_p}, \qquad s_p = \sqrt{\frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}}. \]

The quantity $d$ expresses the gap between groups in units of standard deviation. A value of $d = 0.5$ means the group means differ by half a standard deviation. Because $d$ is dimensionless, it is comparable across experiments that use different metrics or scales, which is exactly what raw mean differences and $p$ values are not.

For paired designs, such as evaluating two models on the same test items, the natural analogue is the standardized mean of the differences,

\[ d_z = \frac{\bar{D}}{s_D}, \]

which relates to the paired $t$ statistic by $t = d_z \sqrt{n}$. This identity makes the large sample trap explicit. The $t$ statistic blends a scale free effect size $d_z$ with a sample size factor $\sqrt{n}$, so a tiny $d_z$ can still produce an overwhelming $t$. Note that $d_z$ uses the standard deviation of the differences $s_D$, which absorbs the correlation between paired scores; it is therefore not directly comparable to the independent groups $d$ unless that correlation is accounted for, a subtlety that trips up meta analyses that mix paired and unpaired designs.

172.3.2 3.2 Interpreting magnitude

Cohen proposed rough conventions of $0.2$ (small), $0.5$ (medium), and $0.8$ (large), but these were offered reluctantly and are not laws of nature. In ML the meaningful threshold is domain dependent. A $d$ of $0.2$ in a click through rate experiment serving billions of impressions may translate to substantial revenue, while a $d$ of $0.8$ on a toy benchmark may be irrelevant to production. Effect sizes should always be interpreted against a context specific notion of what difference matters, sometimes formalized as a smallest effect size of interest (SESOI). The SESOI is set from the cost structure of the decision, not from a textbook table, and fixing it before data collection is what turns an effect size from a description into a decision criterion.

A second interpretive aid is the translation of $d$ into an overlap or a probability. Under approximately normal, equal variance assumptions, $d$ maps directly to the probability of superiority through $A = \Phi(d / \sqrt{2})$, where $\Phi$ is the standard normal cumulative distribution function. Thus $d = 0.5$ corresponds to roughly a $64\%$ chance that a random draw from the better group exceeds a random draw from the worse one, a statement many readers find more intuitive than “half a standard deviation.”

172.3.3 3.3 Small sample bias

Cohen’s $d$ is positively biased for small samples, because the sample standard deviation in the denominator underestimates the population value in expectation. Hedges proposed a correction factor that yields an approximately unbiased estimator, often called Hedges’ $g$ (Hedges 1981),

\[ g = d \cdot \left(1 - \frac{3}{4(n_1 + n_2) - 9}\right). \]

The correction is negligible for large $n$ but matters when comparing models across a handful of random seeds, a regime that is common and under reported in deep learning. With five seeds per model the bias inflates $d$ by several percent, enough to nudge a borderline “medium” effect over a reporting threshold. When the seed count is small, report $g$ rather than $d$.

The estimator below makes the pooled standard deviation explicit. It is illustrative and assumes standard array helpers; the mature open source path in practice is pingouin.compute_effsize or scipy.stats, both freely available.

# Cohen's d for two independent groups (illustrative, not executable)
def cohens_d(x1, x2):
    n1, n2 = len(x1), len(x2)
    s1, s2 = var(x1, ddof=1), var(x2, ddof=1)
    sp = sqrt(((n1 - 1) * s1 + (n2 - 1) * s2) / (n1 + n2 - 2))
    return (mean(x1) - mean(x2)) / sp

172.4 4. A Family of Effect Sizes for ML

172.4.1 4.1 Beyond mean differences

Cohen’s $d$ assumes roughly normal, equal variance data, which performance metrics often violate. Per item accuracy is Bernoulli, calibration error is bounded and skewed, and ranking metrics are discrete. Several alternatives are useful and more robust.

The probability of superiority, also called the common language effect size or $A$ statistic, is the chance that a randomly drawn score from model $A$ exceeds one from model $B$,

\[ A = P(X_A > X_B) + \tfrac{1}{2} P(X_A = X_B). \]

It is closely tied to the Mann Whitney $U$ statistic through $A = U / (n_A n_B)$ and is robust to non normality (McGraw and Wong 1992). A value of $0.5$ denotes no difference, and values approaching $1$ denote near total dominance of one model. Because $A$ is a probability it needs no scale assumption and survives monotone transformations of the metric, which makes it a safe default when the distribution of scores is unknown or clearly non normal.

For association between two categorical variables, such as a design choice and a success or failure outcome, the odds ratio and Cramer’s $V$ quantify effect magnitude. Cramer’s $V$ rescales the chi squared statistic to the interval $[0,1]$,

\[ V = \sqrt{\frac{\chi^2}{n \cdot \min(r-1, c-1)}}, \]

for an $r \times c$ contingency table, giving a sample size free measure of association. For correlation between continuous quantities, Pearson’s $r$ or the coefficient of determination $r^2$ serve directly as effect sizes, since they are already scale free; $r^2$ has the clean reading of the fraction of variance explained.

The table summarizes the family and when each is appropriate.

Setting	Effect size	Range	Robust to non normality
Difference in means, two groups	Cohen’s $d$, Hedges’ $g$	unbounded	no
Paired model comparison	$d_z$	unbounded	no
Stochastic dominance of scores	Probability of superiority $A$	$[0,1]$	yes
Categorical association	Cramer’s $V$, odds ratio	$V \in [0,1]$	yes
Continuous association	$r$, $r^2$	$r \in [-1,1]$	partial

172.4.2 4.2 Confidence intervals as effect size reporting

An effect size point estimate is incomplete without an interval that conveys its uncertainty. A $95\%$ confidence interval on $d$ or on the raw performance gap communicates both magnitude and precision in one object. When the interval for a difference excludes zero, it conveys the same information as $p < 0.05$, but it additionally shows the plausible range of the effect. Reporting the interval $[0.3, 0.9]$ for a standardized difference is far more informative than reporting “$p = 0.01$.” For ML metrics with awkward sampling distributions, bootstrap confidence intervals on the effect size are a practical default, and the bias corrected and accelerated (BCa) bootstrap is the recommended variant when the statistic is skewed.

Equivalence testing turns the interval into an explicit decision rule. To claim two systems are practically the same, one shows that the confidence interval for their difference lies entirely inside the SESOI band $[-\Delta, \Delta]$. This is the two one sided tests (TOST) procedure (Lakens 2013), and it is the principled way to support a null conclusion, which an ordinary nonsignificant $p$ value can never do.

# Bootstrap CI for a paired accuracy gap (illustrative)
def bootstrap_gap_ci(diffs, B=10000, alpha=0.05):
    stats = [mean(resample(diffs)) for _ in range(B)]
    lo = quantile(stats, alpha / 2)
    hi = quantile(stats, 1 - alpha / 2)
    return lo, hi

172.5 5. A Worked Example

Suppose model $A$ and model $B$ are each trained with $n = 10$ random seeds and evaluated on a fixed held out set. The per seed top one accuracies (percent) are

\[ A: 81.2,\ 80.8,\ 81.5,\ 80.9,\ 81.7,\ 81.0,\ 81.3,\ 80.6,\ 81.4,\ 81.1, \] \[ B: 80.4,\ 80.1,\ 80.7,\ 79.9,\ 80.5,\ 80.0,\ 80.6,\ 79.8,\ 80.3,\ 80.2. \]

The means are $\bar{x}_A = 81.15$ and $\bar{x}_B = 80.25$, a raw gap of $0.90$ accuracy points. The sample standard deviations are approximately $s_A \approx 0.33$ and $s_B \approx 0.30$, so the pooled standard deviation is

\[ s_p = \sqrt{\frac{9(0.33^2) + 9(0.30^2)}{18}} \approx 0.32. \]

Cohen’s $d = 0.90 / 0.32 \approx 2.8$, a very large standardized effect: the two seed distributions barely overlap. The small sample correction gives $g \approx 2.8 \times (1 - 3/(4 \cdot 20 - 9)) \approx 2.7$, a modest shrinkage. Translating through $A = \Phi(d/\sqrt{2}) = \Phi(1.98) \approx 0.98$, a randomly chosen $A$ seed beats a randomly chosen $B$ seed about $98\%$ of the time.

Now the contrast that motivates the whole chapter. A paired or two sample $t$ test on these numbers yields a tiny $p$ value, but so would a scenario where the gap were $0.05$ points with correspondingly tiny seed variance. The $p$ value cannot distinguish the two. The effect size and its interval can: here the magnitude is unambiguously large and the seed distributions are well separated, whereas a $0.05$ point gap would surface as a $d$ near zero with an interval hugging zero, telling the practitioner not to bother. Reporting $d \approx 2.8$, $95\%$ CI roughly $[1.6, 4.0]$, alongside the $0.90$ point raw gap, communicates everything the decision needs.

172.6 6. Effect Sizes in Common ML Settings

172.6.1 6.1 Variance in benchmarks and model comparison

Deep learning results vary across random seeds, data orderings, augmentation randomness, and hardware nondeterminism. A responsible comparison treats each seed as a replication and computes an effect size over the resulting distribution of scores, rather than comparing single runs. Bouthillier and colleagues show that ignoring these sources of variance leads to comparisons that do not replicate, and they advocate randomizing all of them and reporting variance honestly (Bouthillier et al. 2021).

It helps to decompose the observed score of a single run as

\[ X = \mu_{\text{model}} + \varepsilon_{\text{seed}} + \varepsilon_{\text{eval}}, \]

where $\mu_{\text{model}}$ is the quantity of interest, $\varepsilon_{\text{seed}}$ is the run to run training variability, and $\varepsilon_{\text{eval}}$ is the sampling noise from a finite test set. Comparing two single runs estimates $\mu_A - \mu_B$ with the full $\varepsilon_{\text{seed}}$ variance still attached, which is why single seed leaderboards are so brittle. Averaging over seeds shrinks the seed component and is what makes a stable effect size estimate possible. When aggregating across multiple benchmark tasks, standardized effect sizes allow a single summary because they remove the differing units and difficulty of each task. This is the same logic that underlies meta analysis, where heterogeneous studies are combined on a common standardized scale.

For comparing classifiers across many data sets, the recommended nonparametric machinery, the Friedman test with Nemenyi post hoc analysis, was laid out by Demsar (Demšar 2006). A Bayesian alternative reports the posterior probability that one model is practically better, worse, or equivalent to another within a region of practical equivalence, which fuses effect size thinking with inference directly (Benavoli et al. 2017).

172.6.2 6.2 A/B testing and online experiments

In production experimentation the effect size is typically the lift in a business metric, expressed as an absolute or relative difference with a confidence interval. Here practical significance is paramount, because shipping a change carries cost. Teams define a minimum detectable effect in advance, which sets the required sample size through a power analysis, and they refuse to ship changes whose interval, although excluding zero, falls below the practically meaningful threshold (Kohavi, Tang, and Xu 2020).

172.6.3 6.3 Fairness and disparity

When measuring whether a model treats groups differently, the effect size is the standardized gap in error rates, selection rates, or calibration between groups. A statistically significant disparity on a huge audit set may be operationally negligible, while a large standardized disparity on a small subgroup may demand action despite weak significance. Effect sizes give regulators and practitioners a magnitude to reason about rather than a binary verdict, and the same SESOI discipline applies: the threshold for an actionable disparity should be fixed by policy before the audit, not discovered afterward.

172.7 7. Power, Sample Size, and Planning

Effect sizes are not only a reporting tool; they drive experimental design. Statistical power, the probability of detecting a true effect of a given size, depends jointly on the effect size $\delta$, the sample size $n$, and the chosen significance level $\alpha$. For a two sample $z$ approximation, the required per group size to achieve power $1 - \beta$ is

\[ n \approx \frac{2(z_{1-\alpha/2} + z_{1-\beta})^2}{\delta^2}, \]

where $\delta$ is the standardized effect size of interest. The inverse square dependence on $\delta$ shows that detecting small effects is expensive: halving the effect you wish to detect quadruples the required sample. Planning an evaluation around the smallest effect worth detecting, rather than collecting as much data as possible and reporting whatever turns out significant, is the disciplined path. It also guards against the underpowered studies whose significant results are inflated, a selection effect sometimes called the winner’s curse or type M (magnitude) error, and which frequently fail to replicate.

172.8 8. Reporting Effect Sizes Alongside p Values

The goal is not to abolish $p$ values but to contextualize them. A complete empirical claim in ML should report, at minimum, the point estimate of the effect, a confidence interval, the sample size, and the $p$ value if a test is performed. The American Statistical Association urges authors to present effect sizes and intervals as standard practice rather than significance alone (Wasserstein and Lazar 2016).

A concrete template for a model comparison reads as follows. Model $A$ improved top one accuracy over model $B$ by $1.8$ points, $95\%$ CI $[0.6, 3.0]$, standardized $d_z = 0.42$, across $20$ seeds, $p = 0.004$. This single sentence tells the reader the magnitude, the uncertainty, the replication count, and the detectability. Compare that with the impoverished alternative, “$A$ significantly outperformed $B$ ($p < 0.05$),” which leaves every practical question unanswered.

172.8.1 8.1 When to use which effect size, and the pitfalls

Choosing the right measure is mostly about the data type. Use $d$ or $g$ for approximately normal scores, switching to $g$ whenever the seed or sample count is small. Use the probability of superiority when scores are skewed, bounded, or ordinal. Use Cramer’s $V$ or the odds ratio for categorical outcomes, and $r$ or $r^2$ for continuous associations. Whatever the choice, attach a confidence interval.

A handful of pitfalls recur often enough to enumerate.

Treating Cohen’s thresholds as universal. The $0.2 / 0.5 / 0.8$ labels are field conventions, not deployment criteria. Always anchor interpretation to a SESOI derived from cost.
Mixing paired and unpaired effect sizes. $d_z$ and the independent groups $d$ live on different scales because $d_z$ absorbs the pairing correlation. Do not pool them in a meta analysis without converting.
Reporting only the winners. When many comparisons are run, adjust for multiplicity and report effect sizes for all of them, not only the significant ones, to avoid the selection bias that turns noise into apparent discovery.
Standardizing against an unstable denominator. With few seeds the pooled standard deviation is itself noisy, so $d$ inherits that noise; a tight interval on the raw gap can be more trustworthy than a wide one on $d$. Report both the raw and standardized effect.
Confusing relative and absolute lift. A $50\%$ relative improvement on a rare event can be a negligible absolute change. State which you mean and prefer absolute differences for decisions.

Three habits make reporting trustworthy. First, fix the smallest effect of interest before seeing results, so that significance is judged against a meaningful bar. Second, prefer intervals to point verdicts, because intervals expose precision. Third, when a null conclusion is the goal, use equivalence testing rather than a nonsignificant $p$ value, since absence of evidence is not evidence of absence.

172.9 9. Conclusion

Statistical significance and practical significance are different questions, and conflating them has produced a literature where vanishingly small differences are declared victories and genuinely large effects are dismissed for want of data. Effect sizes, whether Cohen’s $d$, Hedges’ $g$, the probability of superiority, Cramer’s $V$, or a simple metric lift with a confidence interval, restore magnitude to the center of empirical reasoning. They are scale free, comparable across studies, and directly interpretable against the cost of a deployment decision. The practice this chapter recommends is simple to state and powerful in effect. Always report how large, how uncertain, and only then how surprising.

172.10 References

# Effect Sizes in Machine Learning ## 1. Introduction Modern machine learning research lives and dies by comparison. A new architecture beats a baseline, a tuned optimizer edges out the default, a fairness intervention reduces disparity. The standard ritual for backing such claims is the significance test, which produces a $p$ value and an accompanying verdict of "significant" or "not significant." Yet a $p$ value answers a narrow question. It tells us how surprising the observed data would be if there were truly no difference between systems. It does not tell us how large the difference is, whether that difference matters in deployment, or whether a practitioner should adopt the new method. Those questions belong to the domain of effect sizes. An effect size is a quantitative measure of the magnitude of a phenomenon, expressed on a scale that is interpretable and, ideally, comparable across studies. In machine learning the phenomenon is usually the performance gap between two models, the strength of association between a design choice and an outcome, or the size of a treatment effect in an A/B test. This chapter argues that effect sizes deserve a permanent place beside $p$ values in any empirical ML report. We develop the conceptual distinction between statistical and practical significance, define the standardized effect size measures most relevant to ML with their estimators and properties, work through a numerical example, and offer concrete reporting guidance. A useful organizing principle, due to Cohen and refined by many since, is that every claim of an effect has two ingredients that significance testing fatally entangles: the **signal** (how large the effect is) and the **resolution** (how precisely we measured it). The $p$ value is a function of both. An effect size isolates the signal, and a confidence interval restores the resolution as a separate, visible quantity. Keeping the two apart is the whole game. ## 2. Statistical Significance Is Not Practical Significance ### 2.1 What a p value actually controls Consider comparing two classifiers $A$ and $B$ over $n$ paired test instances. Let $D_i$ be the per instance accuracy difference, with population mean $\mu_D$ and standard deviation $\sigma_D$. The paired $t$ statistic is $$ t = \frac{\bar{D}}{s_D / \sqrt{n}}, $$ where $\bar{D}$ is the sample mean difference and $s_D$ the sample standard deviation. The $p$ value is the probability, under the null hypothesis $\mu_D = 0$, of observing a statistic at least as extreme as $t$. Crucially, the denominator shrinks as $n$ grows. For any fixed nonzero $\mu_D$, no matter how trivially small, $|t|$ grows without bound as $n \to \infty$, and the $p$ value tends to zero. It is worth stating plainly what the $p$ value does and does not control. Under the null it is a probability statement about hypothetical data, not about the hypothesis. It is **not** the probability that the null is true, **not** the probability that the result is a fluke, and **not** one minus the probability that the finding replicates. The ASA statement on $p$ values [@wasserstein2016asa] devotes itself almost entirely to dispelling these misreadings, and the entanglement of magnitude with sample size is the most consequential of them for ML. ### 2.2 The large sample trap This dependence on sample size is the heart of the problem. With the enormous evaluation sets common in ML, where $n$ can reach millions of tokens or images, almost any nonzero difference becomes statistically significant. A model that is better by $0.01$ accuracy points and a model that is better by $5$ accuracy points can both yield $p < 10^{-6}$. Significance certifies that the effect is detectable; it says nothing about whether the effect is large enough to justify the engineering cost, latency increase, or carbon footprint of switching systems. The converse failure is equally real. A genuinely large improvement evaluated on a small or noisy benchmark may fail to reach significance, producing a false sense that "nothing is happening." Statistical significance conflates the size of an effect with the precision of its estimate. Effect sizes separate the two. ```{mermaid} flowchart TD Q["Two systems compared"] --> P{"p value small?"} P -->|yes| ES1{"Effect size meaningful?"} P -->|no| ES2{"Interval rules out a meaningful effect?"} ES1 -->|yes| A["Adopt: real and worthwhile"] ES1 -->|no| B["Detectable but trivial: do not ship"] ES2 -->|yes| C["Genuine equivalence: stop chasing"] ES2 -->|no| D["Underpowered: collect more data"] ``` The diagram captures the four quadrants that a $p$ value alone collapses into two. The two diagonal cells, "detectable but trivial" and "underpowered," are exactly the failures that effect size reporting exposes. ## 3. Cohen's d and Standardized Mean Differences ### 3.1 Definition The most widely used standardized effect size for a difference in means is Cohen's $d$. For two independent groups with means $\bar{x}_1, \bar{x}_2$ and a pooled standard deviation $s_p$, $$ d = \frac{\bar{x}_1 - \bar{x}_2}{s_p}, \qquad s_p = \sqrt{\frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}}. $$ The quantity $d$ expresses the gap between groups in units of standard deviation. A value of $d = 0.5$ means the group means differ by half a standard deviation. Because $d$ is dimensionless, it is comparable across experiments that use different metrics or scales, which is exactly what raw mean differences and $p$ values are not. For paired designs, such as evaluating two models on the same test items, the natural analogue is the standardized mean of the differences, $$ d_z = \frac{\bar{D}}{s_D}, $$ which relates to the paired $t$ statistic by $t = d_z \sqrt{n}$. This identity makes the large sample trap explicit. The $t$ statistic blends a scale free effect size $d_z$ with a sample size factor $\sqrt{n}$, so a tiny $d_z$ can still produce an overwhelming $t$. Note that $d_z$ uses the standard deviation of the **differences** $s_D$, which absorbs the correlation between paired scores; it is therefore not directly comparable to the independent groups $d$ unless that correlation is accounted for, a subtlety that trips up meta analyses that mix paired and unpaired designs. ### 3.2 Interpreting magnitude Cohen proposed rough conventions of $0.2$ (small), $0.5$ (medium), and $0.8$ (large), but these were offered reluctantly and are not laws of nature. In ML the meaningful threshold is domain dependent. A $d$ of $0.2$ in a click through rate experiment serving billions of impressions may translate to substantial revenue, while a $d$ of $0.8$ on a toy benchmark may be irrelevant to production. Effect sizes should always be interpreted against a context specific notion of what difference matters, sometimes formalized as a **smallest effect size of interest** (SESOI). The SESOI is set from the cost structure of the decision, not from a textbook table, and fixing it before data collection is what turns an effect size from a description into a decision criterion. A second interpretive aid is the translation of $d$ into an overlap or a probability. Under approximately normal, equal variance assumptions, $d$ maps directly to the probability of superiority through $A = \Phi(d / \sqrt{2})$, where $\Phi$ is the standard normal cumulative distribution function. Thus $d = 0.5$ corresponds to roughly a $64\%$ chance that a random draw from the better group exceeds a random draw from the worse one, a statement many readers find more intuitive than "half a standard deviation." ### 3.3 Small sample bias Cohen's $d$ is positively biased for small samples, because the sample standard deviation in the denominator underestimates the population value in expectation. Hedges proposed a correction factor that yields an approximately unbiased estimator, often called Hedges' $g$ [@hedges1981distribution], $$ g = d \cdot \left(1 - \frac{3}{4(n_1 + n_2) - 9}\right). $$ The correction is negligible for large $n$ but matters when comparing models across a handful of random seeds, a regime that is common and under reported in deep learning. With five seeds per model the bias inflates $d$ by several percent, enough to nudge a borderline "medium" effect over a reporting threshold. When the seed count is small, report $g$ rather than $d$. The estimator below makes the pooled standard deviation explicit. It is illustrative and assumes standard array helpers; the mature open source path in practice is `pingouin.compute_effsize` or `scipy.stats`, both freely available. ```python # Cohen's d for two independent groups (illustrative, not executable) def cohens_d(x1, x2): n1, n2 = len(x1), len(x2) s1, s2 = var(x1, ddof=1), var(x2, ddof=1) sp = sqrt(((n1 - 1) * s1 + (n2 - 1) * s2) / (n1 + n2 - 2)) return (mean(x1) - mean(x2)) / sp ``` ## 4. A Family of Effect Sizes for ML ### 4.1 Beyond mean differences Cohen's $d$ assumes roughly normal, equal variance data, which performance metrics often violate. Per item accuracy is Bernoulli, calibration error is bounded and skewed, and ranking metrics are discrete. Several alternatives are useful and more robust. The **probability of superiority**, also called the common language effect size or $A$ statistic, is the chance that a randomly drawn score from model $A$ exceeds one from model $B$, $$ A = P(X_A > X_B) + \tfrac{1}{2} P(X_A = X_B). $$ It is closely tied to the Mann Whitney $U$ statistic through $A = U / (n_A n_B)$ and is robust to non normality [@mcgraw1992common]. A value of $0.5$ denotes no difference, and values approaching $1$ denote near total dominance of one model. Because $A$ is a probability it needs no scale assumption and survives monotone transformations of the metric, which makes it a safe default when the distribution of scores is unknown or clearly non normal. For association between two categorical variables, such as a design choice and a success or failure outcome, the **odds ratio** and **Cramer's $V$** quantify effect magnitude. Cramer's $V$ rescales the chi squared statistic to the interval $[0,1]$, $$ V = \sqrt{\frac{\chi^2}{n \cdot \min(r-1, c-1)}}, $$ for an $r \times c$ contingency table, giving a sample size free measure of association. For correlation between continuous quantities, Pearson's $r$ or the coefficient of determination $r^2$ serve directly as effect sizes, since they are already scale free; $r^2$ has the clean reading of the fraction of variance explained. The table summarizes the family and when each is appropriate. | Setting | Effect size | Range | Robust to non normality | |---|---|---|---| | Difference in means, two groups | Cohen's $d$, Hedges' $g$ | unbounded | no | | Paired model comparison | $d_z$ | unbounded | no | | Stochastic dominance of scores | Probability of superiority $A$ | $[0,1]$ | yes | | Categorical association | Cramer's $V$, odds ratio | $V \in [0,1]$ | yes | | Continuous association | $r$, $r^2$ | $r \in [-1,1]$ | partial | ### 4.2 Confidence intervals as effect size reporting An effect size point estimate is incomplete without an interval that conveys its uncertainty. A $95\%$ confidence interval on $d$ or on the raw performance gap communicates both magnitude and precision in one object. When the interval for a difference excludes zero, it conveys the same information as $p < 0.05$, but it additionally shows the plausible range of the effect. Reporting the interval $[0.3, 0.9]$ for a standardized difference is far more informative than reporting "$p = 0.01$." For ML metrics with awkward sampling distributions, bootstrap confidence intervals on the effect size are a practical default, and the bias corrected and accelerated (BCa) bootstrap is the recommended variant when the statistic is skewed. Equivalence testing turns the interval into an explicit decision rule. To claim two systems are practically the same, one shows that the confidence interval for their difference lies entirely inside the SESOI band $[-\Delta, \Delta]$. This is the two one sided tests (TOST) procedure [@lakens2013calculating], and it is the principled way to support a null conclusion, which an ordinary nonsignificant $p$ value can never do. ```python # Bootstrap CI for a paired accuracy gap (illustrative) def bootstrap_gap_ci(diffs, B=10000, alpha=0.05): stats = [mean(resample(diffs)) for _ in range(B)] lo = quantile(stats, alpha / 2) hi = quantile(stats, 1 - alpha / 2) return lo, hi ``` ## 5. A Worked Example Suppose model $A$ and model $B$ are each trained with $n = 10$ random seeds and evaluated on a fixed held out set. The per seed top one accuracies (percent) are $$ A: 81.2,\ 80.8,\ 81.5,\ 80.9,\ 81.7,\ 81.0,\ 81.3,\ 80.6,\ 81.4,\ 81.1, $$ $$ B: 80.4,\ 80.1,\ 80.7,\ 79.9,\ 80.5,\ 80.0,\ 80.6,\ 79.8,\ 80.3,\ 80.2. $$ The means are $\bar{x}_A = 81.15$ and $\bar{x}_B = 80.25$, a raw gap of $0.90$ accuracy points. The sample standard deviations are approximately $s_A \approx 0.33$ and $s_B \approx 0.30$, so the pooled standard deviation is $$ s_p = \sqrt{\frac{9(0.33^2) + 9(0.30^2)}{18}} \approx 0.32. $$ Cohen's $d = 0.90 / 0.32 \approx 2.8$, a very large standardized effect: the two seed distributions barely overlap. The small sample correction gives $g \approx 2.8 \times (1 - 3/(4 \cdot 20 - 9)) \approx 2.7$, a modest shrinkage. Translating through $A = \Phi(d/\sqrt{2}) = \Phi(1.98) \approx 0.98$, a randomly chosen $A$ seed beats a randomly chosen $B$ seed about $98\%$ of the time. Now the contrast that motivates the whole chapter. A paired or two sample $t$ test on these numbers yields a tiny $p$ value, but so would a scenario where the gap were $0.05$ points with correspondingly tiny seed variance. The $p$ value cannot distinguish the two. The effect size and its interval can: here the magnitude is unambiguously large and the seed distributions are well separated, whereas a $0.05$ point gap would surface as a $d$ near zero with an interval hugging zero, telling the practitioner not to bother. Reporting $d \approx 2.8$, $95\%$ CI roughly $[1.6, 4.0]$, alongside the $0.90$ point raw gap, communicates everything the decision needs. ## 6. Effect Sizes in Common ML Settings ### 6.1 Variance in benchmarks and model comparison Deep learning results vary across random seeds, data orderings, augmentation randomness, and hardware nondeterminism. A responsible comparison treats each seed as a replication and computes an effect size over the resulting distribution of scores, rather than comparing single runs. Bouthillier and colleagues show that ignoring these sources of variance leads to comparisons that do not replicate, and they advocate randomizing all of them and reporting variance honestly [@bouthillier2021accounting]. It helps to decompose the observed score of a single run as $$ X = \mu_{\text{model}} + \varepsilon_{\text{seed}} + \varepsilon_{\text{eval}}, $$ where $\mu_{\text{model}}$ is the quantity of interest, $\varepsilon_{\text{seed}}$ is the run to run training variability, and $\varepsilon_{\text{eval}}$ is the sampling noise from a finite test set. Comparing two single runs estimates $\mu_A - \mu_B$ with the full $\varepsilon_{\text{seed}}$ variance still attached, which is why single seed leaderboards are so brittle. Averaging over seeds shrinks the seed component and is what makes a stable effect size estimate possible. When aggregating across multiple benchmark tasks, standardized effect sizes allow a single summary because they remove the differing units and difficulty of each task. This is the same logic that underlies meta analysis, where heterogeneous studies are combined on a common standardized scale. For comparing classifiers across many data sets, the recommended nonparametric machinery, the Friedman test with Nemenyi post hoc analysis, was laid out by Demsar [@demsar2006statistical]. A Bayesian alternative reports the posterior probability that one model is practically better, worse, or equivalent to another within a region of practical equivalence, which fuses effect size thinking with inference directly [@benavoli2017time]. ### 6.2 A/B testing and online experiments In production experimentation the effect size is typically the lift in a business metric, expressed as an absolute or relative difference with a confidence interval. Here practical significance is paramount, because shipping a change carries cost. Teams define a minimum detectable effect in advance, which sets the required sample size through a power analysis, and they refuse to ship changes whose interval, although excluding zero, falls below the practically meaningful threshold [@kohavi2020trustworthy]. ### 6.3 Fairness and disparity When measuring whether a model treats groups differently, the effect size is the standardized gap in error rates, selection rates, or calibration between groups. A statistically significant disparity on a huge audit set may be operationally negligible, while a large standardized disparity on a small subgroup may demand action despite weak significance. Effect sizes give regulators and practitioners a magnitude to reason about rather than a binary verdict, and the same SESOI discipline applies: the threshold for an actionable disparity should be fixed by policy before the audit, not discovered afterward. ## 7. Power, Sample Size, and Planning Effect sizes are not only a reporting tool; they drive experimental design. Statistical power, the probability of detecting a true effect of a given size, depends jointly on the effect size $\delta$, the sample size $n$, and the chosen significance level $\alpha$. For a two sample $z$ approximation, the required per group size to achieve power $1 - \beta$ is $$ n \approx \frac{2(z_{1-\alpha/2} + z_{1-\beta})^2}{\delta^2}, $$ where $\delta$ is the standardized effect size of interest. The inverse square dependence on $\delta$ shows that detecting small effects is expensive: halving the effect you wish to detect quadruples the required sample. Planning an evaluation around the smallest effect worth detecting, rather than collecting as much data as possible and reporting whatever turns out significant, is the disciplined path. It also guards against the underpowered studies whose significant results are inflated, a selection effect sometimes called the winner's curse or type M (magnitude) error, and which frequently fail to replicate. ## 8. Reporting Effect Sizes Alongside p Values The goal is not to abolish $p$ values but to contextualize them. A complete empirical claim in ML should report, at minimum, the point estimate of the effect, a confidence interval, the sample size, and the $p$ value if a test is performed. The American Statistical Association urges authors to present effect sizes and intervals as standard practice rather than significance alone [@wasserstein2016asa]. A concrete template for a model comparison reads as follows. Model $A$ improved top one accuracy over model $B$ by $1.8$ points, $95\%$ CI $[0.6, 3.0]$, standardized $d_z = 0.42$, across $20$ seeds, $p = 0.004$. This single sentence tells the reader the magnitude, the uncertainty, the replication count, and the detectability. Compare that with the impoverished alternative, "$A$ significantly outperformed $B$ ($p < 0.05$)," which leaves every practical question unanswered. ### 8.1 When to use which effect size, and the pitfalls Choosing the right measure is mostly about the data type. Use $d$ or $g$ for approximately normal scores, switching to $g$ whenever the seed or sample count is small. Use the probability of superiority when scores are skewed, bounded, or ordinal. Use Cramer's $V$ or the odds ratio for categorical outcomes, and $r$ or $r^2$ for continuous associations. Whatever the choice, attach a confidence interval. A handful of pitfalls recur often enough to enumerate. - **Treating Cohen's thresholds as universal.** The $0.2 / 0.5 / 0.8$ labels are field conventions, not deployment criteria. Always anchor interpretation to a SESOI derived from cost. - **Mixing paired and unpaired effect sizes.** $d_z$ and the independent groups $d$ live on different scales because $d_z$ absorbs the pairing correlation. Do not pool them in a meta analysis without converting. - **Reporting only the winners.** When many comparisons are run, adjust for multiplicity and report effect sizes for all of them, not only the significant ones, to avoid the selection bias that turns noise into apparent discovery. - **Standardizing against an unstable denominator.** With few seeds the pooled standard deviation is itself noisy, so $d$ inherits that noise; a tight interval on the raw gap can be more trustworthy than a wide one on $d$. Report both the raw and standardized effect. - **Confusing relative and absolute lift.** A $50\%$ relative improvement on a rare event can be a negligible absolute change. State which you mean and prefer absolute differences for decisions. Three habits make reporting trustworthy. First, fix the smallest effect of interest before seeing results, so that significance is judged against a meaningful bar. Second, prefer intervals to point verdicts, because intervals expose precision. Third, when a null conclusion is the goal, use equivalence testing rather than a nonsignificant $p$ value, since absence of evidence is not evidence of absence. ## 9. Conclusion Statistical significance and practical significance are different questions, and conflating them has produced a literature where vanishingly small differences are declared victories and genuinely large effects are dismissed for want of data. Effect sizes, whether Cohen's $d$, Hedges' $g$, the probability of superiority, Cramer's $V$, or a simple metric lift with a confidence interval, restore magnitude to the center of empirical reasoning. They are scale free, comparable across studies, and directly interpretable against the cost of a deployment decision. The practice this chapter recommends is simple to state and powerful in effect. Always report how large, how uncertain, and only then how surprising. ## References ::: {#refs} :::