172  Effect Sizes in Machine Learning

172.1 1. Introduction

Modern machine learning research lives and dies by comparison. A new architecture beats a baseline, a tuned optimizer edges out the default, a fairness intervention reduces disparity. The standard ritual for backing such claims is the significance test, which produces a \(p\) value and an accompanying verdict of “significant” or “not significant.” Yet a \(p\) value answers a narrow question. It tells us how surprising the observed data would be if there were truly no difference between systems. It does not tell us how large the difference is, whether that difference matters in deployment, or whether a practitioner should adopt the new method. Those questions belong to the domain of effect sizes.

An effect size is a quantitative measure of the magnitude of a phenomenon, expressed on a scale that is interpretable and, ideally, comparable across studies. In machine learning the phenomenon is usually the performance gap between two models, the strength of association between a design choice and an outcome, or the size of a treatment effect in an A/B test. This chapter argues that effect sizes deserve a permanent place beside \(p\) values in any empirical ML report. We develop the conceptual distinction between statistical and practical significance, define the standardized effect size measures most relevant to ML, and offer concrete reporting guidance.

172.2 2. Statistical Significance Is Not Practical Significance

172.2.1 2.1 What a p value actually controls

Consider comparing two classifiers \(A\) and \(B\) over \(n\) paired test instances. Let \(D_i\) be the per instance accuracy difference, with population mean \(\mu_D\) and standard deviation \(\sigma_D\). The paired \(t\) statistic is

\[ t = \frac{\bar{D}}{s_D / \sqrt{n}}, \]

where \(\bar{D}\) is the sample mean difference and \(s_D\) the sample standard deviation. The \(p\) value is the probability, under the null hypothesis \(\mu_D = 0\), of observing a statistic at least as extreme as \(t\). Crucially, the denominator shrinks as \(n\) grows. For any fixed nonzero \(\mu_D\), no matter how trivially small, \(|t|\) grows without bound as \(n \to \infty\), and the \(p\) value tends to zero.

172.2.2 2.2 The large sample trap

This dependence on sample size is the heart of the problem. With the enormous evaluation sets common in ML, where \(n\) can reach millions of tokens or images, almost any nonzero difference becomes statistically significant. A model that is better by \(0.01\) accuracy points and a model that is better by \(5\) accuracy points can both yield \(p < 10^{-6}\). Significance certifies that the effect is detectable; it says nothing about whether the effect is large enough to justify the engineering cost, latency increase, or carbon footprint of switching systems.

The converse failure is equally real. A genuinely large improvement evaluated on a small or noisy benchmark may fail to reach significance, producing a false sense that “nothing is happening.” Statistical significance conflates the size of an effect with the precision of its estimate. Effect sizes separate the two.

172.3 3. Cohen’s d and Standardized Mean Differences

172.3.1 3.1 Definition

The most widely used standardized effect size for a difference in means is Cohen’s \(d\). For two independent groups with means \(\bar{x}_1, \bar{x}_2\) and a pooled standard deviation \(s_p\),

\[ d = \frac{\bar{x}_1 - \bar{x}_2}{s_p}, \qquad s_p = \sqrt{\frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}}. \]

The quantity \(d\) expresses the gap between groups in units of standard deviation. A value of \(d = 0.5\) means the group means differ by half a standard deviation. Because \(d\) is dimensionless, it is comparable across experiments that use different metrics or scales, which is exactly what raw mean differences and \(p\) values are not.

For paired designs, such as evaluating two models on the same test items, the natural analogue is the standardized mean of the differences,

\[ d_z = \frac{\bar{D}}{s_D}, \]

which relates to the paired \(t\) statistic by \(t = d_z \sqrt{n}\). This identity makes the large sample trap explicit. The \(t\) statistic blends a scale free effect size \(d_z\) with a sample size factor \(\sqrt{n}\), so a tiny \(d_z\) can still produce an overwhelming \(t\).

172.3.2 3.2 Interpreting magnitude

Cohen proposed rough conventions of \(0.2\) (small), \(0.5\) (medium), and \(0.8\) (large), but these were offered reluctantly and are not laws of nature. In ML the meaningful threshold is domain dependent. A \(d\) of \(0.2\) in a click through rate experiment serving billions of impressions may translate to substantial revenue, while a \(d\) of \(0.8\) on a toy benchmark may be irrelevant to production. Effect sizes should always be interpreted against a context specific notion of what difference matters, sometimes formalized as a smallest effect size of interest.

172.3.3 3.3 Small sample bias

Cohen’s \(d\) is positively biased for small samples. Hedges proposed a correction factor that yields an approximately unbiased estimator, often called Hedges’ \(g\),

\[ g = d \cdot \left(1 - \frac{3}{4(n_1 + n_2) - 9}\right). \]

The correction is negligible for large \(n\) but matters when comparing models across a handful of random seeds, a regime that is common and under reported in deep learning.

# Cohen's d for two independent groups (illustrative, not executable)
def cohens_d(x1, x2):
    n1, n2 = len(x1), len(x2)
    s1, s2 = var(x1, ddof=1), var(x2, ddof=1)
    sp = sqrt(((n1 - 1) * s1 + (n2 - 1) * s2) / (n1 + n2 - 2))
    return (mean(x1) - mean(x2)) / sp

172.4 4. A Family of Effect Sizes for ML

172.4.1 4.1 Beyond mean differences

Cohen’s \(d\) assumes roughly normal, equal variance data, which performance metrics often violate. Several alternatives are useful.

The probability of superiority, also called the common language effect size or \(A\) statistic, is the chance that a randomly drawn score from model \(A\) exceeds one from model \(B\),

\[ A = P(X_A > X_B) + \tfrac{1}{2} P(X_A = X_B). \]

It is closely tied to the Mann Whitney \(U\) statistic through \(A = U / (n_A n_B)\) and is robust to non normality. A value of \(0.5\) denotes no difference, and values approaching \(1\) denote near total dominance of one model.

For association between two categorical variables, such as a design choice and a success or failure outcome, the odds ratio and Cramer’s \(V\) quantify effect magnitude. For correlation between continuous quantities, Pearson’s \(r\) or the coefficient of determination \(r^2\) serve directly as effect sizes, since they are already scale free.

172.4.2 4.2 Confidence intervals as effect size reporting

An effect size point estimate is incomplete without an interval that conveys its uncertainty. A \(95\%\) confidence interval on \(d\) or on the raw performance gap communicates both magnitude and precision in one object. When the interval for a difference excludes zero, it conveys the same information as \(p < 0.05\), but it additionally shows the plausible range of the effect. Reporting the interval \([0.3, 0.9]\) for a standardized difference is far more informative than reporting “\(p = 0.01\).” For ML metrics with awkward sampling distributions, bootstrap confidence intervals on the effect size are a practical default.

# Bootstrap CI for a paired accuracy gap (illustrative)
def bootstrap_gap_ci(diffs, B=10000, alpha=0.05):
    stats = [mean(resample(diffs)) for _ in range(B)]
    lo = quantile(stats, alpha / 2)
    hi = quantile(stats, 1 - alpha / 2)
    return lo, hi

172.5 5. Effect Sizes in Common ML Settings

172.5.1 5.1 Model comparison across seeds and tasks

Deep learning results vary across random seeds, data orderings, and hardware. A responsible comparison treats each seed as a replication and computes an effect size over the resulting distribution of scores, rather than comparing single runs. When aggregating across multiple benchmark tasks, standardized effect sizes allow a single summary because they remove the differing units and difficulty of each task. This is the same logic that underlies meta analysis, where heterogeneous studies are combined on a common standardized scale.

172.5.2 5.2 A/B testing and online experiments

In production experimentation the effect size is typically the lift in a business metric, expressed as an absolute or relative difference with a confidence interval. Here practical significance is paramount, because shipping a change carries cost. Teams define a minimum detectable effect in advance, which sets the required sample size through a power analysis, and they refuse to ship changes whose interval, although excluding zero, falls below the practically meaningful threshold.

172.5.3 5.3 Fairness and disparity

When measuring whether a model treats groups differently, the effect size is the standardized gap in error rates, selection rates, or calibration between groups. A statistically significant disparity on a huge audit set may be operationally negligible, while a large standardized disparity on a small subgroup may demand action despite weak significance. Effect sizes give regulators and practitioners a magnitude to reason about rather than a binary verdict.

172.6 6. Power, Sample Size, and Planning

Effect sizes are not only a reporting tool; they drive experimental design. Statistical power, the probability of detecting a true effect of a given size, depends jointly on the effect size \(\delta\), the sample size \(n\), and the chosen significance level \(\alpha\). For a two sample \(z\) approximation, the required per group size to achieve power \(1 - \beta\) is

\[ n \approx \frac{2(z_{1-\alpha/2} + z_{1-\beta})^2}{\delta^2}, \]

where \(\delta\) is the standardized effect size of interest. The inverse square dependence on \(\delta\) shows that detecting small effects is expensive. Planning an evaluation around the smallest effect worth detecting, rather than collecting as much data as possible and reporting whatever turns out significant, is the disciplined path. It also guards against the underpowered studies whose significant results are inflated and frequently fail to replicate.

172.7 7. Reporting Effect Sizes Alongside p Values

The goal is not to abolish \(p\) values but to contextualize them. A complete empirical claim in ML should report, at minimum, the point estimate of the effect, a confidence interval, the sample size, and the \(p\) value if a test is performed. The American Statistical Association and the American Psychological Association both now urge authors to present effect sizes and intervals as standard practice rather than significance alone.

A concrete template for a model comparison reads as follows. Model \(A\) improved top one accuracy over model \(B\) by \(1.8\) points, \(95\%\) CI \([0.6, 3.0]\), standardized \(d_z = 0.42\), across \(20\) seeds, \(p = 0.004\). This single sentence tells the reader the magnitude, the uncertainty, the replication count, and the detectability. Compare that with the impoverished alternative, “\(A\) significantly outperformed \(B\) (\(p < 0.05\)),” which leaves every practical question unanswered.

Three habits make reporting trustworthy. First, fix the smallest effect of interest before seeing results, so that significance is judged against a meaningful bar. Second, prefer intervals to point verdicts, because intervals expose precision. Third, when many comparisons are run, adjust for multiplicity and report effect sizes for all of them, not only the survivors, to avoid the selection bias that turns noise into apparent discovery.

172.8 8. Conclusion

Statistical significance and practical significance are different questions, and conflating them has produced a literature where vanishingly small differences are declared victories and genuinely large effects are dismissed for want of data. Effect sizes, whether Cohen’s \(d\), Hedges’ \(g\), the probability of superiority, or a simple metric lift with a confidence interval, restore magnitude to the center of empirical reasoning. They are scale free, comparable across studies, and directly interpretable against the cost of a deployment decision. The practice this chapter recommends is simple to state and powerful in effect. Always report how large, how uncertain, and only then how surprising.

172.9 References

  1. Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Lawrence Erlbaum Associates.
  2. Wasserstein, R. L., and Lazar, N. A. (2016). The ASA Statement on p-Values: Context, Process, and Purpose. The American Statistician, 70(2). https://doi.org/10.1080/00031305.2016.1154108
  3. Sullivan, G. M., and Feinn, R. (2012). Using Effect Size, or Why the P Value Is Not Enough. Journal of Graduate Medical Education, 4(3). https://doi.org/10.4300/JGME-D-12-00156.1
  4. Hedges, L. V. (1981). Distribution Theory for Glass’s Estimator of Effect Size and Related Estimators. Journal of Educational Statistics, 6(2). https://doi.org/10.3102/10769986006002107
  5. Lakens, D. (2013). Calculating and Reporting Effect Sizes to Facilitate Cumulative Science. Frontiers in Psychology, 4. https://doi.org/10.3389/fpsyg.2013.00863
  6. Demsar, J. (2006). Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research, 7. https://www.jmlr.org/papers/v7/demsar06a.html
  7. Bouthillier, X., et al. (2021). Accounting for Variance in Machine Learning Benchmarks. Proceedings of Machine Learning and Systems (MLSys). https://proceedings.mlsys.org/paper/2021/hash/cfecdb276f634854f3ef915e2e980c31-Abstract.html
  8. McGraw, K. O., and Wong, S. P. (1992). A Common Language Effect Size Statistic. Psychological Bulletin, 111(2). https://doi.org/10.1037/0033-2909.111.2.361
  9. Kohavi, R., Tang, D., and Xu, Y. (2020). Trustworthy Online Controlled Experiments. Cambridge University Press. https://experimentguide.com