9 The Scientific Method in AI Research

9.1 1. Introduction: Machine Learning as an Empirical Science

Machine learning occupies an unusual position among the sciences. It borrows the deductive machinery of mathematics, the engineering culture of computer systems, and the empirical posture of the natural sciences, yet it is reducible to none of them. A theorem about the convergence of stochastic gradient descent tells us little about whether a particular transformer will generalize to a new distribution of documents. The decisive questions in modern AI research are answered not by proof but by measurement: does this method achieve a lower error than that one, under conditions we can defend as fair, with a margin we can defend as real?

This chapter argues that machine learning is best understood as an empirical science, and that the discipline of empirical science (controlled comparison, falsifiable claims, honest accounting of uncertainty) is the single most important intellectual asset a practitioner can cultivate. The argument matters because the field has repeatedly mistaken impressive engineering for scientific knowledge. Models that top leaderboards have turned out to exploit annotation artifacts (1). Reported gains have evaporated under reproduction (2). Comparisons that looked decisive have collapsed once baselines were tuned with equal care (3). These failures are not the result of fraud or incompetence. They are the predictable consequence of doing experimental science without the safeguards that experimental science has spent four centuries developing.

We proceed from the philosophy of empirical claims to the concrete machinery of the field: baselines and ablations, the partitioning of data, reproducibility, benchmark design, statistical inference, and the catalogue of methodological errors that recur with depressing regularity. Throughout, the unifying idea is the one drawn in Figure 9.1: a claim earns belief only by surviving a sequence of deliberate opportunities to fail.

flowchart TD
    A["Operationalize a falsifiable hypothesis"] --> B["Build strong matched baselines"]
    B --> C["Train with multiple seeds"]
    C --> D["Select and tune on validation data"]
    D --> E["Run ablations to attribute the effect"]
    E --> F["Evaluate once on held-out test data"]
    F --> G["Quantify uncertainty and effect size"]
    G --> H{"Survives every check"}
    H -->|"No"| A
    H -->|"Yes"| I["Report with code, seeds, and configs"]

Figure 9.1: The empirical loop in machine learning research. Each stage is a deliberate opportunity for a claim to fail.

9.2 2. Hypotheses and Falsifiability in Machine Learning

9.2.1 2.1 What a Hypothesis Looks Like in ML

The Popperian criterion holds that a scientific claim must be falsifiable: there must exist some observable outcome that, if it occurred, would count as evidence against the claim (4). In machine learning a well-formed hypothesis is rarely “our model is good.” It is a conditional, comparative, and quantitative statement: “adding a recurrence mechanism to architecture A reduces perplexity on long-context language modeling relative to A without recurrence, holding parameter count and training data fixed.” This formulation specifies the intervention (the recurrence mechanism), the metric (perplexity), the comparison (A with versus without), and the controlled variables (parameters, data). Each of these can be wrong, and each can be checked.

It helps to write the hypothesis as a statement about an unknown population quantity. Let $\theta_A$ and $\theta_B$ denote the expected test metric of methods $A$ and $B$, where the expectation is taken over the randomness of training (initialization, data order, augmentation, nondeterministic hardware) and over the draw of the test set from the target distribution. The scientific claim is a statement about the sign and size of the estimand

\[ \Delta \;=\; \theta_B - \theta_A , \]

for example “$\Delta < 0$ for a loss metric” (method $B$ improves on $A$). A single training run produces only a noisy estimate $\hat\Delta$ of $\Delta$, and the entire apparatus of the rest of this chapter exists to keep the gap between $\hat\Delta$ and $\Delta$ honest.

9.2.2 2.2 The Drift Toward Unfalsifiable Claims

Much of the rhetoric surrounding large models trends toward the unfalsifiable. Statements such as “the model understands language” or “the system reasons” resist refutation because no agreed measurement attaches to them. A productive research culture replaces such claims with operational proxies: a model “reasons” to the extent that it solves a held-out set of multi-step problems whose surface form differs from anything in training. The proxy is imperfect, and saying so is part of the science, but it is falsifiable, and that is what makes it useful. The practitioner’s habit should be to translate every grand claim into a measurable one before believing it, including their own claims.

9.3 3. Baselines and Ablations: The Core of Controlled Comparison

9.3.1 3.1 Why Baselines Carry the Argument

A result in isolation conveys almost no information. Knowing that a model reaches 92 percent accuracy is meaningless until we know what a simpler approach achieves on the same task. The baseline is the control condition, and the strength of an empirical claim is bounded by the strength of the baseline it defeats. A recurring pathology in published work is the weak baseline: the proposed method is tuned extensively while the comparison is taken untuned from an old paper, or implemented carelessly. When researchers have revisited such comparisons with equally tuned baselines, the reported advantage of elaborate methods has frequently shrunk or vanished (3, 5).

The discipline here is symmetric effort. Every hour of hyperparameter search, every architectural refinement, and every data-cleaning step applied to the proposed method must be matched, as nearly as possible, for the baseline. A simple, well-tuned baseline that the new method genuinely beats is far more persuasive than an exotic competitor that the new method beats by accident of unequal effort. A useful rule of thumb is the hierarchy of baselines: a trivial baseline (the majority-class predictor, or a constant), a classical baseline (logistic regression, gradient-boosted trees, a nearest-neighbor retriever), and the strongest published prior method, each tuned with the same budget as the proposed method. A new method should be measured against all three, because beating only the trivial baseline establishes almost nothing.

9.3.2 3.2 Ablations as Causal Attribution

An ablation study removes or alters one component of a system at a time to isolate that component’s contribution. It is the closest thing machine learning has to a controlled experiment in the laboratory sense. If a method combines a new loss function, a new data augmentation, and a new optimizer, the headline result tells us only that the combination works. The ablation tells us which ingredient mattered. Without it, the field accumulates complicated recipes whose active ingredients are unknown, and subsequent researchers inherit cargo cult components that contribute nothing. A good ablation answers the counterfactual: if this piece were absent, what would happen, all else equal?

Two cautions sharpen the practice. First, components can interact, so the effect of removing one piece may depend on the presence of another. A single-knockout ablation (remove one piece from the full system) and a single-addition ablation (add one piece to the bare baseline) can disagree, and the disagreement is itself informative about interactions. Second, every ablation cell is a measurement subject to the same seed variance as the headline number, so an ablation table built from single runs can mislead exactly as a single headline number can. Ablations deserve the same uncertainty quantification described in Section 9.7.

9.4 4. Train, Validation, and Test Discipline

9.4.1 4.1 The Three-Way Partition and What Each Set Is For

The partition of data into training, validation, and test sets encodes a simple epistemic principle: a claim about generalization can only be tested on data that played no role in producing the model. The training set fits parameters. The validation set guides choices made by the researcher, including architecture, hyperparameters, early stopping, and model selection. The test set estimates performance on genuinely unseen data, and it can serve that purpose only if it is consulted once, at the end, after all decisions are frozen (6).

9.4.2 4.2 How the Test Set Leaks, and the Mathematics of Optimistic Selection

The test set’s protective value decays with every glance. If a researcher evaluates many variants on the test set and reports the best, the reported number is an optimistic estimate, because the maximum over noisy measurements is biased upward. This is test-set tuning, and it is one of the most common ways that honest researchers fool themselves.

The bias is not a vague worry; it can be quantified. Suppose a researcher evaluates $k$ models on the test set, and each model’s measured score is its true score plus independent noise. Model the measured scores as $X_1, \dots, X_k$, drawn independently from a distribution with mean $\mu$ and standard deviation $\sigma$, where $\mu$ is the true performance level and $\sigma$ captures finite-test-set and training noise. Reporting the best means reporting $M_k = \max_i X_i$. The expected reported score exceeds the truth, and the gap grows with the number of variants tried:

\[ \mathbb{E}[M_k] - \mu \;\ge\; 0, \qquad \text{and is increasing in } k . \]

For the Gaussian case $X_i \sim \mathcal{N}(\mu, \sigma^2)$, the expected maximum is well approximated for moderate and large $k$ by

\[ \mathbb{E}[M_k] \;\approx\; \mu + \sigma\sqrt{2 \ln k}. \]

The selection bias therefore grows like $\sqrt{\ln k}$: trying $10$ variants and reporting the best inflates the score by roughly $2\sigma$, and $100$ variants by roughly $3\sigma$. If $\sigma$ (the run-to-run noise) is comparable to the differences researchers cite as evidence, then a few rounds of test-set peeking can manufacture an entirely spurious “improvement.” The validation set exists precisely to absorb this selection pressure so the test set need not. The discipline is uncomfortable but non-negotiable: decisions are made on validation data, and the test set is touched once.

When a benchmark is reused by thousands of researchers over years, even a community that individually respects this rule can collectively overfit the public test set, because the same $\max$-over-many-attempts dynamic operates at the level of the field rather than the individual. This adaptive-overfitting concern motivates the practice of constructing fresh test distributions to recheck conclusions, as in the reconstruction of new ImageNet test sets, where absolute accuracies dropped but the ranking of models was largely preserved, suggesting genuine progress under inflated absolute numbers (7).

9.5 5. Reproducibility and the Reproducibility Crisis

9.5.1 5.1 Degrees of Reproducibility

It is useful to distinguish reproducibility (the same team or others obtaining consistent results with the same code and data) from replicability (independent teams reaching the same conclusion with their own implementations). The former checks that a result is real given the artifacts; the latter checks that the conclusion is robust to the inevitable variation in how science is done. Machine learning has documented failures of both. Studies attempting to reproduce reinforcement learning results found that performance depended heavily on random seeds, undocumented code details, and hyperparameters not reported in the original papers (2). Surveys of recommendation systems and information retrieval found that many proposed neural methods failed to beat properly tuned classical baselines once reproduced carefully (5).

9.5.2 5.2 Sources of Irreproducibility

The causes are mundane and therefore fixable. Unreported hyperparameters, undisclosed preprocessing, non-deterministic hardware behavior, missing random seeds, selective reporting of favorable runs, and dependence on private data all break the chain from claim to verification. The remedies are equally mundane: release code and configuration, specify the computing environment, fix and report seeds, document data provenance and splits, and report the full distribution of outcomes rather than a single lucky number. Reproducibility checklists adopted by major conferences formalize these expectations and have measurably improved reporting practices (8).

The open-source ecosystem supplies mature tooling for each remedy, and using it costs little. Version control (Git) pins the exact code; environment specifications (a requirements.txt or environment.yml, or a container image) pin the dependency graph; experiment trackers and data versioning tools record the configuration and metrics of every run; and a fixed, reported seed plus deterministic flags makes a single run reproducible bit for bit where the hardware allows. None of this is exotic, and all of it is free. The barrier is discipline, not cost.

9.6 6. Benchmark Design and Goodhart’s Law

9.6.1 6.1 The Function and the Failure of Benchmarks

Shared benchmarks have driven much of the field’s progress by making competing methods comparable on common ground. A benchmark coordinates a community, focuses effort, and renders claims checkable. Yet the very property that makes a benchmark useful, its role as a fixed target, makes it vulnerable to Goodhart’s law: when a measure becomes a target, it ceases to be a good measure (9). Optimization pressure flows to whatever the metric rewards, including shortcuts that satisfy the metric without delivering the underlying capability the metric was meant to track.

9.6.2 6.2 Shortcuts and Artifacts

The literature is full of cases where models achieved high benchmark scores by exploiting spurious correlations rather than solving the intended task. Natural language inference models learned that the presence of negation words predicted a label, independent of meaning (1). Visual question answering systems answered without looking at the image, exploiting language priors in the question distribution (10). These are not model failures so much as benchmark failures: the dataset permitted a shortcut, and gradient descent, being an excellent shortcut finder, took it. Robust benchmark design therefore demands adversarial scrutiny of the data itself, construction of challenge sets that defeat known shortcuts, and periodic retirement of saturated benchmarks in favor of harder successors.

A practical diagnostic follows from the shortcut framing: if a model with the input partially or wholly removed still scores far above chance, the benchmark contains a shortcut. A visual question answering model that answers well from the question alone, or a natural language inference model that classifies from the hypothesis alone, has revealed an artifact in the data rather than a capability in the model. Running such input-ablation probes before trusting a benchmark is cheap insurance.

9.7 7. Statistical Significance and Variance Across Seeds

9.7.1 7.1 A Single Number Is Not a Result

A deep network’s outcome depends on random initialization, data ordering, augmentation sampling, and non-deterministic parallel computation. Re-running the identical configuration with a different random seed can shift the reported metric by an amount comparable to the differences researchers cite as evidence of method superiority. Reporting a single run is therefore reporting a single sample from a distribution while pretending to report the distribution’s mean. The minimal honest practice is to train each configuration multiple times with different seeds and report a central tendency together with a measure of spread, such as the mean and standard deviation or, better, a confidence interval (11).

It is worth being explicit about where the variance comes from, because the sources call for different remedies. The total variance of a reported metric decomposes, to a first approximation, into independent contributions:

\[ \operatorname{Var}(\hat\theta) \;\approx\; \underbrace{\sigma^2_{\text{init}}}_{\text{weight initialization}} + \underbrace{\sigma^2_{\text{order}}}_{\text{data ordering and sampling}} + \underbrace{\sigma^2_{\text{hw}}}_{\text{nondeterministic compute}} + \underbrace{\sigma^2_{\text{data}}}_{\text{finite test set}} . \]

Fixing a single seed collapses the first three terms but does not remove them from reality; it merely hides them, and a method that wins only at one seed has not been shown to win. The finite-test-set term $\sigma^2_{\text{data}}$ persists no matter how the model is trained and shrinks only with a larger or more representative test set. Estimating these components, rather than reporting one number, is what turns an anecdote into a measurement (11).

9.7.2 7.2 Comparing Distributions, Not Points

Once each method is represented by a distribution of outcomes, comparison becomes a statistical question rather than a reading of two numbers. The estimand is $\Delta = \theta_B - \theta_A$ from Section 9.4.2. With $n$ paired runs (same seeds for both methods), a paired comparison of the per-run differences $d_i = X^B_i - X^A_i$ gives an estimate $\bar d$ and a confidence interval

\[ \bar d \;\pm\; t_{1-\alpha/2,\,n-1}\,\frac{s_d}{\sqrt{n}}, \]

where $s_d$ is the sample standard deviation of the differences. Pairing on seeds removes the shared initialization-and-ordering variance from $s_d$ and so tightens the interval, which is why paired designs are preferred when feasible. When distributional assumptions are doubtful or several datasets are compared at once, nonparametric alternatives such as the Wilcoxon signed-rank test and the Friedman test with appropriate post hoc procedures are the standard tools (12).

Two cautions apply. First, statistical significance is not practical significance: with enough runs a trivial difference becomes significant, so the effect size $\Delta$ and its confidence interval matter as much as any $p$-value. Second, multiple comparisons inflate false positives. If a method is tested against a baseline on $m$ independent benchmarks, each at significance level $\alpha$, the probability of at least one spurious “win” by chance is $1 - (1-\alpha)^m$, which for $\alpha = 0.05$ and $m = 20$ is about $0.64$. A Bonferroni correction tests each comparison at $\alpha/m$ to hold the family-wise error rate at $\alpha$; less conservative procedures control the false discovery rate instead. The goal is not ritual hypothesis testing but an honest account of whether an observed gap could plausibly be noise.

9.7.3 7.3 A Worked Example

Suppose method $B$ is claimed to beat baseline $A$ on a classification task. The authors report a single run each: $A$ scores $91.0$ percent and $B$ scores $91.8$ percent, and they conclude $B$ is better by $0.8$ points. A careful reviewer asks for the distribution.

Re-running both methods with $n = 10$ seeds each, paired, yields per-seed accuracy differences $d_i = X^B_i - X^A_i$ with sample mean $\bar d = 0.30$ points and sample standard deviation $s_d = 0.50$ points. The standard error is $s_d / \sqrt{n} = 0.50 / \sqrt{10} \approx 0.158$ points. The two-sided $95$ percent interval uses $t_{0.975, 9} \approx 2.262$, giving

\[ 0.30 \;\pm\; 2.262 \times 0.158 \;=\; 0.30 \pm 0.36 \;=\; [-0.06,\; 0.66] \text{ points}. \]

The interval includes zero, so at the $5$ percent level the data do not support a claim that $B$ beats $A$, even though the point estimate is positive and the original single-run gap of $0.8$ points looked convincing. The single-run comparison was dominated by seed noise: it happened to pair a lucky $B$ with an unlucky $A$. The same arithmetic run with $\bar d = 0.80$ and $s_d = 0.30$ would instead give $[0.59, 1.01]$, an interval comfortably above zero, and the claim would stand. The discipline does not predetermine the verdict; it simply forces the verdict to depend on the spread as well as the gap. (The numbers in this example are illustrative.)

9.8 8. Fair Comparison

Fairness in comparison means that competing methods differ only in the dimension under study. The confounds are numerous. Differences in parameter count, training compute, data quantity, data quality, tokenization, hyperparameter search budget, and even software framework can all masquerade as method effects. A new architecture that is given more parameters or a longer training schedule than its baseline has not been shown to be better architecture; it has been shown to consume more resources. Compute-matched and parameter-matched comparisons isolate the variable of interest. Equally important is equal tuning budget: the proposed method and every baseline should receive comparable hyperparameter optimization, ideally under an explicit and reported search protocol, so that the comparison reflects the methods rather than the researcher’s differential investment of effort.

A concrete way to make tuning budget explicit is to report performance as a function of the number of hyperparameter configurations tried, rather than as a single tuned number. A method that reaches a good score after one configuration is more valuable, and more honestly compared, than one that reaches the same score only after hundreds, and a curve of best-so-far performance against search budget exposes that difference where a single tuned number conceals it.

9.9 9. Common Methodological Errors

9.9.1 9.1 Data Leakage

Data leakage occurs when information from outside the training set, especially from the test set, contaminates the training process, producing inflated and irreproducible performance. It takes many forms: preprocessing statistics such as normalization constants computed over the full dataset before splitting, duplicate or near-duplicate examples shared across splits, temporal leakage in which future information predicts the past, and feature leakage in which a predictor encodes the target. A broad review across scientific fields that adopted machine learning found leakage to be a pervasive cause of overoptimistic and non-replicable results (13). The defense is to define splits first and to ensure that every fitted quantity, including preprocessing, is derived only from training data. In a cross-validation setting this means the entire preprocessing pipeline (imputation, scaling, feature selection, resampling) must be fitted inside each fold rather than once over the whole dataset, a discipline that mature pipeline abstractions in open-source libraries are designed to enforce.

9.9.2 9.2 Test-Set Tuning and Cherry-Picking

Two related errors corrupt the inference from result to claim. Test-set tuning, analyzed quantitatively in Section 9.4.2, lets repeated evaluation on held-out data quietly turn it into a second validation set, biasing the estimate upward by roughly $\sigma\sqrt{2\ln k}$ after $k$ peeks. Cherry-picking selects favorable outcomes after the fact: the best of many seeds, the subset of benchmarks where the method wins, the qualitative examples that flatter the system. Both errors share a structure, namely selection after observing results, and both are countered by pre-registration of the evaluation plan, reporting of all runs and all benchmarks attempted, and separation of exploratory analysis from confirmatory claims. The exploratory phase is where hypotheses are generated by looking; the confirmatory phase tests a frozen hypothesis on untouched data, and only the latter supports a published claim.

9.9.3 9.3 The Garden of Forking Paths

Even without conscious dishonesty, the sheer number of defensible analysis choices (which metric, which preprocessing, which subset, which statistical test) creates a garden of forking paths in which a researcher exploring flexibly will eventually find an apparently significant result (14). The cumulative effect is a literature biased toward positive findings. Guarding against it requires committing to analysis decisions before seeing outcomes, reporting the decisions that were considered, and treating any post hoc discovery as a hypothesis to be confirmed later rather than a conclusion already established.

9.9.4 9.4 When to Relax the Rules, and the Pitfalls of Each

The strict regime described here is calibrated for confirmatory claims, the kind that appear in a paper’s headline table and that others will build on. Not all work is confirmatory. During exploration, peeking, flexible analysis, and chasing the best of many seeds are exactly how good hypotheses are found; the error is not the peeking but the failure to relabel the result as exploratory and to confirm it later on untouched data. The practical pitfalls cluster into a short list worth keeping in view: tuning on the test set, comparing against a weak or untuned baseline, reporting a single seed, ignoring multiple comparisons, leaking preprocessing across the split, and reporting statistical significance without effect size. Each has a one-line antidote, namely tune on validation only, match the baseline’s effort, report a distribution, correct for the number of comparisons, fit preprocessing inside the split, and report an effect size with its interval. None of these is intellectually difficult. All of them are easy to skip under deadline pressure, which is exactly why naming them as a checklist is useful.

9.10 10. Conclusion

The methods surveyed here share a single underlying commitment: to subject every claim, especially one’s own, to conditions under which it could fail. Strong baselines create the opportunity to fail by comparison. Ablations create the opportunity to fail by attribution. Held-out test data creates the opportunity to fail at generalization. Multiple seeds and statistical tests create the opportunity to fail by chance. Reproducibility creates the opportunity to fail under independent scrutiny. A result that survives all of these is worth believing precisely because it had so many chances to be exposed as noise, artifact, or wishful thinking.

Machine learning will continue to advance through engineering ingenuity, but its claim to be a science rests on the discipline of empirical self-skepticism. The practitioner who internalizes that discipline produces fewer headlines and more knowledge, and over time it is the knowledge that compounds.

9.11 References

Gururangan, S., Swayamdipta, S., Levy, O., Schwartz, R., Bowman, S., and Smith, N. A. (2018). Annotation Artifacts in Natural Language Inference Data. NAACL. https://aclanthology.org/N18-2017/
Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., and Meger, D. (2018). Deep Reinforcement Learning that Matters. AAAI. https://ojs.aaai.org/index.php/AAAI/article/view/11694
Melis, G., Dyer, C., and Blunsom, P. (2018). On the State of the Art of Evaluation in Neural Language Models. ICLR. https://openreview.net/forum?id=ByJHuTgA-
Popper, K. (1959). The Logic of Scientific Discovery. Routledge. https://www.routledge.com/The-Logic-of-Scientific-Discovery/Popper/p/book/9780415278447
Dacrema, M. F., Cremonesi, P., and Jannach, D. (2019). Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches. RecSys. https://dl.acm.org/doi/10.1145/3298689.3347058
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer. https://hastie.su.domains/ElemStatLearn/
Recht, B., Roelofs, R., Schmidt, L., and Shankar, V. (2019). Do ImageNet Classifiers Generalize to ImageNet? ICML. https://proceedings.mlr.press/v97/recht19a.html
Pineau, J., Vincent-Lamarre, P., Sinha, K., Lariviere, V., Beygelzimer, A., d’Alche-Buc, F., Fox, E., and Larochelle, H. (2021). Improving Reproducibility in Machine Learning Research. Journal of Machine Learning Research, 22(164). https://jmlr.org/papers/v22/20-303.html
Strathern, M. (1997). Improving Ratings: Audit in the British University System. European Review, 5(3). https://onlinelibrary.wiley.com/doi/10.1002/(SICI)1234-981X(199707)5:3%3C305::AID-EURO184%3E3.0.CO;2-4
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., and Parikh, D. (2017). Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. CVPR. https://openaccess.thecvf.com/content_cvpr_2017/html/Goyal_Making_the_v_CVPR_2017_paper.html
Bouthillier, X., Delaunay, P., Bronzi, M., Trofimov, A., Nichyporuk, B., Szeto, J., et al. (2021). Accounting for Variance in Machine Learning Benchmarks. MLSys. https://proceedings.mlsys.org/paper_files/paper/2021/hash/cfecdb276f634854f3ef915e2e980c31-Abstract.html
Demsar, J. (2006). Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research, 7. https://jmlr.org/papers/v7/demsar06a.html
Kapoor, S., and Narayanan, A. (2023). Leakage and the Reproducibility Crisis in Machine-Learning-Based Science. Patterns, 4(9). https://www.cell.com/patterns/fulltext/S2666-3899(23)00159-9
Gelman, A., and Loken, E. (2014). The Statistical Crisis in Science. American Scientist, 102(6). https://www.americanscientist.org/article/the-statistical-crisis-in-science

# The Scientific Method in AI Research ## 1. Introduction: Machine Learning as an Empirical Science Machine learning occupies an unusual position among the sciences. It borrows the deductive machinery of mathematics, the engineering culture of computer systems, and the empirical posture of the natural sciences, yet it is reducible to none of them. A theorem about the convergence of stochastic gradient descent tells us little about whether a particular transformer will generalize to a new distribution of documents. The decisive questions in modern AI research are answered not by proof but by measurement: does this method achieve a lower error than that one, under conditions we can defend as fair, with a margin we can defend as real? This chapter argues that machine learning is best understood as an empirical science, and that the discipline of empirical science (controlled comparison, falsifiable claims, honest accounting of uncertainty) is the single most important intellectual asset a practitioner can cultivate. The argument matters because the field has repeatedly mistaken impressive engineering for scientific knowledge. Models that top leaderboards have turned out to exploit annotation artifacts (1). Reported gains have evaporated under reproduction (2). Comparisons that looked decisive have collapsed once baselines were tuned with equal care (3). These failures are not the result of fraud or incompetence. They are the predictable consequence of doing experimental science without the safeguards that experimental science has spent four centuries developing. We proceed from the philosophy of empirical claims to the concrete machinery of the field: baselines and ablations, the partitioning of data, reproducibility, benchmark design, statistical inference, and the catalogue of methodological errors that recur with depressing regularity. Throughout, the unifying idea is the one drawn in @fig-loop: a claim earns belief only by surviving a sequence of deliberate opportunities to fail. ```{mermaid} %%| label: fig-loop %%| fig-cap: "The empirical loop in machine learning research. Each stage is a deliberate opportunity for a claim to fail." flowchart TD A["Operationalize a falsifiable hypothesis"] --> B["Build strong matched baselines"] B --> C["Train with multiple seeds"] C --> D["Select and tune on validation data"] D --> E["Run ablations to attribute the effect"] E --> F["Evaluate once on held-out test data"] F --> G["Quantify uncertainty and effect size"] G --> H{"Survives every check"} H -->|"No"| A H -->|"Yes"| I["Report with code, seeds, and configs"] ``` ## 2. Hypotheses and Falsifiability in Machine Learning ### 2.1 What a Hypothesis Looks Like in ML The Popperian criterion holds that a scientific claim must be falsifiable: there must exist some observable outcome that, if it occurred, would count as evidence against the claim (4). In machine learning a well-formed hypothesis is rarely "our model is good." It is a conditional, comparative, and quantitative statement: "adding a recurrence mechanism to architecture A reduces perplexity on long-context language modeling relative to A without recurrence, holding parameter count and training data fixed." This formulation specifies the intervention (the recurrence mechanism), the metric (perplexity), the comparison (A with versus without), and the controlled variables (parameters, data). Each of these can be wrong, and each can be checked. It helps to write the hypothesis as a statement about an unknown population quantity. Let $\theta_A$ and $\theta_B$ denote the expected test metric of methods $A$ and $B$, where the expectation is taken over the randomness of training (initialization, data order, augmentation, nondeterministic hardware) and over the draw of the test set from the target distribution. The scientific claim is a statement about the sign and size of the *estimand* $$ \Delta \;=\; \theta_B - \theta_A , $$ for example "$\Delta < 0$ for a loss metric" (method $B$ improves on $A$). A single training run produces only a noisy estimate $\hat\Delta$ of $\Delta$, and the entire apparatus of the rest of this chapter exists to keep the gap between $\hat\Delta$ and $\Delta$ honest. ### 2.2 The Drift Toward Unfalsifiable Claims Much of the rhetoric surrounding large models trends toward the unfalsifiable. Statements such as "the model understands language" or "the system reasons" resist refutation because no agreed measurement attaches to them. A productive research culture replaces such claims with operational proxies: a model "reasons" to the extent that it solves a held-out set of multi-step problems whose surface form differs from anything in training. The proxy is imperfect, and saying so is part of the science, but it is falsifiable, and that is what makes it useful. The practitioner's habit should be to translate every grand claim into a measurable one before believing it, including their own claims. ## 3. Baselines and Ablations: The Core of Controlled Comparison ### 3.1 Why Baselines Carry the Argument A result in isolation conveys almost no information. Knowing that a model reaches 92 percent accuracy is meaningless until we know what a simpler approach achieves on the same task. The baseline is the control condition, and the strength of an empirical claim is bounded by the strength of the baseline it defeats. A recurring pathology in published work is the weak baseline: the proposed method is tuned extensively while the comparison is taken untuned from an old paper, or implemented carelessly. When researchers have revisited such comparisons with equally tuned baselines, the reported advantage of elaborate methods has frequently shrunk or vanished (3, 5). The discipline here is symmetric effort. Every hour of hyperparameter search, every architectural refinement, and every data-cleaning step applied to the proposed method must be matched, as nearly as possible, for the baseline. A simple, well-tuned baseline that the new method genuinely beats is far more persuasive than an exotic competitor that the new method beats by accident of unequal effort. A useful rule of thumb is the hierarchy of baselines: a trivial baseline (the majority-class predictor, or a constant), a classical baseline (logistic regression, gradient-boosted trees, a nearest-neighbor retriever), and the strongest published prior method, each tuned with the same budget as the proposed method. A new method should be measured against all three, because beating only the trivial baseline establishes almost nothing. ### 3.2 Ablations as Causal Attribution An ablation study removes or alters one component of a system at a time to isolate that component's contribution. It is the closest thing machine learning has to a controlled experiment in the laboratory sense. If a method combines a new loss function, a new data augmentation, and a new optimizer, the headline result tells us only that the combination works. The ablation tells us which ingredient mattered. Without it, the field accumulates complicated recipes whose active ingredients are unknown, and subsequent researchers inherit cargo cult components that contribute nothing. A good ablation answers the counterfactual: if this piece were absent, what would happen, all else equal? Two cautions sharpen the practice. First, components can interact, so the effect of removing one piece may depend on the presence of another. A single-knockout ablation (remove one piece from the full system) and a single-addition ablation (add one piece to the bare baseline) can disagree, and the disagreement is itself informative about interactions. Second, every ablation cell is a measurement subject to the same seed variance as the headline number, so an ablation table built from single runs can mislead exactly as a single headline number can. Ablations deserve the same uncertainty quantification described in @sec-variance. ## 4. Train, Validation, and Test Discipline ### 4.1 The Three-Way Partition and What Each Set Is For The partition of data into training, validation, and test sets encodes a simple epistemic principle: a claim about generalization can only be tested on data that played no role in producing the model. The training set fits parameters. The validation set guides choices made by the researcher, including architecture, hyperparameters, early stopping, and model selection. The test set estimates performance on genuinely unseen data, and it can serve that purpose only if it is consulted once, at the end, after all decisions are frozen (6). ### 4.2 How the Test Set Leaks, and the Mathematics of Optimistic Selection {#sec-leak} The test set's protective value decays with every glance. If a researcher evaluates many variants on the test set and reports the best, the reported number is an optimistic estimate, because the maximum over noisy measurements is biased upward. This is test-set tuning, and it is one of the most common ways that honest researchers fool themselves. The bias is not a vague worry; it can be quantified. Suppose a researcher evaluates $k$ models on the test set, and each model's measured score is its true score plus independent noise. Model the measured scores as $X_1, \dots, X_k$, drawn independently from a distribution with mean $\mu$ and standard deviation $\sigma$, where $\mu$ is the true performance level and $\sigma$ captures finite-test-set and training noise. Reporting the best means reporting $M_k = \max_i X_i$. The expected reported score exceeds the truth, and the gap grows with the number of variants tried: $$ \mathbb{E}[M_k] - \mu \;\ge\; 0, \qquad \text{and is increasing in } k . $$ For the Gaussian case $X_i \sim \mathcal{N}(\mu, \sigma^2)$, the expected maximum is well approximated for moderate and large $k$ by $$ \mathbb{E}[M_k] \;\approx\; \mu + \sigma\sqrt{2 \ln k}. $$ The selection bias therefore grows like $\sqrt{\ln k}$: trying $10$ variants and reporting the best inflates the score by roughly $2\sigma$, and $100$ variants by roughly $3\sigma$. If $\sigma$ (the run-to-run noise) is comparable to the differences researchers cite as evidence, then a few rounds of test-set peeking can manufacture an entirely spurious "improvement." The validation set exists precisely to absorb this selection pressure so the test set need not. The discipline is uncomfortable but non-negotiable: decisions are made on validation data, and the test set is touched once. When a benchmark is reused by thousands of researchers over years, even a community that individually respects this rule can collectively overfit the public test set, because the same $\max$-over-many-attempts dynamic operates at the level of the field rather than the individual. This adaptive-overfitting concern motivates the practice of constructing fresh test distributions to recheck conclusions, as in the reconstruction of new ImageNet test sets, where absolute accuracies dropped but the *ranking* of models was largely preserved, suggesting genuine progress under inflated absolute numbers (7). ## 5. Reproducibility and the Reproducibility Crisis ### 5.1 Degrees of Reproducibility It is useful to distinguish reproducibility (the same team or others obtaining consistent results with the same code and data) from replicability (independent teams reaching the same conclusion with their own implementations). The former checks that a result is real given the artifacts; the latter checks that the conclusion is robust to the inevitable variation in how science is done. Machine learning has documented failures of both. Studies attempting to reproduce reinforcement learning results found that performance depended heavily on random seeds, undocumented code details, and hyperparameters not reported in the original papers (2). Surveys of recommendation systems and information retrieval found that many proposed neural methods failed to beat properly tuned classical baselines once reproduced carefully (5). ### 5.2 Sources of Irreproducibility The causes are mundane and therefore fixable. Unreported hyperparameters, undisclosed preprocessing, non-deterministic hardware behavior, missing random seeds, selective reporting of favorable runs, and dependence on private data all break the chain from claim to verification. The remedies are equally mundane: release code and configuration, specify the computing environment, fix and report seeds, document data provenance and splits, and report the full distribution of outcomes rather than a single lucky number. Reproducibility checklists adopted by major conferences formalize these expectations and have measurably improved reporting practices (8). The open-source ecosystem supplies mature tooling for each remedy, and using it costs little. Version control (Git) pins the exact code; environment specifications (a `requirements.txt` or `environment.yml`, or a container image) pin the dependency graph; experiment trackers and data versioning tools record the configuration and metrics of every run; and a fixed, reported seed plus deterministic flags makes a single run reproducible bit for bit where the hardware allows. None of this is exotic, and all of it is free. The barrier is discipline, not cost. ## 6. Benchmark Design and Goodhart's Law ### 6.1 The Function and the Failure of Benchmarks Shared benchmarks have driven much of the field's progress by making competing methods comparable on common ground. A benchmark coordinates a community, focuses effort, and renders claims checkable. Yet the very property that makes a benchmark useful, its role as a fixed target, makes it vulnerable to Goodhart's law: when a measure becomes a target, it ceases to be a good measure (9). Optimization pressure flows to whatever the metric rewards, including shortcuts that satisfy the metric without delivering the underlying capability the metric was meant to track. ### 6.2 Shortcuts and Artifacts The literature is full of cases where models achieved high benchmark scores by exploiting spurious correlations rather than solving the intended task. Natural language inference models learned that the presence of negation words predicted a label, independent of meaning (1). Visual question answering systems answered without looking at the image, exploiting language priors in the question distribution (10). These are not model failures so much as benchmark failures: the dataset permitted a shortcut, and gradient descent, being an excellent shortcut finder, took it. Robust benchmark design therefore demands adversarial scrutiny of the data itself, construction of challenge sets that defeat known shortcuts, and periodic retirement of saturated benchmarks in favor of harder successors. A practical diagnostic follows from the shortcut framing: if a model with the input partially or wholly removed still scores far above chance, the benchmark contains a shortcut. A visual question answering model that answers well from the question alone, or a natural language inference model that classifies from the hypothesis alone, has revealed an artifact in the data rather than a capability in the model. Running such input-ablation probes before trusting a benchmark is cheap insurance. ## 7. Statistical Significance and Variance Across Seeds {#sec-variance} ### 7.1 A Single Number Is Not a Result A deep network's outcome depends on random initialization, data ordering, augmentation sampling, and non-deterministic parallel computation. Re-running the identical configuration with a different random seed can shift the reported metric by an amount comparable to the differences researchers cite as evidence of method superiority. Reporting a single run is therefore reporting a single sample from a distribution while pretending to report the distribution's mean. The minimal honest practice is to train each configuration multiple times with different seeds and report a central tendency together with a measure of spread, such as the mean and standard deviation or, better, a confidence interval (11). It is worth being explicit about where the variance comes from, because the sources call for different remedies. The total variance of a reported metric decomposes, to a first approximation, into independent contributions: $$ \operatorname{Var}(\hat\theta) \;\approx\; \underbrace{\sigma^2_{\text{init}}}_{\text{weight initialization}} + \underbrace{\sigma^2_{\text{order}}}_{\text{data ordering and sampling}} + \underbrace{\sigma^2_{\text{hw}}}_{\text{nondeterministic compute}} + \underbrace{\sigma^2_{\text{data}}}_{\text{finite test set}} . $$ Fixing a single seed collapses the first three terms but does not remove them from reality; it merely hides them, and a method that wins only at one seed has not been shown to win. The finite-test-set term $\sigma^2_{\text{data}}$ persists no matter how the model is trained and shrinks only with a larger or more representative test set. Estimating these components, rather than reporting one number, is what turns an anecdote into a measurement (11). ### 7.2 Comparing Distributions, Not Points Once each method is represented by a distribution of outcomes, comparison becomes a statistical question rather than a reading of two numbers. The estimand is $\Delta = \theta_B - \theta_A$ from @sec-leak. With $n$ paired runs (same seeds for both methods), a paired comparison of the per-run differences $d_i = X^B_i - X^A_i$ gives an estimate $\bar d$ and a confidence interval $$ \bar d \;\pm\; t_{1-\alpha/2,\,n-1}\,\frac{s_d}{\sqrt{n}}, $$ where $s_d$ is the sample standard deviation of the differences. Pairing on seeds removes the shared initialization-and-ordering variance from $s_d$ and so tightens the interval, which is why paired designs are preferred when feasible. When distributional assumptions are doubtful or several datasets are compared at once, nonparametric alternatives such as the Wilcoxon signed-rank test and the Friedman test with appropriate post hoc procedures are the standard tools (12). Two cautions apply. First, statistical significance is not practical significance: with enough runs a trivial difference becomes significant, so the effect size $\Delta$ and its confidence interval matter as much as any $p$-value. Second, multiple comparisons inflate false positives. If a method is tested against a baseline on $m$ independent benchmarks, each at significance level $\alpha$, the probability of at least one spurious "win" by chance is $1 - (1-\alpha)^m$, which for $\alpha = 0.05$ and $m = 20$ is about $0.64$. A Bonferroni correction tests each comparison at $\alpha/m$ to hold the family-wise error rate at $\alpha$; less conservative procedures control the false discovery rate instead. The goal is not ritual hypothesis testing but an honest account of whether an observed gap could plausibly be noise. ### 7.3 A Worked Example Suppose method $B$ is claimed to beat baseline $A$ on a classification task. The authors report a single run each: $A$ scores $91.0$ percent and $B$ scores $91.8$ percent, and they conclude $B$ is better by $0.8$ points. A careful reviewer asks for the distribution. Re-running both methods with $n = 10$ seeds each, paired, yields per-seed accuracy differences $d_i = X^B_i - X^A_i$ with sample mean $\bar d = 0.30$ points and sample standard deviation $s_d = 0.50$ points. The standard error is $s_d / \sqrt{n} = 0.50 / \sqrt{10} \approx 0.158$ points. The two-sided $95$ percent interval uses $t_{0.975, 9} \approx 2.262$, giving $$ 0.30 \;\pm\; 2.262 \times 0.158 \;=\; 0.30 \pm 0.36 \;=\; [-0.06,\; 0.66] \text{ points}. $$ The interval includes zero, so at the $5$ percent level the data do not support a claim that $B$ beats $A$, even though the point estimate is positive and the original single-run gap of $0.8$ points looked convincing. The single-run comparison was dominated by seed noise: it happened to pair a lucky $B$ with an unlucky $A$. The same arithmetic run with $\bar d = 0.80$ and $s_d = 0.30$ would instead give $[0.59, 1.01]$, an interval comfortably above zero, and the claim would stand. The discipline does not predetermine the verdict; it simply forces the verdict to depend on the spread as well as the gap. (The numbers in this example are illustrative.) ## 8. Fair Comparison Fairness in comparison means that competing methods differ only in the dimension under study. The confounds are numerous. Differences in parameter count, training compute, data quantity, data quality, tokenization, hyperparameter search budget, and even software framework can all masquerade as method effects. A new architecture that is given more parameters or a longer training schedule than its baseline has not been shown to be better architecture; it has been shown to consume more resources. Compute-matched and parameter-matched comparisons isolate the variable of interest. Equally important is equal tuning budget: the proposed method and every baseline should receive comparable hyperparameter optimization, ideally under an explicit and reported search protocol, so that the comparison reflects the methods rather than the researcher's differential investment of effort. A concrete way to make tuning budget explicit is to report performance as a function of the number of hyperparameter configurations tried, rather than as a single tuned number. A method that reaches a good score after one configuration is more valuable, and more honestly compared, than one that reaches the same score only after hundreds, and a curve of best-so-far performance against search budget exposes that difference where a single tuned number conceals it. ## 9. Common Methodological Errors ### 9.1 Data Leakage Data leakage occurs when information from outside the training set, especially from the test set, contaminates the training process, producing inflated and irreproducible performance. It takes many forms: preprocessing statistics such as normalization constants computed over the full dataset before splitting, duplicate or near-duplicate examples shared across splits, temporal leakage in which future information predicts the past, and feature leakage in which a predictor encodes the target. A broad review across scientific fields that adopted machine learning found leakage to be a pervasive cause of overoptimistic and non-replicable results (13). The defense is to define splits first and to ensure that every fitted quantity, including preprocessing, is derived only from training data. In a cross-validation setting this means the entire preprocessing pipeline (imputation, scaling, feature selection, resampling) must be fitted inside each fold rather than once over the whole dataset, a discipline that mature pipeline abstractions in open-source libraries are designed to enforce. ### 9.2 Test-Set Tuning and Cherry-Picking Two related errors corrupt the inference from result to claim. Test-set tuning, analyzed quantitatively in @sec-leak, lets repeated evaluation on held-out data quietly turn it into a second validation set, biasing the estimate upward by roughly $\sigma\sqrt{2\ln k}$ after $k$ peeks. Cherry-picking selects favorable outcomes after the fact: the best of many seeds, the subset of benchmarks where the method wins, the qualitative examples that flatter the system. Both errors share a structure, namely selection after observing results, and both are countered by pre-registration of the evaluation plan, reporting of all runs and all benchmarks attempted, and separation of exploratory analysis from confirmatory claims. The exploratory phase is where hypotheses are generated by looking; the confirmatory phase tests a frozen hypothesis on untouched data, and only the latter supports a published claim. ### 9.3 The Garden of Forking Paths Even without conscious dishonesty, the sheer number of defensible analysis choices (which metric, which preprocessing, which subset, which statistical test) creates a garden of forking paths in which a researcher exploring flexibly will eventually find an apparently significant result (14). The cumulative effect is a literature biased toward positive findings. Guarding against it requires committing to analysis decisions before seeing outcomes, reporting the decisions that were considered, and treating any post hoc discovery as a hypothesis to be confirmed later rather than a conclusion already established. ### 9.4 When to Relax the Rules, and the Pitfalls of Each The strict regime described here is calibrated for confirmatory claims, the kind that appear in a paper's headline table and that others will build on. Not all work is confirmatory. During exploration, peeking, flexible analysis, and chasing the best of many seeds are exactly how good hypotheses are found; the error is not the peeking but the failure to relabel the result as exploratory and to confirm it later on untouched data. The practical pitfalls cluster into a short list worth keeping in view: tuning on the test set, comparing against a weak or untuned baseline, reporting a single seed, ignoring multiple comparisons, leaking preprocessing across the split, and reporting statistical significance without effect size. Each has a one-line antidote, namely tune on validation only, match the baseline's effort, report a distribution, correct for the number of comparisons, fit preprocessing inside the split, and report an effect size with its interval. None of these is intellectually difficult. All of them are easy to skip under deadline pressure, which is exactly why naming them as a checklist is useful. ## 10. Conclusion The methods surveyed here share a single underlying commitment: to subject every claim, especially one's own, to conditions under which it could fail. Strong baselines create the opportunity to fail by comparison. Ablations create the opportunity to fail by attribution. Held-out test data creates the opportunity to fail at generalization. Multiple seeds and statistical tests create the opportunity to fail by chance. Reproducibility creates the opportunity to fail under independent scrutiny. A result that survives all of these is worth believing precisely because it had so many chances to be exposed as noise, artifact, or wishful thinking. Machine learning will continue to advance through engineering ingenuity, but its claim to be a science rests on the discipline of empirical self-skepticism. The practitioner who internalizes that discipline produces fewer headlines and more knowledge, and over time it is the knowledge that compounds. ## References 1. Gururangan, S., Swayamdipta, S., Levy, O., Schwartz, R., Bowman, S., and Smith, N. A. (2018). Annotation Artifacts in Natural Language Inference Data. NAACL. https://aclanthology.org/N18-2017/ 2. Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., and Meger, D. (2018). Deep Reinforcement Learning that Matters. AAAI. https://ojs.aaai.org/index.php/AAAI/article/view/11694 3. Melis, G., Dyer, C., and Blunsom, P. (2018). On the State of the Art of Evaluation in Neural Language Models. ICLR. https://openreview.net/forum?id=ByJHuTgA- 4. Popper, K. (1959). The Logic of Scientific Discovery. Routledge. https://www.routledge.com/The-Logic-of-Scientific-Discovery/Popper/p/book/9780415278447 5. Dacrema, M. F., Cremonesi, P., and Jannach, D. (2019). Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches. RecSys. https://dl.acm.org/doi/10.1145/3298689.3347058 6. Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer. https://hastie.su.domains/ElemStatLearn/ 7. Recht, B., Roelofs, R., Schmidt, L., and Shankar, V. (2019). Do ImageNet Classifiers Generalize to ImageNet? ICML. https://proceedings.mlr.press/v97/recht19a.html 8. Pineau, J., Vincent-Lamarre, P., Sinha, K., Lariviere, V., Beygelzimer, A., d'Alche-Buc, F., Fox, E., and Larochelle, H. (2021). Improving Reproducibility in Machine Learning Research. Journal of Machine Learning Research, 22(164). https://jmlr.org/papers/v22/20-303.html 9. Strathern, M. (1997). Improving Ratings: Audit in the British University System. European Review, 5(3). https://onlinelibrary.wiley.com/doi/10.1002/(SICI)1234-981X(199707)5:3%3C305::AID-EURO184%3E3.0.CO;2-4 10. Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., and Parikh, D. (2017). Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. CVPR. https://openaccess.thecvf.com/content_cvpr_2017/html/Goyal_Making_the_v_CVPR_2017_paper.html 11. Bouthillier, X., Delaunay, P., Bronzi, M., Trofimov, A., Nichyporuk, B., Szeto, J., et al. (2021). Accounting for Variance in Machine Learning Benchmarks. MLSys. https://proceedings.mlsys.org/paper_files/paper/2021/hash/cfecdb276f634854f3ef915e2e980c31-Abstract.html 12. Demsar, J. (2006). Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research, 7. https://jmlr.org/papers/v7/demsar06a.html 13. Kapoor, S., and Narayanan, A. (2023). Leakage and the Reproducibility Crisis in Machine-Learning-Based Science. Patterns, 4(9). https://www.cell.com/patterns/fulltext/S2666-3899(23)00159-9 14. Gelman, A., and Loken, E. (2014). The Statistical Crisis in Science. American Scientist, 102(6). https://www.americanscientist.org/article/the-statistical-crisis-in-science