9  The Scientific Method in AI Research

9.1 1. Introduction: Machine Learning as an Empirical Science

Machine learning occupies an unusual position among the sciences. It borrows the deductive machinery of mathematics, the engineering culture of computer systems, and the empirical posture of the natural sciences, yet it is reducible to none of them. A theorem about the convergence of stochastic gradient descent tells us little about whether a particular transformer will generalize to a new distribution of documents. The decisive questions in modern AI research are answered not by proof but by measurement: does this method achieve a lower error than that one, under conditions we can defend as fair, with a margin we can defend as real?

This chapter argues that machine learning is best understood as an empirical science, and that the discipline of empirical science (controlled comparison, falsifiable claims, honest accounting of uncertainty) is the single most important intellectual asset a practitioner can cultivate. The argument matters because the field has repeatedly mistaken impressive engineering for scientific knowledge. Models that top leaderboards have turned out to exploit annotation artifacts (1). Reported gains have evaporated under reproduction (2). Comparisons that looked decisive have collapsed once baselines were tuned with equal care (3). These failures are not the result of fraud or incompetence. They are the predictable consequence of doing experimental science without the safeguards that experimental science has spent four centuries developing.

We proceed from the philosophy of empirical claims to the concrete machinery of the field: baselines and ablations, the partitioning of data, reproducibility, benchmark design, statistical inference, and the catalogue of methodological errors that recur with depressing regularity.

9.2 2. Hypotheses and Falsifiability in Machine Learning

9.2.1 2.1 What a Hypothesis Looks Like in ML

The Popperian criterion holds that a scientific claim must be falsifiable: there must exist some observable outcome that, if it occurred, would count as evidence against the claim (4). In machine learning a well-formed hypothesis is rarely “our model is good.” It is a conditional, comparative, and quantitative statement: “adding a recurrence mechanism to architecture A reduces perplexity on long-context language modeling relative to A without recurrence, holding parameter count and training data fixed.” This formulation specifies the intervention (the recurrence mechanism), the metric (perplexity), the comparison (A with versus without), and the controlled variables (parameters, data). Each of these can be wrong, and each can be checked.

9.2.2 2.2 The Drift Toward Unfalsifiable Claims

Much of the rhetoric surrounding large models trends toward the unfalsifiable. Statements such as “the model understands language” or “the system reasons” resist refutation because no agreed measurement attaches to them. A productive research culture replaces such claims with operational proxies: a model “reasons” to the extent that it solves a held-out set of multi-step problems whose surface form differs from anything in training. The proxy is imperfect, and saying so is part of the science, but it is falsifiable, and that is what makes it useful. The practitioner’s habit should be to translate every grand claim into a measurable one before believing it, including their own claims.

9.3 3. Baselines and Ablations: The Core of Controlled Comparison

9.3.1 3.1 Why Baselines Carry the Argument

A result in isolation conveys almost no information. Knowing that a model reaches 92 percent accuracy is meaningless until we know what a simpler approach achieves on the same task. The baseline is the control condition, and the strength of an empirical claim is bounded by the strength of the baseline it defeats. A recurring pathology in published work is the weak baseline: the proposed method is tuned extensively while the comparison is taken untuned from an old paper, or implemented carelessly. When researchers have revisited such comparisons with equally tuned baselines, the reported advantage of elaborate methods has frequently shrunk or vanished (3, 5).

The discipline here is symmetric effort. Every hour of hyperparameter search, every architectural refinement, and every data-cleaning step applied to the proposed method must be matched, as nearly as possible, for the baseline. A simple, well-tuned baseline that the new method genuinely beats is far more persuasive than an exotic competitor that the new method beats by accident of unequal effort.

9.3.2 3.2 Ablations as Causal Attribution

An ablation study removes or alters one component of a system at a time to isolate that component’s contribution. It is the closest thing machine learning has to a controlled experiment in the laboratory sense. If a method combines a new loss function, a new data augmentation, and a new optimizer, the headline result tells us only that the combination works. The ablation tells us which ingredient mattered. Without it, the field accumulates complicated recipes whose active ingredients are unknown, and subsequent researchers inherit cargo cult components that contribute nothing. A good ablation answers the counterfactual: if this piece were absent, what would happen, all else equal?

9.4 4. Train, Validation, and Test Discipline

9.4.1 4.1 The Three-Way Partition and What Each Set Is For

The partition of data into training, validation, and test sets encodes a simple epistemic principle: a claim about generalization can only be tested on data that played no role in producing the model. The training set fits parameters. The validation set guides choices made by the researcher, including architecture, hyperparameters, early stopping, and model selection. The test set estimates performance on genuinely unseen data, and it can serve that purpose only if it is consulted once, at the end, after all decisions are frozen (6).

9.4.2 4.2 How the Test Set Leaks

The test set’s protective value decays with every glance. If a researcher evaluates ten variants on the test set and reports the best, the reported number is an optimistic estimate, because the maximum over noisy measurements is biased upward. This is test-set tuning, and it is one of the most common ways that honest researchers fool themselves. The validation set exists precisely to absorb this selection pressure so the test set need not. The discipline is uncomfortable but non-negotiable: decisions are made on validation data, and the test set is touched once. When a benchmark is reused by thousands of researchers over years, even a community that individually respects this rule can collectively overfit the public test set, which motivates the practice of constructing fresh test distributions to recheck conclusions (7).

9.5 5. Reproducibility and the Reproducibility Crisis

9.5.1 5.1 Degrees of Reproducibility

It is useful to distinguish reproducibility (the same team or others obtaining consistent results with the same code and data) from replicability (independent teams reaching the same conclusion with their own implementations). The former checks that a result is real given the artifacts; the latter checks that the conclusion is robust to the inevitable variation in how science is done. Machine learning has documented failures of both. Studies attempting to reproduce reinforcement learning results found that performance depended heavily on random seeds, undocumented code details, and hyperparameters not reported in the original papers (2). Surveys of recommendation systems and information retrieval found that many proposed neural methods failed to beat properly tuned classical baselines once reproduced carefully (5).

9.5.2 5.2 Sources of Irreproducibility

The causes are mundane and therefore fixable. Unreported hyperparameters, undisclosed preprocessing, non-deterministic hardware behavior, missing random seeds, selective reporting of favorable runs, and dependence on private data all break the chain from claim to verification. The remedies are equally mundane: release code and configuration, specify the computing environment, fix and report seeds, document data provenance and splits, and report the full distribution of outcomes rather than a single lucky number. Reproducibility checklists adopted by major conferences formalize these expectations and have measurably improved reporting practices (8).

9.6 6. Benchmark Design and Goodhart’s Law

9.6.1 6.1 The Function and the Failure of Benchmarks

Shared benchmarks have driven much of the field’s progress by making competing methods comparable on common ground. A benchmark coordinates a community, focuses effort, and renders claims checkable. Yet the very property that makes a benchmark useful, its role as a fixed target, makes it vulnerable to Goodhart’s law: when a measure becomes a target, it ceases to be a good measure (9). Optimization pressure flows to whatever the metric rewards, including shortcuts that satisfy the metric without delivering the underlying capability the metric was meant to track.

9.6.2 6.2 Shortcuts and Artifacts

The literature is full of cases where models achieved high benchmark scores by exploiting spurious correlations rather than solving the intended task. Natural language inference models learned that the presence of negation words predicted a label, independent of meaning (1). Visual question answering systems answered without looking at the image, exploiting language priors in the question distribution (10). These are not model failures so much as benchmark failures: the dataset permitted a shortcut, and gradient descent, being an excellent shortcut finder, took it. Robust benchmark design therefore demands adversarial scrutiny of the data itself, construction of challenge sets that defeat known shortcuts, and periodic retirement of saturated benchmarks in favor of harder successors.

9.7 7. Statistical Significance and Variance Across Seeds

9.7.1 7.1 A Single Number Is Not a Result

A deep network’s outcome depends on random initialization, data ordering, augmentation sampling, and non-deterministic parallel computation. Re-running the identical configuration with a different random seed can shift the reported metric by an amount comparable to the differences researchers cite as evidence of method superiority. Reporting a single run is therefore reporting a single sample from a distribution while pretending to report the distribution’s mean. The minimal honest practice is to train each configuration multiple times with different seeds and report a central tendency together with a measure of spread, such as the mean and standard deviation or, better, a confidence interval (11).

9.7.2 7.2 Comparing Distributions, Not Points

Once each method is represented by a distribution of outcomes, comparison becomes a statistical question rather than a reading of two numbers. Appropriate tools include paired tests when configurations share seeds, and nonparametric tests when distributional assumptions are doubtful (12). Two cautions apply. First, statistical significance is not practical significance: with enough runs a trivial difference becomes significant, so effect size matters as much as the p-value. Second, multiple comparisons inflate false positives, so testing many variants against a baseline requires correction. The goal is not ritual hypothesis testing but an honest account of whether an observed gap could plausibly be noise.

9.8 8. Fair Comparison

Fairness in comparison means that competing methods differ only in the dimension under study. The confounds are numerous. Differences in parameter count, training compute, data quantity, data quality, tokenization, hyperparameter search budget, and even software framework can all masquerade as method effects. A new architecture that is given more parameters or a longer training schedule than its baseline has not been shown to be better architecture; it has been shown to consume more resources. Compute-matched and parameter-matched comparisons isolate the variable of interest. Equally important is equal tuning budget: the proposed method and every baseline should receive comparable hyperparameter optimization, ideally under an explicit and reported search protocol, so that the comparison reflects the methods rather than the researcher’s differential investment of effort.

9.9 9. Common Methodological Errors

9.9.1 9.1 Data Leakage

Data leakage occurs when information from outside the training set, especially from the test set, contaminates the training process, producing inflated and irreproducible performance. It takes many forms: preprocessing statistics such as normalization constants computed over the full dataset before splitting, duplicate or near-duplicate examples shared across splits, temporal leakage in which future information predicts the past, and feature leakage in which a predictor encodes the target. A broad review across scientific fields that adopted machine learning found leakage to be a pervasive cause of overoptimistic and non-replicable results (13). The defense is to define splits first and to ensure that every fitted quantity, including preprocessing, is derived only from training data.

9.9.2 9.2 Test-Set Tuning and Cherry-Picking

Two related errors corrupt the inference from result to claim. Test-set tuning, discussed in Section 4, lets repeated evaluation on held-out data quietly turn it into a second validation set, biasing the estimate upward. Cherry-picking selects favorable outcomes after the fact: the best of many seeds, the subset of benchmarks where the method wins, the qualitative examples that flatter the system. Both errors share a structure, namely selection after observing results, and both are countered by pre-registration of the evaluation plan, reporting of all runs and all benchmarks attempted, and separation of exploratory analysis from confirmatory claims. The exploratory phase is where hypotheses are generated by looking; the confirmatory phase tests a frozen hypothesis on untouched data, and only the latter supports a published claim.

9.9.3 9.3 The Garden of Forking Paths

Even without conscious dishonesty, the sheer number of defensible analysis choices (which metric, which preprocessing, which subset, which statistical test) creates a garden of forking paths in which a researcher exploring flexibly will eventually find an apparently significant result (14). The cumulative effect is a literature biased toward positive findings. Guarding against it requires committing to analysis decisions before seeing outcomes, reporting the decisions that were considered, and treating any post hoc discovery as a hypothesis to be confirmed later rather than a conclusion already established.

9.10 10. Conclusion

The methods surveyed here share a single underlying commitment: to subject every claim, especially one’s own, to conditions under which it could fail. Strong baselines create the opportunity to fail by comparison. Ablations create the opportunity to fail by attribution. Held-out test data creates the opportunity to fail at generalization. Multiple seeds and statistical tests create the opportunity to fail by chance. Reproducibility creates the opportunity to fail under independent scrutiny. A result that survives all of these is worth believing precisely because it had so many chances to be exposed as noise, artifact, or wishful thinking.

Machine learning will continue to advance through engineering ingenuity, but its claim to be a science rests on the discipline of empirical self-skepticism. The practitioner who internalizes that discipline produces fewer headlines and more knowledge, and over time it is the knowledge that compounds.

9.11 References

  1. Gururangan, S., Swayamdipta, S., Levy, O., Schwartz, R., Bowman, S., and Smith, N. A. (2018). Annotation Artifacts in Natural Language Inference Data. NAACL. https://aclanthology.org/N18-2017/

  2. Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., and Meger, D. (2018). Deep Reinforcement Learning that Matters. AAAI. https://ojs.aaai.org/index.php/AAAI/article/view/11694

  3. Melis, G., Dyer, C., and Blunsom, P. (2018). On the State of the Art of Evaluation in Neural Language Models. ICLR. https://openreview.net/forum?id=ByJHuTgA-

  4. Popper, K. (1959). The Logic of Scientific Discovery. Routledge. https://www.routledge.com/The-Logic-of-Scientific-Discovery/Popper/p/book/9780415278447

  5. Dacrema, M. F., Cremonesi, P., and Jannach, D. (2019). Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches. RecSys. https://dl.acm.org/doi/10.1145/3298689.3347058

  6. Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer. https://hastie.su.domains/ElemStatLearn/

  7. Recht, B., Roelofs, R., Schmidt, L., and Shankar, V. (2019). Do ImageNet Classifiers Generalize to ImageNet? ICML. https://proceedings.mlr.press/v97/recht19a.html

  8. Pineau, J., Vincent-Lamarre, P., Sinha, K., Lariviere, V., Beygelzimer, A., d’Alche-Buc, F., Fox, E., and Larochelle, H. (2021). Improving Reproducibility in Machine Learning Research. Journal of Machine Learning Research, 22(164). https://jmlr.org/papers/v22/20-303.html

  9. Strathern, M. (1997). Improving Ratings: Audit in the British University System. European Review, 5(3). https://onlinelibrary.wiley.com/doi/10.1002/(SICI)1234-981X(199707)5:3%3C305::AID-EURO184%3E3.0.CO;2-4

  10. Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., and Parikh, D. (2017). Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. CVPR. https://openaccess.thecvf.com/content_cvpr_2017/html/Goyal_Making_the_v_CVPR_2017_paper.html

  11. Bouthillier, X., Delaunay, P., Bronzi, M., Trofimov, A., Nichyporuk, B., Szeto, J., et al. (2021). Accounting for Variance in Machine Learning Benchmarks. MLSys. https://proceedings.mlsys.org/paper_files/paper/2021/hash/cfecdb276f634854f3ef915e2e980c31-Abstract.html

  12. Demsar, J. (2006). Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research, 7. https://jmlr.org/papers/v7/demsar06a.html

  13. Kapoor, S., and Narayanan, A. (2023). Leakage and the Reproducibility Crisis in Machine-Learning-Based Science. Patterns, 4(9). https://www.cell.com/patterns/fulltext/S2666-3899(23)00159-9

  14. Gelman, A., and Loken, E. (2014). The Statistical Crisis in Science. American Scientist, 102(6). https://www.americanscientist.org/article/the-statistical-crisis-in-science