51 The Primacy of Data
Machine learning inverts the classical model of software. In traditional programming, a human author specifies the rules and the computer applies them to inputs to produce outputs. In machine learning, the human supplies inputs and desired outputs, and the computer infers the rules. The data is not a passive resource consumed by the algorithm. The data is the specification. Whatever pattern, bias, gap, or noise lives in the training set becomes part of the learned function. This chapter argues that data is the true foundation of all machine learning, examines the recent shift toward data-centric thinking, formalizes the old maxim of garbage in, garbage out, and explores how the quality and quantity of data jointly determine what a model can and cannot learn.
51.1 1. Why Data Is the Foundation
51.1.1 1.1 Learning as inference from examples
A supervised learning problem assumes an unknown target function \(f: \mathcal{X} \to \mathcal{Y}\) that maps inputs to outputs. We never observe \(f\) directly. We observe a finite sample \(D = \{(x_i, y_i)\}_{i=1}^{n}\), drawn (we hope) from the same distribution \(\mathcal{D}\) that will generate future inputs. The learning algorithm searches a hypothesis space \(\mathcal{H}\) for a function \(h\) that approximates \(f\) well on \(D\), in the hope that low error on the sample implies low error on the distribution.
This framing makes the dependence on data explicit. The algorithm can only ever know the target through the sample. If the sample misrepresents the distribution, the best possible hypothesis still inherits that misrepresentation. The model is a compression of its training data, and no optimizer, however powerful, can recover information that the data never contained.
51.1.2 1.2 The bias-variance view of data
The expected error of a learned model decomposes into three parts: bias, variance, and irreducible noise. For squared loss, the expected error at a point can be written as
\[ \mathbb{E}\big[(y - h(x))^2\big] = \underbrace{\big(\mathbb{E}[h(x)] - f(x)\big)^2}_{\text{bias}^2} + \underbrace{\mathbb{E}\big[(h(x) - \mathbb{E}[h(x)])^2\big]}_{\text{variance}} + \underbrace{\sigma^2}_{\text{noise}}. \]
Two of these three terms are controlled by data. Variance shrinks as the sample grows, because a larger sample pins down the estimate more tightly. The irreducible noise \(\sigma^2\) is a property of how the labels were generated, so noisy or inconsistent labeling raises the floor on achievable error no matter how much data you collect. Only bias is primarily a property of the model class. The lesson is that two of the three sources of error are addressed by collecting more data or by collecting cleaner data, not by changing the architecture.
51.1.3 1.3 The model is downstream of the data
It is tempting to treat the model architecture as the seat of intelligence and the data as fuel. The opposite framing is more accurate. The architecture defines a space of possible functions, and the data selects one of them. A transformer trained on medical records becomes a clinical model. The same transformer trained on legal filings becomes a legal model. The weights differ entirely, and that difference is authored by the data. When practitioners say a model has learned a spurious correlation, they are really saying the data contained that correlation and the model, being faithful, reproduced it.
51.2 2. The Shift Toward Data-Centric Thinking
51.2.1 2.1 From model-centric to data-centric
For much of the last two decades, progress in machine learning was measured by architectural innovation. Benchmarks held the data fixed and invited researchers to compete on models. This model-centric paradigm produced enormous advances, but it also created a blind spot. On many real-world problems, the marginal return from a new architecture is small compared to the return from fixing the dataset.
Andrew Ng and others have argued for a data-centric paradigm in which the model and code are held fixed and the data is systematically improved [1]. The shift is partly cultural and partly economic. As pretrained models and standard architectures became commodities, the differentiator moved to the data that nobody else has and that nobody else has cleaned.
51.2.2 2.2 Data work is the real work
Surveys of practitioners consistently report that data preparation consumes the majority of project time. Collection, cleaning, labeling, deduplication, and validation dominate the calendar, while model training is often a small fraction. This is not a sign of immature tooling. It reflects the fact that the hard part of machine learning is turning the messy world into a faithful sample.
A useful reframing is to treat the dataset as a versioned artifact with the same rigor applied to code. Datasets should be tested, reviewed, and held to acceptance criteria.
# Treat data quality as a gate, not an afterthought.
def validate_batch(df):
assert df["label"].isin(VALID_LABELS).all(), "unknown label found"
assert df["text"].str.len().gt(0).all(), "empty input found"
dup_rate = df.duplicated(subset=["text"]).mean()
assert dup_rate < 0.01, f"duplicate rate too high: {dup_rate:.3f}"
return df51.2.3 2.3 Why this matters more as models scale
Large pretrained models amplify rather than reduce the importance of data. A foundation model trained on a web-scale corpus inherits the composition of that corpus, including its demographic skew, its factual errors, and its toxic fragments. Fine-tuning and alignment then depend on small, carefully curated datasets whose quality has outsized influence on behavior. The work on instruction tuning and reinforcement learning from human feedback showed that a relatively small set of high-quality human demonstrations and preferences can reshape a model’s behavior dramatically [2]. The leverage of data did not disappear with scale. It moved.
51.3 3. Garbage In, Garbage Out
51.3.1 3.1 The maxim made precise
Garbage in, garbage out is a slogan, but it has a formal core. A learning algorithm minimizes a loss defined with respect to the training labels. If those labels are systematically wrong, the algorithm faithfully minimizes the wrong objective. Consider label noise modeled as a probability \(\rho\) that any given label is flipped. The Bayes-optimal classifier learned on corrupted data is no longer the Bayes-optimal classifier for the clean distribution unless the noise is symmetric and accounted for explicitly. The model does not detect that the labels are garbage. It treats them as ground truth.
51.3.2 3.2 Categories of garbage
Data quality problems are not monolithic. It helps to name the common failure modes:
- Label noise: incorrect or inconsistent annotations, often from rushed or ambiguous labeling guidelines.
- Sampling bias: the training distribution differs from the deployment distribution, so the model optimizes for a world it will not encounter.
- Leakage: information available at training time that will not be available at prediction time, producing optimistic offline metrics that collapse in production.
- Spurious correlations: features that predict the label in the sample but have no causal relationship, such as a watermark that happens to co-occur with a class.
- Duplication and contamination: repeated records that distort the effective distribution, or test examples that leak into training and inflate reported performance.
51.3.3 3.3 Garbage is often invisible at training time
The insidious property of bad data is that the training metrics frequently look excellent. A model that exploits a spurious correlation or a leaked feature will report high accuracy on a validation set drawn from the same flawed source. The error surfaces only at deployment, when the correlation breaks or the leaked feature vanishes. This is why data validation, slice-based evaluation, and audits of the data-generating process matter more than a single aggregate score. A well-known illustration comes from medical imaging, where models learned to detect hospital-specific markers and scanner artifacts rather than disease, achieving strong test numbers while learning the wrong thing [3].
51.3.4 3.4 Cleaning beats collecting, sometimes
When labels are noisy, adding more noisy labels can be less effective than relabeling a subset correctly. The benchmark literature has documented pervasive label errors even in canonical test sets, and correcting them changes which models appear to be best [4]. The practical implication is that a budget spent on careful relabeling of the most uncertain or most influential examples can yield more improvement than the same budget spent on naive collection.
51.4 4. How Quality and Quantity Shape What Models Can Learn
51.4.1 4.1 The role of quantity
More data reduces variance and lets a model fit finer structure without overfitting. The relationship is often regular enough to be described by a power law. Empirical scaling studies show that test loss \(L\) falls with dataset size \(N\) approximately as
\[ L(N) \approx L_\infty + \left(\frac{N_c}{N}\right)^{\alpha}, \]
where \(L_\infty\) is the irreducible loss, \(N_c\) is a constant, and \(\alpha\) is a small positive exponent [5]. The exponent being small is itself a lesson: doubling data yields a predictable but diminishing improvement, so quantity alone faces sharply rising costs at the frontier.
51.4.2 4.2 Quantity cannot fix the wrong distribution
Scaling improves performance only within the distribution the data represents. If the sample is biased, more of the same biased data converges to a confident wrong answer. Formally, the law of large numbers guarantees convergence to the expectation under the sampling distribution, not under the distribution you actually care about. A facial analysis system trained predominantly on lighter-skinned faces will not become fair simply by adding more lighter-skinned faces. The landmark audit of commercial systems found error rates far higher for darker-skinned women, a direct consequence of unrepresentative training data [6].
51.4.3 4.3 The role of quality and the data budget trade-off
Quality and quantity interact. There is growing evidence that careful filtering and deduplication of a large corpus can match or beat training on a larger unfiltered corpus, because low-quality and duplicated examples waste capacity and can actively harm generalization [7]. Deduplication in particular reduces memorization and improves the efficiency of every training step. The emerging consensus is that the right question is not simply how much data, but how much high-quality, diverse, non-redundant data.
A simple way to think about the trade-off is to weight examples by an estimate of their cleanliness or informativeness:
# Prefer informative, clean examples over raw volume.
def score_example(ex):
quality = ex.annotator_agreement # higher is cleaner
novelty = 1.0 - ex.near_duplicate_sim # higher is less redundant
return quality * novelty
curated = sorted(pool, key=score_example, reverse=True)[:budget]51.4.4 4.4 What data fundamentally cannot provide
Some limits are not about volume or cleanliness but about information content. A model cannot learn a distinction that the features never encode. If two classes are genuinely indistinguishable given the available inputs, the irreducible noise term \(\sigma^2\) is positive and no dataset removes it. Likewise, observational data alone cannot in general identify causal effects, because correlation underdetermines causation without assumptions or interventions [8]. Recognizing these ceilings prevents the common error of trying to solve a measurement or design problem by collecting yet more rows.
51.4.5 4.5 Diversity and coverage
Beyond raw quality and quantity sits coverage. A dataset of one million near-identical examples carries little more information than a few hundred. What matters for generalization is whether the data spans the regions of input space the model will face, including the rare and adversarial corners. This is why curated evaluation sets deliberately include hard slices and edge cases, and why active learning, which selects the most informative examples to label next, can outperform random collection at equal cost. Coverage is the bridge between quantity and quality: it asks not only how many examples and how clean, but whether they represent the full problem.
51.5 5. Practical Principles
The arguments above converge on a small set of working principles for practitioners.
First, instrument the data-generating process, not just the model. Know how each label was produced, by whom, and under what guidelines, because that process is the real source of your supervision.
Second, measure quality directly. Track inter-annotator agreement, duplication rates, distribution shift between training and serving, and slice-level performance, and treat regressions in these as seriously as a failing unit test.
Third, spend the marginal dollar where the bias-variance decomposition says it will help. If variance dominates, collect more representative data. If the noise floor dominates, fix the labels. If bias dominates, change the model. The decomposition is a budgeting tool, not just a theoretical curiosity.
Fourth, prefer curation to accumulation once the data is large. Deduplicate, filter, and balance before reaching for a larger crawl, because capacity spent memorizing redundant or low-quality examples is capacity not spent learning the signal.
Fifth, respect the ceilings. When a problem is limited by information content or by the absence of interventions, no quantity of data closes the gap, and the honest move is to redesign the measurement or the experiment rather than scale the dataset.
51.6 6. Conclusion
Machine learning is the discipline of turning data into functions, and so the data is not an input to the real work. The data is the work. Architectures define what is learnable in principle, but the dataset decides what is learned in fact. Garbage in, garbage out is not folklore; it is a direct consequence of how loss functions and estimators behave when their supervision is corrupted. Quantity buys reduced variance along a diminishing power-law curve, quality and coverage decide whether that variance reduction points toward the right target, and certain ceilings remain that no volume of data can breach. The shift toward data-centric thinking is the field maturing into this recognition. The teams that win are not usually the ones with the cleverest architecture. They are the ones with the cleanest, most representative, and most thoughtfully curated data.
51.7 References
- Andrew Ng, “A Chat with Andrew on MLOps: From Model-centric to Data-centric AI.” DeepLearning.AI, 2021. https://www.deeplearning.ai/the-batch/a-chat-with-andrew-on-mlops-from-model-centric-to-data-centric-ai/
- Long Ouyang et al., “Training language models to follow instructions with human feedback.” NeurIPS, 2022. https://arxiv.org/abs/2203.02155
- John R. Zech et al., “Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study.” PLOS Medicine, 2018. https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1002683
- Curtis G. Northcutt, Anish Athalye, and Jonas Mueller, “Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks.” NeurIPS Datasets and Benchmarks, 2021. https://arxiv.org/abs/2103.14749
- Jared Kaplan et al., “Scaling Laws for Neural Language Models.” 2020. https://arxiv.org/abs/2001.08361
- Joy Buolamwini and Timnit Gebru, “Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification.” Proceedings of Machine Learning Research, 2018. https://proceedings.mlr.press/v81/buolamwini18a.html
- Katherine Lee et al., “Deduplicating Training Data Makes Language Models Better.” ACL, 2022. https://arxiv.org/abs/2107.06499
- Judea Pearl, “Causality: Models, Reasoning, and Inference.” Cambridge University Press, 2nd edition, 2009. https://bayes.cs.ucla.edu/BOOK-2K/