51 The Primacy of Data

Machine learning inverts the classical model of software. In traditional programming, a human author specifies the rules and the computer applies them to inputs to produce outputs. In machine learning, the human supplies inputs and desired outputs, and the computer infers the rules. The data is not a passive resource consumed by the algorithm. The data is the specification. Whatever pattern, bias, gap, or noise lives in the training set becomes part of the learned function. This chapter argues that data is the true foundation of all machine learning, examines the recent shift toward data-centric thinking, formalizes the old maxim of garbage in, garbage out, and explores how the quality and quantity of data jointly determine what a model can and cannot learn.

The argument proceeds in four movements. First, a formal account of why the data, and not the architecture, is the object that fixes what a model becomes. Second, the historical and economic shift from a model-centric to a data-centric practice. Third, a precise treatment of how corrupted supervision corrupts the learned function. Fourth, an analysis of how quantity, quality, and coverage jointly bound what is learnable, including ceilings that no volume of data can breach. A reader who internalizes one idea should internalize this one: the model is a faithful image of its data, and faithfulness to bad data is indistinguishable, at training time, from competence.

The relationships among the central quantities can be summarized in advance.

flowchart TD
    A["World and data-generating process"] --> B["Training sample D"]
    B --> C["Hypothesis space H"]
    C --> D["Learned function h"]
    B --> E["Quantity controls variance"]
    B --> F["Quality controls noise floor"]
    B --> G["Coverage controls in-distribution scope"]
    E --> D
    F --> D
    G --> D
    D --> H["Behavior at deployment"]

51.1 1. Why Data Is the Foundation

51.1.1 1.1 Learning as inference from examples

A supervised learning problem assumes a joint distribution $\mathcal{D}$ over $\mathcal{X} \times \mathcal{Y}$ and an unknown target relationship between inputs and outputs. We never observe $\mathcal{D}$ directly. We observe a finite sample $D = \{(x_i, y_i)\}_{i=1}^{n}$, drawn (we hope) from the same distribution that will generate future inputs and labels. The learning algorithm searches a hypothesis space $\mathcal{H}$ for a function $h$ that approximates the target well on $D$, in the hope that low error on the sample implies low error on the distribution.

Make the goal precise. For a loss $\ell$, define the population risk (the quantity we truly care about) and the empirical risk (the only quantity we can compute) as

\[ R(h) = \mathbb{E}_{(x,y)\sim \mathcal{D}}\big[\ell(h(x), y)\big], \qquad \hat{R}_D(h) = \frac{1}{n}\sum_{i=1}^{n} \ell(h(x_i), y_i). \]

Empirical risk minimization returns $\hat{h} = \arg\min_{h \in \mathcal{H}} \hat{R}_D(h)$. Two facts about this object govern everything that follows. First, $\hat{R}_D$ is the only signal the optimizer receives, so it can do no better than the sample $D$ permits. Second, $\hat{R}_D$ is an unbiased estimate of $R$ only when the sample is drawn from the same $\mathcal{D}$ that defines $R$. Break that assumption, by sampling from a different distribution or by corrupting the labels, and the optimizer continues to minimize $\hat{R}_D$ diligently while $R$ drifts out of reach. The algorithm has no way to notice.

This framing makes the dependence on data explicit. The algorithm can only ever know the target through the sample. If the sample misrepresents the distribution, the best possible hypothesis still inherits that misrepresentation. The model is a compression of its training data, and no optimizer, however powerful, can recover information that the data never contained. This is an instance of a more general principle sometimes called the data-processing inequality: post-processing cannot create information about the target that the input did not already carry.

51.1.2 1.2 The bias-variance view of data

The expected error of a learned model decomposes into three parts: bias, variance, and irreducible noise. For squared loss, the expected error at a point can be written as

\[ \mathbb{E}\big[(y - h(x))^2\big] = \underbrace{\big(\mathbb{E}[h(x)] - f(x)\big)^2}_{\text{bias}^2} + \underbrace{\mathbb{E}\big[(h(x) - \mathbb{E}[h(x)])^2\big]}_{\text{variance}} + \underbrace{\sigma^2}_{\text{noise}}. \]

Two of these three terms are controlled by data. Variance shrinks as the sample grows, because a larger sample pins down the estimate more tightly. The irreducible noise $\sigma^2$ is a property of how the labels were generated, so noisy or inconsistent labeling raises the floor on achievable error no matter how much data you collect. Only bias is primarily a property of the model class. The lesson is that two of the three sources of error are addressed by collecting more data or by collecting cleaner data, not by changing the architecture.

51.1.3 1.3 The model is downstream of the data

It is tempting to treat the model architecture as the seat of intelligence and the data as fuel. The opposite framing is more accurate. The architecture defines a space of possible functions, and the data selects one of them. A transformer trained on medical records becomes a clinical model. The same transformer trained on legal filings becomes a legal model. The weights differ entirely, and that difference is authored by the data. When practitioners say a model has learned a spurious correlation, they are really saying the data contained that correlation and the model, being faithful, reproduced it.

51.2 2. The Shift Toward Data-Centric Thinking

51.2.1 2.1 From model-centric to data-centric

For much of the last two decades, progress in machine learning was measured by architectural innovation. Benchmarks held the data fixed and invited researchers to compete on models. This model-centric paradigm produced enormous advances, but it also created a blind spot. On many real-world problems, the marginal return from a new architecture is small compared to the return from fixing the dataset.

Andrew Ng and others have argued for a data-centric paradigm in which the model and code are held fixed and the data is systematically improved [1]. The shift is partly cultural and partly economic. As pretrained models and standard architectures became commodities, the differentiator moved to the data that nobody else has and that nobody else has cleaned.

51.2.2 2.2 Data work is the real work

Surveys of practitioners consistently report that data preparation consumes the majority of project time. Collection, cleaning, labeling, deduplication, and validation dominate the calendar, while model training is often a small fraction. This is not a sign of immature tooling. It reflects the fact that the hard part of machine learning is turning the messy world into a faithful sample.

A useful reframing is to treat the dataset as a versioned artifact with the same rigor applied to code. Datasets should be tested, reviewed, and held to acceptance criteria.

# Treat data quality as a gate, not an afterthought.
def validate_batch(df):
    assert df["label"].isin(VALID_LABELS).all(), "unknown label found"
    assert df["text"].str.len().gt(0).all(), "empty input found"
    dup_rate = df.duplicated(subset=["text"]).mean()
    assert dup_rate < 0.01, f"duplicate rate too high: {dup_rate:.3f}"
    return df

51.2.3 2.3 Why this matters more as models scale

Large pretrained models amplify rather than reduce the importance of data. A foundation model trained on a web-scale corpus inherits the composition of that corpus, including its demographic skew, its factual errors, and its toxic fragments. Fine-tuning and alignment then depend on small, carefully curated datasets whose quality has outsized influence on behavior. The work on instruction tuning and reinforcement learning from human feedback showed that a relatively small set of high-quality human demonstrations and preferences can reshape a model’s behavior dramatically [2]. The leverage of data did not disappear with scale. It moved.

51.3 3. Garbage In, Garbage Out

51.3.1 3.1 The maxim made precise

Garbage in, garbage out is a slogan, but it has a formal core. A learning algorithm minimizes a loss defined with respect to the training labels. If those labels are systematically wrong, the algorithm faithfully minimizes the wrong objective.

Make this concrete for binary classification. Let the clean label be $y \in \{0, 1\}$ with clean posterior $\eta(x) = \Pr(y = 1 \mid x)$. Suppose each observed label is flipped independently with class-dependent rates $\rho_0 = \Pr(\tilde{y}=1 \mid y=0)$ and $\rho_1 = \Pr(\tilde{y}=0 \mid y=1)$. The posterior the model actually sees is

\[ \tilde{\eta}(x) = \Pr(\tilde{y}=1 \mid x) = (1-\rho_1)\,\eta(x) + \rho_0\,\big(1-\eta(x)\big). \]

In the symmetric case $\rho_0 = \rho_1 = \rho < \tfrac12$, this simplifies to $\tilde{\eta}(x) = (1-2\rho)\,\eta(x) + \rho$, which is a strictly increasing affine function of $\eta(x)$. Because the threshold $\eta(x) = \tfrac12$ maps to $\tilde{\eta}(x) = \tfrac12$, the decision boundary of the Bayes classifier is preserved: symmetric noise shrinks the margin and inflates the irreducible loss, but it does not move the optimal boundary. The classifier still aims at the right target, just with less confidence and a higher error floor.

The asymmetric case $\rho_0 \neq \rho_1$ is qualitatively different. Now the point where $\tilde{\eta}(x) = \tfrac12$ corresponds to $\eta(x) = \tfrac{1/2 - \rho_0}{1 - \rho_0 - \rho_1} \neq \tfrac12$, so naive minimization on corrupted labels produces a biased boundary that systematically over-predicts the less-corrupted class. The model does not detect that the labels are garbage. It treats them as ground truth and shifts its boundary accordingly. Recovering the clean-optimal classifier then requires knowing or estimating the noise rates and correcting for them, for example through loss correction or surrogate losses that are provably robust to a known noise transition matrix [9]. The lesson is that not all garbage is equal: noise that is uniform across classes degrades gracefully, while structured, label-correlated noise biases the very thing you are trying to learn.

51.3.2 3.2 Categories of garbage

Data quality problems are not monolithic. It helps to name the common failure modes:

Label noise: incorrect or inconsistent annotations, often from rushed or ambiguous labeling guidelines.
Sampling bias: the training distribution differs from the deployment distribution, so the model optimizes for a world it will not encounter.
Leakage: information available at training time that will not be available at prediction time, producing optimistic offline metrics that collapse in production.
Spurious correlations: features that predict the label in the sample but have no causal relationship, such as a watermark that happens to co-occur with a class.
Duplication and contamination: repeated records that distort the effective distribution, or test examples that leak into training and inflate reported performance.

51.3.3 3.3 Garbage is often invisible at training time

The insidious property of bad data is that the training metrics frequently look excellent. A model that exploits a spurious correlation or a leaked feature will report high accuracy on a validation set drawn from the same flawed source. The error surfaces only at deployment, when the correlation breaks or the leaked feature vanishes. This is why data validation, slice-based evaluation, and audits of the data-generating process matter more than a single aggregate score. A well-known illustration comes from medical imaging, where models learned to detect hospital-specific markers and scanner artifacts rather than disease, achieving strong test numbers while learning the wrong thing [3].

51.3.4 3.4 Cleaning beats collecting, sometimes

When labels are noisy, adding more noisy labels can be less effective than relabeling a subset correctly. The benchmark literature has documented pervasive label errors even in canonical test sets, and correcting them changes which models appear to be best [4]. The practical implication is that a budget spent on careful relabeling of the most uncertain or most influential examples can yield more improvement than the same budget spent on naive collection.

51.4 4. How Quality and Quantity Shape What Models Can Learn

51.4.1 4.1 The role of quantity

More data reduces variance and lets a model fit finer structure without overfitting. The relationship is often regular enough to be described by a power law. Empirical scaling studies show that test loss $L$ falls with dataset size $N$ approximately as

\[ L(N) \approx L_\infty + \left(\frac{N_c}{N}\right)^{\alpha}, \]

where $L_\infty$ is the irreducible loss, $N_c$ is a constant, and $\alpha$ is a small positive exponent [5]. The exponent being small is itself a lesson: doubling data yields a predictable but diminishing improvement, so quantity alone faces sharply rising costs at the frontier.

51.4.2 4.2 Quantity cannot fix the wrong distribution

Scaling improves performance only within the distribution the data represents. If the sample is biased, more of the same biased data converges to a confident wrong answer. Formally, the law of large numbers guarantees that $\hat{R}_D(h)$ converges to the expectation under the sampling distribution $\mathcal{D}_{\text{train}}$, not under the deployment distribution $\mathcal{D}_{\text{test}}$ you actually care about. When these differ, the gap does not close with more data.

The gap can be made quantitative. If training and deployment share the same conditional $\Pr(y \mid x)$ but differ in the input marginal (the covariate-shift setting), the deployment risk relates to a reweighted training risk through the density ratio $w(x) = p_{\text{test}}(x) / p_{\text{train}}(x)$:

\[ R_{\text{test}}(h) = \mathbb{E}_{x \sim \mathcal{D}_{\text{train}}}\big[\, w(x)\, \mathbb{E}[\ell \mid x] \,\big]. \]

This identity carries a hard lesson. Wherever the training data has zero density but the deployment data does not, $w(x)$ is infinite and the integral is undefined: there is no amount of in-support data that informs the model about regions it never saw. Reweighting can correct for under-representation only where there is some representation to reweight. A region with no examples is a region the model is guessing about.

This is the formal heart of why a facial analysis system trained predominantly on lighter-skinned faces does not become fair simply by adding more lighter-skinned faces. The landmark audit of commercial systems found error rates far higher for darker-skinned women, a direct consequence of unrepresentative training data [6]. The fix is not more data; it is more data from the under-represented region.

51.4.3 4.3 The role of quality and the data budget trade-off

Quality and quantity interact. There is growing evidence that careful filtering and deduplication of a large corpus can match or beat training on a larger unfiltered corpus, because low-quality and duplicated examples waste capacity and can actively harm generalization [7]. Deduplication in particular reduces memorization and improves the efficiency of every training step. The emerging consensus is that the right question is not simply how much data, but how much high-quality, diverse, non-redundant data.

A simple way to think about the trade-off is to weight examples by an estimate of their cleanliness or informativeness:

# Prefer informative, clean examples over raw volume.
def score_example(ex):
    quality = ex.annotator_agreement      # higher is cleaner
    novelty = 1.0 - ex.near_duplicate_sim # higher is less redundant
    return quality * novelty

curated = sorted(pool, key=score_example, reverse=True)[:budget]

51.4.4 4.4 What data fundamentally cannot provide

Some limits are not about volume or cleanliness but about information content. A model cannot learn a distinction that the features never encode. If two classes are genuinely indistinguishable given the available inputs, the irreducible noise term $\sigma^2$ is positive and no dataset removes it. Likewise, observational data alone cannot in general identify causal effects, because correlation underdetermines causation without assumptions or interventions [8]. Recognizing these ceilings prevents the common error of trying to solve a measurement or design problem by collecting yet more rows.

51.4.5 4.5 Diversity and coverage

Beyond raw quality and quantity sits coverage. A dataset of one million near-identical examples carries little more information than a few hundred. What matters for generalization is whether the data spans the regions of input space the model will face, including the rare and adversarial corners. This is why curated evaluation sets deliberately include hard slices and edge cases, and why active learning, which selects the most informative examples to label next, can outperform random collection at equal cost. Coverage is the bridge between quantity and quality: it asks not only how many examples and how clean, but whether they represent the full problem.

51.4.6 4.6 A worked example: when relabeling beats collecting

The three forces (quantity, quality, coverage) can be weighed numerically with the bias-variance decomposition as a budget tool. Consider a binary task where a model trained on $n = 10{,}000$ examples reaches 12% test error. Suppose careful auditing reveals two facts: the labels carry symmetric noise at rate $\rho = 0.10$, and an estimated 4 percentage points of the error are variance from insufficient data, with the remainder bias plus the noise-inflated floor.

Symmetric noise at $\rho = 0.10$ does not move the decision boundary (Section 3.1), but it raises the achievable error floor and, more importantly, slows learning by injecting wrong gradients. A back-of-envelope comparison of two equal-cost interventions, each costing the price of labeling 10,000 examples, makes the trade-off vivid.

Collect: double the data to 20,000 noisy examples. Under a power law $L(N) \approx L_\infty + (N_c/N)^\alpha$ with a typical small exponent, doubling $N$ removes only a fraction of the variance term, perhaps cutting the 4-point variance contribution to roughly 3 points. The noise floor is untouched. Net improvement: about 1 point.
Clean: spend the same budget relabeling the original 10,000 examples, driving $\rho$ from 0.10 toward 0.02. This lowers the noise floor and removes the systematic drag of wrong gradients, often recovering several points at once while leaving the data quantity unchanged.

The qualitative conclusion is robust to the exact numbers: when the error budget is dominated by the noise floor rather than by variance, cleaning dominates collecting, because the power-law return on quantity is shallow while the return on removing label noise is direct. The decomposition tells you which lever to pull before you spend the money.

51.5 5. Practical Principles

The arguments above converge on a small set of working principles for practitioners.

First, instrument the data-generating process, not just the model. Know how each label was produced, by whom, and under what guidelines, because that process is the real source of your supervision.

Second, measure quality directly. Track inter-annotator agreement, duplication rates, distribution shift between training and serving, and slice-level performance, and treat regressions in these as seriously as a failing unit test.

Third, spend the marginal dollar where the bias-variance decomposition says it will help. If variance dominates, collect more representative data. If the noise floor dominates, fix the labels. If bias dominates, change the model. The decomposition is a budgeting tool, not just a theoretical curiosity.

Fourth, prefer curation to accumulation once the data is large. Deduplicate, filter, and balance before reaching for a larger crawl, because capacity spent memorizing redundant or low-quality examples is capacity not spent learning the signal.

Fifth, respect the ceilings. When a problem is limited by information content or by the absence of interventions, no quantity of data closes the gap, and the honest move is to redesign the measurement or the experiment rather than scale the dataset.

51.6 6. Conclusion

Machine learning is the discipline of turning data into functions, and so the data is not an input to the real work. The data is the work. Architectures define what is learnable in principle, but the dataset decides what is learned in fact. Garbage in, garbage out is not folklore; it is a direct consequence of how loss functions and estimators behave when their supervision is corrupted. Quantity buys reduced variance along a diminishing power-law curve, quality and coverage decide whether that variance reduction points toward the right target, and certain ceilings remain that no volume of data can breach. The shift toward data-centric thinking is the field maturing into this recognition. The teams that win are not usually the ones with the cleverest architecture. They are the ones with the cleanest, most representative, and most thoughtfully curated data.

51.7 References

Andrew Ng, “A Chat with Andrew on MLOps: From Model-centric to Data-centric AI.” DeepLearning.AI, 2021. https://www.deeplearning.ai/the-batch/a-chat-with-andrew-on-mlops-from-model-centric-to-data-centric-ai/
Long Ouyang et al., “Training language models to follow instructions with human feedback.” NeurIPS, 2022. https://arxiv.org/abs/2203.02155
John R. Zech et al., “Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study.” PLOS Medicine, 2018. https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1002683
Curtis G. Northcutt, Anish Athalye, and Jonas Mueller, “Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks.” NeurIPS Datasets and Benchmarks, 2021. https://arxiv.org/abs/2103.14749
Jared Kaplan et al., “Scaling Laws for Neural Language Models.” 2020. https://arxiv.org/abs/2001.08361
Joy Buolamwini and Timnit Gebru, “Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification.” Proceedings of Machine Learning Research, 2018. https://proceedings.mlr.press/v81/buolamwini18a.html
Katherine Lee et al., “Deduplicating Training Data Makes Language Models Better.” ACL, 2022. https://arxiv.org/abs/2107.06499
Judea Pearl, “Causality: Models, Reasoning, and Inference.” Cambridge University Press, 2nd edition, 2009. https://bayes.cs.ucla.edu/BOOK-2K/
Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu, “Making Deep Neural Networks Robust to Label Noise: A Loss Correction Approach.” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. https://doi.org/10.1109/CVPR.2017.240

# The Primacy of Data Machine learning inverts the classical model of software. In traditional programming, a human author specifies the rules and the computer applies them to inputs to produce outputs. In machine learning, the human supplies inputs and desired outputs, and the computer infers the rules. The data is not a passive resource consumed by the algorithm. The data is the specification. Whatever pattern, bias, gap, or noise lives in the training set becomes part of the learned function. This chapter argues that data is the true foundation of all machine learning, examines the recent shift toward data-centric thinking, formalizes the old maxim of garbage in, garbage out, and explores how the quality and quantity of data jointly determine what a model can and cannot learn. The argument proceeds in four movements. First, a formal account of why the data, and not the architecture, is the object that fixes what a model becomes. Second, the historical and economic shift from a model-centric to a data-centric practice. Third, a precise treatment of how corrupted supervision corrupts the learned function. Fourth, an analysis of how quantity, quality, and coverage jointly bound what is learnable, including ceilings that no volume of data can breach. A reader who internalizes one idea should internalize this one: the model is a faithful image of its data, and faithfulness to bad data is indistinguishable, at training time, from competence. The relationships among the central quantities can be summarized in advance. ```{mermaid} flowchart TD A["World and data-generating process"] --> B["Training sample D"] B --> C["Hypothesis space H"] C --> D["Learned function h"] B --> E["Quantity controls variance"] B --> F["Quality controls noise floor"] B --> G["Coverage controls in-distribution scope"] E --> D F --> D G --> D D --> H["Behavior at deployment"] ``` ## 1. Why Data Is the Foundation ### 1.1 Learning as inference from examples A supervised learning problem assumes a joint distribution $\mathcal{D}$ over $\mathcal{X} \times \mathcal{Y}$ and an unknown target relationship between inputs and outputs. We never observe $\mathcal{D}$ directly. We observe a finite sample $D = \{(x_i, y_i)\}_{i=1}^{n}$, drawn (we hope) from the same distribution that will generate future inputs and labels. The learning algorithm searches a hypothesis space $\mathcal{H}$ for a function $h$ that approximates the target well on $D$, in the hope that low error on the sample implies low error on the distribution. Make the goal precise. For a loss $\ell$, define the **population risk** (the quantity we truly care about) and the **empirical risk** (the only quantity we can compute) as $$ R(h) = \mathbb{E}_{(x,y)\sim \mathcal{D}}\big[\ell(h(x), y)\big], \qquad \hat{R}_D(h) = \frac{1}{n}\sum_{i=1}^{n} \ell(h(x_i), y_i). $$ Empirical risk minimization returns $\hat{h} = \arg\min_{h \in \mathcal{H}} \hat{R}_D(h)$. Two facts about this object govern everything that follows. First, $\hat{R}_D$ is the *only* signal the optimizer receives, so it can do no better than the sample $D$ permits. Second, $\hat{R}_D$ is an unbiased estimate of $R$ *only when the sample is drawn from the same $\mathcal{D}$ that defines $R$*. Break that assumption, by sampling from a different distribution or by corrupting the labels, and the optimizer continues to minimize $\hat{R}_D$ diligently while $R$ drifts out of reach. The algorithm has no way to notice. This framing makes the dependence on data explicit. The algorithm can only ever know the target through the sample. If the sample misrepresents the distribution, the best possible hypothesis still inherits that misrepresentation. The model is a compression of its training data, and no optimizer, however powerful, can recover information that the data never contained. This is an instance of a more general principle sometimes called the data-processing inequality: post-processing cannot create information about the target that the input did not already carry. ### 1.2 The bias-variance view of data The expected error of a learned model decomposes into three parts: bias, variance, and irreducible noise. For squared loss, the expected error at a point can be written as $$ \mathbb{E}\big[(y - h(x))^2\big] = \underbrace{\big(\mathbb{E}[h(x)] - f(x)\big)^2}_{\text{bias}^2} + \underbrace{\mathbb{E}\big[(h(x) - \mathbb{E}[h(x)])^2\big]}_{\text{variance}} + \underbrace{\sigma^2}_{\text{noise}}. $$ Two of these three terms are controlled by data. Variance shrinks as the sample grows, because a larger sample pins down the estimate more tightly. The irreducible noise $\sigma^2$ is a property of how the labels were generated, so noisy or inconsistent labeling raises the floor on achievable error no matter how much data you collect. Only bias is primarily a property of the model class. The lesson is that two of the three sources of error are addressed by collecting more data or by collecting cleaner data, not by changing the architecture. ### 1.3 The model is downstream of the data It is tempting to treat the model architecture as the seat of intelligence and the data as fuel. The opposite framing is more accurate. The architecture defines a space of possible functions, and the data selects one of them. A transformer trained on medical records becomes a clinical model. The same transformer trained on legal filings becomes a legal model. The weights differ entirely, and that difference is authored by the data. When practitioners say a model has learned a spurious correlation, they are really saying the data contained that correlation and the model, being faithful, reproduced it. ## 2. The Shift Toward Data-Centric Thinking ### 2.1 From model-centric to data-centric For much of the last two decades, progress in machine learning was measured by architectural innovation. Benchmarks held the data fixed and invited researchers to compete on models. This model-centric paradigm produced enormous advances, but it also created a blind spot. On many real-world problems, the marginal return from a new architecture is small compared to the return from fixing the dataset. Andrew Ng and others have argued for a data-centric paradigm in which the model and code are held fixed and the data is systematically improved [1]. The shift is partly cultural and partly economic. As pretrained models and standard architectures became commodities, the differentiator moved to the data that nobody else has and that nobody else has cleaned. ### 2.2 Data work is the real work Surveys of practitioners consistently report that data preparation consumes the majority of project time. Collection, cleaning, labeling, deduplication, and validation dominate the calendar, while model training is often a small fraction. This is not a sign of immature tooling. It reflects the fact that the hard part of machine learning is turning the messy world into a faithful sample. A useful reframing is to treat the dataset as a versioned artifact with the same rigor applied to code. Datasets should be tested, reviewed, and held to acceptance criteria. ```python # Treat data quality as a gate, not an afterthought. def validate_batch(df): assert df["label"].isin(VALID_LABELS).all(), "unknown label found" assert df["text"].str.len().gt(0).all(), "empty input found" dup_rate = df.duplicated(subset=["text"]).mean() assert dup_rate < 0.01, f"duplicate rate too high: {dup_rate:.3f}" return df ``` ### 2.3 Why this matters more as models scale Large pretrained models amplify rather than reduce the importance of data. A foundation model trained on a web-scale corpus inherits the composition of that corpus, including its demographic skew, its factual errors, and its toxic fragments. Fine-tuning and alignment then depend on small, carefully curated datasets whose quality has outsized influence on behavior. The work on instruction tuning and reinforcement learning from human feedback showed that a relatively small set of high-quality human demonstrations and preferences can reshape a model's behavior dramatically [2]. The leverage of data did not disappear with scale. It moved. ## 3. Garbage In, Garbage Out ### 3.1 The maxim made precise Garbage in, garbage out is a slogan, but it has a formal core. A learning algorithm minimizes a loss defined with respect to the training labels. If those labels are systematically wrong, the algorithm faithfully minimizes the wrong objective. Make this concrete for binary classification. Let the clean label be $y \in \{0, 1\}$ with clean posterior $\eta(x) = \Pr(y = 1 \mid x)$. Suppose each observed label is flipped independently with class-dependent rates $\rho_0 = \Pr(\tilde{y}=1 \mid y=0)$ and $\rho_1 = \Pr(\tilde{y}=0 \mid y=1)$. The posterior the model actually sees is $$ \tilde{\eta}(x) = \Pr(\tilde{y}=1 \mid x) = (1-\rho_1)\,\eta(x) + \rho_0\,\big(1-\eta(x)\big). $$ In the **symmetric** case $\rho_0 = \rho_1 = \rho < \tfrac12$, this simplifies to $\tilde{\eta}(x) = (1-2\rho)\,\eta(x) + \rho$, which is a strictly increasing affine function of $\eta(x)$. Because the threshold $\eta(x) = \tfrac12$ maps to $\tilde{\eta}(x) = \tfrac12$, the *decision boundary* of the Bayes classifier is preserved: symmetric noise shrinks the margin and inflates the irreducible loss, but it does not move the optimal boundary. The classifier still aims at the right target, just with less confidence and a higher error floor. The **asymmetric** case $\rho_0 \neq \rho_1$ is qualitatively different. Now the point where $\tilde{\eta}(x) = \tfrac12$ corresponds to $\eta(x) = \tfrac{1/2 - \rho_0}{1 - \rho_0 - \rho_1} \neq \tfrac12$, so naive minimization on corrupted labels produces a *biased* boundary that systematically over-predicts the less-corrupted class. The model does not detect that the labels are garbage. It treats them as ground truth and shifts its boundary accordingly. Recovering the clean-optimal classifier then requires knowing or estimating the noise rates and correcting for them, for example through loss correction or surrogate losses that are provably robust to a known noise transition matrix [9]. The lesson is that not all garbage is equal: noise that is uniform across classes degrades gracefully, while structured, label-correlated noise biases the very thing you are trying to learn. ### 3.2 Categories of garbage Data quality problems are not monolithic. It helps to name the common failure modes: - Label noise: incorrect or inconsistent annotations, often from rushed or ambiguous labeling guidelines. - Sampling bias: the training distribution differs from the deployment distribution, so the model optimizes for a world it will not encounter. - Leakage: information available at training time that will not be available at prediction time, producing optimistic offline metrics that collapse in production. - Spurious correlations: features that predict the label in the sample but have no causal relationship, such as a watermark that happens to co-occur with a class. - Duplication and contamination: repeated records that distort the effective distribution, or test examples that leak into training and inflate reported performance. ### 3.3 Garbage is often invisible at training time The insidious property of bad data is that the training metrics frequently look excellent. A model that exploits a spurious correlation or a leaked feature will report high accuracy on a validation set drawn from the same flawed source. The error surfaces only at deployment, when the correlation breaks or the leaked feature vanishes. This is why data validation, slice-based evaluation, and audits of the data-generating process matter more than a single aggregate score. A well-known illustration comes from medical imaging, where models learned to detect hospital-specific markers and scanner artifacts rather than disease, achieving strong test numbers while learning the wrong thing [3]. ### 3.4 Cleaning beats collecting, sometimes When labels are noisy, adding more noisy labels can be less effective than relabeling a subset correctly. The benchmark literature has documented pervasive label errors even in canonical test sets, and correcting them changes which models appear to be best [4]. The practical implication is that a budget spent on careful relabeling of the most uncertain or most influential examples can yield more improvement than the same budget spent on naive collection. ## 4. How Quality and Quantity Shape What Models Can Learn ### 4.1 The role of quantity More data reduces variance and lets a model fit finer structure without overfitting. The relationship is often regular enough to be described by a power law. Empirical scaling studies show that test loss $L$ falls with dataset size $N$ approximately as $$ L(N) \approx L_\infty + \left(\frac{N_c}{N}\right)^{\alpha}, $$ where $L_\infty$ is the irreducible loss, $N_c$ is a constant, and $\alpha$ is a small positive exponent [5]. The exponent being small is itself a lesson: doubling data yields a predictable but diminishing improvement, so quantity alone faces sharply rising costs at the frontier. ### 4.2 Quantity cannot fix the wrong distribution Scaling improves performance only within the distribution the data represents. If the sample is biased, more of the same biased data converges to a confident wrong answer. Formally, the law of large numbers guarantees that $\hat{R}_D(h)$ converges to the expectation under the *sampling* distribution $\mathcal{D}_{\text{train}}$, not under the *deployment* distribution $\mathcal{D}_{\text{test}}$ you actually care about. When these differ, the gap does not close with more data. The gap can be made quantitative. If training and deployment share the same conditional $\Pr(y \mid x)$ but differ in the input marginal (the covariate-shift setting), the deployment risk relates to a reweighted training risk through the density ratio $w(x) = p_{\text{test}}(x) / p_{\text{train}}(x)$: $$ R_{\text{test}}(h) = \mathbb{E}_{x \sim \mathcal{D}_{\text{train}}}\big[\, w(x)\, \mathbb{E}[\ell \mid x] \,\big]. $$ This identity carries a hard lesson. Wherever the training data has *zero* density but the deployment data does not, $w(x)$ is infinite and the integral is undefined: there is no amount of in-support data that informs the model about regions it never saw. Reweighting can correct for under-representation only where there is *some* representation to reweight. A region with no examples is a region the model is guessing about. This is the formal heart of why a facial analysis system trained predominantly on lighter-skinned faces does not become fair simply by adding more lighter-skinned faces. The landmark audit of commercial systems found error rates far higher for darker-skinned women, a direct consequence of unrepresentative training data [6]. The fix is not more data; it is more data *from the under-represented region*. ### 4.3 The role of quality and the data budget trade-off Quality and quantity interact. There is growing evidence that careful filtering and deduplication of a large corpus can match or beat training on a larger unfiltered corpus, because low-quality and duplicated examples waste capacity and can actively harm generalization [7]. Deduplication in particular reduces memorization and improves the efficiency of every training step. The emerging consensus is that the right question is not simply how much data, but how much high-quality, diverse, non-redundant data. A simple way to think about the trade-off is to weight examples by an estimate of their cleanliness or informativeness: ```python # Prefer informative, clean examples over raw volume. def score_example(ex): quality = ex.annotator_agreement # higher is cleaner novelty = 1.0 - ex.near_duplicate_sim # higher is less redundant return quality * novelty curated = sorted(pool, key=score_example, reverse=True)[:budget] ``` ### 4.4 What data fundamentally cannot provide Some limits are not about volume or cleanliness but about information content. A model cannot learn a distinction that the features never encode. If two classes are genuinely indistinguishable given the available inputs, the irreducible noise term $\sigma^2$ is positive and no dataset removes it. Likewise, observational data alone cannot in general identify causal effects, because correlation underdetermines causation without assumptions or interventions [8]. Recognizing these ceilings prevents the common error of trying to solve a measurement or design problem by collecting yet more rows. ### 4.5 Diversity and coverage Beyond raw quality and quantity sits coverage. A dataset of one million near-identical examples carries little more information than a few hundred. What matters for generalization is whether the data spans the regions of input space the model will face, including the rare and adversarial corners. This is why curated evaluation sets deliberately include hard slices and edge cases, and why active learning, which selects the most informative examples to label next, can outperform random collection at equal cost. Coverage is the bridge between quantity and quality: it asks not only how many examples and how clean, but whether they represent the full problem. ### 4.6 A worked example: when relabeling beats collecting The three forces (quantity, quality, coverage) can be weighed numerically with the bias-variance decomposition as a budget tool. Consider a binary task where a model trained on $n = 10{,}000$ examples reaches 12% test error. Suppose careful auditing reveals two facts: the labels carry symmetric noise at rate $\rho = 0.10$, and an estimated 4 percentage points of the error are variance from insufficient data, with the remainder bias plus the noise-inflated floor. Symmetric noise at $\rho = 0.10$ does not move the decision boundary (Section 3.1), but it raises the achievable error floor and, more importantly, slows learning by injecting wrong gradients. A back-of-envelope comparison of two equal-cost interventions, each costing the price of labeling 10,000 examples, makes the trade-off vivid. - **Collect**: double the data to 20,000 noisy examples. Under a power law $L(N) \approx L_\infty + (N_c/N)^\alpha$ with a typical small exponent, doubling $N$ removes only a fraction of the variance term, perhaps cutting the 4-point variance contribution to roughly 3 points. The noise floor is untouched. Net improvement: about 1 point. - **Clean**: spend the same budget relabeling the original 10,000 examples, driving $\rho$ from 0.10 toward 0.02. This lowers the noise floor and removes the systematic drag of wrong gradients, often recovering several points at once while leaving the data quantity unchanged. The qualitative conclusion is robust to the exact numbers: when the error budget is dominated by the noise floor rather than by variance, cleaning dominates collecting, because the power-law return on quantity is shallow while the return on removing label noise is direct. The decomposition tells you which lever to pull before you spend the money. ## 5. Practical Principles The arguments above converge on a small set of working principles for practitioners. First, instrument the data-generating process, not just the model. Know how each label was produced, by whom, and under what guidelines, because that process is the real source of your supervision. Second, measure quality directly. Track inter-annotator agreement, duplication rates, distribution shift between training and serving, and slice-level performance, and treat regressions in these as seriously as a failing unit test. Third, spend the marginal dollar where the bias-variance decomposition says it will help. If variance dominates, collect more representative data. If the noise floor dominates, fix the labels. If bias dominates, change the model. The decomposition is a budgeting tool, not just a theoretical curiosity. Fourth, prefer curation to accumulation once the data is large. Deduplicate, filter, and balance before reaching for a larger crawl, because capacity spent memorizing redundant or low-quality examples is capacity not spent learning the signal. Fifth, respect the ceilings. When a problem is limited by information content or by the absence of interventions, no quantity of data closes the gap, and the honest move is to redesign the measurement or the experiment rather than scale the dataset. ## 6. Conclusion Machine learning is the discipline of turning data into functions, and so the data is not an input to the real work. The data is the work. Architectures define what is learnable in principle, but the dataset decides what is learned in fact. Garbage in, garbage out is not folklore; it is a direct consequence of how loss functions and estimators behave when their supervision is corrupted. Quantity buys reduced variance along a diminishing power-law curve, quality and coverage decide whether that variance reduction points toward the right target, and certain ceilings remain that no volume of data can breach. The shift toward data-centric thinking is the field maturing into this recognition. The teams that win are not usually the ones with the cleverest architecture. They are the ones with the cleanest, most representative, and most thoughtfully curated data. ## References 1. Andrew Ng, "A Chat with Andrew on MLOps: From Model-centric to Data-centric AI." DeepLearning.AI, 2021. https://www.deeplearning.ai/the-batch/a-chat-with-andrew-on-mlops-from-model-centric-to-data-centric-ai/ 2. Long Ouyang et al., "Training language models to follow instructions with human feedback." NeurIPS, 2022. https://arxiv.org/abs/2203.02155 3. John R. Zech et al., "Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study." PLOS Medicine, 2018. https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1002683 4. Curtis G. Northcutt, Anish Athalye, and Jonas Mueller, "Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks." NeurIPS Datasets and Benchmarks, 2021. https://arxiv.org/abs/2103.14749 5. Jared Kaplan et al., "Scaling Laws for Neural Language Models." 2020. https://arxiv.org/abs/2001.08361 6. Joy Buolamwini and Timnit Gebru, "Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification." Proceedings of Machine Learning Research, 2018. https://proceedings.mlr.press/v81/buolamwini18a.html 7. Katherine Lee et al., "Deduplicating Training Data Makes Language Models Better." ACL, 2022. https://arxiv.org/abs/2107.06499 8. Judea Pearl, "Causality: Models, Reasoning, and Inference." Cambridge University Press, 2nd edition, 2009. https://bayes.cs.ucla.edu/BOOK-2K/ 9. Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu, "Making Deep Neural Networks Robust to Label Noise: A Loss Correction Approach." IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. https://doi.org/10.1109/CVPR.2017.240