14 The Data-Centric AI Paradigm

14.1 1. Introduction

For most of the modern history of machine learning, progress was narrated as a story about models. Each year brought a new architecture, a deeper network, a cleverer attention mechanism, or a larger parameter count, and the benchmark leaderboards moved accordingly. In this telling, the training data was treated as a fixed substrate: a static benchmark such as ImageNet or SQuAD that researchers competed against while holding the data constant and iterating on the algorithm. The data-centric AI paradigm inverts this convention. It argues that, for a large and practically important class of problems, the most reliable lever for improving system performance is not the model but the data, and that the discipline of systematically engineering data deserves the same rigor, tooling, and intellectual respect long reserved for model design.

This chapter develops that argument. We examine the framing popularized by Andrew Ng (Section 2), the empirical and theoretical reasons that data quality frequently dominates model choice (Section 3), the concrete techniques of systematic data improvement (Section 4), the documentation practices that make data legible and accountable (Section 5), the central role of data in the scaling of foundation models (Section 6), the promise and hazards of synthetic data (Section 7), and finally the organizational discipline of treating data as a first-class engineering artifact (Section 8).

14.2 2. From Model-Centric to Data-Centric AI

14.2.1 2.1 The model-centric default

The model-centric workflow can be stated compactly. Fix a dataset, then search over architectures, hyperparameters, and optimization strategies until validation performance plateaus. Academic benchmarking institutionalized this loop: a shared, frozen dataset becomes the arena, and the only permitted variable is the learning algorithm. This convention had real benefits. Holding data fixed makes results comparable across papers and isolates algorithmic contributions. The cost, however, is that it trains a generation of practitioners to believe that the data is simply given, when in deployed systems the data is the thing most under the practitioner’s control and most in need of attention.

14.2.2 2.2 Ng’s reframing

Andrew Ng crystallized the alternative in a widely circulated 2021 talk and campaign, “A Chat with Andrew on MLOps: From Model-centric to Data-centric AI” [1]. His central provocation was a thought experiment: hold the model code fixed and improve the data instead. In one frequently cited steel-defect inspection example, a baseline system sat around 76 percent accuracy. A model-centric team iterating on architectures produced essentially no improvement, while a data-centric team that systematically improved label consistency and cleaned the dataset raised performance by double-digit margins [1]. Ng’s slogan, that machine learning practitioners should treat data as code and adopt for data the same version control, quality assurance, and iteration loops that software engineering uses for code, captures the cultural shift more than any single number.

14.2.3 2.3 Defining the paradigm

Data-centric AI is the discipline of systematically engineering the data used to build an AI system, rather than treating data as an immutable input. It does not claim that models are unimportant. It claims that, once a reasonable model is selected, the marginal return on engineering effort is usually higher when invested in the data, and that this investment should be principled, measured, and repeatable rather than ad hoc. A useful survey by Zha and colleagues organizes the field into three goals: training data development, inference data development, and data maintenance [2].

Definition: data-centric AI

Fix a model class and learning algorithm $\mathcal{A}$. Let $D$ denote a dataset drawn from a process the practitioner can influence through collection, labeling, cleaning, augmentation, and selection. Model-centric work searches over $\mathcal{A}$ with $D$ held fixed; data-centric work searches over $D$ with $\mathcal{A}$ held fixed. Formally, where the model-centric objective is \[ \min_{\mathcal{A}} \; \mathcal{L}\big(\mathcal{A}(D),\, D_{\text{test}}\big), \] the data-centric objective is \[ \min_{D \,\in\, \mathcal{D}} \; \mathcal{L}\big(\mathcal{A}(D),\, D_{\text{test}}\big), \] where $\mathcal{D}$ is the feasible set of datasets reachable under a labeling and curation budget. The paradigm asserts that for many deployed systems the second minimization has more accessible slack than the first.

14.3 3. Why Data Quality Often Matters More Than Model Choice

14.3.1 3.1 The plateau of architectural returns

On many applied problems the space of competent models has become crowded and close. A gradient-boosted tree, a well-tuned multilayer perceptron, and a fine-tuned transformer often land within a few points of one another, while the gap between a noisy dataset and a clean one can be far larger. When the architecture frontier flattens, the data frontier is where the slack lives.

14.3.2 3.2 Label noise propagates

Supervised learning treats labels as ground truth, so systematic labeling errors become systematic model errors. A striking demonstration came from Northcutt, Athalye, and Mueller, who audited ten of the most cited machine learning test sets, including ImageNet, MNIST, and several NLP corpora, and estimated an average label error rate of at least 3.3 percent, with ImageNet’s validation set containing more than 2,900 errors [3]. They further showed that these errors can reorder model rankings: lower-capacity models sometimes outperform higher-capacity ones once the test set is corrected, meaning that benchmark-driven conclusions about which model is “best” can be artifacts of label noise [3].

14.3.3 3.3 The garbage-in dynamic and long tails

Real deployments rarely fail on the bulk of common inputs. They fail on rare but consequential subpopulations: an unusual lighting condition, a dialect, a minority class, a sensor placed differently from the training fleet. These long-tail and distribution-shift failures are data problems by nature. No amount of architectural cleverness conjures information about a subpopulation that the training data underrepresents. Improving coverage, balance, and label fidelity for those slices is the only durable fix.

14.3.4 3.4 A small-data corollary

For small datasets, the case is even sharper. With only a few hundred examples, a single mislabeled instance can shift a decision boundary measurably, and consistency of labeling becomes decisive. Ng’s observation is that two annotators disagreeing on edge cases (is a scratch a defect or surface texture?) inject noise that no model can resolve, because the noise lives in the supervision signal itself [1].

14.4 4. Systematic Data Improvement

The practical core of data-centric AI is a set of disciplined techniques for raising data quality. The unifying principle is the iteration loop: measure, diagnose, intervene on the data, and remeasure, mirroring the test-driven loop of software engineering.

14.4.1 4.1 Label consistency

Label consistency is the degree to which annotators apply the same label to equivalent examples. It is improved by writing precise labeling instructions, measuring inter-annotator agreement, surfacing disagreements, and resolving them by refining the definition rather than averaging the confusion. A productive tactic is to find examples where annotators disagree, adjudicate them, and feed the resolution back into the instructions so the ambiguity does not recur. Consistency frequently matters more than raw label volume: a smaller, consistently labeled set can beat a larger, noisier one.

The standard quantitative instrument is Cohen’s kappa, which corrects raw agreement for chance. With $p_o$ the observed proportion of examples on which two annotators agree and $p_e$ the proportion expected if both labeled at random with their empirical class frequencies, kappa is \[ \kappa = \frac{p_o - p_e}{1 - p_e}. \] The numerator strips out the agreement attributable to chance and the denominator rescales so that $\kappa = 1$ denotes perfect agreement, $\kappa = 0$ denotes agreement no better than chance, and $\kappa < 0$ denotes systematic disagreement. The subtraction of $p_e$ matters precisely because high raw agreement is cheap when one class dominates: if 95 percent of examples are negatives, two annotators who both always guess “negative” reach $p_o = 0.95$ while learning nothing, and kappa correctly reports a value near zero. By common convention $\kappa$ above roughly 0.8 is treated as strong agreement, though the threshold that matters is the one your task can tolerate. A low kappa is a signal to fix the rubric, not to collect more labels, because more labels at low consistency simply add more noise to the supervision signal.

14.4.2 4.2 Data cleaning

Data cleaning encompasses the detection and correction of label errors, duplicates, corrupted samples, mislabeled outliers, and leakage between train and test partitions. Confident learning, the method behind the open-source Cleanlab toolkit, estimates the joint distribution between observed (possibly noisy) labels and latent true labels, then ranks examples by their probability of being mislabeled so that human reviewers can audit the most suspect cases first [3], [4]. This converts cleaning from an undirected chore into a prioritized, model-assisted workflow.

The mechanism is worth stating precisely. Let $\tilde{y}$ be the observed (noisy) label and $y^*$ the unknown true label, each ranging over $m$ classes, and let a trained model supply out-of-sample predicted probabilities $\hat{p}(\,\tilde{y} = j \mid x\,)$. Confident learning forms the confident joint $C_{\tilde{y}, y^*}$, an $m \times m$ count matrix. An example labeled class $i$ is counted toward cell $(i, j)$ when the model’s predicted probability for class $j$ exceeds a self-computed, per-class threshold \[ t_j = \frac{1}{|X_{\tilde{y}=j}|} \sum_{x \,\in\, X_{\tilde{y}=j}} \hat{p}(\,\tilde{y} = j \mid x\,), \] that is, the average predicted confidence among examples carrying label $j$. Using a per-class average threshold rather than a single global cutoff makes the method robust to class imbalance and to a model that is systematically over- or under-confident on particular classes. Normalizing and calibrating $C$ yields an estimate of the joint distribution $Q_{\tilde{y}, y^*}$; its off-diagonal mass is an estimate of how many labels of each kind are wrong, and the individual examples in off-diagonal cells are the prioritized audit queue. The output is therefore not just a global noise rate but a ranked list of the specific instances most likely to be mislabeled, which is exactly what a human reviewer needs.

14.4.3 4.3 Data augmentation

Augmentation synthesizes additional training examples by applying label-preserving transformations to existing data: cropping, rotation, and color jitter for images, synonym replacement and back-translation for text, time-warping for audio. Augmentation expands effective dataset size, encodes useful invariances, and improves robustness to the variations the model will meet in deployment. Its discipline lies in matching the transformations to genuine sources of variation rather than introducing distortions that would never occur in production.

14.4.4 4.4 Slicing and error analysis

Aggregate accuracy hides where a model fails. Slice-based analysis partitions the evaluation set into meaningful subpopulations (by demographic, device, geography, class, or input length) and reports performance per slice, so that a model with strong overall accuracy but a catastrophic failure on one slice is caught. The Overton and Snorkel line of work formalized programmatic slicing, and tools such as slice-based learning let teams declare slices of interest and monitor them as first-class metrics [5]. Error analysis then closes the loop: practitioners manually inspect a sample of errors, tag them by category, and tally which categories dominate. The largest category points to the data intervention with the highest expected return, whether that is collecting more examples of a slice, relabeling an ambiguous category, or fixing a systematic annotation rule.

14.4.5 4.5 The iteration loop in practice

These techniques compose into a single loop. Train a baseline, run slice-based evaluation, perform error analysis on the worst slices, hypothesize a data defect, intervene (relabel, clean, augment, or collect), and retrain. Crucially, each intervention is measured against held-out data so that the team learns which data changes actually help, building the same empirical feedback culture that unit tests provide for code.

flowchart LR
    A["Train baseline model"] --> B["Slice-based evaluation"]
    B --> C["Error analysis on worst slices"]
    C --> D["Hypothesize a data defect"]
    D --> E["Intervene: relabel, clean, augment, collect"]
    E --> F["Retrain and measure on held-out data"]
    F --> B

The loop is deliberately closed on held-out measurement rather than on intuition. A team can spend a sprint relabeling a category it believed was broken and discover that held-out accuracy did not move, which is itself valuable information that redirects effort to the next-largest error bucket.

14.4.6 4.6 A worked example: where to spend the next week

Consider a binary defect-inspection model with 90 percent overall accuracy on a balanced 1,000-example validation set, so 100 errors remain. Suppose error analysis tags those 100 errors into categories and counts them: 55 are “ambiguous scratch versus surface texture,” 30 are “blurry image,” and 15 are scattered miscellaneous. The data-centric question is not “which architecture next” but “which single intervention removes the most error.”

The ambiguous-scratch category is both the largest bucket and the one most diagnostic of a labeling-definition problem rather than a modeling problem. If two annotators independently labeled a sample of those cases and produced $\kappa = 0.35$, the supervision signal itself is inconsistent, and no model can be more decisive than its labels. The expected payoff of rewriting the scratch-versus-texture rubric, re-adjudicating the ambiguous cases, and propagating the corrected definition is therefore an upper bound of roughly 55 recovered errors, far exceeding the 30 from a deblurring augmentation campaign or any plausible architecture swap. The discipline is to rank candidate interventions by expected error reduction (bucket size weighted by the share of that bucket the intervention can realistically fix) and to act on the top of that ranked list, then remeasure. This is the same prioritization logic as confident learning, applied at the level of error categories rather than individual instances.

14.5 5. Data Documentation

Systematic data work demands that data be documented, because undocumented data is unaccountable data. Two complementary artifacts have become standard.

14.5.1 5.1 Datasheets for datasets

Gebru and colleagues proposed datasheets for datasets, modeled on the datasheets that accompany electronic components [6]. A datasheet answers a structured set of questions across the dataset lifecycle: motivation (why was it created, by whom, funded how), composition (what does each instance represent, are there sensitive attributes, is anything missing), collection process (how was data gathered, was consent obtained), preprocessing and cleaning, recommended and discouraged uses, distribution, and maintenance. The goal is to make assumptions and limitations explicit so that downstream users can judge fitness for their purpose and avoid misuse.

14.5.2 5.2 Data cards and model cards

Google’s Data Cards Playbook extended this idea into a more visual, stakeholder-oriented template aimed at summarizing the salient facts of a dataset for both technical and non-technical audiences [7]. The closely related model cards proposal by Mitchell and colleagues documents trained models, reporting intended use, evaluation across demographic and environmental slices, and ethical considerations [8]. Together, datasheets and data cards document the inputs while model cards document the outputs, and the combination supports auditing, reproducibility, and regulatory compliance.

14.5.3 5.3 Documentation as governance

Beyond transparency, documentation is a governance mechanism. It records provenance and licensing, flags personally identifiable or sensitive content, and assigns ownership and maintenance responsibilities. As datasets grow and circulate, this metadata is what lets organizations trace a model’s behavior back to a data decision and answer the question that increasingly arrives from regulators and customers alike: where did this data come from, and was it appropriate to use?

14.6 6. Data in Foundation Models and the Scaling of Data

14.6.1 6.1 Scaling laws elevate data

The foundation model era reframed data as a primary axis of capability rather than a backdrop. Kaplan and colleagues first described smooth power-law relationships between model performance and scale [9]. The Chinchilla study by Hoffmann and colleagues then corrected the field’s emphasis: for a fixed compute budget, many large models had been substantially undertrained, and optimal performance requires scaling model size and training tokens roughly in proportion [10].

The Chinchilla analysis fit the final pretraining loss as a function of parameter count $N$ and training tokens $D$, \[ L(N, D) = E + \frac{A}{N^{\alpha}} + \frac{B}{D^{\beta}}, \] where $E$ is an irreducible term reflecting the entropy of natural text, the second term captures finite model capacity, and the third captures finite data. The two reducible terms make the role of data explicit: holding $N$ fixed and growing $D$ drives only the $B / D^{\beta}$ term toward zero, so a model starved of tokens has a loss floor it cannot pass no matter how long it trains. Minimizing $L$ subject to the compute constraint $C \approx 6 N D$ (the approximate floating-point cost of training a dense transformer) yields optimal $N$ and $D$ that both grow as power laws in the budget, and with the fitted exponents the optimal allocation scales the two roughly in proportion. The empirical headline was that the compute-optimal token-to-parameter ratio sits near 20 tokens per parameter, far above what several earlier flagship models used [10]. The practical lesson was blunt. Data is not a free input you saturate early; it is a scarce resource that must scale alongside parameters, which placed dataset construction at the center of frontier model development.

14.6.2 6.2 Data quality and curation at scale

Quantity is necessary but not sufficient. The mounting evidence is that careful curation of web-scale corpora improves models more than indiscriminate accumulation. Lee and colleagues showed that deduplicating training data reduces memorization and improves efficiency [11]. Filtering for quality, deduplication, decontamination against evaluation sets, and balanced sampling across domains and languages have become decisive engineering activities. Projects such as the FineWeb corpus documented that aggressive, principled filtering of common-crawl data yields better models than larger but dirtier alternatives, demonstrating data-centric thinking at trillion-token scale [12].

14.6.3 6.3 The looming data wall

Because the highest-quality public text is finite, researchers have begun to forecast a “data wall.” Villalobos and colleagues estimated that the stock of high-quality human-generated public text could be effectively exhausted by training runs sometime between roughly 2026 and 2032, depending on usage intensity [13]. This projection sharpens the strategic value of data efficiency, curation, and alternative sources, including synthetic data.

14.7 7. Synthetic Data

14.7.1 7.1 Motivation and methods

Synthetic data is data generated by a model or simulator rather than collected from the world. It addresses several pressures at once: scarcity in long-tail or privacy-sensitive domains, the cost of human annotation, and the approaching limits of natural text. Methods range from physics-based simulation and procedural generation to generative models and, increasingly, the use of strong language models to produce instruction-tuning and preference data. Instruction-following advances such as Self-Instruct showed that a model can bootstrap a large, diverse instruction dataset from a small seed, dramatically lowering annotation cost [14].

14.7.2 7.2 Distillation and self-improvement

A dominant contemporary pattern is distillation, in which a capable teacher model generates training targets for a smaller student, transferring competence at a fraction of the data-collection cost. Related self-improvement loops have a model generate candidate solutions, filter them for correctness using verifiers or self-consistency, and train on the survivors. These pipelines are now standard in producing reasoning and coding datasets, and they are a direct response to the data wall: when human data runs short, well-filtered machine-generated data extends the supply.

14.7.3 7.3 Risks: bias amplification and model collapse

Synthetic data is not a free lunch. It can amplify the biases and blind spots of the generating model, and it raises the danger of model collapse. Shumailov and colleagues demonstrated that models trained recursively on their own outputs progressively lose information about the tails of the original distribution, degrading until the model forgets rare events entirely [15].

The intuition is sharp even in the simplest case. Suppose each generation estimates a distribution from a finite sample drawn from the previous generation’s model, then the next generation samples from that estimate, and so on. Even with an unbiased estimator at each step, finite sampling injects variance, and because rare events are sampled rarely, the tails of the distribution are the first to be lost. Consider a one-dimensional Gaussian whose mean and variance are re-estimated from $n$ samples at each generation: the chain of estimated variances is a supermartingale that drifts toward zero, so successive generations contract toward a spike and the diversity of the original distribution evaporates. The lesson generalizes: recursively training on model output without an anchor of real data is a low-pass filter on the data distribution, attenuating exactly the rare and tail behavior that downstream users care most about. The mitigation that the literature consistently supports is to anchor synthetic data with real data, to filter and verify generated examples rigorously, and to maintain provenance so that the proportion and lineage of synthetic content remain known quantities. Synthetic data, in short, is a powerful instrument that demands the same quality discipline as collected data, not an exemption from it.

14.8 8. The Practical Discipline of Improving Data

14.8.1 8.1 Data as a first-class artifact

The throughline of this chapter is that data deserves the engineering rigor we grant to code. That means version control for datasets, with mature open-source tools such as DVC and lakeFS tracking data alongside code so that experiments are reproducible. It means continuous evaluation, where a regression in a slice metric blocks a release just as a failing unit test would. It means clear ownership, where a person or team is accountable for the quality of each dataset, and documentation that travels with the data. The supporting toolchain is overwhelmingly free and open: Cleanlab for label-error detection, Great Expectations and pandera for data validation and schema contracts, and DVC for content-addressed dataset versioning.

14.8.2 8.2 Process, benchmarks, and incentives

The data-centric movement has also produced infrastructure to reward data work. The DataPerf and DataComp benchmarks invert the usual competition: the model is held fixed and entrants compete on constructing the best training set, making data quality the measured objective [16]. Such benchmarks legitimize data engineering as a research contribution and provide shared yardsticks for techniques like filtering and selection.

14.8.3 8.3 When to be data-centric

Data-centric methods are not universally optimal. For genuinely novel modeling problems, or when the data is already clean and abundant and the model class is the bottleneck, architectural work may yield more. The mature stance is therefore diagnostic rather than dogmatic: use error analysis to locate the bottleneck, and direct effort where the evidence points. In the large space of applied, deployed systems built on imperfect real-world data, that evidence points to the data far more often than the model-centric tradition assumed. Recognizing this, and building the tooling, documentation, and culture to act on it, is the contribution of the data-centric AI paradigm.

When to use and common pitfalls

Reach for data-centric methods first when accuracy has plateaued across several reasonable model classes, when errors concentrate in identifiable slices, or when the dataset is small and label consistency is in doubt. Prefer model-centric work when labels are already clean and abundant and a controlled architecture change moves the held-out metric.

Recurring pitfalls:

Tuning on the test set through cleaning. Repeatedly cleaning until a fixed test set improves leaks the test set into the development loop. Hold out a final evaluation set that no cleaning pass touches.
Cleaning away genuine hard cases. Confident learning flags hard but correctly labeled examples alongside true errors. Route flagged items to human review rather than deleting them automatically, or the model loses exactly the difficult cases it must learn.
Optimizing aggregate accuracy while a critical slice regresses. Always track per-slice metrics so a gain on the head does not hide a loss on a small but consequential tail.
Treating synthetic data as exempt from quality control. Unanchored, unverified synthetic data invites bias amplification and model collapse. Mix in real data, verify generated examples, and track provenance.

14.9 9. Conclusion

The data-centric paradigm does not overturn the achievements of model-centric research; it completes them. Models and data are complementary levers, and the engineering question is always which lever returns the most for a given problem. What the paradigm corrects is a long-standing asymmetry of attention, in which sophisticated tooling and culture grew up around models while data was left to ad hoc, undocumented, and often invisible labor. By insisting that data be measured, cleaned, sliced, documented, versioned, and improved through disciplined iteration, data-centric AI brings the neglected half of the system into the light. As foundation models press against the limits of available human data and lean increasingly on curation and synthesis, that discipline moves from a useful practice to a defining one.

14.10 References

[1] A. Ng, “A Chat with Andrew on MLOps: From Model-centric to Data-centric AI,” DeepLearning.AI, 2021. https://www.youtube.com/watch?v=06-AZXmwHjo

[2] D. Zha, Z. P. Bhat, K.-H. Lai, F. Yang, Z. Jiang, S. Zhong, and X. Hu, “Data-centric Artificial Intelligence: A Survey,” ACM Computing Surveys, 2023. https://arxiv.org/abs/2303.10158

[3] C. G. Northcutt, A. Athalye, and J. Mueller, “Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks,” NeurIPS Datasets and Benchmarks, 2021. https://arxiv.org/abs/2103.14749

[4] C. G. Northcutt, L. Jiang, and I. L. Chuang, “Confident Learning: Estimating Uncertainty in Dataset Labels,” Journal of Artificial Intelligence Research, vol. 70, 2021. https://arxiv.org/abs/1911.00068

[5] V. Chen, S. Wu, A. J. Ratner, J. Weng, and C. Re, “Slice-based Learning: A Programming Model for Residual Learning in Critical Data Slices,” NeurIPS, 2019. https://arxiv.org/abs/1909.06349

[6] T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. Daume III, and K. Crawford, “Datasheets for Datasets,” Communications of the ACM, vol. 64, no. 12, 2021. https://arxiv.org/abs/1803.09010

[7] M. Pushkarna, A. Zaldivar, and O. Kjartansson, “Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI,” ACM FAccT, 2022. https://arxiv.org/abs/2204.01075

[8] M. Mitchell, S. Wu, A. Zaldivar, P. Barnes, L. Vasserman, B. Hutchinson, E. Spitzer, I. D. Raji, and T. Gebru, “Model Cards for Model Reporting,” ACM FAT*, 2019. https://arxiv.org/abs/1810.03993

[9] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling Laws for Neural Language Models,” 2020. https://arxiv.org/abs/2001.08361

[10] J. Hoffmann, S. Borgeaud, A. Mensch, et al., “Training Compute-Optimal Large Language Models,” NeurIPS, 2022. https://arxiv.org/abs/2203.15556

[11] K. Lee, D. Ippolito, A. Nystrom, C. Zhang, D. Eck, C. Callison-Burch, and N. Carlini, “Deduplicating Training Data Makes Language Models Better,” ACL, 2022. https://arxiv.org/abs/2107.06499

[12] G. Penedo, H. Kydlicek, L. von Werra, and T. Wolf, “The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale,” NeurIPS Datasets and Benchmarks, 2024. https://arxiv.org/abs/2406.17557

[13] P. Villalobos, A. Ho, J. Sevilla, T. Besiroglu, L. Heim, and M. Hobbhahn, “Will We Run Out of Data? Limits of LLM Scaling Based on Human-generated Data,” ICML, 2024. https://arxiv.org/abs/2211.04325

[14] Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Hajishirzi, “Self-Instruct: Aligning Language Models with Self-Generated Instructions,” ACL, 2023. https://arxiv.org/abs/2212.10560

[15] I. Shumailov, Z. Shumaylov, Y. Zhao, N. Papernot, R. Anderson, and Y. Gal, “AI Models Collapse When Trained on Recursively Generated Data,” Nature, vol. 631, 2024. https://www.nature.com/articles/s41586-024-07566-y

[16] S. Y. Gadre, G. Ilharco, A. Fang, et al., “DataComp: In Search of the Next Generation of Multimodal Datasets,” NeurIPS Datasets and Benchmarks, 2023. https://arxiv.org/abs/2304.14108

# The Data-Centric AI Paradigm ## 1. Introduction For most of the modern history of machine learning, progress was narrated as a story about models. Each year brought a new architecture, a deeper network, a cleverer attention mechanism, or a larger parameter count, and the benchmark leaderboards moved accordingly. In this telling, the training data was treated as a fixed substrate: a static benchmark such as ImageNet or SQuAD that researchers competed against while holding the data constant and iterating on the algorithm. The data-centric AI paradigm inverts this convention. It argues that, for a large and practically important class of problems, the most reliable lever for improving system performance is not the model but the data, and that the discipline of systematically engineering data deserves the same rigor, tooling, and intellectual respect long reserved for model design. This chapter develops that argument. We examine the framing popularized by Andrew Ng (Section 2), the empirical and theoretical reasons that data quality frequently dominates model choice (Section 3), the concrete techniques of systematic data improvement (Section 4), the documentation practices that make data legible and accountable (Section 5), the central role of data in the scaling of foundation models (Section 6), the promise and hazards of synthetic data (Section 7), and finally the organizational discipline of treating data as a first-class engineering artifact (Section 8). ## 2. From Model-Centric to Data-Centric AI ### 2.1 The model-centric default The model-centric workflow can be stated compactly. Fix a dataset, then search over architectures, hyperparameters, and optimization strategies until validation performance plateaus. Academic benchmarking institutionalized this loop: a shared, frozen dataset becomes the arena, and the only permitted variable is the learning algorithm. This convention had real benefits. Holding data fixed makes results comparable across papers and isolates algorithmic contributions. The cost, however, is that it trains a generation of practitioners to believe that the data is simply given, when in deployed systems the data is the thing most under the practitioner's control and most in need of attention. ### 2.2 Ng's reframing Andrew Ng crystallized the alternative in a widely circulated 2021 talk and campaign, "A Chat with Andrew on MLOps: From Model-centric to Data-centric AI" [1]. His central provocation was a thought experiment: hold the model code fixed and improve the data instead. In one frequently cited steel-defect inspection example, a baseline system sat around 76 percent accuracy. A model-centric team iterating on architectures produced essentially no improvement, while a data-centric team that systematically improved label consistency and cleaned the dataset raised performance by double-digit margins [1]. Ng's slogan, that machine learning practitioners should treat data as code and adopt for data the same version control, quality assurance, and iteration loops that software engineering uses for code, captures the cultural shift more than any single number. ### 2.3 Defining the paradigm Data-centric AI is the discipline of systematically engineering the data used to build an AI system, rather than treating data as an immutable input. It does not claim that models are unimportant. It claims that, once a reasonable model is selected, the marginal return on engineering effort is usually higher when invested in the data, and that this investment should be principled, measured, and repeatable rather than ad hoc. A useful survey by Zha and colleagues organizes the field into three goals: training data development, inference data development, and data maintenance [2]. ::: {.callout-note title="Definition: data-centric AI"} Fix a model class and learning algorithm $\mathcal{A}$. Let $D$ denote a dataset drawn from a process the practitioner can influence through collection, labeling, cleaning, augmentation, and selection. Model-centric work searches over $\mathcal{A}$ with $D$ held fixed; data-centric work searches over $D$ with $\mathcal{A}$ held fixed. Formally, where the model-centric objective is $$ \min_{\mathcal{A}} \; \mathcal{L}\big(\mathcal{A}(D),\, D_{\text{test}}\big), $$ the data-centric objective is $$ \min_{D \,\in\, \mathcal{D}} \; \mathcal{L}\big(\mathcal{A}(D),\, D_{\text{test}}\big), $$ where $\mathcal{D}$ is the feasible set of datasets reachable under a labeling and curation budget. The paradigm asserts that for many deployed systems the second minimization has more accessible slack than the first. ::: ## 3. Why Data Quality Often Matters More Than Model Choice ### 3.1 The plateau of architectural returns On many applied problems the space of competent models has become crowded and close. A gradient-boosted tree, a well-tuned multilayer perceptron, and a fine-tuned transformer often land within a few points of one another, while the gap between a noisy dataset and a clean one can be far larger. When the architecture frontier flattens, the data frontier is where the slack lives. ### 3.2 Label noise propagates Supervised learning treats labels as ground truth, so systematic labeling errors become systematic model errors. A striking demonstration came from Northcutt, Athalye, and Mueller, who audited ten of the most cited machine learning test sets, including ImageNet, MNIST, and several NLP corpora, and estimated an average label error rate of at least 3.3 percent, with ImageNet's validation set containing more than 2,900 errors [3]. They further showed that these errors can reorder model rankings: lower-capacity models sometimes outperform higher-capacity ones once the test set is corrected, meaning that benchmark-driven conclusions about which model is "best" can be artifacts of label noise [3]. ### 3.3 The garbage-in dynamic and long tails Real deployments rarely fail on the bulk of common inputs. They fail on rare but consequential subpopulations: an unusual lighting condition, a dialect, a minority class, a sensor placed differently from the training fleet. These long-tail and distribution-shift failures are data problems by nature. No amount of architectural cleverness conjures information about a subpopulation that the training data underrepresents. Improving coverage, balance, and label fidelity for those slices is the only durable fix. ### 3.4 A small-data corollary For small datasets, the case is even sharper. With only a few hundred examples, a single mislabeled instance can shift a decision boundary measurably, and consistency of labeling becomes decisive. Ng's observation is that two annotators disagreeing on edge cases (is a scratch a defect or surface texture?) inject noise that no model can resolve, because the noise lives in the supervision signal itself [1]. ## 4. Systematic Data Improvement The practical core of data-centric AI is a set of disciplined techniques for raising data quality. The unifying principle is the iteration loop: measure, diagnose, intervene on the data, and remeasure, mirroring the test-driven loop of software engineering. ### 4.1 Label consistency Label consistency is the degree to which annotators apply the same label to equivalent examples. It is improved by writing precise labeling instructions, measuring inter-annotator agreement, surfacing disagreements, and resolving them by refining the definition rather than averaging the confusion. A productive tactic is to find examples where annotators disagree, adjudicate them, and feed the resolution back into the instructions so the ambiguity does not recur. Consistency frequently matters more than raw label volume: a smaller, consistently labeled set can beat a larger, noisier one. The standard quantitative instrument is Cohen's kappa, which corrects raw agreement for chance. With $p_o$ the observed proportion of examples on which two annotators agree and $p_e$ the proportion expected if both labeled at random with their empirical class frequencies, kappa is $$ \kappa = \frac{p_o - p_e}{1 - p_e}. $$ The numerator strips out the agreement attributable to chance and the denominator rescales so that $\kappa = 1$ denotes perfect agreement, $\kappa = 0$ denotes agreement no better than chance, and $\kappa < 0$ denotes systematic disagreement. The subtraction of $p_e$ matters precisely because high raw agreement is cheap when one class dominates: if 95 percent of examples are negatives, two annotators who both always guess "negative" reach $p_o = 0.95$ while learning nothing, and kappa correctly reports a value near zero. By common convention $\kappa$ above roughly 0.8 is treated as strong agreement, though the threshold that matters is the one your task can tolerate. A low kappa is a signal to fix the rubric, not to collect more labels, because more labels at low consistency simply add more noise to the supervision signal. ### 4.2 Data cleaning Data cleaning encompasses the detection and correction of label errors, duplicates, corrupted samples, mislabeled outliers, and leakage between train and test partitions. Confident learning, the method behind the open-source Cleanlab toolkit, estimates the joint distribution between observed (possibly noisy) labels and latent true labels, then ranks examples by their probability of being mislabeled so that human reviewers can audit the most suspect cases first [3], [4]. This converts cleaning from an undirected chore into a prioritized, model-assisted workflow. The mechanism is worth stating precisely. Let $\tilde{y}$ be the observed (noisy) label and $y^*$ the unknown true label, each ranging over $m$ classes, and let a trained model supply out-of-sample predicted probabilities $\hat{p}(\,\tilde{y} = j \mid x\,)$. Confident learning forms the confident joint $C_{\tilde{y}, y^*}$, an $m \times m$ count matrix. An example labeled class $i$ is counted toward cell $(i, j)$ when the model's predicted probability for class $j$ exceeds a self-computed, per-class threshold $$ t_j = \frac{1}{|X_{\tilde{y}=j}|} \sum_{x \,\in\, X_{\tilde{y}=j}} \hat{p}(\,\tilde{y} = j \mid x\,), $$ that is, the average predicted confidence among examples carrying label $j$. Using a per-class average threshold rather than a single global cutoff makes the method robust to class imbalance and to a model that is systematically over- or under-confident on particular classes. Normalizing and calibrating $C$ yields an estimate of the joint distribution $Q_{\tilde{y}, y^*}$; its off-diagonal mass is an estimate of how many labels of each kind are wrong, and the individual examples in off-diagonal cells are the prioritized audit queue. The output is therefore not just a global noise rate but a ranked list of the specific instances most likely to be mislabeled, which is exactly what a human reviewer needs. ### 4.3 Data augmentation Augmentation synthesizes additional training examples by applying label-preserving transformations to existing data: cropping, rotation, and color jitter for images, synonym replacement and back-translation for text, time-warping for audio. Augmentation expands effective dataset size, encodes useful invariances, and improves robustness to the variations the model will meet in deployment. Its discipline lies in matching the transformations to genuine sources of variation rather than introducing distortions that would never occur in production. ### 4.4 Slicing and error analysis Aggregate accuracy hides where a model fails. Slice-based analysis partitions the evaluation set into meaningful subpopulations (by demographic, device, geography, class, or input length) and reports performance per slice, so that a model with strong overall accuracy but a catastrophic failure on one slice is caught. The Overton and Snorkel line of work formalized programmatic slicing, and tools such as slice-based learning let teams declare slices of interest and monitor them as first-class metrics [5]. Error analysis then closes the loop: practitioners manually inspect a sample of errors, tag them by category, and tally which categories dominate. The largest category points to the data intervention with the highest expected return, whether that is collecting more examples of a slice, relabeling an ambiguous category, or fixing a systematic annotation rule. ### 4.5 The iteration loop in practice These techniques compose into a single loop. Train a baseline, run slice-based evaluation, perform error analysis on the worst slices, hypothesize a data defect, intervene (relabel, clean, augment, or collect), and retrain. Crucially, each intervention is measured against held-out data so that the team learns which data changes actually help, building the same empirical feedback culture that unit tests provide for code. ```{mermaid} flowchart LR A["Train baseline model"] --> B["Slice-based evaluation"] B --> C["Error analysis on worst slices"] C --> D["Hypothesize a data defect"] D --> E["Intervene: relabel, clean, augment, collect"] E --> F["Retrain and measure on held-out data"] F --> B ``` The loop is deliberately closed on held-out measurement rather than on intuition. A team can spend a sprint relabeling a category it believed was broken and discover that held-out accuracy did not move, which is itself valuable information that redirects effort to the next-largest error bucket. ### 4.6 A worked example: where to spend the next week Consider a binary defect-inspection model with 90 percent overall accuracy on a balanced 1,000-example validation set, so 100 errors remain. Suppose error analysis tags those 100 errors into categories and counts them: 55 are "ambiguous scratch versus surface texture," 30 are "blurry image," and 15 are scattered miscellaneous. The data-centric question is not "which architecture next" but "which single intervention removes the most error." The ambiguous-scratch category is both the largest bucket and the one most diagnostic of a labeling-definition problem rather than a modeling problem. If two annotators independently labeled a sample of those cases and produced $\kappa = 0.35$, the supervision signal itself is inconsistent, and no model can be more decisive than its labels. The expected payoff of rewriting the scratch-versus-texture rubric, re-adjudicating the ambiguous cases, and propagating the corrected definition is therefore an upper bound of roughly 55 recovered errors, far exceeding the 30 from a deblurring augmentation campaign or any plausible architecture swap. The discipline is to rank candidate interventions by expected error reduction (bucket size weighted by the share of that bucket the intervention can realistically fix) and to act on the top of that ranked list, then remeasure. This is the same prioritization logic as confident learning, applied at the level of error categories rather than individual instances. ## 5. Data Documentation Systematic data work demands that data be documented, because undocumented data is unaccountable data. Two complementary artifacts have become standard. ### 5.1 Datasheets for datasets Gebru and colleagues proposed datasheets for datasets, modeled on the datasheets that accompany electronic components [6]. A datasheet answers a structured set of questions across the dataset lifecycle: motivation (why was it created, by whom, funded how), composition (what does each instance represent, are there sensitive attributes, is anything missing), collection process (how was data gathered, was consent obtained), preprocessing and cleaning, recommended and discouraged uses, distribution, and maintenance. The goal is to make assumptions and limitations explicit so that downstream users can judge fitness for their purpose and avoid misuse. ### 5.2 Data cards and model cards Google's Data Cards Playbook extended this idea into a more visual, stakeholder-oriented template aimed at summarizing the salient facts of a dataset for both technical and non-technical audiences [7]. The closely related model cards proposal by Mitchell and colleagues documents trained models, reporting intended use, evaluation across demographic and environmental slices, and ethical considerations [8]. Together, datasheets and data cards document the inputs while model cards document the outputs, and the combination supports auditing, reproducibility, and regulatory compliance. ### 5.3 Documentation as governance Beyond transparency, documentation is a governance mechanism. It records provenance and licensing, flags personally identifiable or sensitive content, and assigns ownership and maintenance responsibilities. As datasets grow and circulate, this metadata is what lets organizations trace a model's behavior back to a data decision and answer the question that increasingly arrives from regulators and customers alike: where did this data come from, and was it appropriate to use? ## 6. Data in Foundation Models and the Scaling of Data ### 6.1 Scaling laws elevate data The foundation model era reframed data as a primary axis of capability rather than a backdrop. Kaplan and colleagues first described smooth power-law relationships between model performance and scale [9]. The Chinchilla study by Hoffmann and colleagues then corrected the field's emphasis: for a fixed compute budget, many large models had been substantially undertrained, and optimal performance requires scaling model size and training tokens roughly in proportion [10]. The Chinchilla analysis fit the final pretraining loss as a function of parameter count $N$ and training tokens $D$, $$ L(N, D) = E + \frac{A}{N^{\alpha}} + \frac{B}{D^{\beta}}, $$ where $E$ is an irreducible term reflecting the entropy of natural text, the second term captures finite model capacity, and the third captures finite data. The two reducible terms make the role of data explicit: holding $N$ fixed and growing $D$ drives only the $B / D^{\beta}$ term toward zero, so a model starved of tokens has a loss floor it cannot pass no matter how long it trains. Minimizing $L$ subject to the compute constraint $C \approx 6 N D$ (the approximate floating-point cost of training a dense transformer) yields optimal $N$ and $D$ that both grow as power laws in the budget, and with the fitted exponents the optimal allocation scales the two roughly in proportion. The empirical headline was that the compute-optimal token-to-parameter ratio sits near 20 tokens per parameter, far above what several earlier flagship models used [10]. The practical lesson was blunt. Data is not a free input you saturate early; it is a scarce resource that must scale alongside parameters, which placed dataset construction at the center of frontier model development. ### 6.2 Data quality and curation at scale Quantity is necessary but not sufficient. The mounting evidence is that careful curation of web-scale corpora improves models more than indiscriminate accumulation. Lee and colleagues showed that deduplicating training data reduces memorization and improves efficiency [11]. Filtering for quality, deduplication, decontamination against evaluation sets, and balanced sampling across domains and languages have become decisive engineering activities. Projects such as the FineWeb corpus documented that aggressive, principled filtering of common-crawl data yields better models than larger but dirtier alternatives, demonstrating data-centric thinking at trillion-token scale [12]. ### 6.3 The looming data wall Because the highest-quality public text is finite, researchers have begun to forecast a "data wall." Villalobos and colleagues estimated that the stock of high-quality human-generated public text could be effectively exhausted by training runs sometime between roughly 2026 and 2032, depending on usage intensity [13]. This projection sharpens the strategic value of data efficiency, curation, and alternative sources, including synthetic data. ## 7. Synthetic Data ### 7.1 Motivation and methods Synthetic data is data generated by a model or simulator rather than collected from the world. It addresses several pressures at once: scarcity in long-tail or privacy-sensitive domains, the cost of human annotation, and the approaching limits of natural text. Methods range from physics-based simulation and procedural generation to generative models and, increasingly, the use of strong language models to produce instruction-tuning and preference data. Instruction-following advances such as Self-Instruct showed that a model can bootstrap a large, diverse instruction dataset from a small seed, dramatically lowering annotation cost [14]. ### 7.2 Distillation and self-improvement A dominant contemporary pattern is distillation, in which a capable teacher model generates training targets for a smaller student, transferring competence at a fraction of the data-collection cost. Related self-improvement loops have a model generate candidate solutions, filter them for correctness using verifiers or self-consistency, and train on the survivors. These pipelines are now standard in producing reasoning and coding datasets, and they are a direct response to the data wall: when human data runs short, well-filtered machine-generated data extends the supply. ### 7.3 Risks: bias amplification and model collapse Synthetic data is not a free lunch. It can amplify the biases and blind spots of the generating model, and it raises the danger of model collapse. Shumailov and colleagues demonstrated that models trained recursively on their own outputs progressively lose information about the tails of the original distribution, degrading until the model forgets rare events entirely [15]. The intuition is sharp even in the simplest case. Suppose each generation estimates a distribution from a finite sample drawn from the previous generation's model, then the next generation samples from that estimate, and so on. Even with an unbiased estimator at each step, finite sampling injects variance, and because rare events are sampled rarely, the tails of the distribution are the first to be lost. Consider a one-dimensional Gaussian whose mean and variance are re-estimated from $n$ samples at each generation: the chain of estimated variances is a supermartingale that drifts toward zero, so successive generations contract toward a spike and the diversity of the original distribution evaporates. The lesson generalizes: recursively training on model output without an anchor of real data is a low-pass filter on the data distribution, attenuating exactly the rare and tail behavior that downstream users care most about. The mitigation that the literature consistently supports is to anchor synthetic data with real data, to filter and verify generated examples rigorously, and to maintain provenance so that the proportion and lineage of synthetic content remain known quantities. Synthetic data, in short, is a powerful instrument that demands the same quality discipline as collected data, not an exemption from it. ## 8. The Practical Discipline of Improving Data ### 8.1 Data as a first-class artifact The throughline of this chapter is that data deserves the engineering rigor we grant to code. That means version control for datasets, with mature open-source tools such as DVC and lakeFS tracking data alongside code so that experiments are reproducible. It means continuous evaluation, where a regression in a slice metric blocks a release just as a failing unit test would. It means clear ownership, where a person or team is accountable for the quality of each dataset, and documentation that travels with the data. The supporting toolchain is overwhelmingly free and open: Cleanlab for label-error detection, Great Expectations and pandera for data validation and schema contracts, and DVC for content-addressed dataset versioning. ### 8.2 Process, benchmarks, and incentives The data-centric movement has also produced infrastructure to reward data work. The DataPerf and DataComp benchmarks invert the usual competition: the model is held fixed and entrants compete on constructing the best training set, making data quality the measured objective [16]. Such benchmarks legitimize data engineering as a research contribution and provide shared yardsticks for techniques like filtering and selection. ### 8.3 When to be data-centric Data-centric methods are not universally optimal. For genuinely novel modeling problems, or when the data is already clean and abundant and the model class is the bottleneck, architectural work may yield more. The mature stance is therefore diagnostic rather than dogmatic: use error analysis to locate the bottleneck, and direct effort where the evidence points. In the large space of applied, deployed systems built on imperfect real-world data, that evidence points to the data far more often than the model-centric tradition assumed. Recognizing this, and building the tooling, documentation, and culture to act on it, is the contribution of the data-centric AI paradigm. ::: {.callout-warning title="When to use and common pitfalls"} Reach for data-centric methods first when accuracy has plateaued across several reasonable model classes, when errors concentrate in identifiable slices, or when the dataset is small and label consistency is in doubt. Prefer model-centric work when labels are already clean and abundant and a controlled architecture change moves the held-out metric. Recurring pitfalls: - **Tuning on the test set through cleaning.** Repeatedly cleaning until a fixed test set improves leaks the test set into the development loop. Hold out a final evaluation set that no cleaning pass touches. - **Cleaning away genuine hard cases.** Confident learning flags hard but correctly labeled examples alongside true errors. Route flagged items to human review rather than deleting them automatically, or the model loses exactly the difficult cases it must learn. - **Optimizing aggregate accuracy while a critical slice regresses.** Always track per-slice metrics so a gain on the head does not hide a loss on a small but consequential tail. - **Treating synthetic data as exempt from quality control.** Unanchored, unverified synthetic data invites bias amplification and model collapse. Mix in real data, verify generated examples, and track provenance. ::: ## 9. Conclusion The data-centric paradigm does not overturn the achievements of model-centric research; it completes them. Models and data are complementary levers, and the engineering question is always which lever returns the most for a given problem. What the paradigm corrects is a long-standing asymmetry of attention, in which sophisticated tooling and culture grew up around models while data was left to ad hoc, undocumented, and often invisible labor. By insisting that data be measured, cleaned, sliced, documented, versioned, and improved through disciplined iteration, data-centric AI brings the neglected half of the system into the light. As foundation models press against the limits of available human data and lean increasingly on curation and synthesis, that discipline moves from a useful practice to a defining one. ## References [1] A. Ng, "A Chat with Andrew on MLOps: From Model-centric to Data-centric AI," DeepLearning.AI, 2021. https://www.youtube.com/watch?v=06-AZXmwHjo [2] D. Zha, Z. P. Bhat, K.-H. Lai, F. Yang, Z. Jiang, S. Zhong, and X. Hu, "Data-centric Artificial Intelligence: A Survey," ACM Computing Surveys, 2023. https://arxiv.org/abs/2303.10158 [3] C. G. Northcutt, A. Athalye, and J. Mueller, "Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks," NeurIPS Datasets and Benchmarks, 2021. https://arxiv.org/abs/2103.14749 [4] C. G. Northcutt, L. Jiang, and I. L. Chuang, "Confident Learning: Estimating Uncertainty in Dataset Labels," Journal of Artificial Intelligence Research, vol. 70, 2021. https://arxiv.org/abs/1911.00068 [5] V. Chen, S. Wu, A. J. Ratner, J. Weng, and C. Re, "Slice-based Learning: A Programming Model for Residual Learning in Critical Data Slices," NeurIPS, 2019. https://arxiv.org/abs/1909.06349 [6] T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. Daume III, and K. Crawford, "Datasheets for Datasets," Communications of the ACM, vol. 64, no. 12, 2021. https://arxiv.org/abs/1803.09010 [7] M. Pushkarna, A. Zaldivar, and O. Kjartansson, "Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI," ACM FAccT, 2022. https://arxiv.org/abs/2204.01075 [8] M. Mitchell, S. Wu, A. Zaldivar, P. Barnes, L. Vasserman, B. Hutchinson, E. Spitzer, I. D. Raji, and T. Gebru, "Model Cards for Model Reporting," ACM FAT*, 2019. https://arxiv.org/abs/1810.03993 [9] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, "Scaling Laws for Neural Language Models," 2020. https://arxiv.org/abs/2001.08361 [10] J. Hoffmann, S. Borgeaud, A. Mensch, et al., "Training Compute-Optimal Large Language Models," NeurIPS, 2022. https://arxiv.org/abs/2203.15556 [11] K. Lee, D. Ippolito, A. Nystrom, C. Zhang, D. Eck, C. Callison-Burch, and N. Carlini, "Deduplicating Training Data Makes Language Models Better," ACL, 2022. https://arxiv.org/abs/2107.06499 [12] G. Penedo, H. Kydlicek, L. von Werra, and T. Wolf, "The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale," NeurIPS Datasets and Benchmarks, 2024. https://arxiv.org/abs/2406.17557 [13] P. Villalobos, A. Ho, J. Sevilla, T. Besiroglu, L. Heim, and M. Hobbhahn, "Will We Run Out of Data? Limits of LLM Scaling Based on Human-generated Data," ICML, 2024. https://arxiv.org/abs/2211.04325 [14] Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Hajishirzi, "Self-Instruct: Aligning Language Models with Self-Generated Instructions," ACL, 2023. https://arxiv.org/abs/2212.10560 [15] I. Shumailov, Z. Shumaylov, Y. Zhao, N. Papernot, R. Anderson, and Y. Gal, "AI Models Collapse When Trained on Recursively Generated Data," Nature, vol. 631, 2024. https://www.nature.com/articles/s41586-024-07566-y [16] S. Y. Gadre, G. Ilharco, A. Fang, et al., "DataComp: In Search of the Next Generation of Multimodal Datasets," NeurIPS Datasets and Benchmarks, 2023. https://arxiv.org/abs/2304.14108