14 The Data-Centric AI Paradigm
14.1 1. Introduction
For most of the modern history of machine learning, progress was narrated as a story about models. Each year brought a new architecture, a deeper network, a cleverer attention mechanism, or a larger parameter count, and the benchmark leaderboards moved accordingly. In this telling, the training data was treated as a fixed substrate: a static benchmark such as ImageNet or SQuAD that researchers competed against while holding the data constant and iterating on the algorithm. The data-centric AI paradigm inverts this convention. It argues that, for a large and practically important class of problems, the most reliable lever for improving system performance is not the model but the data, and that the discipline of systematically engineering data deserves the same rigor, tooling, and intellectual respect long reserved for model design.
This chapter develops that argument. We examine the framing popularized by Andrew Ng (Section 2), the empirical and theoretical reasons that data quality frequently dominates model choice (Section 3), the concrete techniques of systematic data improvement (Section 4), the documentation practices that make data legible and accountable (Section 5), the central role of data in the scaling of foundation models (Section 6), the promise and hazards of synthetic data (Section 7), and finally the organizational discipline of treating data as a first-class engineering artifact (Section 8).
14.2 2. From Model-Centric to Data-Centric AI
14.2.1 2.1 The model-centric default
The model-centric workflow can be stated compactly. Fix a dataset, then search over architectures, hyperparameters, and optimization strategies until validation performance plateaus. Academic benchmarking institutionalized this loop: a shared, frozen dataset becomes the arena, and the only permitted variable is the learning algorithm. This convention had real benefits. Holding data fixed makes results comparable across papers and isolates algorithmic contributions. The cost, however, is that it trains a generation of practitioners to believe that the data is simply given, when in deployed systems the data is the thing most under the practitioner’s control and most in need of attention.
14.2.2 2.2 Ng’s reframing
Andrew Ng crystallized the alternative in a widely circulated 2021 talk and campaign, “A Chat with Andrew on MLOps: From Model-centric to Data-centric AI” [1]. His central provocation was a thought experiment: hold the model code fixed and improve the data instead. In one frequently cited steel-defect inspection example, a baseline system sat around 76 percent accuracy. A model-centric team iterating on architectures produced essentially no improvement, while a data-centric team that systematically improved label consistency and cleaned the dataset raised performance by double-digit margins [1]. Ng’s slogan, that machine learning practitioners should treat data as code and adopt for data the same version control, quality assurance, and iteration loops that software engineering uses for code, captures the cultural shift more than any single number.
14.2.3 2.3 Defining the paradigm
Data-centric AI is the discipline of systematically engineering the data used to build an AI system, rather than treating data as an immutable input. It does not claim that models are unimportant. It claims that, once a reasonable model is selected, the marginal return on engineering effort is usually higher when invested in the data, and that this investment should be principled, measured, and repeatable rather than ad hoc. A useful survey by Zha and colleagues organizes the field into three goals: training data development, inference data development, and data maintenance [2].
14.3 3. Why Data Quality Often Matters More Than Model Choice
14.3.1 3.1 The plateau of architectural returns
On many applied problems the space of competent models has become crowded and close. A gradient-boosted tree, a well-tuned multilayer perceptron, and a fine-tuned transformer often land within a few points of one another, while the gap between a noisy dataset and a clean one can be far larger. When the architecture frontier flattens, the data frontier is where the slack lives.
14.3.2 3.2 Label noise propagates
Supervised learning treats labels as ground truth, so systematic labeling errors become systematic model errors. A striking demonstration came from Northcutt, Athalye, and Mueller, who audited ten of the most cited machine learning test sets, including ImageNet, MNIST, and several NLP corpora, and estimated an average label error rate of at least 3.3 percent, with ImageNet’s validation set containing more than 2,900 errors [3]. They further showed that these errors can reorder model rankings: lower-capacity models sometimes outperform higher-capacity ones once the test set is corrected, meaning that benchmark-driven conclusions about which model is “best” can be artifacts of label noise [3].
14.3.3 3.3 The garbage-in dynamic and long tails
Real deployments rarely fail on the bulk of common inputs. They fail on rare but consequential subpopulations: an unusual lighting condition, a dialect, a minority class, a sensor placed differently from the training fleet. These long-tail and distribution-shift failures are data problems by nature. No amount of architectural cleverness conjures information about a subpopulation that the training data underrepresents. Improving coverage, balance, and label fidelity for those slices is the only durable fix.
14.3.4 3.4 A small-data corollary
For small datasets, the case is even sharper. With only a few hundred examples, a single mislabeled instance can shift a decision boundary measurably, and consistency of labeling becomes decisive. Ng’s observation is that two annotators disagreeing on edge cases (is a scratch a defect or surface texture?) inject noise that no model can resolve, because the noise lives in the supervision signal itself [1].
14.4 4. Systematic Data Improvement
The practical core of data-centric AI is a set of disciplined techniques for raising data quality. The unifying principle is the iteration loop: measure, diagnose, intervene on the data, and remeasure, mirroring the test-driven loop of software engineering.
14.4.1 4.1 Label consistency
Label consistency is the degree to which annotators apply the same label to equivalent examples. It is improved by writing precise labeling instructions, measuring inter-annotator agreement (for example with Cohen’s kappa), surfacing disagreements, and resolving them by refining the definition rather than averaging the confusion. A productive tactic is to find examples where annotators disagree, adjudicate them, and feed the resolution back into the instructions so the ambiguity does not recur. Consistency frequently matters more than raw label volume: a smaller, consistently labeled set can beat a larger, noisier one.
14.4.2 4.2 Data cleaning
Data cleaning encompasses the detection and correction of label errors, duplicates, corrupted samples, mislabeled outliers, and leakage between train and test partitions. Confident learning, the method behind the Cleanlab toolkit, estimates the joint distribution between observed (possibly noisy) labels and latent true labels, then ranks examples by their probability of being mislabeled so that human reviewers can audit the most suspect cases first [3], [4]. This converts cleaning from an undirected chore into a prioritized, model-assisted workflow.
14.4.3 4.3 Data augmentation
Augmentation synthesizes additional training examples by applying label-preserving transformations to existing data: cropping, rotation, and color jitter for images, synonym replacement and back-translation for text, time-warping for audio. Augmentation expands effective dataset size, encodes useful invariances, and improves robustness to the variations the model will meet in deployment. Its discipline lies in matching the transformations to genuine sources of variation rather than introducing distortions that would never occur in production.
14.4.4 4.4 Slicing and error analysis
Aggregate accuracy hides where a model fails. Slice-based analysis partitions the evaluation set into meaningful subpopulations (by demographic, device, geography, class, or input length) and reports performance per slice, so that a model with strong overall accuracy but a catastrophic failure on one slice is caught. The Overton and Snorkel line of work formalized programmatic slicing, and tools such as slice-based learning let teams declare slices of interest and monitor them as first-class metrics [5]. Error analysis then closes the loop: practitioners manually inspect a sample of errors, tag them by category, and tally which categories dominate. The largest category points to the data intervention with the highest expected return, whether that is collecting more examples of a slice, relabeling an ambiguous category, or fixing a systematic annotation rule.
14.4.5 4.5 The iteration loop in practice
These techniques compose into a single loop. Train a baseline, run slice-based evaluation, perform error analysis on the worst slices, hypothesize a data defect, intervene (relabel, clean, augment, or collect), and retrain. Crucially, each intervention is measured against held-out data so that the team learns which data changes actually help, building the same empirical feedback culture that unit tests provide for code.
14.5 5. Data Documentation
Systematic data work demands that data be documented, because undocumented data is unaccountable data. Two complementary artifacts have become standard.
14.5.1 5.1 Datasheets for datasets
Gebru and colleagues proposed datasheets for datasets, modeled on the datasheets that accompany electronic components [6]. A datasheet answers a structured set of questions across the dataset lifecycle: motivation (why was it created, by whom, funded how), composition (what does each instance represent, are there sensitive attributes, is anything missing), collection process (how was data gathered, was consent obtained), preprocessing and cleaning, recommended and discouraged uses, distribution, and maintenance. The goal is to make assumptions and limitations explicit so that downstream users can judge fitness for their purpose and avoid misuse.
14.5.2 5.2 Data cards and model cards
Google’s Data Cards Playbook extended this idea into a more visual, stakeholder-oriented template aimed at summarizing the salient facts of a dataset for both technical and non-technical audiences [7]. The closely related model cards proposal by Mitchell and colleagues documents trained models, reporting intended use, evaluation across demographic and environmental slices, and ethical considerations [8]. Together, datasheets and data cards document the inputs while model cards document the outputs, and the combination supports auditing, reproducibility, and regulatory compliance.
14.5.3 5.3 Documentation as governance
Beyond transparency, documentation is a governance mechanism. It records provenance and licensing, flags personally identifiable or sensitive content, and assigns ownership and maintenance responsibilities. As datasets grow and circulate, this metadata is what lets organizations trace a model’s behavior back to a data decision and answer the question that increasingly arrives from regulators and customers alike: where did this data come from, and was it appropriate to use?
14.6 6. Data in Foundation Models and the Scaling of Data
14.6.1 6.1 Scaling laws elevate data
The foundation model era reframed data as a primary axis of capability rather than a backdrop. Kaplan and colleagues first described smooth power-law relationships between model performance and scale [9]. The Chinchilla study by Hoffmann and colleagues then corrected the field’s emphasis: for a fixed compute budget, many large models had been substantially undertrained, and optimal performance requires scaling model size and training tokens roughly in proportion [10]. The practical lesson was blunt. Data is not a free input you saturate early; it is a scarce resource that must scale alongside parameters, which placed dataset construction at the center of frontier model development.
14.6.2 6.2 Data quality and curation at scale
Quantity is necessary but not sufficient. The mounting evidence is that careful curation of web-scale corpora improves models more than indiscriminate accumulation. Lee and colleagues showed that deduplicating training data reduces memorization and improves efficiency [11]. Filtering for quality, deduplication, decontamination against evaluation sets, and balanced sampling across domains and languages have become decisive engineering activities. Projects such as the FineWeb corpus documented that aggressive, principled filtering of common-crawl data yields better models than larger but dirtier alternatives, demonstrating data-centric thinking at trillion-token scale [12].
14.6.3 6.3 The looming data wall
Because the highest-quality public text is finite, researchers have begun to forecast a “data wall.” Villalobos and colleagues estimated that the stock of high-quality human-generated public text could be effectively exhausted by training runs sometime between roughly 2026 and 2032, depending on usage intensity [13]. This projection sharpens the strategic value of data efficiency, curation, and alternative sources, including synthetic data.
14.7 7. Synthetic Data
14.7.1 7.1 Motivation and methods
Synthetic data is data generated by a model or simulator rather than collected from the world. It addresses several pressures at once: scarcity in long-tail or privacy-sensitive domains, the cost of human annotation, and the approaching limits of natural text. Methods range from physics-based simulation and procedural generation to generative models and, increasingly, the use of strong language models to produce instruction-tuning and preference data. Instruction-following advances such as Self-Instruct showed that a model can bootstrap a large, diverse instruction dataset from a small seed, dramatically lowering annotation cost [14].
14.7.2 7.2 Distillation and self-improvement
A dominant contemporary pattern is distillation, in which a capable teacher model generates training targets for a smaller student, transferring competence at a fraction of the data-collection cost. Related self-improvement loops have a model generate candidate solutions, filter them for correctness using verifiers or self-consistency, and train on the survivors. These pipelines are now standard in producing reasoning and coding datasets, and they are a direct response to the data wall: when human data runs short, well-filtered machine-generated data extends the supply.
14.7.3 7.3 Risks: bias amplification and model collapse
Synthetic data is not a free lunch. It can amplify the biases and blind spots of the generating model, and it raises the danger of model collapse. Shumailov and colleagues demonstrated that models trained recursively on their own outputs progressively lose information about the tails of the original distribution, degrading until the model forgets rare events entirely [15]. The mitigation that the literature consistently supports is to anchor synthetic data with real data, to filter and verify generated examples rigorously, and to maintain provenance so that the proportion and lineage of synthetic content remain known quantities. Synthetic data, in short, is a powerful instrument that demands the same quality discipline as collected data, not an exemption from it.
14.8 8. The Practical Discipline of Improving Data
14.8.1 8.1 Data as a first-class artifact
The throughline of this chapter is that data deserves the engineering rigor we grant to code. That means version control for datasets, with tools such as DVC tracking data alongside code so that experiments are reproducible. It means continuous evaluation, where a regression in a slice metric blocks a release just as a failing unit test would. It means clear ownership, where a person or team is accountable for the quality of each dataset, and documentation that travels with the data.
14.8.2 8.2 Process, benchmarks, and incentives
The data-centric movement has also produced infrastructure to reward data work. The DataPerf and DataComp benchmarks invert the usual competition: the model is held fixed and entrants compete on constructing the best training set, making data quality the measured objective [16]. Such benchmarks legitimize data engineering as a research contribution and provide shared yardsticks for techniques like filtering and selection.
14.8.3 8.3 When to be data-centric
Data-centric methods are not universally optimal. For genuinely novel modeling problems, or when the data is already clean and abundant and the model class is the bottleneck, architectural work may yield more. The mature stance is therefore diagnostic rather than dogmatic: use error analysis to locate the bottleneck, and direct effort where the evidence points. In the large space of applied, deployed systems built on imperfect real-world data, that evidence points to the data far more often than the model-centric tradition assumed. Recognizing this, and building the tooling, documentation, and culture to act on it, is the contribution of the data-centric AI paradigm.
14.9 9. Conclusion
The data-centric paradigm does not overturn the achievements of model-centric research; it completes them. Models and data are complementary levers, and the engineering question is always which lever returns the most for a given problem. What the paradigm corrects is a long-standing asymmetry of attention, in which sophisticated tooling and culture grew up around models while data was left to ad hoc, undocumented, and often invisible labor. By insisting that data be measured, cleaned, sliced, documented, versioned, and improved through disciplined iteration, data-centric AI brings the neglected half of the system into the light. As foundation models press against the limits of available human data and lean increasingly on curation and synthesis, that discipline moves from a useful practice to a defining one.
14.10 References
[1] A. Ng, “A Chat with Andrew on MLOps: From Model-centric to Data-centric AI,” DeepLearning.AI, 2021. https://www.youtube.com/watch?v=06-AZXmwHjo
[2] D. Zha, Z. P. Bhat, K.-H. Lai, F. Yang, Z. Jiang, S. Zhong, and X. Hu, “Data-centric Artificial Intelligence: A Survey,” ACM Computing Surveys, 2023. https://arxiv.org/abs/2303.10158
[3] C. G. Northcutt, A. Athalye, and J. Mueller, “Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks,” NeurIPS Datasets and Benchmarks, 2021. https://arxiv.org/abs/2103.14749
[4] C. G. Northcutt, L. Jiang, and I. L. Chuang, “Confident Learning: Estimating Uncertainty in Dataset Labels,” Journal of Artificial Intelligence Research, vol. 70, 2021. https://arxiv.org/abs/1911.00068
[5] V. Chen, S. Wu, A. J. Ratner, J. Weng, and C. Re, “Slice-based Learning: A Programming Model for Residual Learning in Critical Data Slices,” NeurIPS, 2019. https://arxiv.org/abs/1909.06349
[6] T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. Daume III, and K. Crawford, “Datasheets for Datasets,” Communications of the ACM, vol. 64, no. 12, 2021. https://arxiv.org/abs/1803.09010
[7] M. Pushkarna, A. Zaldivar, and O. Kjartansson, “Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI,” ACM FAccT, 2022. https://arxiv.org/abs/2204.01075
[8] M. Mitchell, S. Wu, A. Zaldivar, P. Barnes, L. Vasserman, B. Hutchinson, E. Spitzer, I. D. Raji, and T. Gebru, “Model Cards for Model Reporting,” ACM FAT*, 2019. https://arxiv.org/abs/1810.03993
[9] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling Laws for Neural Language Models,” 2020. https://arxiv.org/abs/2001.08361
[10] J. Hoffmann, S. Borgeaud, A. Mensch, et al., “Training Compute-Optimal Large Language Models,” NeurIPS, 2022. https://arxiv.org/abs/2203.15556
[11] K. Lee, D. Ippolito, A. Nystrom, C. Zhang, D. Eck, C. Callison-Burch, and N. Carlini, “Deduplicating Training Data Makes Language Models Better,” ACL, 2022. https://arxiv.org/abs/2107.06499
[12] G. Penedo, H. Kydlicek, L. von Werra, and T. Wolf, “The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale,” NeurIPS Datasets and Benchmarks, 2024. https://arxiv.org/abs/2406.17557
[13] P. Villalobos, A. Ho, J. Sevilla, T. Besiroglu, L. Heim, and M. Hobbhahn, “Will We Run Out of Data? Limits of LLM Scaling Based on Human-generated Data,” ICML, 2024. https://arxiv.org/abs/2211.04325
[14] Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Hajishirzi, “Self-Instruct: Aligning Language Models with Self-Generated Instructions,” ACL, 2023. https://arxiv.org/abs/2212.10560
[15] I. Shumailov, Z. Shumaylov, Y. Zhao, N. Papernot, R. Anderson, and Y. Gal, “AI Models Collapse When Trained on Recursively Generated Data,” Nature, vol. 631, 2024. https://www.nature.com/articles/s41586-024-07566-y
[16] S. Y. Gadre, G. Ilharco, A. Fang, et al., “DataComp: In Search of the Next Generation of Multimodal Datasets,” NeurIPS Datasets and Benchmarks, 2023. https://arxiv.org/abs/2304.14108