79  Data Documentation: Datasheets, Data Cards, and Data Statements

A trained model is a function of its data. When practitioners cannot answer basic questions about where a dataset came from, who is represented in it, how it was labeled, and what it was meant for, they cannot reason about the behavior of the systems they build on top of it. Data documentation is the discipline of recording these answers in a structured, durable, and discoverable form. This chapter surveys the three most influential documentation frameworks in machine learning practice, namely Datasheets for Datasets, Data Cards, and Data Statements for natural language processing. It then distills the common content categories they share, and argues that documentation is most valuable not as a compliance artifact but as an instrument of accountability.

79.1 1. Why Document Data

79.1.1 1.1 The accountability gap

Machine learning pipelines have a tendency to launder responsibility. A dataset is scraped or purchased, passed through several teams, filtered and augmented, and finally used to train a model whose failures surface months later in production. By that point the people who can explain the data are gone, the collection scripts have rotted, and the only record of intent is a folder name. Documentation closes this gap by attaching a persistent, human-readable account to the dataset itself, so that the answers travel with the artifact rather than living in the heads of people who have moved on.

79.1.2 1.2 The downstream harms of undocumented data

Several well-known failures trace directly to undocumented or poorly documented data. Facial analysis systems performed far worse on darker-skinned women because their training and benchmark sets were overwhelmingly composed of lighter-skinned faces, a composition fact that was never surfaced to users (Buolamwini and Gebru, 2018). Large image datasets assembled by automated scraping were later found to contain non-consensual and offensive content, problems that careful collection documentation would have flagged or prevented (Birhane and Prabhu, 2021). In each case the data carried assumptions that propagated silently into models. Documentation does not by itself remove these assumptions, but it makes them visible and therefore contestable.

79.1.3 1.3 Documentation as scientific hygiene

Beyond harm prevention, documentation serves reproducibility. A result that depends on a particular preprocessing choice, sampling frame, or annotation guideline cannot be reproduced or fairly compared if those choices are unrecorded. The reproducibility crisis in empirical machine learning is partly a documentation crisis. When the dataset is a moving target with no version, no provenance, and no description of its splits, benchmark numbers lose their meaning.

79.2 2. Datasheets for Datasets

79.2.1 2.1 Origins and analogy

Datasheets for Datasets, proposed by Gebru and colleagues (2021), borrows its central metaphor from electronics. Every electronic component ships with a datasheet that specifies operating conditions, tolerances, and recommended uses. A capacitor rated for one voltage is not silently substituted for another. The authors argue that datasets, which are the components from which models are assembled, deserve the same treatment. The proposal is deliberately a set of questions rather than a rigid schema, because the goal is to prompt reflection by dataset creators rather than to mechanize it.

79.2.2 2.2 Structure

A datasheet is organized around the lifecycle of a dataset. It poses questions grouped into motivation, composition, collection process, preprocessing and cleaning and labeling, uses, distribution, and maintenance. The questions are answered in prose by the people who created the dataset, ideally during creation rather than retrospectively. The motivation section asks why the dataset was created and who funded it. The composition section asks what the instances represent, how many there are, whether any data is missing, and whether the dataset relates to people. The collection section asks how the data was acquired and whether consent was obtained. Later sections cover how the data was transformed, what it has been and could be used for, how it is distributed and licensed, and who will maintain it.

79.2.3 2.3 Strengths and limits

The strength of the datasheet is its generality. The same template works for tabular data, images, text, and sensor logs because the questions concern the data lifecycle rather than any particular modality. This generality is also a limit. A long prose document answering several dozen questions is expensive to produce and easy to skim. There is no machine-readable contract, so two datasheets can be equally complete and yet structurally incomparable. The framework relies on the diligence and honesty of its authors, and it offers no mechanism to verify that the prose matches the data.

79.3 3. Data Cards

79.3.1 3.1 Toward structured transparency

Data Cards, introduced by Pushkarna, Zaldivar, and Kjartansson (2022) at Google, respond to the comparability problem. Where a datasheet is a questionnaire answered in prose, a Data Card is a structured summary built from reusable blocks. Each block addresses a specific question and is designed to be consistent across datasets, so that a reader can scan many cards and compare the same field across them. The framework is opinionated about format precisely so that documentation becomes navigable at scale.

79.3.2 3.2 The OFTEn framework and agentized content

The authors propose that content be organized so that it answers questions across the data lifecycle, captured by the mnemonic of Origins, Factuals, Transformations, Experience, and Notable considerations. A central contribution is the idea of agentized documentation, which recognizes that different readers need different information. A policy reviewer, a model developer, and an affected member of the public ask different questions of the same dataset. Data Cards encourage authors to write content for these distinct agents rather than a single undifferentiated audience. The framework also stresses that a good card surfaces what is hard to see, such as known gaps, sampling decisions, and the reasoning behind transformations, rather than only the easy descriptive statistics.

79.3.3 3.3 Operational emphasis

Data Cards were developed alongside tooling and templates intended for production use inside a large organization. This operational grounding shows in the attention to who produces and who consumes the card, to how cards are reviewed, and to how completeness can be assessed. The lesson for practitioners is that documentation succeeds when it is treated as a deliverable with owners and review gates, not as an afterthought appended once the dataset ships.

79.4 4. Data Statements for NLP

79.4.1 4.1 Language is not neutral

Data Statements, proposed by Bender and Friedman (2018), focus on natural language processing, where a particular hazard dominates. A language technology trained on text from one population, register, or dialect will encode the characteristics of that population and may fail or discriminate when applied to another. The authors argue that the field systematically under-reports the provenance of its language data, which makes it impossible to reason about generalization or bias. A data statement is a characterization of a dataset that lets users understand for whom and for what a system is likely to work.

79.4.2 4.2 Schema

A data statement records the curation rationale, that is why these particular texts were selected. It records the language variety, including dialect and register, specified precisely rather than as a vague label like English. It records speaker demographics and annotator demographics, because both groups shape the data, the speakers through what they wrote and the annotators through how they labeled it. It records the speech situation, the text characteristics, the recording and quality details, and any provenance for data drawn from other sources. The emphasis on annotator demographics is distinctive and important. Labels are not facts about the world but judgments made by particular people, and who those people are affects what the labels mean.

79.4.3 4.3 Generalization and bias

The payoff of a data statement is the ability to predict where a model will fail. If a sentiment classifier was trained on product reviews written by one demographic and labeled by another, a data statement makes the mismatch visible before deployment in a different context. Bender and Friedman distinguish a long form statement intended for documentation and a short form suitable for a paper or model release, acknowledging the practical tension between completeness and the cost of producing it.

79.5 5. What to Document

The three frameworks differ in format and emphasis but converge on a shared set of content categories. The following subsections describe what belongs in each.

79.5.1 5.1 Motivation

Record why the dataset exists. State the task or research question it was built to support, the specific gap it was meant to fill, who created it, and who funded the work. Motivation is the lens through which every later decision should be read. A dataset assembled to study one phenomenon often carries sampling and labeling choices that make it unsuitable for another, and stating the original purpose warns future users away from misuse.

79.5.2 5.2 Composition

Describe what the dataset contains. Specify what a single instance represents, how many instances there are, and what fields each carries. State whether the dataset is a sample of some larger population and, if so, how the sample was drawn and whether it is representative. Disclose missing data, label distributions, and any class imbalance. When the data concerns people, document which subpopulations are present and in what proportions, since composition along demographic lines is frequently the difference between a fair system and a discriminatory one. Note any sensitive content and any confidentiality constraints.

79.5.3 5.3 Collection

Explain how the data was obtained. Describe the mechanism, whether direct observation, survey, scraping, sensors, or purchase from a third party. Record the time frame of collection, because data ages and a model trained on stale data drifts from the world it is meant to serve. For data about people, document whether they were aware of the collection, whether they consented, whether they could withdraw, and whether any ethical review took place. Collection is the category where consent and provenance live, and it is where the most serious harms originate when it is neglected.

79.5.4 5.4 Preprocessing, cleaning, and labeling

Document every transformation between the raw data and the released dataset. Record filtering rules, deduplication, normalization, tokenization, and any discarded records along with the reason for discarding them. Preserve or describe access to the raw data when possible, since a transformation that looks innocuous can encode a consequential bias. For labeling, record who the annotators were, what instructions they followed, how disagreements were resolved, and what inter-annotator agreement was achieved. Labels inherit the perspectives and incentives of the people and processes that produced them, and a model can only ever be as coherent as its labels.

79.5.5 5.5 Uses

State what the dataset has already been used for and what it is suitable for. Equally important, state what it should not be used for. Identify aspects of composition or collection that could cause harm or unfair treatment if the dataset were applied to a task it was not designed for. This is the category that turns documentation from a description into guidance. A clear statement of intended and unintended uses gives downstream practitioners a standard against which to judge their own plans.

79.5.6 5.6 Distribution and maintenance

Record how the dataset is distributed, under what license, and subject to what terms or restrictions. Assign it a version and a citation so that results can be tied to an exact artifact. Name the party responsible for maintenance, state how errors can be reported, and describe how and whether the dataset will be updated, corrected, or retired. A dataset without an owner and a version is a liability, because it cannot be corrected when a flaw is found and cannot be cited reliably when a result depends on it.

79.6 6. Documentation as a Tool for Accountability

79.6.1 6.1 From description to obligation

The frameworks share a quiet ambition. By asking creators to write down their decisions, they convert tacit choices into a record that others can inspect, question, and contest. A documented sampling decision can be challenged. An undocumented one cannot, because no one knows it was made. In this sense documentation distributes power. It moves knowledge about a dataset out of the small group that built it and into the larger community that depends on it, including auditors, regulators, and the people the data describes.

79.6.2 6.2 Process over artifact

The deepest insight across this literature is that the value lies in the act of documenting, not only in the document. Answering the composition and collection questions forces creators to confront gaps and biases while there is still time to address them. Gebru and colleagues stress that datasheets are meant to encourage reflection during creation. A datasheet written honestly at the right moment changes the dataset, because the author notices, for example, that an entire demographic is absent and decides to collect more. The artifact is a byproduct of a better process.

79.6.3 6.3 Regulatory and organizational momentum

Documentation is moving from voluntary good practice toward expectation and requirement. Data protection regimes that grant rights over personal data presuppose that an organization knows what data it holds and where it came from, which is exactly what collection and composition documentation captures. Emerging regulation of high-risk AI systems contemplates obligations to document training data and its provenance. Inside organizations, dataset documentation increasingly functions as a review gate, where a dataset cannot be promoted to production use until its card or datasheet is complete and approved. This shift gives documentation teeth it historically lacked.

79.6.4 6.4 Limits and honest expectations

Documentation is necessary but not sufficient. A complete datasheet can describe a dataset that should never have been collected. A polished Data Card can present a biased dataset attractively. The frameworks depend on the candor of their authors and offer little protection against motivated misrepresentation. Producing good documentation also costs real effort, and without organizational support that effort is the first thing cut under deadline pressure. The honest position is that documentation is a powerful enabling condition for accountability rather than a guarantee of it. It makes scrutiny possible. Whether scrutiny actually happens depends on incentives, culture, and enforcement that lie outside the document.

79.7 7. Practical Guidance

79.7.1 7.1 Choosing a framework

Practitioners should match the framework to context. For general-purpose datasets across many modalities, Datasheets for Datasets offers a thorough and widely understood question set. For organizations documenting many datasets that must be compared and reviewed at scale, Data Cards provide structure and tooling. For language data, Data Statements supply the demographic and provenance detail that NLP generalization demands. These choices are not exclusive. A team can answer datasheet questions using card-style structured blocks, and a language dataset can adopt the data statement schema within a broader datasheet.

79.7.2 7.2 Making documentation durable

Documentation that lives in a forgotten document decays. Store the documentation with the dataset, version the two together, and treat updates to the data as requiring updates to the record. Assign an owner. Make completeness a precondition for release. Write for the specific readers who will use the dataset rather than a generic audience, and surface the uncomfortable facts about gaps and limitations prominently rather than burying them. The test of good documentation is simple. A capable newcomer should be able to read it and correctly predict how the dataset will and will not serve a proposed use.

79.8 References

  1. Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Daumé III, H., and Crawford, K. (2021). Datasheets for Datasets. Communications of the ACM, 64(12). https://dl.acm.org/doi/10.1145/3458723

  2. Pushkarna, M., Zaldivar, A., and Kjartansson, O. (2022). Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI. ACM Conference on Fairness, Accountability, and Transparency (FAccT). https://dl.acm.org/doi/10.1145/3531146.3533231

  3. Bender, E. M., and Friedman, B. (2018). Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science. Transactions of the Association for Computational Linguistics, 6. https://aclanthology.org/Q18-1041/

  4. Buolamwini, J., and Gebru, T. (2018). Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. Proceedings of Machine Learning Research, 81. https://proceedings.mlr.press/v81/buolamwini18a.html

  5. Birhane, A., and Prabhu, V. U. (2021). Large Image Datasets: A Pyrrhic Win for Computer Vision? IEEE Winter Conference on Applications of Computer Vision (WACV). https://ieeexplore.ieee.org/document/9423393

  6. Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., Spitzer, E., Raji, I. D., and Gebru, T. (2019). Model Cards for Model Reporting. ACM Conference on Fairness, Accountability, and Transparency (FAT*). https://dl.acm.org/doi/10.1145/3287560.3287596

  7. Holland, S., Hosny, A., Newman, S., Joseph, J., and Chmielinski, K. (2018). The Dataset Nutrition Label: A Framework to Drive Higher Data Quality Standards. https://arxiv.org/abs/1805.03677

  8. Hutchinson, B., Smart, A., Hanna, A., Denton, E., Greer, C., Kjartansson, O., Barnes, P., and Mitchell, M. (2021). Towards Accountability for Machine Learning Datasets: Practices from Software Engineering and Infrastructure. ACM Conference on Fairness, Accountability, and Transparency (FAccT). https://dl.acm.org/doi/10.1145/3442188.3445918