79 Data Documentation: Datasheets, Data Cards, and Data Statements

A trained model is a function of its data. When practitioners cannot answer basic questions about where a dataset came from, who is represented in it, how it was labeled, and what it was meant for, they cannot reason about the behavior of the systems they build on top of it. Data documentation is the discipline of recording these answers in a structured, durable, and discoverable form. This chapter surveys the three most influential documentation frameworks in machine learning practice, namely Datasheets for Datasets, Data Cards, and Data Statements for natural language processing. It then distills the common content categories they share, develops a small formal model of what documentation actually buys a user, works a concrete example, and argues that documentation is most valuable not as a compliance artifact but as an instrument of accountability.

Definitions

A dataset is a finite collection of instances drawn from some process, together with any labels, metadata, and structure imposed on them. A datasheet (in the generic sense used throughout this chapter) is a structured, human-readable document that records the provenance, composition, intended use, and stewardship of a dataset, and that is versioned and distributed together with the dataset it describes. Provenance is the recorded history of where data came from and how it was transformed, sufficient to trace any released instance back toward its origin. A data statement and a Data Card are two specific framework instances of the generic datasheet idea, with different formats and emphases.

79.1 1. Why Document Data

79.1.1 1.1 The accountability gap

Machine learning pipelines have a tendency to launder responsibility. A dataset is scraped or purchased, passed through several teams, filtered and augmented, and finally used to train a model whose failures surface months later in production. By that point the people who can explain the data are gone, the collection scripts have rotted, and the only record of intent is a folder name. Documentation closes this gap by attaching a persistent, human-readable account to the dataset itself, so that the answers travel with the artifact rather than living in the heads of people who have moved on.

79.1.2 1.2 The downstream harms of undocumented data

Several well-known failures trace directly to undocumented or poorly documented data. Facial analysis systems performed far worse on darker-skinned women because their training and benchmark sets were overwhelmingly composed of lighter-skinned faces, a composition fact that was never surfaced to users (Buolamwini and Gebru, 2018). Large image datasets assembled by automated scraping were later found to contain non-consensual and offensive content, problems that careful collection documentation would have flagged or prevented (Birhane and Prabhu, 2021). In each case the data carried assumptions that propagated silently into models. Documentation does not by itself remove these assumptions, but it makes them visible and therefore contestable.

79.1.3 1.3 Documentation as scientific hygiene

Beyond harm prevention, documentation serves reproducibility. A result that depends on a particular preprocessing choice, sampling frame, or annotation guideline cannot be reproduced or fairly compared if those choices are unrecorded. The reproducibility crisis in empirical machine learning is partly a documentation crisis. When the dataset is a moving target with no version, no provenance, and no description of its splits, benchmark numbers lose their meaning.

79.2 2. Datasheets for Datasets

79.2.1 2.1 Origins and analogy

Datasheets for Datasets, proposed by Gebru and colleagues (2021), borrows its central metaphor from electronics. Every electronic component ships with a datasheet that specifies operating conditions, tolerances, and recommended uses. A capacitor rated for one voltage is not silently substituted for another. The authors argue that datasets, which are the components from which models are assembled, deserve the same treatment. The proposal is deliberately a set of questions rather than a rigid schema, because the goal is to prompt reflection by dataset creators rather than to mechanize it.

79.2.2 2.2 Structure

A datasheet is organized around the lifecycle of a dataset. It poses questions grouped into motivation, composition, collection process, preprocessing and cleaning and labeling, uses, distribution, and maintenance. The questions are answered in prose by the people who created the dataset, ideally during creation rather than retrospectively. The motivation section asks why the dataset was created and who funded it. The composition section asks what the instances represent, how many there are, whether any data is missing, and whether the dataset relates to people. The collection section asks how the data was acquired and whether consent was obtained. Later sections cover how the data was transformed, what it has been and could be used for, how it is distributed and licensed, and who will maintain it.

The seven sections trace the path a dataset travels from intent to retirement. Reading them in order reconstructs the decisions that shaped the artifact.

flowchart LR
    A["Motivation"] --> B["Composition"]
    B --> C["Collection"]
    C --> D["Preprocessing and labeling"]
    D --> E["Uses"]
    E --> F["Distribution"]
    F --> G["Maintenance"]
    G -.->|"errors and updates feed back"| B

The dotted return edge matters. Documentation is not write-once. When a flaw is found or the data is corrected, the maintenance section feeds back into composition, and the version must advance so that downstream results can be matched to the exact state of the data they used.

79.2.3 2.3 Strengths and limits

The strength of the datasheet is its generality. The same template works for tabular data, images, text, and sensor logs because the questions concern the data lifecycle rather than any particular modality. This generality is also a limit. A long prose document answering several dozen questions is expensive to produce and easy to skim. There is no machine-readable contract, so two datasheets can be equally complete and yet structurally incomparable. The framework relies on the diligence and honesty of its authors, and it offers no mechanism to verify that the prose matches the data.

79.3 3. Data Cards

79.3.1 3.1 Toward structured transparency

Data Cards, introduced by Pushkarna, Zaldivar, and Kjartansson (2022) at Google, respond to the comparability problem. Where a datasheet is a questionnaire answered in prose, a Data Card is a structured summary built from reusable blocks. Each block addresses a specific question and is designed to be consistent across datasets, so that a reader can scan many cards and compare the same field across them. The framework is opinionated about format precisely so that documentation becomes navigable at scale.

79.3.2 3.2 The OFTEn framework and agentized content

The authors propose that content be organized so that it answers questions across the data lifecycle, captured by the mnemonic of Origins, Factuals, Transformations, Experience, and Notable considerations. A central contribution is the idea of agentized documentation, which recognizes that different readers need different information. A policy reviewer, a model developer, and an affected member of the public ask different questions of the same dataset. Data Cards encourage authors to write content for these distinct agents rather than a single undifferentiated audience. The framework also stresses that a good card surfaces what is hard to see, such as known gaps, sampling decisions, and the reasoning behind transformations, rather than only the easy descriptive statistics.

79.3.3 3.3 Operational emphasis

Data Cards were developed alongside tooling and templates intended for production use inside a large organization. This operational grounding shows in the attention to who produces and who consumes the card, to how cards are reviewed, and to how completeness can be assessed. The lesson for practitioners is that documentation succeeds when it is treated as a deliverable with owners and review gates, not as an afterthought appended once the dataset ships.

79.4 4. Data Statements for NLP

79.4.1 4.1 Language is not neutral

Data Statements, proposed by Bender and Friedman (2018), focus on natural language processing, where a particular hazard dominates. A language technology trained on text from one population, register, or dialect will encode the characteristics of that population and may fail or discriminate when applied to another. The authors argue that the field systematically under-reports the provenance of its language data, which makes it impossible to reason about generalization or bias. A data statement is a characterization of a dataset that lets users understand for whom and for what a system is likely to work.

79.4.2 4.2 Schema

A data statement records the curation rationale, that is why these particular texts were selected. It records the language variety, including dialect and register, specified precisely rather than as a vague label like English. It records speaker demographics and annotator demographics, because both groups shape the data, the speakers through what they wrote and the annotators through how they labeled it. It records the speech situation, the text characteristics, the recording and quality details, and any provenance for data drawn from other sources. The emphasis on annotator demographics is distinctive and important. Labels are not facts about the world but judgments made by particular people, and who those people are affects what the labels mean.

79.4.3 4.3 Generalization and bias

The payoff of a data statement is the ability to predict where a model will fail. If a sentiment classifier was trained on product reviews written by one demographic and labeled by another, a data statement makes the mismatch visible before deployment in a different context. Bender and Friedman distinguish a long form statement intended for documentation and a short form suitable for a paper or model release, acknowledging the practical tension between completeness and the cost of producing it.

79.4.4 4.4 Comparing the three frameworks

The frameworks are siblings, not rivals. They differ mainly in format, primary audience, and the modality they were designed around.

Dimension	Datasheets for Datasets	Data Cards	Data Statements
Primary form	Prose answers to a fixed question set	Structured reusable blocks	Schema of provenance and demographic fields
Optimized for	Generality across modalities	Comparability and review at scale	Language data and generalization
Distinctive idea	Lifecycle questionnaire	Agentized content for distinct readers	Speaker and annotator demographics
Machine readability	Low	Moderate to high	Low to moderate
Best fit	Single dataset, any modality	Many datasets, one organization	NLP corpora and benchmarks

The right column of any such table can mislead if read as a verdict. All three frameworks share the same underlying content categories developed in the next section. A team can answer the datasheet questions using Data Card style blocks and embed a data statement for the language portion, producing a single document that is general, comparable, and demographically precise at once.

79.5 5. What to Document

The three frameworks differ in format and emphasis but converge on a shared set of content categories. The following subsections describe what belongs in each.

79.5.1 5.1 Motivation

Record why the dataset exists. State the task or research question it was built to support, the specific gap it was meant to fill, who created it, and who funded the work. Motivation is the lens through which every later decision should be read. A dataset assembled to study one phenomenon often carries sampling and labeling choices that make it unsuitable for another, and stating the original purpose warns future users away from misuse.

79.5.2 5.2 Composition

Describe what the dataset contains. Specify what a single instance represents, how many instances there are, and what fields each carries. State whether the dataset is a sample of some larger population and, if so, how the sample was drawn and whether it is representative. Disclose missing data, label distributions, and any class imbalance. When the data concerns people, document which subpopulations are present and in what proportions, since composition along demographic lines is frequently the difference between a fair system and a discriminatory one. Note any sensitive content and any confidentiality constraints.

79.5.3 5.3 Collection

Explain how the data was obtained. Describe the mechanism, whether direct observation, survey, scraping, sensors, or purchase from a third party. Record the time frame of collection, because data ages and a model trained on stale data drifts from the world it is meant to serve. For data about people, document whether they were aware of the collection, whether they consented, whether they could withdraw, and whether any ethical review took place. Collection is the category where consent and provenance live, and it is where the most serious harms originate when it is neglected.

79.5.4 5.4 Preprocessing, cleaning, and labeling

Document every transformation between the raw data and the released dataset. Record filtering rules, deduplication, normalization, tokenization, and any discarded records along with the reason for discarding them. Preserve or describe access to the raw data when possible, since a transformation that looks innocuous can encode a consequential bias. For labeling, record who the annotators were, what instructions they followed, how disagreements were resolved, and what inter-annotator agreement was achieved. Labels inherit the perspectives and incentives of the people and processes that produced them, and a model can only ever be as coherent as its labels.

79.5.5 5.5 Uses

State what the dataset has already been used for and what it is suitable for. Equally important, state what it should not be used for. Identify aspects of composition or collection that could cause harm or unfair treatment if the dataset were applied to a task it was not designed for. This is the category that turns documentation from a description into guidance. A clear statement of intended and unintended uses gives downstream practitioners a standard against which to judge their own plans.

79.5.6 5.6 Distribution and maintenance

Record how the dataset is distributed, under what license, and subject to what terms or restrictions. Assign it a version and a citation so that results can be tied to an exact artifact. Name the party responsible for maintenance, state how errors can be reported, and describe how and whether the dataset will be updated, corrected, or retired. A dataset without an owner and a version is a liability, because it cannot be corrected when a flaw is found and cannot be cited reliably when a result depends on it.

79.6 6. A Formal View of What Documentation Buys

Documentation is usually discussed in qualitative terms, but its core function admits a precise statement. The central problem documentation addresses is dataset shift, the mismatch between the distribution a dataset was drawn from and the distribution on which a model is later deployed.

79.6.1 6.1 The deployment mismatch

Let $P_{\text{train}}(x, y)$ be the joint distribution over inputs $x$ and labels $y$ that produced the dataset, and let $P_{\text{deploy}}(x, y)$ be the distribution at the point of use. A model fit to the first incurs additional error at the second to the extent that the two differ. Two classical decompositions name the ways they can differ (Moreno-Torres et al., 2012). Under covariate shift the input marginal changes while the conditional is stable, $P_{\text{train}}(x) \neq P_{\text{deploy}}(x)$ but $P_{\text{train}}(y \mid x) = P_{\text{deploy}}(y \mid x)$. Under concept shift the labeling relationship itself changes, $P_{\text{train}}(y \mid x) \neq P_{\text{deploy}}(y \mid x)$.

The practical point is that neither kind of shift is observable from the data alone. A practitioner holding only the instance matrix cannot tell whether the sampling frame excluded a subpopulation, or whether the labels encode the judgments of annotators whose criteria differ from the deployment context. These facts live in the collection and labeling history, which is exactly what the composition, collection, and labeling categories of a datasheet record. Documentation does not eliminate shift. It makes the relevant components of $P_{\text{train}}$ legible so that a user can estimate, before deployment, whether the gap to $P_{\text{deploy}}$ is tolerable.

79.6.2 6.2 Documentation as a sufficient statistic for fitness-of-use

It is useful to frame a datasheet as an attempt at a sufficient description for the question that actually matters to a downstream user: is this dataset fit for my intended use $u$? Let $D$ be the dataset and $\text{Doc}$ its documentation. The ideal is that the decision to adopt or reject $D$ for use $u$ depends on $D$ only through $\text{Doc}$,

\[ \text{fit}(D, u) \;\approx\; g\big(\text{Doc}(D),\, u\big), \]

meaning a capable reader can predict fitness from the document without re-auditing the raw data. No real datasheet achieves this exactly, because prose is lossy and authors omit what they did not think to record. The quality of a datasheet is precisely how close it comes to this ideal: how rarely a reader who trusts the document is later surprised by a property of the data that the document failed to surface. This reframes the practical advice in Section 8 as a single objective, namely minimizing the surprises that survive a careful reading.

79.6.3 6.3 A measure of completeness

Comparability across documents invites a crude but useful quantity. Fix a checklist of $n$ required fields for the chosen framework. Let $r_i \in \{0, 1\}$ indicate whether field $i$ is answered substantively rather than left blank or marked unknown, and let $w_i \ge 0$ weight the field by how consequential its omission is, with collection consent and demographic composition weighted heavily and incidental metadata weighted lightly. A weighted completeness score is

\[ C \;=\; \frac{\sum_{i=1}^{n} w_i\, r_i}{\sum_{i=1}^{n} w_i} \;\in\; [0, 1]. \]

This number is a coverage indicator, not a quality guarantee. A document can score $C = 1$ and still misrepresent the data, since $r_i$ records only that a field was answered, not that the answer is true. Used honestly, $C$ supports review gates of the form “no dataset is promoted to production with $C$ below a threshold on the high-weight fields,” which is how completeness acquires teeth inside an organization. Used dishonestly, it becomes a target that invites box-ticking, an instance of the general failure where a measure that becomes a target ceases to measure well.

79.6.4 6.4 Worked example: a sentiment dataset

Consider a dataset $D$ of ten thousand short texts labeled positive, negative, or neutral, intended to train a customer-support sentiment classifier. Suppose the datasheet records the following. The texts were scraped from public product reviews on a single retail platform between January and March of one year. The reviewers skew toward one national English variety. The labels were assigned by five annotators recruited through one crowd platform, working from a two-page guideline, with a measured inter-annotator agreement of Cohen’s kappa around 0.62 and ties broken by majority vote. Neutral is the minority class at roughly twelve percent of instances.

A reader can now reason about fitness without touching the raw data. The intended deployment is incoming support tickets, which are complaints, written in several English varieties, arriving year round. The documentation surfaces three concrete mismatches. First, a register and genre gap, since product reviews are not support tickets, a covariate shift in $P(x)$. Second, a dialect gap between the reviewer population and the ticket population, a further covariate shift that the data alone would hide. Third, a concept concern, because the annotation guideline was written for reviews and may draw the neutral boundary differently than a support team would, a possible shift in $P(y \mid x)$. The moderate kappa warns that even the recorded labels carry real disagreement, so a few points of accuracy are noise rather than signal. None of these conclusions required re-auditing the data. They follow from the document, which is the entire point of writing it.

Had the same dataset shipped with only its instances and a folder name, every one of these inferences would have required reverse engineering the collection process, if it were possible at all. That difference, between a user who can predict failure in an afternoon and a user who discovers it in production, is the value documentation delivers.

79.7 7. Documentation as a Tool for Accountability

79.7.1 7.1 From description to obligation

The frameworks share a quiet ambition. By asking creators to write down their decisions, they convert tacit choices into a record that others can inspect, question, and contest. A documented sampling decision can be challenged. An undocumented one cannot, because no one knows it was made. In this sense documentation distributes power. It moves knowledge about a dataset out of the small group that built it and into the larger community that depends on it, including auditors, regulators, and the people the data describes.

79.7.2 7.2 Process over artifact

The deepest insight across this literature is that the value lies in the act of documenting, not only in the document. Answering the composition and collection questions forces creators to confront gaps and biases while there is still time to address them. Gebru and colleagues stress that datasheets are meant to encourage reflection during creation. A datasheet written honestly at the right moment changes the dataset, because the author notices, for example, that an entire demographic is absent and decides to collect more. The artifact is a byproduct of a better process.

79.7.3 7.3 Regulatory and organizational momentum

Documentation is moving from voluntary good practice toward expectation and requirement. Data protection regimes that grant rights over personal data presuppose that an organization knows what data it holds and where it came from, which is exactly what collection and composition documentation captures. Emerging regulation of high-risk AI systems contemplates obligations to document training data and its provenance. Inside organizations, dataset documentation increasingly functions as a review gate, where a dataset cannot be promoted to production use until its card or datasheet is complete and approved. This shift gives documentation teeth it historically lacked.

79.7.4 7.4 Limits and honest expectations

Documentation is necessary but not sufficient. A complete datasheet can describe a dataset that should never have been collected. A polished Data Card can present a biased dataset attractively. The frameworks depend on the candor of their authors and offer little protection against motivated misrepresentation. Producing good documentation also costs real effort, and without organizational support that effort is the first thing cut under deadline pressure. The honest position is that documentation is a powerful enabling condition for accountability rather than a guarantee of it. It makes scrutiny possible. Whether scrutiny actually happens depends on incentives, culture, and enforcement that lie outside the document.

79.8 8. Practical Guidance

79.8.1 8.1 Choosing a framework

Practitioners should match the framework to context. For general-purpose datasets across many modalities, Datasheets for Datasets offers a thorough and widely understood question set. For organizations documenting many datasets that must be compared and reviewed at scale, Data Cards provide structure and tooling. For language data, Data Statements supply the demographic and provenance detail that NLP generalization demands. These choices are not exclusive. A team can answer datasheet questions using card-style structured blocks, and a language dataset can adopt the data statement schema within a broader datasheet.

79.8.2 8.2 Making documentation durable

Documentation that lives in a forgotten document decays. Store the documentation with the dataset, version the two together, and treat updates to the data as requiring updates to the record. Assign an owner. Make completeness a precondition for release. Write for the specific readers who will use the dataset rather than a generic audience, and surface the uncomfortable facts about gaps and limitations prominently rather than burying them. The test of good documentation is simple. A capable newcomer should be able to read it and correctly predict how the dataset will and will not serve a proposed use.

Tooling should reduce the friction of meeting this standard rather than add ceremony. Mature open-source options exist for every step. Croissant, a metadata format for machine learning datasets standardized through MLCommons, lets a datasheet’s structured fields travel with the data in a machine-readable form. The Hugging Face Hub renders a dataset card from a versioned Markdown file checked in beside the data, so the document and the artifact move together by default. Plain text formats under version control, whether Markdown, YAML, or JSON, are preferable to a slide or a wiki page precisely because they diff, review, and version like code.

When to invest, and where it goes wrong

When the effort clearly pays off. Document heavily when the dataset will outlive its creators’ tenure, when it concerns people, when it will be reused across teams or released publicly, or when it feeds a system subject to audit or regulation. The cost of documentation is fixed and paid once; the cost of an undocumented dataset is paid repeatedly by every downstream user who must reverse engineer it.

Common pitfalls.

Retrospective documentation. Writing the datasheet months after collection loses the very knowledge it was meant to capture. Document during creation, while the decisions are fresh and still changeable.
Completeness theater. Optimizing the completeness score $C$ rather than the truth of the answers. A field marked answered is not a field answered honestly.
Burying the limitations. Putting gaps and biases in a footnote while foregrounding flattering statistics. The uncomfortable facts are the ones a user most needs.
Vague language varieties and demographics. Writing “English” where “United States English, informal register, product-review genre” is what a reader actually needs to predict generalization.
Orphaned documents. A datasheet with no owner and no link to a dataset version cannot be corrected when a flaw is found and cannot be cited reliably when a result depends on it.

79.9 References

Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Daumé III, H., and Crawford, K. (2021). Datasheets for Datasets. Communications of the ACM, 64(12). https://dl.acm.org/doi/10.1145/3458723
Pushkarna, M., Zaldivar, A., and Kjartansson, O. (2022). Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI. ACM Conference on Fairness, Accountability, and Transparency (FAccT). https://dl.acm.org/doi/10.1145/3531146.3533231
Bender, E. M., and Friedman, B. (2018). Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science. Transactions of the Association for Computational Linguistics, 6. https://aclanthology.org/Q18-1041/
Buolamwini, J., and Gebru, T. (2018). Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. Proceedings of Machine Learning Research, 81. https://proceedings.mlr.press/v81/buolamwini18a.html
Birhane, A., and Prabhu, V. U. (2021). Large Image Datasets: A Pyrrhic Win for Computer Vision? IEEE Winter Conference on Applications of Computer Vision (WACV). https://ieeexplore.ieee.org/document/9423393
Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., Spitzer, E., Raji, I. D., and Gebru, T. (2019). Model Cards for Model Reporting. ACM Conference on Fairness, Accountability, and Transparency (FAT*). https://dl.acm.org/doi/10.1145/3287560.3287596
Holland, S., Hosny, A., Newman, S., Joseph, J., and Chmielinski, K. (2018). The Dataset Nutrition Label: A Framework to Drive Higher Data Quality Standards. https://arxiv.org/abs/1805.03677
Hutchinson, B., Smart, A., Hanna, A., Denton, E., Greer, C., Kjartansson, O., Barnes, P., and Mitchell, M. (2021). Towards Accountability for Machine Learning Datasets: Practices from Software Engineering and Infrastructure. ACM Conference on Fairness, Accountability, and Transparency (FAccT). https://dl.acm.org/doi/10.1145/3442188.3445918
Moreno-Torres, J. G., Raeder, T., Alaiz-Rodríguez, R., Chawla, N. V., and Herrera, F. (2012). A Unifying View on Dataset Shift in Classification. Pattern Recognition, 45(1). https://doi.org/10.1016/j.patcog.2011.06.019

# Data Documentation: Datasheets, Data Cards, and Data Statements A trained model is a function of its data. When practitioners cannot answer basic questions about where a dataset came from, who is represented in it, how it was labeled, and what it was meant for, they cannot reason about the behavior of the systems they build on top of it. Data documentation is the discipline of recording these answers in a structured, durable, and discoverable form. This chapter surveys the three most influential documentation frameworks in machine learning practice, namely Datasheets for Datasets, Data Cards, and Data Statements for natural language processing. It then distills the common content categories they share, develops a small formal model of what documentation actually buys a user, works a concrete example, and argues that documentation is most valuable not as a compliance artifact but as an instrument of accountability. ::: callout-note ## Definitions A **dataset** is a finite collection of instances drawn from some process, together with any labels, metadata, and structure imposed on them. A **datasheet** (in the generic sense used throughout this chapter) is a structured, human-readable document that records the provenance, composition, intended use, and stewardship of a dataset, and that is versioned and distributed together with the dataset it describes. **Provenance** is the recorded history of where data came from and how it was transformed, sufficient to trace any released instance back toward its origin. A **data statement** and a **Data Card** are two specific framework instances of the generic datasheet idea, with different formats and emphases. ::: ## 1. Why Document Data ### 1.1 The accountability gap Machine learning pipelines have a tendency to launder responsibility. A dataset is scraped or purchased, passed through several teams, filtered and augmented, and finally used to train a model whose failures surface months later in production. By that point the people who can explain the data are gone, the collection scripts have rotted, and the only record of intent is a folder name. Documentation closes this gap by attaching a persistent, human-readable account to the dataset itself, so that the answers travel with the artifact rather than living in the heads of people who have moved on. ### 1.2 The downstream harms of undocumented data Several well-known failures trace directly to undocumented or poorly documented data. Facial analysis systems performed far worse on darker-skinned women because their training and benchmark sets were overwhelmingly composed of lighter-skinned faces, a composition fact that was never surfaced to users (Buolamwini and Gebru, 2018). Large image datasets assembled by automated scraping were later found to contain non-consensual and offensive content, problems that careful collection documentation would have flagged or prevented (Birhane and Prabhu, 2021). In each case the data carried assumptions that propagated silently into models. Documentation does not by itself remove these assumptions, but it makes them visible and therefore contestable. ### 1.3 Documentation as scientific hygiene Beyond harm prevention, documentation serves reproducibility. A result that depends on a particular preprocessing choice, sampling frame, or annotation guideline cannot be reproduced or fairly compared if those choices are unrecorded. The reproducibility crisis in empirical machine learning is partly a documentation crisis. When the dataset is a moving target with no version, no provenance, and no description of its splits, benchmark numbers lose their meaning. ## 2. Datasheets for Datasets ### 2.1 Origins and analogy Datasheets for Datasets, proposed by Gebru and colleagues (2021), borrows its central metaphor from electronics. Every electronic component ships with a datasheet that specifies operating conditions, tolerances, and recommended uses. A capacitor rated for one voltage is not silently substituted for another. The authors argue that datasets, which are the components from which models are assembled, deserve the same treatment. The proposal is deliberately a set of questions rather than a rigid schema, because the goal is to prompt reflection by dataset creators rather than to mechanize it. ### 2.2 Structure A datasheet is organized around the lifecycle of a dataset. It poses questions grouped into motivation, composition, collection process, preprocessing and cleaning and labeling, uses, distribution, and maintenance. The questions are answered in prose by the people who created the dataset, ideally during creation rather than retrospectively. The motivation section asks why the dataset was created and who funded it. The composition section asks what the instances represent, how many there are, whether any data is missing, and whether the dataset relates to people. The collection section asks how the data was acquired and whether consent was obtained. Later sections cover how the data was transformed, what it has been and could be used for, how it is distributed and licensed, and who will maintain it. The seven sections trace the path a dataset travels from intent to retirement. Reading them in order reconstructs the decisions that shaped the artifact. ```{mermaid} flowchart LR A["Motivation"] --> B["Composition"] B --> C["Collection"] C --> D["Preprocessing and labeling"] D --> E["Uses"] E --> F["Distribution"] F --> G["Maintenance"] G -.->|"errors and updates feed back"| B ``` The dotted return edge matters. Documentation is not write-once. When a flaw is found or the data is corrected, the maintenance section feeds back into composition, and the version must advance so that downstream results can be matched to the exact state of the data they used. ### 2.3 Strengths and limits The strength of the datasheet is its generality. The same template works for tabular data, images, text, and sensor logs because the questions concern the data lifecycle rather than any particular modality. This generality is also a limit. A long prose document answering several dozen questions is expensive to produce and easy to skim. There is no machine-readable contract, so two datasheets can be equally complete and yet structurally incomparable. The framework relies on the diligence and honesty of its authors, and it offers no mechanism to verify that the prose matches the data. ## 3. Data Cards ### 3.1 Toward structured transparency Data Cards, introduced by Pushkarna, Zaldivar, and Kjartansson (2022) at Google, respond to the comparability problem. Where a datasheet is a questionnaire answered in prose, a Data Card is a structured summary built from reusable blocks. Each block addresses a specific question and is designed to be consistent across datasets, so that a reader can scan many cards and compare the same field across them. The framework is opinionated about format precisely so that documentation becomes navigable at scale. ### 3.2 The OFTEn framework and agentized content The authors propose that content be organized so that it answers questions across the data lifecycle, captured by the mnemonic of Origins, Factuals, Transformations, Experience, and Notable considerations. A central contribution is the idea of agentized documentation, which recognizes that different readers need different information. A policy reviewer, a model developer, and an affected member of the public ask different questions of the same dataset. Data Cards encourage authors to write content for these distinct agents rather than a single undifferentiated audience. The framework also stresses that a good card surfaces what is hard to see, such as known gaps, sampling decisions, and the reasoning behind transformations, rather than only the easy descriptive statistics. ### 3.3 Operational emphasis Data Cards were developed alongside tooling and templates intended for production use inside a large organization. This operational grounding shows in the attention to who produces and who consumes the card, to how cards are reviewed, and to how completeness can be assessed. The lesson for practitioners is that documentation succeeds when it is treated as a deliverable with owners and review gates, not as an afterthought appended once the dataset ships. ## 4. Data Statements for NLP ### 4.1 Language is not neutral Data Statements, proposed by Bender and Friedman (2018), focus on natural language processing, where a particular hazard dominates. A language technology trained on text from one population, register, or dialect will encode the characteristics of that population and may fail or discriminate when applied to another. The authors argue that the field systematically under-reports the provenance of its language data, which makes it impossible to reason about generalization or bias. A data statement is a characterization of a dataset that lets users understand for whom and for what a system is likely to work. ### 4.2 Schema A data statement records the curation rationale, that is why these particular texts were selected. It records the language variety, including dialect and register, specified precisely rather than as a vague label like English. It records speaker demographics and annotator demographics, because both groups shape the data, the speakers through what they wrote and the annotators through how they labeled it. It records the speech situation, the text characteristics, the recording and quality details, and any provenance for data drawn from other sources. The emphasis on annotator demographics is distinctive and important. Labels are not facts about the world but judgments made by particular people, and who those people are affects what the labels mean. ### 4.3 Generalization and bias The payoff of a data statement is the ability to predict where a model will fail. If a sentiment classifier was trained on product reviews written by one demographic and labeled by another, a data statement makes the mismatch visible before deployment in a different context. Bender and Friedman distinguish a long form statement intended for documentation and a short form suitable for a paper or model release, acknowledging the practical tension between completeness and the cost of producing it. ### 4.4 Comparing the three frameworks The frameworks are siblings, not rivals. They differ mainly in format, primary audience, and the modality they were designed around. | Dimension | Datasheets for Datasets | Data Cards | Data Statements | |---|---|---|---| | Primary form | Prose answers to a fixed question set | Structured reusable blocks | Schema of provenance and demographic fields | | Optimized for | Generality across modalities | Comparability and review at scale | Language data and generalization | | Distinctive idea | Lifecycle questionnaire | Agentized content for distinct readers | Speaker and annotator demographics | | Machine readability | Low | Moderate to high | Low to moderate | | Best fit | Single dataset, any modality | Many datasets, one organization | NLP corpora and benchmarks | The right column of any such table can mislead if read as a verdict. All three frameworks share the same underlying content categories developed in the next section. A team can answer the datasheet questions using Data Card style blocks and embed a data statement for the language portion, producing a single document that is general, comparable, and demographically precise at once. ## 5. What to Document The three frameworks differ in format and emphasis but converge on a shared set of content categories. The following subsections describe what belongs in each. ### 5.1 Motivation Record why the dataset exists. State the task or research question it was built to support, the specific gap it was meant to fill, who created it, and who funded the work. Motivation is the lens through which every later decision should be read. A dataset assembled to study one phenomenon often carries sampling and labeling choices that make it unsuitable for another, and stating the original purpose warns future users away from misuse. ### 5.2 Composition Describe what the dataset contains. Specify what a single instance represents, how many instances there are, and what fields each carries. State whether the dataset is a sample of some larger population and, if so, how the sample was drawn and whether it is representative. Disclose missing data, label distributions, and any class imbalance. When the data concerns people, document which subpopulations are present and in what proportions, since composition along demographic lines is frequently the difference between a fair system and a discriminatory one. Note any sensitive content and any confidentiality constraints. ### 5.3 Collection Explain how the data was obtained. Describe the mechanism, whether direct observation, survey, scraping, sensors, or purchase from a third party. Record the time frame of collection, because data ages and a model trained on stale data drifts from the world it is meant to serve. For data about people, document whether they were aware of the collection, whether they consented, whether they could withdraw, and whether any ethical review took place. Collection is the category where consent and provenance live, and it is where the most serious harms originate when it is neglected. ### 5.4 Preprocessing, cleaning, and labeling Document every transformation between the raw data and the released dataset. Record filtering rules, deduplication, normalization, tokenization, and any discarded records along with the reason for discarding them. Preserve or describe access to the raw data when possible, since a transformation that looks innocuous can encode a consequential bias. For labeling, record who the annotators were, what instructions they followed, how disagreements were resolved, and what inter-annotator agreement was achieved. Labels inherit the perspectives and incentives of the people and processes that produced them, and a model can only ever be as coherent as its labels. ### 5.5 Uses State what the dataset has already been used for and what it is suitable for. Equally important, state what it should not be used for. Identify aspects of composition or collection that could cause harm or unfair treatment if the dataset were applied to a task it was not designed for. This is the category that turns documentation from a description into guidance. A clear statement of intended and unintended uses gives downstream practitioners a standard against which to judge their own plans. ### 5.6 Distribution and maintenance Record how the dataset is distributed, under what license, and subject to what terms or restrictions. Assign it a version and a citation so that results can be tied to an exact artifact. Name the party responsible for maintenance, state how errors can be reported, and describe how and whether the dataset will be updated, corrected, or retired. A dataset without an owner and a version is a liability, because it cannot be corrected when a flaw is found and cannot be cited reliably when a result depends on it. ## 6. A Formal View of What Documentation Buys Documentation is usually discussed in qualitative terms, but its core function admits a precise statement. The central problem documentation addresses is **dataset shift**, the mismatch between the distribution a dataset was drawn from and the distribution on which a model is later deployed. ### 6.1 The deployment mismatch Let $P_{\text{train}}(x, y)$ be the joint distribution over inputs $x$ and labels $y$ that produced the dataset, and let $P_{\text{deploy}}(x, y)$ be the distribution at the point of use. A model fit to the first incurs additional error at the second to the extent that the two differ. Two classical decompositions name the ways they can differ (Moreno-Torres et al., 2012). Under **covariate shift** the input marginal changes while the conditional is stable, $P_{\text{train}}(x) \neq P_{\text{deploy}}(x)$ but $P_{\text{train}}(y \mid x) = P_{\text{deploy}}(y \mid x)$. Under **concept shift** the labeling relationship itself changes, $P_{\text{train}}(y \mid x) \neq P_{\text{deploy}}(y \mid x)$. The practical point is that **neither kind of shift is observable from the data alone**. A practitioner holding only the instance matrix cannot tell whether the sampling frame excluded a subpopulation, or whether the labels encode the judgments of annotators whose criteria differ from the deployment context. These facts live in the collection and labeling history, which is exactly what the composition, collection, and labeling categories of a datasheet record. Documentation does not eliminate shift. It makes the relevant components of $P_{\text{train}}$ legible so that a user can estimate, before deployment, whether the gap to $P_{\text{deploy}}$ is tolerable. ### 6.2 Documentation as a sufficient statistic for fitness-of-use It is useful to frame a datasheet as an attempt at a **sufficient description** for the question that actually matters to a downstream user: is this dataset fit for my intended use $u$? Let $D$ be the dataset and $\text{Doc}$ its documentation. The ideal is that the decision to adopt or reject $D$ for use $u$ depends on $D$ only through $\text{Doc}$, $$ \text{fit}(D, u) \;\approx\; g\big(\text{Doc}(D),\, u\big), $$ meaning a capable reader can predict fitness from the document without re-auditing the raw data. No real datasheet achieves this exactly, because prose is lossy and authors omit what they did not think to record. The quality of a datasheet is precisely how close it comes to this ideal: how rarely a reader who trusts the document is later surprised by a property of the data that the document failed to surface. This reframes the practical advice in Section 8 as a single objective, namely minimizing the surprises that survive a careful reading. ### 6.3 A measure of completeness Comparability across documents invites a crude but useful quantity. Fix a checklist of $n$ required fields for the chosen framework. Let $r_i \in \{0, 1\}$ indicate whether field $i$ is answered substantively rather than left blank or marked unknown, and let $w_i \ge 0$ weight the field by how consequential its omission is, with collection consent and demographic composition weighted heavily and incidental metadata weighted lightly. A weighted completeness score is $$ C \;=\; \frac{\sum_{i=1}^{n} w_i\, r_i}{\sum_{i=1}^{n} w_i} \;\in\; [0, 1]. $$ This number is a coverage indicator, not a quality guarantee. A document can score $C = 1$ and still misrepresent the data, since $r_i$ records only that a field was answered, not that the answer is true. Used honestly, $C$ supports review gates of the form "no dataset is promoted to production with $C$ below a threshold on the high-weight fields," which is how completeness acquires teeth inside an organization. Used dishonestly, it becomes a target that invites box-ticking, an instance of the general failure where a measure that becomes a target ceases to measure well. ### 6.4 Worked example: a sentiment dataset Consider a dataset $D$ of ten thousand short texts labeled positive, negative, or neutral, intended to train a customer-support sentiment classifier. Suppose the datasheet records the following. The texts were scraped from public product reviews on a single retail platform between January and March of one year. The reviewers skew toward one national English variety. The labels were assigned by five annotators recruited through one crowd platform, working from a two-page guideline, with a measured inter-annotator agreement of Cohen's kappa around 0.62 and ties broken by majority vote. Neutral is the minority class at roughly twelve percent of instances. A reader can now reason about fitness without touching the raw data. The intended deployment is incoming support tickets, which are complaints, written in several English varieties, arriving year round. The documentation surfaces three concrete mismatches. First, a register and genre gap, since product reviews are not support tickets, a covariate shift in $P(x)$. Second, a dialect gap between the reviewer population and the ticket population, a further covariate shift that the data alone would hide. Third, a concept concern, because the annotation guideline was written for reviews and may draw the neutral boundary differently than a support team would, a possible shift in $P(y \mid x)$. The moderate kappa warns that even the recorded labels carry real disagreement, so a few points of accuracy are noise rather than signal. None of these conclusions required re-auditing the data. They follow from the document, which is the entire point of writing it. Had the same dataset shipped with only its instances and a folder name, every one of these inferences would have required reverse engineering the collection process, if it were possible at all. That difference, between a user who can predict failure in an afternoon and a user who discovers it in production, is the value documentation delivers. ## 7. Documentation as a Tool for Accountability ### 7.1 From description to obligation The frameworks share a quiet ambition. By asking creators to write down their decisions, they convert tacit choices into a record that others can inspect, question, and contest. A documented sampling decision can be challenged. An undocumented one cannot, because no one knows it was made. In this sense documentation distributes power. It moves knowledge about a dataset out of the small group that built it and into the larger community that depends on it, including auditors, regulators, and the people the data describes. ### 7.2 Process over artifact The deepest insight across this literature is that the value lies in the act of documenting, not only in the document. Answering the composition and collection questions forces creators to confront gaps and biases while there is still time to address them. Gebru and colleagues stress that datasheets are meant to encourage reflection during creation. A datasheet written honestly at the right moment changes the dataset, because the author notices, for example, that an entire demographic is absent and decides to collect more. The artifact is a byproduct of a better process. ### 7.3 Regulatory and organizational momentum Documentation is moving from voluntary good practice toward expectation and requirement. Data protection regimes that grant rights over personal data presuppose that an organization knows what data it holds and where it came from, which is exactly what collection and composition documentation captures. Emerging regulation of high-risk AI systems contemplates obligations to document training data and its provenance. Inside organizations, dataset documentation increasingly functions as a review gate, where a dataset cannot be promoted to production use until its card or datasheet is complete and approved. This shift gives documentation teeth it historically lacked. ### 7.4 Limits and honest expectations Documentation is necessary but not sufficient. A complete datasheet can describe a dataset that should never have been collected. A polished Data Card can present a biased dataset attractively. The frameworks depend on the candor of their authors and offer little protection against motivated misrepresentation. Producing good documentation also costs real effort, and without organizational support that effort is the first thing cut under deadline pressure. The honest position is that documentation is a powerful enabling condition for accountability rather than a guarantee of it. It makes scrutiny possible. Whether scrutiny actually happens depends on incentives, culture, and enforcement that lie outside the document. ## 8. Practical Guidance ### 8.1 Choosing a framework Practitioners should match the framework to context. For general-purpose datasets across many modalities, Datasheets for Datasets offers a thorough and widely understood question set. For organizations documenting many datasets that must be compared and reviewed at scale, Data Cards provide structure and tooling. For language data, Data Statements supply the demographic and provenance detail that NLP generalization demands. These choices are not exclusive. A team can answer datasheet questions using card-style structured blocks, and a language dataset can adopt the data statement schema within a broader datasheet. ### 8.2 Making documentation durable Documentation that lives in a forgotten document decays. Store the documentation with the dataset, version the two together, and treat updates to the data as requiring updates to the record. Assign an owner. Make completeness a precondition for release. Write for the specific readers who will use the dataset rather than a generic audience, and surface the uncomfortable facts about gaps and limitations prominently rather than burying them. The test of good documentation is simple. A capable newcomer should be able to read it and correctly predict how the dataset will and will not serve a proposed use. Tooling should reduce the friction of meeting this standard rather than add ceremony. Mature open-source options exist for every step. Croissant, a metadata format for machine learning datasets standardized through MLCommons, lets a datasheet's structured fields travel with the data in a machine-readable form. The Hugging Face Hub renders a dataset card from a versioned Markdown file checked in beside the data, so the document and the artifact move together by default. Plain text formats under version control, whether Markdown, YAML, or JSON, are preferable to a slide or a wiki page precisely because they diff, review, and version like code. ::: callout-tip ## When to invest, and where it goes wrong **When the effort clearly pays off.** Document heavily when the dataset will outlive its creators' tenure, when it concerns people, when it will be reused across teams or released publicly, or when it feeds a system subject to audit or regulation. The cost of documentation is fixed and paid once; the cost of an undocumented dataset is paid repeatedly by every downstream user who must reverse engineer it. **Common pitfalls.** - *Retrospective documentation.* Writing the datasheet months after collection loses the very knowledge it was meant to capture. Document during creation, while the decisions are fresh and still changeable. - *Completeness theater.* Optimizing the completeness score $C$ rather than the truth of the answers. A field marked answered is not a field answered honestly. - *Burying the limitations.* Putting gaps and biases in a footnote while foregrounding flattering statistics. The uncomfortable facts are the ones a user most needs. - *Vague language varieties and demographics.* Writing "English" where "United States English, informal register, product-review genre" is what a reader actually needs to predict generalization. - *Orphaned documents.* A datasheet with no owner and no link to a dataset version cannot be corrected when a flaw is found and cannot be cited reliably when a result depends on it. ::: ## References 1. Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Daumé III, H., and Crawford, K. (2021). Datasheets for Datasets. Communications of the ACM, 64(12). https://dl.acm.org/doi/10.1145/3458723 2. Pushkarna, M., Zaldivar, A., and Kjartansson, O. (2022). Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI. ACM Conference on Fairness, Accountability, and Transparency (FAccT). https://dl.acm.org/doi/10.1145/3531146.3533231 3. Bender, E. M., and Friedman, B. (2018). Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science. Transactions of the Association for Computational Linguistics, 6. https://aclanthology.org/Q18-1041/ 4. Buolamwini, J., and Gebru, T. (2018). Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. Proceedings of Machine Learning Research, 81. https://proceedings.mlr.press/v81/buolamwini18a.html 5. Birhane, A., and Prabhu, V. U. (2021). Large Image Datasets: A Pyrrhic Win for Computer Vision? IEEE Winter Conference on Applications of Computer Vision (WACV). https://ieeexplore.ieee.org/document/9423393 6. Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., Spitzer, E., Raji, I. D., and Gebru, T. (2019). Model Cards for Model Reporting. ACM Conference on Fairness, Accountability, and Transparency (FAT*). https://dl.acm.org/doi/10.1145/3287560.3287596 7. Holland, S., Hosny, A., Newman, S., Joseph, J., and Chmielinski, K. (2018). The Dataset Nutrition Label: A Framework to Drive Higher Data Quality Standards. https://arxiv.org/abs/1805.03677 8. Hutchinson, B., Smart, A., Hanna, A., Denton, E., Greer, C., Kjartansson, O., Barnes, P., and Mitchell, M. (2021). Towards Accountability for Machine Learning Datasets: Practices from Software Engineering and Infrastructure. ACM Conference on Fairness, Accountability, and Transparency (FAccT). https://dl.acm.org/doi/10.1145/3442188.3445918 9. Moreno-Torres, J. G., Raeder, T., Alaiz-Rodríguez, R., Chawla, N. V., and Herrera, F. (2012). A Unifying View on Dataset Shift in Classification. Pattern Recognition, 45(1). https://doi.org/10.1016/j.patcog.2011.06.019