80  Data Ethics and Governance

Modern machine learning systems are only as trustworthy as the data that shapes them. A model trained on consented, representative, and well governed data can serve people fairly; a model trained on scraped, skewed, or carelessly retained data can entrench harm at scale. This chapter examines the ethical and governance practices that surround datasets across their full lifecycle, from collection through deletion. It treats data ethics not as a compliance afterthought but as an engineering discipline with concrete artifacts, roles, and checkpoints. The aim is to give practitioners a rigorous and balanced foundation: principled enough to reason about hard cases, practical enough to operationalize in a production pipeline.

80.1 1. Foundations of Data Ethics

80.1.1 1.1 Why Data Ethics Matters

Data is not a neutral resource. Every dataset encodes choices about what was measured, who was included, how categories were defined, and which signals were discarded. Those choices propagate through model training and become operational policy when a system makes predictions about real people. A credit model that learns from historical lending decisions inherits the discrimination embedded in that history. A clinical model trained predominantly on one population may underperform on others. Because machine learning generalizes patterns, it also generalizes the assumptions and inequities baked into its inputs.

Data ethics asks practitioners to take responsibility for these downstream effects at the point where data is gathered and curated. The discipline draws on several normative traditions. Consequentialist reasoning weighs aggregate benefits and harms. Deontological reasoning grounds duties such as honoring consent and respecting autonomy regardless of outcome. Justice based reasoning focuses on the fair distribution of benefits and burdens across groups. In practice these lenses are complementary rather than competing, and mature governance programs invoke all of them.

80.1.2 1.2 The Data Lifecycle as an Ethical Surface

It helps to view ethics as spanning the entire data lifecycle: collection, labeling, storage, processing, model training, deployment, and eventual deletion. Each stage introduces distinct risks. Collection raises questions of consent and lawful basis. Labeling raises questions of annotator working conditions and category validity. Storage raises questions of security and retention. Deployment raises questions of fairness, contestability, and feedback loops. Treating any single stage in isolation produces blind spots. A dataset collected with impeccable consent can still cause harm if it is retained indefinitely or repurposed for an incompatible use. Governance therefore operates as a continuous practice rather than a one time gate.

80.3 3. Fairness and Representation in Datasets

80.3.1 3.1 How Bias Enters Data

Bias is not a single phenomenon but a family of failure modes that enter at different points. Historical bias reflects inequities present in the world that the data faithfully records, such as occupational segregation visible in employment records. Representation bias arises when the sampling process underrepresents some groups, as when a dataset of faces drawn from one region underrepresents others. Measurement bias arises when the chosen proxy imperfectly captures the target concept, as when arrest data is used as a proxy for crime even though arrests reflect policing patterns. Aggregation bias arises when a single model is applied across groups for whom distinct models would be more appropriate. Naming the specific mechanism matters because each calls for a different remedy.

80.3.2 3.2 Defining Fairness

Fairness has no single technical definition, and several plausible definitions are mutually incompatible. Group fairness criteria such as demographic parity require similar outcome rates across groups. Equalized odds requires similar error rates across groups. Calibration requires that a predicted score mean the same thing regardless of group membership. A well known impossibility result shows that calibration and equalized error rates generally cannot both hold when base rates differ across groups, except in degenerate cases. This means practitioners must choose which fairness property to prioritize, and that choice is a value judgment rather than a purely technical one.

Individual fairness offers a complementary lens, demanding that similar individuals receive similar treatment, though it requires a contested definition of similarity. Because no criterion is universally correct, fairness work should begin by articulating who could be harmed, in what way, and which definition best protects against that specific harm. Stating the choice explicitly, with its tradeoffs, is more honest than presenting any single metric as the answer.

80.3.3 3.3 Representation and Documentation

Representation begins with knowing the composition of a dataset. Many datasets ship with no description of their demographic makeup, making it impossible to assess whether a group is underrepresented. Documentation practices address this gap. Datasheets for Datasets, proposed by Gebru and colleagues, prompt creators to record motivation, composition, collection process, recommended uses, and known limitations. Data Statements play a similar role for language data. These artifacts do not by themselves remove bias, but they make it visible and reviewable, and they create accountability by attaching authorship to data decisions.

Improving representation can involve targeted collection to fill gaps, reweighting to correct skew, or constraining the deployment context to populations the data supports. Each approach has limits. Targeted collection of sensitive group data can itself raise privacy concerns. Reweighting cannot manufacture signal that was never measured. The honest position is that some datasets simply should not be used for some purposes, and recognizing that boundary is part of fairness work.

80.4 4. Data Minimization

80.4.1 4.1 The Principle

Data minimization holds that one should collect and retain only the data necessary for a specified, legitimate purpose. The principle inverts the default impulse of large scale data practice, which is to gather everything in case it proves useful later. Minimization reduces risk along several dimensions at once. Less data collected means fewer individuals exposed. Less data retained means a smaller attack surface in the event of a breach. Tighter purpose limitation means less opportunity for function creep, where data gathered for one aim is quietly repurposed for another.

80.4.2 4.2 Minimization in Practice

Operationalizing minimization requires asking, for each field collected, whether it is genuinely required for the stated purpose. Often the answer is no, and the field persists only out of habit. Techniques support the principle. Collecting aggregates instead of individual records, sampling rather than capturing exhaustively, and computing on device so that raw data never leaves the user’s hardware all reduce exposure. Retention schedules enforce deletion after the data’s purpose is fulfilled, converting minimization from a one time decision into an ongoing discipline.

Minimization sits in tension with the machine learning culture of large datasets, where more data frequently improves performance. The tension is real but often overstated. Beyond a point, additional data yields diminishing returns, and the marginal records frequently carry the highest privacy risk relative to their value. A disciplined team treats data volume as a cost to be justified rather than a good to be maximized, and weighs the modest accuracy gains of extra collection against the concentrated risk it creates.

80.5 5. Regulatory Context

80.5.1 5.1 The General Data Protection Regulation

The General Data Protection Regulation, which took effect across the European Union in 2018, is the most influential data protection law and a useful reference point even for organizations outside its jurisdiction. It applies to the processing of personal data of individuals in the EU regardless of where the processor is located, giving it broad extraterritorial reach. Its core principles map closely onto the ethical commitments discussed above: lawfulness, fairness, and transparency; purpose limitation; data minimization; accuracy; storage limitation; integrity and confidentiality; and accountability.

The regulation requires a lawful basis for processing, of which consent is only one. Others include contractual necessity, legal obligation, and legitimate interests, the last of which requires balancing the processor’s interests against the rights of the individual. It grants individuals rights including access, rectification, erasure, portability, and the right to object to certain automated decision making. It distinguishes the data controller, who determines the purposes and means of processing, from the data processor, who acts on the controller’s behalf, and assigns obligations accordingly.

80.5.2 5.2 Obligations of Particular Relevance to Machine Learning

Several GDPR provisions bear directly on model development. The principle of purpose limitation constrains repurposing data collected for one aim to train a model for another, a common practice that often lacks a clear lawful basis. The storage limitation principle conflicts with the impulse to retain training data indefinitely. The right to erasure raises difficult questions when an individual’s data has already influenced a trained model, since deleting a record from a database does not remove its influence from learned parameters. Provisions on automated decision making give individuals a right not to be subject to solely automated decisions with significant effects, with safeguards including meaningful information about the logic involved.

A Data Protection Impact Assessment is required for processing likely to result in high risk to individuals, which frequently includes large scale profiling and novel uses of sensitive data. The assessment forces a structured analysis of necessity, proportionality, and mitigation before processing begins, making it a natural integration point for the ethical review this chapter advocates.

80.5.3 5.3 The Wider Regulatory Landscape

GDPR is not alone. The California Consumer Privacy Act and its successor establish rights for California residents. Sector specific laws such as the United States Health Insurance Portability and Accountability Act govern health data, and the Children’s Online Privacy Protection Act governs data about minors. The EU Artificial Intelligence Act, adopted in 2024, layers obligations specific to AI systems on top of data protection law, with requirements for data governance and quality in high risk applications. Many jurisdictions are enacting their own regimes, producing a patchwork that any global system must navigate. The practical lesson is that designing to a high principled standard, rather than the minimum of any single jurisdiction, is both more robust and easier to maintain than chasing each law separately.

80.6 6. Governance Roles and Policies

80.6.1 6.1 Roles and Accountability

Governance fails when responsibility is diffuse. Effective programs assign clear roles. A Data Protection Officer, required under GDPR in certain cases, provides independent oversight of compliance. Data owners take accountability for specific datasets, including their quality and appropriate use. Data stewards handle day to day curation and metadata. A privacy or ethics review board evaluates higher risk proposals. Engineers and data scientists remain responsible for the systems they build rather than delegating ethical judgment entirely to a separate function. The principle threaded through these roles is accountability: for every dataset and model there should be an identifiable person or body answerable for its conduct.

80.6.2 6.2 Policies and Processes

Roles need processes to act through. Core artifacts include a data classification scheme that sorts data by sensitivity and attaches handling requirements to each tier; access controls that enforce least privilege so people reach only the data their work requires; retention and deletion schedules; and audit logging that records who accessed what and when. A data catalog with lineage tracking allows an organization to answer basic questions, such as where a given field originated and which models consumed it, that are otherwise surprisingly hard to answer at scale.

Review processes embed ethics into the development workflow. A data use proposal describes intended purpose, lawful basis, affected populations, and risks, and is reviewed before collection proceeds. An impact assessment, building on the GDPR model, documents necessity and mitigation for higher risk work. These processes should be proportionate, with lightweight paths for routine low risk work and deeper scrutiny reserved for novel or sensitive uses, so that governance accelerates rather than obstructs responsible projects.

80.6.3 6.3 Culture and Incentives

Documents and boards are necessary but not sufficient. Governance lives or dies on culture and incentives. If teams are rewarded solely for shipping speed and model accuracy, ethical considerations will be treated as obstacles. If review boards lack authority to halt a project, their findings become advisory noise. Sustainable governance requires that ethical review carry real weight, that practitioners have channels to raise concerns without penalty, and that leadership treats data stewardship as a measure of quality rather than a tax on productivity. The most rigorous policy is inert without the organizational will to enforce it.

80.7 7. Dataset Harms and Mitigation

80.7.1 7.1 A Taxonomy of Harms

Datasets can cause several distinct kinds of harm. Allocative harms occur when a system distributes resources or opportunities unfairly, as when a hiring model systematically screens out qualified candidates from a group. Representational harms occur when systems reinforce demeaning stereotypes or render groups invisible, as when an image tagging system mislabels people or a search system returns degrading results for certain identities. Privacy harms occur through exposure, re-identification, or inference of sensitive attributes a person never disclosed. Quality of service harms occur when a system simply works less well for some groups, such as speech recognition that performs poorly for certain accents.

Two further harms deserve emphasis. Feedback loops occur when a model’s outputs shape the future data it is trained on, amplifying initial bias over time, a dynamic well documented in predictive policing where patrols concentrate where past arrests occurred, generating more arrests there. Aggregation and surveillance harms occur when the mere existence of a large dataset enables monitoring and control beyond any individual prediction, a structural harm that no single accurate output offsets.

80.7.2 7.2 Mitigation Strategies

Mitigation begins upstream, because harms are far cheaper to prevent at collection than to patch after deployment. Documentation through datasheets and data statements surfaces limitations early. Representative collection and careful proxy selection reduce bias at its source. Privacy preserving techniques such as differential privacy and on device computation reduce exposure. Fairness evaluation, disaggregated by group so that aggregate metrics do not mask disparities for minorities, detects quality of service gaps before they reach users.

Downstream mitigations matter too. Continuous monitoring detects drift and emerging disparities after deployment. Human review and contestability mechanisms give affected individuals a route to challenge decisions, which both corrects errors and respects autonomy. Red teaming probes for failure modes that ordinary testing misses. Deletion and retraining processes, though imperfect, attempt to honor erasure requests. No single technique suffices, and the credible posture is defense in depth: layered safeguards across the lifecycle, paired with the humility to restrict or abandon a use when the data cannot support it safely.

80.7.3 7.3 The Limits of Technical Fixes

A recurring temptation is to treat dataset harms as engineering problems with engineering solutions. Some are, but many are not. A debiasing algorithm cannot resolve a disagreement about which notion of fairness is correct, because that is a normative question. A privacy technique cannot make an inappropriate use appropriate. When a dataset reflects unjust social conditions, technical correction can obscure rather than address the underlying injustice, and may lend a false veneer of objectivity to a contested decision. The mature practitioner pairs technical skill with the judgment to recognize when a problem exceeds technical means and requires a different decision: narrowing the scope, involving affected communities, or declining the use entirely.

80.8 8. Conclusion

Data ethics and governance are not peripheral compliance chores but central to building systems that deserve trust. The threads of this chapter reinforce one another. Consent and privacy protect individual autonomy. Fairness and representation protect groups from inheriting historical injustice. Minimization reduces the surface of risk. Regulation such as GDPR codifies a baseline that principled practice should exceed. Roles and policies translate intention into reliable action. And a clear understanding of dataset harms, with layered mitigation, closes the loop from principle to outcome.

The unifying lesson is that ethical data practice is continuous and accountable. It spans the full lifecycle, it assigns responsibility to identifiable people, and it retains the humility to recognize the limits of technical remedy. Practitioners who internalize this stance build systems that are not only more lawful but more worthy of the trust that users place in them.

80.9 References

  1. Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Daumé III, H., and Crawford, K. (2021). Datasheets for Datasets. Communications of the ACM. https://dl.acm.org/doi/10.1145/3458723

  2. Bender, E. M., and Friedman, B. (2018). Data Statements for Natural Language Processing. Transactions of the Association for Computational Linguistics. https://aclanthology.org/Q18-1041/

  3. Nissenbaum, H. (2010). Privacy in Context: Technology, Policy, and the Integrity of Social Life. Stanford University Press. https://www.sup.org/books/title/?id=8862

  4. Dwork, C., and Roth, A. (2014). The Algorithmic Foundations of Differential Privacy. Foundations and Trends in Theoretical Computer Science. https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf

  5. Suresh, H., and Guttag, J. (2021). A Framework for Understanding Sources of Harm throughout the Machine Learning Life Cycle. ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization. https://dl.acm.org/doi/10.1145/3465416.3483305

  6. Chouldechova, A. (2017). Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments. Big Data. https://www.liebertpub.com/doi/10.1089/big.2016.0047

  7. Barocas, S., Hardt, M., and Narayanan, A. (2023). Fairness and Machine Learning: Limitations and Opportunities. MIT Press. https://fairmlbook.org/

  8. European Parliament and Council (2016). Regulation (EU) 2016/679 (General Data Protection Regulation). Official Journal of the European Union. https://eur-lex.europa.eu/eli/reg/2016/679/oj

  9. European Parliament and Council (2024). Regulation (EU) 2024/1689 (Artificial Intelligence Act). Official Journal of the European Union. https://eur-lex.europa.eu/eli/reg/2024/1689/oj

  10. Crawford, K. (2017). The Trouble with Bias. Keynote, Conference on Neural Information Processing Systems. https://www.youtube.com/watch?v=fMym_BKWQzk

  11. Buolamwini, J., and Gebru, T. (2018). Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. Proceedings of Machine Learning Research. https://proceedings.mlr.press/v81/buolamwini18a.html

  12. Ensign, D., Friedler, S. A., Neville, S., Scheidegger, C., and Venkatasubramanian, S. (2018). Runaway Feedback Loops in Predictive Policing. Proceedings of Machine Learning Research. https://proceedings.mlr.press/v81/ensign18a.html