80 Data Ethics and Governance

Modern machine learning systems are only as trustworthy as the data that shapes them. A model trained on consented, representative, and well governed data can serve people fairly; a model trained on scraped, skewed, or carelessly retained data can entrench harm at scale. This chapter examines the ethical and governance practices that surround datasets across their full lifecycle, from collection through deletion. It treats data ethics not as a compliance afterthought but as an engineering discipline with concrete artifacts, roles, and checkpoints. The aim is to give practitioners a rigorous and balanced foundation: principled enough to reason about hard cases, practical enough to operationalize in a production pipeline.

80.1 1. Foundations of Data Ethics

80.1.1 1.1 Why Data Ethics Matters

Data is not a neutral resource. Every dataset encodes choices about what was measured, who was included, how categories were defined, and which signals were discarded. Those choices propagate through model training and become operational policy when a system makes predictions about real people. A credit model that learns from historical lending decisions inherits the discrimination embedded in that history. A clinical model trained predominantly on one population may underperform on others. Because machine learning generalizes patterns, it also generalizes the assumptions and inequities baked into its inputs.

Data ethics asks practitioners to take responsibility for these downstream effects at the point where data is gathered and curated. The discipline draws on several normative traditions. Consequentialist reasoning weighs aggregate benefits and harms. Deontological reasoning grounds duties such as honoring consent and respecting autonomy regardless of outcome. Justice based reasoning focuses on the fair distribution of benefits and burdens across groups. In practice these lenses are complementary rather than competing, and mature governance programs invoke all of them.

80.1.2 1.2 The Data Lifecycle as an Ethical Surface

It helps to view ethics as spanning the entire data lifecycle: collection, labeling, storage, processing, model training, deployment, and eventual deletion. Each stage introduces distinct risks. Collection raises questions of consent and lawful basis. Labeling raises questions of annotator working conditions and category validity. Storage raises questions of security and retention. Deployment raises questions of fairness, contestability, and feedback loops. Treating any single stage in isolation produces blind spots. A dataset collected with impeccable consent can still cause harm if it is retained indefinitely or repurposed for an incompatible use. Governance therefore operates as a continuous practice rather than a one time gate.

The diagram below traces the lifecycle and attaches the dominant ethical question and governance artifact to each stage. Reading it left to right makes the continuity concrete: an obligation accepted at collection (honoring a withdrawal of consent) must still be discharged at deletion, many stages later.

flowchart LR
  A["Collection: consent, lawful basis"] --> B["Labeling: annotator welfare, category validity"]
  B --> C["Storage: security, retention limits"]
  C --> D["Processing: minimization, purpose limitation"]
  D --> E["Training: representation, fairness"]
  E --> F["Deployment: contestability, monitoring"]
  F --> G["Deletion: erasure, retraining"]
  G -.->|feedback shapes future collection| A

The dashed return edge marks a structural risk discussed later in the chapter: a deployed model can shape the data it is next trained on, so the lifecycle is a loop rather than a line.

80.3 3. Fairness and Representation in Datasets

80.3.1 3.1 How Bias Enters Data

Bias is not a single phenomenon but a family of failure modes that enter at different points. Historical bias reflects inequities present in the world that the data faithfully records, such as occupational segregation visible in employment records. Representation bias arises when the sampling process underrepresents some groups, as when a dataset of faces drawn from one region underrepresents others. Measurement bias arises when the chosen proxy imperfectly captures the target concept, as when arrest data is used as a proxy for crime even though arrests reflect policing patterns. Aggregation bias arises when a single model is applied across groups for whom distinct models would be more appropriate. Naming the specific mechanism matters because each calls for a different remedy.

80.3.2 3.2 Defining Fairness

Fairness has no single technical definition, and several plausible definitions are mutually incompatible. To compare them precisely, let $A$ denote a protected attribute (group membership), $Y \in \{0,1\}$ the true outcome, $R$ a predictor’s risk score, and $\hat{Y}$ the binary decision. The common group criteria are then conditional independence statements:

Demographic (statistical) parity: $\hat{Y} \perp A$, that is $\Pr[\hat{Y} = 1 \mid A = a]$ is equal across groups. It equalizes selection rates but ignores whether the decision is correct.
Equalized odds: $\hat{Y} \perp A \mid Y$, that is the true positive rate $\Pr[\hat{Y}=1 \mid Y=1, A=a]$ and the false positive rate $\Pr[\hat{Y}=1 \mid Y=0, A=a]$ are equal across groups. It equalizes error rates conditional on the truth.
Calibration (sufficiency): $Y \perp A \mid R$, that is $\Pr[Y = 1 \mid R = r, A = a]$ is equal across groups, so a score of $r$ means the same thing for everyone.

An impossibility result. Suppose the base rates differ across groups, $\Pr[Y=1 \mid A=0] \neq \Pr[Y=1 \mid A=1]$, and the classifier is not perfect. Then no classifier can simultaneously satisfy calibration and equalized odds; more sharply, it cannot equalize both the false positive rate and the false negative rate across groups while staying calibrated, except in the degenerate cases of equal base rates or perfect prediction (references 6 and 14). The intuition is short. Calibration ties each score to a fixed probability of $Y=1$. When base rates differ, matching those probabilities forces the score distributions to differ between groups, and once the distributions differ, any single threshold produces different error rates on each side. The conflict is not an artifact of a particular algorithm; it is a property of the statistics. This is why practitioners must choose which fairness property to prioritize, and that choice is a value judgment rather than a purely technical one.

Individual fairness offers a complementary lens, demanding that similar individuals receive similar treatment. Formally it asks the prediction map to be Lipschitz with respect to a task specific similarity metric $d$ on individuals: $\lVert M(x) - M(x') \rVert \le L \, d(x, x')$ for all pairs $x, x'$ (reference 15). Its difficulty is that constructing a defensible $d$ is itself a contested value judgment, since the metric must encode which differences between people are legitimate grounds for different treatment. Because no criterion is universally correct, fairness work should begin by articulating who could be harmed, in what way, and which definition best protects against that specific harm. Stating the choice explicitly, with its tradeoffs, is more honest than presenting any single metric as the answer.

80.3.3 3.3 A Worked Example of the Fairness Tradeoff

A small numerical example shows the tension in concrete terms. Consider a risk score used to flag loan applicants, evaluated on two groups of 1000 people each. Suppose the score is well calibrated: among everyone the model assigns a “high risk” score, 30 percent actually default, in both groups. The groups differ only in base rate. In group $A$, 100 of the 1000 applicants would default; in group $B$, 300 would.

Because the model is calibrated and group $B$ has the higher true default rate, the score concentrates more of group $B$ in the high risk band. Suppose the decision threshold flags 200 people in group $A$ and 500 in group $B$ as high risk. Calibration holds by construction (30 percent of each flagged set defaults). Now examine the error rates on the people who would not default:

Group $A$: of $900$ non-defaulters, $200 \times 0.7 = 140$ are flagged, a false positive rate of $140 / 900 \approx 0.156$.
Group $B$: of $700$ non-defaulters, $500 \times 0.7 = 350$ are flagged, a false positive rate of $350 / 700 = 0.5$.

A non-defaulter in group $B$ is more than three times as likely to be wrongly flagged as one in group $A$, even though the score is perfectly calibrated and means exactly the same thing in both groups. There is no threshold that removes this gap without breaking calibration, because the gap is driven by the differing base rates, not by the threshold. The lesson is not that one group is being treated unfairly in some absolute sense; it is that “calibrated” and “equal false positive rates” are different commitments, and a designer who has not chosen between them has implicitly chosen one anyway. This is exactly the dynamic documented in audits of recidivism risk tools (reference 6).

80.3.4 3.4 Representation and Documentation

Representation begins with knowing the composition of a dataset. Many datasets ship with no description of their demographic makeup, making it impossible to assess whether a group is underrepresented. Documentation practices address this gap. Datasheets for Datasets, proposed by Gebru and colleagues, prompt creators to record motivation, composition, collection process, recommended uses, and known limitations. Data Statements play a similar role for language data. These artifacts do not by themselves remove bias, but they make it visible and reviewable, and they create accountability by attaching authorship to data decisions.

Improving representation can involve targeted collection to fill gaps, reweighting to correct skew, or constraining the deployment context to populations the data supports. Each approach has limits. Targeted collection of sensitive group data can itself raise privacy concerns. Reweighting cannot manufacture signal that was never measured. The honest position is that some datasets simply should not be used for some purposes, and recognizing that boundary is part of fairness work.

80.4 4. Data Minimization

80.4.1 4.1 The Principle

Data minimization holds that one should collect and retain only the data necessary for a specified, legitimate purpose. The principle inverts the default impulse of large scale data practice, which is to gather everything in case it proves useful later. Minimization reduces risk along several dimensions at once. Less data collected means fewer individuals exposed. Less data retained means a smaller attack surface in the event of a breach. Tighter purpose limitation means less opportunity for function creep, where data gathered for one aim is quietly repurposed for another.

80.4.2 4.2 Minimization in Practice

Operationalizing minimization requires asking, for each field collected, whether it is genuinely required for the stated purpose. Often the answer is no, and the field persists only out of habit. Techniques support the principle. Collecting aggregates instead of individual records, sampling rather than capturing exhaustively, and computing on device so that raw data never leaves the user’s hardware all reduce exposure. Retention schedules enforce deletion after the data’s purpose is fulfilled, converting minimization from a one time decision into an ongoing discipline.

Minimization sits in tension with the machine learning culture of large datasets, where more data frequently improves performance. The tension is real but often overstated, and it can be framed precisely. Empirically, generalization error tends to fall with dataset size $n$ following a power law, roughly $\text{error}(n) \approx \alpha\, n^{-\beta} + \gamma$ with a small exponent $\beta$ and an irreducible floor $\gamma$. The marginal accuracy gain $\lvert d\,\text{error}/dn \rvert$ therefore shrinks as $n^{-(\beta + 1)}$, so each additional record buys less than the one before. Privacy risk, by contrast, does not diminish: every record added is one more individual exposed in a breach and one more person whose erasure rights must later be honored, so the expected harm grows at least linearly in $n$. Plotting marginal benefit against marginal risk gives a crossover point past which collection is net negative. A disciplined team treats data volume as a cost to be justified rather than a good to be maximized, and weighs the modest, diminishing accuracy gains of extra collection against the linearly accumulating risk it creates.

80.5 5. Regulatory Context

80.5.1 5.1 The General Data Protection Regulation

The General Data Protection Regulation, which took effect across the European Union in 2018, is the most influential data protection law and a useful reference point even for organizations outside its jurisdiction. It applies to the processing of personal data of individuals in the EU regardless of where the processor is located, giving it broad extraterritorial reach. Its core principles map closely onto the ethical commitments discussed above: lawfulness, fairness, and transparency; purpose limitation; data minimization; accuracy; storage limitation; integrity and confidentiality; and accountability.

The regulation requires a lawful basis for processing, of which consent is only one. Others include contractual necessity, legal obligation, and legitimate interests, the last of which requires balancing the processor’s interests against the rights of the individual. It grants individuals rights including access, rectification, erasure, portability, and the right to object to certain automated decision making. It distinguishes the data controller, who determines the purposes and means of processing, from the data processor, who acts on the controller’s behalf, and assigns obligations accordingly.

80.5.2 5.2 Obligations of Particular Relevance to Machine Learning

Several GDPR provisions bear directly on model development. The principle of purpose limitation constrains repurposing data collected for one aim to train a model for another, a common practice that often lacks a clear lawful basis. The storage limitation principle conflicts with the impulse to retain training data indefinitely. The right to erasure raises difficult questions when an individual’s data has already influenced a trained model, since deleting a record from a database does not remove its influence from learned parameters. Provisions on automated decision making give individuals a right not to be subject to solely automated decisions with significant effects, with safeguards including meaningful information about the logic involved.

A Data Protection Impact Assessment is required for processing likely to result in high risk to individuals, which frequently includes large scale profiling and novel uses of sensitive data. The assessment forces a structured analysis of necessity, proportionality, and mitigation before processing begins, making it a natural integration point for the ethical review this chapter advocates.

80.5.3 5.3 The Wider Regulatory Landscape

GDPR is not alone. The California Consumer Privacy Act and its successor establish rights for California residents. Sector specific laws such as the United States Health Insurance Portability and Accountability Act govern health data, and the Children’s Online Privacy Protection Act governs data about minors. The EU Artificial Intelligence Act, adopted in 2024, layers obligations specific to AI systems on top of data protection law, with requirements for data governance and quality in high risk applications. Many jurisdictions are enacting their own regimes, producing a patchwork that any global system must navigate. The practical lesson is that designing to a high principled standard, rather than the minimum of any single jurisdiction, is both more robust and easier to maintain than chasing each law separately.

80.6 6. Governance Roles and Policies

80.6.1 6.1 Roles and Accountability

Governance fails when responsibility is diffuse. Effective programs assign clear roles. A Data Protection Officer, required under GDPR in certain cases, provides independent oversight of compliance. Data owners take accountability for specific datasets, including their quality and appropriate use. Data stewards handle day to day curation and metadata. A privacy or ethics review board evaluates higher risk proposals. Engineers and data scientists remain responsible for the systems they build rather than delegating ethical judgment entirely to a separate function. The principle threaded through these roles is accountability: for every dataset and model there should be an identifiable person or body answerable for its conduct.

80.6.2 6.2 Policies and Processes

Roles need processes to act through. Core artifacts include a data classification scheme that sorts data by sensitivity and attaches handling requirements to each tier; access controls that enforce least privilege so people reach only the data their work requires; retention and deletion schedules; and audit logging that records who accessed what and when. A data catalog with lineage tracking allows an organization to answer basic questions, such as where a given field originated and which models consumed it, that are otherwise surprisingly hard to answer at scale.

Review processes embed ethics into the development workflow. A data use proposal describes intended purpose, lawful basis, affected populations, and risks, and is reviewed before collection proceeds. An impact assessment, building on the GDPR model, documents necessity and mitigation for higher risk work. These processes should be proportionate, with lightweight paths for routine low risk work and deeper scrutiny reserved for novel or sensitive uses, so that governance accelerates rather than obstructs responsible projects.

80.6.3 6.3 Culture and Incentives

Documents and boards are necessary but not sufficient. Governance lives or dies on culture and incentives. If teams are rewarded solely for shipping speed and model accuracy, ethical considerations will be treated as obstacles. If review boards lack authority to halt a project, their findings become advisory noise. Sustainable governance requires that ethical review carry real weight, that practitioners have channels to raise concerns without penalty, and that leadership treats data stewardship as a measure of quality rather than a tax on productivity. The most rigorous policy is inert without the organizational will to enforce it.

80.7 7. Dataset Harms and Mitigation

80.7.1 7.1 A Taxonomy of Harms

Datasets can cause several distinct kinds of harm. Allocative harms occur when a system distributes resources or opportunities unfairly, as when a hiring model systematically screens out qualified candidates from a group. Representational harms occur when systems reinforce demeaning stereotypes or render groups invisible, as when an image tagging system mislabels people or a search system returns degrading results for certain identities. Privacy harms occur through exposure, re-identification, or inference of sensitive attributes a person never disclosed. Quality of service harms occur when a system simply works less well for some groups, such as speech recognition that performs poorly for certain accents.

Two further harms deserve emphasis. Feedback loops occur when a model’s outputs shape the future data it is trained on, amplifying initial bias over time, a dynamic well documented in predictive policing where patrols concentrate where past arrests occurred, generating more arrests there. Aggregation and surveillance harms occur when the mere existence of a large dataset enables monitoring and control beyond any individual prediction, a structural harm that no single accurate output offsets.

80.7.2 7.2 Mitigation Strategies

Mitigation begins upstream, because harms are far cheaper to prevent at collection than to patch after deployment. Documentation through datasheets and data statements surfaces limitations early. Representative collection and careful proxy selection reduce bias at its source. Privacy preserving techniques such as differential privacy and on device computation reduce exposure. Fairness evaluation, disaggregated by group so that aggregate metrics do not mask disparities for minorities, detects quality of service gaps before they reach users.

Downstream mitigations matter too. Continuous monitoring detects drift and emerging disparities after deployment. Human review and contestability mechanisms give affected individuals a route to challenge decisions, which both corrects errors and respects autonomy. Red teaming probes for failure modes that ordinary testing misses. Deletion and retraining processes, though imperfect, attempt to honor erasure requests. No single technique suffices, and the credible posture is defense in depth: layered safeguards across the lifecycle, paired with the humility to restrict or abandon a use when the data cannot support it safely.

80.7.3 7.3 The Limits of Technical Fixes

A recurring temptation is to treat dataset harms as engineering problems with engineering solutions. Some are, but many are not. A debiasing algorithm cannot resolve a disagreement about which notion of fairness is correct, because that is a normative question. A privacy technique cannot make an inappropriate use appropriate. When a dataset reflects unjust social conditions, technical correction can obscure rather than address the underlying injustice, and may lend a false veneer of objectivity to a contested decision. The mature practitioner pairs technical skill with the judgment to recognize when a problem exceeds technical means and requires a different decision: narrowing the scope, involving affected communities, or declining the use entirely.

80.8 8. Conclusion

Data ethics and governance are not peripheral compliance chores but central to building systems that deserve trust. The threads of this chapter reinforce one another. Consent and privacy protect individual autonomy. Fairness and representation protect groups from inheriting historical injustice. Minimization reduces the surface of risk. Regulation such as GDPR codifies a baseline that principled practice should exceed. Roles and policies translate intention into reliable action. And a clear understanding of dataset harms, with layered mitigation, closes the loop from principle to outcome.

The unifying lesson is that ethical data practice is continuous and accountable. It spans the full lifecycle, it assigns responsibility to identifiable people, and it retains the humility to recognize the limits of technical remedy. Practitioners who internalize this stance build systems that are not only more lawful but more worthy of the trust that users place in them.

80.9 References

Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Daumé III, H., and Crawford, K. (2021). Datasheets for Datasets. Communications of the ACM. https://dl.acm.org/doi/10.1145/3458723
Bender, E. M., and Friedman, B. (2018). Data Statements for Natural Language Processing. Transactions of the Association for Computational Linguistics. https://aclanthology.org/Q18-1041/
Nissenbaum, H. (2010). Privacy in Context: Technology, Policy, and the Integrity of Social Life. Stanford University Press. https://www.sup.org/books/title/?id=8862
Dwork, C., and Roth, A. (2014). The Algorithmic Foundations of Differential Privacy. Foundations and Trends in Theoretical Computer Science. https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf
Suresh, H., and Guttag, J. (2021). A Framework for Understanding Sources of Harm throughout the Machine Learning Life Cycle. ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization. https://dl.acm.org/doi/10.1145/3465416.3483305
Chouldechova, A. (2017). Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments. Big Data. https://www.liebertpub.com/doi/10.1089/big.2016.0047
Barocas, S., Hardt, M., and Narayanan, A. (2023). Fairness and Machine Learning: Limitations and Opportunities. MIT Press. https://fairmlbook.org/
European Parliament and Council (2016). Regulation (EU) 2016/679 (General Data Protection Regulation). Official Journal of the European Union. https://eur-lex.europa.eu/eli/reg/2016/679/oj
European Parliament and Council (2024). Regulation (EU) 2024/1689 (Artificial Intelligence Act). Official Journal of the European Union. https://eur-lex.europa.eu/eli/reg/2024/1689/oj
Crawford, K. (2017). The Trouble with Bias. Keynote, Conference on Neural Information Processing Systems. https://www.youtube.com/watch?v=fMym_BKWQzk
Buolamwini, J., and Gebru, T. (2018). Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. Proceedings of Machine Learning Research. https://proceedings.mlr.press/v81/buolamwini18a.html
Ensign, D., Friedler, S. A., Neville, S., Scheidegger, C., and Venkatasubramanian, S. (2018). Runaway Feedback Loops in Predictive Policing. Proceedings of Machine Learning Research. https://proceedings.mlr.press/v81/ensign18a.html
Machanavajjhala, A., Kifer, D., Gehrke, J., and Venkitasubramaniam, M. (2007). l-Diversity: Privacy Beyond k-Anonymity. ACM Transactions on Knowledge Discovery from Data. https://dl.acm.org/doi/10.1145/1217299.1217302
Kleinberg, J., Mullainathan, S., and Raghavan, M. (2017). Inherent Trade-Offs in the Fair Determination of Risk Scores. Innovations in Theoretical Computer Science (ITCS). https://doi.org/10.4230/LIPIcs.ITCS.2017.43
Dwork, C., Hardt, M., Pitassi, T., Reingold, O., and Zemel, R. (2012). Fairness Through Awareness. Innovations in Theoretical Computer Science (ITCS). https://dl.acm.org/doi/10.1145/2090236.2090255

# Data Ethics and Governance Modern machine learning systems are only as trustworthy as the data that shapes them. A model trained on consented, representative, and well governed data can serve people fairly; a model trained on scraped, skewed, or carelessly retained data can entrench harm at scale. This chapter examines the ethical and governance practices that surround datasets across their full lifecycle, from collection through deletion. It treats data ethics not as a compliance afterthought but as an engineering discipline with concrete artifacts, roles, and checkpoints. The aim is to give practitioners a rigorous and balanced foundation: principled enough to reason about hard cases, practical enough to operationalize in a production pipeline. ## 1. Foundations of Data Ethics ### 1.1 Why Data Ethics Matters Data is not a neutral resource. Every dataset encodes choices about what was measured, who was included, how categories were defined, and which signals were discarded. Those choices propagate through model training and become operational policy when a system makes predictions about real people. A credit model that learns from historical lending decisions inherits the discrimination embedded in that history. A clinical model trained predominantly on one population may underperform on others. Because machine learning generalizes patterns, it also generalizes the assumptions and inequities baked into its inputs. Data ethics asks practitioners to take responsibility for these downstream effects at the point where data is gathered and curated. The discipline draws on several normative traditions. Consequentialist reasoning weighs aggregate benefits and harms. Deontological reasoning grounds duties such as honoring consent and respecting autonomy regardless of outcome. Justice based reasoning focuses on the fair distribution of benefits and burdens across groups. In practice these lenses are complementary rather than competing, and mature governance programs invoke all of them. ### 1.2 The Data Lifecycle as an Ethical Surface It helps to view ethics as spanning the entire data lifecycle: collection, labeling, storage, processing, model training, deployment, and eventual deletion. Each stage introduces distinct risks. Collection raises questions of consent and lawful basis. Labeling raises questions of annotator working conditions and category validity. Storage raises questions of security and retention. Deployment raises questions of fairness, contestability, and feedback loops. Treating any single stage in isolation produces blind spots. A dataset collected with impeccable consent can still cause harm if it is retained indefinitely or repurposed for an incompatible use. Governance therefore operates as a continuous practice rather than a one time gate. The diagram below traces the lifecycle and attaches the dominant ethical question and governance artifact to each stage. Reading it left to right makes the continuity concrete: an obligation accepted at collection (honoring a withdrawal of consent) must still be discharged at deletion, many stages later. ```{mermaid} flowchart LR A["Collection: consent, lawful basis"] --> B["Labeling: annotator welfare, category validity"] B --> C["Storage: security, retention limits"] C --> D["Processing: minimization, purpose limitation"] D --> E["Training: representation, fairness"] E --> F["Deployment: contestability, monitoring"] F --> G["Deletion: erasure, retraining"] G -.->|feedback shapes future collection| A ``` The dashed return edge marks a structural risk discussed later in the chapter: a deployed model can shape the data it is next trained on, so the lifecycle is a loop rather than a line. ## 2. Consent and Privacy ### 2.1 The Meaning of Informed Consent Consent is the mechanism by which individuals exercise autonomy over information about themselves. For consent to be ethically meaningful it must be informed, freely given, specific, and revocable. Informed means the person understands what data is collected and to what end. Freely given means consent is not coerced or bundled as a precondition for an unrelated service. Specific means consent covers a defined purpose rather than an open ended grant. Revocable means the person can withdraw and have that withdrawal honored downstream. Many data practices fall short of this standard. Lengthy terms of service that no one reads do not produce genuine understanding. Pre ticked boxes do not produce a free choice. Consent obtained for one purpose, such as providing a mapping service, does not extend to an unrelated purpose, such as training a face recognition model. The gap between the legal fiction of consent and its ethical substance is one of the central tensions in data governance, and large scale web scraping for model training has sharpened it considerably, since the individuals whose data appears in training corpora typically never agreed to that use. ### 2.2 Privacy Beyond Consent Privacy is broader than consent. Even data that someone shared willingly in one context can cause harm when moved into another. Helen Nissenbaum's framework of contextual integrity captures this insight: information flows carry norms tied to the context in which they occur, and a privacy violation is best understood as a breach of those context relative norms rather than simply the disclosure of secret facts. A medical disclosure to a physician carries different expectations than the same fact surfaced in an advertising profile. Technical measures complement consent in protecting privacy. De-identification removes direct identifiers, though it is fragile because auxiliary data often permits re-identification. Differential privacy offers a rigorous mathematical guarantee by bounding how much any single individual's record can influence an output, typically by adding calibrated noise. K-anonymity and its successors aim to make each record indistinguishable from a group of others. None of these techniques is a complete solution on its own, and each trades some utility for protection. The practitioner's task is to select safeguards proportionate to the sensitivity of the data and the risk of the use. To make these guarantees precise it helps to state the two most important ones formally. **Differential privacy.** A randomized mechanism $M$ that maps a dataset to an output satisfies $(\varepsilon, \delta)$-differential privacy if, for every pair of datasets $D$ and $D'$ that differ in the record of a single individual, and for every measurable set of outputs $S$, $$ \Pr[M(D) \in S] \;\le\; e^{\varepsilon}\,\Pr[M(D') \in S] \;+\; \delta . $$ The parameter $\varepsilon$ (the privacy budget) bounds the multiplicative change in the output distribution attributable to any one person; smaller $\varepsilon$ means stronger privacy. The additive term $\delta$ allows a small failure probability and is typically chosen to be cryptographically negligible, ideally below $1/n$ for a dataset of $n$ records. The guarantee is worst case over all individuals and all auxiliary information an adversary might hold, which is what makes it robust where ad hoc de-identification is not. A standard way to achieve pure $\varepsilon$-differential privacy ($\delta = 0$) for a numeric query $f$ is the Laplace mechanism, which adds noise scaled to the query's global sensitivity $\Delta f = \max_{D, D'} \lVert f(D) - f(D') \rVert_1$, the largest change one record can cause: $$ M(D) = f(D) + \mathrm{Lap}\!\left(\frac{\Delta f}{\varepsilon}\right). $$ Two properties make the definition practical. Differential privacy is closed under post-processing: no function applied to $M(D)$ without touching the raw data can weaken the guarantee. And it composes: running mechanisms with budgets $\varepsilon_1$ and $\varepsilon_2$ yields at most $(\varepsilon_1 + \varepsilon_2)$-differential privacy, so a fixed total budget must be apportioned across all queries and training steps that touch the data (reference 4). This composition accounting is the discipline that prevents privacy from leaking away one query at a time. **K-anonymity.** A release satisfies $k$-anonymity if, on the quasi-identifiers (attributes such as ZIP code, birth date, and sex that are not unique alone but can identify in combination), every record is identical to at least $k - 1$ others, so each individual hides in an equivalence class of size at least $k$. K-anonymity resists naive linkage but has known weaknesses: if every record in an equivalence class shares the same sensitive value, membership in the class still reveals that value (a homogeneity attack). The refinements $\ell$-diversity, which requires at least $\ell$ well represented sensitive values per class, and $t$-closeness, which requires each class's sensitive distribution to stay within distance $t$ of the overall distribution, were introduced to close these gaps (reference 13). Unlike differential privacy, these are syntactic properties of a particular release and do not by themselves bound what an adversary with side information can learn. **When to use which, and what goes wrong.** The choice among these tools follows the threat model. Use differential privacy when you publish statistics or train models on sensitive records and must defend against adversaries with unknown side information; its cost is utility loss at small $\varepsilon$ and the bookkeeping of a finite budget across all releases. Use $k$-anonymity and its refinements when you release a static microdata table and the quasi-identifiers are well understood; its cost is brittleness against side information and the curse of dimensionality, since high dimensional records rarely share enough attributes to form large equivalence classes. The recurring pitfall across all of them is treating de-identification as binary. Linkage attacks have re-identified individuals in supposedly anonymous medical, mobility, and recommendation datasets by joining them against public records, which is why a removed name is not the same as a protected person. A second pitfall is forgetting composition: each additional release or query spends part of a privacy budget that, once exhausted, cannot be replenished by adding more noise after the fact. ### 2.3 Special Categories and Vulnerable Populations Certain data demands heightened care. Health records, biometric identifiers, sexual orientation, religious belief, and political affiliation can expose people to discrimination or worse if mishandled. Data about children warrants additional protection because children cannot meaningfully consent. Data about marginalized groups carries elevated risk because those groups already face disproportionate surveillance and disadvantage. Ethical practice treats these categories as defaults toward restriction, collecting them only with strong justification and protecting them with the strongest available controls. ## 3. Fairness and Representation in Datasets ### 3.1 How Bias Enters Data Bias is not a single phenomenon but a family of failure modes that enter at different points. Historical bias reflects inequities present in the world that the data faithfully records, such as occupational segregation visible in employment records. Representation bias arises when the sampling process underrepresents some groups, as when a dataset of faces drawn from one region underrepresents others. Measurement bias arises when the chosen proxy imperfectly captures the target concept, as when arrest data is used as a proxy for crime even though arrests reflect policing patterns. Aggregation bias arises when a single model is applied across groups for whom distinct models would be more appropriate. Naming the specific mechanism matters because each calls for a different remedy. ### 3.2 Defining Fairness Fairness has no single technical definition, and several plausible definitions are mutually incompatible. To compare them precisely, let $A$ denote a protected attribute (group membership), $Y \in \{0,1\}$ the true outcome, $R$ a predictor's risk score, and $\hat{Y}$ the binary decision. The common group criteria are then conditional independence statements: - **Demographic (statistical) parity**: $\hat{Y} \perp A$, that is $\Pr[\hat{Y} = 1 \mid A = a]$ is equal across groups. It equalizes selection rates but ignores whether the decision is correct. - **Equalized odds**: $\hat{Y} \perp A \mid Y$, that is the true positive rate $\Pr[\hat{Y}=1 \mid Y=1, A=a]$ and the false positive rate $\Pr[\hat{Y}=1 \mid Y=0, A=a]$ are equal across groups. It equalizes error rates conditional on the truth. - **Calibration (sufficiency)**: $Y \perp A \mid R$, that is $\Pr[Y = 1 \mid R = r, A = a]$ is equal across groups, so a score of $r$ means the same thing for everyone. **An impossibility result.** Suppose the base rates differ across groups, $\Pr[Y=1 \mid A=0] \neq \Pr[Y=1 \mid A=1]$, and the classifier is not perfect. Then no classifier can simultaneously satisfy calibration and equalized odds; more sharply, it cannot equalize both the false positive rate and the false negative rate across groups while staying calibrated, except in the degenerate cases of equal base rates or perfect prediction (references 6 and 14). The intuition is short. Calibration ties each score to a fixed probability of $Y=1$. When base rates differ, matching those probabilities forces the score distributions to differ between groups, and once the distributions differ, any single threshold produces different error rates on each side. The conflict is not an artifact of a particular algorithm; it is a property of the statistics. This is why practitioners must choose which fairness property to prioritize, and that choice is a value judgment rather than a purely technical one. Individual fairness offers a complementary lens, demanding that similar individuals receive similar treatment. Formally it asks the prediction map to be Lipschitz with respect to a task specific similarity metric $d$ on individuals: $\lVert M(x) - M(x') \rVert \le L \, d(x, x')$ for all pairs $x, x'$ (reference 15). Its difficulty is that constructing a defensible $d$ is itself a contested value judgment, since the metric must encode which differences between people are legitimate grounds for different treatment. Because no criterion is universally correct, fairness work should begin by articulating who could be harmed, in what way, and which definition best protects against that specific harm. Stating the choice explicitly, with its tradeoffs, is more honest than presenting any single metric as the answer. ### 3.3 A Worked Example of the Fairness Tradeoff A small numerical example shows the tension in concrete terms. Consider a risk score used to flag loan applicants, evaluated on two groups of 1000 people each. Suppose the score is well calibrated: among everyone the model assigns a "high risk" score, 30 percent actually default, in both groups. The groups differ only in base rate. In group $A$, 100 of the 1000 applicants would default; in group $B$, 300 would. Because the model is calibrated and group $B$ has the higher true default rate, the score concentrates more of group $B$ in the high risk band. Suppose the decision threshold flags 200 people in group $A$ and 500 in group $B$ as high risk. Calibration holds by construction (30 percent of each flagged set defaults). Now examine the error rates on the people who would not default: - Group $A$: of $900$ non-defaulters, $200 \times 0.7 = 140$ are flagged, a false positive rate of $140 / 900 \approx 0.156$. - Group $B$: of $700$ non-defaulters, $500 \times 0.7 = 350$ are flagged, a false positive rate of $350 / 700 = 0.5$. A non-defaulter in group $B$ is more than three times as likely to be wrongly flagged as one in group $A$, even though the score is perfectly calibrated and means exactly the same thing in both groups. There is no threshold that removes this gap without breaking calibration, because the gap is driven by the differing base rates, not by the threshold. The lesson is not that one group is being treated unfairly in some absolute sense; it is that "calibrated" and "equal false positive rates" are different commitments, and a designer who has not chosen between them has implicitly chosen one anyway. This is exactly the dynamic documented in audits of recidivism risk tools (reference 6). ### 3.4 Representation and Documentation Representation begins with knowing the composition of a dataset. Many datasets ship with no description of their demographic makeup, making it impossible to assess whether a group is underrepresented. Documentation practices address this gap. Datasheets for Datasets, proposed by Gebru and colleagues, prompt creators to record motivation, composition, collection process, recommended uses, and known limitations. Data Statements play a similar role for language data. These artifacts do not by themselves remove bias, but they make it visible and reviewable, and they create accountability by attaching authorship to data decisions. Improving representation can involve targeted collection to fill gaps, reweighting to correct skew, or constraining the deployment context to populations the data supports. Each approach has limits. Targeted collection of sensitive group data can itself raise privacy concerns. Reweighting cannot manufacture signal that was never measured. The honest position is that some datasets simply should not be used for some purposes, and recognizing that boundary is part of fairness work. ## 4. Data Minimization ### 4.1 The Principle Data minimization holds that one should collect and retain only the data necessary for a specified, legitimate purpose. The principle inverts the default impulse of large scale data practice, which is to gather everything in case it proves useful later. Minimization reduces risk along several dimensions at once. Less data collected means fewer individuals exposed. Less data retained means a smaller attack surface in the event of a breach. Tighter purpose limitation means less opportunity for function creep, where data gathered for one aim is quietly repurposed for another. ### 4.2 Minimization in Practice Operationalizing minimization requires asking, for each field collected, whether it is genuinely required for the stated purpose. Often the answer is no, and the field persists only out of habit. Techniques support the principle. Collecting aggregates instead of individual records, sampling rather than capturing exhaustively, and computing on device so that raw data never leaves the user's hardware all reduce exposure. Retention schedules enforce deletion after the data's purpose is fulfilled, converting minimization from a one time decision into an ongoing discipline. Minimization sits in tension with the machine learning culture of large datasets, where more data frequently improves performance. The tension is real but often overstated, and it can be framed precisely. Empirically, generalization error tends to fall with dataset size $n$ following a power law, roughly $\text{error}(n) \approx \alpha\, n^{-\beta} + \gamma$ with a small exponent $\beta$ and an irreducible floor $\gamma$. The marginal accuracy gain $\lvert d\,\text{error}/dn \rvert$ therefore shrinks as $n^{-(\beta + 1)}$, so each additional record buys less than the one before. Privacy risk, by contrast, does not diminish: every record added is one more individual exposed in a breach and one more person whose erasure rights must later be honored, so the expected harm grows at least linearly in $n$. Plotting marginal benefit against marginal risk gives a crossover point past which collection is net negative. A disciplined team treats data volume as a cost to be justified rather than a good to be maximized, and weighs the modest, diminishing accuracy gains of extra collection against the linearly accumulating risk it creates. ## 5. Regulatory Context ### 5.1 The General Data Protection Regulation The General Data Protection Regulation, which took effect across the European Union in 2018, is the most influential data protection law and a useful reference point even for organizations outside its jurisdiction. It applies to the processing of personal data of individuals in the EU regardless of where the processor is located, giving it broad extraterritorial reach. Its core principles map closely onto the ethical commitments discussed above: lawfulness, fairness, and transparency; purpose limitation; data minimization; accuracy; storage limitation; integrity and confidentiality; and accountability. The regulation requires a lawful basis for processing, of which consent is only one. Others include contractual necessity, legal obligation, and legitimate interests, the last of which requires balancing the processor's interests against the rights of the individual. It grants individuals rights including access, rectification, erasure, portability, and the right to object to certain automated decision making. It distinguishes the data controller, who determines the purposes and means of processing, from the data processor, who acts on the controller's behalf, and assigns obligations accordingly. ### 5.2 Obligations of Particular Relevance to Machine Learning Several GDPR provisions bear directly on model development. The principle of purpose limitation constrains repurposing data collected for one aim to train a model for another, a common practice that often lacks a clear lawful basis. The storage limitation principle conflicts with the impulse to retain training data indefinitely. The right to erasure raises difficult questions when an individual's data has already influenced a trained model, since deleting a record from a database does not remove its influence from learned parameters. Provisions on automated decision making give individuals a right not to be subject to solely automated decisions with significant effects, with safeguards including meaningful information about the logic involved. A Data Protection Impact Assessment is required for processing likely to result in high risk to individuals, which frequently includes large scale profiling and novel uses of sensitive data. The assessment forces a structured analysis of necessity, proportionality, and mitigation before processing begins, making it a natural integration point for the ethical review this chapter advocates. ### 5.3 The Wider Regulatory Landscape GDPR is not alone. The California Consumer Privacy Act and its successor establish rights for California residents. Sector specific laws such as the United States Health Insurance Portability and Accountability Act govern health data, and the Children's Online Privacy Protection Act governs data about minors. The EU Artificial Intelligence Act, adopted in 2024, layers obligations specific to AI systems on top of data protection law, with requirements for data governance and quality in high risk applications. Many jurisdictions are enacting their own regimes, producing a patchwork that any global system must navigate. The practical lesson is that designing to a high principled standard, rather than the minimum of any single jurisdiction, is both more robust and easier to maintain than chasing each law separately. ## 6. Governance Roles and Policies ### 6.1 Roles and Accountability Governance fails when responsibility is diffuse. Effective programs assign clear roles. A Data Protection Officer, required under GDPR in certain cases, provides independent oversight of compliance. Data owners take accountability for specific datasets, including their quality and appropriate use. Data stewards handle day to day curation and metadata. A privacy or ethics review board evaluates higher risk proposals. Engineers and data scientists remain responsible for the systems they build rather than delegating ethical judgment entirely to a separate function. The principle threaded through these roles is accountability: for every dataset and model there should be an identifiable person or body answerable for its conduct. ### 6.2 Policies and Processes Roles need processes to act through. Core artifacts include a data classification scheme that sorts data by sensitivity and attaches handling requirements to each tier; access controls that enforce least privilege so people reach only the data their work requires; retention and deletion schedules; and audit logging that records who accessed what and when. A data catalog with lineage tracking allows an organization to answer basic questions, such as where a given field originated and which models consumed it, that are otherwise surprisingly hard to answer at scale. Review processes embed ethics into the development workflow. A data use proposal describes intended purpose, lawful basis, affected populations, and risks, and is reviewed before collection proceeds. An impact assessment, building on the GDPR model, documents necessity and mitigation for higher risk work. These processes should be proportionate, with lightweight paths for routine low risk work and deeper scrutiny reserved for novel or sensitive uses, so that governance accelerates rather than obstructs responsible projects. ### 6.3 Culture and Incentives Documents and boards are necessary but not sufficient. Governance lives or dies on culture and incentives. If teams are rewarded solely for shipping speed and model accuracy, ethical considerations will be treated as obstacles. If review boards lack authority to halt a project, their findings become advisory noise. Sustainable governance requires that ethical review carry real weight, that practitioners have channels to raise concerns without penalty, and that leadership treats data stewardship as a measure of quality rather than a tax on productivity. The most rigorous policy is inert without the organizational will to enforce it. ## 7. Dataset Harms and Mitigation ### 7.1 A Taxonomy of Harms Datasets can cause several distinct kinds of harm. Allocative harms occur when a system distributes resources or opportunities unfairly, as when a hiring model systematically screens out qualified candidates from a group. Representational harms occur when systems reinforce demeaning stereotypes or render groups invisible, as when an image tagging system mislabels people or a search system returns degrading results for certain identities. Privacy harms occur through exposure, re-identification, or inference of sensitive attributes a person never disclosed. Quality of service harms occur when a system simply works less well for some groups, such as speech recognition that performs poorly for certain accents. Two further harms deserve emphasis. Feedback loops occur when a model's outputs shape the future data it is trained on, amplifying initial bias over time, a dynamic well documented in predictive policing where patrols concentrate where past arrests occurred, generating more arrests there. Aggregation and surveillance harms occur when the mere existence of a large dataset enables monitoring and control beyond any individual prediction, a structural harm that no single accurate output offsets. ### 7.2 Mitigation Strategies Mitigation begins upstream, because harms are far cheaper to prevent at collection than to patch after deployment. Documentation through datasheets and data statements surfaces limitations early. Representative collection and careful proxy selection reduce bias at its source. Privacy preserving techniques such as differential privacy and on device computation reduce exposure. Fairness evaluation, disaggregated by group so that aggregate metrics do not mask disparities for minorities, detects quality of service gaps before they reach users. Downstream mitigations matter too. Continuous monitoring detects drift and emerging disparities after deployment. Human review and contestability mechanisms give affected individuals a route to challenge decisions, which both corrects errors and respects autonomy. Red teaming probes for failure modes that ordinary testing misses. Deletion and retraining processes, though imperfect, attempt to honor erasure requests. No single technique suffices, and the credible posture is defense in depth: layered safeguards across the lifecycle, paired with the humility to restrict or abandon a use when the data cannot support it safely. ### 7.3 The Limits of Technical Fixes A recurring temptation is to treat dataset harms as engineering problems with engineering solutions. Some are, but many are not. A debiasing algorithm cannot resolve a disagreement about which notion of fairness is correct, because that is a normative question. A privacy technique cannot make an inappropriate use appropriate. When a dataset reflects unjust social conditions, technical correction can obscure rather than address the underlying injustice, and may lend a false veneer of objectivity to a contested decision. The mature practitioner pairs technical skill with the judgment to recognize when a problem exceeds technical means and requires a different decision: narrowing the scope, involving affected communities, or declining the use entirely. ## 8. Conclusion Data ethics and governance are not peripheral compliance chores but central to building systems that deserve trust. The threads of this chapter reinforce one another. Consent and privacy protect individual autonomy. Fairness and representation protect groups from inheriting historical injustice. Minimization reduces the surface of risk. Regulation such as GDPR codifies a baseline that principled practice should exceed. Roles and policies translate intention into reliable action. And a clear understanding of dataset harms, with layered mitigation, closes the loop from principle to outcome. The unifying lesson is that ethical data practice is continuous and accountable. It spans the full lifecycle, it assigns responsibility to identifiable people, and it retains the humility to recognize the limits of technical remedy. Practitioners who internalize this stance build systems that are not only more lawful but more worthy of the trust that users place in them. ## References 1. Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Daumé III, H., and Crawford, K. (2021). Datasheets for Datasets. Communications of the ACM. https://dl.acm.org/doi/10.1145/3458723 2. Bender, E. M., and Friedman, B. (2018). Data Statements for Natural Language Processing. Transactions of the Association for Computational Linguistics. https://aclanthology.org/Q18-1041/ 3. Nissenbaum, H. (2010). Privacy in Context: Technology, Policy, and the Integrity of Social Life. Stanford University Press. https://www.sup.org/books/title/?id=8862 4. Dwork, C., and Roth, A. (2014). The Algorithmic Foundations of Differential Privacy. Foundations and Trends in Theoretical Computer Science. https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf 5. Suresh, H., and Guttag, J. (2021). A Framework for Understanding Sources of Harm throughout the Machine Learning Life Cycle. ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization. https://dl.acm.org/doi/10.1145/3465416.3483305 6. Chouldechova, A. (2017). Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments. Big Data. https://www.liebertpub.com/doi/10.1089/big.2016.0047 7. Barocas, S., Hardt, M., and Narayanan, A. (2023). Fairness and Machine Learning: Limitations and Opportunities. MIT Press. https://fairmlbook.org/ 8. European Parliament and Council (2016). Regulation (EU) 2016/679 (General Data Protection Regulation). Official Journal of the European Union. https://eur-lex.europa.eu/eli/reg/2016/679/oj 9. European Parliament and Council (2024). Regulation (EU) 2024/1689 (Artificial Intelligence Act). Official Journal of the European Union. https://eur-lex.europa.eu/eli/reg/2024/1689/oj 10. Crawford, K. (2017). The Trouble with Bias. Keynote, Conference on Neural Information Processing Systems. https://www.youtube.com/watch?v=fMym_BKWQzk 11. Buolamwini, J., and Gebru, T. (2018). Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. Proceedings of Machine Learning Research. https://proceedings.mlr.press/v81/buolamwini18a.html 12. Ensign, D., Friedler, S. A., Neville, S., Scheidegger, C., and Venkatasubramanian, S. (2018). Runaway Feedback Loops in Predictive Policing. Proceedings of Machine Learning Research. https://proceedings.mlr.press/v81/ensign18a.html 13. Machanavajjhala, A., Kifer, D., Gehrke, J., and Venkitasubramaniam, M. (2007). l-Diversity: Privacy Beyond k-Anonymity. ACM Transactions on Knowledge Discovery from Data. https://dl.acm.org/doi/10.1145/1217299.1217302 14. Kleinberg, J., Mullainathan, S., and Raghavan, M. (2017). Inherent Trade-Offs in the Fair Determination of Risk Scores. Innovations in Theoretical Computer Science (ITCS). https://doi.org/10.4230/LIPIcs.ITCS.2017.43 15. Dwork, C., Hardt, M., Pitassi, T., Reingold, O., and Zemel, R. (2012). Fairness Through Awareness. Innovations in Theoretical Computer Science (ITCS). https://dl.acm.org/doi/10.1145/2090236.2090255