53 Data Collection Strategies

Every machine learning system rests on a dataset, and every dataset is the product of a collection process that imposed structure, made omissions, and introduced bias long before any model touched the data. Practitioners often treat data as a fixed input and reserve their energy for architecture and tuning, yet the collection design usually determines the ceiling on what any model can achieve. A model can only learn the distribution it is shown. If that distribution differs from the one encountered at deployment, no amount of regularization or scale will close the gap. This chapter develops a disciplined approach to designing data collection so that the resulting dataset supports valid inference and reliable deployment.

53.1 1. Framing Collection as a Design Problem

53.1.1 1.1 The target population and the deployment distribution

The first task is to define the population about which we wish to draw conclusions or on which the model will act. In survey statistics this is the target population. In machine learning the analogous object is the deployment distribution, the distribution of inputs the model will actually receive in production. Denote the deployment distribution over inputs and labels by $p_{\text{dep}}(x, y)$ and the distribution from which we collect training data by $p_{\text{col}}(x, y)$. The central goal of collection design is to make these two distributions agree, or to understand precisely how they differ so the difference can be corrected.

When $p_{\text{col}} = p_{\text{dep}}$, a model trained to minimize expected loss on collected data also minimizes expected loss at deployment. When they differ, we have dataset shift, and the empirical risk we minimize is a biased estimate of the risk we care about:

\[ R_{\text{dep}}(f) = \mathbb{E}_{(x,y) \sim p_{\text{dep}}}[\ell(f(x), y)] \neq \mathbb{E}_{(x,y) \sim p_{\text{col}}}[\ell(f(x), y)] = R_{\text{col}}(f). \]

Much of what follows is about controlling this inequality at the point of collection, which is far cheaper and more reliable than correcting it after the fact.

It is worth stating precisely what dataset shift can mean, because the type of shift dictates which remedies are available. Factorize the joint distribution two ways, $p(x, y) = p(y \mid x)\,p(x) = p(x \mid y)\,p(y)$. The standard taxonomy follows Quionero-Candela et al. (reference 5).

Covariate shift. The input marginal changes, $p_{\text{col}}(x) \neq p_{\text{dep}}(x)$, while the labeling mechanism is stable, $p_{\text{col}}(y \mid x) = p_{\text{dep}}(y \mid x)$. This is the benign case: it is correctable by reweighting (Section 3.2) provided the supports overlap.
Prior probability shift. The label marginal changes, $p_{\text{col}}(y) \neq p_{\text{dep}}(y)$, while $p(x \mid y)$ is stable. Common in classification when class balance differs between collection and deployment.
Concept shift. The conditional $p(y \mid x)$ itself changes. This is the dangerous case, because the relationship the model learned is no longer the one it faces, and no reweighting of inputs can repair it. Fresh labels are required.

A simple inequality clarifies what reweighting can and cannot do. Reweighting corrects the marginal mismatch only on the region where the collection distribution places mass. If $p_{\text{dep}}$ assigns positive probability to a region where $p_{\text{col}}(x) = 0$, the density ratio is undefined and the deployment risk on that region is simply unobservable from the collected data. This support coverage requirement, namely $\operatorname{supp}(p_{\text{dep}}) \subseteq \operatorname{supp}(p_{\text{col}})$, is the formal counterpart of the coverage discussion in Section 1.2, and it is a property of the collection design, not of any later correction.

53.1.2 1.2 Sampling frame and coverage

Between the abstract target population and the concrete sample sits the sampling frame, the operational list or mechanism from which units are actually drawn. A frame for a customer churn model might be the set of accounts in the production database. A frame for a speech recognition system might be whatever audio a particular set of microphones captured. Coverage error arises when the frame omits part of the target population or includes units that do not belong to it.

Coverage failures are insidious because they are invisible in the collected data. If your frame for a medical imaging model contains only scans from a single hospital’s scanner model, the dataset will look complete and internally consistent, yet it silently excludes the imaging characteristics of every other scanner. The model will appear to perform well in evaluation, because evaluation data inherits the same coverage gap, and then degrade in deployment. The discipline here is to enumerate, before collection, the subpopulations that exist in the deployment distribution and to verify that the frame can reach each of them.

The relationship among these populations is a nested chain, and naming each gap is the first step to closing it.

flowchart TD
    A["Target population (who we want conclusions about)"] -->|"coverage error"| B["Sampling frame (who the mechanism can reach)"]
    B -->|"sampling error"| C["Selected sample (who we drew)"]
    C -->|"nonresponse and selection bias"| D["Observed sample (who we actually recorded)"]
    D -->|"labeling and missingness"| E["Analyzed sample (who has usable labels)"]

Figure 53.1: From the target population to the analyzed sample, each arrow can introduce a distinct error.

The chain makes the accounting concrete. The analyzed sample, the data a model actually trains on, can differ from the target population through four compounding mechanisms, each with its own remedy: coverage error is fixed by widening the frame, sampling error by larger or stratified samples, selection bias by randomization or weighting, and label gaps by deliberate annotation of the missing region. A dataset that looks adequate at the bottom of the chain may be unrepresentative for reasons introduced at any link above it.

53.2 2. Data Sources and Their Biases

53.2.1 2.1 A taxonomy of sources

Data sources fall into several broad categories, each with a characteristic bias profile.

Instrumented production logs. Cheap, large, and continuously refreshed, but they reflect only the behavior of existing users interacting with an existing system. They carry strong feedback loop bias, since the current model shapes the very data used to train its successor.
Operational records and administrative data. Created for business or legal purposes rather than for modeling. They are often complete within their scope but encode the categories and decisions of the originating process, not necessarily the ones you care about.
Sensors and devices. Subject to calibration drift, hardware heterogeneity, and placement effects. A frame defined by a particular device population rarely matches the device population at deployment.
Surveys and elicited data. Allow controlled probability sampling but suffer nonresponse and self-report bias.
Web scrapes and public corpora. Vast and convenient, but their composition reflects who publishes online, which over represents some languages, regions, and viewpoints and under represents others.
Purchased or partner data. Opaque provenance. The collection process is someone else’s, so its biases are inherited without being documented.

53.2.2 2.2 The convenience trap

Almost all of these sources are tempting precisely because they are convenient. The danger is that convenience and representativeness are usually in tension. The data that is easiest to obtain is rarely the data that matches the deployment distribution. The professional habit worth cultivating is to ask, for any candidate source, the single question: which units of the target population can never appear in this source, and why. That question surfaces coverage and selection problems before they are baked into a model.

53.3 3. Sampling Design

53.3.1 3.1 Probability sampling

A probability sample is one in which every unit in the frame has a known, nonzero probability of selection $\pi_i > 0$. This property is what licenses statistical inference, because it lets us reweight the sample to estimate population quantities without bias. The basic schemes form a hierarchy of increasing structure.

Simple random sampling gives every unit equal probability $\pi_i = n/N$. It is unbiased and simple but can leave small but important subgroups underrepresented by chance.

Stratified sampling partitions the frame into strata and samples within each. It guarantees coverage of every stratum and reduces variance when strata are internally homogeneous. The estimator for a population mean is

\[ \hat{\mu} = \sum_{h=1}^{H} \frac{N_h}{N} \, \bar{y}_h, \]

where $N_h$ is the size of stratum $h$ and $\bar{y}_h$ its sample mean. Writing $W_h = N_h/N$ for the stratum weight and $\sigma_h^2$ for the within-stratum variance, the variance of the stratified mean under sampling without replacement is approximately

\[ \operatorname{Var}(\hat{\mu}_{\text{str}}) = \sum_{h=1}^{H} W_h^2 \, \frac{\sigma_h^2}{n_h}, \]

with $n_h$ the sample size in stratum $h$. Compare this to the variance of a simple random sample of the same total size $n$, which by the analysis-of-variance decomposition carries both within-stratum and between-stratum components. Stratification removes the between-stratum term from the sampling variance entirely. The benefit is therefore largest precisely when strata differ from one another, that is, when the between-stratum variance is large. This is the mathematical reason stratification is the most useful single technique for machine learning collection: it lets the designer force representation of rare but important regions of the input space and, at the same time, lowers estimator variance whenever those regions behave differently from the bulk.

The allocation $n_h$ is itself a design choice. Proportional allocation sets $n_h \propto N_h$ and reproduces the population mix. Neyman allocation sets $n_h \propto N_h \sigma_h$, putting more samples where a stratum is both large and internally variable, which minimizes $\operatorname{Var}(\hat{\mu}_{\text{str}})$ for a fixed total budget (Cochran, reference 1). For modeling, a third allocation often dominates: deliberately oversample rare, decision-critical strata beyond their population share, then correct the induced distortion with weights at training and evaluation time (Section 6.2).

Cluster sampling selects groups of units, such as all transactions from a sampled set of stores. It is cheaper when units are naturally grouped but increases variance through intracluster correlation.

Systematic sampling takes every $k$th unit from an ordered frame. It is convenient but dangerous when the ordering has periodicity that aligns with $k$.

53.3.2 3.2 Inverse probability weighting

When selection probabilities are unequal but known, unbiased estimation is recovered by weighting each unit by the inverse of its inclusion probability, the Horvitz Thompson estimator:

\[ \hat{T} = \sum_{i \in S} \frac{y_i}{\pi_i}. \]

In machine learning this same logic justifies importance weighting of training examples when the collection distribution is known to differ from the deployment distribution. If we can estimate the density ratio $w(x) = p_{\text{dep}}(x) / p_{\text{col}}(x)$, then the reweighted empirical risk

\[ \hat{R}(f) = \frac{1}{n} \sum_{i=1}^{n} w(x_i) \, \ell(f(x_i), y_i) \]

is a consistent estimator of the deployment risk under covariate shift, the assumption that $p(y \mid x)$ is stable while $p(x)$ changes (Sugiyama and Kawanabe, reference 6). The catch is that high variance weights inflate the variance of the estimate, so reweighting is a corrective of last resort rather than a substitute for sound collection.

How badly the weights degrade an estimate is captured by the effective sample size,

\[ n_{\text{eff}} = \frac{\left(\sum_{i=1}^{n} w_i\right)^2}{\sum_{i=1}^{n} w_i^2}, \]

which ranges from $1$, when a single example carries all the weight, to $n$, when the weights are uniform. A reweighted dataset of $n$ raw examples behaves statistically like an unweighted one of size $n_{\text{eff}}$. When $n_{\text{eff}}$ collapses to a small fraction of $n$, it is a quantitative warning that the collection and deployment distributions overlap too little for reweighting to rescue, and that the right response is to collect from the underrepresented region rather than to weight ever harder. Truncating or clipping extreme weights trades a controlled amount of bias for a large reduction in variance and is standard practice, but it does not change the underlying diagnosis that the support overlap is poor.

53.3.3 3.3 Convenience sampling and why it fails

A convenience sample draws whatever units are readily available, with no known selection probabilities. Volunteer panels, the first thousand users to opt in, and an unfiltered web scrape are all convenience samples. The fatal property is that $\pi_i$ is unknown and, worse, correlated with the very quantities being modeled. Because the selection probabilities cannot be recovered, no reweighting can remove the bias in general. The sample may still be useful, but inference from it is conditional on an untestable assumption that the selected units resemble the unselected ones on the outcome of interest. Most published failures of deployed models trace back, on inspection, to a convenience sample that was treated as if it were representative.

53.4 4. Selection Bias and Its Mechanisms

53.4.1 4.1 A formal view

Selection bias occurs when the probability that a unit enters the dataset depends on variables that also relate to the outcome. Introduce a selection indicator $S \in \{0, 1\}$, equal to one when a unit is observed. We collect data from the conditional distribution $p(x, y \mid S = 1)$, but we want to reason about the marginal $p(x, y)$. These coincide only when selection is independent of the relevant variables.

The cleanest case is selection on observables, where $S \perp y \mid x$. Here selection depends only on covariates we record, and inverse probability weighting can correct it. The dangerous case is selection on unobservables, where selection depends on the outcome itself even after conditioning on recorded covariates. No reweighting from the observed data can fix this, because the information needed to estimate the weights is precisely what selection removed.

53.4.2 4.2 Common patterns

Several recurring mechanisms deserve names because recognizing them early prevents expensive mistakes.

Survivorship bias. Only units that persisted are observed. Modeling equipment failures from machines still in service omits those that already failed and were removed.
Self selection. Units choose whether to appear, and the choice relates to the outcome. Customers who respond to a satisfaction survey are not a random slice of customers.
Label availability bias. Labels exist only for a subset, and that subset is not random. Loan repayment outcomes are observed only for applicants who were approved, so the training data systematically excludes the rejected region of input space.
Feedback loops. The deployed model determines which inputs are seen, which determines the next training set. A recommender trained only on items it already surfaces never learns about items it never shows.

53.4.3 4.3 Designing against selection bias

The structural remedy is to introduce randomization into the collection process wherever feasible. A small exploration budget, in which a fraction of decisions are made randomly rather than by the current policy, breaks feedback loops and creates a probability sample of the region the model would otherwise never observe. In the lending example, occasionally approving a randomized subset of marginal applicants, within ethical and regulatory limits, generates the counterfactual labels needed to model the rejected region. The cost of this exploration is real, but it is the price of an unbiased view of the input space.

Collection policy with exploration:
  with probability 1 - epsilon:
      act according to current model
  with probability epsilon:
      act randomly and log the outcome
  # the epsilon-fraction is a probability sample
  # of the space the model would otherwise censor

53.4.4 4.4 Worked example: the censored lending region

A concrete instance makes the cost of selection on unobservables tangible. Suppose a lender approves applicants whose model-estimated default probability is below a threshold and observes repayment only for the approved. Let the approved fraction be $40\%$ and suppose, hypothetically, that the true default rate is $5\%$ among approved applicants and $30\%$ among rejected applicants. A naive model retrained on observed outcomes sees only the approved region, so it estimates a portfolio default rate of $5\%$. The true rate across all applicants is

\[ 0.40 \times 0.05 + 0.60 \times 0.30 = 0.20, \]

a fourfold understatement. No reweighting of the observed approvals recovers the $30\%$ figure, because the rejected region contributes zero observations: the weight $1/\pi_i$ is infinite where $\pi_i = 0$, which is exactly the support-coverage failure of Section 1.1. The only fix that adds information is to generate observations in the censored region. Approving a small randomized fraction $\epsilon$ of otherwise-rejected applicants yields, after the loan matures, an unbiased estimate of the rejected-region default rate at the cost of the expected losses on that $\epsilon$-fraction. The numbers above are illustrative, but the structure is general: when selection depends on the outcome, only randomization buys back the missing information, and its price is the loss incurred on the explored decisions.

53.5 5. Labeling Strategies

53.5.1 5.1 Sources of labels

Labels can come from human annotation, from organic outcomes recorded by the system, from weak or programmatic heuristics, or from existing models. Each trades cost against quality. Organic outcome labels, such as whether a user clicked or a part failed, are cheap and aligned with the real objective but are subject to the availability bias of Section 4. Human labels are expensive but can be elicited for any region of input space, including the rare and the counterfactual.

53.5.2 5.2 Annotation quality and agreement

Human labels are noisy, and the noise is rarely uniform. The standard tool for quantifying labeler consistency is inter annotator agreement corrected for chance. Cohen’s kappa for two annotators is

\[ \kappa = \frac{p_o - p_e}{1 - p_e}, \]

where $p_o$ is observed agreement and $p_e$ is agreement expected by chance (Cohen, reference 10). The coefficient equals $1$ at perfect agreement, $0$ when annotators agree only as often as independent random labeling would predict, and can go negative under systematic disagreement. The chance correction matters because raw agreement $p_o$ is inflated whenever one class dominates: two annotators who both label every item as the majority class agree almost perfectly yet convey no information, and $\kappa$ correctly reports a value near zero. For more than two annotators the analogous chance-corrected statistic is Fleiss’ kappa, and for ordinal labels a weighted kappa penalizes far-apart disagreements more than adjacent ones. Low agreement signals either an ambiguous labeling task, an inadequate guideline, or insufficient annotator training, and it caps the accuracy any model can reach, since the labels themselves are the ceiling. Investment in clear annotation guidelines, adjudication of disagreements, and gold standard test items usually yields more model improvement per dollar than additional raw volume.

53.5.3 5.3 Allocating the labeling budget

Labeling budget is finite, so its allocation is a sampling design in its own right. Three principles guide it. First, allocate disproportionate effort to rare classes and to decision boundaries, since uniform labeling wastes effort on regions where the model is already confident. Active learning formalizes this by selecting for annotation the examples whose labels are expected to be most informative, often those near the current decision boundary. Second, preserve a held out, randomly labeled evaluation set drawn to match the deployment distribution, because an evaluation set distorted by active learning no longer estimates deployment performance. Third, label redundantly where ambiguity is high and singly where it is low, spending agreement measurement only where it matters.

53.5.4 5.4 Weak supervision

When expert labels are scarce, weak supervision combines many noisy labeling sources, such as heuristics, knowledge bases, and patterns, into probabilistic labels by modeling each source’s accuracy and correlations. This trades label noise for label volume and can be effective, but it introduces a new dependency. The biases of the labeling heuristics propagate into the labels, so the same scrutiny applied to data sources must be applied to label sources.

53.6 6. Matching Collection to the Deployment Distribution

53.6.1 6.1 Specifying the deployment distribution

Designing collection to match deployment begins with making the deployment distribution explicit. This means writing down the axes along which inputs vary in production and the expected mass along each axis: the mix of device types, geographies, languages, time of day, user segments, and input difficulty. This specification is a falsifiable artifact. It can be checked against production telemetry once the system is live, and discrepancies become collection requirements for the next iteration.

53.6.2 6.2 Stratifying collection to the target mix

With the deployment mix specified, stratified collection becomes the natural tool. Define strata along the axes that matter and set sampling rates so the collected mix matches the target mix, or deliberately oversample rare strata and reweight, accepting the variance cost in exchange for guaranteed coverage. Oversampling rare but consequential strata, such as fraud cases or safety critical edge cases, is usually correct even though it distorts the training mix, because the alternative is a model that has seen too few examples to learn the rare behavior at all. The distortion is then corrected with class weights in the loss or with inverse probability weighting in evaluation.

53.6.3 6.3 Temporal matching and drift

The deployment distribution is not static. Concept drift, in which $p(y \mid x)$ changes over time, and covariate drift, in which $p(x)$ changes, both erode a model collected from a fixed historical window. Collection strategy must therefore include a refresh cadence and a temporally held out evaluation set, validating on data strictly later than the training window to estimate how performance decays between retraining cycles. A model evaluated only on a random split of historical data systematically overstates its deployment performance because it never confronts drift.

53.6.4 6.4 Monitoring and closing the loop

Finally, collection is not a one time event but a standing process. Production monitoring should compare the live input distribution against the collection specification and raise an alarm when a stratum’s mass shifts beyond tolerance. A widely used operational measure is the population stability index, which bins a feature and compares the collected proportion $c_h$ in bin $h$ against the live proportion $\ell_h$,

\[ \text{PSI} = \sum_{h=1}^{H} (\ell_h - c_h) \, \ln \frac{\ell_h}{c_h}. \]

The PSI is the symmetrized Kullback-Leibler divergence between the two binned distributions and is zero exactly when they match. A common industry convention treats values below $0.1$ as negligible drift, $0.1$ to $0.25$ as moderate drift worth investigating, and above $0.25$ as a major shift demanding action, though the right thresholds are domain specific and should be calibrated against observed performance decay rather than adopted blindly. A formal two sample test, such as a Kolmogorov-Smirnov test on a continuous feature or a chi-squared test on a categorical one, turns the same comparison into a hypothesis test with a controllable false alarm rate. When the alarm fires, the response is a targeted collection campaign for the drifted stratum, closing the loop between deployment and collection. The systems that remain reliable in production are those whose data collection is treated as a continuous control problem rather than a finished task.

53.7 7. When to Use Which Strategy, and Common Pitfalls

The strategies above are not interchangeable, and matching them to the situation prevents the most common failures.

Prefer probability sampling whenever you control the frame. If you can enumerate units and assign selection probabilities, do so. The cost is modest and it preserves the ability to make unbiased population statements later. Reserve convenience sampling for exploration and prototyping, and never let a convenience sample silently become the production training set.
Reach for stratification when the deployment distribution has identifiable, consequential subpopulations, especially rare ones. It is the cheapest insurance against a model that has never seen an important region. The pitfall is choosing strata that are easy to measure rather than strata that drive the outcome.
Use reweighting only for covariate shift with good support overlap. Check the effective sample size before trusting a reweighted estimate. If $n_{\text{eff}}$ is a small fraction of $n$, the honest conclusion is that you need more data from the underrepresented region, not heavier weights. Reweighting cannot touch concept shift.
Introduce randomized exploration when selection depends on the outcome. This is the only remedy for selection on unobservables, and the only way to learn the censored region. The pitfall is omitting it because the short-term cost is visible while the long-term bias is not.
Hold out the evaluation set in time and draw it to match deployment. A random split of historical data, or an evaluation set distorted by active learning, will overstate deployment performance. This is the single most common reason a model looks strong offline and disappoints in production.

A recurring meta-pitfall deserves emphasis: treating data as found rather than designed. The convenience of an existing log or scrape exerts constant pressure to skip the design questions, and the resulting gaps surface only after deployment.

On tooling, mature open-source libraries support each stage of this discipline. The Python scientific stack (numpy, pandas, and scikit-learn) covers stratified splitting, class weighting, and resampling. Imbalanced-learn adds principled resampling for skewed class distributions. For weak supervision, Snorkel models the accuracies and correlations of noisy labeling sources. For drift monitoring, libraries such as Evidently and the open-source Alibi Detect implement population stability indices and two sample drift tests directly, turning the monitoring discussion of Section 6.4 into a few lines of configuration rather than bespoke code. Documenting the resulting dataset with a datasheet (reference 9) records the provenance, coverage, and known biases so that downstream users inherit the design decisions rather than rediscovering their consequences.

53.8 8. A Practical Checklist

The principles above reduce to a sequence of questions to ask before committing to a collection plan.

1. What is the deployment distribution p_dep(x, y)?
   Write down the axes of variation and their target mix.
2. What is the sampling frame? Which subpopulations does it omit?
3. Is selection a probability sample or a convenience sample?
   If convenience, what untestable assumption am I relying on?
4. Does selection depend on the outcome (selection on unobservables)?
   If so, can I introduce randomized exploration to break it?
5. Where do labels come from, and what is their bias and noise?
6. Are rare, consequential strata oversampled and then reweighted?
7. Is the evaluation set drawn to match deployment and held out in time?
8. Is there a monitor comparing live inputs to the collection spec?

Working through these questions does not guarantee a perfect dataset, because perfect representativeness is rarely attainable. What it guarantees is that the gaps between the collected and deployment distributions are known, documented, and accounted for, rather than discovered through a production failure. That awareness is the difference between a dataset that supports valid inference and one that merely looks like data.

53.9 References

Cochran, W. G. Sampling Techniques, 3rd ed. Wiley, 1977. https://www.wiley.com/en-us/Sampling+Techniques%2C+3rd+Edition-p-9780471162407
Lohr, S. L. Sampling: Design and Analysis, 3rd ed. CRC Press, 2021. https://www.routledge.com/Sampling-Design-and-Analysis/Lohr/p/book/9780367279509
Horvitz, D. G., and Thompson, D. J. “A Generalization of Sampling Without Replacement From a Finite Universe.” Journal of the American Statistical Association, 1952. https://www.jstor.org/stable/2280784
Heckman, J. J. “Sample Selection Bias as a Specification Error.” Econometrica, 1979. https://www.jstor.org/stable/1912352
Quionero-Candela, J., Sugiyama, M., Schwaighofer, A., and Lawrence, N. D. Dataset Shift in Machine Learning. MIT Press, 2009. https://mitpress.mit.edu/9780262170055/dataset-shift-in-machine-learning/
Sugiyama, M., and Kawanabe, M. Machine Learning in Non-Stationary Environments: Introduction to Covariate Shift Adaptation. MIT Press, 2012. https://mitpress.mit.edu/9780262017091/
Settles, B. “Active Learning Literature Survey.” Computer Sciences Technical Report, University of Wisconsin Madison, 2009. https://minds.wisconsin.edu/handle/1793/60660
Ratner, A., Bach, S. H., Ehrenberg, H., Fries, J., Wu, S., and Re, C. “Snorkel: Rapid Training Data Creation with Weak Supervision.” VLDB, 2017. https://www.vldb.org/pvldb/vol11/p269-ratner.pdf
Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Daume III, H., and Crawford, K. “Datasheets for Datasets.” Communications of the ACM, 2021. https://dl.acm.org/doi/10.1145/3458723
Cohen, J. “A Coefficient of Agreement for Nominal Scales.” Educational and Psychological Measurement, 1960. https://journals.sagepub.com/doi/10.1177/001316446002000104

# Data Collection Strategies Every machine learning system rests on a dataset, and every dataset is the product of a collection process that imposed structure, made omissions, and introduced bias long before any model touched the data. Practitioners often treat data as a fixed input and reserve their energy for architecture and tuning, yet the collection design usually determines the ceiling on what any model can achieve. A model can only learn the distribution it is shown. If that distribution differs from the one encountered at deployment, no amount of regularization or scale will close the gap. This chapter develops a disciplined approach to designing data collection so that the resulting dataset supports valid inference and reliable deployment. ## 1. Framing Collection as a Design Problem ### 1.1 The target population and the deployment distribution The first task is to define the population about which we wish to draw conclusions or on which the model will act. In survey statistics this is the **target population**. In machine learning the analogous object is the **deployment distribution**, the distribution of inputs the model will actually receive in production. Denote the deployment distribution over inputs and labels by $p_{\text{dep}}(x, y)$ and the distribution from which we collect training data by $p_{\text{col}}(x, y)$. The central goal of collection design is to make these two distributions agree, or to understand precisely how they differ so the difference can be corrected. When $p_{\text{col}} = p_{\text{dep}}$, a model trained to minimize expected loss on collected data also minimizes expected loss at deployment. When they differ, we have **dataset shift**, and the empirical risk we minimize is a biased estimate of the risk we care about: $$ R_{\text{dep}}(f) = \mathbb{E}_{(x,y) \sim p_{\text{dep}}}[\ell(f(x), y)] \neq \mathbb{E}_{(x,y) \sim p_{\text{col}}}[\ell(f(x), y)] = R_{\text{col}}(f). $$ Much of what follows is about controlling this inequality at the point of collection, which is far cheaper and more reliable than correcting it after the fact. It is worth stating precisely what dataset shift can mean, because the type of shift dictates which remedies are available. Factorize the joint distribution two ways, $p(x, y) = p(y \mid x)\,p(x) = p(x \mid y)\,p(y)$. The standard taxonomy follows Quionero-Candela et al. (reference 5). - **Covariate shift.** The input marginal changes, $p_{\text{col}}(x) \neq p_{\text{dep}}(x)$, while the labeling mechanism is stable, $p_{\text{col}}(y \mid x) = p_{\text{dep}}(y \mid x)$. This is the benign case: it is correctable by reweighting (Section 3.2) provided the supports overlap. - **Prior probability shift.** The label marginal changes, $p_{\text{col}}(y) \neq p_{\text{dep}}(y)$, while $p(x \mid y)$ is stable. Common in classification when class balance differs between collection and deployment. - **Concept shift.** The conditional $p(y \mid x)$ itself changes. This is the dangerous case, because the relationship the model learned is no longer the one it faces, and no reweighting of inputs can repair it. Fresh labels are required. A simple inequality clarifies what reweighting can and cannot do. Reweighting corrects the marginal mismatch only on the region where the collection distribution places mass. If $p_{\text{dep}}$ assigns positive probability to a region where $p_{\text{col}}(x) = 0$, the density ratio is undefined and the deployment risk on that region is simply unobservable from the collected data. This **support coverage** requirement, namely $\operatorname{supp}(p_{\text{dep}}) \subseteq \operatorname{supp}(p_{\text{col}})$, is the formal counterpart of the coverage discussion in Section 1.2, and it is a property of the collection design, not of any later correction. ### 1.2 Sampling frame and coverage Between the abstract target population and the concrete sample sits the **sampling frame**, the operational list or mechanism from which units are actually drawn. A frame for a customer churn model might be the set of accounts in the production database. A frame for a speech recognition system might be whatever audio a particular set of microphones captured. **Coverage error** arises when the frame omits part of the target population or includes units that do not belong to it. Coverage failures are insidious because they are invisible in the collected data. If your frame for a medical imaging model contains only scans from a single hospital's scanner model, the dataset will look complete and internally consistent, yet it silently excludes the imaging characteristics of every other scanner. The model will appear to perform well in evaluation, because evaluation data inherits the same coverage gap, and then degrade in deployment. The discipline here is to enumerate, before collection, the subpopulations that exist in the deployment distribution and to verify that the frame can reach each of them. The relationship among these populations is a nested chain, and naming each gap is the first step to closing it. ```{mermaid} %%| label: fig-frame-chain %%| fig-cap: "From the target population to the analyzed sample, each arrow can introduce a distinct error." flowchart TD A["Target population (who we want conclusions about)"] -->|"coverage error"| B["Sampling frame (who the mechanism can reach)"] B -->|"sampling error"| C["Selected sample (who we drew)"] C -->|"nonresponse and selection bias"| D["Observed sample (who we actually recorded)"] D -->|"labeling and missingness"| E["Analyzed sample (who has usable labels)"] ``` The chain makes the accounting concrete. The analyzed sample, the data a model actually trains on, can differ from the target population through four compounding mechanisms, each with its own remedy: coverage error is fixed by widening the frame, sampling error by larger or stratified samples, selection bias by randomization or weighting, and label gaps by deliberate annotation of the missing region. A dataset that looks adequate at the bottom of the chain may be unrepresentative for reasons introduced at any link above it. ## 2. Data Sources and Their Biases ### 2.1 A taxonomy of sources Data sources fall into several broad categories, each with a characteristic bias profile. - **Instrumented production logs.** Cheap, large, and continuously refreshed, but they reflect only the behavior of existing users interacting with an existing system. They carry strong **feedback loop bias**, since the current model shapes the very data used to train its successor. - **Operational records and administrative data.** Created for business or legal purposes rather than for modeling. They are often complete within their scope but encode the categories and decisions of the originating process, not necessarily the ones you care about. - **Sensors and devices.** Subject to calibration drift, hardware heterogeneity, and placement effects. A frame defined by a particular device population rarely matches the device population at deployment. - **Surveys and elicited data.** Allow controlled probability sampling but suffer nonresponse and self-report bias. - **Web scrapes and public corpora.** Vast and convenient, but their composition reflects who publishes online, which over represents some languages, regions, and viewpoints and under represents others. - **Purchased or partner data.** Opaque provenance. The collection process is someone else's, so its biases are inherited without being documented. ### 2.2 The convenience trap Almost all of these sources are tempting precisely because they are convenient. The danger is that convenience and representativeness are usually in tension. The data that is easiest to obtain is rarely the data that matches the deployment distribution. The professional habit worth cultivating is to ask, for any candidate source, the single question: which units of the target population can never appear in this source, and why. That question surfaces coverage and selection problems before they are baked into a model. ## 3. Sampling Design ### 3.1 Probability sampling A **probability sample** is one in which every unit in the frame has a known, nonzero probability of selection $\pi_i > 0$. This property is what licenses statistical inference, because it lets us reweight the sample to estimate population quantities without bias. The basic schemes form a hierarchy of increasing structure. **Simple random sampling** gives every unit equal probability $\pi_i = n/N$. It is unbiased and simple but can leave small but important subgroups underrepresented by chance. **Stratified sampling** partitions the frame into strata and samples within each. It guarantees coverage of every stratum and reduces variance when strata are internally homogeneous. The estimator for a population mean is $$ \hat{\mu} = \sum_{h=1}^{H} \frac{N_h}{N} \, \bar{y}_h, $$ where $N_h$ is the size of stratum $h$ and $\bar{y}_h$ its sample mean. Writing $W_h = N_h/N$ for the stratum weight and $\sigma_h^2$ for the within-stratum variance, the variance of the stratified mean under sampling without replacement is approximately $$ \operatorname{Var}(\hat{\mu}_{\text{str}}) = \sum_{h=1}^{H} W_h^2 \, \frac{\sigma_h^2}{n_h}, $$ with $n_h$ the sample size in stratum $h$. Compare this to the variance of a simple random sample of the same total size $n$, which by the analysis-of-variance decomposition carries both within-stratum and between-stratum components. Stratification removes the between-stratum term from the sampling variance entirely. The benefit is therefore largest precisely when strata differ from one another, that is, when the between-stratum variance is large. This is the mathematical reason stratification is the most useful single technique for machine learning collection: it lets the designer force representation of rare but important regions of the input space and, at the same time, lowers estimator variance whenever those regions behave differently from the bulk. The allocation $n_h$ is itself a design choice. **Proportional allocation** sets $n_h \propto N_h$ and reproduces the population mix. **Neyman allocation** sets $n_h \propto N_h \sigma_h$, putting more samples where a stratum is both large and internally variable, which minimizes $\operatorname{Var}(\hat{\mu}_{\text{str}})$ for a fixed total budget (Cochran, reference 1). For modeling, a third allocation often dominates: deliberately oversample rare, decision-critical strata beyond their population share, then correct the induced distortion with weights at training and evaluation time (Section 6.2). **Cluster sampling** selects groups of units, such as all transactions from a sampled set of stores. It is cheaper when units are naturally grouped but increases variance through intracluster correlation. **Systematic sampling** takes every $k$th unit from an ordered frame. It is convenient but dangerous when the ordering has periodicity that aligns with $k$. ### 3.2 Inverse probability weighting When selection probabilities are unequal but known, unbiased estimation is recovered by weighting each unit by the inverse of its inclusion probability, the **Horvitz Thompson** estimator: $$ \hat{T} = \sum_{i \in S} \frac{y_i}{\pi_i}. $$ In machine learning this same logic justifies **importance weighting** of training examples when the collection distribution is known to differ from the deployment distribution. If we can estimate the density ratio $w(x) = p_{\text{dep}}(x) / p_{\text{col}}(x)$, then the reweighted empirical risk $$ \hat{R}(f) = \frac{1}{n} \sum_{i=1}^{n} w(x_i) \, \ell(f(x_i), y_i) $$ is a consistent estimator of the deployment risk under **covariate shift**, the assumption that $p(y \mid x)$ is stable while $p(x)$ changes (Sugiyama and Kawanabe, reference 6). The catch is that high variance weights inflate the variance of the estimate, so reweighting is a corrective of last resort rather than a substitute for sound collection. How badly the weights degrade an estimate is captured by the **effective sample size**, $$ n_{\text{eff}} = \frac{\left(\sum_{i=1}^{n} w_i\right)^2}{\sum_{i=1}^{n} w_i^2}, $$ which ranges from $1$, when a single example carries all the weight, to $n$, when the weights are uniform. A reweighted dataset of $n$ raw examples behaves statistically like an unweighted one of size $n_{\text{eff}}$. When $n_{\text{eff}}$ collapses to a small fraction of $n$, it is a quantitative warning that the collection and deployment distributions overlap too little for reweighting to rescue, and that the right response is to collect from the underrepresented region rather than to weight ever harder. Truncating or clipping extreme weights trades a controlled amount of bias for a large reduction in variance and is standard practice, but it does not change the underlying diagnosis that the support overlap is poor. ### 3.3 Convenience sampling and why it fails A **convenience sample** draws whatever units are readily available, with no known selection probabilities. Volunteer panels, the first thousand users to opt in, and an unfiltered web scrape are all convenience samples. The fatal property is that $\pi_i$ is unknown and, worse, correlated with the very quantities being modeled. Because the selection probabilities cannot be recovered, no reweighting can remove the bias in general. The sample may still be useful, but inference from it is conditional on an untestable assumption that the selected units resemble the unselected ones on the outcome of interest. Most published failures of deployed models trace back, on inspection, to a convenience sample that was treated as if it were representative. ## 4. Selection Bias and Its Mechanisms ### 4.1 A formal view **Selection bias** occurs when the probability that a unit enters the dataset depends on variables that also relate to the outcome. Introduce a selection indicator $S \in \{0, 1\}$, equal to one when a unit is observed. We collect data from the conditional distribution $p(x, y \mid S = 1)$, but we want to reason about the marginal $p(x, y)$. These coincide only when selection is independent of the relevant variables. The cleanest case is **selection on observables**, where $S \perp y \mid x$. Here selection depends only on covariates we record, and inverse probability weighting can correct it. The dangerous case is **selection on unobservables**, where selection depends on the outcome itself even after conditioning on recorded covariates. No reweighting from the observed data can fix this, because the information needed to estimate the weights is precisely what selection removed. ### 4.2 Common patterns Several recurring mechanisms deserve names because recognizing them early prevents expensive mistakes. - **Survivorship bias.** Only units that persisted are observed. Modeling equipment failures from machines still in service omits those that already failed and were removed. - **Self selection.** Units choose whether to appear, and the choice relates to the outcome. Customers who respond to a satisfaction survey are not a random slice of customers. - **Label availability bias.** Labels exist only for a subset, and that subset is not random. Loan repayment outcomes are observed only for applicants who were approved, so the training data systematically excludes the rejected region of input space. - **Feedback loops.** The deployed model determines which inputs are seen, which determines the next training set. A recommender trained only on items it already surfaces never learns about items it never shows. ### 4.3 Designing against selection bias The structural remedy is to introduce randomization into the collection process wherever feasible. A small **exploration** budget, in which a fraction of decisions are made randomly rather than by the current policy, breaks feedback loops and creates a probability sample of the region the model would otherwise never observe. In the lending example, occasionally approving a randomized subset of marginal applicants, within ethical and regulatory limits, generates the counterfactual labels needed to model the rejected region. The cost of this exploration is real, but it is the price of an unbiased view of the input space. ```text Collection policy with exploration: with probability 1 - epsilon: act according to current model with probability epsilon: act randomly and log the outcome # the epsilon-fraction is a probability sample # of the space the model would otherwise censor ``` ### 4.4 Worked example: the censored lending region A concrete instance makes the cost of selection on unobservables tangible. Suppose a lender approves applicants whose model-estimated default probability is below a threshold and observes repayment only for the approved. Let the approved fraction be $40\%$ and suppose, hypothetically, that the true default rate is $5\%$ among approved applicants and $30\%$ among rejected applicants. A naive model retrained on observed outcomes sees only the approved region, so it estimates a portfolio default rate of $5\%$. The true rate across all applicants is $$ 0.40 \times 0.05 + 0.60 \times 0.30 = 0.20, $$ a fourfold understatement. No reweighting of the observed approvals recovers the $30\%$ figure, because the rejected region contributes zero observations: the weight $1/\pi_i$ is infinite where $\pi_i = 0$, which is exactly the support-coverage failure of Section 1.1. The only fix that adds information is to generate observations in the censored region. Approving a small randomized fraction $\epsilon$ of otherwise-rejected applicants yields, after the loan matures, an unbiased estimate of the rejected-region default rate at the cost of the expected losses on that $\epsilon$-fraction. The numbers above are illustrative, but the structure is general: when selection depends on the outcome, only randomization buys back the missing information, and its price is the loss incurred on the explored decisions. ## 5. Labeling Strategies ### 5.1 Sources of labels Labels can come from human annotation, from organic outcomes recorded by the system, from weak or programmatic heuristics, or from existing models. Each trades cost against quality. Organic outcome labels, such as whether a user clicked or a part failed, are cheap and aligned with the real objective but are subject to the availability bias of Section 4. Human labels are expensive but can be elicited for any region of input space, including the rare and the counterfactual. ### 5.2 Annotation quality and agreement Human labels are noisy, and the noise is rarely uniform. The standard tool for quantifying labeler consistency is inter annotator agreement corrected for chance. **Cohen's kappa** for two annotators is $$ \kappa = \frac{p_o - p_e}{1 - p_e}, $$ where $p_o$ is observed agreement and $p_e$ is agreement expected by chance (Cohen, reference 10). The coefficient equals $1$ at perfect agreement, $0$ when annotators agree only as often as independent random labeling would predict, and can go negative under systematic disagreement. The chance correction matters because raw agreement $p_o$ is inflated whenever one class dominates: two annotators who both label every item as the majority class agree almost perfectly yet convey no information, and $\kappa$ correctly reports a value near zero. For more than two annotators the analogous chance-corrected statistic is Fleiss' kappa, and for ordinal labels a weighted kappa penalizes far-apart disagreements more than adjacent ones. Low agreement signals either an ambiguous labeling task, an inadequate guideline, or insufficient annotator training, and it caps the accuracy any model can reach, since the labels themselves are the ceiling. Investment in clear annotation guidelines, adjudication of disagreements, and gold standard test items usually yields more model improvement per dollar than additional raw volume. ### 5.3 Allocating the labeling budget Labeling budget is finite, so its allocation is a sampling design in its own right. Three principles guide it. First, allocate disproportionate effort to rare classes and to decision boundaries, since uniform labeling wastes effort on regions where the model is already confident. **Active learning** formalizes this by selecting for annotation the examples whose labels are expected to be most informative, often those near the current decision boundary. Second, preserve a held out, randomly labeled evaluation set drawn to match the deployment distribution, because an evaluation set distorted by active learning no longer estimates deployment performance. Third, label redundantly where ambiguity is high and singly where it is low, spending agreement measurement only where it matters. ### 5.4 Weak supervision When expert labels are scarce, **weak supervision** combines many noisy labeling sources, such as heuristics, knowledge bases, and patterns, into probabilistic labels by modeling each source's accuracy and correlations. This trades label noise for label volume and can be effective, but it introduces a new dependency. The biases of the labeling heuristics propagate into the labels, so the same scrutiny applied to data sources must be applied to label sources. ## 6. Matching Collection to the Deployment Distribution ### 6.1 Specifying the deployment distribution Designing collection to match deployment begins with making the deployment distribution explicit. This means writing down the axes along which inputs vary in production and the expected mass along each axis: the mix of device types, geographies, languages, time of day, user segments, and input difficulty. This specification is a falsifiable artifact. It can be checked against production telemetry once the system is live, and discrepancies become collection requirements for the next iteration. ### 6.2 Stratifying collection to the target mix With the deployment mix specified, stratified collection becomes the natural tool. Define strata along the axes that matter and set sampling rates so the collected mix matches the target mix, or deliberately oversample rare strata and reweight, accepting the variance cost in exchange for guaranteed coverage. Oversampling rare but consequential strata, such as fraud cases or safety critical edge cases, is usually correct even though it distorts the training mix, because the alternative is a model that has seen too few examples to learn the rare behavior at all. The distortion is then corrected with class weights in the loss or with inverse probability weighting in evaluation. ### 6.3 Temporal matching and drift The deployment distribution is not static. **Concept drift**, in which $p(y \mid x)$ changes over time, and **covariate drift**, in which $p(x)$ changes, both erode a model collected from a fixed historical window. Collection strategy must therefore include a refresh cadence and a temporally held out evaluation set, validating on data strictly later than the training window to estimate how performance decays between retraining cycles. A model evaluated only on a random split of historical data systematically overstates its deployment performance because it never confronts drift. ### 6.4 Monitoring and closing the loop Finally, collection is not a one time event but a standing process. Production monitoring should compare the live input distribution against the collection specification and raise an alarm when a stratum's mass shifts beyond tolerance. A widely used operational measure is the **population stability index**, which bins a feature and compares the collected proportion $c_h$ in bin $h$ against the live proportion $\ell_h$, $$ \text{PSI} = \sum_{h=1}^{H} (\ell_h - c_h) \, \ln \frac{\ell_h}{c_h}. $$ The PSI is the symmetrized Kullback-Leibler divergence between the two binned distributions and is zero exactly when they match. A common industry convention treats values below $0.1$ as negligible drift, $0.1$ to $0.25$ as moderate drift worth investigating, and above $0.25$ as a major shift demanding action, though the right thresholds are domain specific and should be calibrated against observed performance decay rather than adopted blindly. A formal two sample test, such as a Kolmogorov-Smirnov test on a continuous feature or a chi-squared test on a categorical one, turns the same comparison into a hypothesis test with a controllable false alarm rate. When the alarm fires, the response is a targeted collection campaign for the drifted stratum, closing the loop between deployment and collection. The systems that remain reliable in production are those whose data collection is treated as a continuous control problem rather than a finished task. ## 7. When to Use Which Strategy, and Common Pitfalls The strategies above are not interchangeable, and matching them to the situation prevents the most common failures. - **Prefer probability sampling whenever you control the frame.** If you can enumerate units and assign selection probabilities, do so. The cost is modest and it preserves the ability to make unbiased population statements later. Reserve convenience sampling for exploration and prototyping, and never let a convenience sample silently become the production training set. - **Reach for stratification when the deployment distribution has identifiable, consequential subpopulations**, especially rare ones. It is the cheapest insurance against a model that has never seen an important region. The pitfall is choosing strata that are easy to measure rather than strata that drive the outcome. - **Use reweighting only for covariate shift with good support overlap.** Check the effective sample size before trusting a reweighted estimate. If $n_{\text{eff}}$ is a small fraction of $n$, the honest conclusion is that you need more data from the underrepresented region, not heavier weights. Reweighting cannot touch concept shift. - **Introduce randomized exploration when selection depends on the outcome.** This is the only remedy for selection on unobservables, and the only way to learn the censored region. The pitfall is omitting it because the short-term cost is visible while the long-term bias is not. - **Hold out the evaluation set in time and draw it to match deployment.** A random split of historical data, or an evaluation set distorted by active learning, will overstate deployment performance. This is the single most common reason a model looks strong offline and disappoints in production. A recurring meta-pitfall deserves emphasis: treating data as found rather than designed. The convenience of an existing log or scrape exerts constant pressure to skip the design questions, and the resulting gaps surface only after deployment. On tooling, mature open-source libraries support each stage of this discipline. The Python scientific stack (`numpy`, `pandas`, and `scikit-learn`) covers stratified splitting, class weighting, and resampling. Imbalanced-learn adds principled resampling for skewed class distributions. For weak supervision, Snorkel models the accuracies and correlations of noisy labeling sources. For drift monitoring, libraries such as Evidently and the open-source Alibi Detect implement population stability indices and two sample drift tests directly, turning the monitoring discussion of Section 6.4 into a few lines of configuration rather than bespoke code. Documenting the resulting dataset with a datasheet (reference 9) records the provenance, coverage, and known biases so that downstream users inherit the design decisions rather than rediscovering their consequences. ## 8. A Practical Checklist The principles above reduce to a sequence of questions to ask before committing to a collection plan. ```text 1. What is the deployment distribution p_dep(x, y)? Write down the axes of variation and their target mix. 2. What is the sampling frame? Which subpopulations does it omit? 3. Is selection a probability sample or a convenience sample? If convenience, what untestable assumption am I relying on? 4. Does selection depend on the outcome (selection on unobservables)? If so, can I introduce randomized exploration to break it? 5. Where do labels come from, and what is their bias and noise? 6. Are rare, consequential strata oversampled and then reweighted? 7. Is the evaluation set drawn to match deployment and held out in time? 8. Is there a monitor comparing live inputs to the collection spec? ``` Working through these questions does not guarantee a perfect dataset, because perfect representativeness is rarely attainable. What it guarantees is that the gaps between the collected and deployment distributions are known, documented, and accounted for, rather than discovered through a production failure. That awareness is the difference between a dataset that supports valid inference and one that merely looks like data. ## References 1. Cochran, W. G. *Sampling Techniques*, 3rd ed. Wiley, 1977. https://www.wiley.com/en-us/Sampling+Techniques%2C+3rd+Edition-p-9780471162407 2. Lohr, S. L. *Sampling: Design and Analysis*, 3rd ed. CRC Press, 2021. https://www.routledge.com/Sampling-Design-and-Analysis/Lohr/p/book/9780367279509 3. Horvitz, D. G., and Thompson, D. J. "A Generalization of Sampling Without Replacement From a Finite Universe." *Journal of the American Statistical Association*, 1952. https://www.jstor.org/stable/2280784 4. Heckman, J. J. "Sample Selection Bias as a Specification Error." *Econometrica*, 1979. https://www.jstor.org/stable/1912352 5. Quionero-Candela, J., Sugiyama, M., Schwaighofer, A., and Lawrence, N. D. *Dataset Shift in Machine Learning*. MIT Press, 2009. https://mitpress.mit.edu/9780262170055/dataset-shift-in-machine-learning/ 6. Sugiyama, M., and Kawanabe, M. *Machine Learning in Non-Stationary Environments: Introduction to Covariate Shift Adaptation*. MIT Press, 2012. https://mitpress.mit.edu/9780262017091/ 7. Settles, B. "Active Learning Literature Survey." Computer Sciences Technical Report, University of Wisconsin Madison, 2009. https://minds.wisconsin.edu/handle/1793/60660 8. Ratner, A., Bach, S. H., Ehrenberg, H., Fries, J., Wu, S., and Re, C. "Snorkel: Rapid Training Data Creation with Weak Supervision." *VLDB*, 2017. https://www.vldb.org/pvldb/vol11/p269-ratner.pdf 9. Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Daume III, H., and Crawford, K. "Datasheets for Datasets." *Communications of the ACM*, 2021. https://dl.acm.org/doi/10.1145/3458723 10. Cohen, J. "A Coefficient of Agreement for Nominal Scales." *Educational and Psychological Measurement*, 1960. https://journals.sagepub.com/doi/10.1177/001316446002000104