53 Data Collection Strategies
Every machine learning system rests on a dataset, and every dataset is the product of a collection process that imposed structure, made omissions, and introduced bias long before any model touched the data. Practitioners often treat data as a fixed input and reserve their energy for architecture and tuning, yet the collection design usually determines the ceiling on what any model can achieve. A model can only learn the distribution it is shown. If that distribution differs from the one encountered at deployment, no amount of regularization or scale will close the gap. This chapter develops a disciplined approach to designing data collection so that the resulting dataset supports valid inference and reliable deployment.
53.1 1. Framing Collection as a Design Problem
53.1.1 1.1 The target population and the deployment distribution
The first task is to define the population about which we wish to draw conclusions or on which the model will act. In survey statistics this is the target population. In machine learning the analogous object is the deployment distribution, the distribution of inputs the model will actually receive in production. Denote the deployment distribution over inputs and labels by \(p_{\text{dep}}(x, y)\) and the distribution from which we collect training data by \(p_{\text{col}}(x, y)\). The central goal of collection design is to make these two distributions agree, or to understand precisely how they differ so the difference can be corrected.
When \(p_{\text{col}} = p_{\text{dep}}\), a model trained to minimize expected loss on collected data also minimizes expected loss at deployment. When they differ, we have dataset shift, and the empirical risk we minimize is a biased estimate of the risk we care about:
\[ R_{\text{dep}}(f) = \mathbb{E}_{(x,y) \sim p_{\text{dep}}}[\ell(f(x), y)] \neq \mathbb{E}_{(x,y) \sim p_{\text{col}}}[\ell(f(x), y)] = R_{\text{col}}(f). \]
Much of what follows is about controlling this inequality at the point of collection, which is far cheaper and more reliable than correcting it after the fact.
53.1.2 1.2 Sampling frame and coverage
Between the abstract target population and the concrete sample sits the sampling frame, the operational list or mechanism from which units are actually drawn. A frame for a customer churn model might be the set of accounts in the production database. A frame for a speech recognition system might be whatever audio a particular set of microphones captured. Coverage error arises when the frame omits part of the target population or includes units that do not belong to it.
Coverage failures are insidious because they are invisible in the collected data. If your frame for a medical imaging model contains only scans from a single hospital’s scanner model, the dataset will look complete and internally consistent, yet it silently excludes the imaging characteristics of every other scanner. The model will appear to perform well in evaluation, because evaluation data inherits the same coverage gap, and then degrade in deployment. The discipline here is to enumerate, before collection, the subpopulations that exist in the deployment distribution and to verify that the frame can reach each of them.
53.2 2. Data Sources and Their Biases
53.2.1 2.1 A taxonomy of sources
Data sources fall into several broad categories, each with a characteristic bias profile.
- Instrumented production logs. Cheap, large, and continuously refreshed, but they reflect only the behavior of existing users interacting with an existing system. They carry strong feedback loop bias, since the current model shapes the very data used to train its successor.
- Operational records and administrative data. Created for business or legal purposes rather than for modeling. They are often complete within their scope but encode the categories and decisions of the originating process, not necessarily the ones you care about.
- Sensors and devices. Subject to calibration drift, hardware heterogeneity, and placement effects. A frame defined by a particular device population rarely matches the device population at deployment.
- Surveys and elicited data. Allow controlled probability sampling but suffer nonresponse and self-report bias.
- Web scrapes and public corpora. Vast and convenient, but their composition reflects who publishes online, which over represents some languages, regions, and viewpoints and under represents others.
- Purchased or partner data. Opaque provenance. The collection process is someone else’s, so its biases are inherited without being documented.
53.2.2 2.2 The convenience trap
Almost all of these sources are tempting precisely because they are convenient. The danger is that convenience and representativeness are usually in tension. The data that is easiest to obtain is rarely the data that matches the deployment distribution. The professional habit worth cultivating is to ask, for any candidate source, the single question: which units of the target population can never appear in this source, and why. That question surfaces coverage and selection problems before they are baked into a model.
53.3 3. Sampling Design
53.3.1 3.1 Probability sampling
A probability sample is one in which every unit in the frame has a known, nonzero probability of selection \(\pi_i > 0\). This property is what licenses statistical inference, because it lets us reweight the sample to estimate population quantities without bias. The basic schemes form a hierarchy of increasing structure.
Simple random sampling gives every unit equal probability \(\pi_i = n/N\). It is unbiased and simple but can leave small but important subgroups underrepresented by chance.
Stratified sampling partitions the frame into strata and samples within each. It guarantees coverage of every stratum and reduces variance when strata are internally homogeneous. The estimator for a population mean is
\[ \hat{\mu} = \sum_{h=1}^{H} \frac{N_h}{N} \, \bar{y}_h, \]
where \(N_h\) is the size of stratum \(h\) and \(\bar{y}_h\) its sample mean. Stratification is the most useful single technique for machine learning collection because it lets the designer force representation of rare but important regions of the input space.
Cluster sampling selects groups of units, such as all transactions from a sampled set of stores. It is cheaper when units are naturally grouped but increases variance through intracluster correlation.
Systematic sampling takes every \(k\)th unit from an ordered frame. It is convenient but dangerous when the ordering has periodicity that aligns with \(k\).
53.3.2 3.2 Inverse probability weighting
When selection probabilities are unequal but known, unbiased estimation is recovered by weighting each unit by the inverse of its inclusion probability, the Horvitz Thompson estimator:
\[ \hat{T} = \sum_{i \in S} \frac{y_i}{\pi_i}. \]
In machine learning this same logic justifies importance weighting of training examples when the collection distribution is known to differ from the deployment distribution. If we can estimate the density ratio \(w(x) = p_{\text{dep}}(x) / p_{\text{col}}(x)\), then the reweighted empirical risk
\[ \hat{R}(f) = \frac{1}{n} \sum_{i=1}^{n} w(x_i) \, \ell(f(x_i), y_i) \]
is a consistent estimator of the deployment risk under covariate shift, the assumption that \(p(y \mid x)\) is stable while \(p(x)\) changes. The catch is that high variance weights inflate the variance of the estimate, so reweighting is a corrective of last resort rather than a substitute for sound collection.
53.3.3 3.3 Convenience sampling and why it fails
A convenience sample draws whatever units are readily available, with no known selection probabilities. Volunteer panels, the first thousand users to opt in, and an unfiltered web scrape are all convenience samples. The fatal property is that \(\pi_i\) is unknown and, worse, correlated with the very quantities being modeled. Because the selection probabilities cannot be recovered, no reweighting can remove the bias in general. The sample may still be useful, but inference from it is conditional on an untestable assumption that the selected units resemble the unselected ones on the outcome of interest. Most published failures of deployed models trace back, on inspection, to a convenience sample that was treated as if it were representative.
53.4 4. Selection Bias and Its Mechanisms
53.4.1 4.1 A formal view
Selection bias occurs when the probability that a unit enters the dataset depends on variables that also relate to the outcome. Introduce a selection indicator \(S \in \{0, 1\}\), equal to one when a unit is observed. We collect data from the conditional distribution \(p(x, y \mid S = 1)\), but we want to reason about the marginal \(p(x, y)\). These coincide only when selection is independent of the relevant variables.
The cleanest case is selection on observables, where \(S \perp y \mid x\). Here selection depends only on covariates we record, and inverse probability weighting can correct it. The dangerous case is selection on unobservables, where selection depends on the outcome itself even after conditioning on recorded covariates. No reweighting from the observed data can fix this, because the information needed to estimate the weights is precisely what selection removed.
53.4.2 4.2 Common patterns
Several recurring mechanisms deserve names because recognizing them early prevents expensive mistakes.
- Survivorship bias. Only units that persisted are observed. Modeling equipment failures from machines still in service omits those that already failed and were removed.
- Self selection. Units choose whether to appear, and the choice relates to the outcome. Customers who respond to a satisfaction survey are not a random slice of customers.
- Label availability bias. Labels exist only for a subset, and that subset is not random. Loan repayment outcomes are observed only for applicants who were approved, so the training data systematically excludes the rejected region of input space.
- Feedback loops. The deployed model determines which inputs are seen, which determines the next training set. A recommender trained only on items it already surfaces never learns about items it never shows.
53.4.3 4.3 Designing against selection bias
The structural remedy is to introduce randomization into the collection process wherever feasible. A small exploration budget, in which a fraction of decisions are made randomly rather than by the current policy, breaks feedback loops and creates a probability sample of the region the model would otherwise never observe. In the lending example, occasionally approving a randomized subset of marginal applicants, within ethical and regulatory limits, generates the counterfactual labels needed to model the rejected region. The cost of this exploration is real, but it is the price of an unbiased view of the input space.
Collection policy with exploration:
with probability 1 - epsilon:
act according to current model
with probability epsilon:
act randomly and log the outcome
# the epsilon-fraction is a probability sample
# of the space the model would otherwise censor
53.5 5. Labeling Strategies
53.5.1 5.1 Sources of labels
Labels can come from human annotation, from organic outcomes recorded by the system, from weak or programmatic heuristics, or from existing models. Each trades cost against quality. Organic outcome labels, such as whether a user clicked or a part failed, are cheap and aligned with the real objective but are subject to the availability bias of Section 4. Human labels are expensive but can be elicited for any region of input space, including the rare and the counterfactual.
53.5.2 5.2 Annotation quality and agreement
Human labels are noisy, and the noise is rarely uniform. The standard tool for quantifying labeler consistency is inter annotator agreement corrected for chance. Cohen’s kappa for two annotators is
\[ \kappa = \frac{p_o - p_e}{1 - p_e}, \]
where \(p_o\) is observed agreement and \(p_e\) is agreement expected by chance. Low agreement signals either an ambiguous labeling task, an inadequate guideline, or insufficient annotator training, and it caps the accuracy any model can reach, since the labels themselves are the ceiling. Investment in clear annotation guidelines, adjudication of disagreements, and gold standard test items usually yields more model improvement per dollar than additional raw volume.
53.5.3 5.3 Allocating the labeling budget
Labeling budget is finite, so its allocation is a sampling design in its own right. Three principles guide it. First, allocate disproportionate effort to rare classes and to decision boundaries, since uniform labeling wastes effort on regions where the model is already confident. Active learning formalizes this by selecting for annotation the examples whose labels are expected to be most informative, often those near the current decision boundary. Second, preserve a held out, randomly labeled evaluation set drawn to match the deployment distribution, because an evaluation set distorted by active learning no longer estimates deployment performance. Third, label redundantly where ambiguity is high and singly where it is low, spending agreement measurement only where it matters.
53.5.4 5.4 Weak supervision
When expert labels are scarce, weak supervision combines many noisy labeling sources, such as heuristics, knowledge bases, and patterns, into probabilistic labels by modeling each source’s accuracy and correlations. This trades label noise for label volume and can be effective, but it introduces a new dependency. The biases of the labeling heuristics propagate into the labels, so the same scrutiny applied to data sources must be applied to label sources.
53.6 6. Matching Collection to the Deployment Distribution
53.6.1 6.1 Specifying the deployment distribution
Designing collection to match deployment begins with making the deployment distribution explicit. This means writing down the axes along which inputs vary in production and the expected mass along each axis: the mix of device types, geographies, languages, time of day, user segments, and input difficulty. This specification is a falsifiable artifact. It can be checked against production telemetry once the system is live, and discrepancies become collection requirements for the next iteration.
53.6.2 6.2 Stratifying collection to the target mix
With the deployment mix specified, stratified collection becomes the natural tool. Define strata along the axes that matter and set sampling rates so the collected mix matches the target mix, or deliberately oversample rare strata and reweight, accepting the variance cost in exchange for guaranteed coverage. Oversampling rare but consequential strata, such as fraud cases or safety critical edge cases, is usually correct even though it distorts the training mix, because the alternative is a model that has seen too few examples to learn the rare behavior at all. The distortion is then corrected with class weights in the loss or with inverse probability weighting in evaluation.
53.6.3 6.3 Temporal matching and drift
The deployment distribution is not static. Concept drift, in which \(p(y \mid x)\) changes over time, and covariate drift, in which \(p(x)\) changes, both erode a model collected from a fixed historical window. Collection strategy must therefore include a refresh cadence and a temporally held out evaluation set, validating on data strictly later than the training window to estimate how performance decays between retraining cycles. A model evaluated only on a random split of historical data systematically overstates its deployment performance because it never confronts drift.
53.6.4 6.4 Monitoring and closing the loop
Finally, collection is not a one time event but a standing process. Production monitoring should compare the live input distribution against the collection specification and raise an alarm when a stratum’s mass shifts beyond tolerance. A simple population stability measure or a two sample test on incoming features against the training features turns the abstract goal of matching distributions into an operational signal. When the alarm fires, the response is a targeted collection campaign for the drifted stratum, closing the loop between deployment and collection. The systems that remain reliable in production are those whose data collection is treated as a continuous control problem rather than a finished task.
53.7 7. A Practical Checklist
The principles above reduce to a sequence of questions to ask before committing to a collection plan.
1. What is the deployment distribution p_dep(x, y)?
Write down the axes of variation and their target mix.
2. What is the sampling frame? Which subpopulations does it omit?
3. Is selection a probability sample or a convenience sample?
If convenience, what untestable assumption am I relying on?
4. Does selection depend on the outcome (selection on unobservables)?
If so, can I introduce randomized exploration to break it?
5. Where do labels come from, and what is their bias and noise?
6. Are rare, consequential strata oversampled and then reweighted?
7. Is the evaluation set drawn to match deployment and held out in time?
8. Is there a monitor comparing live inputs to the collection spec?
Working through these questions does not guarantee a perfect dataset, because perfect representativeness is rarely attainable. What it guarantees is that the gaps between the collected and deployment distributions are known, documented, and accounted for, rather than discovered through a production failure. That awareness is the difference between a dataset that supports valid inference and one that merely looks like data.
53.8 References
- Cochran, W. G. Sampling Techniques, 3rd ed. Wiley, 1977. https://www.wiley.com/en-us/Sampling+Techniques%2C+3rd+Edition-p-9780471162407
- Lohr, S. L. Sampling: Design and Analysis, 3rd ed. CRC Press, 2021. https://www.routledge.com/Sampling-Design-and-Analysis/Lohr/p/book/9780367279509
- Horvitz, D. G., and Thompson, D. J. “A Generalization of Sampling Without Replacement From a Finite Universe.” Journal of the American Statistical Association, 1952. https://www.jstor.org/stable/2280784
- Heckman, J. J. “Sample Selection Bias as a Specification Error.” Econometrica, 1979. https://www.jstor.org/stable/1912352
- Quionero-Candela, J., Sugiyama, M., Schwaighofer, A., and Lawrence, N. D. Dataset Shift in Machine Learning. MIT Press, 2009. https://mitpress.mit.edu/9780262170055/dataset-shift-in-machine-learning/
- Sugiyama, M., and Kawanabe, M. Machine Learning in Non-Stationary Environments: Introduction to Covariate Shift Adaptation. MIT Press, 2012. https://mitpress.mit.edu/9780262017091/
- Settles, B. “Active Learning Literature Survey.” Computer Sciences Technical Report, University of Wisconsin Madison, 2009. https://minds.wisconsin.edu/handle/1793/60660
- Ratner, A., Bach, S. H., Ehrenberg, H., Fries, J., Wu, S., and Re, C. “Snorkel: Rapid Training Data Creation with Weak Supervision.” VLDB, 2017. https://www.vldb.org/pvldb/vol11/p269-ratner.pdf
- Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Daume III, H., and Crawford, K. “Datasheets for Datasets.” Communications of the ACM, 2021. https://dl.acm.org/doi/10.1145/3458723
- Cohen, J. “A Coefficient of Agreement for Nominal Scales.” Educational and Psychological Measurement, 1960. https://journals.sagepub.com/doi/10.1177/001316446002000104