74 Data Augmentation Principles

Data augmentation is one of the most reliable levers in applied machine learning. By generating additional training examples through transformations of existing data, practitioners routinely improve generalization without collecting a single new label. Yet augmentation is often treated as a bag of tricks rather than a principled component of the learning system. This chapter develops the conceptual foundations. We frame augmentation as a way to encode invariances into a model, examine its role as a regularizer, formalize what it means for a transform to preserve labels, connect supervised augmentation to consistency regularization in the semi-supervised setting, and conclude with a sober account of when augmentation helps and when it quietly hurts.

The throughline is a single idea. Augmentation is a way of writing down what we already believe about a task, namely that certain changes to an input should not change its label, and of turning that belief into training signal. Seen this way, the choice of transformations is not a tuning detail but a modeling decision on par with the choice of architecture or loss. The chapter is organized to make that decision explicit and to expose the precise assumption (label preservation under a transformation family) on which every benefit of augmentation rests.

flowchart TD
    A["Task symmetries we believe in"] --> B["Transformation family T"]
    B --> C["Augmentation distribution q(theta)"]
    C --> D["Augmented risk over labeled data"]
    C --> E["Consistency loss over unlabeled data"]
    D --> F["Model that is approximately invariant"]
    E --> F
    G["Label preservation assumption"] --> D
    G --> E
    F --> H["Better generalization if T matches the truth"]

74.1 1. Augmentation as Encoding Invariances

74.1.1 1.1 The invariance hypothesis

Most prediction tasks come with a built in symmetry. A cat remains a cat when the image is flipped horizontally, shifted by a few pixels, or brightened slightly. A spoken word retains its meaning under small changes in pitch or background noise. A sentence keeps its sentiment when a word is swapped for a synonym. These symmetries express prior knowledge about the task that the raw training data only partially reveals.

Formally, suppose we have an input space $\mathcal{X}$, a label space $\mathcal{Y}$, and a family of transformations $\mathcal{T} = \{t_\theta : \mathcal{X} \to \mathcal{X}\}$ indexed by parameters $\theta$. The task is said to be invariant under $\mathcal{T}$ if the true conditional distribution satisfies

\[ p(y \mid x) = p(y \mid t_\theta(x)) \quad \text{for all } t_\theta \in \mathcal{T}. \]

We would like our learned predictor $f_\phi$ to respect this property, that is $f_\phi(x) \approx f_\phi(t_\theta(x))$. Data augmentation is the empirical mechanism for instilling that behavior. Rather than hard wiring the invariance into the architecture, which is possible but restrictive, we sample transformations during training and ask the model to produce consistent outputs.

It is worth distinguishing two related notions precisely, because the literature uses them loosely. A predictor $f_\phi : \mathcal{X} \to \mathcal{Y}$ is invariant under $\mathcal{T}$ if $f_\phi(t_\theta(x)) = f_\phi(x)$ for all $x$ and all $t_\theta \in \mathcal{T}$, so the output does not move when the input is transformed. A feature map $g_\phi : \mathcal{X} \to \mathbb{R}^d$ is equivariant under $\mathcal{T}$ if there is a corresponding transformation $\rho_\theta$ on the feature space with $g_\phi(t_\theta(x)) = \rho_\theta(g_\phi(x))$, so the output transforms in a predictable, structured way rather than staying fixed. Classification typically wants invariance of the final decision, while intermediate representations are often most useful when they are equivariant. Augmentation primarily targets invariance of the head, but by training the whole network under transformed inputs it tends to encourage approximately equivariant intermediate features as a side effect.

Two qualifiers keep the invariance hypothesis honest. First, real symmetries are usually approximate and bounded. An image is invariant to small rotations but not to a one hundred eighty degree flip that turns a 6 into a 9; the safe set of $\theta$ is a neighborhood of the identity, not the whole group. Second, invariance is a property of the labeling, not of perception. A transform is admissible exactly when it preserves the conditional $p(y \mid x)$, which is a claim about the task we have defined, not about whether a human finds the transformed input natural.

74.1.2 1.2 Augmentation as expanding the training distribution

Let $\mathcal{D} = \{(x_i, y_i)\}_{i=1}^n$ be the labeled dataset drawn from a distribution $p_{\text{data}}$. Augmentation replaces this finite sample with an enlarged, effectively infinite distribution. Each example is paired with a distribution over transformations $q(\theta)$, and the augmented risk becomes

\[ R_{\text{aug}}(\phi) = \frac{1}{n} \sum_{i=1}^n \mathbb{E}_{\theta \sim q} \big[ \ell\big(f_\phi(t_\theta(x_i)), y_i\big) \big]. \]

The expectation over $\theta$ is the heart of the matter. We are no longer fitting the model to $n$ points but to a tube of inputs surrounding each point, traced out by the transformation family. If the invariance hypothesis holds, every input in that tube genuinely carries label $y_i$, so the augmented risk is an unbiased estimate of risk on a richer distribution than the one we sampled.

This construction has a precise name. It is vicinal risk minimization (Chapelle et al. 2001): the empirical distribution that places a point mass at each $(x_i, y_i)$ is replaced by a vicinity distribution that smears probability over a neighborhood of each example. Standard empirical risk minimization is the degenerate case where the vicinity is a Dirac spike. Augmentation chooses a vicinity by choosing $q(\theta)$, and the augmented risk $R_{\text{aug}}$ is exactly the empirical risk under that vicinal distribution. Framing augmentation this way clarifies the bias and variance tradeoff. A wider vicinity (stronger augmentation) lowers the variance of the risk estimate because it averages over more directions, but it raises bias whenever the vicinity leaks outside the true label region, which is precisely the label preservation failure of Section 3.

A useful mental model is that augmentation injects domain knowledge that the model would otherwise have to discover from data. A convolutional network can in principle learn approximate translation invariance from raw pixels, but it needs many examples to do so. Supplying shifted copies short circuits that learning and frees capacity for the genuinely discriminative features.

74.1.3 1.3 The geometry of orbits

The set of points reachable from a single $x$ under the transformation family is its orbit, $\mathcal{O}(x) = \{ t_\theta(x) : t_\theta \in \mathcal{T} \}$. When the transformations form a group, the orbits partition the input space into equivalence classes, and an invariant predictor is constant on each orbit. Augmentation encourages the model to collapse each orbit to a single decision, which reduces the effective dimensionality of the function the model must learn.

There is a clean way to see why averaging over an orbit cannot hurt and often helps. Suppose $\mathcal{T}$ is a finite group $G$ acting on inputs, and define the symmetrized predictor by averaging a base predictor $h$ over the group,

\[ \bar{h}(x) = \frac{1}{|G|} \sum_{g \in G} h(g \cdot x). \]

By construction $\bar{h}$ is exactly invariant, since relabeling the sum by $g \mapsto g'g$ leaves it unchanged. If the task is truly invariant, so the optimal predictor is constant on orbits, then for a convex loss Jensen’s inequality gives $R(\bar{h}) \le \tfrac{1}{|G|}\sum_{g} R(h \circ g) = R(h)$, where the last equality uses that the data distribution is itself invariant. In words, projecting onto the invariant function class never increases risk under these assumptions, and it strictly decreases it whenever $h$ disagrees across an orbit. Augmentation is the stochastic counterpart of this averaging: instead of summing over all of $G$ at inference, we sample group elements during training and push the learned $f_\phi$ toward the invariant subspace. The benefit is the variance reduction of replacing one orbit sample by an average, and the catch is that the argument assumes the data distribution is genuinely $G$ invariant, which is exactly label preservation restated.

This view also clarifies a failure mode. If two classes share parts of their orbits under the chosen transformations, so $\mathcal{O}(x_i) \cap \mathcal{O}(x_j) \neq \varnothing$ for $y_i \neq y_j$, then augmentation forces the model to assign the same label to genuinely different inputs, and no invariant predictor can separate the classes. We return to this in Section 5.

74.2 2. Augmentation as Regularization

74.2.1 2.1 Why augmentation reduces variance

Regularization, broadly, is any modification to a learning procedure that reduces the variance of the fitted model at the cost of some bias. Augmentation fits this description cleanly. By training on a continuum of perturbed inputs, the model cannot exploit idiosyncratic pixel level or token level patterns that happen to correlate with labels in the finite sample. Those spurious patterns are washed out by the transformation noise, while the stable, invariant structure survives averaging.

Consider the bias variance decomposition of expected test error. A model with high capacity trained on a small dataset typically sits in the high variance regime, memorizing training points. Augmentation enlarges the support of the training distribution, so the same model now interpolates over a smoother target. Empirically this manifests as a smaller gap between training and validation accuracy.

74.2.2 2.2 The connection to explicit penalties

Augmentation with small perturbations can be shown to approximate an explicit regularization penalty on the model. Consider additive noise augmentation $t_\theta(x) = x + \theta$ with $\theta \sim \mathcal{N}(0, \sigma^2 I)$, so $\mathbb{E}[\theta] = 0$ and $\mathbb{E}[\theta\theta^\top] = \sigma^2 I$, and a squared error loss $\ell(f, y) = \tfrac{1}{2}(f - y)^2$ with scalar output. Taylor expanding the model around $x$,

\[ f_\phi(x + \theta) = f_\phi(x) + \nabla_x f_\phi(x)^\top \theta + \tfrac{1}{2}\theta^\top H(x)\,\theta + O(\|\theta\|^3), \]

where $H$ is the input Hessian, and substituting into the loss and taking the expectation over $\theta$, the linear term vanishes because $\mathbb{E}[\theta] = 0$, and the quadratic term contributes through $\mathbb{E}[\theta\theta^\top] = \sigma^2 I$. Collecting terms,

\[ \mathbb{E}_\theta\big[\ell(f_\phi(x+\theta), y)\big] \approx \ell(f_\phi(x), y) + \frac{\sigma^2}{2}\, \|\nabla_x f_\phi(x)\|^2 + \frac{\sigma^2}{2}\, r(x)\,\mathrm{tr}\,H(x) + O(\sigma^4), \]

where $r(x) = f_\phi(x) - y$ is the residual. The leading penalty is a Tikhonov style term on the gradient of the model with respect to its input (Bishop 1995). The Hessian term is weighted by the residual, so near a good fit, where residuals are small, it is negligible and the gradient norm penalty dominates. In words, noise augmentation discourages the function from changing rapidly near training points, which is exactly the smoothness prior we associate with good generalization. Richer transformations such as rotations or crops do not reduce to such a clean closed form, but the qualitative effect is the same. Linearizing $t_\theta(x) \approx x + \theta\, \partial_\theta t_\theta(x)|_{\theta=0}$ shows that the induced penalty weights the directional derivative of $f_\phi$ along the tangent directions of the transformation family, penalizing sensitivity to exactly the directions that the family declares label irrelevant while leaving discriminative directions unconstrained.

74.2.3 2.3 Interaction with other regularizers

Augmentation is not a substitute for weight decay, dropout, or early stopping; it is complementary, and it operates on a different axis. Weight decay constrains the parameters, dropout constrains co adaptation of units, and augmentation constrains the function along data manifold directions. Because these mechanisms target different sources of overfitting, stacking them usually helps, though the marginal benefit of each typically diminishes as others are added. A practical consequence is that the optimal augmentation strength interacts with the optimal weight decay and learning rate, so these hyperparameters should be tuned jointly rather than in isolation.

74.3 3. Label Preserving Transforms

74.3.1 3.1 The label preservation requirement

The entire justification in Section 1 rests on a single assumption: the transformation does not change the label. A transform $t_\theta$ is label preserving for example $(x, y)$ if $y$ remains the correct target for $t_\theta(x)$. When this holds, augmented examples are valid training signal. When it fails, augmentation injects label noise, and the model is trained to make confident but wrong predictions.

The requirement is not a property of the transform alone; it is a property of the transform together with the task and the data. Horizontal flipping preserves the label “cat” but destroys the label in a digit recognition task, since a flipped 2 is not a 2 and a flipped 6 may resemble nothing valid. Color jitter is harmless for object recognition but catastrophic for a task that classifies flowers by color. The first discipline of augmentation is therefore to enumerate, for the specific task, which symmetries the labels actually respect.

74.3.2 3.2 A taxonomy of common transforms

The following sketch organizes widely used transforms by modality and by the invariance they encode.

Vision
  geometric:   flip, rotate, crop, scale, translate, shear
  photometric: brightness, contrast, hue, saturation, blur, noise
  occlusion:   cutout, random erasing
  mixing:      mixup, cutmix (note: these change the label)

Text
  lexical:     synonym replacement, random insertion/deletion
  structural:  back translation, paraphrase generation
  embedding:   word/token dropout, noise in embedding space

Audio
  temporal:    time stretch, time shift, time masking
  spectral:    pitch shift, frequency masking, additive noise

The mixing methods in the vision list deserve a flag. Mixup forms a convex combination $\tilde{x} = \lambda x_i + (1-\lambda) x_j$ and a matching soft label $\tilde{y} = \lambda y_i + (1-\lambda) y_j$. This is not label preserving in the strict sense; it deliberately interpolates labels. It works not by encoding an invariance but by enforcing linear behavior between examples, a distinct regularizing mechanism worth keeping conceptually separate.

74.3.3 3.3 Class conditional and instance conditional validity

Label preservation can hold for some classes and not others, or for some instances and not others. Rotation by ninety degrees is safe for most natural objects but flips the meaning of an arrow, a clock, or text. The safe range of an augmentation is frequently narrower than practitioners assume. A robust practice is to make the augmentation distribution conditional where necessary, applying aggressive rotations only to classes known to be rotation invariant and restricting the range elsewhere.

The deeper point is that the augmentation policy is part of the model specification. Choosing $q(\theta)$ is choosing a prior over the function class, and a mismatched prior degrades performance just as a poorly chosen architecture does.

74.3.4 3.4 A worked example: rotation on two tasks

To make label preservation concrete, contrast the same transform on two tasks. Let the transform be rotation by an angle $\alpha$, with augmentation distribution $q(\alpha)$ uniform on a range $[-\beta, \beta]$.

On a microscopy task that classifies cell types, the imaging geometry has no preferred orientation. The conditional $p(y \mid x)$ is genuinely invariant to any rotation, so the orbit of every image stays inside a single class. Here the safe range is the full circle, $\beta = 180^\circ$, and large $\beta$ is not just permissible but beneficial, because it averages over a symmetry the data truly has and shrinks the effective hypothesis space exactly as the symmetrization argument of Section 1.3 predicts.

On a handwritten digit task the same transform behaves very differently. Small rotations preserve labels: a 7 tilted by ten degrees is still a 7. But the safe range is narrow and class dependent. A 6 rotated near one hundred eighty degrees becomes a 9, so the orbit of a 6 collides with the orbit of a 9, the intersection condition $\mathcal{O}(x_6) \cap \mathcal{O}(x_9) \neq \varnothing$ from Section 1.3 is met, and any invariant predictor must confuse the two. A 4 and a 7 admit wider safe ranges than a 6 or a 9. The correct policy is therefore a small global $\beta$ (often cited as roughly ten to fifteen degrees in practice) or, better, a class conditional range. The contrast is the entire lesson of the section in one transform. The admissibility of an augmentation is a joint property of transform, task, and even class, never of the transform alone.

74.4 4. The Connection to Consistency Regularization

74.4.1 4.1 From supervised augmentation to unlabeled consistency

In the supervised setting we used augmentation to expand labeled examples. The same invariance can be exploited on unlabeled data, where it becomes the engine of much of modern semi supervised learning. The key observation is that the invariance hypothesis $p(y \mid x) = p(y \mid t_\theta(x))$ does not require knowing $y$. It is a statement about the model’s outputs, and we can enforce it directly.

Consistency regularization adds a term that penalizes the model for producing different predictions on two augmentations of the same input. For an unlabeled example $u$ and two transformations $t, t'$, a typical loss is

\[ \mathcal{L}_{\text{cons}}(u) = d\big(f_\phi(t(u)),\, f_\phi(t'(u))\big), \]

where $d$ is a divergence such as squared distance between probability vectors or a cross entropy with one side treated as a fixed target. Minimizing this term pushes the decision boundary into low density regions, because it forces predictions to be stable across the perturbations the augmentation family generates.

The link to the gradient penalty of Section 2.2 is direct. If $t$ is the identity and $t'$ adds small noise $\theta$, then $\mathcal{L}_{\text{cons}}$ with squared distance is approximately $\|\nabla_x f_\phi(x)^\top \theta\|^2$, whose expectation over $\theta$ is again a Tikhonov term on the input gradient. The difference from supervised augmentation is the anchor. Supervised augmentation pins the smoothed function to the known label $y_i$, while consistency regularization pins it only to the model’s own output on a neighboring view. Consistency therefore enforces smoothness without supplying direction, which is why it is almost always paired either with a labeled loss or with a confidence filtered pseudo label that injects directional signal, as in Section 4.2.

74.4.2 4.2 Weak and strong augmentation

The most effective recent methods use augmentations of two different strengths. A weak augmentation, such as a small shift and flip, produces a prediction trusted enough to serve as a pseudo label. A strong augmentation, such as heavy color distortion plus cutout, produces an input the model must match to that pseudo label. The asymmetry is deliberate: the weak view yields a reliable target, and the strong view supplies a hard learning signal.

weak_view   = weak_augment(u)
strong_view = strong_augment(u)

pseudo_label = stop_gradient( predict(weak_view) )      # treat as fixed target
if max(pseudo_label) > tau:                             # confidence threshold
    loss = cross_entropy(predict(strong_view), pseudo_label)

The confidence threshold $\tau$ is essential. Without it, low confidence pseudo labels inject noise and the model can drift, reinforcing its own errors in a feedback loop. The threshold filters the unlabeled set down to examples where the weak view is already confident, which keeps the pseudo labels approximately correct early in training and lets coverage expand as the model improves.

74.4.3 4.3 Why augmentation strength controls the regularizer

Consistency regularization is only as good as the augmentations underneath it. If the transformations are too weak, the two views are nearly identical, the consistency loss is trivially small, and no useful signal flows. If the transformations are too strong and break label preservation, the model is asked to match predictions across inputs that genuinely have different labels, which corrupts learning. The same label preservation discipline from Section 3 reappears here, now with higher stakes, because there is no ground truth label to anchor the example. Augmentation design and consistency regularization are therefore two faces of the same underlying assumption about task invariance.

74.4.4 4.4 The shared invariance principle

It is worth stating the unifying view explicitly. Supervised augmentation, mixup style interpolation, and unlabeled consistency are all mechanisms for encoding a prior that the function should vary slowly along certain directions in input space. Supervised augmentation anchors that prior to known labels. Consistency regularization anchors it to the model’s own confident predictions. Both succeed exactly when the chosen transformation family aligns with the true symmetries of the task, and both fail in the same way when it does not.

74.5 5. When Augmentation Helps and When It Hurts

74.5.1 5.1 Regimes where augmentation helps most

Augmentation delivers the largest gains when data is scarce relative to model capacity. In the small data regime the variance reduction described in Section 2 dominates, and each valid synthetic example meaningfully enlarges the effective training set. Gains also tend to be large when the deployment distribution contains the very variations the augmentations simulate, since the model is then trained on something close to the test conditions. A network trained with random crops and color jitter is better prepared for photographs taken under varied framing and lighting because it has effectively seen such variation.

A third favorable regime is when a known invariance is strong and clean. Tasks with well understood symmetries, such as rotational invariance in microscopy or translation invariance in audio event detection, reward augmentation that targets those symmetries precisely.

74.5.2 5.2 Diminishing and negative returns

As the labeled dataset grows, the marginal value of augmentation shrinks. With abundant data the model can learn the relevant invariances from real examples, and synthetic ones add little. In the very large data regime augmentation may even slow convergence by making each example harder, without a compensating generalization benefit.

Augmentation hurts outright in several identifiable situations. The first is label corruption, discussed in Section 3, where the transform changes the true label and the model is trained on wrong targets. The second is distribution shift introduced by augmentation itself: if the augmentation produces inputs that never occur at test time, the model wastes capacity on an irrelevant region of input space and may underperform on the real distribution. Heavy, unrealistic distortions are a common culprit. The third is interaction with class imbalance and spurious features, where an augmentation that is benign for the majority class destroys the signal that distinguishes a minority class, as in the color dependent flower example.

74.5.3 5.3 Diagnosing augmentation problems

A practical workflow treats the augmentation policy as a tunable component subject to validation. The following heuristics catch most failures.

First, sanity check label preservation by eye. Sample augmented inputs and confirm that a human would still assign the original label. This trivial step catches the most damaging errors and costs minutes.

Second, sweep augmentation strength and watch the validation curve. The signature of excessive augmentation is a training accuracy that stays low while validation accuracy also fails to rise, indicating that the model cannot even fit the corrupted targets. The signature of insufficient augmentation is a large train to validation gap that augmentation was supposed to close.

Third, monitor per class metrics, not just aggregate accuracy. An augmentation that helps overall while quietly destroying one class will hide behind a strong average. Class conditional evaluation surfaces these regressions.

74.5.4 5.4 Learned and adaptive augmentation

Because hand tuning augmentation policies is laborious and task specific, a body of work searches for policies automatically. Methods learn or sample which transformations and magnitudes to apply, optimizing a validation objective rather than fixing the policy by hand. Reduced search formulations make this tractable by limiting the number of free hyperparameters, often to a global magnitude and a count of transforms to apply. The principled takeaway is unchanged: the search is effective only within a transformation family that respects the task’s invariances. Automation tunes the policy; it does not absolve the practitioner from supplying transforms that preserve labels.

74.5.5 5.5 Practical guidance

Treat augmentation as a prior, not a default. Begin by enumerating the symmetries the task genuinely possesses, choose transforms that encode exactly those, and start with conservative magnitudes. Validate label preservation directly, tune strength jointly with the other regularizers, and evaluate per class. When operating on unlabeled data, recognize that consistency regularization inherits every assumption baked into the augmentation policy and amplifies the cost of getting it wrong. Used with this discipline, augmentation is among the highest return interventions available; used carelessly, it is a silent source of label noise and distribution shift.

74.6 References

Shorten, C., and Khoshgoftaar, T. M. “A survey on Image Data Augmentation for Deep Learning.” Journal of Big Data, 2019. https://journalofbigdata.springeropen.com/articles/10.1186/s40537-019-0197-0
Bishop, C. M. “Training with Noise is Equivalent to Tikhonov Regularization.” Neural Computation, 1995. https://direct.mit.edu/neco/article/7/1/108/5828
Zhang, H., Cisse, M., Dauphin, Y. N., and Lopez-Paz, D. “mixup: Beyond Empirical Risk Minimization.” ICLR, 2018. https://arxiv.org/abs/1710.09412
Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., and Yoo, Y. “CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features.” ICCV, 2019. https://arxiv.org/abs/1905.04899
DeVries, T., and Taylor, G. W. “Improved Regularization of Convolutional Neural Networks with Cutout.” 2017. https://arxiv.org/abs/1708.04552
Sohn, K., Berthelot, D., Li, C.-L., Zhang, Z., Carlini, N., Cubuk, E. D., Kurakin, A., Zhang, H., and Raffel, C. “FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence.” NeurIPS, 2020. https://arxiv.org/abs/2001.07685
Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, A., and Raffel, C. “MixMatch: A Holistic Approach to Semi-Supervised Learning.” NeurIPS, 2019. https://arxiv.org/abs/1905.02249
Xie, Q., Dai, Z., Hovy, E., Luong, M.-T., and Le, Q. V. “Unsupervised Data Augmentation for Consistency Training.” NeurIPS, 2020. https://arxiv.org/abs/1904.12848
Cubuk, E. D., Zoph, B., Mane, D., Vasudevan, V., and Le, Q. V. “AutoAugment: Learning Augmentation Strategies from Data.” CVPR, 2019. https://arxiv.org/abs/1805.09501
Cubuk, E. D., Zoph, B., Shlens, J., and Le, Q. V. “RandAugment: Practical Automated Data Augmentation with a Reduced Search Space.” NeurIPS Workshop, 2020. https://arxiv.org/abs/1909.13719
Park, D. S., Chan, W., Zhang, Y., Chiu, C.-C., Zoph, B., Cubuk, E. D., and Le, Q. V. “SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition.” Interspeech, 2019. https://arxiv.org/abs/1904.08779
Wei, J., and Zou, K. “EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks.” EMNLP, 2019. https://arxiv.org/abs/1901.11196
Chapelle, O., Weston, J., Bottou, L., and Vapnik, V. “Vicinal Risk Minimization.” Advances in Neural Information Processing Systems (NeurIPS), 2001. https://proceedings.neurips.cc/paper/2000/hash/ba9a56ce0a9bfa26e8ed9e10b2cc8f46-Abstract.html

# Data Augmentation Principles Data augmentation is one of the most reliable levers in applied machine learning. By generating additional training examples through transformations of existing data, practitioners routinely improve generalization without collecting a single new label. Yet augmentation is often treated as a bag of tricks rather than a principled component of the learning system. This chapter develops the conceptual foundations. We frame augmentation as a way to encode invariances into a model, examine its role as a regularizer, formalize what it means for a transform to preserve labels, connect supervised augmentation to consistency regularization in the semi-supervised setting, and conclude with a sober account of when augmentation helps and when it quietly hurts. The throughline is a single idea. Augmentation is a way of writing down what we already believe about a task, namely that certain changes to an input should not change its label, and of turning that belief into training signal. Seen this way, the choice of transformations is not a tuning detail but a modeling decision on par with the choice of architecture or loss. The chapter is organized to make that decision explicit and to expose the precise assumption (label preservation under a transformation family) on which every benefit of augmentation rests. ```{mermaid} flowchart TD A["Task symmetries we believe in"] --> B["Transformation family T"] B --> C["Augmentation distribution q(theta)"] C --> D["Augmented risk over labeled data"] C --> E["Consistency loss over unlabeled data"] D --> F["Model that is approximately invariant"] E --> F G["Label preservation assumption"] --> D G --> E F --> H["Better generalization if T matches the truth"] ``` ## 1. Augmentation as Encoding Invariances ### 1.1 The invariance hypothesis Most prediction tasks come with a built in symmetry. A cat remains a cat when the image is flipped horizontally, shifted by a few pixels, or brightened slightly. A spoken word retains its meaning under small changes in pitch or background noise. A sentence keeps its sentiment when a word is swapped for a synonym. These symmetries express prior knowledge about the task that the raw training data only partially reveals. Formally, suppose we have an input space $\mathcal{X}$, a label space $\mathcal{Y}$, and a family of transformations $\mathcal{T} = \{t_\theta : \mathcal{X} \to \mathcal{X}\}$ indexed by parameters $\theta$. The task is said to be **invariant** under $\mathcal{T}$ if the true conditional distribution satisfies $$ p(y \mid x) = p(y \mid t_\theta(x)) \quad \text{for all } t_\theta \in \mathcal{T}. $$ We would like our learned predictor $f_\phi$ to respect this property, that is $f_\phi(x) \approx f_\phi(t_\theta(x))$. Data augmentation is the empirical mechanism for instilling that behavior. Rather than hard wiring the invariance into the architecture, which is possible but restrictive, we sample transformations during training and ask the model to produce consistent outputs. It is worth distinguishing two related notions precisely, because the literature uses them loosely. A predictor $f_\phi : \mathcal{X} \to \mathcal{Y}$ is **invariant** under $\mathcal{T}$ if $f_\phi(t_\theta(x)) = f_\phi(x)$ for all $x$ and all $t_\theta \in \mathcal{T}$, so the output does not move when the input is transformed. A feature map $g_\phi : \mathcal{X} \to \mathbb{R}^d$ is **equivariant** under $\mathcal{T}$ if there is a corresponding transformation $\rho_\theta$ on the feature space with $g_\phi(t_\theta(x)) = \rho_\theta(g_\phi(x))$, so the output transforms in a predictable, structured way rather than staying fixed. Classification typically wants invariance of the final decision, while intermediate representations are often most useful when they are equivariant. Augmentation primarily targets invariance of the head, but by training the whole network under transformed inputs it tends to encourage approximately equivariant intermediate features as a side effect. Two qualifiers keep the invariance hypothesis honest. First, real symmetries are usually approximate and bounded. An image is invariant to small rotations but not to a one hundred eighty degree flip that turns a 6 into a 9; the safe set of $\theta$ is a neighborhood of the identity, not the whole group. Second, invariance is a property of the *labeling*, not of perception. A transform is admissible exactly when it preserves the conditional $p(y \mid x)$, which is a claim about the task we have defined, not about whether a human finds the transformed input natural. ### 1.2 Augmentation as expanding the training distribution Let $\mathcal{D} = \{(x_i, y_i)\}_{i=1}^n$ be the labeled dataset drawn from a distribution $p_{\text{data}}$. Augmentation replaces this finite sample with an enlarged, effectively infinite distribution. Each example is paired with a distribution over transformations $q(\theta)$, and the augmented risk becomes $$ R_{\text{aug}}(\phi) = \frac{1}{n} \sum_{i=1}^n \mathbb{E}_{\theta \sim q} \big[ \ell\big(f_\phi(t_\theta(x_i)), y_i\big) \big]. $$ The expectation over $\theta$ is the heart of the matter. We are no longer fitting the model to $n$ points but to a tube of inputs surrounding each point, traced out by the transformation family. If the invariance hypothesis holds, every input in that tube genuinely carries label $y_i$, so the augmented risk is an unbiased estimate of risk on a richer distribution than the one we sampled. This construction has a precise name. It is **vicinal risk minimization** (Chapelle et al. 2001): the empirical distribution that places a point mass at each $(x_i, y_i)$ is replaced by a vicinity distribution that smears probability over a neighborhood of each example. Standard empirical risk minimization is the degenerate case where the vicinity is a Dirac spike. Augmentation chooses a vicinity by choosing $q(\theta)$, and the augmented risk $R_{\text{aug}}$ is exactly the empirical risk under that vicinal distribution. Framing augmentation this way clarifies the bias and variance tradeoff. A wider vicinity (stronger augmentation) lowers the variance of the risk estimate because it averages over more directions, but it raises bias whenever the vicinity leaks outside the true label region, which is precisely the label preservation failure of Section 3. A useful mental model is that augmentation injects domain knowledge that the model would otherwise have to discover from data. A convolutional network can in principle learn approximate translation invariance from raw pixels, but it needs many examples to do so. Supplying shifted copies short circuits that learning and frees capacity for the genuinely discriminative features. ### 1.3 The geometry of orbits The set of points reachable from a single $x$ under the transformation family is its **orbit**, $\mathcal{O}(x) = \{ t_\theta(x) : t_\theta \in \mathcal{T} \}$. When the transformations form a group, the orbits partition the input space into equivalence classes, and an invariant predictor is constant on each orbit. Augmentation encourages the model to collapse each orbit to a single decision, which reduces the effective dimensionality of the function the model must learn. There is a clean way to see why averaging over an orbit cannot hurt and often helps. Suppose $\mathcal{T}$ is a finite group $G$ acting on inputs, and define the **symmetrized predictor** by averaging a base predictor $h$ over the group, $$ \bar{h}(x) = \frac{1}{|G|} \sum_{g \in G} h(g \cdot x). $$ By construction $\bar{h}$ is exactly invariant, since relabeling the sum by $g \mapsto g'g$ leaves it unchanged. If the task is truly invariant, so the optimal predictor is constant on orbits, then for a convex loss Jensen's inequality gives $R(\bar{h}) \le \tfrac{1}{|G|}\sum_{g} R(h \circ g) = R(h)$, where the last equality uses that the data distribution is itself invariant. In words, projecting onto the invariant function class never increases risk under these assumptions, and it strictly decreases it whenever $h$ disagrees across an orbit. Augmentation is the stochastic counterpart of this averaging: instead of summing over all of $G$ at inference, we sample group elements during training and push the learned $f_\phi$ toward the invariant subspace. The benefit is the variance reduction of replacing one orbit sample by an average, and the catch is that the argument assumes the data distribution is genuinely $G$ invariant, which is exactly label preservation restated. This view also clarifies a failure mode. If two classes share parts of their orbits under the chosen transformations, so $\mathcal{O}(x_i) \cap \mathcal{O}(x_j) \neq \varnothing$ for $y_i \neq y_j$, then augmentation forces the model to assign the same label to genuinely different inputs, and no invariant predictor can separate the classes. We return to this in Section 5. ## 2. Augmentation as Regularization ### 2.1 Why augmentation reduces variance Regularization, broadly, is any modification to a learning procedure that reduces the variance of the fitted model at the cost of some bias. Augmentation fits this description cleanly. By training on a continuum of perturbed inputs, the model cannot exploit idiosyncratic pixel level or token level patterns that happen to correlate with labels in the finite sample. Those spurious patterns are washed out by the transformation noise, while the stable, invariant structure survives averaging. Consider the bias variance decomposition of expected test error. A model with high capacity trained on a small dataset typically sits in the high variance regime, memorizing training points. Augmentation enlarges the support of the training distribution, so the same model now interpolates over a smoother target. Empirically this manifests as a smaller gap between training and validation accuracy. ### 2.2 The connection to explicit penalties Augmentation with small perturbations can be shown to approximate an explicit regularization penalty on the model. Consider additive noise augmentation $t_\theta(x) = x + \theta$ with $\theta \sim \mathcal{N}(0, \sigma^2 I)$, so $\mathbb{E}[\theta] = 0$ and $\mathbb{E}[\theta\theta^\top] = \sigma^2 I$, and a squared error loss $\ell(f, y) = \tfrac{1}{2}(f - y)^2$ with scalar output. Taylor expanding the model around $x$, $$ f_\phi(x + \theta) = f_\phi(x) + \nabla_x f_\phi(x)^\top \theta + \tfrac{1}{2}\theta^\top H(x)\,\theta + O(\|\theta\|^3), $$ where $H$ is the input Hessian, and substituting into the loss and taking the expectation over $\theta$, the linear term vanishes because $\mathbb{E}[\theta] = 0$, and the quadratic term contributes through $\mathbb{E}[\theta\theta^\top] = \sigma^2 I$. Collecting terms, $$ \mathbb{E}_\theta\big[\ell(f_\phi(x+\theta), y)\big] \approx \ell(f_\phi(x), y) + \frac{\sigma^2}{2}\, \|\nabla_x f_\phi(x)\|^2 + \frac{\sigma^2}{2}\, r(x)\,\mathrm{tr}\,H(x) + O(\sigma^4), $$ where $r(x) = f_\phi(x) - y$ is the residual. The leading penalty is a Tikhonov style term on the gradient of the model with respect to its input (Bishop 1995). The Hessian term is weighted by the residual, so near a good fit, where residuals are small, it is negligible and the gradient norm penalty dominates. In words, noise augmentation discourages the function from changing rapidly near training points, which is exactly the smoothness prior we associate with good generalization. Richer transformations such as rotations or crops do not reduce to such a clean closed form, but the qualitative effect is the same. Linearizing $t_\theta(x) \approx x + \theta\, \partial_\theta t_\theta(x)|_{\theta=0}$ shows that the induced penalty weights the directional derivative of $f_\phi$ along the *tangent directions* of the transformation family, penalizing sensitivity to exactly the directions that the family declares label irrelevant while leaving discriminative directions unconstrained. ### 2.3 Interaction with other regularizers Augmentation is not a substitute for weight decay, dropout, or early stopping; it is complementary, and it operates on a different axis. Weight decay constrains the parameters, dropout constrains co adaptation of units, and augmentation constrains the function along data manifold directions. Because these mechanisms target different sources of overfitting, stacking them usually helps, though the marginal benefit of each typically diminishes as others are added. A practical consequence is that the optimal augmentation strength interacts with the optimal weight decay and learning rate, so these hyperparameters should be tuned jointly rather than in isolation. ## 3. Label Preserving Transforms ### 3.1 The label preservation requirement The entire justification in Section 1 rests on a single assumption: the transformation does not change the label. A transform $t_\theta$ is **label preserving** for example $(x, y)$ if $y$ remains the correct target for $t_\theta(x)$. When this holds, augmented examples are valid training signal. When it fails, augmentation injects label noise, and the model is trained to make confident but wrong predictions. The requirement is not a property of the transform alone; it is a property of the transform together with the task and the data. Horizontal flipping preserves the label "cat" but destroys the label in a digit recognition task, since a flipped 2 is not a 2 and a flipped 6 may resemble nothing valid. Color jitter is harmless for object recognition but catastrophic for a task that classifies flowers by color. The first discipline of augmentation is therefore to enumerate, for the specific task, which symmetries the labels actually respect. ### 3.2 A taxonomy of common transforms The following sketch organizes widely used transforms by modality and by the invariance they encode. ```text Vision geometric: flip, rotate, crop, scale, translate, shear photometric: brightness, contrast, hue, saturation, blur, noise occlusion: cutout, random erasing mixing: mixup, cutmix (note: these change the label) Text lexical: synonym replacement, random insertion/deletion structural: back translation, paraphrase generation embedding: word/token dropout, noise in embedding space Audio temporal: time stretch, time shift, time masking spectral: pitch shift, frequency masking, additive noise ``` The mixing methods in the vision list deserve a flag. Mixup forms a convex combination $\tilde{x} = \lambda x_i + (1-\lambda) x_j$ and a matching soft label $\tilde{y} = \lambda y_i + (1-\lambda) y_j$. This is not label preserving in the strict sense; it deliberately interpolates labels. It works not by encoding an invariance but by enforcing linear behavior between examples, a distinct regularizing mechanism worth keeping conceptually separate. ### 3.3 Class conditional and instance conditional validity Label preservation can hold for some classes and not others, or for some instances and not others. Rotation by ninety degrees is safe for most natural objects but flips the meaning of an arrow, a clock, or text. The safe range of an augmentation is frequently narrower than practitioners assume. A robust practice is to make the augmentation distribution conditional where necessary, applying aggressive rotations only to classes known to be rotation invariant and restricting the range elsewhere. The deeper point is that the augmentation policy is part of the model specification. Choosing $q(\theta)$ is choosing a prior over the function class, and a mismatched prior degrades performance just as a poorly chosen architecture does. ### 3.4 A worked example: rotation on two tasks To make label preservation concrete, contrast the same transform on two tasks. Let the transform be rotation by an angle $\alpha$, with augmentation distribution $q(\alpha)$ uniform on a range $[-\beta, \beta]$. On a microscopy task that classifies cell types, the imaging geometry has no preferred orientation. The conditional $p(y \mid x)$ is genuinely invariant to any rotation, so the orbit of every image stays inside a single class. Here the safe range is the full circle, $\beta = 180^\circ$, and large $\beta$ is not just permissible but beneficial, because it averages over a symmetry the data truly has and shrinks the effective hypothesis space exactly as the symmetrization argument of Section 1.3 predicts. On a handwritten digit task the same transform behaves very differently. Small rotations preserve labels: a 7 tilted by ten degrees is still a 7. But the safe range is narrow and class dependent. A 6 rotated near one hundred eighty degrees becomes a 9, so the orbit of a 6 collides with the orbit of a 9, the intersection condition $\mathcal{O}(x_6) \cap \mathcal{O}(x_9) \neq \varnothing$ from Section 1.3 is met, and any invariant predictor must confuse the two. A 4 and a 7 admit wider safe ranges than a 6 or a 9. The correct policy is therefore a small global $\beta$ (often cited as roughly ten to fifteen degrees in practice) or, better, a class conditional range. The contrast is the entire lesson of the section in one transform. The admissibility of an augmentation is a joint property of transform, task, and even class, never of the transform alone. ## 4. The Connection to Consistency Regularization ### 4.1 From supervised augmentation to unlabeled consistency In the supervised setting we used augmentation to expand labeled examples. The same invariance can be exploited on unlabeled data, where it becomes the engine of much of modern semi supervised learning. The key observation is that the invariance hypothesis $p(y \mid x) = p(y \mid t_\theta(x))$ does not require knowing $y$. It is a statement about the model's outputs, and we can enforce it directly. **Consistency regularization** adds a term that penalizes the model for producing different predictions on two augmentations of the same input. For an unlabeled example $u$ and two transformations $t, t'$, a typical loss is $$ \mathcal{L}_{\text{cons}}(u) = d\big(f_\phi(t(u)),\, f_\phi(t'(u))\big), $$ where $d$ is a divergence such as squared distance between probability vectors or a cross entropy with one side treated as a fixed target. Minimizing this term pushes the decision boundary into low density regions, because it forces predictions to be stable across the perturbations the augmentation family generates. The link to the gradient penalty of Section 2.2 is direct. If $t$ is the identity and $t'$ adds small noise $\theta$, then $\mathcal{L}_{\text{cons}}$ with squared distance is approximately $\|\nabla_x f_\phi(x)^\top \theta\|^2$, whose expectation over $\theta$ is again a Tikhonov term on the input gradient. The difference from supervised augmentation is the anchor. Supervised augmentation pins the smoothed function to the known label $y_i$, while consistency regularization pins it only to the model's own output on a neighboring view. Consistency therefore enforces *smoothness* without supplying *direction*, which is why it is almost always paired either with a labeled loss or with a confidence filtered pseudo label that injects directional signal, as in Section 4.2. ### 4.2 Weak and strong augmentation The most effective recent methods use augmentations of two different strengths. A **weak** augmentation, such as a small shift and flip, produces a prediction trusted enough to serve as a pseudo label. A **strong** augmentation, such as heavy color distortion plus cutout, produces an input the model must match to that pseudo label. The asymmetry is deliberate: the weak view yields a reliable target, and the strong view supplies a hard learning signal. ```text weak_view = weak_augment(u) strong_view = strong_augment(u) pseudo_label = stop_gradient( predict(weak_view) ) # treat as fixed target if max(pseudo_label) > tau: # confidence threshold loss = cross_entropy(predict(strong_view), pseudo_label) ``` The confidence threshold $\tau$ is essential. Without it, low confidence pseudo labels inject noise and the model can drift, reinforcing its own errors in a feedback loop. The threshold filters the unlabeled set down to examples where the weak view is already confident, which keeps the pseudo labels approximately correct early in training and lets coverage expand as the model improves. ### 4.3 Why augmentation strength controls the regularizer Consistency regularization is only as good as the augmentations underneath it. If the transformations are too weak, the two views are nearly identical, the consistency loss is trivially small, and no useful signal flows. If the transformations are too strong and break label preservation, the model is asked to match predictions across inputs that genuinely have different labels, which corrupts learning. The same label preservation discipline from Section 3 reappears here, now with higher stakes, because there is no ground truth label to anchor the example. Augmentation design and consistency regularization are therefore two faces of the same underlying assumption about task invariance. ### 4.4 The shared invariance principle It is worth stating the unifying view explicitly. Supervised augmentation, mixup style interpolation, and unlabeled consistency are all mechanisms for encoding a prior that the function should vary slowly along certain directions in input space. Supervised augmentation anchors that prior to known labels. Consistency regularization anchors it to the model's own confident predictions. Both succeed exactly when the chosen transformation family aligns with the true symmetries of the task, and both fail in the same way when it does not. ## 5. When Augmentation Helps and When It Hurts ### 5.1 Regimes where augmentation helps most Augmentation delivers the largest gains when data is scarce relative to model capacity. In the small data regime the variance reduction described in Section 2 dominates, and each valid synthetic example meaningfully enlarges the effective training set. Gains also tend to be large when the deployment distribution contains the very variations the augmentations simulate, since the model is then trained on something close to the test conditions. A network trained with random crops and color jitter is better prepared for photographs taken under varied framing and lighting because it has effectively seen such variation. A third favorable regime is when a known invariance is strong and clean. Tasks with well understood symmetries, such as rotational invariance in microscopy or translation invariance in audio event detection, reward augmentation that targets those symmetries precisely. ### 5.2 Diminishing and negative returns As the labeled dataset grows, the marginal value of augmentation shrinks. With abundant data the model can learn the relevant invariances from real examples, and synthetic ones add little. In the very large data regime augmentation may even slow convergence by making each example harder, without a compensating generalization benefit. Augmentation hurts outright in several identifiable situations. The first is **label corruption**, discussed in Section 3, where the transform changes the true label and the model is trained on wrong targets. The second is **distribution shift introduced by augmentation itself**: if the augmentation produces inputs that never occur at test time, the model wastes capacity on an irrelevant region of input space and may underperform on the real distribution. Heavy, unrealistic distortions are a common culprit. The third is **interaction with class imbalance and spurious features**, where an augmentation that is benign for the majority class destroys the signal that distinguishes a minority class, as in the color dependent flower example. ### 5.3 Diagnosing augmentation problems A practical workflow treats the augmentation policy as a tunable component subject to validation. The following heuristics catch most failures. First, sanity check label preservation by eye. Sample augmented inputs and confirm that a human would still assign the original label. This trivial step catches the most damaging errors and costs minutes. Second, sweep augmentation strength and watch the validation curve. The signature of excessive augmentation is a training accuracy that stays low while validation accuracy also fails to rise, indicating that the model cannot even fit the corrupted targets. The signature of insufficient augmentation is a large train to validation gap that augmentation was supposed to close. Third, monitor per class metrics, not just aggregate accuracy. An augmentation that helps overall while quietly destroying one class will hide behind a strong average. Class conditional evaluation surfaces these regressions. ### 5.4 Learned and adaptive augmentation Because hand tuning augmentation policies is laborious and task specific, a body of work searches for policies automatically. Methods learn or sample which transformations and magnitudes to apply, optimizing a validation objective rather than fixing the policy by hand. Reduced search formulations make this tractable by limiting the number of free hyperparameters, often to a global magnitude and a count of transforms to apply. The principled takeaway is unchanged: the search is effective only within a transformation family that respects the task's invariances. Automation tunes the policy; it does not absolve the practitioner from supplying transforms that preserve labels. ### 5.5 Practical guidance Treat augmentation as a prior, not a default. Begin by enumerating the symmetries the task genuinely possesses, choose transforms that encode exactly those, and start with conservative magnitudes. Validate label preservation directly, tune strength jointly with the other regularizers, and evaluate per class. When operating on unlabeled data, recognize that consistency regularization inherits every assumption baked into the augmentation policy and amplifies the cost of getting it wrong. Used with this discipline, augmentation is among the highest return interventions available; used carelessly, it is a silent source of label noise and distribution shift. ## References 1. Shorten, C., and Khoshgoftaar, T. M. "A survey on Image Data Augmentation for Deep Learning." Journal of Big Data, 2019. https://journalofbigdata.springeropen.com/articles/10.1186/s40537-019-0197-0 2. Bishop, C. M. "Training with Noise is Equivalent to Tikhonov Regularization." Neural Computation, 1995. https://direct.mit.edu/neco/article/7/1/108/5828 3. Zhang, H., Cisse, M., Dauphin, Y. N., and Lopez-Paz, D. "mixup: Beyond Empirical Risk Minimization." ICLR, 2018. https://arxiv.org/abs/1710.09412 4. Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., and Yoo, Y. "CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features." ICCV, 2019. https://arxiv.org/abs/1905.04899 5. DeVries, T., and Taylor, G. W. "Improved Regularization of Convolutional Neural Networks with Cutout." 2017. https://arxiv.org/abs/1708.04552 6. Sohn, K., Berthelot, D., Li, C.-L., Zhang, Z., Carlini, N., Cubuk, E. D., Kurakin, A., Zhang, H., and Raffel, C. "FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence." NeurIPS, 2020. https://arxiv.org/abs/2001.07685 7. Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, A., and Raffel, C. "MixMatch: A Holistic Approach to Semi-Supervised Learning." NeurIPS, 2019. https://arxiv.org/abs/1905.02249 8. Xie, Q., Dai, Z., Hovy, E., Luong, M.-T., and Le, Q. V. "Unsupervised Data Augmentation for Consistency Training." NeurIPS, 2020. https://arxiv.org/abs/1904.12848 9. Cubuk, E. D., Zoph, B., Mane, D., Vasudevan, V., and Le, Q. V. "AutoAugment: Learning Augmentation Strategies from Data." CVPR, 2019. https://arxiv.org/abs/1805.09501 10. Cubuk, E. D., Zoph, B., Shlens, J., and Le, Q. V. "RandAugment: Practical Automated Data Augmentation with a Reduced Search Space." NeurIPS Workshop, 2020. https://arxiv.org/abs/1909.13719 11. Park, D. S., Chan, W., Zhang, Y., Chiu, C.-C., Zoph, B., Cubuk, E. D., and Le, Q. V. "SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition." Interspeech, 2019. https://arxiv.org/abs/1904.08779 12. Wei, J., and Zou, K. "EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks." EMNLP, 2019. https://arxiv.org/abs/1901.11196 13. Chapelle, O., Weston, J., Bottou, L., and Vapnik, V. "Vicinal Risk Minimization." Advances in Neural Information Processing Systems (NeurIPS), 2001. https://proceedings.neurips.cc/paper/2000/hash/ba9a56ce0a9bfa26e8ed9e10b2cc8f46-Abstract.html