210 Other Regularization Techniques for Neural Networks
Regularization is the collection of strategies that reduce the gap between training error and test error, trading a small amount of bias for a large reduction in variance. Dropout and explicit parameter norm penalties are covered elsewhere in this book. This chapter treats the complementary techniques that practitioners reach for most often in modern deep learning: weight decay, early stopping, label smoothing, data augmentation viewed as a regularizer, mixup, and stochastic depth. Each method constrains the hypothesis class or injects structured noise into the optimization, and each admits a precise mathematical characterization that explains when and why it helps.
210.1 1. Weight Decay
Weight decay shrinks parameters toward the origin at every update. In its classical form the update rule for parameter vector \(\theta\) with learning rate \(\eta\) is
\[ \theta_{t+1} = (1 - \eta \lambda)\, \theta_t - \eta\, \nabla_\theta \mathcal{L}(\theta_t), \]
where \(\lambda > 0\) is the decay coefficient. The multiplicative factor \((1 - \eta \lambda)\) pulls each weight a fixed fraction toward zero before the gradient step is applied.
210.1.1 1.1 Relationship to L2 Regularization
For plain stochastic gradient descent, weight decay is algebraically identical to adding an \(L_2\) penalty to the loss. Consider the penalized objective
\[ \tilde{\mathcal{L}}(\theta) = \mathcal{L}(\theta) + \frac{\lambda}{2} \lVert \theta \rVert_2^2 . \]
Its gradient is \(\nabla \mathcal{L}(\theta) + \lambda \theta\), so a gradient step yields \(\theta_{t+1} = \theta_t - \eta(\nabla \mathcal{L} + \lambda \theta_t) = (1 - \eta\lambda)\theta_t - \eta \nabla \mathcal{L}\), recovering the decay rule. The equivalence breaks for adaptive optimizers such as Adam, because the penalty gradient \(\lambda \theta\) is rescaled by the per-coordinate second-moment estimate. The decoupled variant AdamW restores the intended behavior by applying the shrinkage directly to the weights rather than routing it through the adaptive preconditioner.
210.1.2 1.2 Effect on the Loss Landscape
A quadratic approximation of the loss around a minimum \(\theta^\ast\) gives a Hessian \(H\) with eigendecomposition \(H = Q \Lambda Q^\top\). The penalized minimizer rescales each eigendirection by the factor \(\Lambda_i / (\Lambda_i + \lambda)\). Directions with small curvature \(\Lambda_i \ll \lambda\) are strongly contracted, while high curvature directions are nearly untouched. Weight decay therefore preferentially suppresses parameter components that the data does not constrain, acting as a soft form of dimensionality reduction.
# Decoupled weight decay (AdamW style)
theta = (1 - lr * wd) * theta
theta = theta - lr * adam_step(grad)210.2 2. Early Stopping
Early stopping halts training when performance on a held-out validation set stops improving. A patience parameter \(p\) specifies how many evaluations may pass without improvement before training terminates, and the parameters from the best validation checkpoint are restored.
210.2.1 2.1 Early Stopping as Implicit Regularization
For a quadratic loss optimized by gradient descent starting from \(\theta_0 = 0\), the iterate after \(t\) steps along eigendirection \(i\) is
\[ \theta_t^{(i)} = \left(1 - (1 - \eta \Lambda_i)^t\right) \theta^{\ast (i)} . \]
High curvature directions converge quickly, low curvature directions slowly. Stopping at finite \(t\) leaves the slow directions only partially fit. The shrinkage factor \(1 - (1 - \eta \Lambda_i)^t\) approximates the \(L_2\) factor \(\Lambda_i / (\Lambda_i + \lambda)\) with the identification \(\lambda \approx 1 / (\eta t)\). Training for fewer iterations is thus quantitatively similar to imposing a stronger weight decay penalty, which is why early stopping is sometimes called a regularizer that costs nothing to evaluate.
210.2.2 2.2 Practical Considerations
Early stopping consumes data because the validation split cannot also serve as training data, though a final retraining pass on the union of train and validation sets can recover that budget. The chief advantage is that a single training run sweeps the effective regularization strength along the optimization trajectory, so the practitioner obtains the regularization path for free rather than running a separate experiment per value of \(\lambda\).
if val_loss < best - min_delta:
best, wait, ckpt = val_loss, 0, copy(model)
else:
wait += 1
if wait >= patience:
stop() # restore ckpt210.3 3. Label Smoothing
Hard one-hot targets push the network to drive the correct logit toward \(+\infty\) relative to the others, encouraging overconfident and poorly calibrated predictions. Label smoothing softens the target distribution. For \(K\) classes and smoothing strength \(\epsilon\), the target for the true class \(y\) becomes
\[ q_k = (1 - \epsilon)\,\mathbb{1}[k = y] + \frac{\epsilon}{K}, \]
so a small probability mass \(\epsilon / K\) is assigned to every class.
210.3.1 3.1 Gradient and Calibration Effects
The cross entropy with smoothed targets is \(-\sum_k q_k \log p_k\), where \(p_k\) is the softmax output. Minimizing it drives the logit of the correct class toward a finite gap above the others rather than toward infinity. Concretely, the optimal logit difference between the true class and any other class is bounded by a function of \(\epsilon\), which prevents the saturation that hard labels induce. The empirical consequence is markedly better calibration: the model’s predicted confidences align more closely with observed accuracies, and expected calibration error drops.
210.3.2 3.2 Representation Geometry
Smoothing also reshapes the penultimate layer. With hard labels, examples of a class can spread arbitrarily far along the direction of their class weight vector. Smoothing encourages examples of the same class to cluster in tight, equidistant groups relative to the templates of other classes, producing more compact within class representations. A documented caveat is that this tighter clustering can erase fine grained relational information between classes, which degrades knowledge distillation when a label smoothed network is used as the teacher.
210.4 4. Data Augmentation as Regularization
Data augmentation enlarges the effective training set by applying label preserving transformations \(g \in \mathcal{G}\) to inputs. For images these include random crops, horizontal flips, color jitter, rotations, and learned policies such as AutoAugment and RandAugment.
210.4.1 4.1 Vicinal Risk and Invariance
Augmentation can be framed through the vicinal risk minimization principle. Rather than minimizing empirical risk over the raw points, one minimizes risk over a vicinity distribution \(\nu\) around each training example,
\[ \mathcal{R}_{\nu}(f) = \frac{1}{N} \sum_{i=1}^{N} \mathbb{E}_{\tilde{x} \sim \nu(\cdot \mid x_i)}\,\ell\big(f(\tilde{x}), y_i\big). \]
When the vicinity is generated by transformations \(g\) that humans regard as label preserving, the objective implicitly penalizes sensitivity of \(f\) to those transformations. In the small noise limit this is equivalent to a Tikhonov penalty on the Jacobian of \(f\) in the directions spanned by \(\mathcal{G}\), encoding the desired invariances directly into the function rather than the parameters.
210.4.2 4.2 Practical Notes
Augmentation strength must match dataset size and model capacity. Aggressive policies help large models on small datasets but can underfit when overused, and transformations must respect task semantics, since a vertical flip destroys the label of a digit while preserving that of an aerial scene.
210.5 5. Mixup
Mixup is a vicinal method that constructs synthetic training examples from convex combinations of pairs. Given two examples \((x_i, y_i)\) and \((x_j, y_j)\) with one-hot labels, it forms
\[ \tilde{x} = \lambda x_i + (1 - \lambda) x_j, \qquad \tilde{y} = \lambda y_i + (1 - \lambda) y_j, \]
with \(\lambda \sim \mathrm{Beta}(\alpha, \alpha)\) for a hyperparameter \(\alpha > 0\). Training proceeds on the interpolated pairs.
210.5.1 5.1 Why Mixup Regularizes
By training on points between examples, mixup enforces that the model behave approximately linearly in the space between training samples. This linear interpolation prior reduces oscillation outside the training distribution and yields smoother decision boundaries. A first order analysis shows that mixup is approximately equivalent to a data dependent regularizer that penalizes the curvature of the loss and shrinks the model’s confidence away from the data manifold, which improves both generalization and robustness to corrupted inputs and adversarial perturbations.
210.5.2 5.2 Variants and Calibration
CutMix replaces the pixel-level blend with a spatial paste, cutting a rectangular patch from one image into another and mixing the labels in proportion to the patch area, which preserves local image statistics that pixel blending destroys. Manifold mixup applies the interpolation at a randomly chosen hidden layer rather than the input, smoothing representations deeper in the network. Mixup trained models are also notably better calibrated than their hard label counterparts, complementing the calibration benefits of label smoothing.
lam = beta(alpha, alpha)
x = lam * x_i + (1 - lam) * x_j
loss = lam * ce(model(x), y_i) + (1 - lam) * ce(model(x), y_j)210.6 6. Stochastic Depth
Stochastic depth regularizes very deep residual networks by randomly dropping entire residual blocks during training. A network with residual blocks \(f_\ell\) computes \(h_\ell = h_{\ell-1} + f_\ell(h_{\ell-1})\) in the standard case. Under stochastic depth a Bernoulli gate \(b_\ell \sim \mathrm{Bernoulli}(p_\ell)\) multiplies each block,
\[ h_\ell = h_{\ell-1} + b_\ell\, f_\ell(h_{\ell-1}), \]
so when \(b_\ell = 0\) the block is skipped and the signal passes through the identity branch unchanged.
210.6.1 6.1 Survival Schedule and Test Time Behavior
The survival probability \(p_\ell\) is typically annealed linearly with depth, \(p_\ell = 1 - \frac{\ell}{L}(1 - p_L)\), so shallow blocks survive almost always and deep blocks are dropped most often. The expected depth of the trained network is \(\sum_\ell p_\ell\), substantially less than the nominal depth \(L\), which shortens gradient paths and speeds training. At test time all blocks are retained and each output is scaled by its survival probability, \(h_\ell = h_{\ell-1} + p_\ell f_\ell(h_{\ell-1})\), so the expected contribution matches the training distribution.
210.6.2 6.2 Ensemble Interpretation
Because each training step samples a random subset of active blocks, stochastic depth trains an implicit ensemble of networks of varying depths that share weights, analogous to the way dropout trains an ensemble of subnetworks. The shorter effective paths also mitigate vanishing gradients, which is why stochastic depth enabled the training of residual networks beyond a thousand layers and remains a standard ingredient in modern vision transformers, where the same block-dropping mechanism is applied to transformer layers.
210.7 7. Choosing and Combining Techniques
These methods are largely complementary and are routinely stacked. A typical image classification recipe combines decoupled weight decay, RandAugment, mixup or CutMix, label smoothing, and stochastic depth simultaneously, with early stopping as a safety net. The guiding principle is that weight decay and early stopping constrain the parameters, augmentation and mixup constrain the function through the data, label smoothing constrains the output distribution, and stochastic depth constrains the architecture during optimization. Because each acts on a different facet of the model, their regularization effects tend to add rather than conflict, though strengths should be tuned jointly since aggressive use of several at once can tip a model from overfitting into underfitting.
210.8 References
- Loshchilov, I., and Hutter, F. Decoupled Weight Decay Regularization. ICLR 2019. https://arxiv.org/abs/1711.05101
- Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning, Chapter 7: Regularization for Deep Learning. MIT Press, 2016. https://www.deeplearningbook.org/contents/regularization.html
- Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. Rethinking the Inception Architecture for Computer Vision. CVPR 2016. https://arxiv.org/abs/1512.00567
- Mueller, R., Kornblith, S., and Hinton, G. When Does Label Smoothing Help? NeurIPS 2019. https://arxiv.org/abs/1906.02629
- Zhang, H., Cisse, M., Dauphin, Y., and Lopez-Paz, D. mixup: Beyond Empirical Risk Minimization. ICLR 2018. https://arxiv.org/abs/1710.09412
- Yun, S., Han, D., Oh, S., Chun, S., Choe, J., and Yoo, Y. CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features. ICCV 2019. https://arxiv.org/abs/1905.04899
- Huang, G., Sun, Y., Liu, Z., Sedra, D., and Weinberger, K. Deep Networks with Stochastic Depth. ECCV 2016. https://arxiv.org/abs/1603.09382
- Cubuk, E., Zoph, B., Shlens, J., and Le, Q. RandAugment: Practical Automated Data Augmentation with a Reduced Search Space. CVPRW 2020. https://arxiv.org/abs/1909.13719
- Verma, V., Lamb, A., Beckham, C., et al. Manifold Mixup: Better Representations by Interpolating Hidden States. ICML 2019. https://arxiv.org/abs/1806.05236