212  Skip Connections and Residual Learning

Depth is one of the most powerful levers in deep learning. Stacking more layers expands the hypothesis space a network can represent, and many landmark results in vision, language, and speech were unlocked by simply going deeper. Yet for years depth was also a trap. Beyond a certain point, adding layers made networks harder to train and worse on both training and test data. Skip connections, and the residual learning framework built around them, resolved this tension and became a structural primitive that now appears in nearly every large model. This chapter develops the degradation problem that motivated residual learning, the mechanics of the residual block, a gradient-flow argument for why identity shortcuts help, and the dense connectivity pattern that generalizes the idea.

212.1 1. The Degradation Problem

212.1.1 1.1 Depth Should Not Hurt, But It Did

Consider a network \(f_L\) with \(L\) layers that achieves some training error. Now construct a deeper network \(f_{L+k}\) by appending \(k\) extra layers. There is a simple existence argument that the deeper network should be at least as good as the shallower one: set the first \(L\) layers equal to those of \(f_L\), and make each of the \(k\) new layers compute the identity map. The deeper network then reproduces the shallower network exactly, so its training error cannot be higher. By this reasoning depth should never degrade training performance.

Empirically, the opposite was observed. He and colleagues documented that plain stacked convolutional networks reached a point where deeper variants had higher training error than shallower ones, even though the shallow solution was embeddable in the deep architecture. A 56 layer plain network underperformed a 20 layer plain network on the training set of CIFAR-10. This is the degradation problem: accuracy saturates and then declines as depth grows, and the decline shows up in training error, not just test error.

212.1.2 1.2 Why It Is Not Overfitting

The distinction matters. Overfitting produces low training error and high test error. Degradation produces high training error, so it cannot be cured by regularization or more data. It is an optimization pathology, not a generalization pathology. Two contributing factors are worth naming.

First, optimizers struggle to drive a stack of nonlinear layers toward the identity map. Layers initialized near zero compute something close to a linear projection followed by a nonlinearity, and asking many such layers to jointly approximate identity is a poorly conditioned target. Second, even with normalization techniques that tame the variance of activations and gradients, very deep plain networks present loss surfaces whose curvature and conditioning make first-order optimization slow. The conclusion the field reached was that the difficulty lay in the way the function was parameterized, and that a reparameterization could make the same function class far easier to fit.

212.2 2. The Residual Block

212.2.1 2.1 Reformulating the Target

The residual learning idea is a change of variables. Instead of asking a block of layers to learn a desired underlying mapping \(\mathcal{H}(x)\) directly, we ask it to learn the residual

\[ \mathcal{F}(x) = \mathcal{H}(x) - x, \]

and then recover the target by adding the input back:

\[ \mathcal{H}(x) = \mathcal{F}(x) + x. \]

Concretely, a residual block computes

\[ y = \mathcal{F}(x, \{W_i\}) + x, \]

where \(\mathcal{F}\) is a small stack of weight layers, typically two or three convolutions or linear maps with nonlinearities between them. The term \(x\) added at the output is the skip connection or identity shortcut. A canonical two-layer block is

\[ \mathcal{F}(x) = W_2 \, \sigma(W_1 x), \]

with \(\sigma\) a nonlinearity such as the rectified linear unit, after which \(y = \mathcal{F}(x) + x\) passes through a final nonlinearity.

The reformulation does not change what functions can be represented, since \(\mathcal{F}(x) + x\) can express anything \(\mathcal{H}(x)\) can. What changes is the inductive bias of the parameterization. If the optimal mapping for a block is close to the identity, the optimizer only needs to push \(\mathcal{F}\) toward zero, which is easy: small or zero weights already give a near-identity block. The hard target from Section 1, learning identity through a stack of nonlinear layers, becomes the easy target of learning a small perturbation around identity.

212.2.2 2.2 Matching Dimensions

The addition \(\mathcal{F}(x) + x\) requires that \(\mathcal{F}(x)\) and \(x\) have the same shape. When a block changes the number of channels or the spatial resolution, the shortcut must be adapted. The common choices are a linear projection \(W_s x\) on the shortcut to match dimensions,

\[ y = \mathcal{F}(x, \{W_i\}) + W_s x, \]

or zero padding of the extra channels combined with strided subsampling. The projection variant adds parameters but is otherwise minimal, and identity shortcuts are preferred wherever shapes already agree because they add no parameters and no computation.

212.2.3 2.3 The Bottleneck Variant

For very deep networks, a bottleneck block reduces compute. It uses three layers: a \(1 \times 1\) convolution that reduces channel dimension, a \(3 \times 3\) convolution at the reduced dimension, and a \(1 \times 1\) convolution that restores dimension, with the identity shortcut wrapping the trio. The bottleneck keeps the expensive spatial convolution narrow, which is what made networks of 50, 101, and 152 layers practical.

input x
  |---------------------------.
  v                           |
[1x1 conv, reduce]            | identity
[3x3 conv]                    |
[1x1 conv, restore]           |
  v                           |
  (+)<------------------------'
  |
 ReLU
  v
output

212.3 3. Why Identity Shortcuts Help Gradients Flow

212.3.1 3.1 The Vanishing and Exploding Gradient Picture

Backpropagation through a deep plain network multiplies Jacobians layer by layer. If \(z_l\) denotes the activation at layer \(l\) and the loss is \(\mathcal{L}\), then the gradient with respect to an early activation \(z_l\) is a product of Jacobians of all later layers:

\[ \frac{\partial \mathcal{L}}{\partial z_l} = \frac{\partial \mathcal{L}}{\partial z_L} \prod_{i=l}^{L-1} \frac{\partial z_{i+1}}{\partial z_i}. \]

When the spectral norms of these Jacobian factors are consistently below one, the product shrinks geometrically and the gradient reaching early layers vanishes. When they are consistently above one, the product explodes. Either failure mode stalls learning in the layers furthest from the loss, and the problem worsens with depth because the number of factors equals the depth.

212.3.2 3.2 The Additive Identity Term

Residual connections change the structure of this product. For a residual block \(z_{l+1} = z_l + \mathcal{F}(z_l, W_l)\), the layer Jacobian is

\[ \frac{\partial z_{l+1}}{\partial z_l} = I + \frac{\partial \mathcal{F}(z_l, W_l)}{\partial z_l}. \]

The identity matrix \(I\) is the structural contribution of the skip connection. Unrolling across a stack of residual blocks, the activation at a deep layer \(L\) relates to a shallow layer \(l\) additively:

\[ z_L = z_l + \sum_{i=l}^{L-1} \mathcal{F}(z_i, W_i). \]

This additive form, highlighted in the analysis of residual mappings by He and colleagues, is the key to gradient flow. Differentiating it gives

\[ \frac{\partial \mathcal{L}}{\partial z_l} = \frac{\partial \mathcal{L}}{\partial z_L} \left( I + \frac{\partial}{\partial z_l} \sum_{i=l}^{L-1} \mathcal{F}(z_i, W_i) \right). \]

The crucial feature is the standalone term \(\frac{\partial \mathcal{L}}{\partial z_L}\), which arrives at layer \(l\) untouched by any chain of multiplications. The gradient from the loss propagates back to every layer along the identity path without attenuation, regardless of depth. The residual term adds a correction on top, but it is extremely unlikely to cancel the identity term for a whole minibatch, so the total gradient rarely vanishes. In a plain network the analogous expression is a bare product with no protected additive term, which is why depth alone could starve early layers of signal.

212.3.3 3.3 A Linear Sanity Check

A linear special case makes the contrast vivid. Suppose each plain layer multiplies by a scalar \(a\), so the end-to-end map over \(L\) layers is \(a^L\) and the gradient scales as \(a^{L-1}\). For \(a = 0.9\) and \(L = 100\) the gradient factor is roughly \(0.9^{99} \approx 3 \times 10^{-5}\), essentially zero. Now make each layer residual, multiplying by \(1 + b\) with \(b\) a small learned scalar. The end-to-end map is \((1+b)^L\), and near \(b = 0\) the per-layer Jacobian is \(1 + b \approx 1\), so the product stays near one rather than collapsing. The skip connection re-centers the multiplicative dynamics around unity, which is exactly where signals neither vanish nor explode.

212.3.4 3.4 Interpretation as Iterative Refinement

The additive unrolling in Section 3.2 also suggests a useful mental model. A deep residual network can be read as a sequence of small refinements applied to a representation that is carried forward largely intact. Each block nudges the representation by \(\mathcal{F}(z_i, W_i)\) rather than rebuilding it from scratch. This view connects residual networks to the unrolling of iterative algorithms and explains an empirical curiosity: removing or reordering individual residual blocks at test time often degrades accuracy only gradually, consistent with many shallow corrective paths rather than one brittle deep computation. It also clarifies why placing the nonlinearity and normalization so that the identity path stays clean, the so-called pre-activation arrangement, tends to improve very deep training. In pre-activation blocks the shortcut carries the unmodified signal and the entire transformation lives on the residual branch, keeping the additive identity path free of intervening nonlinearities.

212.4 4. Dense Connections

212.4.1 4.1 From Additive Shortcuts to Concatenative Reuse

Residual networks connect each block to its immediate predecessor through addition. Densely connected networks generalize the reuse of earlier features by connecting every layer to every subsequent layer through concatenation. In a dense block, layer \(\ell\) receives the feature maps of all preceding layers as input:

\[ x_\ell = H_\ell\!\left( [\, x_0, x_1, \ldots, x_{\ell-1} \,] \right), \]

where \([\cdot]\) denotes concatenation along the channel axis and \(H_\ell\) is a composite function, typically batch normalization, a nonlinearity, and a convolution. A block of \(L\) layers therefore contains \(\frac{L(L+1)}{2}\) direct connections rather than the \(L\) connections of a plain stack.

212.4.2 4.2 Growth Rate and Parameter Efficiency

Because inputs accumulate through concatenation, each layer needs to produce only a small number of new feature maps. This number is the growth rate \(k\): if the input to the block has \(k_0\) channels, then layer \(\ell\) sees \(k_0 + k(\ell - 1)\) channels. A small growth rate, often a few dozen channels, keeps the network narrow while still giving later layers access to all earlier features. Dense connectivity is strikingly parameter efficient for this reason. Each layer adds only its own thin slice of new features, and because every layer can read the collective knowledge of the block, the network avoids relearning redundant representations. Reported results showed dense networks matching or exceeding the accuracy of residual networks with substantially fewer parameters.

212.4.3 4.3 Gradient Flow in Dense Networks

The gradient argument carries over with a concatenative twist. Every layer has a direct path to the loss through the layers it feeds, and because connections are formed by concatenation rather than summation, the signal from each source layer remains identifiable rather than blended. Each layer thus receives gradients along many short paths, and the implicit deep supervision this creates is part of why dense networks train well. Transitions between dense blocks use pooling and \(1 \times 1\) convolutions to reduce the channel count, preventing the concatenated width from growing without bound across the full network.

212.4.4 4.4 Choosing Between Additive and Concatenative Connectivity

Additive and concatenative shortcuts embody different trade-offs. Addition preserves a fixed channel width, costs nothing in extra memory at the join, and biases each block toward small perturbations of its input, which suits the iterative-refinement reading. Concatenation preserves earlier features verbatim and lets later layers select among them, at the cost of growing channel counts and higher activation memory. In practice residual addition dominates very large models, where its constant width and clean identity path scale gracefully, while dense concatenation remains attractive when parameter and feature efficiency are the priority. Both descend from the same insight: give the optimizer a direct, low-resistance route between distant layers, and depth becomes an asset rather than a liability.

212.5 5. Summary

The degradation problem revealed that depth, by itself, can make optimization harder even when a deeper network provably contains a good shallow solution. Residual learning reframes each block to model a residual around the identity, so that doing nothing is the easy default and the optimizer learns only the needed correction. The identity shortcut inserts an additive term into the backpropagation product, creating an unattenuated path for gradients to reach every layer and re-centering the network’s multiplicative dynamics near unity. Dense connectivity pushes feature reuse further by concatenating all earlier outputs, trading width for parameter efficiency and many short gradient paths. Together these patterns turned very deep networks from a curiosity into the default, and the skip connection is now a structural assumption rather than a design choice.

212.6 References

  1. He, K., Zhang, X., Ren, S., and Sun, J. Deep Residual Learning for Image Recognition. CVPR, 2016. https://arxiv.org/abs/1512.03385
  2. He, K., Zhang, X., Ren, S., and Sun, J. Identity Mappings in Deep Residual Networks. ECCV, 2016. https://arxiv.org/abs/1603.05027
  3. Huang, G., Liu, Z., van der Maaten, L., and Weinberger, K. Q. Densely Connected Convolutional Networks. CVPR, 2017. https://arxiv.org/abs/1608.06993
  4. Veit, A., Wilber, M., and Belongie, S. Residual Networks Behave Like Ensembles of Relatively Shallow Networks. NeurIPS, 2016. https://arxiv.org/abs/1605.06431
  5. Srivastava, R. K., Greff, K., and Schmidhuber, J. Highway Networks. arXiv, 2015. https://arxiv.org/abs/1505.00387
  6. Glorot, X., and Bengio, Y. Understanding the Difficulty of Training Deep Feedforward Neural Networks. AISTATS, 2010. https://proceedings.mlr.press/v9/glorot10a.html
  7. Ioffe, S., and Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. ICML, 2015. https://arxiv.org/abs/1502.03167