211  Vanishing and Exploding Gradients

Training deep and recurrent neural networks by gradient descent depends on the assumption that a useful learning signal can travel from the loss at the output back to parameters in the earliest layers. When that signal shrinks toward zero or grows without bound as it propagates, optimization stalls or diverges. These two failure modes, known as the vanishing gradient problem and the exploding gradient problem, were the central obstacle to training very deep architectures for much of the field’s history. This chapter develops the mathematics of why they arise, explains the roles of activation functions and weight initialization, and surveys the practical and architectural remedies that make modern deep learning feasible.

211.1 1. The Mechanics of Backpropagation Through Depth

Consider a feedforward network with \(L\) layers. Layer \(\ell\) computes a preactivation \(z^{(\ell)} = W^{(\ell)} a^{(\ell-1)} + b^{(\ell)}\) followed by an elementwise nonlinearity \(a^{(\ell)} = \phi(z^{(\ell)})\), with \(a^{(0)} = x\) the input. Let \(\mathcal{L}\) be a scalar loss. Backpropagation computes the error signal \(\delta^{(\ell)} = \partial \mathcal{L} / \partial z^{(\ell)}\) by the recursion

\[ \delta^{(\ell)} = \left( W^{(\ell+1)} \right)^{\top} \delta^{(\ell+1)} \odot \phi'\!\left(z^{(\ell)}\right), \]

where \(\odot\) is the Hadamard product. The gradient with respect to the weights of layer \(\ell\) is \(\partial \mathcal{L} / \partial W^{(\ell)} = \delta^{(\ell)} \left(a^{(\ell-1)}\right)^{\top}\).

Unrolling the recursion shows that the error signal at an early layer is a product of many Jacobian factors. Writing \(D^{(\ell)} = \operatorname{diag}\!\left(\phi'(z^{(\ell)})\right)\), the signal that reaches layer \(\ell\) from layer \(L\) is

\[ \delta^{(\ell)} = \left( \prod_{k=\ell+1}^{L} D^{(k)} \left(W^{(k)}\right)^{\top} \right) \delta^{(L)}_{\text{local}}. \]

The magnitude of \(\delta^{(\ell)}\) is therefore governed by the product of \(L - \ell\) matrices. Products of many matrices tend to behave either like a contraction, collapsing toward zero, or like an expansion, blowing up, unless their multiplicative factors are carefully balanced near one. This is the structural origin of both problems.

211.2 2. A Quantitative Account of the Two Regimes

211.2.1 2.1 Scalar intuition

A single neuron chained through \(L\) layers makes the mechanism transparent. Suppose each layer multiplies the backpropagated signal by a scalar factor \(w \, \phi'(z)\). After \(L\) layers the signal scales by \(\left(w \, \phi'(z)\right)^{L}\). If \(|w \, \phi'(z)| < 1\), the factor decays geometrically and the gradient vanishes; if \(|w \, \phi'(z)| > 1\), it grows geometrically and the gradient explodes. Only the knife edge at exactly one preserves the signal, and random products do not sit on that edge by accident.

211.2.2 2.2 Spectral and norm bounds

The matrix case generalizes the scalar one through operator norms and singular values. Using submultiplicativity,

\[ \left\| \delta^{(\ell)} \right\| \le \left( \prod_{k=\ell+1}^{L} \left\| D^{(k)} \right\| \, \left\| W^{(k)} \right\| \right) \left\| \delta^{(L)}_{\text{local}} \right\|, \]

where \(\|W\|\) denotes the largest singular value. If every \(\|W^{(k)}\| \le \gamma\) and every activation derivative is bounded by \(\beta = \sup_z |\phi'(z)|\), then the signal is bounded above by \((\gamma \beta)^{L-\ell}\). When \(\gamma \beta < 1\) this upper bound decays exponentially, which guarantees vanishing. A complementary lower bound built from the smallest singular value shows that when the smallest singular value times the minimal activation slope exceeds one, the signal must grow exponentially. The product \(\gamma \beta\) relative to one is the single most important quantity to reason about.

211.2.3 2.3 The recurrent case

Recurrent networks make the problem acute because the same weight matrix is reused at every time step. A vanilla recurrent network has hidden state \(h_t = \phi(W h_{t-1} + U x_t)\). Backpropagation through time over a sequence of length \(T\) multiplies \(T\) copies of the same Jacobian \(J_t = D_t W\). The sensitivity of a late loss to an early state is

\[ \frac{\partial h_T}{\partial h_t} = \prod_{k=t+1}^{T} D_k W . \]

Because \(W\) is shared, the relevant quantity is its spectral radius \(\rho(W)\). Bengio and colleagues showed that \(\rho(W) < 1\) (combined with bounded activation slopes) is a sufficient condition for gradients to vanish, while \(\rho(W) > 1\) is a necessary condition for them to explode [1]. Sequences thousands of steps long amplify even tiny deviations from one into enormous or negligible factors, which is why vanilla recurrent networks struggle to learn long range dependencies.

211.3 3. The Role of Activation Functions

The activation derivative \(\phi'\) enters every factor of the product, so its typical magnitude directly controls the regime.

211.3.1 3.1 Saturating nonlinearities

The logistic sigmoid \(\sigma(z) = 1/(1 + e^{-z})\) has derivative \(\sigma'(z) = \sigma(z)(1 - \sigma(z))\), which peaks at \(0.25\) when \(z = 0\) and falls toward zero as \(|z|\) grows. The hyperbolic tangent has derivative \(\tanh'(z) = 1 - \tanh^2(z)\), peaking at one but also collapsing in the saturated tails. Two consequences follow. First, even in the best case the sigmoid attenuates the signal by a factor of at least four per layer, so a stack of sigmoids vanishes quickly. Second, once a unit saturates, its derivative is nearly zero and it stops passing any signal, a state from which it rarely recovers.

211.3.2 3.2 Piecewise linear nonlinearities

The rectified linear unit \(\operatorname{ReLU}(z) = \max(0, z)\) has derivative equal to one for positive preactivations and zero otherwise. On the active path the derivative is exactly one, so \(\phi'\) no longer shrinks the signal multiplicatively, which is the main reason ReLU networks train at far greater depth than sigmoid networks. The price is the dead unit phenomenon: a neuron whose preactivation is negative for all inputs receives no gradient and never updates. Variants address this by giving the negative branch a nonzero slope. The leaky ReLU uses \(\phi(z) = \max(\alpha z, z)\) with a small fixed \(\alpha\), the parametric ReLU learns \(\alpha\), and the exponential linear unit and GELU provide smooth alternatives with similar gradient preserving behavior. The shared theme is keeping \(|\phi'|\) near one over the operating range.

211.4 4. Weight Initialization

If activation slopes are kept near one, the remaining control variable is the scale of the weights, fixed at initialization. The goal is to choose the initial distribution so that the variance of activations in the forward pass and the variance of gradients in the backward pass are preserved across layers.

211.4.1 4.1 Variance propagation

Assume weights are drawn independently with mean zero and variance \(\operatorname{Var}(W)\), inputs are normalized, and the nonlinearity is roughly linear near the origin. For a layer with \(n_{\text{in}}\) inputs, the variance of a preactivation is approximately \(n_{\text{in}} \operatorname{Var}(W) \operatorname{Var}(a)\). To keep \(\operatorname{Var}(z) = \operatorname{Var}(a)\) across the forward pass we need \(n_{\text{in}} \operatorname{Var}(W) = 1\). Symmetrically, preserving gradient variance in the backward pass requires \(n_{\text{out}} \operatorname{Var}(W) = 1\).

211.4.2 4.2 Xavier and He schemes

Glorot and Bengio reconciled the two conditions by compromising on the harmonic mean of fan in and fan out, giving the Xavier initialization

\[ \operatorname{Var}(W) = \frac{2}{n_{\text{in}} + n_{\text{out}}}, \]

which is appropriate for symmetric activations such as \(\tanh\) [2]. Rectifiers zero out half of their inputs in expectation, halving the variance per layer, so He and colleagues corrected the factor to

\[ \operatorname{Var}(W) = \frac{2}{n_{\text{in}}}, \]

which restores signal scale for ReLU networks and enabled the training of very deep convolutional models [3]. A useful refinement for deep linear and ReLU stacks is orthogonal initialization, which sets the singular values of each weight matrix to one so the Jacobian product is an isometry at the start of training, a condition associated with the dynamical regime of stable signal propagation [4].

# Initialization scale (illustration, not executable)
He:      std = sqrt(2 / fan_in)            # ReLU family
Xavier:  std = sqrt(2 / (fan_in + fan_out)) # tanh, sigmoid
Orthogonal: W = orthonormal matrix scaled by gain

Initialization can place a network in a healthy regime, but it cannot guarantee the network stays there as weights drift during training. That motivates mechanisms that act throughout optimization.

211.5 5. Gradient Clipping

Exploding gradients can be addressed directly at the moment they occur. Gradient clipping rescales the gradient whenever its norm exceeds a threshold \(\tau\). Given the full gradient \(g\), norm clipping applies

\[ \hat{g} = \begin{cases} g & \text{if } \|g\| \le \tau, \\[4pt] \tau \, \dfrac{g}{\|g\|} & \text{if } \|g\| > \tau. \end{cases} \]

Because the rescaling preserves direction and only shortens the step, the update still points downhill while its magnitude is bounded, which prevents a single sharp region of the loss surface from launching the parameters far away. Pascanu and colleagues introduced this technique precisely for recurrent training, where the loss landscape contains steep walls that a normal step would overshoot [5]. Clipping treats the symptom of explosion rather than its cause, and it does nothing for vanishing gradients, so it is best understood as a stabilizer applied alongside the structural fixes below.

# Norm clipping (illustration, not executable)
total_norm = norm(concat(all gradients))
if total_norm > tau:
    scale = tau / (total_norm + eps)
    g = g * scale   # for every parameter gradient

211.6 6. Normalization Layers

Normalization layers keep the distribution of activations stable as training proceeds, which indirectly conditions the gradient flow. Batch normalization standardizes each feature across the minibatch,

\[ \hat{z} = \frac{z - \mu_{\mathcal{B}}}{\sqrt{\sigma_{\mathcal{B}}^2 + \epsilon}}, \qquad y = \gamma \hat{z} + \beta, \]

with learnable scale \(\gamma\) and shift \(\beta\) [6]. By holding preactivation statistics near unit variance, batch normalization keeps activation slopes in their responsive range and reduces the dependence of each layer’s gradient on the scale of distant layers, which permits higher learning rates and deeper stacks. Layer normalization computes the same standardization across features within a single example rather than across the batch, which removes the dependence on batch size and makes it the normalization of choice in recurrent and transformer architectures [7]. These methods do not eliminate the underlying product of Jacobians, but they prevent the runaway drift of activation scale that would otherwise push the network into a vanishing or exploding regime.

211.7 7. Architectural Fixes

The most durable solutions change the function being differentiated so that a short, well conditioned path always exists for the gradient.

211.7.1 7.1 Residual connections

A residual block computes \(a^{(\ell)} = a^{(\ell-1)} + F(a^{(\ell-1)})\), adding the input back to the transformed output [8]. The Jacobian of the block is \(I + \partial F / \partial a^{(\ell-1)}\), so the backward recursion becomes

\[ \frac{\partial \mathcal{L}}{\partial a^{(\ell-1)}} = \frac{\partial \mathcal{L}}{\partial a^{(\ell)}} \left( I + \frac{\partial F}{\partial a^{(\ell-1)}} \right). \]

Expanding the product across many blocks yields a sum that includes the identity term, which carries the gradient unattenuated regardless of how small the residual branches are. The identity path is the reason networks hundreds and even thousands of layers deep can be optimized, and it is the structural backbone of modern convolutional and transformer models.

211.7.2 7.2 Gated recurrent units

Recurrent architectures solve their version of the problem with additive memory and multiplicative gates. The long short term memory network maintains a cell state \(c_t\) updated by

\[ c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t, \]

where the forget gate \(f_t\), input gate \(i_t\), and a candidate \(\tilde{c}_t\) are functions of the input and previous hidden state [9]. The recurrence in \(c_t\) is additive rather than a repeated matrix multiplication, so when the forget gate stays near one the cell state forms a near identity path through time, an analogue of the residual connection that lets gradients persist over long horizons. This mechanism, the constant error carousel, is what allows the long short term memory network and the simpler gated recurrent unit to learn dependencies across hundreds of time steps that defeat vanilla recurrence.

211.7.3 7.3 Attention and the elimination of depth in time

Self attention sidesteps the temporal product entirely by connecting every position to every other position in a single layer [10]. The path length between two tokens is constant rather than proportional to their distance, so a gradient between distant positions traverses a fixed and small number of operations. Removing the long multiplicative chain of recurrence is a large part of why attention based models train stably on very long sequences, and it explains the shift away from recurrent architectures for sequence modeling.

211.8 8. Practical Diagnosis and Synthesis

In practice the regime of a network is diagnosed by monitoring gradient norms per layer. Norms that shrink by orders of magnitude from output to input signal vanishing, while norms that spike or produce numerical overflow signal explosion. The remedies compose rather than compete. A modern recipe combines a nonparametric activation with derivative near one, variance preserving initialization, a normalization layer, residual or gated paths for signal flow, and gradient clipping as a safety net against rare spikes. Each addresses a different factor in the product of Jacobians: the activation controls \(\phi'\), initialization sets the initial \(\|W\|\), normalization holds activation scale steady through training, architecture inserts an identity path that bypasses the product, and clipping caps the worst case step. Together they convert the once formidable barrier of training deep and recurrent networks into routine engineering, and understanding which factor each one controls is what allows a practitioner to diagnose and repair a network that fails to learn.

211.9 References

  1. Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,” IEEE Transactions on Neural Networks, 1994. https://doi.org/10.1109/72.279181
  2. X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” AISTATS, 2010. https://proceedings.mlr.press/v9/glorot10a.html
  3. K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification,” ICCV, 2015. https://arxiv.org/abs/1502.01852
  4. A. M. Saxe, J. L. McClelland, and S. Ganguli, “Exact solutions to the nonlinear dynamics of learning in deep linear neural networks,” ICLR, 2014. https://arxiv.org/abs/1312.6120
  5. R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrent neural networks,” ICML, 2013. https://arxiv.org/abs/1211.5063
  6. S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” ICML, 2015. https://arxiv.org/abs/1502.03167
  7. J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint, 2016. https://arxiv.org/abs/1607.06450
  8. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” CVPR, 2016. https://arxiv.org/abs/1512.03385
  9. S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, 1997. https://doi.org/10.1162/neco.1997.9.8.1735
  10. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” NeurIPS, 2017. https://arxiv.org/abs/1706.03762