211 Vanishing and Exploding Gradients

Training deep and recurrent neural networks by gradient descent depends on the assumption that a useful learning signal can travel from the loss at the output back to parameters in the earliest layers. When that signal shrinks toward zero or grows without bound as it propagates, optimization stalls or diverges. These two failure modes, known as the vanishing gradient problem and the exploding gradient problem, were the central obstacle to training very deep architectures for much of the field’s history. This chapter develops the mathematics of why they arise, explains the roles of activation functions and weight initialization, and surveys the practical and architectural remedies that make modern deep learning feasible.

The unifying object behind both failure modes is a single quantity: the product of layerwise Jacobians that the chain rule assembles as it carries the error signal backward. Everything in this chapter, the choice of activation, the scale of the initial weights, normalization, residual paths, and gating, can be read as an attempt to keep the multiplicative factors of that product near one so the signal neither collapses nor diverges. We make that reading explicit by stating a precise condition for vanishing and explosion, working a small numerical example, and then organizing the remedies by which factor of the product each one controls.

Definitions

Vanishing gradients. The backpropagated error signal $\delta^{(\ell)}$ at an early layer $\ell$ decays toward zero as the depth $L - \ell$ grows, so the gradient with respect to early parameters is negligible and those parameters effectively stop learning.

Exploding gradients. The signal $\delta^{(\ell)}$ grows without bound as $L - \ell$ grows, producing huge or numerically overflowing updates that destabilize or diverge the optimization.

Both are properties of the same product of Jacobians; they are the two ways a long matrix product can fail to be approximately norm preserving.

211.1 1. The Mechanics of Backpropagation Through Depth

Consider a feedforward network with $L$ layers. Layer $\ell$ computes a preactivation $z^{(\ell)} = W^{(\ell)} a^{(\ell-1)} + b^{(\ell)}$ followed by an elementwise nonlinearity $a^{(\ell)} = \phi(z^{(\ell)})$, with $a^{(0)} = x$ the input. Let $\mathcal{L}$ be a scalar loss. Backpropagation computes the error signal $\delta^{(\ell)} = \partial \mathcal{L} / \partial z^{(\ell)}$ by the recursion

\[ \delta^{(\ell)} = \left( W^{(\ell+1)} \right)^{\top} \delta^{(\ell+1)} \odot \phi'\!\left(z^{(\ell)}\right), \]

where $\odot$ is the Hadamard product. The gradient with respect to the weights of layer $\ell$ is $\partial \mathcal{L} / \partial W^{(\ell)} = \delta^{(\ell)} \left(a^{(\ell-1)}\right)^{\top}$.

Unrolling the recursion shows that the error signal at an early layer is a product of many Jacobian factors. Writing $D^{(\ell)} = \operatorname{diag}\!\left(\phi'(z^{(\ell)})\right)$, the signal that reaches layer $\ell$ from layer $L$ is

\[ \delta^{(\ell)} = \left( \prod_{k=\ell+1}^{L} D^{(k)} \left(W^{(k)}\right)^{\top} \right) \delta^{(L)}_{\text{local}}. \]

The magnitude of $\delta^{(\ell)}$ is therefore governed by the product of $L - \ell$ matrices. Products of many matrices tend to behave either like a contraction, collapsing toward zero, or like an expansion, blowing up, unless their multiplicative factors are carefully balanced near one. This is the structural origin of both problems.

The following diagram traces a single error signal as it flows backward through the stack. Each arrow multiplies the signal by one Jacobian factor, and the cumulative product is what determines the regime.

flowchart RL
  L["Loss gradient at output"] -->|"times J_L"| dL["delta at layer L"]
  dL -->|"times J_L minus 1"| dmid["delta at middle layers"]
  dmid -->|"times J ell plus 1"| de["delta at layer ell"]
  de --> note["Product of L minus ell Jacobians sets the magnitude"]

211.2 2. A Quantitative Account of the Two Regimes

211.2.1 2.1 Scalar intuition

A single neuron chained through $L$ layers makes the mechanism transparent. Suppose each layer multiplies the backpropagated signal by a scalar factor $w \, \phi'(z)$. After $L$ layers the signal scales by $\left(w \, \phi'(z)\right)^{L}$. If $|w \, \phi'(z)| < 1$, the factor decays geometrically and the gradient vanishes; if $|w \, \phi'(z)| > 1$, it grows geometrically and the gradient explodes. Only the knife edge at exactly one preserves the signal, and random products do not sit on that edge by accident.

211.2.2 2.2 Spectral and norm bounds

The matrix case generalizes the scalar one through operator norms and singular values. Using submultiplicativity,

\[ \left\| \delta^{(\ell)} \right\| \le \left( \prod_{k=\ell+1}^{L} \left\| D^{(k)} \right\| \, \left\| W^{(k)} \right\| \right) \left\| \delta^{(L)}_{\text{local}} \right\|, \]

where $\|W\|$ denotes the largest singular value. If every $\|W^{(k)}\| \le \gamma$ and every activation derivative is bounded by $\beta = \sup_z |\phi'(z)|$, then the signal is bounded above by $(\gamma \beta)^{L-\ell}$. When $\gamma \beta < 1$ this upper bound decays exponentially, which guarantees vanishing. A complementary lower bound built from the smallest singular value shows that when the smallest singular value times the minimal activation slope exceeds one, the signal must grow exponentially. The product $\gamma \beta$ relative to one is the single most important quantity to reason about.

211.2.3 2.3 The recurrent case

Recurrent networks make the problem acute because the same weight matrix is reused at every time step. A vanilla recurrent network has hidden state $h_t = \phi(W h_{t-1} + U x_t)$. Backpropagation through time over a sequence of length $T$ multiplies $T$ copies of the same Jacobian $J_t = D_t W$. The sensitivity of a late loss to an early state is

\[ \frac{\partial h_T}{\partial h_t} = \prod_{k=t+1}^{T} D_k W . \]

Because $W$ is shared, the relevant quantity is its spectral radius $\rho(W)$, the largest absolute value among its eigenvalues. The following statement makes the threshold precise.

Spectral condition for vanishing and explosion (Bengio et al., 1994)

Let $h_t = \phi(W h_{t-1} + U x_t)$ with $\beta = \sup_z |\phi'(z)|$ the maximal slope of the activation. Then:

If $\rho(W) < 1/\beta$, the factor $\big\| \prod_{k} D_k W \big\|$ contracts to zero as the horizon grows, so gradients vanish. This is a sufficient condition.
Gradients can explode only if $\rho(W) > 1/\beta$. This is a necessary condition for explosion, not a guarantee of it.

For bounded saturating activations $\beta$ is a constant ($\beta = 1$ for $\tanh$, $\beta = 1/4$ for the logistic sigmoid), so the boundary collapses to the familiar $\rho(W) = 1$ for $\tanh$ recurrence [1].

Proof sketch. Each backward factor is $D_k W$ with $\|D_k\| \le \beta$. By submultiplicativity the norm of the product is bounded by $(\beta \, \|W\|)^{T}$, and for a diagonalizable $W$ the operator norm is controlled by $\rho(W)$ up to the conditioning of its eigenbasis. When $\beta \, \rho(W) < 1$ this bound tends to zero geometrically, forcing vanishing. For explosion, align the signal with the leading eigenvector of $W$; the product then grows like $\rho(W)^{T}$ only if $\rho(W)$ exceeds $1/\beta$, so a larger spectral radius is necessary, while saturation of the $D_k$ factors can still suppress growth, which is why it is not sufficient. $\square$

Sequences thousands of steps long amplify even tiny deviations from one into enormous or negligible factors, which is why vanilla recurrent networks struggle to learn long range dependencies.

Worked example: how fast the signal moves

Take a $\tanh$ recurrence ($\beta = 1$) whose backward factors each contribute an effective scalar gain $r$ per step. After $T$ steps the signal scales by $r^{T}$. The numbers are stark.

Per-step gain $r$	After $T = 10$	After $T = 100$
$0.9$ (mild contraction)	$0.35$	$2.7 \times 10^{-5}$
$0.99$ (near critical)	$0.90$	$0.37$
$1.01$ (near critical)	$1.10$	$2.7$
$1.1$ (mild expansion)	$2.6$	$1.4 \times 10^{4}$

A gain of $0.9$, which looks harmless over a few steps, has already destroyed five orders of magnitude of signal over a hundred steps, while a gain of $1.1$ inflates it by four. Only the near critical gains of $0.99$ and $1.01$ keep the signal usable across a long horizon, and they sit on a knife edge that drifting weights rarely hold. This single table is the whole problem in miniature: the regime is set by whether the per-step gain sits just below, at, or just above one, and the effect compounds exponentially in the horizon.

211.3 3. The Role of Activation Functions

The activation derivative $\phi'$ enters every factor of the product, so its typical magnitude directly controls the regime.

211.3.1 3.1 Saturating nonlinearities

The logistic sigmoid $\sigma(z) = 1/(1 + e^{-z})$ has derivative $\sigma'(z) = \sigma(z)(1 - \sigma(z))$, which peaks at $0.25$ when $z = 0$ and falls toward zero as $|z|$ grows. The hyperbolic tangent has derivative $\tanh'(z) = 1 - \tanh^2(z)$, peaking at one but also collapsing in the saturated tails. Two consequences follow. First, even in the best case the sigmoid attenuates the signal by a factor of at least four per layer, so a stack of sigmoids vanishes quickly. Second, once a unit saturates, its derivative is nearly zero and it stops passing any signal, a state from which it rarely recovers.

211.3.2 3.2 Piecewise linear nonlinearities

The rectified linear unit $\operatorname{ReLU}(z) = \max(0, z)$ has derivative equal to one for positive preactivations and zero otherwise. On the active path the derivative is exactly one, so $\phi'$ no longer shrinks the signal multiplicatively, which is the main reason ReLU networks train at far greater depth than sigmoid networks. The price is the dead unit phenomenon: a neuron whose preactivation is negative for all inputs receives no gradient and never updates. Variants address this by giving the negative branch a nonzero slope. The leaky ReLU uses $\phi(z) = \max(\alpha z, z)$ with a small fixed $\alpha$, the parametric ReLU learns $\alpha$, and the exponential linear unit and GELU provide smooth alternatives with similar gradient preserving behavior. The shared theme is keeping $|\phi'|$ near one over the operating range.

211.4 4. Weight Initialization

If activation slopes are kept near one, the remaining control variable is the scale of the weights, fixed at initialization. The goal is to choose the initial distribution so that the variance of activations in the forward pass and the variance of gradients in the backward pass are preserved across layers.

211.4.1 4.1 Variance propagation

Assume weights are drawn independently with mean zero and variance $\operatorname{Var}(W)$, inputs are normalized, and the nonlinearity is roughly linear near the origin. For a layer with $n_{\text{in}}$ inputs, the variance of a preactivation is approximately $n_{\text{in}} \operatorname{Var}(W) \operatorname{Var}(a)$. To keep $\operatorname{Var}(z) = \operatorname{Var}(a)$ across the forward pass we need $n_{\text{in}} \operatorname{Var}(W) = 1$. Symmetrically, preserving gradient variance in the backward pass requires $n_{\text{out}} \operatorname{Var}(W) = 1$.

211.4.2 4.2 Xavier and He schemes

Glorot and Bengio reconciled the two conditions by compromising on the harmonic mean of fan in and fan out, giving the Xavier initialization

\[ \operatorname{Var}(W) = \frac{2}{n_{\text{in}} + n_{\text{out}}}, \]

which is appropriate for symmetric activations such as $\tanh$ [2]. Rectifiers zero out half of their inputs in expectation, halving the variance per layer, so He and colleagues corrected the factor to

\[ \operatorname{Var}(W) = \frac{2}{n_{\text{in}}}, \]

which restores signal scale for ReLU networks and enabled the training of very deep convolutional models [3]. A useful refinement for deep linear and ReLU stacks is orthogonal initialization, which sets the singular values of each weight matrix to one so the Jacobian product is an isometry at the start of training, a condition associated with the dynamical regime of stable signal propagation [4]. The stronger goal of making the entire input-output Jacobian close to an isometry, with all of its singular values concentrated near one rather than only its norm controlled, is called dynamical isometry, and mean field analysis shows it can be achieved for very deep networks by combining orthogonal weights with carefully chosen nonlinearities, which empirically permits training networks tens of thousands of layers deep without normalization or skip connections [11].

# Initialization scale (illustration, not executable)
He:      std = sqrt(2 / fan_in)            # ReLU family
Xavier:  std = sqrt(2 / (fan_in + fan_out)) # tanh, sigmoid
Orthogonal: W = orthonormal matrix scaled by gain

Initialization can place a network in a healthy regime, but it cannot guarantee the network stays there as weights drift during training. That motivates mechanisms that act throughout optimization.

211.5 5. Gradient Clipping

Exploding gradients can be addressed directly at the moment they occur. Gradient clipping rescales the gradient whenever its norm exceeds a threshold $\tau$. Given the full gradient $g$, norm clipping applies

\[ \hat{g} = \begin{cases} g & \text{if } \|g\| \le \tau, \\[4pt] \tau \, \dfrac{g}{\|g\|} & \text{if } \|g\| > \tau. \end{cases} \]

Because the rescaling preserves direction and only shortens the step, the update still points downhill while its magnitude is bounded, which prevents a single sharp region of the loss surface from launching the parameters far away. Pascanu and colleagues introduced this technique precisely for recurrent training, where the loss landscape contains steep walls that a normal step would overshoot [5]. Clipping treats the symptom of explosion rather than its cause, and it does nothing for vanishing gradients, so it is best understood as a stabilizer applied alongside the structural fixes below.

# Norm clipping (illustration, not executable)
total_norm = norm(concat(all gradients))
if total_norm > tau:
    scale = tau / (total_norm + eps)
    g = g * scale   # for every parameter gradient

211.6 6. Normalization Layers

Normalization layers keep the distribution of activations stable as training proceeds, which indirectly conditions the gradient flow. Batch normalization standardizes each feature across the minibatch,

\[ \hat{z} = \frac{z - \mu_{\mathcal{B}}}{\sqrt{\sigma_{\mathcal{B}}^2 + \epsilon}}, \qquad y = \gamma \hat{z} + \beta, \]

with learnable scale $\gamma$ and shift $\beta$ [6]. By holding preactivation statistics near unit variance, batch normalization keeps activation slopes in their responsive range and reduces the dependence of each layer’s gradient on the scale of distant layers, which permits higher learning rates and deeper stacks. Layer normalization computes the same standardization across features within a single example rather than across the batch, which removes the dependence on batch size and makes it the normalization of choice in recurrent and transformer architectures [7]. These methods do not eliminate the underlying product of Jacobians, but they prevent the runaway drift of activation scale that would otherwise push the network into a vanishing or exploding regime.

211.7 7. Architectural Fixes

The most durable solutions change the function being differentiated so that a short, well conditioned path always exists for the gradient.

211.7.1 7.1 Residual connections

A residual block computes $a^{(\ell)} = a^{(\ell-1)} + F(a^{(\ell-1)})$, adding the input back to the transformed output [8]. The Jacobian of the block is $I + \partial F / \partial a^{(\ell-1)}$, so the backward recursion becomes

\[ \frac{\partial \mathcal{L}}{\partial a^{(\ell-1)}} = \frac{\partial \mathcal{L}}{\partial a^{(\ell)}} \left( I + \frac{\partial F}{\partial a^{(\ell-1)}} \right). \]

Expanding the product across many blocks yields a sum that includes the identity term, which carries the gradient unattenuated regardless of how small the residual branches are. The identity path is the reason networks hundreds and even thousands of layers deep can be optimized, and it is the structural backbone of modern convolutional and transformer models.

211.7.2 7.2 Gated recurrent units

Recurrent architectures solve their version of the problem with additive memory and multiplicative gates. The long short term memory network maintains a cell state $c_t$ updated by

\[ c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t, \]

where the forget gate $f_t$, input gate $i_t$, and a candidate $\tilde{c}_t$ are functions of the input and previous hidden state [9]. The recurrence in $c_t$ is additive rather than a repeated matrix multiplication, so when the forget gate stays near one the cell state forms a near identity path through time, an analogue of the residual connection that lets gradients persist over long horizons. This mechanism, the constant error carousel, is what allows the long short term memory network and the simpler gated recurrent unit, which merges the cell and hidden state and uses two gates instead of three, to learn dependencies across hundreds of time steps that defeat vanilla recurrence [12].

211.7.3 7.3 Attention and the elimination of depth in time

Self attention sidesteps the temporal product entirely by connecting every position to every other position in a single layer [10]. The path length between two tokens is constant rather than proportional to their distance, so a gradient between distant positions traverses a fixed and small number of operations. Removing the long multiplicative chain of recurrence is a large part of why attention based models train stably on very long sequences, and it explains the shift away from recurrent architectures for sequence modeling.

211.8 8. Practical Diagnosis and Synthesis

In practice the regime of a network is diagnosed by monitoring gradient norms per layer. Norms that shrink by orders of magnitude from output to input signal vanishing, while norms that spike or produce numerical overflow signal explosion. The remedies compose rather than compete. A modern recipe combines a nonparametric activation with derivative near one, variance preserving initialization, a normalization layer, residual or gated paths for signal flow, and gradient clipping as a safety net against rare spikes. Each addresses a different factor in the product of Jacobians: the activation controls $\phi'$, initialization sets the initial $\|W\|$, normalization holds activation scale steady through training, architecture inserts an identity path that bypasses the product, and clipping caps the worst case step.

211.8.1 8.1 Which factor does each remedy control?

The table below makes the division of labor explicit. Reading it is the fastest way to map an observed symptom to the right intervention.

Remedy	Factor of the product it controls	Helps vanishing	Helps explosion
Non-saturating activation (ReLU family)	activation slope $\phi'$	yes	no
Variance preserving init (He, Xavier, orthogonal)	initial weight scale $\\|W\\|$	yes	yes (at start)
Normalization (batch, layer)	activation scale during training	yes	yes
Residual or gated path	inserts an identity term in the Jacobian	yes	no
Gradient clipping	caps the realized step norm	no	yes

211.8.2 8.2 When to use what, and pitfalls

Reach for residual connections and normalization first in feedforward and convolutional stacks; they are the highest leverage changes and are nearly always worth their modest cost. For recurrence, prefer gated cells or, where the task allows, replace recurrence with attention to remove the temporal product entirely. Add gradient clipping whenever you train recurrent models or see occasional loss spikes, since it is cheap insurance against the rare steep wall.

Several pitfalls recur. Gradient clipping fixes only explosion; a network that clips constantly is signaling a deeper conditioning problem, not solving one, and clipping does nothing for vanishing. Batch normalization couples examples within a minibatch, which degrades with very small batches and interacts subtly with recurrence, which is why layer normalization dominates in sequence and transformer models. He initialization assumes a ReLU style nonlinearity; pairing it with a saturating activation, or pairing Xavier with ReLU, leaves the variance miscalibrated by a factor of two per layer. Residual blocks help only if the residual branch is initialized small enough that the identity term dominates early; a residual branch initialized at full scale can still place the network in an unstable regime. Finally, monitoring a single global gradient norm hides the diagnosis: vanishing and explosion are spatial phenomena across depth, so log per-layer norms to see where in the stack the signal degrades. Mature open-source frameworks such as PyTorch, JAX with Flax, and TensorFlow expose all of these tools, He and Xavier initializers, batch and layer normalization, norm based gradient clipping, and residual building blocks, as standard components, so the practitioner’s task is diagnosis and composition rather than implementation.

Together these methods convert the once formidable barrier of training deep and recurrent networks into routine engineering, and understanding which factor each one controls is what allows a practitioner to diagnose and repair a network that fails to learn.

211.9 References

Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,” IEEE Transactions on Neural Networks, 1994. https://doi.org/10.1109/72.279181
X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” AISTATS, 2010. https://proceedings.mlr.press/v9/glorot10a.html
K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification,” ICCV, 2015. https://arxiv.org/abs/1502.01852
A. M. Saxe, J. L. McClelland, and S. Ganguli, “Exact solutions to the nonlinear dynamics of learning in deep linear neural networks,” ICLR, 2014. https://arxiv.org/abs/1312.6120
R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrent neural networks,” ICML, 2013. https://arxiv.org/abs/1211.5063
S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” ICML, 2015. https://arxiv.org/abs/1502.03167
J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint, 2016. https://arxiv.org/abs/1607.06450
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” CVPR, 2016. https://arxiv.org/abs/1512.03385
S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, 1997. https://doi.org/10.1162/neco.1997.9.8.1735
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” NeurIPS, 2017. https://arxiv.org/abs/1706.03762
L. Xiao, Y. Bahri, J. Sohl-Dickstein, S. S. Schoenholz, and J. Pennington, “Dynamical isometry and a mean field theory of CNNs: How to train 10,000-layer vanilla convolutional neural networks,” ICML, 2018. https://arxiv.org/abs/1806.05393
K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” EMNLP, 2014. https://doi.org/10.3115/v1/D14-1179

# Vanishing and Exploding Gradients Training deep and recurrent neural networks by gradient descent depends on the assumption that a useful learning signal can travel from the loss at the output back to parameters in the earliest layers. When that signal shrinks toward zero or grows without bound as it propagates, optimization stalls or diverges. These two failure modes, known as the vanishing gradient problem and the exploding gradient problem, were the central obstacle to training very deep architectures for much of the field's history. This chapter develops the mathematics of why they arise, explains the roles of activation functions and weight initialization, and surveys the practical and architectural remedies that make modern deep learning feasible. The unifying object behind both failure modes is a single quantity: the product of layerwise Jacobians that the chain rule assembles as it carries the error signal backward. Everything in this chapter, the choice of activation, the scale of the initial weights, normalization, residual paths, and gating, can be read as an attempt to keep the multiplicative factors of that product near one so the signal neither collapses nor diverges. We make that reading explicit by stating a precise condition for vanishing and explosion, working a small numerical example, and then organizing the remedies by which factor of the product each one controls. ::: {.callout-note} ## Definitions **Vanishing gradients.** The backpropagated error signal $\delta^{(\ell)}$ at an early layer $\ell$ decays toward zero as the depth $L - \ell$ grows, so the gradient with respect to early parameters is negligible and those parameters effectively stop learning. **Exploding gradients.** The signal $\delta^{(\ell)}$ grows without bound as $L - \ell$ grows, producing huge or numerically overflowing updates that destabilize or diverge the optimization. Both are properties of the same product of Jacobians; they are the two ways a long matrix product can fail to be approximately norm preserving. ::: ## 1. The Mechanics of Backpropagation Through Depth Consider a feedforward network with $L$ layers. Layer $\ell$ computes a preactivation $z^{(\ell)} = W^{(\ell)} a^{(\ell-1)} + b^{(\ell)}$ followed by an elementwise nonlinearity $a^{(\ell)} = \phi(z^{(\ell)})$, with $a^{(0)} = x$ the input. Let $\mathcal{L}$ be a scalar loss. Backpropagation computes the error signal $\delta^{(\ell)} = \partial \mathcal{L} / \partial z^{(\ell)}$ by the recursion $$ \delta^{(\ell)} = \left( W^{(\ell+1)} \right)^{\top} \delta^{(\ell+1)} \odot \phi'\!\left(z^{(\ell)}\right), $$ where $\odot$ is the Hadamard product. The gradient with respect to the weights of layer $\ell$ is $\partial \mathcal{L} / \partial W^{(\ell)} = \delta^{(\ell)} \left(a^{(\ell-1)}\right)^{\top}$. Unrolling the recursion shows that the error signal at an early layer is a product of many Jacobian factors. Writing $D^{(\ell)} = \operatorname{diag}\!\left(\phi'(z^{(\ell)})\right)$, the signal that reaches layer $\ell$ from layer $L$ is $$ \delta^{(\ell)} = \left( \prod_{k=\ell+1}^{L} D^{(k)} \left(W^{(k)}\right)^{\top} \right) \delta^{(L)}_{\text{local}}. $$ The magnitude of $\delta^{(\ell)}$ is therefore governed by the product of $L - \ell$ matrices. Products of many matrices tend to behave either like a contraction, collapsing toward zero, or like an expansion, blowing up, unless their multiplicative factors are carefully balanced near one. This is the structural origin of both problems. The following diagram traces a single error signal as it flows backward through the stack. Each arrow multiplies the signal by one Jacobian factor, and the cumulative product is what determines the regime. ```{mermaid} flowchart RL L["Loss gradient at output"] -->|"times J_L"| dL["delta at layer L"] dL -->|"times J_L minus 1"| dmid["delta at middle layers"] dmid -->|"times J ell plus 1"| de["delta at layer ell"] de --> note["Product of L minus ell Jacobians sets the magnitude"] ``` ## 2. A Quantitative Account of the Two Regimes ### 2.1 Scalar intuition A single neuron chained through $L$ layers makes the mechanism transparent. Suppose each layer multiplies the backpropagated signal by a scalar factor $w \, \phi'(z)$. After $L$ layers the signal scales by $\left(w \, \phi'(z)\right)^{L}$. If $|w \, \phi'(z)| < 1$, the factor decays geometrically and the gradient vanishes; if $|w \, \phi'(z)| > 1$, it grows geometrically and the gradient explodes. Only the knife edge at exactly one preserves the signal, and random products do not sit on that edge by accident. ### 2.2 Spectral and norm bounds The matrix case generalizes the scalar one through operator norms and singular values. Using submultiplicativity, $$ \left\| \delta^{(\ell)} \right\| \le \left( \prod_{k=\ell+1}^{L} \left\| D^{(k)} \right\| \, \left\| W^{(k)} \right\| \right) \left\| \delta^{(L)}_{\text{local}} \right\|, $$ where $\|W\|$ denotes the largest singular value. If every $\|W^{(k)}\| \le \gamma$ and every activation derivative is bounded by $\beta = \sup_z |\phi'(z)|$, then the signal is bounded above by $(\gamma \beta)^{L-\ell}$. When $\gamma \beta < 1$ this upper bound decays exponentially, which guarantees vanishing. A complementary lower bound built from the smallest singular value shows that when the smallest singular value times the minimal activation slope exceeds one, the signal must grow exponentially. The product $\gamma \beta$ relative to one is the single most important quantity to reason about. ### 2.3 The recurrent case Recurrent networks make the problem acute because the same weight matrix is reused at every time step. A vanilla recurrent network has hidden state $h_t = \phi(W h_{t-1} + U x_t)$. Backpropagation through time over a sequence of length $T$ multiplies $T$ copies of the same Jacobian $J_t = D_t W$. The sensitivity of a late loss to an early state is $$ \frac{\partial h_T}{\partial h_t} = \prod_{k=t+1}^{T} D_k W . $$ Because $W$ is shared, the relevant quantity is its spectral radius $\rho(W)$, the largest absolute value among its eigenvalues. The following statement makes the threshold precise. ::: {.callout-important} ## Spectral condition for vanishing and explosion (Bengio et al., 1994) Let $h_t = \phi(W h_{t-1} + U x_t)$ with $\beta = \sup_z |\phi'(z)|$ the maximal slope of the activation. Then: - If $\rho(W) < 1/\beta$, the factor $\big\| \prod_{k} D_k W \big\|$ contracts to zero as the horizon grows, so gradients **vanish**. This is a sufficient condition. - Gradients can **explode** only if $\rho(W) > 1/\beta$. This is a necessary condition for explosion, not a guarantee of it. For bounded saturating activations $\beta$ is a constant ($\beta = 1$ for $\tanh$, $\beta = 1/4$ for the logistic sigmoid), so the boundary collapses to the familiar $\rho(W) = 1$ for $\tanh$ recurrence [1]. ::: *Proof sketch.* Each backward factor is $D_k W$ with $\|D_k\| \le \beta$. By submultiplicativity the norm of the product is bounded by $(\beta \, \|W\|)^{T}$, and for a diagonalizable $W$ the operator norm is controlled by $\rho(W)$ up to the conditioning of its eigenbasis. When $\beta \, \rho(W) < 1$ this bound tends to zero geometrically, forcing vanishing. For explosion, align the signal with the leading eigenvector of $W$; the product then grows like $\rho(W)^{T}$ only if $\rho(W)$ exceeds $1/\beta$, so a larger spectral radius is necessary, while saturation of the $D_k$ factors can still suppress growth, which is why it is not sufficient. $\square$ Sequences thousands of steps long amplify even tiny deviations from one into enormous or negligible factors, which is why vanilla recurrent networks struggle to learn long range dependencies. ::: {.callout-tip} ## Worked example: how fast the signal moves Take a $\tanh$ recurrence ($\beta = 1$) whose backward factors each contribute an effective scalar gain $r$ per step. After $T$ steps the signal scales by $r^{T}$. The numbers are stark. | Per-step gain $r$ | After $T = 10$ | After $T = 100$ | |---|---|---| | $0.9$ (mild contraction) | $0.35$ | $2.7 \times 10^{-5}$ | | $0.99$ (near critical) | $0.90$ | $0.37$ | | $1.01$ (near critical) | $1.10$ | $2.7$ | | $1.1$ (mild expansion) | $2.6$ | $1.4 \times 10^{4}$ | A gain of $0.9$, which looks harmless over a few steps, has already destroyed five orders of magnitude of signal over a hundred steps, while a gain of $1.1$ inflates it by four. Only the near critical gains of $0.99$ and $1.01$ keep the signal usable across a long horizon, and they sit on a knife edge that drifting weights rarely hold. This single table is the whole problem in miniature: the regime is set by whether the per-step gain sits just below, at, or just above one, and the effect compounds exponentially in the horizon. ::: ## 3. The Role of Activation Functions The activation derivative $\phi'$ enters every factor of the product, so its typical magnitude directly controls the regime. ### 3.1 Saturating nonlinearities The logistic sigmoid $\sigma(z) = 1/(1 + e^{-z})$ has derivative $\sigma'(z) = \sigma(z)(1 - \sigma(z))$, which peaks at $0.25$ when $z = 0$ and falls toward zero as $|z|$ grows. The hyperbolic tangent has derivative $\tanh'(z) = 1 - \tanh^2(z)$, peaking at one but also collapsing in the saturated tails. Two consequences follow. First, even in the best case the sigmoid attenuates the signal by a factor of at least four per layer, so a stack of sigmoids vanishes quickly. Second, once a unit saturates, its derivative is nearly zero and it stops passing any signal, a state from which it rarely recovers. ### 3.2 Piecewise linear nonlinearities The rectified linear unit $\operatorname{ReLU}(z) = \max(0, z)$ has derivative equal to one for positive preactivations and zero otherwise. On the active path the derivative is exactly one, so $\phi'$ no longer shrinks the signal multiplicatively, which is the main reason ReLU networks train at far greater depth than sigmoid networks. The price is the dead unit phenomenon: a neuron whose preactivation is negative for all inputs receives no gradient and never updates. Variants address this by giving the negative branch a nonzero slope. The leaky ReLU uses $\phi(z) = \max(\alpha z, z)$ with a small fixed $\alpha$, the parametric ReLU learns $\alpha$, and the exponential linear unit and GELU provide smooth alternatives with similar gradient preserving behavior. The shared theme is keeping $|\phi'|$ near one over the operating range. ## 4. Weight Initialization If activation slopes are kept near one, the remaining control variable is the scale of the weights, fixed at initialization. The goal is to choose the initial distribution so that the variance of activations in the forward pass and the variance of gradients in the backward pass are preserved across layers. ### 4.1 Variance propagation Assume weights are drawn independently with mean zero and variance $\operatorname{Var}(W)$, inputs are normalized, and the nonlinearity is roughly linear near the origin. For a layer with $n_{\text{in}}$ inputs, the variance of a preactivation is approximately $n_{\text{in}} \operatorname{Var}(W) \operatorname{Var}(a)$. To keep $\operatorname{Var}(z) = \operatorname{Var}(a)$ across the forward pass we need $n_{\text{in}} \operatorname{Var}(W) = 1$. Symmetrically, preserving gradient variance in the backward pass requires $n_{\text{out}} \operatorname{Var}(W) = 1$. ### 4.2 Xavier and He schemes Glorot and Bengio reconciled the two conditions by compromising on the harmonic mean of fan in and fan out, giving the Xavier initialization $$ \operatorname{Var}(W) = \frac{2}{n_{\text{in}} + n_{\text{out}}}, $$ which is appropriate for symmetric activations such as $\tanh$ [2]. Rectifiers zero out half of their inputs in expectation, halving the variance per layer, so He and colleagues corrected the factor to $$ \operatorname{Var}(W) = \frac{2}{n_{\text{in}}}, $$ which restores signal scale for ReLU networks and enabled the training of very deep convolutional models [3]. A useful refinement for deep linear and ReLU stacks is orthogonal initialization, which sets the singular values of each weight matrix to one so the Jacobian product is an isometry at the start of training, a condition associated with the dynamical regime of stable signal propagation [4]. The stronger goal of making the entire input-output Jacobian close to an isometry, with all of its singular values concentrated near one rather than only its norm controlled, is called dynamical isometry, and mean field analysis shows it can be achieved for very deep networks by combining orthogonal weights with carefully chosen nonlinearities, which empirically permits training networks tens of thousands of layers deep without normalization or skip connections [11]. ```text # Initialization scale (illustration, not executable) He: std = sqrt(2 / fan_in) # ReLU family Xavier: std = sqrt(2 / (fan_in + fan_out)) # tanh, sigmoid Orthogonal: W = orthonormal matrix scaled by gain ``` Initialization can place a network in a healthy regime, but it cannot guarantee the network stays there as weights drift during training. That motivates mechanisms that act throughout optimization. ## 5. Gradient Clipping Exploding gradients can be addressed directly at the moment they occur. Gradient clipping rescales the gradient whenever its norm exceeds a threshold $\tau$. Given the full gradient $g$, norm clipping applies $$ \hat{g} = \begin{cases} g & \text{if } \|g\| \le \tau, \\[4pt] \tau \, \dfrac{g}{\|g\|} & \text{if } \|g\| > \tau. \end{cases} $$ Because the rescaling preserves direction and only shortens the step, the update still points downhill while its magnitude is bounded, which prevents a single sharp region of the loss surface from launching the parameters far away. Pascanu and colleagues introduced this technique precisely for recurrent training, where the loss landscape contains steep walls that a normal step would overshoot [5]. Clipping treats the symptom of explosion rather than its cause, and it does nothing for vanishing gradients, so it is best understood as a stabilizer applied alongside the structural fixes below. ```text # Norm clipping (illustration, not executable) total_norm = norm(concat(all gradients)) if total_norm > tau: scale = tau / (total_norm + eps) g = g * scale # for every parameter gradient ``` ## 6. Normalization Layers Normalization layers keep the distribution of activations stable as training proceeds, which indirectly conditions the gradient flow. Batch normalization standardizes each feature across the minibatch, $$ \hat{z} = \frac{z - \mu_{\mathcal{B}}}{\sqrt{\sigma_{\mathcal{B}}^2 + \epsilon}}, \qquad y = \gamma \hat{z} + \beta, $$ with learnable scale $\gamma$ and shift $\beta$ [6]. By holding preactivation statistics near unit variance, batch normalization keeps activation slopes in their responsive range and reduces the dependence of each layer's gradient on the scale of distant layers, which permits higher learning rates and deeper stacks. Layer normalization computes the same standardization across features within a single example rather than across the batch, which removes the dependence on batch size and makes it the normalization of choice in recurrent and transformer architectures [7]. These methods do not eliminate the underlying product of Jacobians, but they prevent the runaway drift of activation scale that would otherwise push the network into a vanishing or exploding regime. ## 7. Architectural Fixes The most durable solutions change the function being differentiated so that a short, well conditioned path always exists for the gradient. ### 7.1 Residual connections A residual block computes $a^{(\ell)} = a^{(\ell-1)} + F(a^{(\ell-1)})$, adding the input back to the transformed output [8]. The Jacobian of the block is $I + \partial F / \partial a^{(\ell-1)}$, so the backward recursion becomes $$ \frac{\partial \mathcal{L}}{\partial a^{(\ell-1)}} = \frac{\partial \mathcal{L}}{\partial a^{(\ell)}} \left( I + \frac{\partial F}{\partial a^{(\ell-1)}} \right). $$ Expanding the product across many blocks yields a sum that includes the identity term, which carries the gradient unattenuated regardless of how small the residual branches are. The identity path is the reason networks hundreds and even thousands of layers deep can be optimized, and it is the structural backbone of modern convolutional and transformer models. ### 7.2 Gated recurrent units Recurrent architectures solve their version of the problem with additive memory and multiplicative gates. The long short term memory network maintains a cell state $c_t$ updated by $$ c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t, $$ where the forget gate $f_t$, input gate $i_t$, and a candidate $\tilde{c}_t$ are functions of the input and previous hidden state [9]. The recurrence in $c_t$ is additive rather than a repeated matrix multiplication, so when the forget gate stays near one the cell state forms a near identity path through time, an analogue of the residual connection that lets gradients persist over long horizons. This mechanism, the constant error carousel, is what allows the long short term memory network and the simpler gated recurrent unit, which merges the cell and hidden state and uses two gates instead of three, to learn dependencies across hundreds of time steps that defeat vanilla recurrence [12]. ### 7.3 Attention and the elimination of depth in time Self attention sidesteps the temporal product entirely by connecting every position to every other position in a single layer [10]. The path length between two tokens is constant rather than proportional to their distance, so a gradient between distant positions traverses a fixed and small number of operations. Removing the long multiplicative chain of recurrence is a large part of why attention based models train stably on very long sequences, and it explains the shift away from recurrent architectures for sequence modeling. ## 8. Practical Diagnosis and Synthesis In practice the regime of a network is diagnosed by monitoring gradient norms per layer. Norms that shrink by orders of magnitude from output to input signal vanishing, while norms that spike or produce numerical overflow signal explosion. The remedies compose rather than compete. A modern recipe combines a nonparametric activation with derivative near one, variance preserving initialization, a normalization layer, residual or gated paths for signal flow, and gradient clipping as a safety net against rare spikes. Each addresses a different factor in the product of Jacobians: the activation controls $\phi'$, initialization sets the initial $\|W\|$, normalization holds activation scale steady through training, architecture inserts an identity path that bypasses the product, and clipping caps the worst case step. ### 8.1 Which factor does each remedy control? The table below makes the division of labor explicit. Reading it is the fastest way to map an observed symptom to the right intervention. | Remedy | Factor of the product it controls | Helps vanishing | Helps explosion | |---|---|---|---| | Non-saturating activation (ReLU family) | activation slope $\phi'$ | yes | no | | Variance preserving init (He, Xavier, orthogonal) | initial weight scale $\|W\|$ | yes | yes (at start) | | Normalization (batch, layer) | activation scale during training | yes | yes | | Residual or gated path | inserts an identity term in the Jacobian | yes | no | | Gradient clipping | caps the realized step norm | no | yes | ### 8.2 When to use what, and pitfalls Reach for residual connections and normalization first in feedforward and convolutional stacks; they are the highest leverage changes and are nearly always worth their modest cost. For recurrence, prefer gated cells or, where the task allows, replace recurrence with attention to remove the temporal product entirely. Add gradient clipping whenever you train recurrent models or see occasional loss spikes, since it is cheap insurance against the rare steep wall. Several pitfalls recur. Gradient clipping fixes only explosion; a network that clips constantly is signaling a deeper conditioning problem, not solving one, and clipping does nothing for vanishing. Batch normalization couples examples within a minibatch, which degrades with very small batches and interacts subtly with recurrence, which is why layer normalization dominates in sequence and transformer models. He initialization assumes a ReLU style nonlinearity; pairing it with a saturating activation, or pairing Xavier with ReLU, leaves the variance miscalibrated by a factor of two per layer. Residual blocks help only if the residual branch is initialized small enough that the identity term dominates early; a residual branch initialized at full scale can still place the network in an unstable regime. Finally, monitoring a single global gradient norm hides the diagnosis: vanishing and explosion are spatial phenomena across depth, so log per-layer norms to see where in the stack the signal degrades. Mature open-source frameworks such as PyTorch, JAX with Flax, and TensorFlow expose all of these tools, He and Xavier initializers, batch and layer normalization, norm based gradient clipping, and residual building blocks, as standard components, so the practitioner's task is diagnosis and composition rather than implementation. Together these methods convert the once formidable barrier of training deep and recurrent networks into routine engineering, and understanding which factor each one controls is what allows a practitioner to diagnose and repair a network that fails to learn. ## References 1. Y. Bengio, P. Simard, and P. Frasconi, "Learning long-term dependencies with gradient descent is difficult," IEEE Transactions on Neural Networks, 1994. https://doi.org/10.1109/72.279181 2. X. Glorot and Y. Bengio, "Understanding the difficulty of training deep feedforward neural networks," AISTATS, 2010. https://proceedings.mlr.press/v9/glorot10a.html 3. K. He, X. Zhang, S. Ren, and J. Sun, "Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification," ICCV, 2015. https://arxiv.org/abs/1502.01852 4. A. M. Saxe, J. L. McClelland, and S. Ganguli, "Exact solutions to the nonlinear dynamics of learning in deep linear neural networks," ICLR, 2014. https://arxiv.org/abs/1312.6120 5. R. Pascanu, T. Mikolov, and Y. Bengio, "On the difficulty of training recurrent neural networks," ICML, 2013. https://arxiv.org/abs/1211.5063 6. S. Ioffe and C. Szegedy, "Batch normalization: Accelerating deep network training by reducing internal covariate shift," ICML, 2015. https://arxiv.org/abs/1502.03167 7. J. L. Ba, J. R. Kiros, and G. E. Hinton, "Layer normalization," arXiv preprint, 2016. https://arxiv.org/abs/1607.06450 8. K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," CVPR, 2016. https://arxiv.org/abs/1512.03385 9. S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural Computation, 1997. https://doi.org/10.1162/neco.1997.9.8.1735 10. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," NeurIPS, 2017. https://arxiv.org/abs/1706.03762 11. L. Xiao, Y. Bahri, J. Sohl-Dickstein, S. S. Schoenholz, and J. Pennington, "Dynamical isometry and a mean field theory of CNNs: How to train 10,000-layer vanilla convolutional neural networks," ICML, 2018. https://arxiv.org/abs/1806.05393 12. K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, "Learning phrase representations using RNN encoder-decoder for statistical machine translation," EMNLP, 2014. https://doi.org/10.3115/v1/D14-1179

Per-step gain \(r\)	After \(T = 10\)	After \(T = 100\)
\(0.9\) (mild contraction)	\(0.35\)	\(2.7 \times 10^{-5}\)
\(0.99\) (near critical)	\(0.90\)	\(0.37\)
\(1.01\) (near critical)	\(1.10\)	\(2.7\)
\(1.1\) (mild expansion)	\(2.6\)	\(1.4 \times 10^{4}\)