204 Random Weight Initialization
Before a neural network can learn anything, its parameters must hold some value. The choice of those initial values, made before a single gradient step, exerts an outsized influence on whether training converges quickly, converges slowly, or fails outright. This chapter examines why initialization matters, the symmetry breaking problem that rules out constant initialization, the way the scale of random weights governs signal propagation through depth, and the characteristic failures that follow from naive choices.
204.1 1. Why Initialization Matters
Training a deep network is an iterative optimization over a high dimensional, nonconvex loss surface. Gradient descent does not search this surface globally. It begins at the point fixed by initialization and follows local curvature from there. The starting point therefore selects which basin of attraction the optimizer can reach and shapes the conditioning of the early trajectory.
Three properties of the starting point matter most. First, the network must not begin in a degenerate configuration where many units compute identical functions, since such redundancy cannot be undone by gradient descent. Second, the magnitudes of activations and gradients must remain in a usable range as signals traverse many layers, neither shrinking toward zero nor swelling without bound. Third, the loss surface near the starting point should be well conditioned enough that a reasonable learning rate makes steady progress.
A useful way to see the leverage of initialization is to consider what the optimizer can and cannot repair. A poorly scaled initialization that drives activations into a saturated regime produces near zero gradients, and an optimizer that receives near zero gradients makes near zero progress. The pathology is self reinforcing. By contrast, a sensibly scaled random start places the network in a region where gradients are informative from the first step, and the remainder of training can proceed on the strength of the data.
204.2 2. The Symmetry Breaking Problem
204.2.1 2.1 Why Constant Initialization Fails
Suppose we initialize every weight in a layer to the same constant, the natural choice being zero. Consider a single hidden layer with input \(x \in \mathbb{R}^d\), weight matrix \(W \in \mathbb{R}^{h \times d}\), bias \(b\), and elementwise nonlinearity \(\phi\). The preactivation and activation of hidden unit \(i\) are
\[ z_i = \sum_{j=1}^{d} W_{ij} x_j + b_i, \qquad a_i = \phi(z_i). \]
If \(W_{ij} = c\) and \(b_i = c'\) for all \(i, j\), then every \(z_i\) is identical, hence every \(a_i\) is identical. The hidden units are indistinguishable. Worse, this property is preserved by gradient descent. During backpropagation the gradient with respect to \(W_{ij}\) is
\[ \frac{\partial \mathcal{L}}{\partial W_{ij}} = \delta_i \, x_j, \qquad \delta_i = \frac{\partial \mathcal{L}}{\partial z_i}. \]
Because the units share identical incoming weights and identical outgoing weights, the error signals \(\delta_i\) are also identical across \(i\). Every unit therefore receives the same update and remains equal to its neighbors after the step. The layer behaves as though it contained a single unit, no matter how wide it is. This is the symmetry breaking problem: identical units are stuck in a symmetric subspace from which deterministic, symmetric updates cannot escape.
204.2.2 2.2 Randomness as the Cure
The remedy is to break the symmetry by drawing the initial weights from a distribution so that distinct units begin life with distinct incoming weights. Random initialization ensures, with probability one, that no two rows of \(W\) are equal, that the corresponding gradients differ, and that units specialize over the course of training. Symmetry breaking is the irreducible reason that initialization must contain randomness. Biases can safely be initialized to a constant such as zero, because the asymmetry already present in \(W\) is enough to differentiate the units.
A short sketch of the standard pattern:
for each layer L:
W[L] ~ Distribution(scale) # random, breaks symmetry
b[L] = 0 # constant is fine
The remaining question is not whether to use randomness but which distribution and, above all, what scale.
204.3 3. Scale and Signal Propagation
204.3.1 3.1 Variance Through One Layer
The scale of the random weights controls how the variance of activations changes from layer to layer. Treat the inputs and weights as random variables. Assume the components \(x_j\) are independent with mean zero and variance \(\operatorname{Var}(x)\), and the weights \(W_{ij}\) are independent, mean zero, with variance \(\sigma^2\), drawn independently of the inputs. For a single preactivation,
\[ \operatorname{Var}(z_i) = \sum_{j=1}^{d} \operatorname{Var}(W_{ij} x_j) = d \, \sigma^2 \operatorname{Var}(x), \]
where we used independence and zero means so that the variance of each product is the product of variances and cross terms vanish. The factor \(d\), the fan in of the layer, is the crux. If \(\sigma^2\) is held fixed as the network is made wider or deeper, the variance is multiplied by \(d\) at every layer and the scale of activations explodes or, for \(d\sigma^2 < 1\), collapses.
204.3.2 3.2 Compounding Through Depth
The single layer relation compounds. Let the network have \(L\) layers with (in a linearized regime) approximately variance preserving or variance scaling behavior governed by a per layer gain \(g = d \sigma^2\). Ignoring the nonlinearity for intuition, the activation variance after \(L\) layers scales as
\[ \operatorname{Var}(z^{(L)}) \approx g^{L} \, \operatorname{Var}(x). \]
This is a geometric progression in depth. Three regimes follow:
- If \(g > 1\), activations grow exponentially with depth and the forward signal explodes.
- If \(g < 1\), activations shrink exponentially with depth and the forward signal vanishes.
- If \(g = 1\), the variance is preserved and signal magnitude stays stable across layers.
The same geometric law governs the backward pass. Gradients propagate through the transposes of the weight matrices, so the gradient variance scales as a product of per layer gains involving the fan out. When that product departs from one, gradients explode or vanish on the backward pass, which is the direct cause of stalled or unstable training in deep stacks.
204.3.3 3.3 The Variance Preserving Prescription
The design goal that emerges is to choose \(\sigma^2\) so that the per layer gain is close to one in both directions. For a linear or odd nonlinearity, forward stability requires
\[ \sigma^2 = \frac{1}{d_{\text{in}}}, \]
so that \(d_{\text{in}} \sigma^2 = 1\). Backward stability by the same argument favors \(\sigma^2 = 1/d_{\text{out}}\). Since both cannot hold exactly unless fan in equals fan out, a common compromise balances the two,
\[ \sigma^2 = \frac{2}{d_{\text{in}} + d_{\text{out}}}. \]
This is the Xavier, or Glorot, prescription, derived precisely from the requirement that activation and gradient variance be approximately preserved across layers [1]. For rectified linear units, which discard the negative half of their input and thereby halve the variance, the correction is to double the gain, giving
\[ \sigma^2 = \frac{2}{d_{\text{in}}}, \]
the He, or Kaiming, initialization [2]. The factor of two compensates exactly for the expected variance lost when \(\operatorname{ReLU}\) zeroes out half of a symmetric distribution.
A compact statement of the rule:
# fan_in = number of inputs to the layer
std = sqrt(2.0 / fan_in) # He, for ReLU
W ~ Normal(mean=0, std=std)
204.4 4. Naive Initialization Failures
It is instructive to catalog the failures that occur when these principles are ignored, since each maps to a recognizable training pathology.
204.4.1 4.1 All Zeros, or Any Constant
As established in section 2, constant initialization never breaks symmetry. The network trains as if it had one unit per layer and cannot represent functions that require diverse features. The visible symptom is a loss that decreases far less than expected and a model whose effective capacity is a tiny fraction of its parameter count.
204.4.2 4.2 Weights Too Large
Choosing \(\sigma\) large, for example sampling from a standard normal without any fan in correction, makes the per layer gain \(g = d \sigma^2 \gg 1\). Activations grow geometrically with depth. With saturating nonlinearities such as \(\tanh\) or the logistic sigmoid, the large preactivations push units into the flat tails where the derivative is nearly zero,
\[ \tanh'(z) = 1 - \tanh^2(z) \to 0 \quad \text{as } |z| \to \infty, \]
so backpropagated gradients are throttled to nearly nothing. The forward signal explodes while the backward signal vanishes, and the network either diverges to numerical overflow or freezes with saturated units. With unbounded nonlinearities the activations and the loss can simply overflow to infinity on the first forward pass.
204.4.3 4.3 Weights Too Small
Choosing \(\sigma\) very small makes \(g \ll 1\). Activations contract geometrically toward zero as depth increases, and by the deepest layers the signal carries almost no information about the input. On the backward pass the gradients likewise shrink to insignificance, so the early layers receive essentially no learning signal. Training appears to stall: the loss plateaus near its initial value and the deep layers remain close to their starting point. This is one concrete mechanism of the vanishing gradient problem, traceable directly to scale rather than to any defect of the data.
204.4.4 4.4 Ignoring the Nonlinearity
Even a fan in scaling can fail if it is matched to the wrong nonlinearity. Applying the Xavier variance \(1/d_{\text{in}}\) to a deep \(\operatorname{ReLU}\) network undershoots, because \(\operatorname{ReLU}\) removes half the variance at every layer. The cumulative shortfall compounds over depth, and very deep \(\operatorname{ReLU}\) stacks initialized this way exhibit slow, attenuated activations precisely because the missing factor of two acts geometrically. This is why the He correction was needed to train the first very deep rectified networks [2].
204.4.5 4.5 A Numerical Illustration
Consider a network of \(L = 50\) layers, each with fan in \(d = 256\), and weights drawn with \(\sigma = 0.1\). The per layer gain is \(g = d\sigma^2 = 256 \times 0.01 = 2.56\). After fifty layers the activation variance is multiplied by
\[ g^{L} = 2.56^{50} \approx 10^{20}, \]
an explosion that guarantees overflow. Halving the standard deviation to \(\sigma = 0.05\) gives \(g = 0.64\) and a contraction by \(0.64^{50} \approx 10^{-10}\), an equally fatal collapse in the opposite direction. Only the narrow choice that places \(g\) near one, here \(\sigma = 1/\sqrt{256} \approx 0.0625\), keeps the signal alive across all fifty layers. The example underlines how sharply the outcome depends on a single scale parameter once depth is large.
204.5 5. Summary
Initialization is the seed from which all subsequent learning grows. It must contain randomness so that units differentiate, since any constant initialization leaves the network trapped in a symmetric subspace that gradient descent cannot leave. It must be scaled so that the per layer variance gain stays near one, since the gain compounds geometrically with depth and any departure from unity drives the forward and backward signals to explode or vanish. The Xavier and He prescriptions encode exactly this variance preserving requirement, each tuned to its nonlinearity. The naive alternatives, whether constant, too large, too small, or mismatched to the activation function, each map to a distinct and predictable training failure. A principled random initialization does not by itself guarantee success, but it removes the failures that would otherwise make success impossible.
204.6 References
- Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. AISTATS 2010. https://proceedings.mlr.press/v9/glorot10a.html
- He, K., Zhang, X., Ren, S., and Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. ICCV 2015. https://arxiv.org/abs/1502.01852
- Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning, Chapter 8: Optimization for Training Deep Models. MIT Press, 2016. https://www.deeplearningbook.org/
- Sutskever, I., Martens, J., Dahl, G., and Hinton, G. On the importance of initialization and momentum in deep learning. ICML 2013. https://proceedings.mlr.press/v28/sutskever13.html