204 Random Weight Initialization

Before a neural network can learn anything, its parameters must hold some value. The choice of those initial values, made before a single gradient step, exerts an outsized influence on whether training converges quickly, converges slowly, or fails outright. This chapter examines why initialization matters, the symmetry breaking problem that rules out constant initialization, the way the scale of random weights governs signal propagation through depth, and the characteristic failures that follow from naive choices.

The treatment is conceptual rather than algorithmic. There is no single procedure to package, but there is a small body of probability theory, centered on how variance propagates through a composition of linear maps and nonlinearities, that explains essentially every initialization rule in common use. Understanding that theory once is enough to reason about any new architecture.

204.1 1. Why Initialization Matters

Training a deep network is an iterative optimization over a high dimensional, nonconvex loss surface. Gradient descent does not search this surface globally. It begins at the point fixed by initialization and follows local curvature from there. The starting point therefore selects which basin of attraction the optimizer can reach and shapes the conditioning of the early trajectory.

Three properties of the starting point matter most. First, the network must not begin in a degenerate configuration where many units compute identical functions, since such redundancy cannot be undone by gradient descent. Second, the magnitudes of activations and gradients must remain in a usable range as signals traverse many layers, neither shrinking toward zero nor swelling without bound. Third, the loss surface near the starting point should be well conditioned enough that a reasonable learning rate makes steady progress.

A useful way to see the leverage of initialization is to consider what the optimizer can and cannot repair. A poorly scaled initialization that drives activations into a saturated regime produces near zero gradients, and an optimizer that receives near zero gradients makes near zero progress. The pathology is self reinforcing. By contrast, a sensibly scaled random start places the network in a region where gradients are informative from the first step, and the remainder of training can proceed on the strength of the data.

It is worth stating plainly what initialization is not. It is not a substitute for the architectural and algorithmic tools that also stabilize training, such as normalization layers, residual connections, and adaptive optimizers. Those tools widen the range of initializations that succeed, but they do not eliminate the underlying variance bookkeeping. A network with batch normalization after every layer is far more forgiving of scale, yet it still requires symmetry breaking, and a residual network still benefits from initializing residual branches so that each block starts close to the identity. Initialization is the first line of defense, and the principles below remain the reference point even when other mechanisms share the load.

204.2 2. The Symmetry Breaking Problem

204.2.1 2.1 Why Constant Initialization Fails

Suppose we initialize every weight in a layer to the same constant, the natural choice being zero. Consider a single hidden layer with input $x \in \mathbb{R}^d$, weight matrix $W \in \mathbb{R}^{h \times d}$, bias $b$, and elementwise nonlinearity $\phi$. The preactivation and activation of hidden unit $i$ are

\[ z_i = \sum_{j=1}^{d} W_{ij} x_j + b_i, \qquad a_i = \phi(z_i). \]

If $W_{ij} = c$ and $b_i = c'$ for all $i, j$, then every $z_i$ is identical, hence every $a_i$ is identical. The hidden units are indistinguishable. Worse, this property is preserved by gradient descent. During backpropagation the gradient with respect to $W_{ij}$ is

\[ \frac{\partial \mathcal{L}}{\partial W_{ij}} = \delta_i \, x_j, \qquad \delta_i = \frac{\partial \mathcal{L}}{\partial z_i}. \]

Because the units share identical incoming weights and identical outgoing weights, the error signals $\delta_i$ are also identical across $i$. Every unit therefore receives the same update and remains equal to its neighbors after the step. The layer behaves as though it contained a single unit, no matter how wide it is.

We can make the invariance precise. Let $\pi$ be any permutation of the $h$ hidden units of a layer, acting by permuting the rows of $W$ and $b$ and the corresponding columns of the next layer’s weight matrix. The function computed by the network is invariant under $\pi$, and so the gradient field is equivariant: if a parameter configuration is fixed by $\pi$, its gradient is fixed by $\pi$ as well. A constant initialization is fixed by every such permutation. Full batch gradient descent, being a deterministic function of the gradient, therefore keeps the iterate in the permutation invariant subspace forever. That subspace is exactly the set of configurations in which all units of the layer are identical, so the effective width collapses to one.

Proposition (constant initialization is a trap)

Let the parameters of a layer be initialized so that all rows of $W$ are equal and all entries of $b$ are equal. Under full batch gradient descent with any shared learning rate, all rows of $W$ remain equal and all entries of $b$ remain equal at every iteration. Consequently the layer can represent only functions of the form $x \mapsto u\,\phi(w^\top x + b_0)$ for a shared row $w$, scalar bias $b_0$, and shared output weight $u$, regardless of its nominal width $h$.

The argument is the permutation equivariance above: the symmetric subspace is invariant under the gradient map, and deterministic updates cannot leave an invariant subspace. Stochastic minibatch noise does not reliably break the symmetry either, because the expected update preserves it and the fluctuations are not designed to push consistently in any one asymmetric direction.

204.2.2 2.2 Randomness as the Cure

The remedy is to break the symmetry by drawing the initial weights from a continuous distribution so that distinct units begin life with distinct incoming weights. Random initialization ensures, with probability one, that no two rows of $W$ are equal, that the corresponding gradients differ, and that units specialize over the course of training. Symmetry breaking is the irreducible reason that initialization must contain randomness.

Biases can safely be initialized to a constant such as zero, because the asymmetry already present in the random $W$ is enough to differentiate the units. Setting biases to zero is in fact preferable to randomizing them, since a zero bias keeps the preactivation centered and avoids adding gratuitous variance to the signal budget analyzed in the next section. A short sketch of the standard pattern, with the random draw confined to the weights:

for each layer L:
    W[L] ~ Distribution(scale)   # random, breaks symmetry
    b[L] = 0                      # constant is fine, keeps signal centered

The remaining question is not whether to use randomness but which distribution and, above all, what scale. The distribution itself matters little. A zero mean Gaussian and a zero mean uniform of matched variance behave almost identically through the propagation analysis, because that analysis depends on the weights only through their second moment. The scale, by contrast, matters enormously, as the next section shows.

204.3 3. Scale and Signal Propagation

204.3.1 3.1 Variance Through One Layer

The scale of the random weights controls how the variance of activations changes from layer to layer. Treat the inputs and weights as random variables. Assume the components $x_j$ are independent with mean zero and common variance $\operatorname{Var}(x)$, and the weights $W_{ij}$ are independent, mean zero, with variance $\sigma^2$, drawn independently of the inputs. For a single preactivation, using that the variance of a sum of independent terms is the sum of the variances and that for independent zero mean factors $\operatorname{Var}(W_{ij} x_j) = \operatorname{Var}(W_{ij})\operatorname{Var}(x_j)$,

\[ \operatorname{Var}(z_i) = \sum_{j=1}^{d} \operatorname{Var}(W_{ij} x_j) = d \, \sigma^2 \operatorname{Var}(x). \]

The factor $d$, the fan in of the layer, is the crux. If $\sigma^2$ is held fixed as the network is made wider or deeper, the variance is multiplied by $d$ at every layer and the scale of activations explodes or, for $d\sigma^2 < 1$, collapses. The bias, when nonzero, contributes an additive $\operatorname{Var}(b_i)$ to this expression, which is a further reason to keep it at zero so that the weight scale alone governs the budget.

204.3.2 3.2 Compounding Through Depth

The single layer relation compounds. Let the network have $L$ layers with (in a linearized regime) approximately variance preserving or variance scaling behavior governed by a per layer gain $g = d \sigma^2$. Ignoring the nonlinearity for intuition, the activation variance after $L$ layers scales as

\[ \operatorname{Var}(z^{(L)}) \approx g^{L} \, \operatorname{Var}(x). \]

This is a geometric progression in depth, and the geometric nature is what makes initialization unforgiving. A multiplicative error per layer of even a few percent is harmless in a shallow network but catastrophic in a deep one, because it is raised to the power $L$. Three regimes follow:

If $g > 1$, activations grow exponentially with depth and the forward signal explodes.
If $g < 1$, activations shrink exponentially with depth and the forward signal vanishes.
If $g = 1$, the variance is preserved and signal magnitude stays stable across layers.

The same geometric law governs the backward pass. Gradients propagate through the transposes of the weight matrices, so the gradient variance scales as a product of per layer gains involving the fan out. When that product departs from one, gradients explode or vanish on the backward pass, which is the direct cause of stalled or unstable training in deep stacks. The forward and backward conditions are distinct, and a single weight scale cannot satisfy both unless fan in equals fan out, which motivates the compromise in the next subsection.

The following diagram summarizes the three regimes and the single knob that selects among them.

flowchart TD
    A["Per layer gain g equals d times sigma squared"] --> B{"Value of g"}
    B -->|"g greater than 1"| C["Forward signal explodes, risk of overflow"]
    B -->|"g less than 1"| D["Forward signal vanishes, deep layers carry no information"]
    B -->|"g equals 1"| E["Variance preserved, stable training"]
    E --> F["Choose sigma squared to set g near one for the given nonlinearity"]

204.3.3 3.3 The Variance Preserving Prescription

The design goal that emerges is to choose $\sigma^2$ so that the per layer gain is close to one in both directions. For a linear network or an odd nonlinearity that behaves linearly near zero, forward stability requires

\[ \sigma^2 = \frac{1}{d_{\text{in}}}, \]

so that $d_{\text{in}} \sigma^2 = 1$. Backward stability by the same argument favors $\sigma^2 = 1/d_{\text{out}}$. Since both cannot hold exactly unless fan in equals fan out, a common compromise takes the harmonic style average of the two targets,

\[ \sigma^2 = \frac{2}{d_{\text{in}} + d_{\text{out}}}. \]

This is the Xavier, or Glorot, prescription, derived precisely from the requirement that activation and gradient variance be approximately preserved across layers [1]. It can be realized with either a Gaussian of this variance or, equivalently in second moment, a uniform distribution on $\pm\sqrt{6/(d_{\text{in}} + d_{\text{out}})}$, since a uniform variable on $\pm a$ has variance $a^2/3$.

For rectified linear units, which discard the negative half of their input and thereby halve the variance, the correction is to double the gain, giving

\[ \sigma^2 = \frac{2}{d_{\text{in}}}, \]

the He, or Kaiming, initialization [2]. The factor of two compensates exactly for the expected variance lost when $\operatorname{ReLU}$ zeroes out half of a symmetric distribution. The exactness is worth deriving once. If $z$ is symmetric about zero, then $\operatorname{ReLU}(z) = \max(0, z)$ satisfies

\[ \mathbb{E}\!\left[\operatorname{ReLU}(z)^2\right] = \mathbb{E}\!\left[z^2 \,\mathbf{1}\{z > 0\}\right] = \tfrac{1}{2}\,\mathbb{E}[z^2] = \tfrac{1}{2}\operatorname{Var}(z), \]

because by symmetry the event $z > 0$ contributes exactly half of the second moment. The activation therefore carries half the variance of its preactivation, and to restore unit gain across the linear map plus the nonlinearity we must double $\sigma^2$ relative to the linear case. A compact statement of the rule:

# fan_in = number of inputs to the layer
std = sqrt(2.0 / fan_in)      # He, for ReLU
W ~ Normal(mean=0, std=std)

More general nonlinearities are handled by a gain correction $\sigma^2 = c / d_{\text{in}}$, where $c$ compensates for the variance the activation removes. For ReLU $c = 2$, for the identity or odd functions near zero $c = 1$, and for variants such as leaky ReLU with negative slope $\alpha$ the correction is $c = 2/(1 + \alpha^2)$, which recovers $c = 2$ at $\alpha = 0$ and approaches $c = 1$ as $\alpha \to 1$. The principle is uniform: estimate how much variance the nonlinearity passes through, then choose the weight scale to cancel it.

204.4 4. Naive Initialization Failures

It is instructive to catalog the failures that occur when these principles are ignored, since each maps to a recognizable training pathology.

204.4.1 4.1 All Zeros, or Any Constant

As established in section 2, constant initialization never breaks symmetry. The network trains as if it had one unit per layer and cannot represent functions that require diverse features. The visible symptom is a loss that decreases far less than expected and a model whose effective capacity is a tiny fraction of its parameter count. A telltale diagnostic is that the hidden activations within a layer remain numerically identical across units after several steps.

204.4.2 4.2 Weights Too Large

Choosing $\sigma$ large, for example sampling from a standard normal without any fan in correction, makes the per layer gain $g = d \sigma^2 \gg 1$. Activations grow geometrically with depth. With saturating nonlinearities such as $\tanh$ or the logistic sigmoid, the large preactivations push units into the flat tails where the derivative is nearly zero,

\[ \tanh'(z) = 1 - \tanh^2(z) \to 0 \quad \text{as } |z| \to \infty, \]

so backpropagated gradients are throttled to nearly nothing. The forward signal explodes while the backward signal vanishes, and the network either diverges to numerical overflow or freezes with saturated units. With unbounded nonlinearities the activations and the loss can simply overflow to infinity on the first forward pass, surfacing as a NaN loss before any learning occurs.

204.4.3 4.3 Weights Too Small

Choosing $\sigma$ very small makes $g \ll 1$. Activations contract geometrically toward zero as depth increases, and by the deepest layers the signal carries almost no information about the input. On the backward pass the gradients likewise shrink to insignificance, so the early layers receive essentially no learning signal. Training appears to stall: the loss plateaus near its initial value and the deep layers remain close to their starting point. This is one concrete mechanism of the vanishing gradient problem, traceable directly to scale rather than to any defect of the data [3].

204.4.4 4.4 Ignoring the Nonlinearity

Even a fan in scaling can fail if it is matched to the wrong nonlinearity. Applying the Xavier variance $1/d_{\text{in}}$ to a deep $\operatorname{ReLU}$ network undershoots, because $\operatorname{ReLU}$ removes half the variance at every layer. With gain $g = \tfrac{1}{2}$ per layer instead of one, the activation variance after $L$ layers is suppressed by a factor $2^{-L}$, so a fifty layer stack loses roughly fifteen orders of magnitude of signal scale. This is why the He correction was needed to train the first very deep rectified networks [2]. The symmetric mistake, applying the He variance $2/d_{\text{in}}$ to a $\tanh$ network, overshoots and pushes units toward saturation.

204.4.5 4.5 A Numerical Illustration

Consider a network of $L = 50$ layers, each with fan in $d = 256$, and weights drawn with $\sigma = 0.1$. The per layer gain is $g = d\sigma^2 = 256 \times 0.01 = 2.56$. After fifty layers the activation variance is multiplied by

\[ g^{L} = 2.56^{50} \approx 2.6 \times 10^{20}, \]

an explosion that guarantees overflow. Halving the standard deviation to $\sigma = 0.05$ gives $g = 256 \times 0.0025 = 0.64$ and a contraction by $0.64^{50} \approx 2.0 \times 10^{-10}$, an equally fatal collapse in the opposite direction. Only the narrow choice that places $g$ near one, here $\sigma = 1/\sqrt{256} = 0.0625$ for a linear stack, keeps the signal alive across all fifty layers. The two failing values of $\sigma$ differ by a mere factor of two, yet their outcomes differ by thirty orders of magnitude, which is the clearest possible demonstration of how sharply the result depends on a single scale parameter once depth is large.

The table below collects the three cases for the same depth and width.

Standard deviation $\sigma$	Per layer gain $g = d\sigma^2$	Variance factor $g^{50}$	Outcome
$0.05$	$0.64$	$\approx 10^{-10}$	signal vanishes
$0.0625$	$1.00$	$1$	signal preserved
$0.1$	$2.56$	$\approx 10^{20}$	signal explodes

204.5 5. When to Use What, and Pitfalls

A short practical perspective complements the theory.

Use He initialization with $\sigma^2 = 2/d_{\text{in}}$ for networks built on ReLU and its close relatives, which is the default for most modern convolutional and feedforward architectures.
Use Xavier initialization with $\sigma^2 = 2/(d_{\text{in}} + d_{\text{out}})$ for $\tanh$ and other odd, roughly linear near zero activations, and as a reasonable default when the activation is unknown.
Match the gain to the activation deliberately rather than by habit. The single most common scaling mistake is reusing a default tuned for a different nonlinearity, as in section 4.4.
Keep biases at zero unless a specific reason argues otherwise, so that the weight scale alone controls the variance budget.
Remember that normalization layers and residual connections relax these requirements but do not remove them. With a residual block $x \mapsto x + F(x)$, initializing the last layer of the branch $F$ to small or zero weights makes each block start near the identity, which keeps the gain near one through arbitrary depth and is a widely used stabilizer.
When a deep network refuses to learn, inspect the per layer activation and gradient statistics before changing the data or the optimizer. A monotone decay or growth of these statistics across layers is the fingerprint of a scale mismatch and points directly back to the initialization.

Mature open source frameworks implement these schemes directly. In PyTorch the functions torch.nn.init.kaiming_normal_ and torch.nn.init.xavier_uniform_ apply the He and Xavier rules with the appropriate gain, and in JAX the flax.linen and jax.nn.initializers modules expose he_normal, glorot_uniform, and related initializers. Using these rather than hand rolled draws avoids off by a factor of two errors in the fan computation, which are easy to make and silent in their effects.

204.6 6. Summary

Initialization is the seed from which all subsequent learning grows. It must contain randomness so that units differentiate, since any constant initialization leaves the network trapped in a permutation symmetric subspace that gradient descent cannot leave. It must be scaled so that the per layer variance gain stays near one, since the gain compounds geometrically with depth and any departure from unity drives the forward and backward signals to explode or vanish. The Xavier and He prescriptions encode exactly this variance preserving requirement, each tuned to its nonlinearity through a gain factor that cancels the variance the activation removes. The naive alternatives, whether constant, too large, too small, or mismatched to the activation function, each map to a distinct and predictable training failure. A principled random initialization does not by itself guarantee success, but it removes the failures that would otherwise make success impossible.

204.7 References

Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. AISTATS 2010. https://proceedings.mlr.press/v9/glorot10a.html
He, K., Zhang, X., Ren, S., and Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. ICCV 2015. https://doi.org/10.1109/ICCV.2015.123
Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning, Chapter 8: Optimization for Training Deep Models. MIT Press, 2016. https://www.deeplearningbook.org/
Sutskever, I., Martens, J., Dahl, G., and Hinton, G. On the importance of initialization and momentum in deep learning. ICML 2013. https://proceedings.mlr.press/v28/sutskever13.html

# Random Weight Initialization Before a neural network can learn anything, its parameters must hold some value. The choice of those initial values, made before a single gradient step, exerts an outsized influence on whether training converges quickly, converges slowly, or fails outright. This chapter examines why initialization matters, the symmetry breaking problem that rules out constant initialization, the way the scale of random weights governs signal propagation through depth, and the characteristic failures that follow from naive choices. The treatment is conceptual rather than algorithmic. There is no single procedure to package, but there is a small body of probability theory, centered on how variance propagates through a composition of linear maps and nonlinearities, that explains essentially every initialization rule in common use. Understanding that theory once is enough to reason about any new architecture. ## 1. Why Initialization Matters Training a deep network is an iterative optimization over a high dimensional, nonconvex loss surface. Gradient descent does not search this surface globally. It begins at the point fixed by initialization and follows local curvature from there. The starting point therefore selects which basin of attraction the optimizer can reach and shapes the conditioning of the early trajectory. Three properties of the starting point matter most. First, the network must not begin in a degenerate configuration where many units compute identical functions, since such redundancy cannot be undone by gradient descent. Second, the magnitudes of activations and gradients must remain in a usable range as signals traverse many layers, neither shrinking toward zero nor swelling without bound. Third, the loss surface near the starting point should be well conditioned enough that a reasonable learning rate makes steady progress. A useful way to see the leverage of initialization is to consider what the optimizer can and cannot repair. A poorly scaled initialization that drives activations into a saturated regime produces near zero gradients, and an optimizer that receives near zero gradients makes near zero progress. The pathology is self reinforcing. By contrast, a sensibly scaled random start places the network in a region where gradients are informative from the first step, and the remainder of training can proceed on the strength of the data. It is worth stating plainly what initialization is not. It is not a substitute for the architectural and algorithmic tools that also stabilize training, such as normalization layers, residual connections, and adaptive optimizers. Those tools widen the range of initializations that succeed, but they do not eliminate the underlying variance bookkeeping. A network with batch normalization after every layer is far more forgiving of scale, yet it still requires symmetry breaking, and a residual network still benefits from initializing residual branches so that each block starts close to the identity. Initialization is the first line of defense, and the principles below remain the reference point even when other mechanisms share the load. ## 2. The Symmetry Breaking Problem ### 2.1 Why Constant Initialization Fails Suppose we initialize every weight in a layer to the same constant, the natural choice being zero. Consider a single hidden layer with input $x \in \mathbb{R}^d$, weight matrix $W \in \mathbb{R}^{h \times d}$, bias $b$, and elementwise nonlinearity $\phi$. The preactivation and activation of hidden unit $i$ are $$ z_i = \sum_{j=1}^{d} W_{ij} x_j + b_i, \qquad a_i = \phi(z_i). $$ If $W_{ij} = c$ and $b_i = c'$ for all $i, j$, then every $z_i$ is identical, hence every $a_i$ is identical. The hidden units are indistinguishable. Worse, this property is preserved by gradient descent. During backpropagation the gradient with respect to $W_{ij}$ is $$ \frac{\partial \mathcal{L}}{\partial W_{ij}} = \delta_i \, x_j, \qquad \delta_i = \frac{\partial \mathcal{L}}{\partial z_i}. $$ Because the units share identical incoming weights and identical outgoing weights, the error signals $\delta_i$ are also identical across $i$. Every unit therefore receives the same update and remains equal to its neighbors after the step. The layer behaves as though it contained a single unit, no matter how wide it is. We can make the invariance precise. Let $\pi$ be any permutation of the $h$ hidden units of a layer, acting by permuting the rows of $W$ and $b$ and the corresponding columns of the next layer's weight matrix. The function computed by the network is invariant under $\pi$, and so the gradient field is equivariant: if a parameter configuration is fixed by $\pi$, its gradient is fixed by $\pi$ as well. A constant initialization is fixed by every such permutation. Full batch gradient descent, being a deterministic function of the gradient, therefore keeps the iterate in the permutation invariant subspace forever. That subspace is exactly the set of configurations in which all units of the layer are identical, so the effective width collapses to one. ::: {.callout-note} ## Proposition (constant initialization is a trap) Let the parameters of a layer be initialized so that all rows of $W$ are equal and all entries of $b$ are equal. Under full batch gradient descent with any shared learning rate, all rows of $W$ remain equal and all entries of $b$ remain equal at every iteration. Consequently the layer can represent only functions of the form $x \mapsto u\,\phi(w^\top x + b_0)$ for a shared row $w$, scalar bias $b_0$, and shared output weight $u$, regardless of its nominal width $h$. The argument is the permutation equivariance above: the symmetric subspace is invariant under the gradient map, and deterministic updates cannot leave an invariant subspace. Stochastic minibatch noise does not reliably break the symmetry either, because the expected update preserves it and the fluctuations are not designed to push consistently in any one asymmetric direction. ::: ### 2.2 Randomness as the Cure The remedy is to break the symmetry by drawing the initial weights from a continuous distribution so that distinct units begin life with distinct incoming weights. Random initialization ensures, with probability one, that no two rows of $W$ are equal, that the corresponding gradients differ, and that units specialize over the course of training. Symmetry breaking is the irreducible reason that initialization must contain randomness. Biases can safely be initialized to a constant such as zero, because the asymmetry already present in the random $W$ is enough to differentiate the units. Setting biases to zero is in fact preferable to randomizing them, since a zero bias keeps the preactivation centered and avoids adding gratuitous variance to the signal budget analyzed in the next section. A short sketch of the standard pattern, with the random draw confined to the weights: ```text for each layer L: W[L] ~ Distribution(scale) # random, breaks symmetry b[L] = 0 # constant is fine, keeps signal centered ``` The remaining question is not whether to use randomness but which distribution and, above all, what scale. The distribution itself matters little. A zero mean Gaussian and a zero mean uniform of matched variance behave almost identically through the propagation analysis, because that analysis depends on the weights only through their second moment. The scale, by contrast, matters enormously, as the next section shows. ## 3. Scale and Signal Propagation ### 3.1 Variance Through One Layer The scale of the random weights controls how the variance of activations changes from layer to layer. Treat the inputs and weights as random variables. Assume the components $x_j$ are independent with mean zero and common variance $\operatorname{Var}(x)$, and the weights $W_{ij}$ are independent, mean zero, with variance $\sigma^2$, drawn independently of the inputs. For a single preactivation, using that the variance of a sum of independent terms is the sum of the variances and that for independent zero mean factors $\operatorname{Var}(W_{ij} x_j) = \operatorname{Var}(W_{ij})\operatorname{Var}(x_j)$, $$ \operatorname{Var}(z_i) = \sum_{j=1}^{d} \operatorname{Var}(W_{ij} x_j) = d \, \sigma^2 \operatorname{Var}(x). $$ The factor $d$, the fan in of the layer, is the crux. If $\sigma^2$ is held fixed as the network is made wider or deeper, the variance is multiplied by $d$ at every layer and the scale of activations explodes or, for $d\sigma^2 < 1$, collapses. The bias, when nonzero, contributes an additive $\operatorname{Var}(b_i)$ to this expression, which is a further reason to keep it at zero so that the weight scale alone governs the budget. ### 3.2 Compounding Through Depth The single layer relation compounds. Let the network have $L$ layers with (in a linearized regime) approximately variance preserving or variance scaling behavior governed by a per layer gain $g = d \sigma^2$. Ignoring the nonlinearity for intuition, the activation variance after $L$ layers scales as $$ \operatorname{Var}(z^{(L)}) \approx g^{L} \, \operatorname{Var}(x). $$ This is a geometric progression in depth, and the geometric nature is what makes initialization unforgiving. A multiplicative error per layer of even a few percent is harmless in a shallow network but catastrophic in a deep one, because it is raised to the power $L$. Three regimes follow: - If $g > 1$, activations grow exponentially with depth and the forward signal explodes. - If $g < 1$, activations shrink exponentially with depth and the forward signal vanishes. - If $g = 1$, the variance is preserved and signal magnitude stays stable across layers. The same geometric law governs the backward pass. Gradients propagate through the transposes of the weight matrices, so the gradient variance scales as a product of per layer gains involving the fan out. When that product departs from one, gradients explode or vanish on the backward pass, which is the direct cause of stalled or unstable training in deep stacks. The forward and backward conditions are distinct, and a single weight scale cannot satisfy both unless fan in equals fan out, which motivates the compromise in the next subsection. The following diagram summarizes the three regimes and the single knob that selects among them. ```{mermaid} flowchart TD A["Per layer gain g equals d times sigma squared"] --> B{"Value of g"} B -->|"g greater than 1"| C["Forward signal explodes, risk of overflow"] B -->|"g less than 1"| D["Forward signal vanishes, deep layers carry no information"] B -->|"g equals 1"| E["Variance preserved, stable training"] E --> F["Choose sigma squared to set g near one for the given nonlinearity"] ``` ### 3.3 The Variance Preserving Prescription The design goal that emerges is to choose $\sigma^2$ so that the per layer gain is close to one in both directions. For a linear network or an odd nonlinearity that behaves linearly near zero, forward stability requires $$ \sigma^2 = \frac{1}{d_{\text{in}}}, $$ so that $d_{\text{in}} \sigma^2 = 1$. Backward stability by the same argument favors $\sigma^2 = 1/d_{\text{out}}$. Since both cannot hold exactly unless fan in equals fan out, a common compromise takes the harmonic style average of the two targets, $$ \sigma^2 = \frac{2}{d_{\text{in}} + d_{\text{out}}}. $$ This is the Xavier, or Glorot, prescription, derived precisely from the requirement that activation and gradient variance be approximately preserved across layers [1]. It can be realized with either a Gaussian of this variance or, equivalently in second moment, a uniform distribution on $\pm\sqrt{6/(d_{\text{in}} + d_{\text{out}})}$, since a uniform variable on $\pm a$ has variance $a^2/3$. For rectified linear units, which discard the negative half of their input and thereby halve the variance, the correction is to double the gain, giving $$ \sigma^2 = \frac{2}{d_{\text{in}}}, $$ the He, or Kaiming, initialization [2]. The factor of two compensates exactly for the expected variance lost when $\operatorname{ReLU}$ zeroes out half of a symmetric distribution. The exactness is worth deriving once. If $z$ is symmetric about zero, then $\operatorname{ReLU}(z) = \max(0, z)$ satisfies $$ \mathbb{E}\!\left[\operatorname{ReLU}(z)^2\right] = \mathbb{E}\!\left[z^2 \,\mathbf{1}\{z > 0\}\right] = \tfrac{1}{2}\,\mathbb{E}[z^2] = \tfrac{1}{2}\operatorname{Var}(z), $$ because by symmetry the event $z > 0$ contributes exactly half of the second moment. The activation therefore carries half the variance of its preactivation, and to restore unit gain across the linear map plus the nonlinearity we must double $\sigma^2$ relative to the linear case. A compact statement of the rule: ```text # fan_in = number of inputs to the layer std = sqrt(2.0 / fan_in) # He, for ReLU W ~ Normal(mean=0, std=std) ``` More general nonlinearities are handled by a gain correction $\sigma^2 = c / d_{\text{in}}$, where $c$ compensates for the variance the activation removes. For ReLU $c = 2$, for the identity or odd functions near zero $c = 1$, and for variants such as leaky ReLU with negative slope $\alpha$ the correction is $c = 2/(1 + \alpha^2)$, which recovers $c = 2$ at $\alpha = 0$ and approaches $c = 1$ as $\alpha \to 1$. The principle is uniform: estimate how much variance the nonlinearity passes through, then choose the weight scale to cancel it. ## 4. Naive Initialization Failures It is instructive to catalog the failures that occur when these principles are ignored, since each maps to a recognizable training pathology. ### 4.1 All Zeros, or Any Constant As established in section 2, constant initialization never breaks symmetry. The network trains as if it had one unit per layer and cannot represent functions that require diverse features. The visible symptom is a loss that decreases far less than expected and a model whose effective capacity is a tiny fraction of its parameter count. A telltale diagnostic is that the hidden activations within a layer remain numerically identical across units after several steps. ### 4.2 Weights Too Large Choosing $\sigma$ large, for example sampling from a standard normal without any fan in correction, makes the per layer gain $g = d \sigma^2 \gg 1$. Activations grow geometrically with depth. With saturating nonlinearities such as $\tanh$ or the logistic sigmoid, the large preactivations push units into the flat tails where the derivative is nearly zero, $$ \tanh'(z) = 1 - \tanh^2(z) \to 0 \quad \text{as } |z| \to \infty, $$ so backpropagated gradients are throttled to nearly nothing. The forward signal explodes while the backward signal vanishes, and the network either diverges to numerical overflow or freezes with saturated units. With unbounded nonlinearities the activations and the loss can simply overflow to infinity on the first forward pass, surfacing as a `NaN` loss before any learning occurs. ### 4.3 Weights Too Small Choosing $\sigma$ very small makes $g \ll 1$. Activations contract geometrically toward zero as depth increases, and by the deepest layers the signal carries almost no information about the input. On the backward pass the gradients likewise shrink to insignificance, so the early layers receive essentially no learning signal. Training appears to stall: the loss plateaus near its initial value and the deep layers remain close to their starting point. This is one concrete mechanism of the vanishing gradient problem, traceable directly to scale rather than to any defect of the data [3]. ### 4.4 Ignoring the Nonlinearity Even a fan in scaling can fail if it is matched to the wrong nonlinearity. Applying the Xavier variance $1/d_{\text{in}}$ to a deep $\operatorname{ReLU}$ network undershoots, because $\operatorname{ReLU}$ removes half the variance at every layer. With gain $g = \tfrac{1}{2}$ per layer instead of one, the activation variance after $L$ layers is suppressed by a factor $2^{-L}$, so a fifty layer stack loses roughly fifteen orders of magnitude of signal scale. This is why the He correction was needed to train the first very deep rectified networks [2]. The symmetric mistake, applying the He variance $2/d_{\text{in}}$ to a $\tanh$ network, overshoots and pushes units toward saturation. ### 4.5 A Numerical Illustration Consider a network of $L = 50$ layers, each with fan in $d = 256$, and weights drawn with $\sigma = 0.1$. The per layer gain is $g = d\sigma^2 = 256 \times 0.01 = 2.56$. After fifty layers the activation variance is multiplied by $$ g^{L} = 2.56^{50} \approx 2.6 \times 10^{20}, $$ an explosion that guarantees overflow. Halving the standard deviation to $\sigma = 0.05$ gives $g = 256 \times 0.0025 = 0.64$ and a contraction by $0.64^{50} \approx 2.0 \times 10^{-10}$, an equally fatal collapse in the opposite direction. Only the narrow choice that places $g$ near one, here $\sigma = 1/\sqrt{256} = 0.0625$ for a linear stack, keeps the signal alive across all fifty layers. The two failing values of $\sigma$ differ by a mere factor of two, yet their outcomes differ by thirty orders of magnitude, which is the clearest possible demonstration of how sharply the result depends on a single scale parameter once depth is large. The table below collects the three cases for the same depth and width. | Standard deviation $\sigma$ | Per layer gain $g = d\sigma^2$ | Variance factor $g^{50}$ | Outcome | |---|---|---|---| | $0.05$ | $0.64$ | $\approx 10^{-10}$ | signal vanishes | | $0.0625$ | $1.00$ | $1$ | signal preserved | | $0.1$ | $2.56$ | $\approx 10^{20}$ | signal explodes | ## 5. When to Use What, and Pitfalls A short practical perspective complements the theory. - Use He initialization with $\sigma^2 = 2/d_{\text{in}}$ for networks built on ReLU and its close relatives, which is the default for most modern convolutional and feedforward architectures. - Use Xavier initialization with $\sigma^2 = 2/(d_{\text{in}} + d_{\text{out}})$ for $\tanh$ and other odd, roughly linear near zero activations, and as a reasonable default when the activation is unknown. - Match the gain to the activation deliberately rather than by habit. The single most common scaling mistake is reusing a default tuned for a different nonlinearity, as in section 4.4. - Keep biases at zero unless a specific reason argues otherwise, so that the weight scale alone controls the variance budget. - Remember that normalization layers and residual connections relax these requirements but do not remove them. With a residual block $x \mapsto x + F(x)$, initializing the last layer of the branch $F$ to small or zero weights makes each block start near the identity, which keeps the gain near one through arbitrary depth and is a widely used stabilizer. - When a deep network refuses to learn, inspect the per layer activation and gradient statistics before changing the data or the optimizer. A monotone decay or growth of these statistics across layers is the fingerprint of a scale mismatch and points directly back to the initialization. Mature open source frameworks implement these schemes directly. In PyTorch the functions `torch.nn.init.kaiming_normal_` and `torch.nn.init.xavier_uniform_` apply the He and Xavier rules with the appropriate gain, and in JAX the `flax.linen` and `jax.nn.initializers` modules expose `he_normal`, `glorot_uniform`, and related initializers. Using these rather than hand rolled draws avoids off by a factor of two errors in the fan computation, which are easy to make and silent in their effects. ## 6. Summary Initialization is the seed from which all subsequent learning grows. It must contain randomness so that units differentiate, since any constant initialization leaves the network trapped in a permutation symmetric subspace that gradient descent cannot leave. It must be scaled so that the per layer variance gain stays near one, since the gain compounds geometrically with depth and any departure from unity drives the forward and backward signals to explode or vanish. The Xavier and He prescriptions encode exactly this variance preserving requirement, each tuned to its nonlinearity through a gain factor that cancels the variance the activation removes. The naive alternatives, whether constant, too large, too small, or mismatched to the activation function, each map to a distinct and predictable training failure. A principled random initialization does not by itself guarantee success, but it removes the failures that would otherwise make success impossible. ## References 1. Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. AISTATS 2010. https://proceedings.mlr.press/v9/glorot10a.html 2. He, K., Zhang, X., Ren, S., and Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. ICCV 2015. https://doi.org/10.1109/ICCV.2015.123 3. Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning, Chapter 8: Optimization for Training Deep Models. MIT Press, 2016. https://www.deeplearningbook.org/ 4. Sutskever, I., Martens, J., Dahl, G., and Hinton, G. On the importance of initialization and momentum in deep learning. ICML 2013. https://proceedings.mlr.press/v28/sutskever13.html

Standard deviation \(\sigma\)	Per layer gain \(g = d\sigma^2\)	Variance factor \(g^{50}\)	Outcome
\(0.05\)	\(0.64\)	\(\approx 10^{-10}\)	signal vanishes
\(0.0625\)	\(1.00\)	\(1\)	signal preserved
\(0.1\)	\(2.56\)	\(\approx 10^{20}\)	signal explodes