205  Weight Initialization: Xavier and He

Training a deep network begins before the first gradient step. The initial values of the weights set the statistical regime in which signals propagate forward and gradients propagate backward. Choose them poorly and activations either collapse toward zero or saturate at the extremes of the nonlinearity, and the network either learns nothing or learns catastrophically slowly. Xavier (Glorot) initialization and He (Kaiming) initialization are the two canonical schemes that fix this problem by reasoning about variance. This chapter derives both from first principles, explains the roles of fan-in and fan-out, and distills the practical defaults that practitioners reach for today.

205.1 1. The Problem of Signal Propagation

Consider a feedforward network with layers indexed by \(l\). Each layer computes a pre-activation \(\mathbf{z}^{(l)} = W^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}\) followed by an elementwise nonlinearity \(\mathbf{a}^{(l)} = \phi(\mathbf{z}^{(l)})\). The weight matrix \(W^{(l)}\) has shape \(n_l \times n_{l-1}\), where \(n_{l-1}\) is the number of inputs to the layer (the fan-in) and \(n_l\) is the number of outputs (the fan-out).

Suppose we initialize every weight independently from a distribution with mean zero and variance \(\sigma_W^2\). As a signal passes through many layers, the variance of the activations either grows or shrinks geometrically unless the per-layer scaling is chosen with care. If the variance grows, activations explode and saturating nonlinearities clip; if it shrinks, activations vanish and so do the gradients that depend on them. The goal of principled initialization is to keep the variance of activations roughly constant across the forward pass, and the variance of gradients roughly constant across the backward pass.

The qualitative target is simple to state. Let \(\text{Var}(a^{(l)})\) denote the variance of a typical activation in layer \(l\). We want

\[ \text{Var}(a^{(l)}) \approx \text{Var}(a^{(l-1)}) \quad \text{for all } l. \]

A symmetric condition holds for the backpropagated error signal. Achieving both simultaneously turns out to constrain the weight variance in terms of both fan-in and fan-out.

205.2 2. Variance of a Linear Combination

The derivations rest on one elementary fact. Let \(X_1, \dots, X_n\) be independent random variables, each with mean zero and variance \(\sigma_X^2\), and let \(W_1, \dots, W_n\) be independent weights, also mean zero with variance \(\sigma_W^2\), independent of the \(X_i\). Consider the sum

\[ Z = \sum_{i=1}^{n} W_i X_i. \]

Because all terms have mean zero and are mutually independent,

\[ \text{Var}(Z) = \sum_{i=1}^{n} \text{Var}(W_i X_i) = \sum_{i=1}^{n} \text{Var}(W_i)\,\text{Var}(X_i) = n\, \sigma_W^2\, \sigma_X^2. \]

The middle step uses the identity that for independent zero-mean variables \(\text{Var}(WX) = \text{Var}(W)\text{Var}(X)\), which follows from \(\mathbb{E}[WX] = 0\) and \(\mathbb{E}[(WX)^2] = \mathbb{E}[W^2]\mathbb{E}[X^2]\).

This single equation, \(\text{Var}(Z) = n\, \sigma_W^2\, \sigma_X^2\), is the engine behind both Xavier and He. The factor \(n\) is the fan-in, and it is precisely the source of the variance blowup that we must counteract by choosing \(\sigma_W^2\) inversely proportional to \(n\).

205.3 3. Xavier (Glorot) Initialization for tanh

Glorot and Bengio analyzed networks with symmetric, zero-centered activations such as \(\tanh\). Near the origin, \(\tanh(z) \approx z\), so to a first approximation the nonlinearity acts like the identity. Under this linear regime we treat \(\phi\) as having unit slope, which lets the variance recursion of Section 2 apply directly to activations.

205.3.1 3.1 The Forward Condition

Assume the inputs \(a^{(l-1)}\) to layer \(l\) are zero-mean with variance \(\text{Var}(a^{(l-1)})\), and that weights are drawn independently with variance \(\sigma_W^2\). Under the linear approximation \(a^{(l)} \approx z^{(l)}\), the result of Section 2 gives

\[ \text{Var}(a^{(l)}) = n_{l-1}\, \sigma_W^2\, \text{Var}(a^{(l-1)}). \]

To preserve variance forward, \(\text{Var}(a^{(l)}) = \text{Var}(a^{(l-1)})\), we need the multiplicative factor to equal one:

\[ n_{l-1}\, \sigma_W^2 = 1 \quad \Longrightarrow \quad \sigma_W^2 = \frac{1}{n_{l-1}} = \frac{1}{n_{\text{in}}}. \]

205.3.2 3.2 The Backward Condition

Backpropagation sends an error signal \(\boldsymbol{\delta}^{(l)} = \partial \mathcal{L} / \partial \mathbf{z}^{(l)}\) through the transpose of the weight matrix. In the linear regime the gradient with respect to the previous layer’s pre-activation satisfies \(\boldsymbol{\delta}^{(l-1)} = (W^{(l)})^{\top} \boldsymbol{\delta}^{(l)}\). The dimension of the sum here is \(n_l\), the fan-out, because each input neuron receives error contributions from all \(n_l\) output neurons. Applying the same variance bookkeeping,

\[ \text{Var}(\delta^{(l-1)}) = n_l\, \sigma_W^2\, \text{Var}(\delta^{(l)}). \]

Preserving gradient variance backward requires

\[ n_l\, \sigma_W^2 = 1 \quad \Longrightarrow \quad \sigma_W^2 = \frac{1}{n_l} = \frac{1}{n_{\text{out}}}. \]

205.3.3 3.3 The Compromise

The forward condition asks for \(\sigma_W^2 = 1/n_{\text{in}}\) and the backward condition asks for \(\sigma_W^2 = 1/n_{\text{out}}\). Unless the layer is square these are incompatible, so Glorot and Bengio proposed the harmonic-style compromise of averaging the two denominators:

\[ \boxed{\;\sigma_W^2 = \frac{2}{n_{\text{in}} + n_{\text{out}}}\;} \]

This is Xavier initialization. It does not satisfy either condition exactly, but it keeps both forward activations and backward gradients within a stable band, which is what matters in practice.

Two concrete distributions realize this variance. A zero-mean Gaussian uses standard deviation \(\sigma_W = \sqrt{2 / (n_{\text{in}} + n_{\text{out}})}\). A uniform distribution \(U(-r, r)\) has variance \(r^2 / 3\), so matching \(r^2/3 = 2/(n_{\text{in}}+n_{\text{out}})\) gives the familiar bound

\[ r = \sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}, \qquad W \sim U\!\left(-r,\, r\right). \]

# Xavier (Glorot) variance, conceptual
fan_in, fan_out = layer.in_features, layer.out_features
std = sqrt(2.0 / (fan_in + fan_out))     # normal variant
bound = sqrt(6.0 / (fan_in + fan_out))   # uniform variant: U(-bound, bound)

205.4 4. He (Kaiming) Initialization for ReLU

The linear approximation that justified Xavier breaks down for the rectified linear unit \(\phi(z) = \max(0, z)\). ReLU is not symmetric about zero, and it discards roughly half of its inputs by mapping all negative pre-activations to exactly zero. This halving of the active signal must be accounted for, and doing so changes the constant in the variance formula.

205.4.1 4.1 The Effect of the Rectifier

Suppose \(z^{(l)}\) is symmetric about zero, which holds when the weights are symmetric and the bias is zero. Then \(z^{(l)}\) is positive half the time and negative half the time. For the positive half ReLU acts as the identity, and for the negative half it outputs zero. We compute the second moment of the activation \(a = \max(0, z)\):

\[ \mathbb{E}[a^2] = \mathbb{E}[\max(0, z)^2] = \int_{0}^{\infty} z^2 p(z)\, dz = \tfrac{1}{2} \int_{-\infty}^{\infty} z^2 p(z)\, dz = \tfrac{1}{2}\,\mathbb{E}[z^2]. \]

The middle equality uses the symmetry of \(p(z)\): the integral over the positive half-line is exactly half the integral over the whole line. Since \(z\) has mean zero, \(\mathbb{E}[z^2] = \text{Var}(z)\), so

\[ \mathbb{E}[a^2] = \tfrac{1}{2}\,\text{Var}(z). \]

This factor of one half is the crux. ReLU passes only half the variance of its input, so to keep variance constant across layers we must inject a compensating factor of two into the weight variance.

205.4.2 4.2 The Forward Derivation

Let \(n_{\text{in}} = n_{l-1}\) be the fan-in. The pre-activation variance is \(\text{Var}(z^{(l)}) = n_{l-1}\, \sigma_W^2\, \mathbb{E}[(a^{(l-1)})^2]\), where we use the second moment of the previous activation because ReLU outputs are not zero-mean. Substituting the rectifier relation \(\mathbb{E}[(a^{(l-1)})^2] = \tfrac{1}{2}\text{Var}(z^{(l-1)})\) gives

\[ \text{Var}(z^{(l)}) = n_{l-1}\, \sigma_W^2 \cdot \tfrac{1}{2}\, \text{Var}(z^{(l-1)}). \]

Demanding \(\text{Var}(z^{(l)}) = \text{Var}(z^{(l-1)})\) forces the bracketed factor to one:

\[ \tfrac{1}{2}\, n_{l-1}\, \sigma_W^2 = 1 \quad \Longrightarrow \quad \boxed{\;\sigma_W^2 = \frac{2}{n_{\text{in}}}\;} \]

This is He initialization. The numerator is \(2\) rather than \(1\), and that doubling is the direct mathematical consequence of ReLU killing half the signal. He and colleagues showed that this scaling allows very deep rectifier networks, including the thirty-layer models that motivated their work, to converge where Xavier initialization stalls.

For a Gaussian the standard deviation is \(\sigma_W = \sqrt{2/n_{\text{in}}}\), and for a uniform variant the bound is \(r = \sqrt{6/n_{\text{in}}}\).

# He (Kaiming) variance for ReLU, conceptual
fan_in = layer.in_features
std = sqrt(2.0 / fan_in)          # normal variant
bound = sqrt(6.0 / fan_in)        # uniform variant: U(-bound, bound)

205.4.3 4.3 Fan-In, Fan-Out, and Leaky Variants

He initialization can be anchored to either fan-in or fan-out. The fan-in mode preserves the variance of activations in the forward pass, while the fan-out mode preserves the variance of gradients in the backward pass. The choice rarely changes results dramatically, because the per-layer factor introduced in one direction is absorbed elsewhere, but fan-in is the conventional default.

For the leaky ReLU \(\phi(z) = \max(\alpha z, z)\) with small negative slope \(\alpha\), the negative half is no longer discarded entirely. Repeating the second-moment calculation yields a correction, and the variance becomes

\[ \sigma_W^2 = \frac{2}{(1 + \alpha^2)\, n_{\text{in}}}. \]

When \(\alpha = 0\) this recovers standard He, and when \(\alpha = 1\) the unit is linear and the formula reduces to the Xavier-style \(1/n_{\text{in}}\). This generalization is exactly the gain parameter that deep learning libraries expose.

205.5 5. Gain Factors and a Unified View

Both schemes can be written through a single template. Let \(g\) be a nonlinearity-dependent gain. Then

\[ \sigma_W^2 = \frac{g^2}{n}, \qquad g = \begin{cases} 1 & \text{linear, } \tanh \text{ (approx.)} \\[2pt] \sqrt{2} & \text{ReLU} \\[2pt] \sqrt{2/(1+\alpha^2)} & \text{leaky ReLU} \end{cases} \]

with \(n\) being fan-in, fan-out, or their average depending on whether we target the forward pass, the backward pass, or the Glorot compromise. The recommended gain for \(\tanh\) is in fact slightly above one, around \(5/3\), because the small but real curvature of \(\tanh\) near useful operating points reduces its effective slope below unity; the gain compensates. Viewing initialization through the gain abstraction makes it clear that Xavier and He are not different philosophies but the same variance-preservation principle evaluated for different activation functions.

205.6 6. Practical Defaults

The following defaults reflect what works reliably in modern practice.

  • For networks built on ReLU and its close relatives such as leaky ReLU, ELU, or GELU, use He initialization with fan-in mode. This is the standard for convolutional and most feedforward architectures. In code this is kaiming_normal_ or kaiming_uniform_ with nonlinearity='relu'.
  • For networks with \(\tanh\), sigmoid, or other symmetric saturating activations, use Xavier initialization with the appropriate gain. This remains common in recurrent networks and in attention components that use \(\tanh\) gating.
  • Biases are almost always initialized to zero. A nonzero bias breaks the symmetry assumption used in the ReLU derivation, and there is rarely any benefit. One historical exception is initializing forget-gate biases of an LSTM to a small positive value.
  • Initialization variance is computed per layer from that layer’s own fan-in and fan-out, not from a global constant, because the whole point is to neutralize the layer-specific factor \(n\).
  • For convolutional layers the fan-in is the number of input channels times the spatial kernel size, \(C_{\text{in}} \cdot k_h \cdot k_w\), and the fan-out is \(C_{\text{out}} \cdot k_h \cdot k_w\). The same formulas apply once \(n\) is computed this way.
# PyTorch defaults, conceptual
nn.init.kaiming_normal_(conv.weight, mode='fan_in', nonlinearity='relu')
nn.init.xavier_uniform_(linear.weight, gain=nn.init.calculate_gain('tanh'))
nn.init.zeros_(layer.bias)

A final caveat is that normalization layers and residual connections have shifted some of the burden that initialization once carried alone. Batch normalization, layer normalization, and carefully scaled residual branches all stabilize variance during training, which makes networks more forgiving of imperfect initialization. Even so, the right starting variance still accelerates early training and remains essential in architectures without normalization, in very deep stacks, and whenever training must be reproducible and robust. The variance argument that produced Xavier and He continues to inform newer schemes, including the residual-aware and transformer-specific initializers that scale weights by depth.

205.7 7. Summary

The unifying idea is that initialization should preserve variance, both forward through the activations and backward through the gradients. The variance of a sum of \(n\) independent weighted signals scales as \(n\,\sigma_W^2\), so the weight variance must scale as \(1/n\). Xavier initialization applies this to symmetric activations and compromises between fan-in and fan-out with \(\sigma_W^2 = 2/(n_{\text{in}} + n_{\text{out}})\). He initialization adds a factor of two, \(\sigma_W^2 = 2/n_{\text{in}}\), to compensate for the half of the signal that ReLU discards. Picking the scheme that matches the nonlinearity is one of the cheapest and highest-leverage decisions in building a trainable deep network.

205.8 References

  1. Glorot, X. and Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS). https://proceedings.mlr.press/v9/glorot10a.html
  2. He, K., Zhang, X., Ren, S., and Sun, J. (2015). Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. Proceedings of the IEEE International Conference on Computer Vision (ICCV). https://arxiv.org/abs/1502.01852
  3. Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning, Chapter 8: Optimization for Training Deep Models. MIT Press. https://www.deeplearningbook.org/
  4. Saxe, A. M., McClelland, J. L., and Ganguli, S. (2014). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1312.6120
  5. PyTorch Documentation. torch.nn.init. https://pytorch.org/docs/stable/nn.init.html
  6. Zhang, H., Dauphin, Y. N., and Ma, T. (2019). Fixup Initialization: Residual Learning Without Normalization. International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1901.09321