205 Weight Initialization: Xavier and He

Training a deep network begins before the first gradient step. The initial values of the weights set the statistical regime in which signals propagate forward and gradients propagate backward. Choose them poorly and activations either collapse toward zero or saturate at the extremes of the nonlinearity, and the network either learns nothing or learns catastrophically slowly. Xavier (Glorot) initialization and He (Kaiming) initialization are the two canonical schemes that fix this problem by reasoning about variance. This chapter derives both from first principles, explains the roles of fan-in and fan-out, and distills the practical defaults that practitioners reach for today.

205.1 1. The Problem of Signal Propagation

Consider a feedforward network with layers indexed by $l$. Each layer computes a pre-activation $\mathbf{z}^{(l)} = W^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}$ followed by an elementwise nonlinearity $\mathbf{a}^{(l)} = \phi(\mathbf{z}^{(l)})$. The weight matrix $W^{(l)}$ has shape $n_l \times n_{l-1}$, where $n_{l-1}$ is the number of inputs to the layer (the fan-in) and $n_l$ is the number of outputs (the fan-out).

Suppose we initialize every weight independently from a distribution with mean zero and variance $\sigma_W^2$. As a signal passes through many layers, the variance of the activations either grows or shrinks geometrically unless the per-layer scaling is chosen with care. If the variance grows, activations explode and saturating nonlinearities clip; if it shrinks, activations vanish and so do the gradients that depend on them. The goal of principled initialization is to keep the variance of activations roughly constant across the forward pass, and the variance of gradients roughly constant across the backward pass.

The qualitative target is simple to state. Let $\text{Var}(a^{(l)})$ denote the variance of a typical activation in layer $l$. We want

\[ \text{Var}(a^{(l)}) \approx \text{Var}(a^{(l-1)}) \quad \text{for all } l. \]

A symmetric condition holds for the backpropagated error signal. Achieving both simultaneously turns out to constrain the weight variance in terms of both fan-in and fan-out.

205.2 2. Variance of a Linear Combination

The derivations rest on one elementary fact. Let $X_1, \dots, X_n$ be independent random variables, each with mean zero and variance $\sigma_X^2$, and let $W_1, \dots, W_n$ be independent weights, also mean zero with variance $\sigma_W^2$, independent of the $X_i$. Consider the sum

\[ Z = \sum_{i=1}^{n} W_i X_i. \]

Because all terms have mean zero and are mutually independent,

\[ \text{Var}(Z) = \sum_{i=1}^{n} \text{Var}(W_i X_i) = \sum_{i=1}^{n} \text{Var}(W_i)\,\text{Var}(X_i) = n\, \sigma_W^2\, \sigma_X^2. \]

The middle step uses the identity that for independent zero-mean variables $\text{Var}(WX) = \text{Var}(W)\text{Var}(X)$, which follows from $\mathbb{E}[WX] = 0$ and $\mathbb{E}[(WX)^2] = \mathbb{E}[W^2]\mathbb{E}[X^2]$.

This single equation, $\text{Var}(Z) = n\, \sigma_W^2\, \sigma_X^2$, is the engine behind both Xavier and He. The factor $n$ is the fan-in, and it is precisely the source of the variance blowup that we must counteract by choosing $\sigma_W^2$ inversely proportional to $n$.

205.3 3. Xavier (Glorot) Initialization for tanh

Glorot and Bengio analyzed networks with symmetric, zero-centered activations such as $\tanh$. Near the origin, $\tanh(z) \approx z$, so to a first approximation the nonlinearity acts like the identity. Under this linear regime we treat $\phi$ as having unit slope, which lets the variance recursion of Section 2 apply directly to activations.

205.3.1 3.1 The Forward Condition

Assume the inputs $a^{(l-1)}$ to layer $l$ are zero-mean with variance $\text{Var}(a^{(l-1)})$, and that weights are drawn independently with variance $\sigma_W^2$. Under the linear approximation $a^{(l)} \approx z^{(l)}$, the result of Section 2 gives

\[ \text{Var}(a^{(l)}) = n_{l-1}\, \sigma_W^2\, \text{Var}(a^{(l-1)}). \]

To preserve variance forward, $\text{Var}(a^{(l)}) = \text{Var}(a^{(l-1)})$, we need the multiplicative factor to equal one:

\[ n_{l-1}\, \sigma_W^2 = 1 \quad \Longrightarrow \quad \sigma_W^2 = \frac{1}{n_{l-1}} = \frac{1}{n_{\text{in}}}. \]

205.3.2 3.2 The Backward Condition

Backpropagation sends an error signal $\boldsymbol{\delta}^{(l)} = \partial \mathcal{L} / \partial \mathbf{z}^{(l)}$ through the transpose of the weight matrix. In the linear regime the gradient with respect to the previous layer’s pre-activation satisfies $\boldsymbol{\delta}^{(l-1)} = (W^{(l)})^{\top} \boldsymbol{\delta}^{(l)}$. The dimension of the sum here is $n_l$, the fan-out, because each input neuron receives error contributions from all $n_l$ output neurons. Applying the same variance bookkeeping,

\[ \text{Var}(\delta^{(l-1)}) = n_l\, \sigma_W^2\, \text{Var}(\delta^{(l)}). \]

Preserving gradient variance backward requires

\[ n_l\, \sigma_W^2 = 1 \quad \Longrightarrow \quad \sigma_W^2 = \frac{1}{n_l} = \frac{1}{n_{\text{out}}}. \]

205.3.3 3.3 The Compromise

The forward condition asks for $\sigma_W^2 = 1/n_{\text{in}}$ and the backward condition asks for $\sigma_W^2 = 1/n_{\text{out}}$. Unless the layer is square these are incompatible, so Glorot and Bengio proposed the harmonic-style compromise of averaging the two denominators:

\[ \boxed{\;\sigma_W^2 = \frac{2}{n_{\text{in}} + n_{\text{out}}}\;} \]

This is Xavier initialization. It does not satisfy either condition exactly, but it keeps both forward activations and backward gradients within a stable band, which is what matters in practice.

Two concrete distributions realize this variance. A zero-mean Gaussian uses standard deviation $\sigma_W = \sqrt{2 / (n_{\text{in}} + n_{\text{out}})}$. A uniform distribution $U(-r, r)$ has variance $r^2 / 3$, so matching $r^2/3 = 2/(n_{\text{in}}+n_{\text{out}})$ gives the familiar bound

\[ r = \sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}, \qquad W \sim U\!\left(-r,\, r\right). \]

# Xavier (Glorot) variance, conceptual
fan_in, fan_out = layer.in_features, layer.out_features
std = sqrt(2.0 / (fan_in + fan_out))     # normal variant
bound = sqrt(6.0 / (fan_in + fan_out))   # uniform variant: U(-bound, bound)

205.4 4. He (Kaiming) Initialization for ReLU

The linear approximation that justified Xavier breaks down for the rectified linear unit $\phi(z) = \max(0, z)$. ReLU is not symmetric about zero, and it discards roughly half of its inputs by mapping all negative pre-activations to exactly zero. This halving of the active signal must be accounted for, and doing so changes the constant in the variance formula.

205.4.1 4.1 The Effect of the Rectifier

Suppose $z^{(l)}$ is symmetric about zero, which holds when the weights are symmetric and the bias is zero. Then $z^{(l)}$ is positive half the time and negative half the time. For the positive half ReLU acts as the identity, and for the negative half it outputs zero. We compute the second moment of the activation $a = \max(0, z)$:

\[ \mathbb{E}[a^2] = \mathbb{E}[\max(0, z)^2] = \int_{0}^{\infty} z^2 p(z)\, dz = \tfrac{1}{2} \int_{-\infty}^{\infty} z^2 p(z)\, dz = \tfrac{1}{2}\,\mathbb{E}[z^2]. \]

The middle equality uses the symmetry of $p(z)$: the integral over the positive half-line is exactly half the integral over the whole line. Since $z$ has mean zero, $\mathbb{E}[z^2] = \text{Var}(z)$, so

\[ \mathbb{E}[a^2] = \tfrac{1}{2}\,\text{Var}(z). \]

This factor of one half is the crux. ReLU passes only half the variance of its input, so to keep variance constant across layers we must inject a compensating factor of two into the weight variance.

205.4.2 4.2 The Forward Derivation

Let $n_{\text{in}} = n_{l-1}$ be the fan-in. The pre-activation variance is $\text{Var}(z^{(l)}) = n_{l-1}\, \sigma_W^2\, \mathbb{E}[(a^{(l-1)})^2]$, where we use the second moment of the previous activation because ReLU outputs are not zero-mean. Substituting the rectifier relation $\mathbb{E}[(a^{(l-1)})^2] = \tfrac{1}{2}\text{Var}(z^{(l-1)})$ gives

\[ \text{Var}(z^{(l)}) = n_{l-1}\, \sigma_W^2 \cdot \tfrac{1}{2}\, \text{Var}(z^{(l-1)}). \]

Demanding $\text{Var}(z^{(l)}) = \text{Var}(z^{(l-1)})$ forces the bracketed factor to one:

\[ \tfrac{1}{2}\, n_{l-1}\, \sigma_W^2 = 1 \quad \Longrightarrow \quad \boxed{\;\sigma_W^2 = \frac{2}{n_{\text{in}}}\;} \]

This is He initialization. The numerator is $2$ rather than $1$, and that doubling is the direct mathematical consequence of ReLU killing half the signal. He and colleagues showed that this scaling allows very deep rectifier networks, including the thirty-layer models that motivated their work, to converge where Xavier initialization stalls.

For a Gaussian the standard deviation is $\sigma_W = \sqrt{2/n_{\text{in}}}$, and for a uniform variant the bound is $r = \sqrt{6/n_{\text{in}}}$.

# He (Kaiming) variance for ReLU, conceptual
fan_in = layer.in_features
std = sqrt(2.0 / fan_in)          # normal variant
bound = sqrt(6.0 / fan_in)        # uniform variant: U(-bound, bound)

205.4.3 4.3 Fan-In, Fan-Out, and Leaky Variants

He initialization can be anchored to either fan-in or fan-out. The fan-in mode preserves the variance of activations in the forward pass, while the fan-out mode preserves the variance of gradients in the backward pass. The choice rarely changes results dramatically, because the per-layer factor introduced in one direction is absorbed elsewhere, but fan-in is the conventional default.

For the leaky ReLU $\phi(z) = \max(\alpha z, z)$ with small negative slope $\alpha$, the negative half is no longer discarded entirely. Repeating the second-moment calculation yields a correction, and the variance becomes

\[ \sigma_W^2 = \frac{2}{(1 + \alpha^2)\, n_{\text{in}}}. \]

When $\alpha = 0$ this recovers standard He, and when $\alpha = 1$ the unit is linear and the formula reduces to the Xavier-style $1/n_{\text{in}}$. This generalization is exactly the gain parameter that deep learning libraries expose.

205.5 5. Gain Factors and a Unified View

Both schemes can be written through a single template. Let $g$ be a nonlinearity-dependent gain. Then

\[ \sigma_W^2 = \frac{g^2}{n}, \qquad g = \begin{cases} 1 & \text{linear, } \tanh \text{ (approx.)} \\[2pt] \sqrt{2} & \text{ReLU} \\[2pt] \sqrt{2/(1+\alpha^2)} & \text{leaky ReLU} \end{cases} \]

with $n$ being fan-in, fan-out, or their average depending on whether we target the forward pass, the backward pass, or the Glorot compromise. The recommended gain for $\tanh$ is in fact slightly above one, around $5/3$, because the small but real curvature of $\tanh$ near useful operating points reduces its effective slope below unity; the gain compensates. Viewing initialization through the gain abstraction makes it clear that Xavier and He are not different philosophies but the same variance-preservation principle evaluated for different activation functions.

205.6 6. Practical Defaults

The following defaults reflect what works reliably in modern practice.

For networks built on ReLU and its close relatives such as leaky ReLU, ELU, or GELU, use He initialization with fan-in mode. This is the standard for convolutional and most feedforward architectures. In code this is kaiming_normal_ or kaiming_uniform_ with nonlinearity='relu'.
For networks with $\tanh$, sigmoid, or other symmetric saturating activations, use Xavier initialization with the appropriate gain. This remains common in recurrent networks and in attention components that use $\tanh$ gating.
Biases are almost always initialized to zero. A nonzero bias breaks the symmetry assumption used in the ReLU derivation, and there is rarely any benefit. One historical exception is initializing forget-gate biases of an LSTM to a small positive value.
Initialization variance is computed per layer from that layer’s own fan-in and fan-out, not from a global constant, because the whole point is to neutralize the layer-specific factor $n$.
For convolutional layers the fan-in is the number of input channels times the spatial kernel size, $C_{\text{in}} \cdot k_h \cdot k_w$, and the fan-out is $C_{\text{out}} \cdot k_h \cdot k_w$. The same formulas apply once $n$ is computed this way.

# PyTorch defaults, conceptual
nn.init.kaiming_normal_(conv.weight, mode='fan_in', nonlinearity='relu')
nn.init.xavier_uniform_(linear.weight, gain=nn.init.calculate_gain('tanh'))
nn.init.zeros_(layer.bias)

A final caveat is that normalization layers and residual connections have shifted some of the burden that initialization once carried alone. Batch normalization, layer normalization, and carefully scaled residual branches all stabilize variance during training, which makes networks more forgiving of imperfect initialization. Even so, the right starting variance still accelerates early training and remains essential in architectures without normalization, in very deep stacks, and whenever training must be reproducible and robust. The variance argument that produced Xavier and He continues to inform newer schemes, including the residual-aware and transformer-specific initializers that scale weights by depth.

205.7 7. Reference Implementation

The aiinaction companion library ships a small, dependency-free implementation of both schemes in Python, Julia, and Rust. It exposes two layers of API. The first computes the theoretical scale of a scheme: xavier_scale(fan_in, fan_out) and he_scale(fan) return the matching Gaussian standard deviation $\sigma_W$ and the half-width $r = \sigma_W\sqrt{3}$ of the equivalent uniform support $U(-r, r)$. The second layer samples an actual weight matrix of shape (fan_out, fan_in) via xavier_normal, xavier_uniform, he_normal, and he_uniform. The helper calculate_gain returns the nonlinearity-dependent gain $g$ discussed in Section 5.

Sampling is built on a self-contained, deterministic SplitMix64 generator feeding a Box-Muller transform, so a fixed seed yields the identical weight matrix across all three languages, not merely matching summary statistics. This makes the examples below reproducible bit-for-bit and lets the cross-language parity tests assert agreement to $10^{-9}$.

Code

import numpy as np
from aiinaction.ch200_weight_init import (
    calculate_gain, xavier_scale, he_scale, he_normal, xavier_normal,
)

fan_in, fan_out = 256, 128

# Theoretical scales for a tanh layer (Xavier) and a ReLU layer (He).
xs = xavier_scale(fan_in, fan_out, gain=calculate_gain("tanh"))
hs = he_scale(fan_in, gain=calculate_gain("relu"))
print(f"Xavier(tanh) std = {xs.std:.6f}, uniform bound = {xs.bound:.6f}")
print(f"He(relu)     std = {hs.std:.6f}, uniform bound = {hs.bound:.6f}")

# Sample a real He-initialized weight matrix and check the empirical variance
# against the 2 / fan_in target.
W = he_normal(fan_in, fan_out, seed=0)
print(f"W shape          = {W.shape}")
print(f"empirical var(W) = {W.var():.6f}")
print(f"target 2/fan_in  = {2.0 / fan_in:.6f}")

# The deterministic seed makes the first few weights reproducible everywhere.
print("first row, 3 weights:", np.round(xavier_normal(3, 2, seed=42)[0], 6))

Xavier(tanh) std = 0.120281, uniform bound = 0.208333
He(relu)     std = 0.088388, uniform bound = 0.153093
W shape          = (128, 256)
empirical var(W) = 0.007828
target 2/fan_in  = 0.007812
first row, 3 weights: [ 0.262292 -0.564078  1.093891]

using AIInAction.Ch200WeightInit

fan_in, fan_out = 256, 128

xs = xavier_scale(fan_in, fan_out; gain=calculate_gain("tanh"))
hs = he_scale(fan_in; gain=calculate_gain("relu"))
println("Xavier(tanh) std = ", round(xs.std; digits=6),
        ", uniform bound = ", round(xs.bound; digits=6))
println("He(relu)     std = ", round(hs.std; digits=6),
        ", uniform bound = ", round(hs.bound; digits=6))

W = he_normal(fan_in, fan_out; seed=0)          # (fan_out, fan_in)
target = 2.0 / fan_in
println("empirical var(W) = ", round(sum(abs2, W .- sum(W) / length(W)) / length(W); digits=6))
println("target 2/fan_in  = ", round(target; digits=6))

# Same deterministic seed -> same weights as Python and Rust.
println("first row: ", round.(xavier_normal(3, 2; seed=42)[1, :]; digits=6))

use aiinaction::ch200_weight_init::{
    calculate_gain, xavier_scale, he_scale, he_normal, xavier_normal, FanMode,
};

fn main() {
    let (fan_in, fan_out) = (256usize, 128usize);

    let xs = xavier_scale(fan_in, fan_out, calculate_gain("tanh", None).unwrap()).unwrap();
    let hs = he_scale(fan_in, calculate_gain("relu", None).unwrap()).unwrap();
    println!("Xavier(tanh) std = {:.6}, uniform bound = {:.6}", xs.std, xs.bound);
    println!("He(relu)     std = {:.6}, uniform bound = {:.6}", hs.std, hs.bound);

    // (fan_out, fan_in) He-initialized matrix with the ReLU gain sqrt(2).
    let w = he_normal(fan_in, fan_out, 2.0_f64.sqrt(), FanMode::FanIn, 0).unwrap();
    let flat: Vec<f64> = w.iter().flatten().copied().collect();
    let mean = flat.iter().sum::<f64>() / flat.len() as f64;
    let var = flat.iter().map(|v| (v - mean).powi(2)).sum::<f64>() / flat.len() as f64;
    println!("empirical var(W) = {:.6}", var);
    println!("target 2/fan_in  = {:.6}", 2.0 / fan_in as f64);

    // Same seed -> identical weights across all three languages.
    let first = xavier_normal(3, 2, 1.0, 42).unwrap();
    println!("first row: {:?}", first[0]);
}

205.8 8. Summary

The unifying idea is that initialization should preserve variance, both forward through the activations and backward through the gradients. The variance of a sum of $n$ independent weighted signals scales as $n\,\sigma_W^2$, so the weight variance must scale as $1/n$. Xavier initialization applies this to symmetric activations and compromises between fan-in and fan-out with $\sigma_W^2 = 2/(n_{\text{in}} + n_{\text{out}})$. He initialization adds a factor of two, $\sigma_W^2 = 2/n_{\text{in}}$, to compensate for the half of the signal that ReLU discards. Picking the scheme that matches the nonlinearity is one of the cheapest and highest-leverage decisions in building a trainable deep network.

205.9 References

Glorot, X. and Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS). https://proceedings.mlr.press/v9/glorot10a.html
He, K., Zhang, X., Ren, S., and Sun, J. (2015). Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. Proceedings of the IEEE International Conference on Computer Vision (ICCV). https://arxiv.org/abs/1502.01852
Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning, Chapter 8: Optimization for Training Deep Models. MIT Press. https://www.deeplearningbook.org/
Saxe, A. M., McClelland, J. L., and Ganguli, S. (2014). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1312.6120
PyTorch Documentation. torch.nn.init. https://pytorch.org/docs/stable/nn.init.html
Zhang, H., Dauphin, Y. N., and Ma, T. (2019). Fixup Initialization: Residual Learning Without Normalization. International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1901.09321

# Weight Initialization: Xavier and He Training a deep network begins before the first gradient step. The initial values of the weights set the statistical regime in which signals propagate forward and gradients propagate backward. Choose them poorly and activations either collapse toward zero or saturate at the extremes of the nonlinearity, and the network either learns nothing or learns catastrophically slowly. Xavier (Glorot) initialization and He (Kaiming) initialization are the two canonical schemes that fix this problem by reasoning about variance. This chapter derives both from first principles, explains the roles of fan-in and fan-out, and distills the practical defaults that practitioners reach for today. ## 1. The Problem of Signal Propagation Consider a feedforward network with layers indexed by $l$. Each layer computes a pre-activation $\mathbf{z}^{(l)} = W^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}$ followed by an elementwise nonlinearity $\mathbf{a}^{(l)} = \phi(\mathbf{z}^{(l)})$. The weight matrix $W^{(l)}$ has shape $n_l \times n_{l-1}$, where $n_{l-1}$ is the number of inputs to the layer (the fan-in) and $n_l$ is the number of outputs (the fan-out). Suppose we initialize every weight independently from a distribution with mean zero and variance $\sigma_W^2$. As a signal passes through many layers, the variance of the activations either grows or shrinks geometrically unless the per-layer scaling is chosen with care. If the variance grows, activations explode and saturating nonlinearities clip; if it shrinks, activations vanish and so do the gradients that depend on them. The goal of principled initialization is to keep the variance of activations roughly constant across the forward pass, and the variance of gradients roughly constant across the backward pass. The qualitative target is simple to state. Let $\text{Var}(a^{(l)})$ denote the variance of a typical activation in layer $l$. We want $$ \text{Var}(a^{(l)}) \approx \text{Var}(a^{(l-1)}) \quad \text{for all } l. $$ A symmetric condition holds for the backpropagated error signal. Achieving both simultaneously turns out to constrain the weight variance in terms of both fan-in and fan-out. ## 2. Variance of a Linear Combination The derivations rest on one elementary fact. Let $X_1, \dots, X_n$ be independent random variables, each with mean zero and variance $\sigma_X^2$, and let $W_1, \dots, W_n$ be independent weights, also mean zero with variance $\sigma_W^2$, independent of the $X_i$. Consider the sum $$ Z = \sum_{i=1}^{n} W_i X_i. $$ Because all terms have mean zero and are mutually independent, $$ \text{Var}(Z) = \sum_{i=1}^{n} \text{Var}(W_i X_i) = \sum_{i=1}^{n} \text{Var}(W_i)\,\text{Var}(X_i) = n\, \sigma_W^2\, \sigma_X^2. $$ The middle step uses the identity that for independent zero-mean variables $\text{Var}(WX) = \text{Var}(W)\text{Var}(X)$, which follows from $\mathbb{E}[WX] = 0$ and $\mathbb{E}[(WX)^2] = \mathbb{E}[W^2]\mathbb{E}[X^2]$. This single equation, $\text{Var}(Z) = n\, \sigma_W^2\, \sigma_X^2$, is the engine behind both Xavier and He. The factor $n$ is the fan-in, and it is precisely the source of the variance blowup that we must counteract by choosing $\sigma_W^2$ inversely proportional to $n$. ## 3. Xavier (Glorot) Initialization for tanh Glorot and Bengio analyzed networks with symmetric, zero-centered activations such as $\tanh$. Near the origin, $\tanh(z) \approx z$, so to a first approximation the nonlinearity acts like the identity. Under this linear regime we treat $\phi$ as having unit slope, which lets the variance recursion of Section 2 apply directly to activations. ### 3.1 The Forward Condition Assume the inputs $a^{(l-1)}$ to layer $l$ are zero-mean with variance $\text{Var}(a^{(l-1)})$, and that weights are drawn independently with variance $\sigma_W^2$. Under the linear approximation $a^{(l)} \approx z^{(l)}$, the result of Section 2 gives $$ \text{Var}(a^{(l)}) = n_{l-1}\, \sigma_W^2\, \text{Var}(a^{(l-1)}). $$ To preserve variance forward, $\text{Var}(a^{(l)}) = \text{Var}(a^{(l-1)})$, we need the multiplicative factor to equal one: $$ n_{l-1}\, \sigma_W^2 = 1 \quad \Longrightarrow \quad \sigma_W^2 = \frac{1}{n_{l-1}} = \frac{1}{n_{\text{in}}}. $$ ### 3.2 The Backward Condition Backpropagation sends an error signal $\boldsymbol{\delta}^{(l)} = \partial \mathcal{L} / \partial \mathbf{z}^{(l)}$ through the transpose of the weight matrix. In the linear regime the gradient with respect to the previous layer's pre-activation satisfies $\boldsymbol{\delta}^{(l-1)} = (W^{(l)})^{\top} \boldsymbol{\delta}^{(l)}$. The dimension of the sum here is $n_l$, the fan-out, because each input neuron receives error contributions from all $n_l$ output neurons. Applying the same variance bookkeeping, $$ \text{Var}(\delta^{(l-1)}) = n_l\, \sigma_W^2\, \text{Var}(\delta^{(l)}). $$ Preserving gradient variance backward requires $$ n_l\, \sigma_W^2 = 1 \quad \Longrightarrow \quad \sigma_W^2 = \frac{1}{n_l} = \frac{1}{n_{\text{out}}}. $$ ### 3.3 The Compromise The forward condition asks for $\sigma_W^2 = 1/n_{\text{in}}$ and the backward condition asks for $\sigma_W^2 = 1/n_{\text{out}}$. Unless the layer is square these are incompatible, so Glorot and Bengio proposed the harmonic-style compromise of averaging the two denominators: $$ \boxed{\;\sigma_W^2 = \frac{2}{n_{\text{in}} + n_{\text{out}}}\;} $$ This is Xavier initialization. It does not satisfy either condition exactly, but it keeps both forward activations and backward gradients within a stable band, which is what matters in practice. Two concrete distributions realize this variance. A zero-mean Gaussian uses standard deviation $\sigma_W = \sqrt{2 / (n_{\text{in}} + n_{\text{out}})}$. A uniform distribution $U(-r, r)$ has variance $r^2 / 3$, so matching $r^2/3 = 2/(n_{\text{in}}+n_{\text{out}})$ gives the familiar bound $$ r = \sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}, \qquad W \sim U\!\left(-r,\, r\right). $$ ```text # Xavier (Glorot) variance, conceptual fan_in, fan_out = layer.in_features, layer.out_features std = sqrt(2.0 / (fan_in + fan_out)) # normal variant bound = sqrt(6.0 / (fan_in + fan_out)) # uniform variant: U(-bound, bound) ``` ## 4. He (Kaiming) Initialization for ReLU The linear approximation that justified Xavier breaks down for the rectified linear unit $\phi(z) = \max(0, z)$. ReLU is not symmetric about zero, and it discards roughly half of its inputs by mapping all negative pre-activations to exactly zero. This halving of the active signal must be accounted for, and doing so changes the constant in the variance formula. ### 4.1 The Effect of the Rectifier Suppose $z^{(l)}$ is symmetric about zero, which holds when the weights are symmetric and the bias is zero. Then $z^{(l)}$ is positive half the time and negative half the time. For the positive half ReLU acts as the identity, and for the negative half it outputs zero. We compute the second moment of the activation $a = \max(0, z)$: $$ \mathbb{E}[a^2] = \mathbb{E}[\max(0, z)^2] = \int_{0}^{\infty} z^2 p(z)\, dz = \tfrac{1}{2} \int_{-\infty}^{\infty} z^2 p(z)\, dz = \tfrac{1}{2}\,\mathbb{E}[z^2]. $$ The middle equality uses the symmetry of $p(z)$: the integral over the positive half-line is exactly half the integral over the whole line. Since $z$ has mean zero, $\mathbb{E}[z^2] = \text{Var}(z)$, so $$ \mathbb{E}[a^2] = \tfrac{1}{2}\,\text{Var}(z). $$ This factor of one half is the crux. ReLU passes only half the variance of its input, so to keep variance constant across layers we must inject a compensating factor of two into the weight variance. ### 4.2 The Forward Derivation Let $n_{\text{in}} = n_{l-1}$ be the fan-in. The pre-activation variance is $\text{Var}(z^{(l)}) = n_{l-1}\, \sigma_W^2\, \mathbb{E}[(a^{(l-1)})^2]$, where we use the second moment of the previous activation because ReLU outputs are not zero-mean. Substituting the rectifier relation $\mathbb{E}[(a^{(l-1)})^2] = \tfrac{1}{2}\text{Var}(z^{(l-1)})$ gives $$ \text{Var}(z^{(l)}) = n_{l-1}\, \sigma_W^2 \cdot \tfrac{1}{2}\, \text{Var}(z^{(l-1)}). $$ Demanding $\text{Var}(z^{(l)}) = \text{Var}(z^{(l-1)})$ forces the bracketed factor to one: $$ \tfrac{1}{2}\, n_{l-1}\, \sigma_W^2 = 1 \quad \Longrightarrow \quad \boxed{\;\sigma_W^2 = \frac{2}{n_{\text{in}}}\;} $$ This is He initialization. The numerator is $2$ rather than $1$, and that doubling is the direct mathematical consequence of ReLU killing half the signal. He and colleagues showed that this scaling allows very deep rectifier networks, including the thirty-layer models that motivated their work, to converge where Xavier initialization stalls. For a Gaussian the standard deviation is $\sigma_W = \sqrt{2/n_{\text{in}}}$, and for a uniform variant the bound is $r = \sqrt{6/n_{\text{in}}}$. ```text # He (Kaiming) variance for ReLU, conceptual fan_in = layer.in_features std = sqrt(2.0 / fan_in) # normal variant bound = sqrt(6.0 / fan_in) # uniform variant: U(-bound, bound) ``` ### 4.3 Fan-In, Fan-Out, and Leaky Variants He initialization can be anchored to either fan-in or fan-out. The fan-in mode preserves the variance of activations in the forward pass, while the fan-out mode preserves the variance of gradients in the backward pass. The choice rarely changes results dramatically, because the per-layer factor introduced in one direction is absorbed elsewhere, but fan-in is the conventional default. For the leaky ReLU $\phi(z) = \max(\alpha z, z)$ with small negative slope $\alpha$, the negative half is no longer discarded entirely. Repeating the second-moment calculation yields a correction, and the variance becomes $$ \sigma_W^2 = \frac{2}{(1 + \alpha^2)\, n_{\text{in}}}. $$ When $\alpha = 0$ this recovers standard He, and when $\alpha = 1$ the unit is linear and the formula reduces to the Xavier-style $1/n_{\text{in}}$. This generalization is exactly the gain parameter that deep learning libraries expose. ## 5. Gain Factors and a Unified View Both schemes can be written through a single template. Let $g$ be a nonlinearity-dependent gain. Then $$ \sigma_W^2 = \frac{g^2}{n}, \qquad g = \begin{cases} 1 & \text{linear, } \tanh \text{ (approx.)} \\[2pt] \sqrt{2} & \text{ReLU} \\[2pt] \sqrt{2/(1+\alpha^2)} & \text{leaky ReLU} \end{cases} $$ with $n$ being fan-in, fan-out, or their average depending on whether we target the forward pass, the backward pass, or the Glorot compromise. The recommended gain for $\tanh$ is in fact slightly above one, around $5/3$, because the small but real curvature of $\tanh$ near useful operating points reduces its effective slope below unity; the gain compensates. Viewing initialization through the gain abstraction makes it clear that Xavier and He are not different philosophies but the same variance-preservation principle evaluated for different activation functions. ## 6. Practical Defaults The following defaults reflect what works reliably in modern practice. - For networks built on ReLU and its close relatives such as leaky ReLU, ELU, or GELU, use He initialization with fan-in mode. This is the standard for convolutional and most feedforward architectures. In code this is `kaiming_normal_` or `kaiming_uniform_` with `nonlinearity='relu'`. - For networks with $\tanh$, sigmoid, or other symmetric saturating activations, use Xavier initialization with the appropriate gain. This remains common in recurrent networks and in attention components that use $\tanh$ gating. - Biases are almost always initialized to zero. A nonzero bias breaks the symmetry assumption used in the ReLU derivation, and there is rarely any benefit. One historical exception is initializing forget-gate biases of an LSTM to a small positive value. - Initialization variance is computed per layer from that layer's own fan-in and fan-out, not from a global constant, because the whole point is to neutralize the layer-specific factor $n$. - For convolutional layers the fan-in is the number of input channels times the spatial kernel size, $C_{\text{in}} \cdot k_h \cdot k_w$, and the fan-out is $C_{\text{out}} \cdot k_h \cdot k_w$. The same formulas apply once $n$ is computed this way. ```text # PyTorch defaults, conceptual nn.init.kaiming_normal_(conv.weight, mode='fan_in', nonlinearity='relu') nn.init.xavier_uniform_(linear.weight, gain=nn.init.calculate_gain('tanh')) nn.init.zeros_(layer.bias) ``` A final caveat is that normalization layers and residual connections have shifted some of the burden that initialization once carried alone. Batch normalization, layer normalization, and carefully scaled residual branches all stabilize variance during training, which makes networks more forgiving of imperfect initialization. Even so, the right starting variance still accelerates early training and remains essential in architectures without normalization, in very deep stacks, and whenever training must be reproducible and robust. The variance argument that produced Xavier and He continues to inform newer schemes, including the residual-aware and transformer-specific initializers that scale weights by depth. ## 7. Reference Implementation The `aiinaction` companion library ships a small, dependency-free implementation of both schemes in Python, Julia, and Rust. It exposes two layers of API. The first computes the *theoretical* scale of a scheme: `xavier_scale(fan_in, fan_out)` and `he_scale(fan)` return the matching Gaussian standard deviation $\sigma_W$ and the half-width $r = \sigma_W\sqrt{3}$ of the equivalent uniform support $U(-r, r)$. The second layer *samples* an actual weight matrix of shape `(fan_out, fan_in)` via `xavier_normal`, `xavier_uniform`, `he_normal`, and `he_uniform`. The helper `calculate_gain` returns the nonlinearity-dependent gain $g$ discussed in Section 5. Sampling is built on a self-contained, deterministic SplitMix64 generator feeding a Box-Muller transform, so a fixed seed yields the *identical* weight matrix across all three languages, not merely matching summary statistics. This makes the examples below reproducible bit-for-bit and lets the cross-language parity tests assert agreement to $10^{-9}$. ::: {.panel-tabset} ## Python ```{python} import numpy as np from aiinaction.ch200_weight_init import ( calculate_gain, xavier_scale, he_scale, he_normal, xavier_normal, ) fan_in, fan_out = 256, 128 # Theoretical scales for a tanh layer (Xavier) and a ReLU layer (He). xs = xavier_scale(fan_in, fan_out, gain=calculate_gain("tanh")) hs = he_scale(fan_in, gain=calculate_gain("relu")) print(f"Xavier(tanh) std = {xs.std:.6f}, uniform bound = {xs.bound:.6f}") print(f"He(relu) std = {hs.std:.6f}, uniform bound = {hs.bound:.6f}") # Sample a real He-initialized weight matrix and check the empirical variance # against the 2 / fan_in target. W = he_normal(fan_in, fan_out, seed=0) print(f"W shape = {W.shape}") print(f"empirical var(W) = {W.var():.6f}") print(f"target 2/fan_in = {2.0 / fan_in:.6f}") # The deterministic seed makes the first few weights reproducible everywhere. print("first row, 3 weights:", np.round(xavier_normal(3, 2, seed=42)[0], 6)) ``` ## Julia ```julia using AIInAction.Ch200WeightInit fan_in, fan_out = 256, 128 xs = xavier_scale(fan_in, fan_out; gain=calculate_gain("tanh")) hs = he_scale(fan_in; gain=calculate_gain("relu")) println("Xavier(tanh) std = ", round(xs.std; digits=6), ", uniform bound = ", round(xs.bound; digits=6)) println("He(relu) std = ", round(hs.std; digits=6), ", uniform bound = ", round(hs.bound; digits=6)) W = he_normal(fan_in, fan_out; seed=0) # (fan_out, fan_in) target = 2.0 / fan_in println("empirical var(W) = ", round(sum(abs2, W .- sum(W) / length(W)) / length(W); digits=6)) println("target 2/fan_in = ", round(target; digits=6)) # Same deterministic seed -> same weights as Python and Rust. println("first row: ", round.(xavier_normal(3, 2; seed=42)[1, :]; digits=6)) ``` ## Rust ```rust use aiinaction::ch200_weight_init::{ calculate_gain, xavier_scale, he_scale, he_normal, xavier_normal, FanMode, }; fn main() { let (fan_in, fan_out) = (256usize, 128usize); let xs = xavier_scale(fan_in, fan_out, calculate_gain("tanh", None).unwrap()).unwrap(); let hs = he_scale(fan_in, calculate_gain("relu", None).unwrap()).unwrap(); println!("Xavier(tanh) std = {:.6}, uniform bound = {:.6}", xs.std, xs.bound); println!("He(relu) std = {:.6}, uniform bound = {:.6}", hs.std, hs.bound); // (fan_out, fan_in) He-initialized matrix with the ReLU gain sqrt(2). let w = he_normal(fan_in, fan_out, 2.0_f64.sqrt(), FanMode::FanIn, 0).unwrap(); let flat: Vec<f64> = w.iter().flatten().copied().collect(); let mean = flat.iter().sum::<f64>() / flat.len() as f64; let var = flat.iter().map(|v| (v - mean).powi(2)).sum::<f64>() / flat.len() as f64; println!("empirical var(W) = {:.6}", var); println!("target 2/fan_in = {:.6}", 2.0 / fan_in as f64); // Same seed -> identical weights across all three languages. let first = xavier_normal(3, 2, 1.0, 42).unwrap(); println!("first row: {:?}", first[0]); } ``` ::: ## 8. Summary The unifying idea is that initialization should preserve variance, both forward through the activations and backward through the gradients. The variance of a sum of $n$ independent weighted signals scales as $n\,\sigma_W^2$, so the weight variance must scale as $1/n$. Xavier initialization applies this to symmetric activations and compromises between fan-in and fan-out with $\sigma_W^2 = 2/(n_{\text{in}} + n_{\text{out}})$. He initialization adds a factor of two, $\sigma_W^2 = 2/n_{\text{in}}$, to compensate for the half of the signal that ReLU discards. Picking the scheme that matches the nonlinearity is one of the cheapest and highest-leverage decisions in building a trainable deep network. ## References 1. Glorot, X. and Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS). https://proceedings.mlr.press/v9/glorot10a.html 2. He, K., Zhang, X., Ren, S., and Sun, J. (2015). Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. Proceedings of the IEEE International Conference on Computer Vision (ICCV). https://arxiv.org/abs/1502.01852 3. Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning, Chapter 8: Optimization for Training Deep Models. MIT Press. https://www.deeplearningbook.org/ 4. Saxe, A. M., McClelland, J. L., and Ganguli, S. (2014). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1312.6120 5. PyTorch Documentation. torch.nn.init. https://pytorch.org/docs/stable/nn.init.html 6. Zhang, H., Dauphin, Y. N., and Ma, T. (2019). Fixup Initialization: Residual Learning Without Normalization. International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1901.09321