208 Group and Instance Normalization

Feature normalization is one of the load bearing ideas in modern deep learning. By rescaling intermediate activations to a stable distribution, normalization layers smooth the optimization landscape, reduce sensitivity to initialization, and permit larger learning rates. Batch Normalization launched this line of work, but its reliance on batch statistics creates failure modes that motivated a family of alternatives. This chapter develops a unified view of batch, layer, group, and instance normalization, derives where each sits along a single design axis, and explains why group normalization rescues small batch vision training while instance normalization underpins style transfer.

208.1 1. A Unified Formulation

Consider the activation tensor produced by a convolutional layer, with shape $(N, C, H, W)$, where $N$ indexes examples in a mini batch, $C$ indexes feature channels, and $H, W$ index spatial positions. Write a single scalar activation as $x_{nchw}$. Every normalization scheme in this chapter can be written as

\[ \hat{x}_i = \frac{x_i - \mu_i}{\sqrt{\sigma_i^2 + \epsilon}}, \qquad y_i = \gamma\, \hat{x}_i + \beta, \]

where $i$ is shorthand for an index tuple $(n, c, h, w)$, the constant $\epsilon$ guards against division by zero, and $\gamma, \beta$ are learnable affine parameters typically defined per channel. The mean and variance are computed over a set $\mathcal{S}_i$ of activations:

\[ \mu_i = \frac{1}{|\mathcal{S}_i|} \sum_{j \in \mathcal{S}_i} x_j, \qquad \sigma_i^2 = \frac{1}{|\mathcal{S}_i|} \sum_{j \in \mathcal{S}_i} (x_j - \mu_i)^2 . \]

The only thing that distinguishes the four methods is the definition of $\mathcal{S}_i$, the pooling set over which statistics are aggregated. This is the central insight that makes the spectrum legible: each method answers the question of which activations share a normalizer.

Two formal properties hold for every choice of $\mathcal{S}_i$ and are worth stating precisely, because they explain why the family stabilizes training at all.

Proposition (scale and shift invariance of the normalized statistic)

Fix a pooling set $\mathcal{S}$ and let $\hat{x}$ denote the normalized output before the affine step, with $\epsilon = 0$. For any scalars $a \neq 0$ and $b$, replacing every activation $x_j \to a x_j + b$ for $j \in \mathcal{S}$ leaves $\hat{x}$ unchanged.

The proof is a one line computation. Under $x_j \to a x_j + b$ the pooled mean becomes $a\mu + b$ and the pooled standard deviation becomes $|a|\sigma$, so

\[ \frac{(a x_i + b) - (a\mu + b)}{|a|\sigma} = \frac{a (x_i - \mu)}{|a|\sigma} = \operatorname{sign}(a)\,\frac{x_i - \mu}{\sigma}, \]

which equals $\hat{x}_i$ up to the sign that a nonzero $a$ cannot flip in practice, since deep networks place the normalizer after a linear map whose scale is absorbed harmlessly. The practical reading is that the layer feeding a normalization layer cannot change the loss by globally rescaling or offsetting its output inside the pooling set. This decouples the optimization of upstream weight magnitude from the optimization of direction, which is the mechanism behind the smoother loss surface reported by Santurkar et al. (reference 8).

208.1.1 1.1 The Four Pooling Sets

Let $i = (n, c, h, w)$. The four schemes correspond to four choices.

Batch Norm pools across the batch and spatial dimensions, holding the channel fixed: $\mathcal{S}_i = \{ j : c_j = c \}$. Statistics depend on the whole batch, so $\mu$ and $\sigma$ have shape $(C,)$.
Layer Norm pools across all channels and spatial positions of a single example: $\mathcal{S}_i = \{ j : n_j = n \}$. Statistics have shape $(N,)$ and are computed independently per example.
Instance Norm pools across spatial positions only, holding both example and channel fixed: $\mathcal{S}_i = \{ j : n_j = n,\ c_j = c \}$. Statistics have shape $(N, C)$.
Group Norm partitions the $C$ channels into $G$ groups and pools across spatial positions and the channels within one group: $\mathcal{S}_i = \{ j : n_j = n,\ \lfloor c_j / (C/G) \rfloor = \lfloor c / (C/G) \rfloor \}$. Statistics have shape $(N, G)$.

A compact way to remember this: imagine the activation tensor as a stack of $(C, H, W)$ blocks, one per example. Batch Norm normalizes a blue slab that cuts across all examples at a fixed channel. Layer Norm normalizes an entire per example block. Instance Norm normalizes one channel within one block. Group Norm normalizes a contiguous band of channels within one block.

shape (N, C, H, W); reduce over the marked axes
BatchNorm   : reduce over (N, H, W)   per channel
LayerNorm   : reduce over (C, H, W)   per sample
InstanceNorm: reduce over (H, W)      per (sample, channel)
GroupNorm   : reduce over (H, W, c/G) per (sample, group)

The following diagram places the schemes on the axis that orders them, the size of the per example pooling set, with Batch Norm set apart as the only batch coupled member.

flowchart LR
    B["LayerNorm: pool over C, H, W per sample, G equals 1"] --> C["GroupNorm: pool over H, W and channels in a group"]
    C --> D["InstanceNorm: pool over H, W per channel, G equals C"]
    A["BatchNorm: pool over N, H, W per channel, batch coupled"]

The crucial structural difference is that Batch Norm is the only one of the four whose statistics couple different examples in the batch. The other three are computed within a single example and are therefore independent of batch size. This single fact explains most of the practical behavior that follows.

208.1.2 1.2 A Small Worked Example

Concrete numbers make the pooling sets tangible. Take a tiny tensor with $N = 1$ example, $C = 2$ channels, and a $2 \times 2$ spatial grid, so $H = W = 2$. Let the two channel maps be

\[ x_{0,0} = \begin{pmatrix} 1 & 3 \\ 5 & 7 \end{pmatrix}, \qquad x_{0,1} = \begin{pmatrix} 0 & 4 \\ 8 & 12 \end{pmatrix}. \]

Instance Norm normalizes each channel over its four spatial entries. Channel $0$ has mean $4$ and variance $5$, so its standard deviation is $\sqrt{5} \approx 2.236$. Channel $1$ has mean $6$ and variance $20$, so its standard deviation is $\sqrt{20} \approx 4.472$. The two channels are scaled by different amounts, $2.236$ and $4.472$, which is exactly the behavior that erases per channel contrast.

Layer Norm, or equivalently Group Norm with $G = 1$, pools all eight entries together. Their mean is $5$ and their variance is $\tfrac{1}{8}\sum (x - 5)^2 = 13.5$, giving one standard deviation $\sqrt{13.5} \approx 3.674$ applied to both channels. Because a single normalizer is shared, the larger spread of channel $1$ relative to channel $0$ survives the normalization, preserving cross channel scale information that Instance Norm discards. Group Norm with $G = 2$ on this tensor coincides with Instance Norm, since each group then holds one channel. This four entry computation is the whole story scaled down: the only thing that changed between the methods was which entries entered the sums.

208.2 2. Why Batch Norm Breaks

Batch Norm estimates per channel mean and variance from the mini batch. With batch size $m$, the standard error of the mean estimate scales as $1/\sqrt{m}$, so the statistics become noisy when $m$ is small. Two distinct problems arise.

First, the normalization itself becomes inaccurate. When $m = 2$, the estimated $\sigma^2$ for a channel is computed from very few samples and fluctuates wildly from step to step. This injects noise into every downstream activation, and that noise compounds through depth.

Second, there is a train and test mismatch. At training time Batch Norm uses the current batch statistics, but at inference it uses running averages $\mu_{\text{run}}, \sigma^2_{\text{run}}$ accumulated during training. If the batch statistics are noisy or if the training distribution of batch statistics differs from a single example evaluation, the running estimates are biased and accuracy degrades.

These problems are acute precisely in the regimes that matter for high resolution vision. Detection, segmentation, and video models consume large inputs, so memory forces batch sizes of one or two per device. The error of Batch Norm rises sharply as $m$ falls below roughly eight. A model that trains well at $m = 32$ can lose several points of accuracy at $m = 2$ for no reason other than statistical noise in the normalizer.

A further subtlety is that Batch Norm makes the loss for one example depend on the other examples that happen to share its batch. This violates the usual independence assumption, complicates theoretical analysis, and can leak information across examples in ways that matter for tasks like contrastive learning and certain sequence models.

208.3 3. Group Normalization

Group Norm, introduced by Wu and He, removes the batch dependence entirely while retaining the channel structure that vision models rely on. It divides the $C$ channels into $G$ groups and normalizes within each group, pooling over spatial locations and the channels in that group, for each example independently.

208.3.1 3.1 Definition and Special Cases

With $G$ groups, each group contains $C/G$ channels. The statistics for example $n$ and group $g$ are

\[ \mu_{ng} = \frac{1}{(C/G) H W} \sum_{c \in g} \sum_{h, w} x_{nchw}, \qquad \sigma_{ng}^2 = \frac{1}{(C/G) H W} \sum_{c \in g} \sum_{h, w} (x_{nchw} - \mu_{ng})^2 . \]

Group Norm interpolates between the two extremes of single example normalization. When $G = 1$, all channels form one group and Group Norm reduces to Layer Norm. When $G = C$, each channel is its own group and Group Norm reduces to Instance Norm. The interesting regime is intermediate, and a default of $G = 32$ groups works well across many architectures.

GroupNorm(G=1)   == LayerNorm
GroupNorm(G=C)   == InstanceNorm
GroupNorm(G=32)  == practical default for vision

208.3.2 3.2 Why Grouping Channels Is Principled

Grouping is not an arbitrary trick. Channels in a convolutional layer are not independent. Classical features such as oriented edges at different frequencies, or color and texture filter banks, naturally form clusters of related responses. Group Norm respects this by sharing a normalizer within a group, which assumes that channels in the same group have comparable scale. Layer Norm goes further and assumes all channels share one distribution, which is often too strong for convolutional features because different channels can have genuinely different scales. Instance Norm goes to the opposite extreme and normalizes each channel separately, discarding all cross channel scale information. Group Norm occupies the productive middle ground.

208.3.3 3.3 Empirical Behavior

The defining property of Group Norm is that its accuracy is essentially flat as a function of batch size, because its computation never touches the batch axis. On ImageNet classification with a ResNet-50, Group Norm matches Batch Norm at moderate batch sizes and substantially outperforms it at $m = 2$, where Batch Norm collapses. There is no running statistic and no train and test discrepancy, since the same per example computation is used at training and inference. The cost is that at large batch sizes Batch Norm sometimes retains a small edge, because the noise it injects acts as a mild regularizer that Group Norm lacks.

208.4 4. Instance Normalization

Instance Norm normalizes each channel of each example over its spatial extent. It was introduced by Ulyanov, Vedaldi, and Lempitsky in the context of feed forward style transfer, and the reason it helps there is illuminating.

208.4.1 4.1 Definition

For example $n$ and channel $c$,

\[ \mu_{nc} = \frac{1}{H W} \sum_{h, w} x_{nchw}, \qquad \sigma_{nc}^2 = \frac{1}{H W} \sum_{h, w} (x_{nchw} - \mu_{nc})^2 . \]

This is Group Norm with $G = C$. Each spatial feature map is individually centered and scaled.

208.4.2 4.2 The Connection to Style

The key observation in neural style transfer is that the artistic style of an image is captured largely by the statistics of feature activations, in particular by the means and the Gram matrices of channel responses. Two images in the same style have similar per channel feature statistics regardless of content. A style transfer network must therefore wash out the original contrast and intensity statistics of the content image so that the target style statistics can be imposed.

Instance Norm does exactly this. By removing the per channel mean and variance of each feature map, it discards instance specific contrast information that would otherwise bleed through and interfere with the target style. Empirically, swapping Batch Norm for Instance Norm in a style transfer generator produces markedly sharper and more faithful stylization and removes a wash of residual content contrast. The mechanism is that normalization to zero mean and unit variance per channel makes the network invariant to the global contrast of the input, which is precisely a property we want when re-rendering content in a new style.

208.4.3 4.3 Conditional Instance Norm and AdaIN

Instance Norm generalizes naturally to controllable generation. Conditional Instance Norm learns a separate affine pair $(\gamma_s, \beta_s)$ for each style $s$, so a single network can render many styles by selecting the affine parameters. Adaptive Instance Norm, or AdaIN, takes this further by computing the affine parameters directly from a style input:

\[ \text{AdaIN}(x, y) = \sigma(y)\, \frac{x - \mu(x)}{\sigma(x)} + \mu(y), \]

where $\mu(x), \sigma(x)$ are the per channel instance statistics of the content feature $x$ and $\mu(y), \sigma(y)$ are those of a style feature $y$. AdaIN normalizes the content to remove its own style and then re-imposes the style of $y$ through the affine step. This idea reappears in high quality image generators, where per layer affine parameters predicted from a latent code drive the synthesis of structure and texture at each resolution.

208.5 5. The Spectrum and How to Choose

The four methods form a clean spectrum governed by how much the normalizer shares across examples and across channels.

Method	Pooling set	Batch dependent	Cross channel pooling
Batch Norm	$(N, H, W)$ per channel	yes	no
Layer Norm	$(C, H, W)$ per sample	no	all channels
Group Norm	$(H, W, C/G)$ per group	no	within group
Instance Norm	$(H, W)$ per channel	no	no

Reading the table, Group Norm with $G$ swept from $1$ to $C$ traces a continuous path from Layer Norm to Instance Norm, all of it free of batch dependence. Batch Norm stands apart as the only batch coupled scheme.

Practical guidance follows directly from this structure.

Use Batch Norm when the batch is large and stable, as in standard ImageNet classification with moderate to large batches. The regularizing noise often gives a small accuracy gain.
Use Group Norm when the batch is small or variable, as in detection, segmentation, and video, or when you want results that do not depend on batch size or device count. It is the safe default for memory constrained vision.
Use Layer Norm for sequence models and transformers, where the channel or feature axis is the natural unit and there is no spatial grid. Layer Norm is Group Norm with one group.
Use Instance Norm for style transfer and image to image translation, where removing per sample per channel contrast is the whole point.

A useful diagnostic is to ask whether the task wants to preserve or discard global per channel intensity. Classification wants to preserve discriminative scale information, which argues against Instance Norm. Stylization wants to discard input contrast, which argues for it. Group Norm gives a tunable knob between these regimes.

208.5.1 5.1 Pitfalls

A few failure modes recur often enough to call out.

The number of groups must divide the channel count. Choosing $G = 32$ on a layer with $C = 48$ channels is a configuration error. When channel counts vary across a network, a common robust choice is to fix the channels per group, for example $C/G = 16$, rather than fixing $G$, so that every layer divides cleanly.
Group Norm degenerates on tiny spatial maps. With $G = C$ the pooling set is just the $H W$ spatial entries, so a $1 \times 1$ feature map leaves a single sample and the variance estimate becomes meaningless. Late stage feature maps in classifiers and the per token features in some architectures are exactly this regime, and Layer Norm or a small $G$ is safer there.
Do not mix the train and inference paths. The appeal of the single example methods is that the same computation runs at training and inference, with no running averages to maintain. Reintroducing a running statistic, or freezing a Batch Norm layer and a Group Norm layer inconsistently when fine tuning, silently recreates the train and test mismatch the switch was meant to avoid.
Instance Norm erases information by design. Using it in a classifier removes the per channel intensity that often carries class signal, which is why it underperforms there. Reach for it only when contrast removal is the goal.
The affine parameters still matter. Setting $\gamma$ and $\beta$ per channel restores representational capacity that the normalization removed. Omitting them, or sharing them too coarsely, can leave the normalized features unable to recover a needed scale.

Mature open source frameworks ship all four layers directly, so none of this requires custom kernels. PyTorch provides nn.BatchNorm2d, nn.GroupNorm, nn.InstanceNorm2d, and nn.LayerNorm, and Flax and Haiku expose the same set, which makes swapping among them a one line change during ablation.

# pseudocode: one implementation, four behaviors
def normalize(x, axes, gamma, beta, eps=1e-5):
    mu = x.mean(axes, keepdims=True)
    var = x.var(axes, keepdims=True)
    return gamma * (x - mu) / (var + eps).sqrt() + beta

# x has shape (N, C, H, W)
# BatchNorm    : axes = (0, 2, 3)
# LayerNorm    : axes = (1, 2, 3)
# InstanceNorm : axes = (2, 3)
# GroupNorm    : reshape to (N, G, C//G, H, W), axes = (2, 3, 4)

208.6 6. Theoretical Notes

Two properties unify the batch independent methods. First, they are invariant to per group affine transformations of the input within their pooling set, which is the proposition proved in Section 1. Second, their Jacobian has a benign structure: the centering and scaling operations bound the effective gradient scale, which is one reason normalization stabilizes training.

208.6.1 6.1 The Backward Pass

The coupling that normalization introduces is easiest to see by differentiating through a single pooling set. Fix one set $\mathcal{S}$ of size $m = |\mathcal{S}|$, write the centered and scaled output as $\hat{x}_i = (x_i - \mu)/\sqrt{\sigma^2 + \epsilon}$, and let $g_i = \partial L / \partial \hat{x}_i$ be the incoming gradient. Standard differentiation through the shared $\mu$ and $\sigma^2$ gives

\[ \frac{\partial L}{\partial x_i} = \frac{1}{\sqrt{\sigma^2 + \epsilon}} \left( g_i - \frac{1}{m} \sum_{j \in \mathcal{S}} g_j - \hat{x}_i \cdot \frac{1}{m} \sum_{j \in \mathcal{S}} g_j \hat{x}_j \right). \]

The structure is informative. The first inner term subtracts the mean of the incoming gradients, so a gradient component that is uniform across the pooling set is removed. The second inner term subtracts the projection of the gradient onto the current normalized output, so a component aligned with $\hat{x}$ is also removed. In words, the backward pass orthogonalizes the gradient against the two directions that scaling and shifting the activations cannot affect, namely the constant direction and the $\hat{x}$ direction. This is the exact backward image of the forward invariance proposition, and it is why a normalization layer cannot pass gradient signal that would only renormalize its own input. The factor $1/\sqrt{\sigma^2 + \epsilon}$ in front shows that the layer also rescales the gradient by the inverse activation scale, which is the bounding effect that keeps gradient magnitudes well behaved through depth.

208.6.2 6.2 Bias Variance Tradeoff in the Pooling Set

The choice of pooling set is ultimately a bias variance tradeoff in statistics estimation. Larger pooling sets give lower variance estimates of $\mu$ and $\sigma$ but impose a stronger assumption that the pooled activations are exchangeable. Batch Norm has a large pooling set but pays with batch dependence. Instance Norm has the smallest pooling set, $H W$ samples, which is fine for large feature maps but unstable for small ones. Group Norm tunes the pooling set size through $G$, trading estimation variance against the validity of the within group exchangeability assumption, which is why an intermediate $G$ tends to win.

208.7 7. Summary

Batch, layer, group, and instance normalization are one algorithm with four choices of pooling set. Batch Norm couples examples and excels with large batches but degrades when batches are small. The three single example methods are immune to batch size. Group Norm interpolates between Layer Norm and Instance Norm and is the robust default for memory limited vision tasks. Instance Norm, the extreme of per channel single example normalization, discards input contrast and is the workhorse of style transfer, generalizing to conditional and adaptive variants that drive controllable generation. Choosing among them reduces to two questions: can you afford a large stable batch, and does your task want to preserve or erase per channel intensity.

208.8 References

Ioffe, S., and Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. ICML, 2015. https://arxiv.org/abs/1502.03167
Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer Normalization. 2016. https://arxiv.org/abs/1607.06450
Wu, Y., and He, K. Group Normalization. ECCV, 2018. https://arxiv.org/abs/1803.08494
Ulyanov, D., Vedaldi, A., and Lempitsky, V. Instance Normalization: The Missing Ingredient for Fast Stylization. 2016. https://arxiv.org/abs/1607.08022
Dumoulin, V., Shlens, J., and Kudlur, M. A Learned Representation For Artistic Style. ICLR, 2017. https://arxiv.org/abs/1610.07629
Huang, X., and Belongie, S. Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization. ICCV, 2017. https://arxiv.org/abs/1703.06868
Karras, T., Laine, S., and Aila, T. A Style-Based Generator Architecture for Generative Adversarial Networks. CVPR, 2019. https://arxiv.org/abs/1812.04948
Santurkar, S., Tsipras, D., Ilyas, A., and Madry, A. How Does Batch Normalization Help Optimization? NeurIPS, 2018. https://arxiv.org/abs/1805.11604

# Group and Instance Normalization Feature normalization is one of the load bearing ideas in modern deep learning. By rescaling intermediate activations to a stable distribution, normalization layers smooth the optimization landscape, reduce sensitivity to initialization, and permit larger learning rates. Batch Normalization launched this line of work, but its reliance on batch statistics creates failure modes that motivated a family of alternatives. This chapter develops a unified view of batch, layer, group, and instance normalization, derives where each sits along a single design axis, and explains why group normalization rescues small batch vision training while instance normalization underpins style transfer. ## 1. A Unified Formulation Consider the activation tensor produced by a convolutional layer, with shape $(N, C, H, W)$, where $N$ indexes examples in a mini batch, $C$ indexes feature channels, and $H, W$ index spatial positions. Write a single scalar activation as $x_{nchw}$. Every normalization scheme in this chapter can be written as $$ \hat{x}_i = \frac{x_i - \mu_i}{\sqrt{\sigma_i^2 + \epsilon}}, \qquad y_i = \gamma\, \hat{x}_i + \beta, $$ where $i$ is shorthand for an index tuple $(n, c, h, w)$, the constant $\epsilon$ guards against division by zero, and $\gamma, \beta$ are learnable affine parameters typically defined per channel. The mean and variance are computed over a set $\mathcal{S}_i$ of activations: $$ \mu_i = \frac{1}{|\mathcal{S}_i|} \sum_{j \in \mathcal{S}_i} x_j, \qquad \sigma_i^2 = \frac{1}{|\mathcal{S}_i|} \sum_{j \in \mathcal{S}_i} (x_j - \mu_i)^2 . $$ The only thing that distinguishes the four methods is the definition of $\mathcal{S}_i$, the pooling set over which statistics are aggregated. This is the central insight that makes the spectrum legible: each method answers the question of which activations share a normalizer. Two formal properties hold for every choice of $\mathcal{S}_i$ and are worth stating precisely, because they explain why the family stabilizes training at all. ::: {.callout-note title="Proposition (scale and shift invariance of the normalized statistic)"} Fix a pooling set $\mathcal{S}$ and let $\hat{x}$ denote the normalized output before the affine step, with $\epsilon = 0$. For any scalars $a \neq 0$ and $b$, replacing every activation $x_j \to a x_j + b$ for $j \in \mathcal{S}$ leaves $\hat{x}$ unchanged. ::: The proof is a one line computation. Under $x_j \to a x_j + b$ the pooled mean becomes $a\mu + b$ and the pooled standard deviation becomes $|a|\sigma$, so $$ \frac{(a x_i + b) - (a\mu + b)}{|a|\sigma} = \frac{a (x_i - \mu)}{|a|\sigma} = \operatorname{sign}(a)\,\frac{x_i - \mu}{\sigma}, $$ which equals $\hat{x}_i$ up to the sign that a nonzero $a$ cannot flip in practice, since deep networks place the normalizer after a linear map whose scale is absorbed harmlessly. The practical reading is that the layer feeding a normalization layer cannot change the loss by globally rescaling or offsetting its output inside the pooling set. This decouples the optimization of upstream weight magnitude from the optimization of direction, which is the mechanism behind the smoother loss surface reported by Santurkar et al. (reference 8). ### 1.1 The Four Pooling Sets Let $i = (n, c, h, w)$. The four schemes correspond to four choices. - Batch Norm pools across the batch and spatial dimensions, holding the channel fixed: $\mathcal{S}_i = \{ j : c_j = c \}$. Statistics depend on the whole batch, so $\mu$ and $\sigma$ have shape $(C,)$. - Layer Norm pools across all channels and spatial positions of a single example: $\mathcal{S}_i = \{ j : n_j = n \}$. Statistics have shape $(N,)$ and are computed independently per example. - Instance Norm pools across spatial positions only, holding both example and channel fixed: $\mathcal{S}_i = \{ j : n_j = n,\ c_j = c \}$. Statistics have shape $(N, C)$. - Group Norm partitions the $C$ channels into $G$ groups and pools across spatial positions and the channels within one group: $\mathcal{S}_i = \{ j : n_j = n,\ \lfloor c_j / (C/G) \rfloor = \lfloor c / (C/G) \rfloor \}$. Statistics have shape $(N, G)$. A compact way to remember this: imagine the activation tensor as a stack of $(C, H, W)$ blocks, one per example. Batch Norm normalizes a blue slab that cuts across all examples at a fixed channel. Layer Norm normalizes an entire per example block. Instance Norm normalizes one channel within one block. Group Norm normalizes a contiguous band of channels within one block. ```text shape (N, C, H, W); reduce over the marked axes BatchNorm : reduce over (N, H, W) per channel LayerNorm : reduce over (C, H, W) per sample InstanceNorm: reduce over (H, W) per (sample, channel) GroupNorm : reduce over (H, W, c/G) per (sample, group) ``` The following diagram places the schemes on the axis that orders them, the size of the per example pooling set, with Batch Norm set apart as the only batch coupled member. ```{mermaid} flowchart LR B["LayerNorm: pool over C, H, W per sample, G equals 1"] --> C["GroupNorm: pool over H, W and channels in a group"] C --> D["InstanceNorm: pool over H, W per channel, G equals C"] A["BatchNorm: pool over N, H, W per channel, batch coupled"] ``` The crucial structural difference is that Batch Norm is the only one of the four whose statistics couple different examples in the batch. The other three are computed within a single example and are therefore independent of batch size. This single fact explains most of the practical behavior that follows. ### 1.2 A Small Worked Example Concrete numbers make the pooling sets tangible. Take a tiny tensor with $N = 1$ example, $C = 2$ channels, and a $2 \times 2$ spatial grid, so $H = W = 2$. Let the two channel maps be $$ x_{0,0} = \begin{pmatrix} 1 & 3 \\ 5 & 7 \end{pmatrix}, \qquad x_{0,1} = \begin{pmatrix} 0 & 4 \\ 8 & 12 \end{pmatrix}. $$ Instance Norm normalizes each channel over its four spatial entries. Channel $0$ has mean $4$ and variance $5$, so its standard deviation is $\sqrt{5} \approx 2.236$. Channel $1$ has mean $6$ and variance $20$, so its standard deviation is $\sqrt{20} \approx 4.472$. The two channels are scaled by different amounts, $2.236$ and $4.472$, which is exactly the behavior that erases per channel contrast. Layer Norm, or equivalently Group Norm with $G = 1$, pools all eight entries together. Their mean is $5$ and their variance is $\tfrac{1}{8}\sum (x - 5)^2 = 13.5$, giving one standard deviation $\sqrt{13.5} \approx 3.674$ applied to both channels. Because a single normalizer is shared, the larger spread of channel $1$ relative to channel $0$ survives the normalization, preserving cross channel scale information that Instance Norm discards. Group Norm with $G = 2$ on this tensor coincides with Instance Norm, since each group then holds one channel. This four entry computation is the whole story scaled down: the only thing that changed between the methods was which entries entered the sums. ## 2. Why Batch Norm Breaks Batch Norm estimates per channel mean and variance from the mini batch. With batch size $m$, the standard error of the mean estimate scales as $1/\sqrt{m}$, so the statistics become noisy when $m$ is small. Two distinct problems arise. First, the normalization itself becomes inaccurate. When $m = 2$, the estimated $\sigma^2$ for a channel is computed from very few samples and fluctuates wildly from step to step. This injects noise into every downstream activation, and that noise compounds through depth. Second, there is a train and test mismatch. At training time Batch Norm uses the current batch statistics, but at inference it uses running averages $\mu_{\text{run}}, \sigma^2_{\text{run}}$ accumulated during training. If the batch statistics are noisy or if the training distribution of batch statistics differs from a single example evaluation, the running estimates are biased and accuracy degrades. These problems are acute precisely in the regimes that matter for high resolution vision. Detection, segmentation, and video models consume large inputs, so memory forces batch sizes of one or two per device. The error of Batch Norm rises sharply as $m$ falls below roughly eight. A model that trains well at $m = 32$ can lose several points of accuracy at $m = 2$ for no reason other than statistical noise in the normalizer. A further subtlety is that Batch Norm makes the loss for one example depend on the other examples that happen to share its batch. This violates the usual independence assumption, complicates theoretical analysis, and can leak information across examples in ways that matter for tasks like contrastive learning and certain sequence models. ## 3. Group Normalization Group Norm, introduced by Wu and He, removes the batch dependence entirely while retaining the channel structure that vision models rely on. It divides the $C$ channels into $G$ groups and normalizes within each group, pooling over spatial locations and the channels in that group, for each example independently. ### 3.1 Definition and Special Cases With $G$ groups, each group contains $C/G$ channels. The statistics for example $n$ and group $g$ are $$ \mu_{ng} = \frac{1}{(C/G) H W} \sum_{c \in g} \sum_{h, w} x_{nchw}, \qquad \sigma_{ng}^2 = \frac{1}{(C/G) H W} \sum_{c \in g} \sum_{h, w} (x_{nchw} - \mu_{ng})^2 . $$ Group Norm interpolates between the two extremes of single example normalization. When $G = 1$, all channels form one group and Group Norm reduces to Layer Norm. When $G = C$, each channel is its own group and Group Norm reduces to Instance Norm. The interesting regime is intermediate, and a default of $G = 32$ groups works well across many architectures. ```text GroupNorm(G=1) == LayerNorm GroupNorm(G=C) == InstanceNorm GroupNorm(G=32) == practical default for vision ``` ### 3.2 Why Grouping Channels Is Principled Grouping is not an arbitrary trick. Channels in a convolutional layer are not independent. Classical features such as oriented edges at different frequencies, or color and texture filter banks, naturally form clusters of related responses. Group Norm respects this by sharing a normalizer within a group, which assumes that channels in the same group have comparable scale. Layer Norm goes further and assumes all channels share one distribution, which is often too strong for convolutional features because different channels can have genuinely different scales. Instance Norm goes to the opposite extreme and normalizes each channel separately, discarding all cross channel scale information. Group Norm occupies the productive middle ground. ### 3.3 Empirical Behavior The defining property of Group Norm is that its accuracy is essentially flat as a function of batch size, because its computation never touches the batch axis. On ImageNet classification with a ResNet-50, Group Norm matches Batch Norm at moderate batch sizes and substantially outperforms it at $m = 2$, where Batch Norm collapses. There is no running statistic and no train and test discrepancy, since the same per example computation is used at training and inference. The cost is that at large batch sizes Batch Norm sometimes retains a small edge, because the noise it injects acts as a mild regularizer that Group Norm lacks. ## 4. Instance Normalization Instance Norm normalizes each channel of each example over its spatial extent. It was introduced by Ulyanov, Vedaldi, and Lempitsky in the context of feed forward style transfer, and the reason it helps there is illuminating. ### 4.1 Definition For example $n$ and channel $c$, $$ \mu_{nc} = \frac{1}{H W} \sum_{h, w} x_{nchw}, \qquad \sigma_{nc}^2 = \frac{1}{H W} \sum_{h, w} (x_{nchw} - \mu_{nc})^2 . $$ This is Group Norm with $G = C$. Each spatial feature map is individually centered and scaled. ### 4.2 The Connection to Style The key observation in neural style transfer is that the artistic style of an image is captured largely by the statistics of feature activations, in particular by the means and the Gram matrices of channel responses. Two images in the same style have similar per channel feature statistics regardless of content. A style transfer network must therefore wash out the original contrast and intensity statistics of the content image so that the target style statistics can be imposed. Instance Norm does exactly this. By removing the per channel mean and variance of each feature map, it discards instance specific contrast information that would otherwise bleed through and interfere with the target style. Empirically, swapping Batch Norm for Instance Norm in a style transfer generator produces markedly sharper and more faithful stylization and removes a wash of residual content contrast. The mechanism is that normalization to zero mean and unit variance per channel makes the network invariant to the global contrast of the input, which is precisely a property we want when re-rendering content in a new style. ### 4.3 Conditional Instance Norm and AdaIN Instance Norm generalizes naturally to controllable generation. Conditional Instance Norm learns a separate affine pair $(\gamma_s, \beta_s)$ for each style $s$, so a single network can render many styles by selecting the affine parameters. Adaptive Instance Norm, or AdaIN, takes this further by computing the affine parameters directly from a style input: $$ \text{AdaIN}(x, y) = \sigma(y)\, \frac{x - \mu(x)}{\sigma(x)} + \mu(y), $$ where $\mu(x), \sigma(x)$ are the per channel instance statistics of the content feature $x$ and $\mu(y), \sigma(y)$ are those of a style feature $y$. AdaIN normalizes the content to remove its own style and then re-imposes the style of $y$ through the affine step. This idea reappears in high quality image generators, where per layer affine parameters predicted from a latent code drive the synthesis of structure and texture at each resolution. ## 5. The Spectrum and How to Choose The four methods form a clean spectrum governed by how much the normalizer shares across examples and across channels. | Method | Pooling set | Batch dependent | Cross channel pooling | |---|---|---|---| | Batch Norm | $(N, H, W)$ per channel | yes | no | | Layer Norm | $(C, H, W)$ per sample | no | all channels | | Group Norm | $(H, W, C/G)$ per group | no | within group | | Instance Norm | $(H, W)$ per channel | no | no | Reading the table, Group Norm with $G$ swept from $1$ to $C$ traces a continuous path from Layer Norm to Instance Norm, all of it free of batch dependence. Batch Norm stands apart as the only batch coupled scheme. Practical guidance follows directly from this structure. - Use Batch Norm when the batch is large and stable, as in standard ImageNet classification with moderate to large batches. The regularizing noise often gives a small accuracy gain. - Use Group Norm when the batch is small or variable, as in detection, segmentation, and video, or when you want results that do not depend on batch size or device count. It is the safe default for memory constrained vision. - Use Layer Norm for sequence models and transformers, where the channel or feature axis is the natural unit and there is no spatial grid. Layer Norm is Group Norm with one group. - Use Instance Norm for style transfer and image to image translation, where removing per sample per channel contrast is the whole point. A useful diagnostic is to ask whether the task wants to preserve or discard global per channel intensity. Classification wants to preserve discriminative scale information, which argues against Instance Norm. Stylization wants to discard input contrast, which argues for it. Group Norm gives a tunable knob between these regimes. ### 5.1 Pitfalls A few failure modes recur often enough to call out. - The number of groups must divide the channel count. Choosing $G = 32$ on a layer with $C = 48$ channels is a configuration error. When channel counts vary across a network, a common robust choice is to fix the channels per group, for example $C/G = 16$, rather than fixing $G$, so that every layer divides cleanly. - Group Norm degenerates on tiny spatial maps. With $G = C$ the pooling set is just the $H W$ spatial entries, so a $1 \times 1$ feature map leaves a single sample and the variance estimate becomes meaningless. Late stage feature maps in classifiers and the per token features in some architectures are exactly this regime, and Layer Norm or a small $G$ is safer there. - Do not mix the train and inference paths. The appeal of the single example methods is that the same computation runs at training and inference, with no running averages to maintain. Reintroducing a running statistic, or freezing a Batch Norm layer and a Group Norm layer inconsistently when fine tuning, silently recreates the train and test mismatch the switch was meant to avoid. - Instance Norm erases information by design. Using it in a classifier removes the per channel intensity that often carries class signal, which is why it underperforms there. Reach for it only when contrast removal is the goal. - The affine parameters still matter. Setting $\gamma$ and $\beta$ per channel restores representational capacity that the normalization removed. Omitting them, or sharing them too coarsely, can leave the normalized features unable to recover a needed scale. Mature open source frameworks ship all four layers directly, so none of this requires custom kernels. PyTorch provides `nn.BatchNorm2d`, `nn.GroupNorm`, `nn.InstanceNorm2d`, and `nn.LayerNorm`, and Flax and Haiku expose the same set, which makes swapping among them a one line change during ablation. ```python # pseudocode: one implementation, four behaviors def normalize(x, axes, gamma, beta, eps=1e-5): mu = x.mean(axes, keepdims=True) var = x.var(axes, keepdims=True) return gamma * (x - mu) / (var + eps).sqrt() + beta # x has shape (N, C, H, W) # BatchNorm : axes = (0, 2, 3) # LayerNorm : axes = (1, 2, 3) # InstanceNorm : axes = (2, 3) # GroupNorm : reshape to (N, G, C//G, H, W), axes = (2, 3, 4) ``` ## 6. Theoretical Notes Two properties unify the batch independent methods. First, they are invariant to per group affine transformations of the input within their pooling set, which is the proposition proved in Section 1. Second, their Jacobian has a benign structure: the centering and scaling operations bound the effective gradient scale, which is one reason normalization stabilizes training. ### 6.1 The Backward Pass The coupling that normalization introduces is easiest to see by differentiating through a single pooling set. Fix one set $\mathcal{S}$ of size $m = |\mathcal{S}|$, write the centered and scaled output as $\hat{x}_i = (x_i - \mu)/\sqrt{\sigma^2 + \epsilon}$, and let $g_i = \partial L / \partial \hat{x}_i$ be the incoming gradient. Standard differentiation through the shared $\mu$ and $\sigma^2$ gives $$ \frac{\partial L}{\partial x_i} = \frac{1}{\sqrt{\sigma^2 + \epsilon}} \left( g_i - \frac{1}{m} \sum_{j \in \mathcal{S}} g_j - \hat{x}_i \cdot \frac{1}{m} \sum_{j \in \mathcal{S}} g_j \hat{x}_j \right). $$ The structure is informative. The first inner term subtracts the mean of the incoming gradients, so a gradient component that is uniform across the pooling set is removed. The second inner term subtracts the projection of the gradient onto the current normalized output, so a component aligned with $\hat{x}$ is also removed. In words, the backward pass orthogonalizes the gradient against the two directions that scaling and shifting the activations cannot affect, namely the constant direction and the $\hat{x}$ direction. This is the exact backward image of the forward invariance proposition, and it is why a normalization layer cannot pass gradient signal that would only renormalize its own input. The factor $1/\sqrt{\sigma^2 + \epsilon}$ in front shows that the layer also rescales the gradient by the inverse activation scale, which is the bounding effect that keeps gradient magnitudes well behaved through depth. ### 6.2 Bias Variance Tradeoff in the Pooling Set The choice of pooling set is ultimately a bias variance tradeoff in statistics estimation. Larger pooling sets give lower variance estimates of $\mu$ and $\sigma$ but impose a stronger assumption that the pooled activations are exchangeable. Batch Norm has a large pooling set but pays with batch dependence. Instance Norm has the smallest pooling set, $H W$ samples, which is fine for large feature maps but unstable for small ones. Group Norm tunes the pooling set size through $G$, trading estimation variance against the validity of the within group exchangeability assumption, which is why an intermediate $G$ tends to win. ## 7. Summary Batch, layer, group, and instance normalization are one algorithm with four choices of pooling set. Batch Norm couples examples and excels with large batches but degrades when batches are small. The three single example methods are immune to batch size. Group Norm interpolates between Layer Norm and Instance Norm and is the robust default for memory limited vision tasks. Instance Norm, the extreme of per channel single example normalization, discards input contrast and is the workhorse of style transfer, generalizing to conditional and adaptive variants that drive controllable generation. Choosing among them reduces to two questions: can you afford a large stable batch, and does your task want to preserve or erase per channel intensity. ## References 1. Ioffe, S., and Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. ICML, 2015. https://arxiv.org/abs/1502.03167 2. Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer Normalization. 2016. https://arxiv.org/abs/1607.06450 3. Wu, Y., and He, K. Group Normalization. ECCV, 2018. https://arxiv.org/abs/1803.08494 4. Ulyanov, D., Vedaldi, A., and Lempitsky, V. Instance Normalization: The Missing Ingredient for Fast Stylization. 2016. https://arxiv.org/abs/1607.08022 5. Dumoulin, V., Shlens, J., and Kudlur, M. A Learned Representation For Artistic Style. ICLR, 2017. https://arxiv.org/abs/1610.07629 6. Huang, X., and Belongie, S. Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization. ICCV, 2017. https://arxiv.org/abs/1703.06868 7. Karras, T., Laine, S., and Aila, T. A Style-Based Generator Architecture for Generative Adversarial Networks. CVPR, 2019. https://arxiv.org/abs/1812.04948 8. Santurkar, S., Tsipras, D., Ilyas, A., and Madry, A. How Does Batch Normalization Help Optimization? NeurIPS, 2018. https://arxiv.org/abs/1805.11604