187 Modern Activation Functions

Activation functions are the source of nonlinearity in deep networks. Without them, a stack of linear layers collapses into a single linear map, and the network loses all representational power beyond affine transformations. To see this concretely, compose two affine maps: $W_2(W_1 x + b_1) + b_2 = (W_2 W_1) x + (W_2 b_1 + b_2)$, which is itself affine with weight $W_2 W_1$ and bias $W_2 b_1 + b_2$. A pointwise nonlinearity inserted between the layers breaks this collapse and is what makes the universal approximation property available.

For two decades the rectified linear unit (ReLU) dominated practice because it is cheap, sparse, and largely avoids the saturation that crippled the sigmoid and hyperbolic tangent. The modern era, driven by very deep convolutional networks and especially by transformers, has shifted toward a family of smooth, self gated activations. This chapter develops the mathematics of GELU, Swish/SiLU, Mish, and the gated linear unit family (GLU, GeGLU, SwiGLU), states their key analytic properties precisely (smoothness, monotonicity, bounds, and limiting behavior), and explains the empirical and theoretical reasons these functions are chosen today.

To keep the comparisons crisp, we will repeatedly refer to four properties of a scalar activation $f: \mathbb{R} \to \mathbb{R}$.

Smoothness. Whether $f$ is continuously differentiable ($C^1$), and more generally $C^\infty$. ReLU is continuous but only piecewise differentiable, with a kink at the origin. All the modern gates below are $C^\infty$.
Monotonicity. Whether $f' \ge 0$ everywhere. ReLU and the logistic sigmoid are monotone. GELU, SiLU, Swish, and Mish are not: they dip below the linear trend for moderately negative inputs.
Boundedness. Whether the range of $f$ is bounded above, below, or both. The modern self gated activations are bounded below by a small negative constant and unbounded above, approaching the identity for large positive inputs.
Behavior at the limits. The asymptotics as $x \to +\infty$ and $x \to -\infty$ determine how the unit behaves on confidently positive and confidently negative pre activations.

187.1 1. From ReLU to Smooth Gates

The rectified linear unit is defined as

\[\mathrm{ReLU}(x) = \max(0, x).\]

Its appeal is computational and statistical. The gradient is exactly $1$ for positive inputs, so it does not vanish as depth grows, and it induces sparse activations because roughly half of the units output zero for centered inputs. The cost is the so called dying ReLU problem: a unit whose pre activation is pushed permanently negative receives zero gradient and never recovers, since

\[\frac{d}{dx}\mathrm{ReLU}(x) = \mathbb{1}[x > 0].\]

The function is also nonsmooth at the origin, and its hard cutoff discards all information about the magnitude of negative pre activations.

A natural response is to make the gate soft. Rather than multiplying $x$ by a hard indicator $\mathbb{1}[x > 0]$, we multiply $x$ by a smooth function that rises from $0$ to $1$. All of the pointwise activations discussed below share this template:

\[f(x) = x \cdot g(x),\]

where $g$ is a smooth gate valued (mostly) in $[0, 1]$. The differences lie in the choice of $g$. This self gating idea, where the unit decides how much of its own input to pass through, is the unifying theme of modern activations. ReLU itself fits the template with the discontinuous gate $g(x) = \mathbb{1}[x > 0]$, so the smooth gates can be read as continuous relaxations of the hard ReLU gate.

Differentiating the template by the product rule gives a derivative shared in form across the whole family,

\[f'(x) = g(x) + x\,g'(x),\]

which makes two facts immediate. First, wherever the gate is bounded and its derivative is bounded, $f$ is differentiable, so smoothness of $f$ follows from smoothness of $g$. Second, the term $x\,g'(x)$ is what creates non-monotonicity: for negative $x$ with $g'(x) > 0$ it subtracts from $g(x)$, and for sufficiently negative $x$ it can drive $f'(x)$ below zero. This single expression organizes most of the analysis that follows.

The next sections instantiate this template with four choices of $g$: the Gaussian CDF (GELU), the logistic sigmoid (SiLU and Swish), and a softplus driven tanh (Mish). The taxonomy below previews how the pieces relate.

flowchart TD
    A["Activation functions"] --> B["Hard gate (ReLU)"]
    A --> C["Smooth self gated, pointwise"]
    A --> D["Gated linear units, layer level"]
    C --> E["GELU: gate is Gaussian CDF"]
    C --> F["SiLU and Swish: gate is logistic sigmoid"]
    C --> G["Mish: gate is tanh of softplus"]
    D --> H["GLU: sigmoid gate on a projection"]
    D --> I["GeGLU: GELU gate"]
    D --> J["SwiGLU: SiLU gate"]

187.2 2. GELU

The Gaussian Error Linear Unit (GELU), introduced by Hendrycks and Gimpel, defines the gate as the cumulative distribution function (CDF) of a standard normal. Let $\Phi$ denote the standard Gaussian CDF. Then

\[\mathrm{GELU}(x) = x \, \Phi(x) = x \cdot \frac{1}{2}\left[1 + \mathrm{erf}\!\left(\frac{x}{\sqrt{2}}\right)\right].\]

187.2.1 2.1 Probabilistic Interpretation

GELU has a clean stochastic reading. Consider multiplying the input by a Bernoulli mask $m$ whose probability of being $1$ is $\Phi(x)$, that is, the probability that a standard normal variable $Z$ satisfies $Z \le x$. The expected output of this stochastic gate is

\[\mathbb{E}[m \cdot x] = x \, \Phi(x) = \mathrm{GELU}(x).\]

So GELU is the deterministic expectation of a data dependent dropout that keeps larger inputs more often. This connects it conceptually to both dropout and ReLU: as the input grows, the keep probability approaches $1$, and the function approaches the identity; as the input falls, the keep probability approaches $0$.

187.2.2 2.2 Derivative and Analytic Properties

Differentiating the exact form $\mathrm{GELU}(x) = x\,\Phi(x)$ by the product rule, and using $\Phi'(x) = \phi(x)$ where $\phi(x) = \tfrac{1}{\sqrt{2\pi}} e^{-x^2/2}$ is the standard normal density, gives

\[\frac{d}{dx}\mathrm{GELU}(x) = \Phi(x) + x\,\phi(x).\]

This is smooth ($C^\infty$, since $\Phi$ and $\phi$ are), confirming GELU is infinitely differentiable, unlike ReLU. The function is non-monotone: the term $x\,\phi(x)$ is negative for $x < 0$ and large enough in magnitude near $x \approx -0.75$ to push the derivative below zero, producing a shallow negative dip with a global minimum value of about $-0.17$. The limiting behavior is clean. As $x \to +\infty$, $\Phi(x) \to 1$ so $\mathrm{GELU}(x) \to x$ (the identity), and the derivative tends to $1$. As $x \to -\infty$, $\Phi(x) \to 0$ exponentially fast, so $\mathrm{GELU}(x) \to 0^{-}$ from below and the derivative tends to $0$. GELU is therefore bounded below (by roughly $-0.17$) and unbounded above, exactly the bounded below, unbounded above profile shared by the family.

187.2.3 2.3 Approximations

The exact form requires the error function, which is somewhat expensive. Two common approximations appear in practice. The tanh approximation is

\[\mathrm{GELU}(x) \approx 0.5\,x\left(1 + \tanh\!\left[\sqrt{\tfrac{2}{\pi}}\left(x + 0.044715\,x^3\right)\right]\right),\]

and the sigmoid approximation is $\mathrm{GELU}(x) \approx x\,\sigma(1.702\,x)$, where $\sigma$ is the logistic sigmoid. Modern hardware and libraries usually compute the exact erf form efficiently, so the approximations matter mainly for reproducing specific historical checkpoints. A subtle but real consequence is that BERT and GPT-2 era models were trained with the tanh approximation, so loading those weights into an exact GELU implementation introduces tiny numerical mismatches.

gelu_exact(x)   = x * 0.5 * (1 + erf(x / sqrt(2)))
gelu_tanh(x)    = 0.5 * x * (1 + tanh(0.79788 * (x + 0.044715 * x**3)))
gelu_sigmoid(x) = x * sigmoid(1.702 * x)

GELU became the default activation in the original Transformer encoder variants, BERT, and the GPT family, which is why it remains ubiquitous in the encoder and decoder blocks of large language models.

187.3 3. Swish and SiLU

The Sigmoid Linear Unit (SiLU) uses the logistic sigmoid as its gate:

\[\mathrm{SiLU}(x) = x \, \sigma(x) = \frac{x}{1 + e^{-x}}.\]

This function was proposed independently several times. Elfwing and colleagues described it as the SiLU in reinforcement learning, and Ramachandran, Zoph, and Le rediscovered it through an automated activation search and named it Swish, with a learnable or fixed parameter $\beta$:

\[\mathrm{Swish}_\beta(x) = x \, \sigma(\beta x).\]

When $\beta = 1$, Swish equals SiLU. As $\beta \to \infty$, the sigmoid approaches a step function and $\mathrm{Swish}_\beta(x) \to \mathrm{ReLU}(x)$. As $\beta \to 0$, the output tends to the linear function $x/2$. Thus $\beta$ interpolates smoothly between a linear unit and ReLU, which is part of why the search procedure favored it.

187.3.1 3.1 Non-monotonicity and the Negative Dip

A defining property of SiLU, Swish, GELU, and Mish is that they are not monotone. SiLU has a global minimum near $x \approx -1.278$, where the output dips slightly below zero before climbing back toward the linear regime. The derivative is

\[\frac{d}{dx}\mathrm{SiLU}(x) = \sigma(x)\bigl(1 + x\,(1 - \sigma(x))\bigr).\]

This small negative region is believed to help optimization by preserving a little gradient signal for moderately negative pre activations, in contrast to ReLU which zeroes them completely. The bounded below, unbounded above shape also acts as a mild self regularizer.

The derivative deserves a careful read because it is what distinguishes the smooth gates from ReLU at the level of training dynamics. Setting $f'(x) = 0$ in $\sigma(x)\bigl(1 + x(1 - \sigma(x))\bigr)$ requires the bracket to vanish, since $\sigma(x) > 0$ everywhere; solving $1 + x(1 - \sigma(x)) = 0$ numerically gives the global minimum near $x \approx -1.278$, with minimum value $\mathrm{SiLU}(-1.278) \approx -0.278$. For confidently negative inputs the derivative does not snap to zero as it does for ReLU; instead it decays smoothly through a brief negative band before approaching zero, so a unit sitting at a moderately negative pre activation still receives a usable gradient and can climb back. This is the precise mechanism by which the smooth gates sidestep the dying ReLU failure mode.

Worked example: a single SiLU unit

Take a unit with pre activation $x = -1$. The logistic sigmoid is $\sigma(-1) = 1/(1 + e^{1}) \approx 0.2689$. The output is $\mathrm{SiLU}(-1) = (-1)(0.2689) \approx -0.2689$, slightly negative rather than the exact zero a ReLU would give. The derivative is

\[\sigma(-1)\bigl(1 + (-1)(1 - \sigma(-1))\bigr) = 0.2689\,\bigl(1 - 0.7311\bigr) \approx 0.2689 \times 0.2689 \approx 0.0723.\]

So the unit passes a small negative signal forward and, critically, a small positive gradient backward (about $0.07$). A ReLU at the same point would output exactly $0$ and backpropagate exactly $0$, contributing nothing to learning. Repeated across many units and many steps, that nonzero leakage is the difference between a unit that can recover and one that is permanently dead.

SiLU is the activation used throughout EfficientNet and many later vision backbones, and it appears in detection and segmentation networks. In language models its gated form, discussed below, is far more common than its plain form.

187.4 4. Mish

Mish, introduced by Misra, is closely related to SiLU but uses a softplus based gate:

\[\mathrm{Mish}(x) = x \, \tanh\!\bigl(\mathrm{softplus}(x)\bigr) = x \, \tanh\!\bigl(\ln(1 + e^{x})\bigr).\]

Like SiLU, Mish is smooth ($C^\infty$), non-monotone, bounded below (its global minimum is about $-0.31$ near $x \approx -1.19$), and unbounded above. The asymptotics mirror SiLU: as $x \to +\infty$, $\mathrm{softplus}(x) \to \infty$ so $\tanh(\mathrm{softplus}(x)) \to 1$ and $\mathrm{Mish}(x) \to x$; as $x \to -\infty$, $\mathrm{softplus}(x) \to 0^{+}$ so the gate and the output both approach $0$. Its negative dip is slightly deeper and its transition region a little wider than SiLU, which gives a marginally smoother loss landscape in some experiments. Empirically Mish showed gains on object detection benchmarks and became a popular choice in the YOLO line of detectors. The tradeoff is cost: evaluating an exponential, a logarithm, and a hyperbolic tangent makes Mish heavier than SiLU or even exact GELU, and the accuracy gains over SiLU are usually small and task dependent. For this reason Mish is common in computer vision but rare in the largest language models, where throughput pressure favors cheaper gates.

mish(x) = x * tanh(softplus(x))   # softplus(x) = log(1 + exp(x))

187.5 4.1 Comparison of the Pointwise Gates

The pointwise activations differ mainly in their gate, their cost, and the depth of their negative dip. The table consolidates the properties derived above. All four are $C^\infty$, bounded below, unbounded above, and approach the identity as $x \to +\infty$; the approximate minimum location and value are numerical.

Activation	Gate $g(x)$	Monotone	Approx. min location	Approx. min value	Relative cost
ReLU	$\mathbb{1}[x>0]$	yes	none (kink at 0)	$0$	lowest
GELU	$\Phi(x)$	no	$x \approx -0.75$	$-0.17$	low (exact erf)
SiLU / Swish	$\sigma(x)$	no	$x \approx -1.28$	$-0.28$	low
Mish	$\tanh(\mathrm{softplus}(x))$	no	$x \approx -1.19$	$-0.31$	high

The practical reading: GELU, SiLU, and Mish are close cousins whose curves nearly coincide for $|x| \gtrsim 3$ and differ only in the shape of the transition near the origin and the depth of the dip. The choice among them is therefore driven less by their analytic differences, which are small, than by cost and architectural lineage.

187.6 5. The Gated Linear Unit Family

The activations above are pointwise: each scalar input maps to a scalar output. The gated linear unit (GLU), introduced by Dauphin and colleagues for convolutional sequence modeling, is different. It splits a higher dimensional projection into two halves and uses one half to gate the other.

187.6.1 5.1 Definition

Let $x \in \mathbb{R}^{d}$ be the input to a layer, and let $W, V \in \mathbb{R}^{d \times d_f}$ with biases $b, c \in \mathbb{R}^{d_f}$. The GLU is

\[\mathrm{GLU}(x) = (xW + b) \odot \sigma(xV + c),\]

where $\odot$ is elementwise multiplication. The first linear branch produces content, and the second branch, squashed by a sigmoid, produces a multiplicative gate. Crucially the gate is computed from a learned projection of the whole input rather than from the content value itself, so it is a learned, input dependent mask rather than a fixed pointwise nonlinearity. This is what distinguishes a GLU from simply applying SiLU to a single projection.

187.6.2 5.2 Generalized Gates: GeGLU and SwiGLU

Shazeer generalized GLU by replacing the sigmoid gate with other activations, producing a family of variants. With a bilinear (identity gate), GELU gate, or Swish/SiLU gate we obtain

\[\mathrm{Bilinear}(x) = (xW)\odot(xV),\] \[\mathrm{GeGLU}(x) = (xW)\odot \mathrm{GELU}(xV),\] \[\mathrm{SwiGLU}(x) = (xW)\odot \mathrm{SiLU}(xV).\]

(Biases are usually dropped in transformer implementations.) In a transformer the feedforward network (FFN) normally has the form

\[\mathrm{FFN}(x) = \phi(xW_1)\,W_2,\]

with $\phi$ a pointwise activation such as ReLU or GELU. The gated variant replaces this with

\[\mathrm{FFN}_{\mathrm{SwiGLU}}(x) = \bigl(\mathrm{SiLU}(xW_1)\odot(xV)\bigr)\,W_2,\]

which now uses three weight matrices $W_1, V, W_2$ instead of two.

187.6.3 5.3 The Parameter Budget Adjustment

Because the gated FFN introduces a third matrix, a naive substitution increases parameters and compute by roughly fifty percent. To compare fairly at fixed cost, practitioners shrink the hidden dimension. The arithmetic is exact. A standard FFN with model width $d$ and hidden width $d_f$ holds two matrices, $W_1 \in \mathbb{R}^{d \times d_f}$ and $W_2 \in \mathbb{R}^{d_f \times d}$, for $2\,d\,d_f$ parameters (ignoring biases). The gated FFN holds three matrices $W_1, V \in \mathbb{R}^{d \times d_f'}$ and $W_2 \in \mathbb{R}^{d_f' \times d}$, for $3\,d\,d_f'$ parameters. Matching the budgets,

\[3\,d\,d_f' = 2\,d\,d_f \quad\Longrightarrow\quad d_f' = \tfrac{2}{3}\,d_f.\]

So the two projections of the gated FFN are each set to width $d_f' = \tfrac{2}{3} d_f$. This is why many open models report a feedforward hidden size close to $\tfrac{8}{3}$ of the model dimension rather than the classic factor of $4$: starting from $d_f = 4d$ and scaling by $\tfrac{2}{3}$ gives $d_f' = \tfrac{8}{3} d$. Under this matched budget, SwiGLU and GeGLU consistently improve perplexity over plain ReLU or GELU FFNs in the experiments reported by Shazeer and reproduced widely since. The improvement is modest per layer but reliable, and it compounds across many layers and large training runs.

187.6.4 5.4 Why Gating Helps

Two informal explanations are commonly offered. First, the multiplicative interaction $(xW)\odot g(xV)$ gives the FFN a second order term in the input, increasing expressivity relative to the purely additive composition of a single projection and pointwise nonlinearity. Second, the gate can suppress or amplify individual channels conditionally, which acts as a soft, content dependent feature selector and improves gradient flow. Shazeer himself noted that the gains lack a clean theoretical justification and attributed their success, with characteristic candor, to divine benevolence. The empirical record nonetheless made these variants standard.

SwiGLU is now the feedforward activation in LLaMA and its descendants, PaLM, Mistral, Qwen, and most contemporary open weight large language models. GeGLU appears in T5 variants and several Google models. The choice between them is largely a matter of lineage and minor preference, since their measured differences are small.

187.7 6. How Activations Are Chosen Today

The selection of an activation function in modern practice is governed by a few practical considerations rather than by any single dominant theory.

187.7.1 6.1 Smoothness and Optimization

Smooth, non-monotone gates such as GELU and SiLU produce better conditioned loss surfaces than the kinked ReLU, and they avoid dead units by always leaking a small gradient. For very deep networks trained with large batch sizes and adaptive optimizers, this smoothness translates into more stable training and slightly better final accuracy.

187.7.2 6.2 Compute and Memory

Throughput matters at scale. Exact GELU and SiLU are cheap and are well supported by fused kernels on GPUs and accelerators. Mish is more expensive and so is reserved for settings where its small accuracy edge justifies the cost, mostly in vision. For the gated variants, the dominant cost is the extra projection matrix, which is why the two thirds width adjustment is essential to keep comparisons honest.

187.7.3 6.3 Architecture and Lineage

In practice the activation is often inherited from a reference architecture. Convolutional vision backbones tend to use ReLU, SiLU, or Mish. Transformer language models almost universally use either GELU in plain FFNs or SwiGLU and GeGLU in gated FFNs. New models rarely re-derive the choice from scratch; they adopt what worked in the closest successful predecessor and tune from there.

187.7.4 6.4 A Practical Default

For a transformer trained today, the common recommendation is a SwiGLU feedforward block with the hidden width scaled by two thirds to hold the parameter budget fixed. For a plain pointwise activation, GELU or SiLU are safe defaults. For a convolutional vision model under tight latency budgets, ReLU or SiLU remain sensible. The marginal differences among the smooth gated families are small enough that data quality, model scale, normalization, and optimizer settings usually dominate the final result.

187.7.5 6.5 When to Use Each, and Common Pitfalls

The following guidance summarizes the tradeoffs in operational terms.

Use GELU for plain (non-gated) transformer feedforward blocks, especially when reproducing or fine tuning BERT or GPT-2 era models. Pitfall: the tanh and sigmoid approximations are not bit identical to the exact erf form, so mixing implementations when loading historical checkpoints introduces small numerical drift. Match the activation variant to the one the weights were trained with.
Use SiLU or Swish for convolutional vision backbones and as the gate inside SwiGLU. Pitfall: with a learnable $\beta$, the parameter can drift toward the ReLU limit ($\beta \to \infty$) or the linear limit ($\beta \to 0$) during training; if you do not need the extra flexibility, fixing $\beta = 1$ (plain SiLU) is simpler and usually as good.
Use Mish when a small vision accuracy gain justifies extra compute, as in some object detectors. Pitfall: it is the most expensive of the group (an exponential, a logarithm, and a tanh per element), so it is rarely worth it under throughput pressure, and almost never in large language models.
Use SwiGLU or GeGLU for the feedforward blocks of new large language models. Pitfall: forgetting the two thirds width adjustment silently inflates parameters and compute by about fifty percent, which both breaks fair comparisons and wastes budget. Always scale the hidden width when substituting a gated FFN for a plain one.

A pitfall shared across the smooth gates concerns numerical stability at extreme inputs. Naive implementations of the sigmoid or softplus can overflow for large $|x|$; mature open source frameworks (PyTorch, JAX, TensorFlow) provide fused, numerically stable kernels for GELU, SiLU, and softplus, so prefer the built in operators over hand rolled formulas in production code.

187.8 7. Summary

Modern activation functions are unified by the idea of soft self gating, where a unit multiplies its input by a smooth, learned, or input dependent gate instead of a hard threshold. GELU gates by the Gaussian CDF, SiLU and Swish gate by the sigmoid, and Mish gates by a softplus driven tanh. The gated linear unit family lifts this idea to the layer level, using a separate learned projection to gate a content projection, and its members GeGLU and SwiGLU now define the feedforward blocks of state of the art language models. The choice among them today is driven by smoothness for optimization, compute cost at scale, and architectural lineage, with the smooth gated variants the clear default for new work.

187.9 References

Hendrycks, D. and Gimpel, K. “Gaussian Error Linear Units (GELUs).” 2016. https://arxiv.org/abs/1606.08415
Elfwing, S., Uchibe, E., and Doya, K. “Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning.” 2017. https://arxiv.org/abs/1702.03118
Ramachandran, P., Zoph, B., and Le, Q. V. “Searching for Activation Functions.” 2017. https://arxiv.org/abs/1710.05941
Misra, D. “Mish: A Self Regularized Non-Monotonic Activation Function.” 2019. https://arxiv.org/abs/1908.08681
Dauphin, Y. N., Fan, A., Auli, M., and Grangier, D. “Language Modeling with Gated Convolutional Networks.” 2017. https://arxiv.org/abs/1612.08083
Shazeer, N. “GLU Variants Improve Transformer.” 2020. https://arxiv.org/abs/2002.05202
Vaswani, A. et al. “Attention Is All You Need.” 2017. https://arxiv.org/abs/1706.03762
Touvron, H. et al. “LLaMA: Open and Efficient Foundation Language Models.” 2023. https://arxiv.org/abs/2302.13971
Tan, M. and Le, Q. V. “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.” 2019. https://arxiv.org/abs/1905.11946
Nair, V. and Hinton, G. E. “Rectified Linear Units Improve Restricted Boltzmann Machines.” 2010. https://www.cs.toronto.edu/~hinton/absps/reluICML.pdf

# Modern Activation Functions Activation functions are the source of nonlinearity in deep networks. Without them, a stack of linear layers collapses into a single linear map, and the network loses all representational power beyond affine transformations. To see this concretely, compose two affine maps: $W_2(W_1 x + b_1) + b_2 = (W_2 W_1) x + (W_2 b_1 + b_2)$, which is itself affine with weight $W_2 W_1$ and bias $W_2 b_1 + b_2$. A pointwise nonlinearity inserted between the layers breaks this collapse and is what makes the universal approximation property available. For two decades the rectified linear unit (ReLU) dominated practice because it is cheap, sparse, and largely avoids the saturation that crippled the sigmoid and hyperbolic tangent. The modern era, driven by very deep convolutional networks and especially by transformers, has shifted toward a family of smooth, self gated activations. This chapter develops the mathematics of GELU, Swish/SiLU, Mish, and the gated linear unit family (GLU, GeGLU, SwiGLU), states their key analytic properties precisely (smoothness, monotonicity, bounds, and limiting behavior), and explains the empirical and theoretical reasons these functions are chosen today. To keep the comparisons crisp, we will repeatedly refer to four properties of a scalar activation $f: \mathbb{R} \to \mathbb{R}$. - **Smoothness.** Whether $f$ is continuously differentiable ($C^1$), and more generally $C^\infty$. ReLU is continuous but only piecewise differentiable, with a kink at the origin. All the modern gates below are $C^\infty$. - **Monotonicity.** Whether $f' \ge 0$ everywhere. ReLU and the logistic sigmoid are monotone. GELU, SiLU, Swish, and Mish are not: they dip below the linear trend for moderately negative inputs. - **Boundedness.** Whether the range of $f$ is bounded above, below, or both. The modern self gated activations are bounded below by a small negative constant and unbounded above, approaching the identity for large positive inputs. - **Behavior at the limits.** The asymptotics as $x \to +\infty$ and $x \to -\infty$ determine how the unit behaves on confidently positive and confidently negative pre activations. ## 1. From ReLU to Smooth Gates The rectified linear unit is defined as $$\mathrm{ReLU}(x) = \max(0, x).$$ Its appeal is computational and statistical. The gradient is exactly $1$ for positive inputs, so it does not vanish as depth grows, and it induces sparse activations because roughly half of the units output zero for centered inputs. The cost is the so called dying ReLU problem: a unit whose pre activation is pushed permanently negative receives zero gradient and never recovers, since $$\frac{d}{dx}\mathrm{ReLU}(x) = \mathbb{1}[x > 0].$$ The function is also nonsmooth at the origin, and its hard cutoff discards all information about the magnitude of negative pre activations. A natural response is to make the gate soft. Rather than multiplying $x$ by a hard indicator $\mathbb{1}[x > 0]$, we multiply $x$ by a smooth function that rises from $0$ to $1$. All of the pointwise activations discussed below share this template: $$f(x) = x \cdot g(x),$$ where $g$ is a smooth gate valued (mostly) in $[0, 1]$. The differences lie in the choice of $g$. This self gating idea, where the unit decides how much of its own input to pass through, is the unifying theme of modern activations. ReLU itself fits the template with the discontinuous gate $g(x) = \mathbb{1}[x > 0]$, so the smooth gates can be read as continuous relaxations of the hard ReLU gate. Differentiating the template by the product rule gives a derivative shared in form across the whole family, $$f'(x) = g(x) + x\,g'(x),$$ which makes two facts immediate. First, wherever the gate is bounded and its derivative is bounded, $f$ is differentiable, so smoothness of $f$ follows from smoothness of $g$. Second, the term $x\,g'(x)$ is what creates non-monotonicity: for negative $x$ with $g'(x) > 0$ it subtracts from $g(x)$, and for sufficiently negative $x$ it can drive $f'(x)$ below zero. This single expression organizes most of the analysis that follows. The next sections instantiate this template with four choices of $g$: the Gaussian CDF (GELU), the logistic sigmoid (SiLU and Swish), and a softplus driven tanh (Mish). The taxonomy below previews how the pieces relate. ```{mermaid} flowchart TD A["Activation functions"] --> B["Hard gate (ReLU)"] A --> C["Smooth self gated, pointwise"] A --> D["Gated linear units, layer level"] C --> E["GELU: gate is Gaussian CDF"] C --> F["SiLU and Swish: gate is logistic sigmoid"] C --> G["Mish: gate is tanh of softplus"] D --> H["GLU: sigmoid gate on a projection"] D --> I["GeGLU: GELU gate"] D --> J["SwiGLU: SiLU gate"] ``` ## 2. GELU The Gaussian Error Linear Unit (GELU), introduced by Hendrycks and Gimpel, defines the gate as the cumulative distribution function (CDF) of a standard normal. Let $\Phi$ denote the standard Gaussian CDF. Then $$\mathrm{GELU}(x) = x \, \Phi(x) = x \cdot \frac{1}{2}\left[1 + \mathrm{erf}\!\left(\frac{x}{\sqrt{2}}\right)\right].$$ ### 2.1 Probabilistic Interpretation GELU has a clean stochastic reading. Consider multiplying the input by a Bernoulli mask $m$ whose probability of being $1$ is $\Phi(x)$, that is, the probability that a standard normal variable $Z$ satisfies $Z \le x$. The expected output of this stochastic gate is $$\mathbb{E}[m \cdot x] = x \, \Phi(x) = \mathrm{GELU}(x).$$ So GELU is the deterministic expectation of a data dependent dropout that keeps larger inputs more often. This connects it conceptually to both dropout and ReLU: as the input grows, the keep probability approaches $1$, and the function approaches the identity; as the input falls, the keep probability approaches $0$. ### 2.2 Derivative and Analytic Properties Differentiating the exact form $\mathrm{GELU}(x) = x\,\Phi(x)$ by the product rule, and using $\Phi'(x) = \phi(x)$ where $\phi(x) = \tfrac{1}{\sqrt{2\pi}} e^{-x^2/2}$ is the standard normal density, gives $$\frac{d}{dx}\mathrm{GELU}(x) = \Phi(x) + x\,\phi(x).$$ This is smooth ($C^\infty$, since $\Phi$ and $\phi$ are), confirming GELU is infinitely differentiable, unlike ReLU. The function is non-monotone: the term $x\,\phi(x)$ is negative for $x < 0$ and large enough in magnitude near $x \approx -0.75$ to push the derivative below zero, producing a shallow negative dip with a global minimum value of about $-0.17$. The limiting behavior is clean. As $x \to +\infty$, $\Phi(x) \to 1$ so $\mathrm{GELU}(x) \to x$ (the identity), and the derivative tends to $1$. As $x \to -\infty$, $\Phi(x) \to 0$ exponentially fast, so $\mathrm{GELU}(x) \to 0^{-}$ from below and the derivative tends to $0$. GELU is therefore bounded below (by roughly $-0.17$) and unbounded above, exactly the bounded below, unbounded above profile shared by the family. ### 2.3 Approximations The exact form requires the error function, which is somewhat expensive. Two common approximations appear in practice. The tanh approximation is $$\mathrm{GELU}(x) \approx 0.5\,x\left(1 + \tanh\!\left[\sqrt{\tfrac{2}{\pi}}\left(x + 0.044715\,x^3\right)\right]\right),$$ and the sigmoid approximation is $\mathrm{GELU}(x) \approx x\,\sigma(1.702\,x)$, where $\sigma$ is the logistic sigmoid. Modern hardware and libraries usually compute the exact erf form efficiently, so the approximations matter mainly for reproducing specific historical checkpoints. A subtle but real consequence is that BERT and GPT-2 era models were trained with the tanh approximation, so loading those weights into an exact GELU implementation introduces tiny numerical mismatches. ```text gelu_exact(x) = x * 0.5 * (1 + erf(x / sqrt(2))) gelu_tanh(x) = 0.5 * x * (1 + tanh(0.79788 * (x + 0.044715 * x**3))) gelu_sigmoid(x) = x * sigmoid(1.702 * x) ``` GELU became the default activation in the original Transformer encoder variants, BERT, and the GPT family, which is why it remains ubiquitous in the encoder and decoder blocks of large language models. ## 3. Swish and SiLU The Sigmoid Linear Unit (SiLU) uses the logistic sigmoid as its gate: $$\mathrm{SiLU}(x) = x \, \sigma(x) = \frac{x}{1 + e^{-x}}.$$ This function was proposed independently several times. Elfwing and colleagues described it as the SiLU in reinforcement learning, and Ramachandran, Zoph, and Le rediscovered it through an automated activation search and named it Swish, with a learnable or fixed parameter $\beta$: $$\mathrm{Swish}_\beta(x) = x \, \sigma(\beta x).$$ When $\beta = 1$, Swish equals SiLU. As $\beta \to \infty$, the sigmoid approaches a step function and $\mathrm{Swish}_\beta(x) \to \mathrm{ReLU}(x)$. As $\beta \to 0$, the output tends to the linear function $x/2$. Thus $\beta$ interpolates smoothly between a linear unit and ReLU, which is part of why the search procedure favored it. ### 3.1 Non-monotonicity and the Negative Dip A defining property of SiLU, Swish, GELU, and Mish is that they are not monotone. SiLU has a global minimum near $x \approx -1.278$, where the output dips slightly below zero before climbing back toward the linear regime. The derivative is $$\frac{d}{dx}\mathrm{SiLU}(x) = \sigma(x)\bigl(1 + x\,(1 - \sigma(x))\bigr).$$ This small negative region is believed to help optimization by preserving a little gradient signal for moderately negative pre activations, in contrast to ReLU which zeroes them completely. The bounded below, unbounded above shape also acts as a mild self regularizer. The derivative deserves a careful read because it is what distinguishes the smooth gates from ReLU at the level of training dynamics. Setting $f'(x) = 0$ in $\sigma(x)\bigl(1 + x(1 - \sigma(x))\bigr)$ requires the bracket to vanish, since $\sigma(x) > 0$ everywhere; solving $1 + x(1 - \sigma(x)) = 0$ numerically gives the global minimum near $x \approx -1.278$, with minimum value $\mathrm{SiLU}(-1.278) \approx -0.278$. For confidently negative inputs the derivative does not snap to zero as it does for ReLU; instead it decays smoothly through a brief negative band before approaching zero, so a unit sitting at a moderately negative pre activation still receives a usable gradient and can climb back. This is the precise mechanism by which the smooth gates sidestep the dying ReLU failure mode. ::: {.callout-note} ## Worked example: a single SiLU unit Take a unit with pre activation $x = -1$. The logistic sigmoid is $\sigma(-1) = 1/(1 + e^{1}) \approx 0.2689$. The output is $\mathrm{SiLU}(-1) = (-1)(0.2689) \approx -0.2689$, slightly negative rather than the exact zero a ReLU would give. The derivative is $$\sigma(-1)\bigl(1 + (-1)(1 - \sigma(-1))\bigr) = 0.2689\,\bigl(1 - 0.7311\bigr) \approx 0.2689 \times 0.2689 \approx 0.0723.$$ So the unit passes a small negative signal forward and, critically, a small positive gradient backward (about $0.07$). A ReLU at the same point would output exactly $0$ and backpropagate exactly $0$, contributing nothing to learning. Repeated across many units and many steps, that nonzero leakage is the difference between a unit that can recover and one that is permanently dead. ::: SiLU is the activation used throughout EfficientNet and many later vision backbones, and it appears in detection and segmentation networks. In language models its gated form, discussed below, is far more common than its plain form. ## 4. Mish Mish, introduced by Misra, is closely related to SiLU but uses a softplus based gate: $$\mathrm{Mish}(x) = x \, \tanh\!\bigl(\mathrm{softplus}(x)\bigr) = x \, \tanh\!\bigl(\ln(1 + e^{x})\bigr).$$ Like SiLU, Mish is smooth ($C^\infty$), non-monotone, bounded below (its global minimum is about $-0.31$ near $x \approx -1.19$), and unbounded above. The asymptotics mirror SiLU: as $x \to +\infty$, $\mathrm{softplus}(x) \to \infty$ so $\tanh(\mathrm{softplus}(x)) \to 1$ and $\mathrm{Mish}(x) \to x$; as $x \to -\infty$, $\mathrm{softplus}(x) \to 0^{+}$ so the gate and the output both approach $0$. Its negative dip is slightly deeper and its transition region a little wider than SiLU, which gives a marginally smoother loss landscape in some experiments. Empirically Mish showed gains on object detection benchmarks and became a popular choice in the YOLO line of detectors. The tradeoff is cost: evaluating an exponential, a logarithm, and a hyperbolic tangent makes Mish heavier than SiLU or even exact GELU, and the accuracy gains over SiLU are usually small and task dependent. For this reason Mish is common in computer vision but rare in the largest language models, where throughput pressure favors cheaper gates. ```text mish(x) = x * tanh(softplus(x)) # softplus(x) = log(1 + exp(x)) ``` ## 4.1 Comparison of the Pointwise Gates The pointwise activations differ mainly in their gate, their cost, and the depth of their negative dip. The table consolidates the properties derived above. All four are $C^\infty$, bounded below, unbounded above, and approach the identity as $x \to +\infty$; the approximate minimum location and value are numerical. | Activation | Gate $g(x)$ | Monotone | Approx. min location | Approx. min value | Relative cost | |---|---|---|---|---|---| | ReLU | $\mathbb{1}[x>0]$ | yes | none (kink at 0) | $0$ | lowest | | GELU | $\Phi(x)$ | no | $x \approx -0.75$ | $-0.17$ | low (exact erf) | | SiLU / Swish | $\sigma(x)$ | no | $x \approx -1.28$ | $-0.28$ | low | | Mish | $\tanh(\mathrm{softplus}(x))$ | no | $x \approx -1.19$ | $-0.31$ | high | The practical reading: GELU, SiLU, and Mish are close cousins whose curves nearly coincide for $|x| \gtrsim 3$ and differ only in the shape of the transition near the origin and the depth of the dip. The choice among them is therefore driven less by their analytic differences, which are small, than by cost and architectural lineage. ## 5. The Gated Linear Unit Family The activations above are pointwise: each scalar input maps to a scalar output. The gated linear unit (GLU), introduced by Dauphin and colleagues for convolutional sequence modeling, is different. It splits a higher dimensional projection into two halves and uses one half to gate the other. ### 5.1 Definition Let $x \in \mathbb{R}^{d}$ be the input to a layer, and let $W, V \in \mathbb{R}^{d \times d_f}$ with biases $b, c \in \mathbb{R}^{d_f}$. The GLU is $$\mathrm{GLU}(x) = (xW + b) \odot \sigma(xV + c),$$ where $\odot$ is elementwise multiplication. The first linear branch produces content, and the second branch, squashed by a sigmoid, produces a multiplicative gate. Crucially the gate is computed from a learned projection of the whole input rather than from the content value itself, so it is a learned, input dependent mask rather than a fixed pointwise nonlinearity. This is what distinguishes a GLU from simply applying SiLU to a single projection. ### 5.2 Generalized Gates: GeGLU and SwiGLU Shazeer generalized GLU by replacing the sigmoid gate with other activations, producing a family of variants. With a bilinear (identity gate), GELU gate, or Swish/SiLU gate we obtain $$\mathrm{Bilinear}(x) = (xW)\odot(xV),$$ $$\mathrm{GeGLU}(x) = (xW)\odot \mathrm{GELU}(xV),$$ $$\mathrm{SwiGLU}(x) = (xW)\odot \mathrm{SiLU}(xV).$$ (Biases are usually dropped in transformer implementations.) In a transformer the feedforward network (FFN) normally has the form $$\mathrm{FFN}(x) = \phi(xW_1)\,W_2,$$ with $\phi$ a pointwise activation such as ReLU or GELU. The gated variant replaces this with $$\mathrm{FFN}_{\mathrm{SwiGLU}}(x) = \bigl(\mathrm{SiLU}(xW_1)\odot(xV)\bigr)\,W_2,$$ which now uses three weight matrices $W_1, V, W_2$ instead of two. ### 5.3 The Parameter Budget Adjustment Because the gated FFN introduces a third matrix, a naive substitution increases parameters and compute by roughly fifty percent. To compare fairly at fixed cost, practitioners shrink the hidden dimension. The arithmetic is exact. A standard FFN with model width $d$ and hidden width $d_f$ holds two matrices, $W_1 \in \mathbb{R}^{d \times d_f}$ and $W_2 \in \mathbb{R}^{d_f \times d}$, for $2\,d\,d_f$ parameters (ignoring biases). The gated FFN holds three matrices $W_1, V \in \mathbb{R}^{d \times d_f'}$ and $W_2 \in \mathbb{R}^{d_f' \times d}$, for $3\,d\,d_f'$ parameters. Matching the budgets, $$3\,d\,d_f' = 2\,d\,d_f \quad\Longrightarrow\quad d_f' = \tfrac{2}{3}\,d_f.$$ So the two projections of the gated FFN are each set to width $d_f' = \tfrac{2}{3} d_f$. This is why many open models report a feedforward hidden size close to $\tfrac{8}{3}$ of the model dimension rather than the classic factor of $4$: starting from $d_f = 4d$ and scaling by $\tfrac{2}{3}$ gives $d_f' = \tfrac{8}{3} d$. Under this matched budget, SwiGLU and GeGLU consistently improve perplexity over plain ReLU or GELU FFNs in the experiments reported by Shazeer and reproduced widely since. The improvement is modest per layer but reliable, and it compounds across many layers and large training runs. ### 5.4 Why Gating Helps Two informal explanations are commonly offered. First, the multiplicative interaction $(xW)\odot g(xV)$ gives the FFN a second order term in the input, increasing expressivity relative to the purely additive composition of a single projection and pointwise nonlinearity. Second, the gate can suppress or amplify individual channels conditionally, which acts as a soft, content dependent feature selector and improves gradient flow. Shazeer himself noted that the gains lack a clean theoretical justification and attributed their success, with characteristic candor, to divine benevolence. The empirical record nonetheless made these variants standard. SwiGLU is now the feedforward activation in LLaMA and its descendants, PaLM, Mistral, Qwen, and most contemporary open weight large language models. GeGLU appears in T5 variants and several Google models. The choice between them is largely a matter of lineage and minor preference, since their measured differences are small. ## 6. How Activations Are Chosen Today The selection of an activation function in modern practice is governed by a few practical considerations rather than by any single dominant theory. ### 6.1 Smoothness and Optimization Smooth, non-monotone gates such as GELU and SiLU produce better conditioned loss surfaces than the kinked ReLU, and they avoid dead units by always leaking a small gradient. For very deep networks trained with large batch sizes and adaptive optimizers, this smoothness translates into more stable training and slightly better final accuracy. ### 6.2 Compute and Memory Throughput matters at scale. Exact GELU and SiLU are cheap and are well supported by fused kernels on GPUs and accelerators. Mish is more expensive and so is reserved for settings where its small accuracy edge justifies the cost, mostly in vision. For the gated variants, the dominant cost is the extra projection matrix, which is why the two thirds width adjustment is essential to keep comparisons honest. ### 6.3 Architecture and Lineage In practice the activation is often inherited from a reference architecture. Convolutional vision backbones tend to use ReLU, SiLU, or Mish. Transformer language models almost universally use either GELU in plain FFNs or SwiGLU and GeGLU in gated FFNs. New models rarely re-derive the choice from scratch; they adopt what worked in the closest successful predecessor and tune from there. ### 6.4 A Practical Default For a transformer trained today, the common recommendation is a SwiGLU feedforward block with the hidden width scaled by two thirds to hold the parameter budget fixed. For a plain pointwise activation, GELU or SiLU are safe defaults. For a convolutional vision model under tight latency budgets, ReLU or SiLU remain sensible. The marginal differences among the smooth gated families are small enough that data quality, model scale, normalization, and optimizer settings usually dominate the final result. ### 6.5 When to Use Each, and Common Pitfalls The following guidance summarizes the tradeoffs in operational terms. - **Use GELU** for plain (non-gated) transformer feedforward blocks, especially when reproducing or fine tuning BERT or GPT-2 era models. Pitfall: the tanh and sigmoid approximations are not bit identical to the exact erf form, so mixing implementations when loading historical checkpoints introduces small numerical drift. Match the activation variant to the one the weights were trained with. - **Use SiLU or Swish** for convolutional vision backbones and as the gate inside SwiGLU. Pitfall: with a learnable $\beta$, the parameter can drift toward the ReLU limit ($\beta \to \infty$) or the linear limit ($\beta \to 0$) during training; if you do not need the extra flexibility, fixing $\beta = 1$ (plain SiLU) is simpler and usually as good. - **Use Mish** when a small vision accuracy gain justifies extra compute, as in some object detectors. Pitfall: it is the most expensive of the group (an exponential, a logarithm, and a tanh per element), so it is rarely worth it under throughput pressure, and almost never in large language models. - **Use SwiGLU or GeGLU** for the feedforward blocks of new large language models. Pitfall: forgetting the two thirds width adjustment silently inflates parameters and compute by about fifty percent, which both breaks fair comparisons and wastes budget. Always scale the hidden width when substituting a gated FFN for a plain one. A pitfall shared across the smooth gates concerns numerical stability at extreme inputs. Naive implementations of the sigmoid or softplus can overflow for large $|x|$; mature open source frameworks (PyTorch, JAX, TensorFlow) provide fused, numerically stable kernels for GELU, SiLU, and softplus, so prefer the built in operators over hand rolled formulas in production code. ## 7. Summary Modern activation functions are unified by the idea of soft self gating, where a unit multiplies its input by a smooth, learned, or input dependent gate instead of a hard threshold. GELU gates by the Gaussian CDF, SiLU and Swish gate by the sigmoid, and Mish gates by a softplus driven tanh. The gated linear unit family lifts this idea to the layer level, using a separate learned projection to gate a content projection, and its members GeGLU and SwiGLU now define the feedforward blocks of state of the art language models. The choice among them today is driven by smoothness for optimization, compute cost at scale, and architectural lineage, with the smooth gated variants the clear default for new work. ## References 1. Hendrycks, D. and Gimpel, K. "Gaussian Error Linear Units (GELUs)." 2016. https://arxiv.org/abs/1606.08415 2. Elfwing, S., Uchibe, E., and Doya, K. "Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning." 2017. https://arxiv.org/abs/1702.03118 3. Ramachandran, P., Zoph, B., and Le, Q. V. "Searching for Activation Functions." 2017. https://arxiv.org/abs/1710.05941 4. Misra, D. "Mish: A Self Regularized Non-Monotonic Activation Function." 2019. https://arxiv.org/abs/1908.08681 5. Dauphin, Y. N., Fan, A., Auli, M., and Grangier, D. "Language Modeling with Gated Convolutional Networks." 2017. https://arxiv.org/abs/1612.08083 6. Shazeer, N. "GLU Variants Improve Transformer." 2020. https://arxiv.org/abs/2002.05202 7. Vaswani, A. et al. "Attention Is All You Need." 2017. https://arxiv.org/abs/1706.03762 8. Touvron, H. et al. "LLaMA: Open and Efficient Foundation Language Models." 2023. https://arxiv.org/abs/2302.13971 9. Tan, M. and Le, Q. V. "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks." 2019. https://arxiv.org/abs/1905.11946 10. Nair, V. and Hinton, G. E. "Rectified Linear Units Improve Restricted Boltzmann Machines." 2010. https://www.cs.toronto.edu/~hinton/absps/reluICML.pdf