185 Sigmoid and Tanh Activations

The logistic sigmoid and the hyperbolic tangent are the two classical bounded nonlinearities that dominated neural network design from the 1980s through the early 2010s. Although the rectified linear family has since displaced them in the hidden layers of most feedforward and convolutional architectures, sigmoid and tanh remain indispensable in specific roles: probability outputs, gating mechanisms in recurrent and attention based models, and any setting where a smooth saturating squashing function is required. Understanding their analytic properties, and in particular the way they fail, is foundational for diagnosing training pathologies and for appreciating why later designs took the shape they did.

185.1 1. Definitions and Analytic Properties

185.1.1 1.1 The Logistic Sigmoid

The logistic sigmoid maps the real line onto the open interval $(0, 1)$:

\[ \sigma(x) = \frac{1}{1 + e^{-x}}. \]

It is monotonically increasing, infinitely differentiable (it is real analytic), and antisymmetric about the point $(0, \tfrac{1}{2})$, meaning $\sigma(-x) = 1 - \sigma(x)$. As $x \to +\infty$ the output approaches $1$, and as $x \to -\infty$ it approaches $0$. The value at the origin is exactly $\tfrac{1}{2}$. Because its range coincides with the unit interval, $\sigma(x)$ is naturally read as a probability, which is the reason it serves as the output nonlinearity for binary classification and for the marginal probabilities in multilabel problems.

The sigmoid is the inverse of the logit (log-odds) function. If $p = \sigma(x)$ then

\[ x = \sigma^{-1}(p) = \ln\!\frac{p}{1 - p}, \qquad p \in (0, 1). \]

This inverse relation is the precise sense in which a network’s final pre-activation, the logit, is a log-odds score: the sigmoid exponentiates and normalizes that score into a probability. It also explains why sigmoid sits at the output of logistic regression, the simplest member of the generalized linear model family, where the linear predictor is exactly the logit of the response.

185.1.2 1.2 The Hyperbolic Tangent

The hyperbolic tangent maps the real line onto the open interval $(-1, 1)$:

\[ \tanh(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}. \]

It is an odd function, $\tanh(-x) = -\tanh(x)$, and it passes through the origin. The two functions are not independent. Starting from $\tanh(x) = (e^{x} - e^{-x})/(e^{x} + e^{-x})$ and dividing numerator and denominator by $e^{x}$ gives $\tanh(x) = (1 - e^{-2x})/(1 + e^{-2x})$; writing this over the common form of the sigmoid yields the identity

\[ \tanh(x) = 2\,\sigma(2x) - 1, \]

so tanh is a rescaled and shifted sigmoid. This identity is worth internalizing: every qualitative statement about one function has a direct counterpart for the other, and the practical differences between them reduce almost entirely to range and centering. Equivalently, $\sigma(x) = \tfrac{1}{2}\bigl(1 + \tanh(x/2)\bigr)$.

185.1.3 1.3 Numerically Stable Evaluation

The textbook formula $1/(1 + e^{-x})$ overflows for large negative $x$, since $e^{-x}$ grows without bound. Mature numerical libraries avoid this with a sign-split branch that keeps every exponential argument nonpositive:

\[ \sigma(x) = \begin{cases} \dfrac{1}{1 + e^{-x}}, & x \ge 0,\\[2ex] \dfrac{e^{x}}{1 + e^{x}}, & x < 0. \end{cases} \]

Both branches are algebraically identical to the definition, but neither ever exponentiates a large positive number, so the computation stays in floating-point range. The same idea underlies the log-sigmoid used in loss functions, $\log \sigma(x) = -\operatorname{softplus}(-x)$ where $\operatorname{softplus}(z) = \log(1 + e^{z})$, which the standard open-source frameworks (NumPy and SciPy via scipy.special.expit and log_expit, PyTorch, JAX, TensorFlow) implement directly so that users rarely need to write the branch by hand.

185.2 2. Derivatives

185.2.1 2.1 Closed Form Expressions

The derivative of the sigmoid admits an unusually convenient self referential form. Differentiating $\sigma(x) = (1 + e^{-x})^{-1}$ by the chain rule gives $\sigma'(x) = e^{-x}/(1 + e^{-x})^{2}$, and recognizing $e^{-x}/(1 + e^{-x}) = 1 - \sigma(x)$ collapses this to

\[ \sigma'(x) = \sigma(x)\bigl(1 - \sigma(x)\bigr). \]

Once the forward activation $a = \sigma(x)$ has been computed, the gradient costs only a subtraction and a multiplication, with no further calls to the exponential. The tanh derivative is similarly compact:

\[ \frac{d}{dx}\tanh(x) = 1 - \tanh^{2}(x) = \operatorname{sech}^{2}(x). \]

Both expressions are central to backpropagation, where the local gradient of each unit multiplies the incoming error signal. The fact that the derivative is expressible in terms of the already cached output (not the raw pre-activation) is one reason these functions were so attractive in the era before automatic differentiation: the backward pass reuses the forward activation directly.

185.2.2 2.2 Magnitude of the Slope

The crucial quantitative fact is how small these derivatives are. The sigmoid derivative attains its maximum at $x = 0$, where $\sigma(0) = \tfrac{1}{2}$ and therefore

\[ \sigma'(0) = \tfrac{1}{2}\cdot\tfrac{1}{2} = \tfrac{1}{4}. \]

So even at its steepest point the sigmoid passes through at most one quarter of the incoming gradient. The tanh derivative is more generous at the origin, reaching $1 - \tanh^{2}(0) = 1$. This larger maximum slope is one reason tanh is generally preferred over sigmoid for hidden units when a bounded activation is desired. Away from the origin both derivatives decay rapidly toward zero, which is the source of the difficulties described next. Note that $\sigma'(x) \le \tfrac14$ for all $x$, and as $a = \sigma(x)$ ranges over $(0,1)$ the product $a(1-a)$ is a downward parabola peaking at $a = \tfrac12$, so the gradient is largest exactly where the unit is most uncertain.

x        sigma'(x)     tanh'(x)
 0.0      0.2500        1.0000
 2.0      0.1050        0.0707
 4.0      0.0177        0.0013
 6.0      0.0025        0.0000

185.3 3. Saturation and the Vanishing Gradient

185.3.1 3.1 What Saturation Means

A unit is said to saturate when its pre-activation $x$ has large magnitude, placing the output near one of the asymptotes. In that regime the curve is nearly flat, so the local derivative is close to zero. For the sigmoid, once $|x|$ exceeds roughly $5$ the derivative has fallen below $0.007$; for tanh the collapse is even sharper because of the factor of two inside the equivalent logistic form. A saturated unit still produces a sensible forward output, but it has almost stopped responding to changes in its input, and it transmits almost no gradient backward. The decay is exponential: for large $x$, $\sigma'(x) \approx e^{-x}$ and $1 - \tanh^{2}(x) \approx 4 e^{-2x}$, so the backward signal through a saturated unit dies off with the magnitude of the pre-activation.

185.3.2 3.2 Propagation Through Depth

The damage compounds with depth. Backpropagation forms the gradient of the loss with respect to an early parameter as a product of per-layer Jacobians. Schematically, for a chain of $L$ layers the gradient magnitude scales like

\[ \left\|\frac{\partial \mathcal{L}}{\partial x^{(1)}}\right\| \;\sim\; \prod_{\ell=1}^{L} \bigl\|\,\mathrm{diag}\bigl(f'(x^{(\ell)})\bigr)\,W^{(\ell)}\,\bigr\|. \]

If each factor contributes a per-layer derivative bounded by $\tfrac{1}{4}$ (sigmoid) and the weight norms do not compensate, the product shrinks geometrically. After ten sigmoid layers the upper bound on the surviving gradient is on the order of $4^{-10} \approx 10^{-6}$. Early layers therefore receive a vanishingly small learning signal and update extremely slowly, while later layers train normally. This is the vanishing gradient problem, identified in Hochreiter’s 1991 thesis and analyzed in detail by Bengio, Simard, and Frasconi in 1994. It is the single most important reason deep stacks of sigmoid or tanh units were historically so hard to train.

The flow of signal forward and the decay of gradient backward can be pictured as two passes over the same stack.

flowchart LR
  X["input x"] --> A1["sigmoid layer 1"]
  A1 --> A2["sigmoid layer 2"]
  A2 --> A3["sigmoid layer L"]
  A3 --> Y["output"]
  Y -. "grad scaled by sigma prime" .-> A3
  A3 -. "grad x 1/4 or less" .-> A2
  A2 -. "grad x 1/4 or less" .-> A1
  A1 -. "tiny grad reaches input" .-> X

The solid arrows are the forward activations, which stay well defined at every depth. The dashed arrows are the backward gradients, each multiplied by a per-layer factor no larger than $\tfrac14$, so the signal reaching the first layer is exponentially attenuated in $L$.

185.3.3 3.3 Why Initialization and Range Matter

Saturation is not inevitable; it depends on the distribution of pre-activations. If weights are initialized so that the variance of $x$ stays near unity, most units operate in the high-slope region near the origin. The Glorot and Bengio initialization of 2010 was derived precisely to keep activation and gradient variances stable across layers for tanh networks, and it substantially mitigated the symptom. Concretely, for a layer with $n_{\text{in}}$ inputs and $n_{\text{out}}$ outputs, the Glorot scheme draws weights with variance $2/(n_{\text{in}} + n_{\text{out}})$, the compromise that approximately preserves both the forward activation variance and the backward gradient variance. Saturating nonlinearities thus place a real burden on careful initialization and on input normalization in a way that the non-saturating rectified family does not.

185.4 4. The Zero-Centering Argument

185.4.1 4.1 Sigmoid Outputs Are Always Positive

A more subtle defect of the sigmoid concerns the sign of its outputs. Because $\sigma(x) \in (0, 1)$, every value fed from one layer into the next is strictly positive. Consider a weight $w_i$ feeding into a single downstream neuron with pre-activation $z = \sum_i w_i a_i + b$. During backpropagation the gradient with respect to that weight is

\[ \frac{\partial \mathcal{L}}{\partial w_i} = \frac{\partial \mathcal{L}}{\partial z}\, a_i. \]

If all incoming activations $a_i$ are positive, then the sign of every weight gradient into that neuron is determined entirely by the single scalar $\partial \mathcal{L} / \partial z$. Consequently all weights of that neuron must increase together or decrease together on a given step. They cannot move in independent directions, which forces the optimizer to follow an inefficient zig-zag trajectory toward minima that would otherwise be reached more directly. Geometrically, the feasible update direction is confined to a single orthant of weight space per step, so a target that lies in a different orthant can only be reached by an alternating staircase of moves rather than a straight line.

185.4.2 4.2 How Tanh Resolves It

Because tanh is symmetric about zero and produces both positive and negative outputs, the activations entering the next layer have a mean closer to zero. The per-weight gradients then carry mixed signs and the zig-zag effect is largely removed. This zero-centering property, articulated by LeCun and colleagues in their influential note on efficient backpropagation, is the primary reason tanh was the default hidden-layer choice throughout the 1990s and 2000s whenever a bounded nonlinearity was used. The same logic later motivated batch normalization and other mean-centering schemes, which restore a favorable activation distribution even when the nonlinearity itself does not.

185.5 5. The Sigmoid and Binary Cross-Entropy

The sigmoid earns its place at classification outputs because of an elegant cancellation with its natural loss. Let $z$ be the logit, $p = \sigma(z)$ the predicted probability, and $y \in \{0, 1\}$ the label. The binary cross-entropy loss is

\[ \mathcal{L} = -\bigl[\, y \log p + (1 - y) \log(1 - p)\,\bigr]. \]

Differentiating with respect to the logit and substituting $\sigma'(z) = p(1 - p)$ produces a striking simplification:

\[ \frac{\partial \mathcal{L}}{\partial z} = \left(-\frac{y}{p} + \frac{1 - y}{1 - p}\right) p(1 - p) = p - y. \]

The factor $p(1 - p)$ from the sigmoid derivative cancels the matching denominators in the loss derivative, leaving the residual $p - y$. Two consequences follow. First, the gradient at the output layer never saturates with respect to the loss-and-activation pair, even though the sigmoid alone does, because the small slope is exactly offset by the large loss gradient near the asymptotes. This is why the cross-entropy loss, not the squared error, is the correct partner for a sigmoid output: squared error reintroduces the $p(1-p)$ factor and lets the output unit stall. Second, the result is the same clean residual that softmax with categorical cross-entropy yields in the multiclass case. In practice one fuses the sigmoid and the loss into a single numerically stable operation, the binary-cross-entropy-with-logits primitive provided by every major open-source framework, rather than applying them separately.

# conceptual pairing, not a runnable snippet
logit  -> sigmoid -> p in (0,1)
loss   = -[ y*log(p) + (1-y)*log(1-p) ]
dloss/dlogit = p - y

185.5.1 5.1 A Worked Single-Neuron Example

Consider one logistic unit with a single input feature, computing $p = \sigma(wx + b)$, trained with binary cross-entropy. Take $w = 0.5$, $b = 0$, a training pair $(x, y) = (2, 1)$, and learning rate $\eta = 0.1$. The forward pass gives the logit $z = 0.5 \cdot 2 + 0 = 1$ and the probability $p = \sigma(1) \approx 0.731$. The loss is $-\log(0.731) \approx 0.313$.

The backward pass uses the cancellation above. The logit gradient is $p - y = 0.731 - 1 = -0.269$. By the chain rule the parameter gradients are

\[ \frac{\partial \mathcal{L}}{\partial w} = (p - y)\,x = -0.269 \cdot 2 = -0.538, \qquad \frac{\partial \mathcal{L}}{\partial b} = p - y = -0.269. \]

A single gradient-descent step yields $w \leftarrow 0.5 - 0.1 \cdot (-0.538) = 0.554$ and $b \leftarrow 0 - 0.1 \cdot (-0.269) = 0.027$. Recomputing the forward pass gives $z = 0.554 \cdot 2 + 0.027 \approx 1.135$ and $p \approx 0.757$, closer to the target of $1$, with the loss falling to about $0.279$. The negative weight gradient pushed the logit up exactly because the prediction was below the label, and the magnitude was scaled by the input feature, the hallmark behavior of a logistic unit.

185.6 6. Where They Are Still Used

The retreat of sigmoid and tanh from generic hidden layers does not mean they are obsolete. They survive in roles where their bounded, smooth, probabilistic character is exactly what is needed.

185.6.1 6.1 Output Layers and Probabilities

The sigmoid remains the standard output nonlinearity for binary classification, where it converts a single logit into a calibrated probability, and for multilabel classification, where an independent sigmoid is applied to each class logit. It pairs naturally with the binary cross-entropy loss, whose gradient with respect to the logit simplifies to the difference between the predicted probability and the target, as derived in Section 5. In practice one fuses the sigmoid and the loss into a single numerically stable operation rather than applying them separately.

185.6.2 6.2 Gating in Recurrent and Gated Architectures

Gated recurrent designs depend essentially on saturating nonlinearities. In the LSTM the input, forget, and output gates each use a sigmoid because a gate value near $0$ or $1$ acts as a soft binary switch that admits or blocks information, while the cell candidate and the exposed cell state use tanh to keep the signal bounded in $(-1, 1)$. The GRU follows the same pattern with its update and reset gates. The constant error carousel of the LSTM was designed specifically to bypass the vanishing gradient by giving the cell state an additive, ungated path through time, which is what allows these saturating gates to be used safely across long sequences. A common practical refinement is to initialize the forget-gate bias to a positive value so the gate starts near $1$, keeping the cell state’s memory path open early in training.

185.6.3 6.3 Attention, Gating, and Smooth Approximations

Sigmoid gating reappears in modern Transformer variants. Gated linear unit blocks multiply one projection by a sigmoid or related gate of another, and several recent feedforward designs use such multiplicative gates to improve quality. The sigmoid also hides inside the smooth activations that did replace it in hidden layers. The SiLU, or swish, is defined as $x\,\sigma(x)$, and the GELU is well approximated by $x\,\sigma(1.702\,x)$, so the logistic curve persists as a smooth gate even in architectures that no longer expose it directly. Tanh, for its part, is used to bound the output of policy networks in continuous control, to squash regression targets into a fixed range, and as the nonlinearity in the closed-form approximation of GELU.

185.7 7. When to Use, and Pitfalls

A short decision guide and a list of the failure modes that recur in practice.

Use a sigmoid output for binary or multilabel classification, and always pair it with the with-logits fused loss rather than applying the sigmoid and then a separate cross-entropy. The separate form loses the numerical stability and risks $\log 0$.
Use tanh in preference to sigmoid for any bounded hidden activation, because its zero-centered range avoids the coupled-sign update pathology and its steeper slope at the origin passes more gradient.
Avoid stacking many sigmoid or tanh layers as a deep feedforward trunk. Prefer the rectified family, residual connections, or normalization layers, which do not saturate and do not attenuate the gradient geometrically with depth.
If you must use saturating units in depth, initialize with the Glorot scheme and normalize inputs so pre-activations sit near the high-slope region, and monitor the fraction of units whose outputs cluster near the asymptotes, a direct symptom of saturation.
For gates in recurrent and gated blocks the sigmoid is the right tool precisely because it saturates: a gate wants to commit toward fully open or fully closed. Here saturation is a feature, not a defect.
Watch for silent saturation at initialization: a too-large weight scale drives units into the flat regions before the first update, after which they receive almost no gradient and never recover. This presents as a loss that plateaus immediately and is best diagnosed by histogramming the activations.

185.8 8. Summary

Sigmoid and tanh are smooth, bounded, monotone nonlinearities related by $\tanh(x) = 2\sigma(2x) - 1$, with the convenient derivatives $\sigma' = \sigma(1-\sigma)$ and $\tanh' = 1 - \tanh^{2}$. Their maximum slopes, $\tfrac{1}{4}$ and $1$ respectively, together with rapid saturation away from the origin, produce the vanishing gradient that makes deep stacks of either function hard to train. The strictly positive range of the sigmoid additionally couples the signs of weight updates and slows optimization, a defect that tanh’s zero-centering partly cures and that normalization layers later addressed in general. The sigmoid’s cancellation with binary cross-entropy, which reduces the output gradient to the clean residual $p - y$, explains its enduring place at classification heads. Despite losing the hidden layers of feedforward and convolutional networks to the rectified family, both functions remain central to probability outputs, to the gates of recurrent and gated architectures, and to the smooth gated activations of contemporary Transformers.

185.9 9. References

Hochreiter, S. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Technische Universitat Munchen, 1991. https://people.idsia.ch/~juergen/SeppHochreiter1991ThesisAdvisorSchmidhuber.pdf
Bengio, Y., Simard, P., and Frasconi, P. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 1994. https://doi.org/10.1109/72.279181
LeCun, Y., Bottou, L., Orr, G., and Muller, K. Efficient BackProp. In Neural Networks: Tricks of the Trade, 1998. https://doi.org/10.1007/3-540-49430-8_2
Hochreiter, S. and Schmidhuber, J. Long Short-Term Memory. Neural Computation, 1997. https://doi.org/10.1162/neco.1997.9.8.1735
Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. AISTATS, 2010. https://proceedings.mlr.press/v9/glorot10a.html
Cho, K. et al. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. EMNLP, 2014. https://doi.org/10.3115/v1/D14-1179
Hendrycks, D. and Gimpel, K. Gaussian Error Linear Units (GELUs). 2016. https://arxiv.org/abs/1606.08415
Elfwing, S., Uchibe, E., and Doya, K. Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning. Neural Networks, 2018. https://doi.org/10.1016/j.neunet.2017.12.012
Shazeer, N. GLU Variants Improve Transformer. 2020. https://arxiv.org/abs/2002.05202

# Sigmoid and Tanh Activations The logistic sigmoid and the hyperbolic tangent are the two classical bounded nonlinearities that dominated neural network design from the 1980s through the early 2010s. Although the rectified linear family has since displaced them in the hidden layers of most feedforward and convolutional architectures, sigmoid and tanh remain indispensable in specific roles: probability outputs, gating mechanisms in recurrent and attention based models, and any setting where a smooth saturating squashing function is required. Understanding their analytic properties, and in particular the way they fail, is foundational for diagnosing training pathologies and for appreciating why later designs took the shape they did. ## 1. Definitions and Analytic Properties ### 1.1 The Logistic Sigmoid The logistic sigmoid maps the real line onto the open interval $(0, 1)$: $$ \sigma(x) = \frac{1}{1 + e^{-x}}. $$ It is monotonically increasing, infinitely differentiable (it is real analytic), and antisymmetric about the point $(0, \tfrac{1}{2})$, meaning $\sigma(-x) = 1 - \sigma(x)$. As $x \to +\infty$ the output approaches $1$, and as $x \to -\infty$ it approaches $0$. The value at the origin is exactly $\tfrac{1}{2}$. Because its range coincides with the unit interval, $\sigma(x)$ is naturally read as a probability, which is the reason it serves as the output nonlinearity for binary classification and for the marginal probabilities in multilabel problems. The sigmoid is the inverse of the logit (log-odds) function. If $p = \sigma(x)$ then $$ x = \sigma^{-1}(p) = \ln\!\frac{p}{1 - p}, \qquad p \in (0, 1). $$ This inverse relation is the precise sense in which a network's final pre-activation, the logit, is a log-odds score: the sigmoid exponentiates and normalizes that score into a probability. It also explains why sigmoid sits at the output of logistic regression, the simplest member of the generalized linear model family, where the linear predictor is exactly the logit of the response. ### 1.2 The Hyperbolic Tangent The hyperbolic tangent maps the real line onto the open interval $(-1, 1)$: $$ \tanh(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}. $$ It is an odd function, $\tanh(-x) = -\tanh(x)$, and it passes through the origin. The two functions are not independent. Starting from $\tanh(x) = (e^{x} - e^{-x})/(e^{x} + e^{-x})$ and dividing numerator and denominator by $e^{x}$ gives $\tanh(x) = (1 - e^{-2x})/(1 + e^{-2x})$; writing this over the common form of the sigmoid yields the identity $$ \tanh(x) = 2\,\sigma(2x) - 1, $$ so tanh is a rescaled and shifted sigmoid. This identity is worth internalizing: every qualitative statement about one function has a direct counterpart for the other, and the practical differences between them reduce almost entirely to range and centering. Equivalently, $\sigma(x) = \tfrac{1}{2}\bigl(1 + \tanh(x/2)\bigr)$. ### 1.3 Numerically Stable Evaluation The textbook formula $1/(1 + e^{-x})$ overflows for large negative $x$, since $e^{-x}$ grows without bound. Mature numerical libraries avoid this with a sign-split branch that keeps every exponential argument nonpositive: $$ \sigma(x) = \begin{cases} \dfrac{1}{1 + e^{-x}}, & x \ge 0,\\[2ex] \dfrac{e^{x}}{1 + e^{x}}, & x < 0. \end{cases} $$ Both branches are algebraically identical to the definition, but neither ever exponentiates a large positive number, so the computation stays in floating-point range. The same idea underlies the log-sigmoid used in loss functions, $\log \sigma(x) = -\operatorname{softplus}(-x)$ where $\operatorname{softplus}(z) = \log(1 + e^{z})$, which the standard open-source frameworks (NumPy and SciPy via `scipy.special.expit` and `log_expit`, PyTorch, JAX, TensorFlow) implement directly so that users rarely need to write the branch by hand. ## 2. Derivatives ### 2.1 Closed Form Expressions The derivative of the sigmoid admits an unusually convenient self referential form. Differentiating $\sigma(x) = (1 + e^{-x})^{-1}$ by the chain rule gives $\sigma'(x) = e^{-x}/(1 + e^{-x})^{2}$, and recognizing $e^{-x}/(1 + e^{-x}) = 1 - \sigma(x)$ collapses this to $$ \sigma'(x) = \sigma(x)\bigl(1 - \sigma(x)\bigr). $$ Once the forward activation $a = \sigma(x)$ has been computed, the gradient costs only a subtraction and a multiplication, with no further calls to the exponential. The tanh derivative is similarly compact: $$ \frac{d}{dx}\tanh(x) = 1 - \tanh^{2}(x) = \operatorname{sech}^{2}(x). $$ Both expressions are central to backpropagation, where the local gradient of each unit multiplies the incoming error signal. The fact that the derivative is expressible in terms of the already cached output (not the raw pre-activation) is one reason these functions were so attractive in the era before automatic differentiation: the backward pass reuses the forward activation directly. ### 2.2 Magnitude of the Slope The crucial quantitative fact is how small these derivatives are. The sigmoid derivative attains its maximum at $x = 0$, where $\sigma(0) = \tfrac{1}{2}$ and therefore $$ \sigma'(0) = \tfrac{1}{2}\cdot\tfrac{1}{2} = \tfrac{1}{4}. $$ So even at its steepest point the sigmoid passes through at most one quarter of the incoming gradient. The tanh derivative is more generous at the origin, reaching $1 - \tanh^{2}(0) = 1$. This larger maximum slope is one reason tanh is generally preferred over sigmoid for hidden units when a bounded activation is desired. Away from the origin both derivatives decay rapidly toward zero, which is the source of the difficulties described next. Note that $\sigma'(x) \le \tfrac14$ for all $x$, and as $a = \sigma(x)$ ranges over $(0,1)$ the product $a(1-a)$ is a downward parabola peaking at $a = \tfrac12$, so the gradient is largest exactly where the unit is most uncertain. ```text x sigma'(x) tanh'(x) 0.0 0.2500 1.0000 2.0 0.1050 0.0707 4.0 0.0177 0.0013 6.0 0.0025 0.0000 ``` ## 3. Saturation and the Vanishing Gradient ### 3.1 What Saturation Means A unit is said to saturate when its pre-activation $x$ has large magnitude, placing the output near one of the asymptotes. In that regime the curve is nearly flat, so the local derivative is close to zero. For the sigmoid, once $|x|$ exceeds roughly $5$ the derivative has fallen below $0.007$; for tanh the collapse is even sharper because of the factor of two inside the equivalent logistic form. A saturated unit still produces a sensible forward output, but it has almost stopped responding to changes in its input, and it transmits almost no gradient backward. The decay is exponential: for large $x$, $\sigma'(x) \approx e^{-x}$ and $1 - \tanh^{2}(x) \approx 4 e^{-2x}$, so the backward signal through a saturated unit dies off with the magnitude of the pre-activation. ### 3.2 Propagation Through Depth The damage compounds with depth. Backpropagation forms the gradient of the loss with respect to an early parameter as a product of per-layer Jacobians. Schematically, for a chain of $L$ layers the gradient magnitude scales like $$ \left\|\frac{\partial \mathcal{L}}{\partial x^{(1)}}\right\| \;\sim\; \prod_{\ell=1}^{L} \bigl\|\,\mathrm{diag}\bigl(f'(x^{(\ell)})\bigr)\,W^{(\ell)}\,\bigr\|. $$ If each factor contributes a per-layer derivative bounded by $\tfrac{1}{4}$ (sigmoid) and the weight norms do not compensate, the product shrinks geometrically. After ten sigmoid layers the upper bound on the surviving gradient is on the order of $4^{-10} \approx 10^{-6}$. Early layers therefore receive a vanishingly small learning signal and update extremely slowly, while later layers train normally. This is the vanishing gradient problem, identified in Hochreiter's 1991 thesis and analyzed in detail by Bengio, Simard, and Frasconi in 1994. It is the single most important reason deep stacks of sigmoid or tanh units were historically so hard to train. The flow of signal forward and the decay of gradient backward can be pictured as two passes over the same stack. ```{mermaid} flowchart LR X["input x"] --> A1["sigmoid layer 1"] A1 --> A2["sigmoid layer 2"] A2 --> A3["sigmoid layer L"] A3 --> Y["output"] Y -. "grad scaled by sigma prime" .-> A3 A3 -. "grad x 1/4 or less" .-> A2 A2 -. "grad x 1/4 or less" .-> A1 A1 -. "tiny grad reaches input" .-> X ``` The solid arrows are the forward activations, which stay well defined at every depth. The dashed arrows are the backward gradients, each multiplied by a per-layer factor no larger than $\tfrac14$, so the signal reaching the first layer is exponentially attenuated in $L$. ### 3.3 Why Initialization and Range Matter Saturation is not inevitable; it depends on the distribution of pre-activations. If weights are initialized so that the variance of $x$ stays near unity, most units operate in the high-slope region near the origin. The Glorot and Bengio initialization of 2010 was derived precisely to keep activation and gradient variances stable across layers for tanh networks, and it substantially mitigated the symptom. Concretely, for a layer with $n_{\text{in}}$ inputs and $n_{\text{out}}$ outputs, the Glorot scheme draws weights with variance $2/(n_{\text{in}} + n_{\text{out}})$, the compromise that approximately preserves both the forward activation variance and the backward gradient variance. Saturating nonlinearities thus place a real burden on careful initialization and on input normalization in a way that the non-saturating rectified family does not. ## 4. The Zero-Centering Argument ### 4.1 Sigmoid Outputs Are Always Positive A more subtle defect of the sigmoid concerns the sign of its outputs. Because $\sigma(x) \in (0, 1)$, every value fed from one layer into the next is strictly positive. Consider a weight $w_i$ feeding into a single downstream neuron with pre-activation $z = \sum_i w_i a_i + b$. During backpropagation the gradient with respect to that weight is $$ \frac{\partial \mathcal{L}}{\partial w_i} = \frac{\partial \mathcal{L}}{\partial z}\, a_i. $$ If all incoming activations $a_i$ are positive, then the sign of every weight gradient into that neuron is determined entirely by the single scalar $\partial \mathcal{L} / \partial z$. Consequently all weights of that neuron must increase together or decrease together on a given step. They cannot move in independent directions, which forces the optimizer to follow an inefficient zig-zag trajectory toward minima that would otherwise be reached more directly. Geometrically, the feasible update direction is confined to a single orthant of weight space per step, so a target that lies in a different orthant can only be reached by an alternating staircase of moves rather than a straight line. ### 4.2 How Tanh Resolves It Because tanh is symmetric about zero and produces both positive and negative outputs, the activations entering the next layer have a mean closer to zero. The per-weight gradients then carry mixed signs and the zig-zag effect is largely removed. This zero-centering property, articulated by LeCun and colleagues in their influential note on efficient backpropagation, is the primary reason tanh was the default hidden-layer choice throughout the 1990s and 2000s whenever a bounded nonlinearity was used. The same logic later motivated batch normalization and other mean-centering schemes, which restore a favorable activation distribution even when the nonlinearity itself does not. ## 5. The Sigmoid and Binary Cross-Entropy The sigmoid earns its place at classification outputs because of an elegant cancellation with its natural loss. Let $z$ be the logit, $p = \sigma(z)$ the predicted probability, and $y \in \{0, 1\}$ the label. The binary cross-entropy loss is $$ \mathcal{L} = -\bigl[\, y \log p + (1 - y) \log(1 - p)\,\bigr]. $$ Differentiating with respect to the logit and substituting $\sigma'(z) = p(1 - p)$ produces a striking simplification: $$ \frac{\partial \mathcal{L}}{\partial z} = \left(-\frac{y}{p} + \frac{1 - y}{1 - p}\right) p(1 - p) = p - y. $$ The factor $p(1 - p)$ from the sigmoid derivative cancels the matching denominators in the loss derivative, leaving the residual $p - y$. Two consequences follow. First, the gradient at the output layer never saturates with respect to the loss-and-activation pair, even though the sigmoid alone does, because the small slope is exactly offset by the large loss gradient near the asymptotes. This is why the cross-entropy loss, not the squared error, is the correct partner for a sigmoid output: squared error reintroduces the $p(1-p)$ factor and lets the output unit stall. Second, the result is the same clean residual that softmax with categorical cross-entropy yields in the multiclass case. In practice one fuses the sigmoid and the loss into a single numerically stable operation, the binary-cross-entropy-with-logits primitive provided by every major open-source framework, rather than applying them separately. ```text # conceptual pairing, not a runnable snippet logit -> sigmoid -> p in (0,1) loss = -[ y*log(p) + (1-y)*log(1-p) ] dloss/dlogit = p - y ``` ### 5.1 A Worked Single-Neuron Example Consider one logistic unit with a single input feature, computing $p = \sigma(wx + b)$, trained with binary cross-entropy. Take $w = 0.5$, $b = 0$, a training pair $(x, y) = (2, 1)$, and learning rate $\eta = 0.1$. The forward pass gives the logit $z = 0.5 \cdot 2 + 0 = 1$ and the probability $p = \sigma(1) \approx 0.731$. The loss is $-\log(0.731) \approx 0.313$. The backward pass uses the cancellation above. The logit gradient is $p - y = 0.731 - 1 = -0.269$. By the chain rule the parameter gradients are $$ \frac{\partial \mathcal{L}}{\partial w} = (p - y)\,x = -0.269 \cdot 2 = -0.538, \qquad \frac{\partial \mathcal{L}}{\partial b} = p - y = -0.269. $$ A single gradient-descent step yields $w \leftarrow 0.5 - 0.1 \cdot (-0.538) = 0.554$ and $b \leftarrow 0 - 0.1 \cdot (-0.269) = 0.027$. Recomputing the forward pass gives $z = 0.554 \cdot 2 + 0.027 \approx 1.135$ and $p \approx 0.757$, closer to the target of $1$, with the loss falling to about $0.279$. The negative weight gradient pushed the logit up exactly because the prediction was below the label, and the magnitude was scaled by the input feature, the hallmark behavior of a logistic unit. ## 6. Where They Are Still Used The retreat of sigmoid and tanh from generic hidden layers does not mean they are obsolete. They survive in roles where their bounded, smooth, probabilistic character is exactly what is needed. ### 6.1 Output Layers and Probabilities The sigmoid remains the standard output nonlinearity for binary classification, where it converts a single logit into a calibrated probability, and for multilabel classification, where an independent sigmoid is applied to each class logit. It pairs naturally with the binary cross-entropy loss, whose gradient with respect to the logit simplifies to the difference between the predicted probability and the target, as derived in Section 5. In practice one fuses the sigmoid and the loss into a single numerically stable operation rather than applying them separately. ### 6.2 Gating in Recurrent and Gated Architectures Gated recurrent designs depend essentially on saturating nonlinearities. In the LSTM the input, forget, and output gates each use a sigmoid because a gate value near $0$ or $1$ acts as a soft binary switch that admits or blocks information, while the cell candidate and the exposed cell state use tanh to keep the signal bounded in $(-1, 1)$. The GRU follows the same pattern with its update and reset gates. The constant error carousel of the LSTM was designed specifically to bypass the vanishing gradient by giving the cell state an additive, ungated path through time, which is what allows these saturating gates to be used safely across long sequences. A common practical refinement is to initialize the forget-gate bias to a positive value so the gate starts near $1$, keeping the cell state's memory path open early in training. ### 6.3 Attention, Gating, and Smooth Approximations Sigmoid gating reappears in modern Transformer variants. Gated linear unit blocks multiply one projection by a sigmoid or related gate of another, and several recent feedforward designs use such multiplicative gates to improve quality. The sigmoid also hides inside the smooth activations that did replace it in hidden layers. The SiLU, or swish, is defined as $x\,\sigma(x)$, and the GELU is well approximated by $x\,\sigma(1.702\,x)$, so the logistic curve persists as a smooth gate even in architectures that no longer expose it directly. Tanh, for its part, is used to bound the output of policy networks in continuous control, to squash regression targets into a fixed range, and as the nonlinearity in the closed-form approximation of GELU. ## 7. When to Use, and Pitfalls A short decision guide and a list of the failure modes that recur in practice. - Use a sigmoid output for binary or multilabel classification, and always pair it with the with-logits fused loss rather than applying the sigmoid and then a separate cross-entropy. The separate form loses the numerical stability and risks $\log 0$. - Use tanh in preference to sigmoid for any bounded hidden activation, because its zero-centered range avoids the coupled-sign update pathology and its steeper slope at the origin passes more gradient. - Avoid stacking many sigmoid or tanh layers as a deep feedforward trunk. Prefer the rectified family, residual connections, or normalization layers, which do not saturate and do not attenuate the gradient geometrically with depth. - If you must use saturating units in depth, initialize with the Glorot scheme and normalize inputs so pre-activations sit near the high-slope region, and monitor the fraction of units whose outputs cluster near the asymptotes, a direct symptom of saturation. - For gates in recurrent and gated blocks the sigmoid is the right tool precisely because it saturates: a gate wants to commit toward fully open or fully closed. Here saturation is a feature, not a defect. - Watch for silent saturation at initialization: a too-large weight scale drives units into the flat regions before the first update, after which they receive almost no gradient and never recover. This presents as a loss that plateaus immediately and is best diagnosed by histogramming the activations. ## 8. Summary Sigmoid and tanh are smooth, bounded, monotone nonlinearities related by $\tanh(x) = 2\sigma(2x) - 1$, with the convenient derivatives $\sigma' = \sigma(1-\sigma)$ and $\tanh' = 1 - \tanh^{2}$. Their maximum slopes, $\tfrac{1}{4}$ and $1$ respectively, together with rapid saturation away from the origin, produce the vanishing gradient that makes deep stacks of either function hard to train. The strictly positive range of the sigmoid additionally couples the signs of weight updates and slows optimization, a defect that tanh's zero-centering partly cures and that normalization layers later addressed in general. The sigmoid's cancellation with binary cross-entropy, which reduces the output gradient to the clean residual $p - y$, explains its enduring place at classification heads. Despite losing the hidden layers of feedforward and convolutional networks to the rectified family, both functions remain central to probability outputs, to the gates of recurrent and gated architectures, and to the smooth gated activations of contemporary Transformers. ## 9. References 1. Hochreiter, S. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Technische Universitat Munchen, 1991. https://people.idsia.ch/~juergen/SeppHochreiter1991ThesisAdvisorSchmidhuber.pdf 2. Bengio, Y., Simard, P., and Frasconi, P. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 1994. https://doi.org/10.1109/72.279181 3. LeCun, Y., Bottou, L., Orr, G., and Muller, K. Efficient BackProp. In Neural Networks: Tricks of the Trade, 1998. https://doi.org/10.1007/3-540-49430-8_2 4. Hochreiter, S. and Schmidhuber, J. Long Short-Term Memory. Neural Computation, 1997. https://doi.org/10.1162/neco.1997.9.8.1735 5. Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. AISTATS, 2010. https://proceedings.mlr.press/v9/glorot10a.html 6. Cho, K. et al. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. EMNLP, 2014. https://doi.org/10.3115/v1/D14-1179 7. Hendrycks, D. and Gimpel, K. Gaussian Error Linear Units (GELUs). 2016. https://arxiv.org/abs/1606.08415 8. Elfwing, S., Uchibe, E., and Doya, K. Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning. Neural Networks, 2018. https://doi.org/10.1016/j.neunet.2017.12.012 9. Shazeer, N. GLU Variants Improve Transformer. 2020. https://arxiv.org/abs/2002.05202