185 Sigmoid and Tanh Activations
The logistic sigmoid and the hyperbolic tangent are the two classical bounded nonlinearities that dominated neural network design from the 1980s through the early 2010s. Although the rectified linear family has since displaced them in the hidden layers of most feedforward and convolutional architectures, sigmoid and tanh remain indispensable in specific roles: probability outputs, gating mechanisms in recurrent and attention based models, and any setting where a smooth saturating squashing function is required. Understanding their analytic properties, and in particular the way they fail, is foundational for diagnosing training pathologies and for appreciating why later designs took the shape they did.
185.1 1. Definitions and Analytic Properties
185.1.1 1.1 The Logistic Sigmoid
The logistic sigmoid maps the real line onto the open interval \((0, 1)\):
\[ \sigma(x) = \frac{1}{1 + e^{-x}}. \]
It is monotonically increasing, infinitely differentiable, and antisymmetric about the point \((0, \tfrac{1}{2})\), meaning \(\sigma(-x) = 1 - \sigma(x)\). As \(x \to +\infty\) the output approaches \(1\), and as \(x \to -\infty\) it approaches \(0\). The value at the origin is exactly \(\tfrac{1}{2}\). Because its range coincides with the unit interval, \(\sigma(x)\) is naturally read as a probability, which is the reason it serves as the output nonlinearity for binary classification and for the marginal probabilities in multilabel problems.
185.1.2 1.2 The Hyperbolic Tangent
The hyperbolic tangent maps the real line onto the open interval \((-1, 1)\):
\[ \tanh(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}. \]
It is an odd function, \(\tanh(-x) = -\tanh(x)\), and it passes through the origin. The two functions are not independent. A short algebraic manipulation gives the relation
\[ \tanh(x) = 2\,\sigma(2x) - 1, \]
so tanh is a rescaled and shifted sigmoid. This identity is worth internalizing: every qualitative statement about one function has a direct counterpart for the other, and the practical differences between them reduce almost entirely to range and centering.
185.2 2. Derivatives
185.2.1 2.1 Closed Form Expressions
The derivative of the sigmoid admits an unusually convenient self referential form:
\[ \sigma'(x) = \sigma(x)\bigl(1 - \sigma(x)\bigr). \]
Once the forward activation \(a = \sigma(x)\) has been computed, the gradient costs only a subtraction and a multiplication, with no further calls to the exponential. The tanh derivative is similarly compact:
\[ \frac{d}{dx}\tanh(x) = 1 - \tanh^{2}(x). \]
Both expressions are central to backpropagation, where the local gradient of each unit multiplies the incoming error signal.
185.2.2 2.2 Magnitude of the Slope
The crucial quantitative fact is how small these derivatives are. The sigmoid derivative attains its maximum at \(x = 0\), where \(\sigma(0) = \tfrac{1}{2}\) and therefore
\[ \sigma'(0) = \tfrac{1}{2}\cdot\tfrac{1}{2} = \tfrac{1}{4}. \]
So even at its steepest point the sigmoid passes through at most one quarter of the incoming gradient. The tanh derivative is more generous at the origin, reaching \(1 - \tanh^{2}(0) = 1\). This larger maximum slope is one reason tanh is generally preferred over sigmoid for hidden units when a bounded activation is desired. Away from the origin both derivatives decay rapidly toward zero, which is the source of the difficulties described next.
x sigma'(x) tanh'(x)
0.0 0.2500 1.0000
2.0 0.1050 0.0707
4.0 0.0177 0.0013
6.0 0.0025 0.0000
185.3 3. Saturation and the Vanishing Gradient
185.3.1 3.1 What Saturation Means
A unit is said to saturate when its pre-activation \(x\) has large magnitude, placing the output near one of the asymptotes. In that regime the curve is nearly flat, so the local derivative is close to zero. For the sigmoid, once \(|x|\) exceeds roughly \(5\) the derivative has fallen below \(0.007\); for tanh the collapse is even sharper because of the factor of two inside the equivalent logistic form. A saturated unit still produces a sensible forward output, but it has almost stopped responding to changes in its input, and it transmits almost no gradient backward.
185.3.2 3.2 Propagation Through Depth
The damage compounds with depth. Backpropagation forms the gradient of the loss with respect to an early parameter as a product of per-layer Jacobians. Schematically, for a chain of \(L\) layers the gradient magnitude scales like
\[ \left\|\frac{\partial \mathcal{L}}{\partial x^{(1)}}\right\| \;\sim\; \prod_{\ell=1}^{L} \bigl\|\,\mathrm{diag}\bigl(f'(x^{(\ell)})\bigr)\,W^{(\ell)}\,\bigr\|. \]
If each factor contributes a per-layer derivative bounded by \(\tfrac{1}{4}\) (sigmoid) and the weight norms do not compensate, the product shrinks geometrically. After ten sigmoid layers the upper bound on the surviving gradient is on the order of \(4^{-10} \approx 10^{-6}\). Early layers therefore receive a vanishingly small learning signal and update extremely slowly, while later layers train normally. This is the vanishing gradient problem, identified in Hochreiter’s 1991 thesis and analyzed in detail by Bengio, Simard, and Frasconi in 1994. It is the single most important reason deep stacks of sigmoid or tanh units were historically so hard to train.
185.3.3 3.3 Why Initialization and Range Matter
Saturation is not inevitable; it depends on the distribution of pre-activations. If weights are initialized so that the variance of \(x\) stays near unity, most units operate in the high-slope region near the origin. The Glorot and Bengio initialization of 2010 was derived precisely to keep activation and gradient variances stable across layers for tanh networks, and it substantially mitigated the symptom. Saturating nonlinearities thus place a real burden on careful initialization and on input normalization in a way that the non-saturating rectified family does not.
185.4 4. The Zero-Centering Argument
185.4.1 4.1 Sigmoid Outputs Are Always Positive
A more subtle defect of the sigmoid concerns the sign of its outputs. Because \(\sigma(x) \in (0, 1)\), every value fed from one layer into the next is strictly positive. Consider a weight \(w_i\) feeding into a single downstream neuron with pre-activation \(z = \sum_i w_i a_i + b\). During backpropagation the gradient with respect to that weight is
\[ \frac{\partial \mathcal{L}}{\partial w_i} = \frac{\partial \mathcal{L}}{\partial z}\, a_i. \]
If all incoming activations \(a_i\) are positive, then the sign of every weight gradient into that neuron is determined entirely by the single scalar \(\partial \mathcal{L} / \partial z\). Consequently all weights of that neuron must increase together or decrease together on a given step. They cannot move in independent directions, which forces the optimizer to follow an inefficient zig-zag trajectory toward minima that would otherwise be reached more directly.
185.4.2 4.2 How Tanh Resolves It
Because tanh is symmetric about zero and produces both positive and negative outputs, the activations entering the next layer have a mean closer to zero. The per-weight gradients then carry mixed signs and the zig-zag effect is largely removed. This zero-centering property, articulated by LeCun and colleagues in their influential note on efficient backpropagation, is the primary reason tanh was the default hidden-layer choice throughout the 1990s and 2000s whenever a bounded nonlinearity was used. The same logic later motivated batch normalization and other mean-centering schemes, which restore a favorable activation distribution even when the nonlinearity itself does not.
185.5 5. Where They Are Still Used
The retreat of sigmoid and tanh from generic hidden layers does not mean they are obsolete. They survive in roles where their bounded, smooth, probabilistic character is exactly what is needed.
185.5.1 5.1 Output Layers and Probabilities
The sigmoid remains the standard output nonlinearity for binary classification, where it converts a single logit into a calibrated probability, and for multilabel classification, where an independent sigmoid is applied to each class logit. It pairs naturally with the binary cross-entropy loss, whose gradient with respect to the logit simplifies to the difference between the predicted probability and the target. In practice one fuses the sigmoid and the loss into a single numerically stable operation rather than applying them separately.
# conceptual pairing, not a runnable snippet
logit -> sigmoid -> p in (0,1)
loss = -[ y*log(p) + (1-y)*log(1-p) ]
dloss/dlogit = p - y
185.5.2 5.2 Gating in Recurrent and Gated Architectures
Gated recurrent designs depend essentially on saturating nonlinearities. In the LSTM the input, forget, and output gates each use a sigmoid because a gate value near \(0\) or \(1\) acts as a soft binary switch that admits or blocks information, while the cell candidate and the exposed cell state use tanh to keep the signal bounded in \((-1, 1)\). The GRU follows the same pattern with its update and reset gates. The constant error carousel of the LSTM was designed specifically to bypass the vanishing gradient by giving the cell state an additive, ungated path through time, which is what allows these saturating gates to be used safely across long sequences.
185.5.3 5.3 Attention, Gating, and Smooth Approximations
Sigmoid gating reappears in modern Transformer variants. Gated linear unit blocks multiply one projection by a sigmoid or related gate of another, and several recent feedforward designs use such multiplicative gates to improve quality. The sigmoid also hides inside the smooth activations that did replace it in hidden layers. The SiLU, or swish, is defined as \(x\,\sigma(x)\), and the GELU is well approximated by \(x\,\sigma(1.702\,x)\), so the logistic curve persists as a smooth gate even in architectures that no longer expose it directly. Tanh, for its part, is used to bound the output of policy networks in continuous control, to squash regression targets into a fixed range, and as the nonlinearity in the closed-form approximation of GELU.
185.6 6. Summary
Sigmoid and tanh are smooth, bounded, monotone nonlinearities related by \(\tanh(x) = 2\sigma(2x) - 1\), with the convenient derivatives \(\sigma' = \sigma(1-\sigma)\) and \(\tanh' = 1 - \tanh^{2}\). Their maximum slopes, \(\tfrac{1}{4}\) and \(1\) respectively, together with rapid saturation away from the origin, produce the vanishing gradient that makes deep stacks of either function hard to train. The strictly positive range of the sigmoid additionally couples the signs of weight updates and slows optimization, a defect that tanh’s zero-centering partly cures and that normalization layers later addressed in general. Despite losing the hidden layers of feedforward and convolutional networks to the rectified family, both functions remain central to probability outputs, to the gates of recurrent and gated architectures, and to the smooth gated activations of contemporary Transformers.
185.7 7. References
- Hochreiter, S. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Technische Universitat Munchen, 1991. https://people.idsia.ch/~juergen/SeppHochreiter1991ThesisAdvisorSchmidhuber.pdf
- Bengio, Y., Simard, P., and Frasconi, P. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 1994. https://ieeexplore.ieee.org/document/279181
- LeCun, Y., Bottou, L., Orr, G., and Muller, K. Efficient BackProp. In Neural Networks: Tricks of the Trade, 1998. https://link.springer.com/chapter/10.1007/3-540-49430-8_2
- Hochreiter, S. and Schmidhuber, J. Long Short-Term Memory. Neural Computation, 1997. https://www.bioinf.jku.at/publications/older/2604.pdf
- Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. AISTATS, 2010. https://proceedings.mlr.press/v9/glorot10a.html
- Cho, K. et al. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. EMNLP, 2014. https://arxiv.org/abs/1406.1078
- Hendrycks, D. and Gimpel, K. Gaussian Error Linear Units (GELUs). 2016. https://arxiv.org/abs/1606.08415
- Elfwing, S., Uchibe, E., and Doya, K. Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning. 2017. https://arxiv.org/abs/1702.03118
- Shazeer, N. GLU Variants Improve Transformer. 2020. https://arxiv.org/abs/2002.05202