94  Softmax Regression

Softmax regression generalizes binary logistic regression to problems with more than two mutually exclusive classes. It is the workhorse classifier that sits at the output of nearly every modern neural network, from image classifiers to large language models, where the final layer maps a vector of scores to a probability distribution over a vocabulary or a label set. This chapter develops the model from first principles, derives the cross-entropy loss and its gradient, establishes the equivalence with multinomial logistic regression, and treats the practical concerns of numerical stability and temperature scaling that determine whether an implementation works at all.

94.1 1. The Softmax Function

94.1.1 1.1 Definition

Given a vector of real-valued scores, often called logits, \(\mathbf{z} = (z_1, z_2, \ldots, z_K) \in \mathbb{R}^K\), the softmax function produces a probability vector \(\boldsymbol{\sigma}(\mathbf{z}) \in \mathbb{R}^K\) whose \(k\)-th component is

\[ \sigma_k(\mathbf{z}) = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}. \]

Each output lies strictly in the open interval \((0, 1)\), and the components sum to one:

\[ \sum_{k=1}^{K} \sigma_k(\mathbf{z}) = \frac{\sum_{k=1}^{K} e^{z_k}}{\sum_{j=1}^{K} e^{z_j}} = 1. \]

The function therefore maps an arbitrary point in \(\mathbb{R}^K\) onto the interior of the probability simplex \(\Delta^{K-1}\). The name reflects that softmax is a smooth, differentiable surrogate for the \(\arg\max\) operator: it concentrates mass on the largest logit without ever forcing the others to exactly zero, which keeps the function differentiable everywhere.

94.1.2 1.2 Properties

Several properties of softmax matter both for analysis and for implementation.

Shift invariance. Adding a constant \(c\) to every logit leaves the output unchanged. For any scalar \(c\),

\[ \sigma_k(\mathbf{z} + c\mathbf{1}) = \frac{e^{z_k + c}}{\sum_j e^{z_j + c}} = \frac{e^{c} e^{z_k}}{e^{c}\sum_j e^{z_j}} = \sigma_k(\mathbf{z}). \]

This single fact is the foundation of the numerical stabilization discussed in Section 4. It also reveals that softmax has one redundant degree of freedom: only differences between logits are identifiable, not their absolute level.

Monotonicity and order preservation. Because the exponential is strictly increasing, \(z_i > z_j\) implies \(\sigma_i(\mathbf{z}) > \sigma_j(\mathbf{z})\). Softmax never reorders the classes relative to their logits.

Saturation. As the gap between the largest logit and the rest grows, softmax approaches a one-hot vector. Conversely, when all logits are equal, the output is the uniform distribution \((1/K, \ldots, 1/K)\).

Non-injectivity. Owing to shift invariance, softmax is not injective on \(\mathbb{R}^K\). It becomes injective once we fix a reference, for example by constraining \(z_K = 0\), which is exactly what happens when the binary case collapses to a single logit.

94.1.3 1.3 The Jacobian

The derivative of softmax is needed for backpropagation. Differentiating \(\sigma_i\) with respect to \(z_j\) gives a clean closed form. For the diagonal entries,

\[ \frac{\partial \sigma_i}{\partial z_i} = \sigma_i (1 - \sigma_i), \]

and for the off-diagonal entries,

\[ \frac{\partial \sigma_i}{\partial z_j} = -\sigma_i \sigma_j \quad (i \neq j). \]

Both cases combine into the compact expression

\[ \frac{\partial \sigma_i}{\partial z_j} = \sigma_i (\delta_{ij} - \sigma_j), \]

where \(\delta_{ij}\) is the Kronecker delta. In matrix form the Jacobian is \(\mathbf{J} = \operatorname{diag}(\boldsymbol{\sigma}) - \boldsymbol{\sigma}\boldsymbol{\sigma}^\top\). This matrix is symmetric and positive semidefinite, and it is singular because its rows sum to zero, a direct consequence of the shift invariance noted above.

94.2 2. The Softmax Regression Model

94.2.1 2.1 From features to class probabilities

Softmax regression is a linear model that feeds its scores through softmax. Given an input feature vector \(\mathbf{x} \in \mathbb{R}^d\) and \(K\) classes, the model maintains a weight matrix \(\mathbf{W} \in \mathbb{R}^{K \times d}\) and a bias vector \(\mathbf{b} \in \mathbb{R}^K\). The logits are the affine map

\[ \mathbf{z} = \mathbf{W}\mathbf{x} + \mathbf{b}, \]

and the predicted distribution over classes is

\[ p(y = k \mid \mathbf{x}) = \sigma_k(\mathbf{z}) = \frac{\exp(\mathbf{w}_k^\top \mathbf{x} + b_k)}{\sum_{j=1}^{K} \exp(\mathbf{w}_j^\top \mathbf{x} + b_j)}, \]

where \(\mathbf{w}_k\) is the \(k\)-th row of \(\mathbf{W}\). The decision rule predicts the class with the highest probability, equivalently the highest logit, so the decision boundaries between any pair of classes are linear hyperplanes in feature space.

94.2.2 2.2 Overparameterization

The shift invariance of softmax carries over to the parameters. Replacing every \(\mathbf{w}_k\) by \(\mathbf{w}_k + \mathbf{v}\) and every \(b_k\) by \(b_k + c\) for a shared \(\mathbf{v}\) and \(c\) adds the same quantity to all logits and leaves the predicted distribution unchanged. The model is therefore overparameterized by exactly one class worth of parameters. Two remedies are common. One can fix the parameters of a reference class to zero, which reduces the count to \(K-1\) weight vectors and recovers the classical statistical formulation. Alternatively one keeps the full \(K\) sets of parameters for symmetry and convenience and relies on weight regularization, such as an \(L_2\) penalty, to pin down a unique solution. Modern deep learning frameworks almost universally take the second route.

94.3 3. Cross-Entropy Loss and Its Gradient

94.3.1 3.1 Maximum likelihood and the loss

Suppose the true label for an example is encoded as a one-hot vector \(\mathbf{y} \in \{0,1\}^K\) with \(y_k = 1\) for the correct class. The likelihood the model assigns to that label is \(\prod_{k} \sigma_k(\mathbf{z})^{y_k}\). Taking the negative logarithm gives the cross-entropy loss for a single example:

\[ \mathcal{L}(\mathbf{z}, \mathbf{y}) = -\sum_{k=1}^{K} y_k \log \sigma_k(\mathbf{z}) = -\log \sigma_c(\mathbf{z}), \]

where \(c\) is the index of the true class. The loss measures the surprise of the model on the correct answer. It is zero only in the limit where the model places all mass on the true class, and it grows without bound as the predicted probability of the true class approaches zero. Averaging \(\mathcal{L}\) over a training set and minimizing it is precisely maximum likelihood estimation, since minimizing the average negative log likelihood maximizes the joint likelihood of the data.

94.3.2 3.2 Cross-entropy as a divergence

Cross-entropy has an information-theoretic reading. For a target distribution \(\mathbf{y}\) and a predicted distribution \(\mathbf{p} = \boldsymbol{\sigma}(\mathbf{z})\), the cross-entropy is \(H(\mathbf{y}, \mathbf{p}) = -\sum_k y_k \log p_k\). It decomposes as

\[ H(\mathbf{y}, \mathbf{p}) = H(\mathbf{y}) + D_{\mathrm{KL}}(\mathbf{y} \parallel \mathbf{p}), \]

where \(H(\mathbf{y})\) is the entropy of the target and \(D_{\mathrm{KL}}\) is the Kullback-Leibler divergence. When \(\mathbf{y}\) is one-hot its entropy is zero, so minimizing cross-entropy is identical to minimizing the KL divergence from the target to the prediction. This view also justifies soft targets, as used in label smoothing and knowledge distillation, where \(\mathbf{y}\) is a full distribution rather than a one-hot vector.

94.3.3 3.3 The gradient with respect to logits

The signature elegance of the softmax cross-entropy pairing is its gradient. We want \(\partial \mathcal{L} / \partial z_j\). Writing \(\mathcal{L} = -\sum_k y_k \log \sigma_k\) and using the Jacobian from Section 1.3,

\[ \frac{\partial \mathcal{L}}{\partial z_j} = -\sum_{k} y_k \frac{1}{\sigma_k} \frac{\partial \sigma_k}{\partial z_j} = -\sum_k y_k \frac{1}{\sigma_k}\, \sigma_k (\delta_{kj} - \sigma_j). \]

Simplifying the sum,

\[ \frac{\partial \mathcal{L}}{\partial z_j} = -\sum_k y_k (\delta_{kj} - \sigma_j) = -y_j + \sigma_j \sum_k y_k = \sigma_j - y_j, \]

where the last step uses \(\sum_k y_k = 1\). In vector form,

\[ \nabla_{\mathbf{z}} \mathcal{L} = \boldsymbol{\sigma}(\mathbf{z}) - \mathbf{y} = \mathbf{p} - \mathbf{y}. \]

The gradient is simply the difference between the predicted distribution and the target. This result holds for soft targets too. The simplicity is not an accident: cross-entropy is the matching loss for the softmax link function, and pairing any generalized linear model’s canonical link with the negative log likelihood produces a residual of exactly this form.

94.3.4 3.4 Gradient with respect to parameters

Propagating through the affine map \(\mathbf{z} = \mathbf{W}\mathbf{x} + \mathbf{b}\) by the chain rule gives the parameter gradients. For a single example,

\[ \nabla_{\mathbf{w}_k} \mathcal{L} = (\sigma_k - y_k)\, \mathbf{x}, \qquad \frac{\partial \mathcal{L}}{\partial b_k} = \sigma_k - y_k. \]

Averaged over a minibatch of \(N\) examples, the weight gradient is \(\frac{1}{N}\sum_n (\mathbf{p}^{(n)} - \mathbf{y}^{(n)})(\mathbf{x}^{(n)})^\top\). Because the objective is convex in \((\mathbf{W}, \mathbf{b})\), gradient descent on this loss converges to a global optimum, optionally unique once regularization is added.

94.3.5 3.5 Convexity

The average cross-entropy loss is convex in the model parameters. The composition of the affine logit map with the convex log-sum-exp function is convex, and subtracting the linear term \(z_c\) for the correct class preserves convexity. Convexity guarantees that any local minimum is global and that first-order methods such as stochastic gradient descent, L-BFGS, or Newton-type methods reach the optimum. This is a distinguishing property of softmax regression relative to the deep networks that embed it, whose overall objectives are non-convex.

94.4 4. Numerical Stability

94.4.1 4.1 The overflow and underflow problem

A naive evaluation of \(e^{z_k}\) overflows in floating point when any logit is large. In IEEE double precision \(e^{z}\) overflows to infinity for \(z \gtrsim 710\), and in single precision the threshold is near \(89\). Even when individual exponentials are finite, their sum can overflow, and very negative logits underflow to zero, which then poisons a subsequent logarithm with \(\log 0 = -\infty\). Logits of these magnitudes are routine in trained networks, so unstabilized softmax is not merely fragile but reliably broken.

94.4.2 4.2 The max-subtraction trick

Shift invariance, established in Section 1.2, provides the fix at no cost to correctness. Let \(m = \max_k z_k\) and compute

\[ \sigma_k(\mathbf{z}) = \frac{e^{z_k - m}}{\sum_{j} e^{z_j - m}}. \]

After subtraction the largest exponent is \(e^{0} = 1\), so no term overflows, and the denominator is at least one, so no division by an underflowed sum occurs. Terms that underflow to zero are the genuinely negligible ones and do no harm.

def softmax(z):
    m = z.max(axis=-1, keepdims=True)
    e = exp(z - m)
    return e / e.sum(axis=-1, keepdims=True)

94.4.3 4.3 The log-sum-exp trick and fused cross-entropy

For the loss we need \(\log \sigma_c(\mathbf{z}) = z_c - \log \sum_j e^{z_j}\). The second term is the log-sum-exp function, stabilized identically:

\[ \operatorname{LSE}(\mathbf{z}) = \log \sum_j e^{z_j} = m + \log \sum_j e^{z_j - m}. \]

The cross-entropy loss then becomes

\[ \mathcal{L} = \operatorname{LSE}(\mathbf{z}) - z_c = m + \log \sum_j e^{z_j - m} - z_c. \]

Computing the loss directly from logits this way, rather than first forming probabilities and then taking their logarithm, avoids the catastrophic cancellation of \(\log(\text{a number near zero})\). This is why production libraries provide a single fused operation, such as cross_entropy taking raw logits, instead of asking the user to compose a softmax with a separate log loss. The fused path is both faster and far more accurate, and it is one of the most common sources of bugs when reimplemented by hand.

def cross_entropy_from_logits(z, c):
    m = z.max(axis=-1, keepdims=True)
    lse = m + log(exp(z - m).sum(axis=-1, keepdims=True))
    return (lse.squeeze() - z[c])

94.5 5. Relationship to Multinomial Logistic Regression

94.5.1 5.1 The same model under two names

Softmax regression and multinomial logistic regression are the same model. The statistics literature derives multinomial logistic regression by choosing a reference category, say class \(K\), and modeling the log odds of every other class against it as linear:

\[ \log \frac{p(y = k \mid \mathbf{x})}{p(y = K \mid \mathbf{x})} = \boldsymbol{\beta}_k^\top \mathbf{x}, \qquad k = 1, \ldots, K-1. \]

Exponentiating and using the constraint that probabilities sum to one yields

\[ p(y = k \mid \mathbf{x}) = \frac{\exp(\boldsymbol{\beta}_k^\top \mathbf{x})}{1 + \sum_{j=1}^{K-1} \exp(\boldsymbol{\beta}_j^\top \mathbf{x})}, \]

which is exactly softmax with the reference class fixed at \(\boldsymbol{\beta}_K = \mathbf{0}\). The machine learning convention keeps all \(K\) parameter vectors and accepts the redundant degree of freedom discussed in Section 2.2. The two parameterizations describe identical distributions; they differ only in identifiability and notation.

94.5.2 5.2 Reduction to binary logistic regression

When \(K = 2\) the model must reduce to ordinary logistic regression. Setting the reference class to \(2\), the probability of class \(1\) is

\[ p(y = 1 \mid \mathbf{x}) = \frac{e^{z_1}}{e^{z_1} + e^{z_2}} = \frac{1}{1 + e^{-(z_1 - z_2)}} = \sigma\!\left(z_1 - z_2\right), \]

which is the logistic sigmoid applied to the single effective logit \(z_1 - z_2\). The two-class softmax has one redundant parameter set; eliminating it recovers the familiar single-weight-vector logistic model exactly. Softmax regression is thus the faithful multiclass generalization of logistic regression, and cross-entropy generalizes the binary log loss.

94.5.3 5.3 Generalized linear model perspective

Both models are instances of a generalized linear model for a categorical (multinomial) response. The softmax is the inverse of the canonical link for the multinomial family, and the cross-entropy gradient \(\mathbf{p} - \mathbf{y}\) is the canonical residual that appears for every exponential-family GLM under its natural link. This unifying view explains why the gradient is so clean and why the loss is convex, and it places softmax regression in the same family as linear and Poisson regression.

94.6 6. Temperature

94.6.1 6.1 Definition and effect

A temperature parameter \(T > 0\) rescales the logits before the softmax:

\[ \sigma_k(\mathbf{z}; T) = \frac{\exp(z_k / T)}{\sum_j \exp(z_j / T)}. \]

Temperature controls how sharp or diffuse the output distribution is without changing the ranking of the classes. As \(T \to 0^+\), the distribution concentrates entirely on the highest logit and softmax approaches a hard \(\arg\max\), a regime sometimes called low temperature or greedy. As \(T \to \infty\), the logits are flattened and the distribution approaches uniform, the high-temperature regime. Setting \(T = 1\) recovers ordinary softmax. Equivalently, dividing logits by \(T\) scales every pairwise log odds by \(1/T\), so temperature is a uniform sharpening or smoothing of confidence.

94.6.2 6.2 Uses in practice

Temperature appears in several distinct roles. In sampling from generative models, including the token distributions of large language models, temperature trades off determinism against diversity: low temperature yields safe, repetitive output while high temperature increases variety at the cost of coherence. In knowledge distillation, a high temperature softens the teacher’s distribution so that the relative magnitudes of the non-target logits, the so-called dark knowledge, carry usable gradient signal to the student; the loss is typically scaled by \(T^2\) to keep gradient magnitudes comparable across temperatures. In calibration, a single temperature fit on a held-out set, known as temperature scaling, adjusts an overconfident classifier’s probabilities to better match empirical accuracy without altering its predictions, since the \(\arg\max\) is invariant to \(T\).

94.6.3 6.3 Temperature and the gradient

Because \(z_k / T\) is just a rescaled logit, the gradient analysis of Section 3 carries through. With respect to the scaled logits the gradient is still \(\mathbf{p} - \mathbf{y}\), and with respect to the original logits an extra factor of \(1/T\) appears by the chain rule. During training one normally leaves \(T = 1\) and learns the logit scale through the weights themselves; temperature is most useful as a post hoc or inference-time control rather than a trained parameter, precisely because it can be tuned after the fact without retraining the underlying model.

94.7 7. Summary

Softmax regression is a linear, convex, probabilistic classifier whose output is a distribution over \(K\) classes. Its mathematics is unusually tidy: the softmax Jacobian is \(\operatorname{diag}(\boldsymbol{\sigma}) - \boldsymbol{\sigma}\boldsymbol{\sigma}^\top\), and pairing softmax with cross-entropy yields the gradient \(\mathbf{p} - \mathbf{y}\), a difference of distributions that drives both the convex parameter estimation here and the backpropagation through every deep network that uses a softmax head. The same model is known in statistics as multinomial logistic regression, reduces to logistic regression when \(K = 2\), and belongs to the exponential-family GLMs. Two practical levers complete the picture: the max-subtraction and log-sum-exp tricks make the computation numerically safe, and temperature offers a single knob to sharpen or smooth the output distribution for sampling, distillation, and calibration. Mastering these few ideas equips a practitioner to read, implement, and debug the classification layer that terminates the vast majority of contemporary models.

94.8 References

  1. Bishop, C. M. “Pattern Recognition and Machine Learning,” Chapter 4: Linear Models for Classification. Springer, 2006. https://www.microsoft.com/en-us/research/publication/pattern-recognition-machine-learning/
  2. Goodfellow, I., Bengio, Y., and Courville, A. “Deep Learning,” Section 6.2.2: Softmax Units for Multinoulli Output Distributions. MIT Press, 2016. https://www.deeplearningbook.org/
  3. Murphy, K. P. “Probabilistic Machine Learning: An Introduction,” Chapter 10: Logistic Regression. MIT Press, 2022. https://probml.github.io/pml-book/book1.html
  4. Hinton, G., Vinyals, O., and Dean, J. “Distilling the Knowledge in a Neural Network.” arXiv preprint, 2015. https://arxiv.org/abs/1503.02531
  5. Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. “On Calibration of Modern Neural Networks.” ICML, 2017. https://arxiv.org/abs/1706.04599
  6. Blondel, M., Martins, A. F. T., and Niculae, V. “Learning with Fenchel-Young Losses.” Journal of Machine Learning Research, 2020. https://jmlr.org/papers/v21/19-021.html
  7. CS231n Convolutional Neural Networks for Visual Recognition. “Linear Classification: Softmax Classifier.” Stanford University. https://cs231n.github.io/linear-classify/