189 Loss Functions for Classification in Neural Networks

Classification is the workhorse task of modern deep learning, and the choice of loss function determines what a network actually optimizes. This chapter develops the theory and numerical practice of the loss functions that dominate classification: the softmax cross-entropy for multiclass problems, the binary cross-entropy with logits for multilabel and binary problems, and the regularization technique of label smoothing. We pay close attention to numerical stability, because the naive mathematical forms of these objectives overflow and underflow in finite precision arithmetic, and the fused implementations used in practice differ substantially from the textbook equations.

189.1 1. From Probabilistic Modeling to Cross-Entropy

A classifier with parameters $\theta$ defines a conditional distribution $p_\theta(y \mid x)$ over labels $y$ given an input $x$. Given a dataset $\mathcal{D} = \{(x_i, y_i)\}_{i=1}^N$ drawn from an unknown data distribution, maximum likelihood estimation seeks the parameters that maximize the probability of the observed labels. Taking logarithms and negating turns the product over examples into a sum to be minimized:

\[ \mathcal{L}(\theta) = -\frac{1}{N} \sum_{i=1}^{N} \log p_\theta(y_i \mid x_i). \]

This is the average negative log likelihood, and it is identical to the empirical cross-entropy between the data distribution and the model. To see the connection, let $q_i$ denote the one-hot target distribution that places all mass on the true class $y_i$, and let $p_i$ denote the model distribution over classes. The cross-entropy is

\[ H(q_i, p_i) = -\sum_{k} q_i(k) \log p_i(k) = -\log p_i(y_i), \]

because $q_i$ is zero everywhere except at $k = y_i$. Minimizing cross-entropy is therefore equivalent to maximum likelihood. There is a complementary information-theoretic reading. Cross-entropy decomposes as

\[ H(q, p) = H(q) + D_{\mathrm{KL}}(q \,\|\, p), \]

where $H(q)$ is the entropy of the target distribution and $D_{\mathrm{KL}}$ is the Kullback-Leibler divergence. Since $H(q)$ does not depend on $\theta$, minimizing cross-entropy minimizes the KL divergence from the model to the targets. The optimum is reached when $p$ matches $q$, which formalizes the intuition that we want the model to reproduce the labeling.

189.2 2. The Softmax and Multiclass Cross-Entropy

189.2.1 2.1 The softmax link function

A neural network for $K$ class classification produces a vector of real valued scores $z = (z_1, \dots, z_K)$, called logits, which live in $\mathbb{R}^K$ and are unconstrained. The softmax function maps logits to a probability simplex:

\[ \mathrm{softmax}(z)_k = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}. \]

The outputs are positive and sum to one, so they constitute a valid distribution. The softmax is invariant to adding a constant to every logit, since $\mathrm{softmax}(z + c\mathbf{1}) = \mathrm{softmax}(z)$. This shift invariance is both the source of a numerical trick and a reminder that the logits are identified only up to an additive constant.

189.2.2 2.2 The loss and its gradient

Combining the softmax with cross-entropy, the per-example loss for true class $y$ is

\[ \ell(z, y) = -\log \mathrm{softmax}(z)_y = -z_y + \log \sum_{j=1}^{K} e^{z_j}. \]

The second term is the log-sum-exp function, written $\mathrm{LSE}(z)$. The loss is thus the gap between the log-sum-exp of all logits and the logit of the correct class. The gradient with respect to the logits is remarkably clean. Writing $p = \mathrm{softmax}(z)$,

\[ \frac{\partial \ell}{\partial z_k} = p_k - \mathbb{1}[k = y], \]

which in vector form is $p - q$, the difference between the predicted distribution and the one-hot target. This is the same elegant form that appears in logistic and linear regression with their canonical link functions, and it is no coincidence: the softmax is the canonical link for the categorical distribution in the exponential family, and the predicted-minus-target gradient is a general property of such models. Because the gradient is bounded in magnitude by one per coordinate and never saturates to zero when the prediction is wrong, the softmax cross-entropy supplies strong learning signal even when the model is confidently incorrect, which is one reason it is preferred over a mean squared error applied to softmax outputs.

189.2.3 2.3 Deriving the gradient

The clean predicted-minus-target form is worth deriving explicitly, because the cancellation that produces it is the reason the loss is so well behaved. Start from the Jacobian of the softmax. Writing $p_k = \mathrm{softmax}(z)_k$ and differentiating $p_k = e^{z_k} / \sum_j e^{z_j}$,

\[ \frac{\partial p_k}{\partial z_i} = p_k\,(\delta_{ki} - p_i), \]

where $\delta_{ki}$ is the Kronecker delta. The diagonal terms $i = k$ give $p_k(1 - p_k)$ and the off-diagonal terms give $-p_k p_i$, so the softmax Jacobian is $\mathrm{diag}(p) - p p^\top$, a symmetric positive semidefinite matrix. Now differentiate the loss $\ell = -\log p_y$ through this Jacobian:

\[ \frac{\partial \ell}{\partial z_i} = -\frac{1}{p_y}\frac{\partial p_y}{\partial z_i} = -\frac{1}{p_y}\,p_y\,(\delta_{yi} - p_i) = p_i - \delta_{yi} = p_i - q_i. \]

The factor $1/p_y$ cancels exactly against the $p_y$ from the Jacobian, which is precisely why the softmax and the logarithm are paired: the log undoes the normalization that the softmax introduces, leaving a residual that depends on the prediction error alone. The same calculation done directly on the fused form $\ell = \mathrm{LSE}(z) - z_y$ is shorter still, since $\partial \mathrm{LSE}/\partial z_i = e^{z_i}/\sum_j e^{z_j} = p_i$ and $\partial(-z_y)/\partial z_i = -\delta_{yi}$, giving $p_i - q_i$ immediately.

For a batch of $N$ examples the averaged loss has gradient $\tfrac{1}{N}\sum_i (p_i - q_i)$ per example, that is the matrix $(P - Q)/N$ whose rows are the per-example residuals. Under label smoothing the only change is that the hard target $q$ becomes the smoothed $q'$, so the gradient is $(P - Q')/N$; every row still sums to zero because $p$ and $q'$ are both distributions, which is the discrete analogue of the constraint that gradients of a normalized objective are tangent to the simplex.

189.3 3. Numerical Stability of Log-Sum-Exp

189.3.1 3.1 Why the naive form fails

The expression $\log \sum_j e^{z_j}$ is dangerous in floating point. If any logit is large, say $z_j = 1000$, then $e^{z_j}$ overflows to infinity in IEEE 754 double precision, whose largest finite value is near $1.8 \times 10^{308}$, corresponding to an exponent argument of about $709$. In single precision the threshold is near $88$. Conversely, if all logits are very negative, every exponential underflows to zero, the sum is zero, and the logarithm returns negative infinity. Either way the result is a NaN or an infinity that poisons the backward pass.

189.3.2 3.2 The shift trick

The shift invariance of the softmax provides the cure. Let $m = \max_j z_j$. Then

\[ \mathrm{LSE}(z) = m + \log \sum_{j=1}^{K} e^{z_j - m}. \]

After subtracting the maximum, the largest exponent argument is zero, so the largest term is exactly one and cannot overflow. At least one term in the sum equals one, so the sum is at least one and its logarithm is finite and well defined. Smaller terms may underflow to zero, but those terms were negligible anyway, so the relative error is tiny. This stabilized log-sum-exp is the foundation of every production softmax implementation.

# stable log-sum-exp
m = max(z)
lse = m + log(sum(exp(z - m)))
loss = lse - z[y]

189.3.3 3.3 Fused softmax cross-entropy

Practitioners rarely compute softmax probabilities and then take their logarithm, because $\log(\mathrm{softmax})$ recomputes the same exponentials twice and loses precision when a probability is near zero. Instead the loss is computed directly from logits in a single fused kernel using the identity

\[ \log \mathrm{softmax}(z)_k = z_k - \mathrm{LSE}(z). \]

This is why deep learning libraries expose an operation that consumes raw logits rather than probabilities. In PyTorch the function torch.nn.functional.cross_entropy expects logits and internally applies a stabilized log-softmax; passing it already-normalized probabilities is a common and silent bug. The general rule is to keep the network output in logit space for as long as possible and to fold the softmax into the loss.

189.4 4. Binary and Multilabel Classification

189.4.1 4.1 The sigmoid and binary cross-entropy

When there are two classes, or when each of several labels is independently present or absent, the relevant link is the logistic sigmoid:

\[ \sigma(z) = \frac{1}{1 + e^{-z}}, \qquad \sigma(z) \in (0, 1). \]

The sigmoid is the two class special case of the softmax applied to the logit difference. For a binary target $y \in \{0, 1\}$ with predicted probability $\hat{p} = \sigma(z)$, the binary cross-entropy loss is

\[ \ell(z, y) = -\big[\, y \log \hat{p} + (1 - y) \log (1 - \hat{p}) \,\big]. \]

Multilabel classification, where an input may carry several labels at once, treats each of the $C$ outputs as an independent Bernoulli problem and sums the binary cross-entropy over labels. This differs fundamentally from multiclass softmax, which couples the outputs through a single normalization and enforces that exactly one class is present.

189.4.2 4.2 Binary cross-entropy with logits

The same numerical hazards reappear. Computing $\sigma(z)$ and then its logarithm overflows when $z$ is large and negative, because $e^{-z}$ explodes, and it produces $\log 0$ when $\sigma(z)$ saturates to zero or one. The stable formulation substitutes the sigmoid and simplifies. Starting from the binary cross-entropy and writing it in terms of the logit $z$,

\[ \ell(z, y) = \max(z, 0) - z \cdot y + \log\!\big(1 + e^{-|z|}\big). \]

This rearrangement, used by the function commonly named binary cross-entropy with logits, never exponentiates a positive number, since the argument $-|z|$ is always nonpositive, so it cannot overflow, and the $\max(z, 0)$ term carries the large magnitude behavior exactly. The derivation uses the identity $\log(1 + e^{z}) = \max(z, 0) + \log(1 + e^{-|z|})$, which is the softplus function written in a stable way.

# stable binary cross-entropy from logit z and target y
loss = max(z, 0) - z * y + log(1 + exp(-abs(z)))

In PyTorch this is binary_cross_entropy_with_logits; in TensorFlow it is sigmoid_cross_entropy_with_logits. As with the multiclass case, the lesson is to pass logits, not probabilities, so that the library can apply the stable form.

189.4.3 4.3 Class imbalance and weighting

Real classification problems are frequently imbalanced, with rare positives swamped by negatives. A standard remedy weights the positive term by a factor $w_+ > 1$:

\[ \ell(z, y) = -\big[\, w_+ \, y \log \hat{p} + (1 - y) \log (1 - \hat{p}) \,\big], \]

which rescales the gradient contribution of positive examples. A more aggressive alternative, the focal loss, multiplies the per example loss by $(1 - \hat{p}_t)^\gamma$, where $\hat{p}_t$ is the probability assigned to the true class. This factor shrinks the loss on easy, well classified examples and focuses optimization on hard ones, and it was introduced to train dense object detectors where the foreground to background ratio is extreme.

189.5 5. Label Smoothing

189.5.1 5.1 Motivation and definition

Hard one-hot targets push the model to drive the correct logit toward positive infinity relative to the others, because cross-entropy is minimized only in the limit of infinite confidence. This encourages overconfident predictions, large logit magnitudes, and poor calibration, where the predicted probability of the chosen class systematically exceeds its empirical accuracy. Label smoothing addresses this by replacing the one-hot target with a softened distribution that reserves a small amount of probability mass for the wrong classes. With smoothing parameter $\epsilon$ and $K$ classes, the smoothed target is

\[ q'(k) = (1 - \epsilon)\, \mathbb{1}[k = y] + \frac{\epsilon}{K}. \]

The true class receives $1 - \epsilon + \epsilon/K$ and every other class receives $\epsilon/K$. A typical value is $\epsilon = 0.1$. The training objective remains cross-entropy, now taken against $q'$:

\[ \ell_{\mathrm{LS}}(z, y) = -\sum_{k=1}^{K} q'(k) \log \mathrm{softmax}(z)_k. \]

189.5.2 5.2 Effect on the optimum and on geometry

With smoothed targets the loss is no longer minimized by infinite logit gaps. Setting the gradient to zero, the optimal logits satisfy $\mathrm{softmax}(z)_k = q'(k)$, so the model is asked to predict the correct class with probability $1 - \epsilon + \epsilon/K$ rather than one, and the optimal logit gap between the correct and incorrect classes becomes a finite constant. This bounds the logit magnitudes and tends to improve calibration. Empirically, label smoothing also reshapes the learned representations: penultimate layer activations for examples of the same class cluster more tightly and at more equal distances from other class centroids, a geometric regularity that accompanies its accuracy and calibration benefits in image classification, machine translation, and speech recognition.

189.5.3 5.3 A KL-divergence reading and caveats

Label smoothing can be viewed as adding a penalty that pulls the model toward the uniform distribution $u$. Decomposing the smoothed cross-entropy,

\[ \ell_{\mathrm{LS}} = (1 - \epsilon)\, H(q, p) + \epsilon \, H(u, p) = (1 - \epsilon)\, H(q, p) + \epsilon \big( D_{\mathrm{KL}}(u \,\|\, p) + H(u) \big), \]

so up to a constant the smoothing term is a KL divergence from the model to uniform, a confidence penalty that discourages peaked outputs. The technique is not universally beneficial. Because it deliberately removes information about relative incorrect class probabilities, label smoothing can degrade knowledge distillation, where a student is trained to match a teacher’s full soft distribution and therefore needs the very inter-class structure that smoothing erases. As with any regularizer, the smoothing strength is a hyperparameter to be validated rather than assumed.

189.6 6. Practical Guidance

A short checklist captures the operational consequences of the theory. Keep network outputs in logit space and let the loss function apply the softmax or sigmoid internally, so that the stabilized log-sum-exp and softplus forms are used. Choose softmax cross-entropy for mutually exclusive classes and summed binary cross-entropy with logits for independent multilabel outputs. Reach for positive weighting or focal loss when the class distribution is skewed. Apply label smoothing with a small $\epsilon$ such as $0.1$ to curb overconfidence and improve calibration, but reconsider it when training teachers for distillation. Finally, when a training run produces NaNs early, suspect an unstable hand rolled softmax or a loss fed probabilities instead of logits before suspecting the data.

189.7 7. A From-Scratch Implementation

The companion aiinaction libraries ship a small, stable softmax cross-entropy in all three languages of the book. The module ch184_softmax_ce exposes four functions that operate directly on a logit matrix of shape $(N, K)$: softmax and log_softmax (the stabilized link functions), cross_entropy_loss (the fused mean loss, with optional label_smoothing), and cross_entropy_grad (the predicted-minus-target gradient $(P - Q')/N$). Each call validates its inputs, subtracts the per-row maximum before exponentiating, and never takes the logarithm of a probability. The three implementations agree to floating-point tolerance on the shared fixtures below; the only surface difference is that Julia uses 1-based class indices while Python and Rust use 0-based ones.

The example below scores three samples over three classes. The gradient rows sum to zero, and the loss with label smoothing $\epsilon = 0.1$ is slightly larger than the hard-target loss, reflecting the confidence penalty.

Code

from aiinaction.ch184_softmax_ce import (
    softmax, log_softmax, cross_entropy_loss, cross_entropy_grad,
)

# Three samples, three classes. Logits live in R^K and are unconstrained.
Z = [[2.0, 1.0, 0.1],
     [0.5, 2.5, 0.3],
     [1.0, 1.0, 1.0]]
labels = [0, 1, 2]  # 0-based true class per row

p = softmax(Z)
print("softmax row 0:        ", [round(v, 4) for v in p[0]])
print("rows sum to one:      ", [round(float(s), 6) for s in p.sum(axis=1)])

loss = cross_entropy_loss(Z, labels)
loss_ls = cross_entropy_loss(Z, labels, label_smoothing=0.1)
print(f"mean cross-entropy:    {loss:.6f}")
print(f"with label smoothing:  {loss_ls:.6f}")

g = cross_entropy_grad(Z, labels)
print("gradient (P - Q)/N:")
for row in g:
    print("  ", [round(float(v), 4) for v in row])
print("each gradient row sums to ~0:",
      [round(float(s), 12) for s in g.sum(axis=1)])

softmax row 0:         [np.float64(0.659), np.float64(0.2424), np.float64(0.0986)]
rows sum to one:       [1.0, 1.0, 1.0]
mean cross-entropy:    0.578564
with label smoothing:  0.657453
gradient (P - Q)/N:
   [-0.1137, 0.0808, 0.0329]
   [0.0362, -0.0658, 0.0296]
   [0.1111, 0.1111, -0.2222]
each gradient row sums to ~0: [0.0, -0.0, -0.0]

using AIInAction.Ch184SoftmaxCE

# Same fixture; Julia uses 1-based class indices.
Z = [2.0 1.0 0.1
     0.5 2.5 0.3
     1.0 1.0 1.0]
labels = [1, 2, 3]

p = softmax(Z)
println("softmax row 1: ", round.(p[1, :], digits=4))

loss    = cross_entropy_loss(Z, labels)
loss_ls = cross_entropy_loss(Z, labels; label_smoothing=0.1)
println("mean cross-entropy:   ", round(loss, digits=6))
println("with label smoothing: ", round(loss_ls, digits=6))

g = cross_entropy_grad(Z, labels)  # (P - Q)/N
println("gradient row sums: ", round.(vec(sum(g; dims=2)), digits=12))

use aiinaction::ch184_softmax_ce::{
    softmax, cross_entropy_loss, cross_entropy_grad,
};

fn main() {
    // Same fixture; Rust uses 0-based class indices.
    let z = vec![
        vec![2.0, 1.0, 0.1],
        vec![0.5, 2.5, 0.3],
        vec![1.0, 1.0, 1.0],
    ];
    let labels = [0usize, 1, 2];

    let p = softmax(&z).unwrap();
    println!("softmax row 0: {:?}", p[0]);

    let loss = cross_entropy_loss(&z, &labels, 0.0).unwrap();
    let loss_ls = cross_entropy_loss(&z, &labels, 0.1).unwrap();
    println!("mean cross-entropy:   {:.6}", loss);
    println!("with label smoothing: {:.6}", loss_ls);

    let g = cross_entropy_grad(&z, &labels, 0.0).unwrap(); // (P - Q)/N
    for row in &g {
        let s: f64 = row.iter().sum();
        println!("row sum {:.12}", s);
    }
}

189.8 8. Integer Labels versus One-Hot Encoding: Sparse versus Dense Cross-Entropy

The multiclass cross-entropy derived in Section 2 assumes the target $y$ is represented as a one-hot vector of length $K$. In practice, label encoding is a choice with numerical and memory implications.

189.8.1 8.1 The Two Representations

A one-hot target stores class $c$ as a vector $e_c \in \{0,1\}^K$ with a single non-zero entry. An integer target stores only the index $c \in \{0, 1, \ldots, K-1\}$. For $K = 1000$ classes, one-hot uses 1000 floats per example; the integer uses one. For language modeling with a vocabulary of 50,000 tokens, the one-hot tensor across a batch of 512 sequences of length 256 would be 512 × 256 × 50,000 ≈ 6.5 billion entries, far beyond any practical memory budget.

The loss formula is identical in both cases. For class $c$ and predicted distribution $\hat{p}$,

\[ \mathcal{L}_{\mathrm{CE}} = -\sum_{k=1}^{K} e_c(k)\, \log \hat{p}_k = -\log \hat{p}_c, \]

because $e_c$ zeros out every term except $k = c$. The integer representation exploits this sparsity directly: it indexes into the log-probability vector rather than materializing the full one-hot and computing a dot product. The math is the same; only the implementation differs.

189.8.2 8.2 Framework Implementations

PyTorch separates the two cases explicitly:

nn.CrossEntropyLoss consumes integer labels of shape (N,) or (N, d_1, ..., d_K) for pixel-wise classification. It fuses softmax and NLL into one numerically stable operation.
nn.NLLLoss also consumes integer labels but expects log-probabilities as input (after F.log_softmax), giving manual control over the softmax.

Keras similarly offers two variants:

CategoricalCrossentropy expects one-hot targets of shape (N, K).
SparseCategoricalCrossentropy expects integer targets of shape (N,) and performs the same computation internally.

The from_logits parameter controls whether the model output is treated as raw logits (apply softmax internally, numerically stable) or as probabilities (skip softmax). Setting from_logits=True is almost always preferable: it enables the fused log-sum-exp trick and avoids the redundant softmax → log round-trip.

Code

import numpy as np

# Demonstrate SparseCategoricalCrossentropy vs CategoricalCrossentropy equivalence
def softmax(z):
    e = np.exp(z - z.max(axis=-1, keepdims=True))
    return e / e.sum(axis=-1, keepdims=True)

def sparse_ce(logits, labels):
    probs = softmax(logits)
    return -np.mean(np.log(probs[np.arange(len(labels)), labels] + 1e-12))

def dense_ce(logits, one_hot):
    probs = softmax(logits)
    return -np.mean(np.sum(one_hot * np.log(probs + 1e-12), axis=-1))

rng = np.random.default_rng(0)
logits = rng.standard_normal((4, 5))   # batch=4, classes=5
labels = np.array([2, 0, 4, 1])        # integer labels
one_hot = np.eye(5)[labels]            # equivalent one-hot

print(f"sparse CE:  {sparse_ce(logits, labels):.6f}")
print(f"dense CE:   {dense_ce(logits, one_hot):.6f}")
print(f"difference: {abs(sparse_ce(logits, labels) - dense_ce(logits, one_hot)):.2e}")

sparse CE:  1.870146
dense CE:   1.870146
difference: 0.00e+00

189.8.3 8.3 When to Use Each

Use integer labels (sparse) when: - The label vocabulary is large (NLP token prediction, image segmentation with many classes). - Memory is a constraint. - You are using PyTorch’s CrossEntropyLoss or Keras’s SparseCategoricalCrossentropy.

Use one-hot labels (dense) when: - Labels are soft (label smoothing, knowledge distillation), requiring a non-integer distribution over classes. You cannot express $[0.9, 0.05, 0.05]$ as a single integer. - The loss involves a sum over all classes, such as KL divergence between predicted and target distributions. - The framework or loss function requires it explicitly.

Label smoothing, discussed in Section 7, requires one-hot (or equivalently soft) targets because smoothed labels are $\tilde{y}_k = (1-\epsilon)\,\delta_{k=c} + \epsilon/K$, which is not a valid integer encoding. Distillation losses similarly operate on soft teacher distributions and require the dense form.

189.8.4 8.4 The `reduction` Parameter

Both sparse and dense cross-entropy functions accept a reduction argument that controls how per-example losses are aggregated:

'mean' (default): divide the summed loss by the number of examples. Gradient scale is independent of batch size.
'sum': sum all per-example losses. Gradient scale grows with batch size; requires compensating learning rate adjustment.
'none': return the per-example loss vector. Useful for importance weighting: loss = ce(logits, labels, reduction='none'); weighted = (loss * weights).mean().

The choice of reduction is invisible in the forward pass output when batches are fixed size, but matters when batch sizes vary (variable-length sequences, online learning) or when sample-level weights are applied for class imbalance or curriculum learning.

189.9 9. Summary

Cross-entropy is the maximum likelihood objective for classification, and its two principal forms, softmax cross-entropy for multiclass problems and binary cross-entropy with logits for binary and multilabel problems, both produce the clean predicted-minus-target gradient that drives stable learning. The mathematical expressions hide numerical landmines that the shift trick for log-sum-exp and the absolute value form of softplus defuse, which is why fused logit-consuming loss functions are the norm. The sparse/dense distinction between integer and one-hot labels is purely an implementation concern with identical mathematics, but the choice matters practically: use integer labels for large vocabularies and dense one-hot labels when soft targets are needed for label smoothing or distillation. Label smoothing trades a small amount of confidence for better calibration and more regular representations, at the cost of fine grained inter-class information that some downstream uses still require. Understanding these objectives at the level of their gradients and their floating point behavior is what separates a model that trains from one that diverges.

189.10 References

Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning, chapter 6. MIT Press, 2016. https://www.deeplearningbook.org/
Bishop, C. M. Pattern Recognition and Machine Learning. Springer, 2006. https://www.microsoft.com/en-us/research/publication/pattern-recognition-machine-learning/
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. Rethinking the Inception Architecture for Computer Vision. CVPR, 2016. https://arxiv.org/abs/1512.00567
Mueller, R., Kornblith, S., and Hinton, G. When Does Label Smoothing Help? NeurIPS, 2019. https://arxiv.org/abs/1906.02629
Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Dollar, P. Focal Loss for Dense Object Detection. ICCV, 2017. https://arxiv.org/abs/1708.02002
PyTorch Documentation. torch.nn.functional.cross_entropy and binary_cross_entropy_with_logits. https://pytorch.org/docs/stable/nn.functional.html
Blanchard, P., Higham, D. J., and Higham, N. J. Accurately Computing the Log-Sum-Exp and Softmax Functions. IMA Journal of Numerical Analysis, 2021. https://doi.org/10.1093/imanum/draa038

# Loss Functions for Classification in Neural Networks Classification is the workhorse task of modern deep learning, and the choice of loss function determines what a network actually optimizes. This chapter develops the theory and numerical practice of the loss functions that dominate classification: the softmax cross-entropy for multiclass problems, the binary cross-entropy with logits for multilabel and binary problems, and the regularization technique of label smoothing. We pay close attention to numerical stability, because the naive mathematical forms of these objectives overflow and underflow in finite precision arithmetic, and the fused implementations used in practice differ substantially from the textbook equations. ## 1. From Probabilistic Modeling to Cross-Entropy A classifier with parameters $\theta$ defines a conditional distribution $p_\theta(y \mid x)$ over labels $y$ given an input $x$. Given a dataset $\mathcal{D} = \{(x_i, y_i)\}_{i=1}^N$ drawn from an unknown data distribution, maximum likelihood estimation seeks the parameters that maximize the probability of the observed labels. Taking logarithms and negating turns the product over examples into a sum to be minimized: $$ \mathcal{L}(\theta) = -\frac{1}{N} \sum_{i=1}^{N} \log p_\theta(y_i \mid x_i). $$ This is the average negative log likelihood, and it is identical to the empirical cross-entropy between the data distribution and the model. To see the connection, let $q_i$ denote the one-hot target distribution that places all mass on the true class $y_i$, and let $p_i$ denote the model distribution over classes. The cross-entropy is $$ H(q_i, p_i) = -\sum_{k} q_i(k) \log p_i(k) = -\log p_i(y_i), $$ because $q_i$ is zero everywhere except at $k = y_i$. Minimizing cross-entropy is therefore equivalent to maximum likelihood. There is a complementary information-theoretic reading. Cross-entropy decomposes as $$ H(q, p) = H(q) + D_{\mathrm{KL}}(q \,\|\, p), $$ where $H(q)$ is the entropy of the target distribution and $D_{\mathrm{KL}}$ is the Kullback-Leibler divergence. Since $H(q)$ does not depend on $\theta$, minimizing cross-entropy minimizes the KL divergence from the model to the targets. The optimum is reached when $p$ matches $q$, which formalizes the intuition that we want the model to reproduce the labeling. ## 2. The Softmax and Multiclass Cross-Entropy ### 2.1 The softmax link function A neural network for $K$ class classification produces a vector of real valued scores $z = (z_1, \dots, z_K)$, called logits, which live in $\mathbb{R}^K$ and are unconstrained. The softmax function maps logits to a probability simplex: $$ \mathrm{softmax}(z)_k = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}. $$ The outputs are positive and sum to one, so they constitute a valid distribution. The softmax is invariant to adding a constant to every logit, since $\mathrm{softmax}(z + c\mathbf{1}) = \mathrm{softmax}(z)$. This shift invariance is both the source of a numerical trick and a reminder that the logits are identified only up to an additive constant. ### 2.2 The loss and its gradient Combining the softmax with cross-entropy, the per-example loss for true class $y$ is $$ \ell(z, y) = -\log \mathrm{softmax}(z)_y = -z_y + \log \sum_{j=1}^{K} e^{z_j}. $$ The second term is the log-sum-exp function, written $\mathrm{LSE}(z)$. The loss is thus the gap between the log-sum-exp of all logits and the logit of the correct class. The gradient with respect to the logits is remarkably clean. Writing $p = \mathrm{softmax}(z)$, $$ \frac{\partial \ell}{\partial z_k} = p_k - \mathbb{1}[k = y], $$ which in vector form is $p - q$, the difference between the predicted distribution and the one-hot target. This is the same elegant form that appears in logistic and linear regression with their canonical link functions, and it is no coincidence: the softmax is the canonical link for the categorical distribution in the exponential family, and the predicted-minus-target gradient is a general property of such models. Because the gradient is bounded in magnitude by one per coordinate and never saturates to zero when the prediction is wrong, the softmax cross-entropy supplies strong learning signal even when the model is confidently incorrect, which is one reason it is preferred over a mean squared error applied to softmax outputs. ### 2.3 Deriving the gradient The clean predicted-minus-target form is worth deriving explicitly, because the cancellation that produces it is the reason the loss is so well behaved. Start from the Jacobian of the softmax. Writing $p_k = \mathrm{softmax}(z)_k$ and differentiating $p_k = e^{z_k} / \sum_j e^{z_j}$, $$ \frac{\partial p_k}{\partial z_i} = p_k\,(\delta_{ki} - p_i), $$ where $\delta_{ki}$ is the Kronecker delta. The diagonal terms $i = k$ give $p_k(1 - p_k)$ and the off-diagonal terms give $-p_k p_i$, so the softmax Jacobian is $\mathrm{diag}(p) - p p^\top$, a symmetric positive semidefinite matrix. Now differentiate the loss $\ell = -\log p_y$ through this Jacobian: $$ \frac{\partial \ell}{\partial z_i} = -\frac{1}{p_y}\frac{\partial p_y}{\partial z_i} = -\frac{1}{p_y}\,p_y\,(\delta_{yi} - p_i) = p_i - \delta_{yi} = p_i - q_i. $$ The factor $1/p_y$ cancels exactly against the $p_y$ from the Jacobian, which is precisely why the softmax and the logarithm are paired: the log undoes the normalization that the softmax introduces, leaving a residual that depends on the prediction error alone. The same calculation done directly on the fused form $\ell = \mathrm{LSE}(z) - z_y$ is shorter still, since $\partial \mathrm{LSE}/\partial z_i = e^{z_i}/\sum_j e^{z_j} = p_i$ and $\partial(-z_y)/\partial z_i = -\delta_{yi}$, giving $p_i - q_i$ immediately. For a batch of $N$ examples the averaged loss has gradient $\tfrac{1}{N}\sum_i (p_i - q_i)$ per example, that is the matrix $(P - Q)/N$ whose rows are the per-example residuals. Under label smoothing the only change is that the hard target $q$ becomes the smoothed $q'$, so the gradient is $(P - Q')/N$; every row still sums to zero because $p$ and $q'$ are both distributions, which is the discrete analogue of the constraint that gradients of a normalized objective are tangent to the simplex. ## 3. Numerical Stability of Log-Sum-Exp ### 3.1 Why the naive form fails The expression $\log \sum_j e^{z_j}$ is dangerous in floating point. If any logit is large, say $z_j = 1000$, then $e^{z_j}$ overflows to infinity in IEEE 754 double precision, whose largest finite value is near $1.8 \times 10^{308}$, corresponding to an exponent argument of about $709$. In single precision the threshold is near $88$. Conversely, if all logits are very negative, every exponential underflows to zero, the sum is zero, and the logarithm returns negative infinity. Either way the result is a NaN or an infinity that poisons the backward pass. ### 3.2 The shift trick The shift invariance of the softmax provides the cure. Let $m = \max_j z_j$. Then $$ \mathrm{LSE}(z) = m + \log \sum_{j=1}^{K} e^{z_j - m}. $$ After subtracting the maximum, the largest exponent argument is zero, so the largest term is exactly one and cannot overflow. At least one term in the sum equals one, so the sum is at least one and its logarithm is finite and well defined. Smaller terms may underflow to zero, but those terms were negligible anyway, so the relative error is tiny. This stabilized log-sum-exp is the foundation of every production softmax implementation. ```text # stable log-sum-exp m = max(z) lse = m + log(sum(exp(z - m))) loss = lse - z[y] ``` ### 3.3 Fused softmax cross-entropy Practitioners rarely compute softmax probabilities and then take their logarithm, because $\log(\mathrm{softmax})$ recomputes the same exponentials twice and loses precision when a probability is near zero. Instead the loss is computed directly from logits in a single fused kernel using the identity $$ \log \mathrm{softmax}(z)_k = z_k - \mathrm{LSE}(z). $$ This is why deep learning libraries expose an operation that consumes raw logits rather than probabilities. In PyTorch the function `torch.nn.functional.cross_entropy` expects logits and internally applies a stabilized log-softmax; passing it already-normalized probabilities is a common and silent bug. The general rule is to keep the network output in logit space for as long as possible and to fold the softmax into the loss. ## 4. Binary and Multilabel Classification ### 4.1 The sigmoid and binary cross-entropy When there are two classes, or when each of several labels is independently present or absent, the relevant link is the logistic sigmoid: $$ \sigma(z) = \frac{1}{1 + e^{-z}}, \qquad \sigma(z) \in (0, 1). $$ The sigmoid is the two class special case of the softmax applied to the logit difference. For a binary target $y \in \{0, 1\}$ with predicted probability $\hat{p} = \sigma(z)$, the binary cross-entropy loss is $$ \ell(z, y) = -\big[\, y \log \hat{p} + (1 - y) \log (1 - \hat{p}) \,\big]. $$ Multilabel classification, where an input may carry several labels at once, treats each of the $C$ outputs as an independent Bernoulli problem and sums the binary cross-entropy over labels. This differs fundamentally from multiclass softmax, which couples the outputs through a single normalization and enforces that exactly one class is present. ### 4.2 Binary cross-entropy with logits The same numerical hazards reappear. Computing $\sigma(z)$ and then its logarithm overflows when $z$ is large and negative, because $e^{-z}$ explodes, and it produces $\log 0$ when $\sigma(z)$ saturates to zero or one. The stable formulation substitutes the sigmoid and simplifies. Starting from the binary cross-entropy and writing it in terms of the logit $z$, $$ \ell(z, y) = \max(z, 0) - z \cdot y + \log\!\big(1 + e^{-|z|}\big). $$ This rearrangement, used by the function commonly named binary cross-entropy with logits, never exponentiates a positive number, since the argument $-|z|$ is always nonpositive, so it cannot overflow, and the $\max(z, 0)$ term carries the large magnitude behavior exactly. The derivation uses the identity $\log(1 + e^{z}) = \max(z, 0) + \log(1 + e^{-|z|})$, which is the softplus function written in a stable way. ```text # stable binary cross-entropy from logit z and target y loss = max(z, 0) - z * y + log(1 + exp(-abs(z))) ``` In PyTorch this is `binary_cross_entropy_with_logits`; in TensorFlow it is `sigmoid_cross_entropy_with_logits`. As with the multiclass case, the lesson is to pass logits, not probabilities, so that the library can apply the stable form. ### 4.3 Class imbalance and weighting Real classification problems are frequently imbalanced, with rare positives swamped by negatives. A standard remedy weights the positive term by a factor $w_+ > 1$: $$ \ell(z, y) = -\big[\, w_+ \, y \log \hat{p} + (1 - y) \log (1 - \hat{p}) \,\big], $$ which rescales the gradient contribution of positive examples. A more aggressive alternative, the focal loss, multiplies the per example loss by $(1 - \hat{p}_t)^\gamma$, where $\hat{p}_t$ is the probability assigned to the true class. This factor shrinks the loss on easy, well classified examples and focuses optimization on hard ones, and it was introduced to train dense object detectors where the foreground to background ratio is extreme. ## 5. Label Smoothing ### 5.1 Motivation and definition Hard one-hot targets push the model to drive the correct logit toward positive infinity relative to the others, because cross-entropy is minimized only in the limit of infinite confidence. This encourages overconfident predictions, large logit magnitudes, and poor calibration, where the predicted probability of the chosen class systematically exceeds its empirical accuracy. Label smoothing addresses this by replacing the one-hot target with a softened distribution that reserves a small amount of probability mass for the wrong classes. With smoothing parameter $\epsilon$ and $K$ classes, the smoothed target is $$ q'(k) = (1 - \epsilon)\, \mathbb{1}[k = y] + \frac{\epsilon}{K}. $$ The true class receives $1 - \epsilon + \epsilon/K$ and every other class receives $\epsilon/K$. A typical value is $\epsilon = 0.1$. The training objective remains cross-entropy, now taken against $q'$: $$ \ell_{\mathrm{LS}}(z, y) = -\sum_{k=1}^{K} q'(k) \log \mathrm{softmax}(z)_k. $$ ### 5.2 Effect on the optimum and on geometry With smoothed targets the loss is no longer minimized by infinite logit gaps. Setting the gradient to zero, the optimal logits satisfy $\mathrm{softmax}(z)_k = q'(k)$, so the model is asked to predict the correct class with probability $1 - \epsilon + \epsilon/K$ rather than one, and the optimal logit gap between the correct and incorrect classes becomes a finite constant. This bounds the logit magnitudes and tends to improve calibration. Empirically, label smoothing also reshapes the learned representations: penultimate layer activations for examples of the same class cluster more tightly and at more equal distances from other class centroids, a geometric regularity that accompanies its accuracy and calibration benefits in image classification, machine translation, and speech recognition. ### 5.3 A KL-divergence reading and caveats Label smoothing can be viewed as adding a penalty that pulls the model toward the uniform distribution $u$. Decomposing the smoothed cross-entropy, $$ \ell_{\mathrm{LS}} = (1 - \epsilon)\, H(q, p) + \epsilon \, H(u, p) = (1 - \epsilon)\, H(q, p) + \epsilon \big( D_{\mathrm{KL}}(u \,\|\, p) + H(u) \big), $$ so up to a constant the smoothing term is a KL divergence from the model to uniform, a confidence penalty that discourages peaked outputs. The technique is not universally beneficial. Because it deliberately removes information about relative incorrect class probabilities, label smoothing can degrade knowledge distillation, where a student is trained to match a teacher's full soft distribution and therefore needs the very inter-class structure that smoothing erases. As with any regularizer, the smoothing strength is a hyperparameter to be validated rather than assumed. ## 6. Practical Guidance A short checklist captures the operational consequences of the theory. Keep network outputs in logit space and let the loss function apply the softmax or sigmoid internally, so that the stabilized log-sum-exp and softplus forms are used. Choose softmax cross-entropy for mutually exclusive classes and summed binary cross-entropy with logits for independent multilabel outputs. Reach for positive weighting or focal loss when the class distribution is skewed. Apply label smoothing with a small $\epsilon$ such as $0.1$ to curb overconfidence and improve calibration, but reconsider it when training teachers for distillation. Finally, when a training run produces NaNs early, suspect an unstable hand rolled softmax or a loss fed probabilities instead of logits before suspecting the data. ## 7. A From-Scratch Implementation The companion `aiinaction` libraries ship a small, stable softmax cross-entropy in all three languages of the book. The module `ch184_softmax_ce` exposes four functions that operate directly on a logit matrix of shape $(N, K)$: `softmax` and `log_softmax` (the stabilized link functions), `cross_entropy_loss` (the fused mean loss, with optional `label_smoothing`), and `cross_entropy_grad` (the predicted-minus-target gradient $(P - Q')/N$). Each call validates its inputs, subtracts the per-row maximum before exponentiating, and never takes the logarithm of a probability. The three implementations agree to floating-point tolerance on the shared fixtures below; the only surface difference is that Julia uses 1-based class indices while Python and Rust use 0-based ones. The example below scores three samples over three classes. The gradient rows sum to zero, and the loss with label smoothing $\epsilon = 0.1$ is slightly larger than the hard-target loss, reflecting the confidence penalty. ::: {.panel-tabset} ## Python ```{python} from aiinaction.ch184_softmax_ce import ( softmax, log_softmax, cross_entropy_loss, cross_entropy_grad, ) # Three samples, three classes. Logits live in R^K and are unconstrained. Z = [[2.0, 1.0, 0.1], [0.5, 2.5, 0.3], [1.0, 1.0, 1.0]] labels = [0, 1, 2] # 0-based true class per row p = softmax(Z) print("softmax row 0: ", [round(v, 4) for v in p[0]]) print("rows sum to one: ", [round(float(s), 6) for s in p.sum(axis=1)]) loss = cross_entropy_loss(Z, labels) loss_ls = cross_entropy_loss(Z, labels, label_smoothing=0.1) print(f"mean cross-entropy: {loss:.6f}") print(f"with label smoothing: {loss_ls:.6f}") g = cross_entropy_grad(Z, labels) print("gradient (P - Q)/N:") for row in g: print(" ", [round(float(v), 4) for v in row]) print("each gradient row sums to ~0:", [round(float(s), 12) for s in g.sum(axis=1)]) ``` ## Julia ```julia using AIInAction.Ch184SoftmaxCE # Same fixture; Julia uses 1-based class indices. Z = [2.0 1.0 0.1 0.5 2.5 0.3 1.0 1.0 1.0] labels = [1, 2, 3] p = softmax(Z) println("softmax row 1: ", round.(p[1, :], digits=4)) loss = cross_entropy_loss(Z, labels) loss_ls = cross_entropy_loss(Z, labels; label_smoothing=0.1) println("mean cross-entropy: ", round(loss, digits=6)) println("with label smoothing: ", round(loss_ls, digits=6)) g = cross_entropy_grad(Z, labels) # (P - Q)/N println("gradient row sums: ", round.(vec(sum(g; dims=2)), digits=12)) ``` ## Rust ```rust use aiinaction::ch184_softmax_ce::{ softmax, cross_entropy_loss, cross_entropy_grad, }; fn main() { // Same fixture; Rust uses 0-based class indices. let z = vec![ vec![2.0, 1.0, 0.1], vec![0.5, 2.5, 0.3], vec![1.0, 1.0, 1.0], ]; let labels = [0usize, 1, 2]; let p = softmax(&z).unwrap(); println!("softmax row 0: {:?}", p[0]); let loss = cross_entropy_loss(&z, &labels, 0.0).unwrap(); let loss_ls = cross_entropy_loss(&z, &labels, 0.1).unwrap(); println!("mean cross-entropy: {:.6}", loss); println!("with label smoothing: {:.6}", loss_ls); let g = cross_entropy_grad(&z, &labels, 0.0).unwrap(); // (P - Q)/N for row in &g { let s: f64 = row.iter().sum(); println!("row sum {:.12}", s); } } ``` ::: ## 8. Integer Labels versus One-Hot Encoding: Sparse versus Dense Cross-Entropy The multiclass cross-entropy derived in Section 2 assumes the target $y$ is represented as a one-hot vector of length $K$. In practice, label encoding is a choice with numerical and memory implications. ### 8.1 The Two Representations A one-hot target stores class $c$ as a vector $e_c \in \{0,1\}^K$ with a single non-zero entry. An integer target stores only the index $c \in \{0, 1, \ldots, K-1\}$. For $K = 1000$ classes, one-hot uses 1000 floats per example; the integer uses one. For language modeling with a vocabulary of 50,000 tokens, the one-hot tensor across a batch of 512 sequences of length 256 would be 512 × 256 × 50,000 ≈ 6.5 billion entries, far beyond any practical memory budget. The loss formula is identical in both cases. For class $c$ and predicted distribution $\hat{p}$, $$ \mathcal{L}_{\mathrm{CE}} = -\sum_{k=1}^{K} e_c(k)\, \log \hat{p}_k = -\log \hat{p}_c, $$ because $e_c$ zeros out every term except $k = c$. The integer representation exploits this sparsity directly: it indexes into the log-probability vector rather than materializing the full one-hot and computing a dot product. The math is the same; only the implementation differs. ### 8.2 Framework Implementations PyTorch separates the two cases explicitly: - `nn.CrossEntropyLoss` consumes **integer** labels of shape `(N,)` or `(N, d_1, ..., d_K)` for pixel-wise classification. It fuses softmax and NLL into one numerically stable operation. - `nn.NLLLoss` also consumes integer labels but expects **log-probabilities** as input (after `F.log_softmax`), giving manual control over the softmax. Keras similarly offers two variants: - `CategoricalCrossentropy` expects **one-hot** targets of shape `(N, K)`. - `SparseCategoricalCrossentropy` expects **integer** targets of shape `(N,)` and performs the same computation internally. The `from_logits` parameter controls whether the model output is treated as raw logits (apply softmax internally, numerically stable) or as probabilities (skip softmax). Setting `from_logits=True` is almost always preferable: it enables the fused log-sum-exp trick and avoids the redundant `softmax → log` round-trip. ```{python} import numpy as np # Demonstrate SparseCategoricalCrossentropy vs CategoricalCrossentropy equivalence def softmax(z): e = np.exp(z - z.max(axis=-1, keepdims=True)) return e / e.sum(axis=-1, keepdims=True) def sparse_ce(logits, labels): probs = softmax(logits) return -np.mean(np.log(probs[np.arange(len(labels)), labels] + 1e-12)) def dense_ce(logits, one_hot): probs = softmax(logits) return -np.mean(np.sum(one_hot * np.log(probs + 1e-12), axis=-1)) rng = np.random.default_rng(0) logits = rng.standard_normal((4, 5)) # batch=4, classes=5 labels = np.array([2, 0, 4, 1]) # integer labels one_hot = np.eye(5)[labels] # equivalent one-hot print(f"sparse CE: {sparse_ce(logits, labels):.6f}") print(f"dense CE: {dense_ce(logits, one_hot):.6f}") print(f"difference: {abs(sparse_ce(logits, labels) - dense_ce(logits, one_hot)):.2e}") ``` ### 8.3 When to Use Each Use **integer labels** (sparse) when: - The label vocabulary is large (NLP token prediction, image segmentation with many classes). - Memory is a constraint. - You are using PyTorch's `CrossEntropyLoss` or Keras's `SparseCategoricalCrossentropy`. Use **one-hot labels** (dense) when: - Labels are soft (label smoothing, knowledge distillation), requiring a non-integer distribution over classes. You cannot express $[0.9, 0.05, 0.05]$ as a single integer. - The loss involves a sum over all classes, such as KL divergence between predicted and target distributions. - The framework or loss function requires it explicitly. Label smoothing, discussed in Section 7, requires one-hot (or equivalently soft) targets because smoothed labels are $\tilde{y}_k = (1-\epsilon)\,\delta_{k=c} + \epsilon/K$, which is not a valid integer encoding. Distillation losses similarly operate on soft teacher distributions and require the dense form. ### 8.4 The `reduction` Parameter Both sparse and dense cross-entropy functions accept a `reduction` argument that controls how per-example losses are aggregated: - `'mean'` (default): divide the summed loss by the number of examples. Gradient scale is independent of batch size. - `'sum'`: sum all per-example losses. Gradient scale grows with batch size; requires compensating learning rate adjustment. - `'none'`: return the per-example loss vector. Useful for importance weighting: `loss = ce(logits, labels, reduction='none'); weighted = (loss * weights).mean()`. The choice of reduction is invisible in the forward pass output when batches are fixed size, but matters when batch sizes vary (variable-length sequences, online learning) or when sample-level weights are applied for class imbalance or curriculum learning. ## 9. Summary Cross-entropy is the maximum likelihood objective for classification, and its two principal forms, softmax cross-entropy for multiclass problems and binary cross-entropy with logits for binary and multilabel problems, both produce the clean predicted-minus-target gradient that drives stable learning. The mathematical expressions hide numerical landmines that the shift trick for log-sum-exp and the absolute value form of softplus defuse, which is why fused logit-consuming loss functions are the norm. The sparse/dense distinction between integer and one-hot labels is purely an implementation concern with identical mathematics, but the choice matters practically: use integer labels for large vocabularies and dense one-hot labels when soft targets are needed for label smoothing or distillation. Label smoothing trades a small amount of confidence for better calibration and more regular representations, at the cost of fine grained inter-class information that some downstream uses still require. Understanding these objectives at the level of their gradients and their floating point behavior is what separates a model that trains from one that diverges. ## References 1. Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning, chapter 6. MIT Press, 2016. https://www.deeplearningbook.org/ 2. Bishop, C. M. Pattern Recognition and Machine Learning. Springer, 2006. https://www.microsoft.com/en-us/research/publication/pattern-recognition-machine-learning/ 3. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. Rethinking the Inception Architecture for Computer Vision. CVPR, 2016. https://arxiv.org/abs/1512.00567 4. Mueller, R., Kornblith, S., and Hinton, G. When Does Label Smoothing Help? NeurIPS, 2019. https://arxiv.org/abs/1906.02629 5. Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Dollar, P. Focal Loss for Dense Object Detection. ICCV, 2017. https://arxiv.org/abs/1708.02002 6. PyTorch Documentation. torch.nn.functional.cross_entropy and binary_cross_entropy_with_logits. https://pytorch.org/docs/stable/nn.functional.html 7. Blanchard, P., Higham, D. J., and Higham, N. J. Accurately Computing the Log-Sum-Exp and Softmax Functions. IMA Journal of Numerical Analysis, 2021. https://doi.org/10.1093/imanum/draa038

189.1 1. From Probabilistic Modeling to Cross-Entropy

189.2 2. The Softmax and Multiclass Cross-Entropy

189.2.1 2.1 The softmax link function

189.2.2 2.2 The loss and its gradient

189.2.3 2.3 Deriving the gradient

189.3 3. Numerical Stability of Log-Sum-Exp

189.3.1 3.1 Why the naive form fails

189.3.2 3.2 The shift trick

189.3.3 3.3 Fused softmax cross-entropy

189.4 4. Binary and Multilabel Classification

189.4.1 4.1 The sigmoid and binary cross-entropy

189.4.2 4.2 Binary cross-entropy with logits

189.4.3 4.3 Class imbalance and weighting

189.5 5. Label Smoothing

189.5.1 5.1 Motivation and definition

189.5.2 5.2 Effect on the optimum and on geometry

189.5.3 5.3 A KL-divergence reading and caveats

189.6 6. Practical Guidance

189.7 7. A From-Scratch Implementation

189.8 8. Integer Labels versus One-Hot Encoding: Sparse versus Dense Cross-Entropy

189.8.1 8.1 The Two Representations

189.8.2 8.2 Framework Implementations

189.8.3 8.3 When to Use Each

189.8.4 8.4 The reduction Parameter

189.9 9. Summary

189.10 References

189.8.4 8.4 The `reduction` Parameter