42 Cross-Entropy and the Cross-Entropy Loss

Cross-entropy is one of the load-bearing ideas of modern machine learning. It measures the cost of encoding data drawn from one distribution using a code optimized for another, and through that lens it becomes the natural objective for training classifiers and language models. This chapter develops cross-entropy from its information-theoretic roots, shows how it decomposes into entropy plus the Kullback-Leibler divergence, derives the cross-entropy loss as the maximum likelihood estimator for categorical data, and works through the practical machinery that makes it usable: the softmax function, its numerically stable implementation, and label smoothing.

42.1 1. Entropy and the Cost of Encoding

42.1.1 1.1 Shannon entropy

Let $p$ be a probability distribution over a finite set of outcomes $\mathcal{X} = \{x_1, \dots, x_K\}$. The Shannon entropy of $p$ is

\[ H(p) = -\sum_{k=1}^{K} p(x_k) \log p(x_k). \]

When the logarithm is base 2 the result is measured in bits, and when it is the natural logarithm the unit is nats. Machine learning almost always uses the natural logarithm because it composes cleanly with gradient based optimization, so we adopt $\log = \ln$ throughout unless stated otherwise.

Entropy quantifies the average uncertainty in $p$. By Shannon’s source coding theorem, $H(p)$ is also the minimum expected number of nats per symbol needed to encode a stream of independent draws from $p$, achievable in the limit by an optimal code that assigns roughly $-\log p(x_k)$ nats to outcome $x_k$. Rare events get long codewords, common events get short ones, and the entropy is the weighted average codeword length.

The optimal codeword length follows from a short variational argument. We want to choose nonnegative codeword lengths $\ell_k$ that minimize the expected length $\sum_k p(x_k) \ell_k$ subject to the Kraft constraint $\sum_k e^{-\ell_k} \le 1$ in nats. Forming the Lagrangian $\mathcal{J} = \sum_k p(x_k) \ell_k + \lambda \big(\sum_k e^{-\ell_k} - 1\big)$ and setting $\partial \mathcal{J} / \partial \ell_k = p(x_k) - \lambda e^{-\ell_k} = 0$ gives $\ell_k = -\log\!\big(p(x_k)/\lambda\big)$. Enforcing the constraint with equality forces $\lambda = 1$, so the optimal length is $\ell_k^\star = -\log p(x_k)$ and the optimal expected length is exactly $\sum_k p(x_k) \ell_k^\star = H(p)$. This is why $-\log p(x_k)$ is the natural unit of surprise.

42.1.2 1.2 Cross-entropy between two distributions

Suppose the data truly follow $p$, but we build our code as though they followed a different distribution $q$. We then spend $-\log q(x_k)$ nats on outcome $x_k$ instead of the optimal $-\log p(x_k)$. The expected cost under the true distribution is the cross-entropy

\[ H(p, q) = -\sum_{k=1}^{K} p(x_k) \log q(x_k) = \mathbb{E}_{x \sim p}\!\left[-\log q(x)\right]. \]

The cross-entropy is the average code length we actually pay when reality is $p$ and our beliefs are $q$. Three properties matter immediately. First, $H(p, q)$ is not symmetric; in general $H(p, q) \neq H(q, p)$. Second, if $q(x_k) = 0$ for some $x_k$ with $p(x_k) > 0$, the cross-entropy is infinite, which encodes the intuition that a model assigning zero probability to an event that actually occurs is infinitely surprised. Third, $H(p, p) = H(p)$, so cross-entropy reduces to ordinary entropy when the code matches the source.

The same variational argument that fixed the optimal code length also pins down cross-entropy as the cost of a mismatched code. If we build a code optimal for $q$, the codeword for outcome $x_k$ has length $\ell_k = -\log q(x_k)$. The expected length under the true source $p$ is then $\sum_k p(x_k) \ell_k = -\sum_k p(x_k) \log q(x_k) = H(p, q)$, recovering the definition above directly from coding considerations rather than by stipulation.

For continuous distributions with densities the sums become integrals, $H(p, q) = -\int p(x) \log q(x)\, dx$, but the discrete case is what classification needs.

42.2 2. Cross-Entropy, Entropy, and KL Divergence

42.2.1 2.1 The decomposition

The central identity of this chapter relates cross-entropy to entropy and the Kullback-Leibler divergence. The KL divergence from $q$ to $p$ is

\[ D_{\mathrm{KL}}(p \parallel q) = \sum_{k=1}^{K} p(x_k) \log \frac{p(x_k)}{q(x_k)}. \]

Expanding the logarithm of the ratio,

\[ D_{\mathrm{KL}}(p \parallel q) = \sum_k p(x_k) \log p(x_k) - \sum_k p(x_k) \log q(x_k) = -H(p) + H(p, q). \]

Rearranging gives the decomposition

\[ H(p, q) = H(p) + D_{\mathrm{KL}}(p \parallel q). \]

In words, the total cost of encoding $p$ with a code built for $q$ splits into two parts: the irreducible cost $H(p)$ of the source itself, plus the penalty $D_{\mathrm{KL}}(p \parallel q)$ for the mismatch between $q$ and $p$. The KL term is the excess number of nats incurred by using the wrong code.

42.2.2 2.2 Why the penalty is nonnegative

Gibbs’ inequality states that $D_{\mathrm{KL}}(p \parallel q) \geq 0$, with equality if and only if $p = q$. The cleanest proof uses the concavity of the logarithm. Since $\log$ is concave, Jensen’s inequality gives

\[ -D_{\mathrm{KL}}(p \parallel q) = \sum_k p(x_k) \log \frac{q(x_k)}{p(x_k)} \leq \log \sum_k p(x_k) \frac{q(x_k)}{p(x_k)} = \log \sum_k q(x_k) = \log 1 = 0. \]

Multiplying by $-1$ yields $D_{\mathrm{KL}}(p \parallel q) \geq 0$. Because $\log$ is strictly concave, equality holds only when the ratio $q(x_k)/p(x_k)$ is constant across all $k$, which forces $q = p$.

A consequence relevant to optimization follows immediately. Since $H(p)$ does not depend on $q$,

\[ \arg\min_q H(p, q) = \arg\min_q D_{\mathrm{KL}}(p \parallel q) = p. \]

Minimizing cross-entropy over the model distribution $q$ is exactly equivalent to minimizing the KL divergence to the data distribution $p$. The entropy term $H(p)$ is a constant offset that the optimizer never sees. This is why we can speak of cross-entropy and KL divergence almost interchangeably when discussing training objectives, even though they differ by the fixed quantity $H(p)$.

42.3 3. The Cross-Entropy Loss from Maximum Likelihood

42.3.1 3.1 Setup

Consider a supervised classification problem. We observe a dataset $\mathcal{D} = \{(x^{(i)}, y^{(i)})\}_{i=1}^{N}$ where each $x^{(i)}$ is an input and each $y^{(i)} \in \{1, \dots, K\}$ is a class label. A model with parameters $\theta$ produces a conditional distribution $q_\theta(y \mid x)$ over the $K$ classes. The standard estimation principle is maximum likelihood: choose $\theta$ to maximize the probability the model assigns to the observed labels.

42.3.2 3.2 From likelihood to loss

Assuming the examples are independent and identically distributed, the likelihood factorizes into a product of per example probabilities,

\[ \mathcal{L}_{\mathrm{lik}}(\theta) = \prod_{i=1}^{N} q_\theta\!\left(y^{(i)} \mid x^{(i)}\right). \]

This product of many numbers in $(0, 1)$ underflows quickly and is awkward to differentiate, so we take the logarithm. Because $\log$ is strictly increasing, the maximizer is unchanged, and the product becomes a sum,

\[ \ell(\theta) = \log \mathcal{L}_{\mathrm{lik}}(\theta) = \sum_{i=1}^{N} \log q_\theta\!\left(y^{(i)} \mid x^{(i)}\right). \]

It is worth making the categorical structure explicit. The model output is a vector $q_\theta(\cdot \mid x^{(i)})$ on the simplex, and the observed label selects one component. Writing $p^{(i)}_k = \mathbb{1}[y^{(i)} = k]$ for the one-hot encoding, the single observed probability can be written as a product over classes, $q_\theta(y^{(i)} \mid x^{(i)}) = \prod_{k=1}^{K} q_\theta(k \mid x^{(i)})^{\,p^{(i)}_k}$, which is exactly the categorical likelihood. Taking the log turns the exponent into the familiar sum $\sum_k p^{(i)}_k \log q_\theta(k \mid x^{(i)})$.

Maximizing $\ell(\theta)$ is the same as minimizing its negative, scaled by $1/N$ to make it an average rather than a sum. Define the average negative log likelihood

\[ \mathcal{L}(\theta) = -\frac{1}{N} \sum_{i=1}^{N} \log q_\theta\!\left(y^{(i)} \mid x^{(i)}\right). \]

This is the cross-entropy loss. To see why it deserves the name, encode each label as a one-hot vector $\mathbf{p}^{(i)}$ whose components are $p^{(i)}_k = \mathbb{1}[y^{(i)} = k]$. The per example cross-entropy between this empirical one-hot target and the model prediction is

\[ H\!\left(\mathbf{p}^{(i)}, q_\theta(\cdot \mid x^{(i)})\right) = -\sum_{k=1}^{K} p^{(i)}_k \log q_\theta\!\left(k \mid x^{(i)}\right) = -\log q_\theta\!\left(y^{(i)} \mid x^{(i)}\right), \]

because the one-hot vector zeroes out every term except the one for the true class. Averaging over the dataset recovers $\mathcal{L}(\theta)$ exactly. So minimizing the cross-entropy loss is minimizing maximum likelihood, and equivalently minimizing the average KL divergence between the empirical label distributions and the model.

42.3.3 3.3 The distributional reading

There is a tidy way to view the whole dataset at once. Let $\hat{p}$ denote the empirical distribution that places mass $1/N$ on each observed pair. Then

\[ \mathcal{L}(\theta) = \mathbb{E}_{(x, y) \sim \hat{p}}\!\left[-\log q_\theta(y \mid x)\right] = H\!\left(\hat{p}, q_\theta\right). \]

Training drives $q_\theta$ toward $\hat{p}$ in the KL sense. Because the empirical distribution converges to the true data distribution as $N$ grows, minimizing cross-entropy is a consistent way to approximate the true conditional distribution, subject to the capacity and inductive biases of the model class.

42.4 4. Binary and Categorical Cross-Entropy

42.4.1 4.1 Binary cross-entropy

When $K = 2$ the model needs only a single scalar output. Let the model produce $\hat{y} = q_\theta(y = 1 \mid x) \in (0, 1)$, the predicted probability of the positive class, and let $y \in \{0, 1\}$ be the label. The per example binary cross-entropy is

\[ \mathrm{BCE}(y, \hat{y}) = -\big[\, y \log \hat{y} + (1 - y) \log (1 - \hat{y}) \,\big]. \]

This is the cross-entropy between the Bernoulli target distribution $(1 - y, y)$ and the predicted Bernoulli $(1 - \hat{y}, \hat{y})$. When $y = 1$ the loss is $-\log \hat{y}$, which vanishes as $\hat{y} \to 1$ and blows up as $\hat{y} \to 0$. The symmetric statement holds for $y = 0$.

The probability $\hat{y}$ is typically produced from a real valued logit $z$ by the logistic sigmoid $\sigma(z) = 1 / (1 + e^{-z})$. Substituting and simplifying gives a form expressed directly in the logit,

\[ \mathrm{BCE}(y, z) = \log\!\left(1 + e^{-z}\right) + (1 - y)\, z, \]

which is more numerically stable than computing $\sigma(z)$ first and then taking its logarithm, for the same reasons discussed in the next section. Many libraries expose this fused logit form as a single primitive, often named something like binary_cross_entropy_with_logits.

42.4.2 4.2 Categorical cross-entropy

For $K > 2$ classes the model outputs a full probability vector $\hat{\mathbf{y}} = (\hat{y}_1, \dots, \hat{y}_K)$ with $\sum_k \hat{y}_k = 1$. Against a one-hot target $\mathbf{y}$, the categorical cross-entropy is

\[ \mathrm{CCE}(\mathbf{y}, \hat{\mathbf{y}}) = -\sum_{k=1}^{K} y_k \log \hat{y}_k = -\log \hat{y}_c, \]

where $c$ is the index of the true class. If labels are stored as integers rather than one-hot vectors, the loss is often called sparse categorical cross-entropy, but the arithmetic is identical: gather the predicted probability of the correct class and take its negative logarithm. Binary cross-entropy is the special case $K = 2$ when the two class probabilities are written as $\hat{y}$ and $1 - \hat{y}$.

42.5 5. The Softmax Function

42.5.1 5.1 Definition and role

The model’s penultimate layer produces a vector of real valued logits $\mathbf{z} = (z_1, \dots, z_K)$ that range over all of $\mathbb{R}$. To interpret these as a probability distribution we apply the softmax function

\[ \mathrm{softmax}(\mathbf{z})_k = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}. \]

Each output lies in $(0, 1)$ and the outputs sum to one, so softmax maps an unconstrained logit vector onto the probability simplex. It is the multiclass generalization of the logistic sigmoid, and it is shift invariant: adding the same constant $c$ to every logit leaves the output unchanged, since the factor $e^c$ cancels between numerator and denominator. This invariance is both a modeling fact and, as we will see, the key to a stable implementation.

42.5.2 5.2 The softmax cross-entropy gradient

Pairing softmax with cross-entropy yields a gradient of unusual simplicity, which is a large part of why the combination dominates classification. Let $\hat{y}_k = \mathrm{softmax}(\mathbf{z})_k$ and let the loss for a single example with true class $c$ be $L = -\log \hat{y}_c$. The derivative with respect to logit $z_k$ is

\[ \frac{\partial L}{\partial z_k} = \hat{y}_k - y_k, \]

where $y_k = \mathbb{1}[k = c]$. The gradient is simply the difference between the predicted distribution and the one-hot target. The derivation starts from the Jacobian of the softmax. Write $\hat{y}_i = e^{z_i} / S$ with $S = \sum_j e^{z_j}$. Differentiating the quotient with respect to $z_k$ requires two cases. When $i = k$,

\[ \frac{\partial \hat{y}_i}{\partial z_i} = \frac{e^{z_i} S - e^{z_i} e^{z_i}}{S^2} = \hat{y}_i - \hat{y}_i^2 = \hat{y}_i (1 - \hat{y}_i), \]

and when $i \neq k$ the numerator $e^{z_i}$ is constant in $z_k$ so only the denominator contributes,

\[ \frac{\partial \hat{y}_i}{\partial z_k} = -\frac{e^{z_i} e^{z_k}}{S^2} = -\hat{y}_i \hat{y}_k. \]

Both cases combine into the compact form

\[ \frac{\partial \hat{y}_i}{\partial z_k} = \hat{y}_i (\delta_{ik} - \hat{y}_k), \]

with $\delta_{ik}$ the Kronecker delta. Now apply the chain rule to $L = -\log \hat{y}_c$. Only the $c$ component of $\hat{y}$ enters $L$, so

\[ \frac{\partial L}{\partial z_k} = -\frac{1}{\hat{y}_c} \frac{\partial \hat{y}_c}{\partial z_k} = -\frac{1}{\hat{y}_c} \hat{y}_c (\delta_{ck} - \hat{y}_k) = \hat{y}_k - \delta_{ck} = \hat{y}_k - y_k. \]

For a full one-hot target the same result follows from the general loss $L = -\sum_i y_i \log \hat{y}_i$, where $\partial L / \partial z_k = -\sum_i y_i (\delta_{ik} - \hat{y}_k) = -y_k + \hat{y}_k \sum_i y_i = \hat{y}_k - y_k$, using $\sum_i y_i = 1$. In vector form the gradient is the residual $\nabla_{\mathbf{z}} L = \hat{\mathbf{y}} - \mathbf{y}$.

The clean linear form means the error signal injected at the logits is exactly the residual between prediction and target, and there is no awkward division or saturating factor to weaken the gradient when the model is badly wrong. This is the property that makes softmax cross-entropy train reliably where a squared error on softmax outputs would suffer from vanishing gradients.

42.6 6. Numerically Stable Softmax and Log-Softmax

42.6.1 6.1 The overflow problem

The naive softmax computes $e^{z_k}$ for each logit. If any logit is large, say $z_k = 1000$, then $e^{z_k}$ overflows to infinity in floating point, and the ratio becomes the indeterminate $\infty / \infty$. If all logits are very negative, every exponential underflows to zero and the denominator becomes zero. Either way the result is a NaN or Inf that corrupts the rest of the computation.

42.6.2 6.2 The max subtraction trick

The shift invariance of softmax provides the fix. Let $m = \max_j z_j$ and subtract it from every logit before exponentiating:

\[ \mathrm{softmax}(\mathbf{z})_k = \frac{e^{z_k - m}}{\sum_{j=1}^{K} e^{z_j - m}}. \]

This is algebraically identical to the original because the common factor $e^{-m}$ cancels. Numerically it is far better behaved: the largest exponent is now $z_k - m = 0$, so the largest term is $e^0 = 1$ and nothing overflows. The smaller terms may underflow to zero, but that is harmless because they contribute negligibly to the sum.

42.6.3 6.3 Log-softmax and the log-sum-exp identity

In practice we never want the probability followed by a logarithm, because computing $\log(\hat{y}_c)$ as a separate step reintroduces precision loss when $\hat{y}_c$ is tiny. Instead we compute the log probability directly. Taking the logarithm of the stable softmax,

\[ \log \mathrm{softmax}(\mathbf{z})_k = z_k - m - \log \sum_{j=1}^{K} e^{z_j - m}. \]

The term $m + \log \sum_j e^{z_j - m}$ is the numerically stable log-sum-exp of the logits. The identity behind it generalizes the max subtraction trick to the logarithm. For any shift $m$,

\[ \log \sum_{j=1}^{K} e^{z_j} = \log \sum_{j=1}^{K} e^{m} e^{z_j - m} = m + \log \sum_{j=1}^{K} e^{z_j - m}, \]

and choosing $m = \max_j z_j$ guarantees every exponent $z_j - m \le 0$, so each $e^{z_j - m} \in (0, 1]$ and the sum lies in $[1, K]$, safely inside the floating point range. The error analysis of Blanchard, Higham, and Higham shows this shifted formula is not just overflow safe but also has a small relative error bound, which is why it is the standard implementation. Because the cross-entropy loss for the true class is just $-\log \mathrm{softmax}(\mathbf{z})_c = -z_c + m + \log \sum_j e^{z_j - m}$, the entire loss can be computed with one stable expression and no intermediate probabilities. This is why deep learning libraries provide fused operations such as log_softmax and cross_entropy that take raw logits rather than probabilities, and why feeding already softmaxed values into such a loss is a common and subtle bug.

The pattern generalizes to batches by computing the per row maximum and applying the same subtraction along the class axis. The implementations below demonstrate it concretely.

42.7 7. Implementations

The Python tab is executed when the book is rendered, so its output is real. The Julia and Rust tabs are illustrative and show the same numerically stable softmax cross-entropy in two other ecosystems.

Code

import numpy as np
import matplotlib.pyplot as plt

rng = np.random.default_rng(0)

def log_softmax(z):
    """Numerically stable log softmax along the last axis."""
    m = np.max(z, axis=-1, keepdims=True)
    shifted = z - m
    lse = m[..., 0] + np.log(np.sum(np.exp(shifted), axis=-1))
    return shifted - (lse[..., None] - m)

def softmax(z):
    return np.exp(log_softmax(z))

def cross_entropy_from_logits(z, y_idx):
    """Mean cross entropy for a batch of logits z and integer labels y_idx."""
    lsm = log_softmax(z)
    rows = np.arange(z.shape[0])
    return -np.mean(lsm[rows, y_idx])

# A small batch of logits and integer labels.
N, K = 4, 5
z = rng.normal(size=(N, K)) * 3.0
y_idx = rng.integers(0, K, size=N)

# Stability check: add a huge constant; the loss must be unchanged.
loss = cross_entropy_from_logits(z, y_idx)
loss_shifted = cross_entropy_from_logits(z + 1000.0, y_idx)
print(f"loss                 = {loss:.6f}")
print(f"loss (+1000 shift)   = {loss_shifted:.6f}")
print(f"shift invariant      = {np.allclose(loss, loss_shifted)}")

# Analytic gradient at the logits is (p - y).
probs = softmax(z)
y_onehot = np.zeros((N, K))
y_onehot[np.arange(N), y_idx] = 1.0
grad_analytic = (probs - y_onehot) / N

# Numerical gradient by central differences.
eps = 1e-6
grad_numeric = np.zeros_like(z)
for i in range(N):
    for k in range(K):
        zp = z.copy(); zp[i, k] += eps
        zm = z.copy(); zm[i, k] -= eps
        grad_numeric[i, k] = (
            cross_entropy_from_logits(zp, y_idx)
            - cross_entropy_from_logits(zm, y_idx)
        ) / (2 * eps)

max_abs_err = np.max(np.abs(grad_analytic - grad_numeric))
print(f"max |p - y vs numeric gradient| = {max_abs_err:.2e}")
print(f"gradient matches p - y          = {np.allclose(grad_analytic, grad_numeric, atol=1e-6)}")

# Loss vs predicted probability of the true class, for a binary problem.
p_true = np.linspace(1e-3, 1.0, 200)
loss_curve = -np.log(p_true)

fig, ax = plt.subplots(figsize=(6, 4))
ax.plot(p_true, loss_curve, color="#2563eb", linewidth=2)
ax.set_xlabel("predicted probability of true class")
ax.set_ylabel("cross-entropy loss  (nats)")
ax.set_title("Cross-entropy loss vs predicted probability")
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

loss                 = 2.891213
loss (+1000 shift)   = 2.891213
shift invariant      = True
max |p - y vs numeric gradient| = 2.48e-10
gradient matches p - y          = True

# Illustrative: numerically stable softmax cross-entropy in Julia.
function log_softmax(z::AbstractVector)
    m = maximum(z)
    shifted = z .- m
    lse = log(sum(exp, shifted))
    return shifted .- lse
end

function cross_entropy_from_logits(z::AbstractVector, c::Int)
    return -log_softmax(z)[c]
end

z = [1.0, 3.0, 0.5, 1000.0]   # last logit is huge; stays finite
c = 4
println("loss = ", cross_entropy_from_logits(z, c))

# Gradient is p - y, the residual between softmax and the one-hot target.
function ce_gradient(z::AbstractVector, c::Int)
    p = exp.(log_softmax(z))
    y = zeros(length(z)); y[c] = 1.0
    return p .- y
end
println("grad = ", ce_gradient(z, c))

// Illustrative: numerically stable softmax cross-entropy in Rust.
fn log_softmax(z: &[f64]) -> Vec<f64> {
    let m = z.iter().cloned().fold(f64::NEG_INFINITY, f64::max);
    let shifted: Vec<f64> = z.iter().map(|&zi| zi - m).collect();
    let sum_exp: f64 = shifted.iter().map(|&s| s.exp()).sum();
    let lse = sum_exp.ln();
    shifted.iter().map(|&s| s - lse).collect()
}

fn cross_entropy_from_logits(z: &[f64], c: usize) -> f64 {
    -log_softmax(z)[c]
}

fn main() {
    let z = [1.0, 3.0, 0.5, 1000.0]; // huge logit stays finite
    let c = 3;
    let loss = cross_entropy_from_logits(&z, c);
    println!("loss = {loss}");

    // Gradient is p - y.
    let log_p = log_softmax(&z);
    let grad: Vec<f64> = log_p
        .iter()
        .enumerate()
        .map(|(k, &lp)| lp.exp() - if k == c { 1.0 } else { 0.0 })
        .collect();
    println!("grad = {grad:?}");
}

42.8 8. Label Smoothing

42.8.1 8.1 Motivation

The one-hot target asks the model to drive the true class probability to exactly one and all others to exactly zero. Achieving that requires the correct logit to grow without bound relative to the others, because softmax only reaches a hard zero or one in the limit of infinite logits. This pushes the model toward extreme, overconfident outputs, encourages large weight magnitudes, and tends to widen the gap between the largest logit and the rest in a way that generalizes poorly. The model becomes confidently wrong on inputs near the decision boundary and its predicted probabilities stop being well calibrated.

42.8.2 8.2 The smoothed target

Label smoothing, introduced in the context of the Inception architecture, replaces the hard one-hot target with a softened distribution. With smoothing parameter $\epsilon \in (0, 1)$, the target for class $k$ becomes

\[ y_k^{\mathrm{LS}} = (1 - \epsilon)\, y_k + \frac{\epsilon}{K}, \]

which assigns probability $1 - \epsilon + \epsilon/K$ to the true class and $\epsilon/K$ to each of the others. The loss is the cross-entropy against this softened target,

\[ \mathcal{L}^{\mathrm{LS}} = -\sum_{k=1}^{K} y_k^{\mathrm{LS}} \log \hat{y}_k. \]

A useful way to read this is as a mixture. The smoothed target is a convex combination of the one-hot label and the uniform distribution $u_k = 1/K$, namely $\mathbf{y}^{\mathrm{LS}} = (1 - \epsilon)\, \mathbf{y} + \epsilon\, \mathbf{u}$. Because cross-entropy is linear in its first argument, the loss splits cleanly:

\[ \mathcal{L}^{\mathrm{LS}} = -\sum_k \big[(1 - \epsilon) y_k + \epsilon u_k\big] \log \hat{y}_k = (1 - \epsilon)\, H(\mathbf{y}, \hat{\mathbf{y}}) + \epsilon\, H(\mathbf{u}, \hat{\mathbf{y}}). \]

The second term equals $\epsilon\big(H(\mathbf{u}) + D_{\mathrm{KL}}(\mathbf{u} \parallel \hat{\mathbf{y}})\big)$ by the chapter’s central decomposition, and since $H(\mathbf{u}) = \log K$ is constant, the smoothing contributes a $D_{\mathrm{KL}}(\mathbf{u} \parallel \hat{\mathbf{y}})$ penalty that is minimized when $\hat{\mathbf{y}}$ is uniform. Label smoothing therefore acts as a regularizer that pulls predictions toward uniform and penalizes overconfidence.

The effect on the logits is concrete. The gradient of $\mathcal{L}^{\mathrm{LS}}$ with respect to the logits is $\hat{\mathbf{y}} - \mathbf{y}^{\mathrm{LS}}$ by the same residual argument as before, so a stationary point requires $\hat{y}_k = y_k^{\mathrm{LS}}$ for every class. Inverting the softmax, the optimal logit gap between the true class $c$ and any other class $k$ satisfies

\[ z_c^\star - z_k^\star = \log \frac{y_c^{\mathrm{LS}}}{y_k^{\mathrm{LS}}} = \log \frac{(1 - \epsilon) + \epsilon/K}{\epsilon/K}, \]

which is finite for any $\epsilon > 0$. Contrast this with the hard target, where the gap must diverge to drive $\hat{y}_c \to 1$. The finite optimum is exactly why smoothing curbs runaway logit growth and keeps representations compact.

42.8.3 8.3 Effects and trade-offs

Empirically, label smoothing improves generalization and produces better calibrated probabilities across image classification, machine translation, and speech recognition. With a smoothed target the optimal logits are finite rather than divergent, which curbs the runaway growth of the correct logit and keeps representations more compact. Analysis of the learned representations shows that smoothing encourages examples of the same class to cluster tightly and at roughly equal distances from other class clusters, a geometric regularity that the hard target does not impose.

There are costs. Because label smoothing erases some of the fine grained information about relative similarities between the wrong classes, it can hurt when the softened teacher is then distilled into a student network, since the student loses the very dark knowledge that distillation relies on. A typical value is $\epsilon = 0.1$, and like any regularizer its strength should be tuned to the dataset and model. When calibration and clean generalization matter more than transferring inter class structure, a small amount of smoothing is usually a cheap and effective improvement.

42.9 9. Summary

Cross-entropy measures the expected cost of encoding data from $p$ using a code optimized for $q$, and it decomposes exactly as $H(p, q) = H(p) + D_{\mathrm{KL}}(p \parallel q)$. Because the entropy term is constant in the model, minimizing cross-entropy is minimizing KL divergence to the data. The cross-entropy loss is precisely the average negative log likelihood, so training a classifier by cross-entropy is maximum likelihood estimation. Binary and categorical cross-entropy are the Bernoulli and categorical instances of the same quantity. Softmax turns logits into probabilities with a gradient that reduces to the prediction minus the target, the max subtraction trick and the log-sum-exp identity make that computation numerically safe, and label smoothing softens the targets to curb overconfidence and improve calibration. Together these pieces form the default training objective for classification and language modeling.

42.10 References

Shannon, C. E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal, 27(3), 379-423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x
Kullback, S., and Leibler, R. A. (1951). On Information and Sufficiency. Annals of Mathematical Statistics, 22(1), 79-86. https://doi.org/10.1214/aoms/1177729694
Cover, T. M., and Thomas, J. A. (2006). Elements of Information Theory (2nd ed.). Wiley. https://doi.org/10.1002/047174882X
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. https://doi.org/10.1007/978-0-387-45528-0
Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press. https://www.deeplearningbook.org/
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. (2016). Rethinking the Inception Architecture for Computer Vision. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2818-2826. https://doi.org/10.1109/CVPR.2016.308
Müller, R., Kornblith, S., and Hinton, G. (2019). When Does Label Smoothing Help? Advances in Neural Information Processing Systems (NeurIPS), 32, 4694-4703. https://doi.org/10.48550/arXiv.1906.02629
Blanchard, P., Higham, D. J., and Higham, N. J. (2021). Accurately Computing the Log-Sum-Exp and Softmax Functions. IMA Journal of Numerical Analysis, 41(4), 2311-2330. https://doi.org/10.1093/imanum/draa038
Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. (2017). On Calibration of Modern Neural Networks. Proceedings of the 34th International Conference on Machine Learning (ICML), 70, 1321-1330. https://doi.org/10.48550/arXiv.1706.04599
Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. https://probml.github.io/pml-book/book1.html

# Cross-Entropy and the Cross-Entropy Loss Cross-entropy is one of the load-bearing ideas of modern machine learning. It measures the cost of encoding data drawn from one distribution using a code optimized for another, and through that lens it becomes the natural objective for training classifiers and language models. This chapter develops cross-entropy from its information-theoretic roots, shows how it decomposes into entropy plus the Kullback-Leibler divergence, derives the cross-entropy loss as the maximum likelihood estimator for categorical data, and works through the practical machinery that makes it usable: the softmax function, its numerically stable implementation, and label smoothing. ## 1. Entropy and the Cost of Encoding ### 1.1 Shannon entropy Let $p$ be a probability distribution over a finite set of outcomes $\mathcal{X} = \{x_1, \dots, x_K\}$. The Shannon entropy of $p$ is $$ H(p) = -\sum_{k=1}^{K} p(x_k) \log p(x_k). $$ When the logarithm is base 2 the result is measured in bits, and when it is the natural logarithm the unit is nats. Machine learning almost always uses the natural logarithm because it composes cleanly with gradient based optimization, so we adopt $\log = \ln$ throughout unless stated otherwise. Entropy quantifies the average uncertainty in $p$. By Shannon's source coding theorem, $H(p)$ is also the minimum expected number of nats per symbol needed to encode a stream of independent draws from $p$, achievable in the limit by an optimal code that assigns roughly $-\log p(x_k)$ nats to outcome $x_k$. Rare events get long codewords, common events get short ones, and the entropy is the weighted average codeword length. The optimal codeword length follows from a short variational argument. We want to choose nonnegative codeword lengths $\ell_k$ that minimize the expected length $\sum_k p(x_k) \ell_k$ subject to the Kraft constraint $\sum_k e^{-\ell_k} \le 1$ in nats. Forming the Lagrangian $\mathcal{J} = \sum_k p(x_k) \ell_k + \lambda \big(\sum_k e^{-\ell_k} - 1\big)$ and setting $\partial \mathcal{J} / \partial \ell_k = p(x_k) - \lambda e^{-\ell_k} = 0$ gives $\ell_k = -\log\!\big(p(x_k)/\lambda\big)$. Enforcing the constraint with equality forces $\lambda = 1$, so the optimal length is $\ell_k^\star = -\log p(x_k)$ and the optimal expected length is exactly $\sum_k p(x_k) \ell_k^\star = H(p)$. This is why $-\log p(x_k)$ is the natural unit of surprise. ### 1.2 Cross-entropy between two distributions Suppose the data truly follow $p$, but we build our code as though they followed a different distribution $q$. We then spend $-\log q(x_k)$ nats on outcome $x_k$ instead of the optimal $-\log p(x_k)$. The expected cost under the true distribution is the cross-entropy $$ H(p, q) = -\sum_{k=1}^{K} p(x_k) \log q(x_k) = \mathbb{E}_{x \sim p}\!\left[-\log q(x)\right]. $$ The cross-entropy is the average code length we actually pay when reality is $p$ and our beliefs are $q$. Three properties matter immediately. First, $H(p, q)$ is not symmetric; in general $H(p, q) \neq H(q, p)$. Second, if $q(x_k) = 0$ for some $x_k$ with $p(x_k) > 0$, the cross-entropy is infinite, which encodes the intuition that a model assigning zero probability to an event that actually occurs is infinitely surprised. Third, $H(p, p) = H(p)$, so cross-entropy reduces to ordinary entropy when the code matches the source. The same variational argument that fixed the optimal code length also pins down cross-entropy as the cost of a mismatched code. If we build a code optimal for $q$, the codeword for outcome $x_k$ has length $\ell_k = -\log q(x_k)$. The expected length under the true source $p$ is then $\sum_k p(x_k) \ell_k = -\sum_k p(x_k) \log q(x_k) = H(p, q)$, recovering the definition above directly from coding considerations rather than by stipulation. For continuous distributions with densities the sums become integrals, $H(p, q) = -\int p(x) \log q(x)\, dx$, but the discrete case is what classification needs. ## 2. Cross-Entropy, Entropy, and KL Divergence ### 2.1 The decomposition The central identity of this chapter relates cross-entropy to entropy and the Kullback-Leibler divergence. The KL divergence from $q$ to $p$ is $$ D_{\mathrm{KL}}(p \parallel q) = \sum_{k=1}^{K} p(x_k) \log \frac{p(x_k)}{q(x_k)}. $$ Expanding the logarithm of the ratio, $$ D_{\mathrm{KL}}(p \parallel q) = \sum_k p(x_k) \log p(x_k) - \sum_k p(x_k) \log q(x_k) = -H(p) + H(p, q). $$ Rearranging gives the decomposition $$ H(p, q) = H(p) + D_{\mathrm{KL}}(p \parallel q). $$ In words, the total cost of encoding $p$ with a code built for $q$ splits into two parts: the irreducible cost $H(p)$ of the source itself, plus the penalty $D_{\mathrm{KL}}(p \parallel q)$ for the mismatch between $q$ and $p$. The KL term is the excess number of nats incurred by using the wrong code. ### 2.2 Why the penalty is nonnegative Gibbs' inequality states that $D_{\mathrm{KL}}(p \parallel q) \geq 0$, with equality if and only if $p = q$. The cleanest proof uses the concavity of the logarithm. Since $\log$ is concave, Jensen's inequality gives $$ -D_{\mathrm{KL}}(p \parallel q) = \sum_k p(x_k) \log \frac{q(x_k)}{p(x_k)} \leq \log \sum_k p(x_k) \frac{q(x_k)}{p(x_k)} = \log \sum_k q(x_k) = \log 1 = 0. $$ Multiplying by $-1$ yields $D_{\mathrm{KL}}(p \parallel q) \geq 0$. Because $\log$ is strictly concave, equality holds only when the ratio $q(x_k)/p(x_k)$ is constant across all $k$, which forces $q = p$. A consequence relevant to optimization follows immediately. Since $H(p)$ does not depend on $q$, $$ \arg\min_q H(p, q) = \arg\min_q D_{\mathrm{KL}}(p \parallel q) = p. $$ Minimizing cross-entropy over the model distribution $q$ is exactly equivalent to minimizing the KL divergence to the data distribution $p$. The entropy term $H(p)$ is a constant offset that the optimizer never sees. This is why we can speak of cross-entropy and KL divergence almost interchangeably when discussing training objectives, even though they differ by the fixed quantity $H(p)$. ## 3. The Cross-Entropy Loss from Maximum Likelihood ### 3.1 Setup Consider a supervised classification problem. We observe a dataset $\mathcal{D} = \{(x^{(i)}, y^{(i)})\}_{i=1}^{N}$ where each $x^{(i)}$ is an input and each $y^{(i)} \in \{1, \dots, K\}$ is a class label. A model with parameters $\theta$ produces a conditional distribution $q_\theta(y \mid x)$ over the $K$ classes. The standard estimation principle is maximum likelihood: choose $\theta$ to maximize the probability the model assigns to the observed labels. ### 3.2 From likelihood to loss Assuming the examples are independent and identically distributed, the likelihood factorizes into a product of per example probabilities, $$ \mathcal{L}_{\mathrm{lik}}(\theta) = \prod_{i=1}^{N} q_\theta\!\left(y^{(i)} \mid x^{(i)}\right). $$ This product of many numbers in $(0, 1)$ underflows quickly and is awkward to differentiate, so we take the logarithm. Because $\log$ is strictly increasing, the maximizer is unchanged, and the product becomes a sum, $$ \ell(\theta) = \log \mathcal{L}_{\mathrm{lik}}(\theta) = \sum_{i=1}^{N} \log q_\theta\!\left(y^{(i)} \mid x^{(i)}\right). $$ It is worth making the categorical structure explicit. The model output is a vector $q_\theta(\cdot \mid x^{(i)})$ on the simplex, and the observed label selects one component. Writing $p^{(i)}_k = \mathbb{1}[y^{(i)} = k]$ for the one-hot encoding, the single observed probability can be written as a product over classes, $q_\theta(y^{(i)} \mid x^{(i)}) = \prod_{k=1}^{K} q_\theta(k \mid x^{(i)})^{\,p^{(i)}_k}$, which is exactly the categorical likelihood. Taking the log turns the exponent into the familiar sum $\sum_k p^{(i)}_k \log q_\theta(k \mid x^{(i)})$. Maximizing $\ell(\theta)$ is the same as minimizing its negative, scaled by $1/N$ to make it an average rather than a sum. Define the average negative log likelihood $$ \mathcal{L}(\theta) = -\frac{1}{N} \sum_{i=1}^{N} \log q_\theta\!\left(y^{(i)} \mid x^{(i)}\right). $$ This is the cross-entropy loss. To see why it deserves the name, encode each label as a one-hot vector $\mathbf{p}^{(i)}$ whose components are $p^{(i)}_k = \mathbb{1}[y^{(i)} = k]$. The per example cross-entropy between this empirical one-hot target and the model prediction is $$ H\!\left(\mathbf{p}^{(i)}, q_\theta(\cdot \mid x^{(i)})\right) = -\sum_{k=1}^{K} p^{(i)}_k \log q_\theta\!\left(k \mid x^{(i)}\right) = -\log q_\theta\!\left(y^{(i)} \mid x^{(i)}\right), $$ because the one-hot vector zeroes out every term except the one for the true class. Averaging over the dataset recovers $\mathcal{L}(\theta)$ exactly. So minimizing the cross-entropy loss is minimizing maximum likelihood, and equivalently minimizing the average KL divergence between the empirical label distributions and the model. ### 3.3 The distributional reading There is a tidy way to view the whole dataset at once. Let $\hat{p}$ denote the empirical distribution that places mass $1/N$ on each observed pair. Then $$ \mathcal{L}(\theta) = \mathbb{E}_{(x, y) \sim \hat{p}}\!\left[-\log q_\theta(y \mid x)\right] = H\!\left(\hat{p}, q_\theta\right). $$ Training drives $q_\theta$ toward $\hat{p}$ in the KL sense. Because the empirical distribution converges to the true data distribution as $N$ grows, minimizing cross-entropy is a consistent way to approximate the true conditional distribution, subject to the capacity and inductive biases of the model class. ## 4. Binary and Categorical Cross-Entropy ### 4.1 Binary cross-entropy When $K = 2$ the model needs only a single scalar output. Let the model produce $\hat{y} = q_\theta(y = 1 \mid x) \in (0, 1)$, the predicted probability of the positive class, and let $y \in \{0, 1\}$ be the label. The per example binary cross-entropy is $$ \mathrm{BCE}(y, \hat{y}) = -\big[\, y \log \hat{y} + (1 - y) \log (1 - \hat{y}) \,\big]. $$ This is the cross-entropy between the Bernoulli target distribution $(1 - y, y)$ and the predicted Bernoulli $(1 - \hat{y}, \hat{y})$. When $y = 1$ the loss is $-\log \hat{y}$, which vanishes as $\hat{y} \to 1$ and blows up as $\hat{y} \to 0$. The symmetric statement holds for $y = 0$. The probability $\hat{y}$ is typically produced from a real valued logit $z$ by the logistic sigmoid $\sigma(z) = 1 / (1 + e^{-z})$. Substituting and simplifying gives a form expressed directly in the logit, $$ \mathrm{BCE}(y, z) = \log\!\left(1 + e^{-z}\right) + (1 - y)\, z, $$ which is more numerically stable than computing $\sigma(z)$ first and then taking its logarithm, for the same reasons discussed in the next section. Many libraries expose this fused logit form as a single primitive, often named something like `binary_cross_entropy_with_logits`. ### 4.2 Categorical cross-entropy For $K > 2$ classes the model outputs a full probability vector $\hat{\mathbf{y}} = (\hat{y}_1, \dots, \hat{y}_K)$ with $\sum_k \hat{y}_k = 1$. Against a one-hot target $\mathbf{y}$, the categorical cross-entropy is $$ \mathrm{CCE}(\mathbf{y}, \hat{\mathbf{y}}) = -\sum_{k=1}^{K} y_k \log \hat{y}_k = -\log \hat{y}_c, $$ where $c$ is the index of the true class. If labels are stored as integers rather than one-hot vectors, the loss is often called sparse categorical cross-entropy, but the arithmetic is identical: gather the predicted probability of the correct class and take its negative logarithm. Binary cross-entropy is the special case $K = 2$ when the two class probabilities are written as $\hat{y}$ and $1 - \hat{y}$. ## 5. The Softmax Function ### 5.1 Definition and role The model's penultimate layer produces a vector of real valued logits $\mathbf{z} = (z_1, \dots, z_K)$ that range over all of $\mathbb{R}$. To interpret these as a probability distribution we apply the softmax function $$ \mathrm{softmax}(\mathbf{z})_k = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}. $$ Each output lies in $(0, 1)$ and the outputs sum to one, so softmax maps an unconstrained logit vector onto the probability simplex. It is the multiclass generalization of the logistic sigmoid, and it is shift invariant: adding the same constant $c$ to every logit leaves the output unchanged, since the factor $e^c$ cancels between numerator and denominator. This invariance is both a modeling fact and, as we will see, the key to a stable implementation. ### 5.2 The softmax cross-entropy gradient Pairing softmax with cross-entropy yields a gradient of unusual simplicity, which is a large part of why the combination dominates classification. Let $\hat{y}_k = \mathrm{softmax}(\mathbf{z})_k$ and let the loss for a single example with true class $c$ be $L = -\log \hat{y}_c$. The derivative with respect to logit $z_k$ is $$ \frac{\partial L}{\partial z_k} = \hat{y}_k - y_k, $$ where $y_k = \mathbb{1}[k = c]$. The gradient is simply the difference between the predicted distribution and the one-hot target. The derivation starts from the Jacobian of the softmax. Write $\hat{y}_i = e^{z_i} / S$ with $S = \sum_j e^{z_j}$. Differentiating the quotient with respect to $z_k$ requires two cases. When $i = k$, $$ \frac{\partial \hat{y}_i}{\partial z_i} = \frac{e^{z_i} S - e^{z_i} e^{z_i}}{S^2} = \hat{y}_i - \hat{y}_i^2 = \hat{y}_i (1 - \hat{y}_i), $$ and when $i \neq k$ the numerator $e^{z_i}$ is constant in $z_k$ so only the denominator contributes, $$ \frac{\partial \hat{y}_i}{\partial z_k} = -\frac{e^{z_i} e^{z_k}}{S^2} = -\hat{y}_i \hat{y}_k. $$ Both cases combine into the compact form $$ \frac{\partial \hat{y}_i}{\partial z_k} = \hat{y}_i (\delta_{ik} - \hat{y}_k), $$ with $\delta_{ik}$ the Kronecker delta. Now apply the chain rule to $L = -\log \hat{y}_c$. Only the $c$ component of $\hat{y}$ enters $L$, so $$ \frac{\partial L}{\partial z_k} = -\frac{1}{\hat{y}_c} \frac{\partial \hat{y}_c}{\partial z_k} = -\frac{1}{\hat{y}_c} \hat{y}_c (\delta_{ck} - \hat{y}_k) = \hat{y}_k - \delta_{ck} = \hat{y}_k - y_k. $$ For a full one-hot target the same result follows from the general loss $L = -\sum_i y_i \log \hat{y}_i$, where $\partial L / \partial z_k = -\sum_i y_i (\delta_{ik} - \hat{y}_k) = -y_k + \hat{y}_k \sum_i y_i = \hat{y}_k - y_k$, using $\sum_i y_i = 1$. In vector form the gradient is the residual $\nabla_{\mathbf{z}} L = \hat{\mathbf{y}} - \mathbf{y}$. The clean linear form means the error signal injected at the logits is exactly the residual between prediction and target, and there is no awkward division or saturating factor to weaken the gradient when the model is badly wrong. This is the property that makes softmax cross-entropy train reliably where a squared error on softmax outputs would suffer from vanishing gradients. ## 6. Numerically Stable Softmax and Log-Softmax ### 6.1 The overflow problem The naive softmax computes $e^{z_k}$ for each logit. If any logit is large, say $z_k = 1000$, then $e^{z_k}$ overflows to infinity in floating point, and the ratio becomes the indeterminate $\infty / \infty$. If all logits are very negative, every exponential underflows to zero and the denominator becomes zero. Either way the result is a `NaN` or `Inf` that corrupts the rest of the computation. ### 6.2 The max subtraction trick The shift invariance of softmax provides the fix. Let $m = \max_j z_j$ and subtract it from every logit before exponentiating: $$ \mathrm{softmax}(\mathbf{z})_k = \frac{e^{z_k - m}}{\sum_{j=1}^{K} e^{z_j - m}}. $$ This is algebraically identical to the original because the common factor $e^{-m}$ cancels. Numerically it is far better behaved: the largest exponent is now $z_k - m = 0$, so the largest term is $e^0 = 1$ and nothing overflows. The smaller terms may underflow to zero, but that is harmless because they contribute negligibly to the sum. ### 6.3 Log-softmax and the log-sum-exp identity In practice we never want the probability followed by a logarithm, because computing $\log(\hat{y}_c)$ as a separate step reintroduces precision loss when $\hat{y}_c$ is tiny. Instead we compute the log probability directly. Taking the logarithm of the stable softmax, $$ \log \mathrm{softmax}(\mathbf{z})_k = z_k - m - \log \sum_{j=1}^{K} e^{z_j - m}. $$ The term $m + \log \sum_j e^{z_j - m}$ is the numerically stable log-sum-exp of the logits. The identity behind it generalizes the max subtraction trick to the logarithm. For any shift $m$, $$ \log \sum_{j=1}^{K} e^{z_j} = \log \sum_{j=1}^{K} e^{m} e^{z_j - m} = m + \log \sum_{j=1}^{K} e^{z_j - m}, $$ and choosing $m = \max_j z_j$ guarantees every exponent $z_j - m \le 0$, so each $e^{z_j - m} \in (0, 1]$ and the sum lies in $[1, K]$, safely inside the floating point range. The error analysis of Blanchard, Higham, and Higham shows this shifted formula is not just overflow safe but also has a small relative error bound, which is why it is the standard implementation. Because the cross-entropy loss for the true class is just $-\log \mathrm{softmax}(\mathbf{z})_c = -z_c + m + \log \sum_j e^{z_j - m}$, the entire loss can be computed with one stable expression and no intermediate probabilities. This is why deep learning libraries provide fused operations such as `log_softmax` and `cross_entropy` that take raw logits rather than probabilities, and why feeding already softmaxed values into such a loss is a common and subtle bug. The pattern generalizes to batches by computing the per row maximum and applying the same subtraction along the class axis. The implementations below demonstrate it concretely. ## 7. Implementations The Python tab is executed when the book is rendered, so its output is real. The Julia and Rust tabs are illustrative and show the same numerically stable softmax cross-entropy in two other ecosystems. ::: {.panel-tabset} ## Python ```{python} import numpy as np import matplotlib.pyplot as plt rng = np.random.default_rng(0) def log_softmax(z): """Numerically stable log softmax along the last axis.""" m = np.max(z, axis=-1, keepdims=True) shifted = z - m lse = m[..., 0] + np.log(np.sum(np.exp(shifted), axis=-1)) return shifted - (lse[..., None] - m) def softmax(z): return np.exp(log_softmax(z)) def cross_entropy_from_logits(z, y_idx): """Mean cross entropy for a batch of logits z and integer labels y_idx.""" lsm = log_softmax(z) rows = np.arange(z.shape[0]) return -np.mean(lsm[rows, y_idx]) # A small batch of logits and integer labels. N, K = 4, 5 z = rng.normal(size=(N, K)) * 3.0 y_idx = rng.integers(0, K, size=N) # Stability check: add a huge constant; the loss must be unchanged. loss = cross_entropy_from_logits(z, y_idx) loss_shifted = cross_entropy_from_logits(z + 1000.0, y_idx) print(f"loss = {loss:.6f}") print(f"loss (+1000 shift) = {loss_shifted:.6f}") print(f"shift invariant = {np.allclose(loss, loss_shifted)}") # Analytic gradient at the logits is (p - y). probs = softmax(z) y_onehot = np.zeros((N, K)) y_onehot[np.arange(N), y_idx] = 1.0 grad_analytic = (probs - y_onehot) / N # Numerical gradient by central differences. eps = 1e-6 grad_numeric = np.zeros_like(z) for i in range(N): for k in range(K): zp = z.copy(); zp[i, k] += eps zm = z.copy(); zm[i, k] -= eps grad_numeric[i, k] = ( cross_entropy_from_logits(zp, y_idx) - cross_entropy_from_logits(zm, y_idx) ) / (2 * eps) max_abs_err = np.max(np.abs(grad_analytic - grad_numeric)) print(f"max |p - y vs numeric gradient| = {max_abs_err:.2e}") print(f"gradient matches p - y = {np.allclose(grad_analytic, grad_numeric, atol=1e-6)}") # Loss vs predicted probability of the true class, for a binary problem. p_true = np.linspace(1e-3, 1.0, 200) loss_curve = -np.log(p_true) fig, ax = plt.subplots(figsize=(6, 4)) ax.plot(p_true, loss_curve, color="#2563eb", linewidth=2) ax.set_xlabel("predicted probability of true class") ax.set_ylabel("cross-entropy loss (nats)") ax.set_title("Cross-entropy loss vs predicted probability") ax.grid(True, alpha=0.3) plt.tight_layout() plt.show() ``` ## Julia ```julia # Illustrative: numerically stable softmax cross-entropy in Julia. function log_softmax(z::AbstractVector) m = maximum(z) shifted = z .- m lse = log(sum(exp, shifted)) return shifted .- lse end function cross_entropy_from_logits(z::AbstractVector, c::Int) return -log_softmax(z)[c] end z = [1.0, 3.0, 0.5, 1000.0] # last logit is huge; stays finite c = 4 println("loss = ", cross_entropy_from_logits(z, c)) # Gradient is p - y, the residual between softmax and the one-hot target. function ce_gradient(z::AbstractVector, c::Int) p = exp.(log_softmax(z)) y = zeros(length(z)); y[c] = 1.0 return p .- y end println("grad = ", ce_gradient(z, c)) ``` ## Rust ```rust // Illustrative: numerically stable softmax cross-entropy in Rust. fn log_softmax(z: &[f64]) -> Vec<f64> { let m = z.iter().cloned().fold(f64::NEG_INFINITY, f64::max); let shifted: Vec<f64> = z.iter().map(|&zi| zi - m).collect(); let sum_exp: f64 = shifted.iter().map(|&s| s.exp()).sum(); let lse = sum_exp.ln(); shifted.iter().map(|&s| s - lse).collect() } fn cross_entropy_from_logits(z: &[f64], c: usize) -> f64 { -log_softmax(z)[c] } fn main() { let z = [1.0, 3.0, 0.5, 1000.0]; // huge logit stays finite let c = 3; let loss = cross_entropy_from_logits(&z, c); println!("loss = {loss}"); // Gradient is p - y. let log_p = log_softmax(&z); let grad: Vec<f64> = log_p .iter() .enumerate() .map(|(k, &lp)| lp.exp() - if k == c { 1.0 } else { 0.0 }) .collect(); println!("grad = {grad:?}"); } ``` ::: ## 8. Label Smoothing ### 8.1 Motivation The one-hot target asks the model to drive the true class probability to exactly one and all others to exactly zero. Achieving that requires the correct logit to grow without bound relative to the others, because softmax only reaches a hard zero or one in the limit of infinite logits. This pushes the model toward extreme, overconfident outputs, encourages large weight magnitudes, and tends to widen the gap between the largest logit and the rest in a way that generalizes poorly. The model becomes confidently wrong on inputs near the decision boundary and its predicted probabilities stop being well calibrated. ### 8.2 The smoothed target Label smoothing, introduced in the context of the Inception architecture, replaces the hard one-hot target with a softened distribution. With smoothing parameter $\epsilon \in (0, 1)$, the target for class $k$ becomes $$ y_k^{\mathrm{LS}} = (1 - \epsilon)\, y_k + \frac{\epsilon}{K}, $$ which assigns probability $1 - \epsilon + \epsilon/K$ to the true class and $\epsilon/K$ to each of the others. The loss is the cross-entropy against this softened target, $$ \mathcal{L}^{\mathrm{LS}} = -\sum_{k=1}^{K} y_k^{\mathrm{LS}} \log \hat{y}_k. $$ A useful way to read this is as a mixture. The smoothed target is a convex combination of the one-hot label and the uniform distribution $u_k = 1/K$, namely $\mathbf{y}^{\mathrm{LS}} = (1 - \epsilon)\, \mathbf{y} + \epsilon\, \mathbf{u}$. Because cross-entropy is linear in its first argument, the loss splits cleanly: $$ \mathcal{L}^{\mathrm{LS}} = -\sum_k \big[(1 - \epsilon) y_k + \epsilon u_k\big] \log \hat{y}_k = (1 - \epsilon)\, H(\mathbf{y}, \hat{\mathbf{y}}) + \epsilon\, H(\mathbf{u}, \hat{\mathbf{y}}). $$ The second term equals $\epsilon\big(H(\mathbf{u}) + D_{\mathrm{KL}}(\mathbf{u} \parallel \hat{\mathbf{y}})\big)$ by the chapter's central decomposition, and since $H(\mathbf{u}) = \log K$ is constant, the smoothing contributes a $D_{\mathrm{KL}}(\mathbf{u} \parallel \hat{\mathbf{y}})$ penalty that is minimized when $\hat{\mathbf{y}}$ is uniform. Label smoothing therefore acts as a regularizer that pulls predictions toward uniform and penalizes overconfidence. The effect on the logits is concrete. The gradient of $\mathcal{L}^{\mathrm{LS}}$ with respect to the logits is $\hat{\mathbf{y}} - \mathbf{y}^{\mathrm{LS}}$ by the same residual argument as before, so a stationary point requires $\hat{y}_k = y_k^{\mathrm{LS}}$ for every class. Inverting the softmax, the optimal logit gap between the true class $c$ and any other class $k$ satisfies $$ z_c^\star - z_k^\star = \log \frac{y_c^{\mathrm{LS}}}{y_k^{\mathrm{LS}}} = \log \frac{(1 - \epsilon) + \epsilon/K}{\epsilon/K}, $$ which is finite for any $\epsilon > 0$. Contrast this with the hard target, where the gap must diverge to drive $\hat{y}_c \to 1$. The finite optimum is exactly why smoothing curbs runaway logit growth and keeps representations compact. ### 8.3 Effects and trade-offs Empirically, label smoothing improves generalization and produces better calibrated probabilities across image classification, machine translation, and speech recognition. With a smoothed target the optimal logits are finite rather than divergent, which curbs the runaway growth of the correct logit and keeps representations more compact. Analysis of the learned representations shows that smoothing encourages examples of the same class to cluster tightly and at roughly equal distances from other class clusters, a geometric regularity that the hard target does not impose. There are costs. Because label smoothing erases some of the fine grained information about relative similarities between the wrong classes, it can hurt when the softened teacher is then distilled into a student network, since the student loses the very dark knowledge that distillation relies on. A typical value is $\epsilon = 0.1$, and like any regularizer its strength should be tuned to the dataset and model. When calibration and clean generalization matter more than transferring inter class structure, a small amount of smoothing is usually a cheap and effective improvement. ## 9. Summary Cross-entropy measures the expected cost of encoding data from $p$ using a code optimized for $q$, and it decomposes exactly as $H(p, q) = H(p) + D_{\mathrm{KL}}(p \parallel q)$. Because the entropy term is constant in the model, minimizing cross-entropy is minimizing KL divergence to the data. The cross-entropy loss is precisely the average negative log likelihood, so training a classifier by cross-entropy is maximum likelihood estimation. Binary and categorical cross-entropy are the Bernoulli and categorical instances of the same quantity. Softmax turns logits into probabilities with a gradient that reduces to the prediction minus the target, the max subtraction trick and the log-sum-exp identity make that computation numerically safe, and label smoothing softens the targets to curb overconfidence and improve calibration. Together these pieces form the default training objective for classification and language modeling. ## References 1. Shannon, C. E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal, 27(3), 379-423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x 2. Kullback, S., and Leibler, R. A. (1951). On Information and Sufficiency. Annals of Mathematical Statistics, 22(1), 79-86. https://doi.org/10.1214/aoms/1177729694 3. Cover, T. M., and Thomas, J. A. (2006). Elements of Information Theory (2nd ed.). Wiley. https://doi.org/10.1002/047174882X 4. Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. https://doi.org/10.1007/978-0-387-45528-0 5. Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press. https://www.deeplearningbook.org/ 6. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. (2016). Rethinking the Inception Architecture for Computer Vision. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2818-2826. https://doi.org/10.1109/CVPR.2016.308 7. Müller, R., Kornblith, S., and Hinton, G. (2019). When Does Label Smoothing Help? Advances in Neural Information Processing Systems (NeurIPS), 32, 4694-4703. https://doi.org/10.48550/arXiv.1906.02629 8. Blanchard, P., Higham, D. J., and Higham, N. J. (2021). Accurately Computing the Log-Sum-Exp and Softmax Functions. IMA Journal of Numerical Analysis, 41(4), 2311-2330. https://doi.org/10.1093/imanum/draa038 9. Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. (2017). On Calibration of Modern Neural Networks. Proceedings of the 34th International Conference on Machine Learning (ICML), 70, 1321-1330. https://doi.org/10.48550/arXiv.1706.04599 10. Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. https://probml.github.io/pml-book/book1.html