40 Information Theory and Entropy

Information theory is the mathematical study of how to quantify, store, and communicate information. It was founded almost single-handedly by Claude Shannon in his 1948 paper “A Mathematical Theory of Communication,” which introduced a precise way to measure the information content of a random source. The central object of that theory is entropy, a number that captures how uncertain we are about an outcome before we observe it, or equivalently how much we learn on average once we do. For machine learning practitioners, entropy is not a historical curiosity. It sits underneath cross entropy loss, mutual information objectives, decision tree splitting criteria, variational inference, and the very notion of how many bits a model needs to describe its data. This chapter builds entropy from first principles, develops the related quantities of joint and conditional entropy, explains the source coding theorem that gives entropy its operational meaning, extends the idea to continuous variables through differential entropy, and closes with why all of this matters when we train models.

40.1 1. Self-Information and Shannon Entropy

40.1.1 1.1 Quantifying surprise

Suppose a discrete random variable $X$ takes a value $x$ with probability $p(x)$. We want a function that measures how surprising or informative the event $X = x$ is. Intuitively, rare events are surprising and common events are not. If the sun rises tomorrow, an event of probability near one, we learn almost nothing. If a fair coin lands on its edge, an event of tiny probability, we learn a great deal.

Shannon argued that any reasonable measure of information should satisfy a few natural requirements. It should depend only on the probability of the event, not on what the event is. It should decrease as probability increases, so that more likely events carry less information. It should be continuous in the probability. And critically, the information of two independent events occurring together should be the sum of their individual informations. If you learn the result of two independent coin flips, you should gain twice the information of one flip.

The only function (up to a choice of base) that satisfies these constraints is the logarithm. We define the self-information, also called the surprisal, of the event $X = x$ as

\[ I(x) = -\log p(x) = \log \frac{1}{p(x)}. \]

We can make the uniqueness claim precise. Write the information of an event as a function $f$ of its probability alone, $I = f(p)$. The additivity requirement says that for two independent events with probabilities $p$ and $q$, the joint event of probability $pq$ carries information $f(pq) = f(p) + f(q)$. This is the Cauchy functional equation in multiplicative form. Substituting $p = e^{-u}$ and $q = e^{-v}$ and writing $g(u) = f(e^{-u})$ turns it into $g(u + v) = g(u) + g(v)$, the additive Cauchy equation. Under the continuity (or even mere monotonicity) requirement, the only solutions are linear, $g(u) = c\,u$, which means $f(p) = -c \log p$ for some constant $c > 0$. The sign of $c$ is fixed by the requirement that information decrease in $p$, and its magnitude only sets the base of the logarithm. Hence, up to the choice of unit, $I(x) = -\log p(x)$ is forced.

The additivity requirement is therefore exactly the property that the logarithm of a product is the sum of logarithms. For independent events $x$ and $y$, $p(x, y) = p(x) p(y)$, so $I(x, y) = -\log(p(x) p(y)) = I(x) + I(y)$.

The base of the logarithm sets the unit of information. Base 2 gives bits, which is the most common choice in computing and communication. Base $e$ gives nats, which appear naturally in calculus-heavy machine learning derivations. Base 10 gives dits or hartleys. One bit is the information gained from observing the outcome of a single fair coin flip, since $-\log_2(1/2) = 1$.

40.1.2 1.2 Defining Shannon entropy

Self-information describes a single outcome. To characterize the whole random variable, we average self-information over all possible outcomes, weighting each by its probability. This expectation is the Shannon entropy:

\[ H(X) = \mathbb{E}_{x \sim p}\left[I(x)\right] = -\sum_{x} p(x) \log p(x). \]

By convention $0 \log 0 = 0$, justified because $p \log p \to 0$ as $p \to 0$. Entropy is the expected surprise of the source, the average number of bits we are unsure about before we see the outcome.

Consider a Bernoulli variable with probability $p$ of success. Its entropy is the binary entropy function

\[ H(p) = -p \log_2 p - (1 - p) \log_2 (1 - p). \]

This function is zero when $p = 0$ or $p = 1$, because the outcome is certain and there is nothing to learn. It reaches its maximum of exactly one bit at $p = 1/2$, the fair coin, where uncertainty is greatest. The shape of $H(p)$, a smooth hump peaking in the middle, is the canonical picture of entropy as uncertainty.

40.1.3 1.3 Key properties

Entropy is always nonnegative, $H(X) \geq 0$. To see it, note that for each outcome $p(x) \in [0, 1]$, so $\log p(x) \leq 0$ and the term $-p(x) \log p(x) \geq 0$. A sum of nonnegative terms is nonnegative. Equality $H(X) = 0$ holds if and only if every term vanishes, which requires each $p(x)$ to be $0$ or $1$. Since the probabilities sum to one, exactly one outcome has $p(x) = 1$, so the distribution is degenerate, placing all mass on a single outcome.

Among all distributions over an alphabet of $n$ symbols, the uniform distribution maximizes entropy, achieving $H = \log n$. We can prove this directly with Lagrange multipliers. Maximize $H(p) = -\sum_{i=1}^{n} p_i \log p_i$ subject to the normalization constraint $\sum_{i=1}^{n} p_i = 1$. Form the Lagrangian

\[ \mathcal{L}(p, \lambda) = -\sum_{i=1}^{n} p_i \log p_i + \lambda\left(\sum_{i=1}^{n} p_i - 1\right). \]

Differentiating with respect to a single $p_i$ and using $\frac{d}{dp_i}(p_i \log p_i) = \log p_i + 1$ gives

\[ \frac{\partial \mathcal{L}}{\partial p_i} = -\log p_i - 1 + \lambda = 0 \quad\Longrightarrow\quad p_i = e^{\lambda - 1}. \]

The stationary condition makes every $p_i$ equal to the same constant, independent of $i$. Imposing $\sum_i p_i = 1$ forces $p_i = 1/n$ for all $i$, the uniform distribution. Because $-p \log p$ is concave and the constraint set is convex, this critical point is the global maximum, and substituting back gives $H = -\sum_i \frac{1}{n} \log \frac{1}{n} = \log n$. This is the formal statement that uniformity is the state of maximum uncertainty, and it is the seed of the maximum entropy principle, which selects the least committed distribution consistent with known constraints.

Entropy also enjoys invariance under relabeling. Permuting the symbols leaves $H(X)$ unchanged because the formula depends only on the multiset of probabilities, not on the identities attached to them.

40.2 2. Joint and Conditional Entropy

40.2.1 2.1 Joint entropy

When two random variables $X$ and $Y$ are considered together, their joint entropy measures the total uncertainty in the pair. It is just the entropy of the joint distribution $p(x, y)$:

\[ H(X, Y) = -\sum_{x} \sum_{y} p(x, y) \log p(x, y). \]

If $X$ and $Y$ are independent, the joint entropy decomposes additively, $H(X, Y) = H(X) + H(Y)$, recovering the intuition that independent sources of uncertainty simply add. When they are dependent, the joint entropy is strictly less than the sum, because knowing one variable tells us something about the other and reduces the combined uncertainty.

40.2.2 2.2 Conditional entropy

Conditional entropy quantifies the uncertainty that remains in $Y$ once we know $X$. For a fixed value $X = x$, the residual uncertainty is $H(Y \mid X = x) = -\sum_y p(y \mid x) \log p(y \mid x)$. Averaging over $x$ gives the conditional entropy:

\[ H(Y \mid X) = \sum_{x} p(x) \, H(Y \mid X = x) = -\sum_{x} \sum_{y} p(x, y) \log p(y \mid x). \]

This is the expected number of additional bits needed to specify $Y$ after $X$ has been revealed. Conditioning can never increase entropy on average, a result stated as $H(Y \mid X) \leq H(Y)$, with equality if and only if $X$ and $Y$ are independent. Information cannot hurt: learning $X$ either reduces our uncertainty about $Y$ or leaves it unchanged. Note the qualifier “on average,” because for a specific value of $X$ the residual uncertainty $H(Y \mid X = x)$ can exceed $H(Y)$.

40.2.3 2.3 The chain rule

Joint and conditional entropy are tied together by the chain rule, which mirrors the chain rule of probability:

\[ H(X, Y) = H(X) + H(Y \mid X) = H(Y) + H(X \mid Y). \]

This identity is not an assumption; it follows from the definitions and the probability factorization $p(x, y) = p(x)\, p(y \mid x)$. Starting from the joint entropy and taking the logarithm of the product,

\[ \begin{aligned} H(X, Y) &= -\sum_{x}\sum_{y} p(x, y) \log p(x, y) \\ &= -\sum_{x}\sum_{y} p(x, y) \log\big(p(x)\, p(y \mid x)\big) \\ &= -\sum_{x}\sum_{y} p(x, y) \log p(x) \;-\; \sum_{x}\sum_{y} p(x, y) \log p(y \mid x). \end{aligned} \]

In the first term, summing $p(x, y)$ over $y$ collapses the joint to the marginal $p(x)$, so it becomes $-\sum_x p(x) \log p(x) = H(X)$. The second term is exactly the definition of $H(Y \mid X)$. Hence $H(X, Y) = H(X) + H(Y \mid X)$. Repeating the argument with the roles of $X$ and $Y$ swapped gives the symmetric form. The total uncertainty in the pair equals the uncertainty in the first variable plus the leftover uncertainty in the second given the first. The rule generalizes to any number of variables,

\[ H(X_1, X_2, \dots, X_n) = \sum_{i=1}^{n} H(X_i \mid X_1, \dots, X_{i-1}), \]

which is exactly the information-theoretic counterpart of factorizing a joint distribution into a product of conditionals. This decomposition is what makes autoregressive language models tractable: the entropy of a whole sequence is the sum of per-token conditional entropies.

40.2.4 2.4 Mutual information

The reduction in uncertainty about $Y$ from learning $X$ is the mutual information,

\[ I(X; Y) = H(Y) - H(Y \mid X) = H(X) + H(Y) - H(X, Y). \]

Mutual information is symmetric, nonnegative, and zero exactly when the variables are independent. It can be written as the Kullback-Leibler divergence between the joint distribution and the product of the marginals, $I(X; Y) = D_{\mathrm{KL}}(p(x, y) \,\|\, p(x) p(y))$, which makes precise the idea that mutual information measures how far two variables are from being independent. It is the workhorse quantity behind feature selection, representation learning objectives such as InfoNCE, and the information bottleneck framework.

40.2.5 2.5 A worked example

Concrete numbers make these quantities tangible. Consider two binary variables $X$ and $Y$ with the joint distribution

\[ \begin{array}{c|cc} p(x, y) & Y = 0 & Y = 1 \\ \hline X = 0 & 1/2 & 1/4 \\ X = 1 & 0 & 1/4 \end{array} \]

The marginals are $p(X = 0) = 1/2 + 1/4 = 3/4$ and $p(X = 1) = 1/4$, while $p(Y = 0) = 1/2$ and $p(Y = 1) = 1/2$. The marginal entropy of $X$ is

\[ H(X) = -\tfrac{3}{4}\log_2 \tfrac{3}{4} - \tfrac{1}{4}\log_2 \tfrac{1}{4} = 0.75 \times 0.415 + 0.25 \times 2 \approx 0.811 \text{ bits}, \]

and $Y$ is fair, so $H(Y) = 1$ bit. The joint entropy sums $-p \log_2 p$ over the three nonzero cells (recalling $0 \log 0 = 0$):

\[ H(X, Y) = -\tfrac{1}{2}\log_2 \tfrac{1}{2} - \tfrac{1}{4}\log_2 \tfrac{1}{4} - \tfrac{1}{4}\log_2 \tfrac{1}{4} = \tfrac{1}{2} + \tfrac{1}{2} + \tfrac{1}{2} = 1.5 \text{ bits}. \]

The chain rule then gives the conditional entropy without a fresh summation, $H(Y \mid X) = H(X, Y) - H(X) = 1.5 - 0.811 = 0.689$ bits, and the mutual information is $I(X; Y) = H(Y) - H(Y \mid X) = 1 - 0.689 = 0.311$ bits. Knowing $X$ removes about a third of a bit of uncertainty about $Y$, which matches intuition: when $X = 1$ the value of $Y$ is certain ($Y = 1$), so observing $X$ is sometimes very informative.

40.3 3. The Source Coding Theorem

40.3.1 3.1 Entropy as a compression limit

So far entropy has been an abstract average. The source coding theorem, also called Shannon’s first theorem or the noiseless coding theorem, gives it concrete operational meaning: entropy is the fundamental limit on lossless compression. No matter how clever your encoding scheme, you cannot on average represent the output of a source using fewer bits than its entropy.

Formally, consider a source emitting symbols independently from distribution $p$. A uniquely decodable binary code assigns to each symbol $x$ a codeword of length $\ell(x)$ bits. The expected codeword length is $L = \sum_x p(x) \ell(x)$. The theorem states that any uniquely decodable code satisfies

\[ L \geq H(X), \]

and that there exist codes whose expected length comes within one bit of this bound, $L < H(X) + 1$. By encoding long blocks of $n$ symbols at a time, the per-symbol overhead shrinks to zero, so the expected length per symbol can be pushed arbitrarily close to $H(X)$.

40.3.2 3.2 Why the limit holds

The lower bound rests on the Kraft inequality, which says that the codeword lengths of any uniquely decodable code must satisfy $\sum_x 2^{-\ell(x)} \leq 1$. Combining this constraint with the nonnegativity of the Kullback-Leibler divergence yields $L \geq H(X)$ directly. The intuition is that an efficient code should assign short codewords to frequent symbols and long codewords to rare ones, and the optimal length for a symbol of probability $p(x)$ turns out to be $-\log_2 p(x)$, precisely its self-information. Averaging those ideal lengths gives the entropy.

40.3.3 3.3 Practical codes

Huffman coding constructs an optimal prefix code by greedily merging the two least probable symbols, and it achieves the best possible expected length among codes that assign an integer number of bits per symbol. Because of the integer constraint, Huffman codes can be up to one bit per symbol away from entropy. Arithmetic coding and the more recent asymmetric numeral systems sidestep the integer limitation by encoding entire messages as a single number, reaching the entropy bound far more tightly. These ideas live inside everyday tools such as gzip, PNG, and modern neural compression pipelines. The deep connection for machine learning is that a probabilistic model is itself a compressor: a good model assigns high probability to the data, which means short codewords, which means the data can be described in few bits. Compression and prediction are two views of the same problem.

40.4 4. Differential Entropy

40.4.1 4.1 Extending entropy to continuous variables

Real-valued quantities such as pixel intensities, embeddings, and network activations live on a continuum, so we need an analogue of entropy for continuous random variables. For a variable $X$ with probability density $f(x)$, the differential entropy is defined by replacing the sum with an integral:

\[ h(X) = -\int_{-\infty}^{\infty} f(x) \log f(x) \, dx. \]

As a worked example, take a Gaussian $f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\big(-\frac{(x - \mu)^2}{2\sigma^2}\big)$ and compute its differential entropy in nats. The log density is $\log f(x) = -\frac{1}{2}\log(2\pi\sigma^2) - \frac{(x - \mu)^2}{2\sigma^2}$, so

\[ \begin{aligned} h(X) &= -\int f(x) \log f(x)\, dx \\ &= \tfrac{1}{2}\log(2\pi\sigma^2)\underbrace{\int f(x)\, dx}_{=\,1} \;+\; \frac{1}{2\sigma^2}\underbrace{\int (x - \mu)^2 f(x)\, dx}_{=\,\sigma^2} \\ &= \tfrac{1}{2}\log(2\pi\sigma^2) + \tfrac{1}{2} = \tfrac{1}{2}\log(2\pi e\,\sigma^2). \end{aligned} \]

The second integral is just the variance, which equals $\sigma^2$ by definition, and the $\frac{1}{2}$ is folded into the logarithm using $\frac{1}{2} = \frac{1}{2}\log e$. The result $\frac{1}{2}\log(2\pi e\,\sigma^2)$ grows with the variance: a wider spread means more uncertainty. Among all continuous distributions with a fixed variance, the Gaussian is the one that maximizes differential entropy, a fact that helps explain why the normal distribution is such a natural default.

40.4.2 4.2 Differences from discrete entropy

Differential entropy looks like its discrete cousin, but it behaves differently and several intuitions break. It can be negative, as happens for a uniform distribution on an interval shorter than one. It is not invariant under a change of variables: rescaling $X$ by a factor $a$ shifts the differential entropy by $\log |a|$, because stretching the axis changes the density. And it does not represent an absolute number of bits, since digitizing a continuous value to finite precision requires infinitely many bits in the limit.

The reason for these quirks is that differential entropy is really the discrete entropy of a finely quantized version of $X$ minus a term that diverges as the quantization grows infinitely fine. The infinite part is discarded in the definition, leaving a quantity that is best understood as relative rather than absolute. Despite these caveats, differential entropy is well behaved in the differences and ratios that matter. Mutual information between continuous variables, defined through differential entropies, remains nonnegative and invariant under invertible transformations, which is why it is a dependable objective even when the underlying entropies are not.

40.5 5. Why Entropy Matters in Machine Learning

40.5.1 5.1 Cross entropy and the standard training loss

The most pervasive appearance of information theory in machine learning is the cross entropy loss used to train classifiers and language models. Given a true distribution $p$ and a model distribution $q$, the cross entropy is

\[ H(p, q) = -\sum_{x} p(x) \log q(x). \]

It measures the average number of bits needed to encode samples from $p$ using a code optimized for $q$. Cross entropy decomposes as $H(p, q) = H(p) + D_{\mathrm{KL}}(p \,\| \, q)$, where the Kullback-Leibler divergence is the extra cost of using the wrong distribution. Since $H(p)$ is fixed by the data, minimizing cross entropy is the same as minimizing the divergence between the model and the data, which is in turn equivalent to maximum likelihood estimation. Every time a network is trained with a softmax output and a negative log likelihood objective, it is being pushed to compress the training data, and its loss in nats is literally a description length.

40.5.2 5.2 Decision trees and feature selection

Classical machine learning uses entropy directly as a splitting criterion. Decision tree algorithms such as ID3 and C4.5 choose the feature that yields the largest information gain, defined as the reduction in label entropy after conditioning on that feature, $\mathrm{IG}(Y, X) = H(Y) - H(Y \mid X)$. This is exactly the mutual information between the feature and the label. Splits that most reduce uncertainty about the target are preferred, giving a principled and interpretable basis for greedy tree construction. The same mutual information criterion underlies filter methods for feature selection, which rank candidate features by how much they tell us about the target.

40.5.3 5.3 Representation learning and regularization

Modern representation learning leans heavily on entropy and mutual information. Contrastive methods maximize a lower bound on the mutual information between different views of the same input. The information bottleneck principle frames learning as finding a representation $T$ that is maximally informative about the label $Y$ while being maximally compressive about the input $X$, trading off $I(X; T)$ against $I(T; Y)$. Variational autoencoders include a Kullback-Leibler term that penalizes the divergence between the approximate posterior and a prior, an entropy-based regularizer that controls how much information the latent code carries. In reinforcement learning, maximum entropy methods add an entropy bonus to the objective so that policies stay stochastic and keep exploring, since a high-entropy policy hedges across actions rather than collapsing prematurely onto one.

40.5.4 5.4 Uncertainty, calibration, and beyond

Entropy gives a direct readout of a model’s uncertainty. The entropy of a classifier’s predicted distribution measures how confident it is: low entropy means a sharp, confident prediction, while high entropy means the model is hedging across classes. This signal drives active learning, where the system queries labels for the inputs it is most uncertain about, and it supports out-of-distribution detection and selective prediction, where low-confidence inputs are flagged for human review. Perplexity, the standard metric for language models, is simply two raised to the cross entropy, so reporting perplexity is reporting an exponentiated bit count. From the loss function that trains a network, to the criterion that grows a tree, to the metric that evaluates a language model, information theory supplies the common currency in which learning is measured. Entropy is the bridge between probability and bits, and understanding it turns many machine learning objectives from opaque formulas into statements about average surprise.

40.6 6. Computing Entropy in Practice

The definitions above are easy to turn into code. The cell below plots the binary entropy function across the full range of $p$, confirming that it peaks at exactly one bit when $p = 1/2$, and then estimates the entropy of an empirical distribution drawn from samples. The empirical entropy of a roughly uniform four-symbol source should sit just under the theoretical maximum of $\log_2 4 = 2$ bits.

Code

import numpy as np
import matplotlib.pyplot as plt

rng = np.random.default_rng(42)


def entropy(p, base=2):
    """Shannon entropy of a probability vector, with 0 log 0 = 0."""
    p = np.asarray(p, dtype=float)
    p = p[p > 0]
    return float(-np.sum(p * (np.log(p) / np.log(base))))


def binary_entropy(p, base=2):
    p = np.asarray(p, dtype=float)
    out = np.zeros_like(p)
    mask = (p > 0) & (p < 1)
    pm = p[mask]
    out[mask] = -(pm * np.log(pm) + (1 - pm) * np.log(1 - pm)) / np.log(base)
    return out


# Binary entropy curve.
ps = np.linspace(0, 1, 501)
Hb = binary_entropy(ps)
print("max binary entropy:", round(Hb.max(), 6),
      "bits at p =", round(ps[np.argmax(Hb)], 3))

# Empirical distribution from 10,000 samples over 4 symbols.
samples = rng.integers(0, 4, size=10000)
counts = np.bincount(samples, minlength=4)
emp = counts / counts.sum()
print("empirical probabilities:", np.round(emp, 4))
print("empirical entropy:", round(entropy(emp), 6), "bits")
print("uniform maximum:", round(np.log2(4), 6), "bits")

fig, ax = plt.subplots(figsize=(6, 4))
ax.plot(ps, Hb, color="C0", lw=2)
ax.axvline(0.5, ls="--", color="grey", alpha=0.7)
ax.set_xlabel("p (probability of success)")
ax.set_ylabel("H(p) in bits")
ax.set_title("Entropy of a Bernoulli variable")
fig.tight_layout()
plt.show()

max binary entropy: 1.0 bits at p = 0.5
empirical probabilities: [0.2534 0.2527 0.2486 0.2453]
empirical entropy: 1.999876 bits
uniform maximum: 2.0 bits

The same entropy computation is short in any numerical language. The tabs below show the discrete entropy function in three ecosystems; the Python tab is executed above, while the Julia and Rust versions are illustrative.

Code

import numpy as np


def entropy_bits(p):
    p = np.asarray(p, dtype=float)
    p = p[p > 0]
    return float(-np.sum(p * np.log2(p)))


print(round(entropy_bits([0.25, 0.25, 0.25, 0.25]), 6))
print(round(entropy_bits([0.5, 0.25, 0.25]), 6))

2.0
1.5

function entropy_bits(p::AbstractVector{<:Real})
    s = 0.0
    for pi in p
        if pi > 0
            s -= pi * log2(pi)
        end
    end
    return s
end

println(entropy_bits([0.25, 0.25, 0.25, 0.25]))  # 2.0
println(entropy_bits([0.5, 0.25, 0.25]))         # 1.5

fn entropy_bits(p: &[f64]) -> f64 {
    p.iter()
        .filter(|&&pi| pi > 0.0)
        .map(|&pi| -pi * pi.log2())
        .sum()
}

fn main() {
    println!("{}", entropy_bits(&[0.25, 0.25, 0.25, 0.25])); // 2.0
    println!("{}", entropy_bits(&[0.5, 0.25, 0.25]));        // 1.5
}

40.7 References

Shannon, C. E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal, 27(3), 379-423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x
Cover, T. M., and Thomas, J. A. (2006). Elements of Information Theory (2nd ed.). Wiley-Interscience. https://doi.org/10.1002/047174882X
MacKay, D. J. C. (2003). Information Theory, Inference, and Learning Algorithms. Cambridge University Press. ISBN 978-0521642989. https://www.inference.org.uk/itprnn/book.pdf
Tishby, N., Pereira, F. C., and Bialek, W. (1999). The Information Bottleneck Method. Proceedings of the 37th Annual Allerton Conference on Communication, Control and Computing, 368-377. https://doi.org/10.48550/arXiv.physics/0004057
Huffman, D. A. (1952). A Method for the Construction of Minimum-Redundancy Codes. Proceedings of the IRE, 40(9), 1098-1101. https://doi.org/10.1109/JRPROC.1952.273898
Tishby, N., and Zaslavsky, N. (2015). Deep Learning and the Information Bottleneck Principle. 2015 IEEE Information Theory Workshop (ITW), 1-5. https://doi.org/10.1109/ITW.2015.7133169
Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning, Chapter 3: Probability and Information Theory. MIT Press. ISBN 978-0262035613. https://www.deeplearningbook.org/contents/prob.html

# Information Theory and Entropy Information theory is the mathematical study of how to quantify, store, and communicate information. It was founded almost single-handedly by Claude Shannon in his 1948 paper "A Mathematical Theory of Communication," which introduced a precise way to measure the information content of a random source. The central object of that theory is entropy, a number that captures how uncertain we are about an outcome before we observe it, or equivalently how much we learn on average once we do. For machine learning practitioners, entropy is not a historical curiosity. It sits underneath cross entropy loss, mutual information objectives, decision tree splitting criteria, variational inference, and the very notion of how many bits a model needs to describe its data. This chapter builds entropy from first principles, develops the related quantities of joint and conditional entropy, explains the source coding theorem that gives entropy its operational meaning, extends the idea to continuous variables through differential entropy, and closes with why all of this matters when we train models. ## 1. Self-Information and Shannon Entropy ### 1.1 Quantifying surprise Suppose a discrete random variable $X$ takes a value $x$ with probability $p(x)$. We want a function that measures how surprising or informative the event $X = x$ is. Intuitively, rare events are surprising and common events are not. If the sun rises tomorrow, an event of probability near one, we learn almost nothing. If a fair coin lands on its edge, an event of tiny probability, we learn a great deal. Shannon argued that any reasonable measure of information should satisfy a few natural requirements. It should depend only on the probability of the event, not on what the event is. It should decrease as probability increases, so that more likely events carry less information. It should be continuous in the probability. And critically, the information of two independent events occurring together should be the sum of their individual informations. If you learn the result of two independent coin flips, you should gain twice the information of one flip. The only function (up to a choice of base) that satisfies these constraints is the logarithm. We define the self-information, also called the surprisal, of the event $X = x$ as $$ I(x) = -\log p(x) = \log \frac{1}{p(x)}. $$ We can make the uniqueness claim precise. Write the information of an event as a function $f$ of its probability alone, $I = f(p)$. The additivity requirement says that for two independent events with probabilities $p$ and $q$, the joint event of probability $pq$ carries information $f(pq) = f(p) + f(q)$. This is the Cauchy functional equation in multiplicative form. Substituting $p = e^{-u}$ and $q = e^{-v}$ and writing $g(u) = f(e^{-u})$ turns it into $g(u + v) = g(u) + g(v)$, the additive Cauchy equation. Under the continuity (or even mere monotonicity) requirement, the only solutions are linear, $g(u) = c\,u$, which means $f(p) = -c \log p$ for some constant $c > 0$. The sign of $c$ is fixed by the requirement that information decrease in $p$, and its magnitude only sets the base of the logarithm. Hence, up to the choice of unit, $I(x) = -\log p(x)$ is forced. The additivity requirement is therefore exactly the property that the logarithm of a product is the sum of logarithms. For independent events $x$ and $y$, $p(x, y) = p(x) p(y)$, so $I(x, y) = -\log(p(x) p(y)) = I(x) + I(y)$. The base of the logarithm sets the unit of information. Base 2 gives bits, which is the most common choice in computing and communication. Base $e$ gives nats, which appear naturally in calculus-heavy machine learning derivations. Base 10 gives dits or hartleys. One bit is the information gained from observing the outcome of a single fair coin flip, since $-\log_2(1/2) = 1$. ### 1.2 Defining Shannon entropy Self-information describes a single outcome. To characterize the whole random variable, we average self-information over all possible outcomes, weighting each by its probability. This expectation is the Shannon entropy: $$ H(X) = \mathbb{E}_{x \sim p}\left[I(x)\right] = -\sum_{x} p(x) \log p(x). $$ By convention $0 \log 0 = 0$, justified because $p \log p \to 0$ as $p \to 0$. Entropy is the expected surprise of the source, the average number of bits we are unsure about before we see the outcome. Consider a Bernoulli variable with probability $p$ of success. Its entropy is the binary entropy function $$ H(p) = -p \log_2 p - (1 - p) \log_2 (1 - p). $$ This function is zero when $p = 0$ or $p = 1$, because the outcome is certain and there is nothing to learn. It reaches its maximum of exactly one bit at $p = 1/2$, the fair coin, where uncertainty is greatest. The shape of $H(p)$, a smooth hump peaking in the middle, is the canonical picture of entropy as uncertainty. ### 1.3 Key properties Entropy is always nonnegative, $H(X) \geq 0$. To see it, note that for each outcome $p(x) \in [0, 1]$, so $\log p(x) \leq 0$ and the term $-p(x) \log p(x) \geq 0$. A sum of nonnegative terms is nonnegative. Equality $H(X) = 0$ holds if and only if every term vanishes, which requires each $p(x)$ to be $0$ or $1$. Since the probabilities sum to one, exactly one outcome has $p(x) = 1$, so the distribution is degenerate, placing all mass on a single outcome. Among all distributions over an alphabet of $n$ symbols, the uniform distribution maximizes entropy, achieving $H = \log n$. We can prove this directly with Lagrange multipliers. Maximize $H(p) = -\sum_{i=1}^{n} p_i \log p_i$ subject to the normalization constraint $\sum_{i=1}^{n} p_i = 1$. Form the Lagrangian $$ \mathcal{L}(p, \lambda) = -\sum_{i=1}^{n} p_i \log p_i + \lambda\left(\sum_{i=1}^{n} p_i - 1\right). $$ Differentiating with respect to a single $p_i$ and using $\frac{d}{dp_i}(p_i \log p_i) = \log p_i + 1$ gives $$ \frac{\partial \mathcal{L}}{\partial p_i} = -\log p_i - 1 + \lambda = 0 \quad\Longrightarrow\quad p_i = e^{\lambda - 1}. $$ The stationary condition makes every $p_i$ equal to the same constant, independent of $i$. Imposing $\sum_i p_i = 1$ forces $p_i = 1/n$ for all $i$, the uniform distribution. Because $-p \log p$ is concave and the constraint set is convex, this critical point is the global maximum, and substituting back gives $H = -\sum_i \frac{1}{n} \log \frac{1}{n} = \log n$. This is the formal statement that uniformity is the state of maximum uncertainty, and it is the seed of the maximum entropy principle, which selects the least committed distribution consistent with known constraints. Entropy also enjoys invariance under relabeling. Permuting the symbols leaves $H(X)$ unchanged because the formula depends only on the multiset of probabilities, not on the identities attached to them. ## 2. Joint and Conditional Entropy ### 2.1 Joint entropy When two random variables $X$ and $Y$ are considered together, their joint entropy measures the total uncertainty in the pair. It is just the entropy of the joint distribution $p(x, y)$: $$ H(X, Y) = -\sum_{x} \sum_{y} p(x, y) \log p(x, y). $$ If $X$ and $Y$ are independent, the joint entropy decomposes additively, $H(X, Y) = H(X) + H(Y)$, recovering the intuition that independent sources of uncertainty simply add. When they are dependent, the joint entropy is strictly less than the sum, because knowing one variable tells us something about the other and reduces the combined uncertainty. ### 2.2 Conditional entropy Conditional entropy quantifies the uncertainty that remains in $Y$ once we know $X$. For a fixed value $X = x$, the residual uncertainty is $H(Y \mid X = x) = -\sum_y p(y \mid x) \log p(y \mid x)$. Averaging over $x$ gives the conditional entropy: $$ H(Y \mid X) = \sum_{x} p(x) \, H(Y \mid X = x) = -\sum_{x} \sum_{y} p(x, y) \log p(y \mid x). $$ This is the expected number of additional bits needed to specify $Y$ after $X$ has been revealed. Conditioning can never increase entropy on average, a result stated as $H(Y \mid X) \leq H(Y)$, with equality if and only if $X$ and $Y$ are independent. Information cannot hurt: learning $X$ either reduces our uncertainty about $Y$ or leaves it unchanged. Note the qualifier "on average," because for a specific value of $X$ the residual uncertainty $H(Y \mid X = x)$ can exceed $H(Y)$. ### 2.3 The chain rule Joint and conditional entropy are tied together by the chain rule, which mirrors the chain rule of probability: $$ H(X, Y) = H(X) + H(Y \mid X) = H(Y) + H(X \mid Y). $$ This identity is not an assumption; it follows from the definitions and the probability factorization $p(x, y) = p(x)\, p(y \mid x)$. Starting from the joint entropy and taking the logarithm of the product, $$ \begin{aligned} H(X, Y) &= -\sum_{x}\sum_{y} p(x, y) \log p(x, y) \\ &= -\sum_{x}\sum_{y} p(x, y) \log\big(p(x)\, p(y \mid x)\big) \\ &= -\sum_{x}\sum_{y} p(x, y) \log p(x) \;-\; \sum_{x}\sum_{y} p(x, y) \log p(y \mid x). \end{aligned} $$ In the first term, summing $p(x, y)$ over $y$ collapses the joint to the marginal $p(x)$, so it becomes $-\sum_x p(x) \log p(x) = H(X)$. The second term is exactly the definition of $H(Y \mid X)$. Hence $H(X, Y) = H(X) + H(Y \mid X)$. Repeating the argument with the roles of $X$ and $Y$ swapped gives the symmetric form. The total uncertainty in the pair equals the uncertainty in the first variable plus the leftover uncertainty in the second given the first. The rule generalizes to any number of variables, $$ H(X_1, X_2, \dots, X_n) = \sum_{i=1}^{n} H(X_i \mid X_1, \dots, X_{i-1}), $$ which is exactly the information-theoretic counterpart of factorizing a joint distribution into a product of conditionals. This decomposition is what makes autoregressive language models tractable: the entropy of a whole sequence is the sum of per-token conditional entropies. ### 2.4 Mutual information The reduction in uncertainty about $Y$ from learning $X$ is the mutual information, $$ I(X; Y) = H(Y) - H(Y \mid X) = H(X) + H(Y) - H(X, Y). $$ Mutual information is symmetric, nonnegative, and zero exactly when the variables are independent. It can be written as the Kullback-Leibler divergence between the joint distribution and the product of the marginals, $I(X; Y) = D_{\mathrm{KL}}(p(x, y) \,\|\, p(x) p(y))$, which makes precise the idea that mutual information measures how far two variables are from being independent. It is the workhorse quantity behind feature selection, representation learning objectives such as InfoNCE, and the information bottleneck framework. ### 2.5 A worked example Concrete numbers make these quantities tangible. Consider two binary variables $X$ and $Y$ with the joint distribution $$ \begin{array}{c|cc} p(x, y) & Y = 0 & Y = 1 \\ \hline X = 0 & 1/2 & 1/4 \\ X = 1 & 0 & 1/4 \end{array} $$ The marginals are $p(X = 0) = 1/2 + 1/4 = 3/4$ and $p(X = 1) = 1/4$, while $p(Y = 0) = 1/2$ and $p(Y = 1) = 1/2$. The marginal entropy of $X$ is $$ H(X) = -\tfrac{3}{4}\log_2 \tfrac{3}{4} - \tfrac{1}{4}\log_2 \tfrac{1}{4} = 0.75 \times 0.415 + 0.25 \times 2 \approx 0.811 \text{ bits}, $$ and $Y$ is fair, so $H(Y) = 1$ bit. The joint entropy sums $-p \log_2 p$ over the three nonzero cells (recalling $0 \log 0 = 0$): $$ H(X, Y) = -\tfrac{1}{2}\log_2 \tfrac{1}{2} - \tfrac{1}{4}\log_2 \tfrac{1}{4} - \tfrac{1}{4}\log_2 \tfrac{1}{4} = \tfrac{1}{2} + \tfrac{1}{2} + \tfrac{1}{2} = 1.5 \text{ bits}. $$ The chain rule then gives the conditional entropy without a fresh summation, $H(Y \mid X) = H(X, Y) - H(X) = 1.5 - 0.811 = 0.689$ bits, and the mutual information is $I(X; Y) = H(Y) - H(Y \mid X) = 1 - 0.689 = 0.311$ bits. Knowing $X$ removes about a third of a bit of uncertainty about $Y$, which matches intuition: when $X = 1$ the value of $Y$ is certain ($Y = 1$), so observing $X$ is sometimes very informative. ## 3. The Source Coding Theorem ### 3.1 Entropy as a compression limit So far entropy has been an abstract average. The source coding theorem, also called Shannon's first theorem or the noiseless coding theorem, gives it concrete operational meaning: entropy is the fundamental limit on lossless compression. No matter how clever your encoding scheme, you cannot on average represent the output of a source using fewer bits than its entropy. Formally, consider a source emitting symbols independently from distribution $p$. A uniquely decodable binary code assigns to each symbol $x$ a codeword of length $\ell(x)$ bits. The expected codeword length is $L = \sum_x p(x) \ell(x)$. The theorem states that any uniquely decodable code satisfies $$ L \geq H(X), $$ and that there exist codes whose expected length comes within one bit of this bound, $L < H(X) + 1$. By encoding long blocks of $n$ symbols at a time, the per-symbol overhead shrinks to zero, so the expected length per symbol can be pushed arbitrarily close to $H(X)$. ### 3.2 Why the limit holds The lower bound rests on the Kraft inequality, which says that the codeword lengths of any uniquely decodable code must satisfy $\sum_x 2^{-\ell(x)} \leq 1$. Combining this constraint with the nonnegativity of the Kullback-Leibler divergence yields $L \geq H(X)$ directly. The intuition is that an efficient code should assign short codewords to frequent symbols and long codewords to rare ones, and the optimal length for a symbol of probability $p(x)$ turns out to be $-\log_2 p(x)$, precisely its self-information. Averaging those ideal lengths gives the entropy. ### 3.3 Practical codes Huffman coding constructs an optimal prefix code by greedily merging the two least probable symbols, and it achieves the best possible expected length among codes that assign an integer number of bits per symbol. Because of the integer constraint, Huffman codes can be up to one bit per symbol away from entropy. Arithmetic coding and the more recent asymmetric numeral systems sidestep the integer limitation by encoding entire messages as a single number, reaching the entropy bound far more tightly. These ideas live inside everyday tools such as gzip, PNG, and modern neural compression pipelines. The deep connection for machine learning is that a probabilistic model is itself a compressor: a good model assigns high probability to the data, which means short codewords, which means the data can be described in few bits. Compression and prediction are two views of the same problem. ## 4. Differential Entropy ### 4.1 Extending entropy to continuous variables Real-valued quantities such as pixel intensities, embeddings, and network activations live on a continuum, so we need an analogue of entropy for continuous random variables. For a variable $X$ with probability density $f(x)$, the differential entropy is defined by replacing the sum with an integral: $$ h(X) = -\int_{-\infty}^{\infty} f(x) \log f(x) \, dx. $$ As a worked example, take a Gaussian $f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\big(-\frac{(x - \mu)^2}{2\sigma^2}\big)$ and compute its differential entropy in nats. The log density is $\log f(x) = -\frac{1}{2}\log(2\pi\sigma^2) - \frac{(x - \mu)^2}{2\sigma^2}$, so $$ \begin{aligned} h(X) &= -\int f(x) \log f(x)\, dx \\ &= \tfrac{1}{2}\log(2\pi\sigma^2)\underbrace{\int f(x)\, dx}_{=\,1} \;+\; \frac{1}{2\sigma^2}\underbrace{\int (x - \mu)^2 f(x)\, dx}_{=\,\sigma^2} \\ &= \tfrac{1}{2}\log(2\pi\sigma^2) + \tfrac{1}{2} = \tfrac{1}{2}\log(2\pi e\,\sigma^2). \end{aligned} $$ The second integral is just the variance, which equals $\sigma^2$ by definition, and the $\frac{1}{2}$ is folded into the logarithm using $\frac{1}{2} = \frac{1}{2}\log e$. The result $\frac{1}{2}\log(2\pi e\,\sigma^2)$ grows with the variance: a wider spread means more uncertainty. Among all continuous distributions with a fixed variance, the Gaussian is the one that maximizes differential entropy, a fact that helps explain why the normal distribution is such a natural default. ### 4.2 Differences from discrete entropy Differential entropy looks like its discrete cousin, but it behaves differently and several intuitions break. It can be negative, as happens for a uniform distribution on an interval shorter than one. It is not invariant under a change of variables: rescaling $X$ by a factor $a$ shifts the differential entropy by $\log |a|$, because stretching the axis changes the density. And it does not represent an absolute number of bits, since digitizing a continuous value to finite precision requires infinitely many bits in the limit. The reason for these quirks is that differential entropy is really the discrete entropy of a finely quantized version of $X$ minus a term that diverges as the quantization grows infinitely fine. The infinite part is discarded in the definition, leaving a quantity that is best understood as relative rather than absolute. Despite these caveats, differential entropy is well behaved in the differences and ratios that matter. Mutual information between continuous variables, defined through differential entropies, remains nonnegative and invariant under invertible transformations, which is why it is a dependable objective even when the underlying entropies are not. ## 5. Why Entropy Matters in Machine Learning ### 5.1 Cross entropy and the standard training loss The most pervasive appearance of information theory in machine learning is the cross entropy loss used to train classifiers and language models. Given a true distribution $p$ and a model distribution $q$, the cross entropy is $$ H(p, q) = -\sum_{x} p(x) \log q(x). $$ It measures the average number of bits needed to encode samples from $p$ using a code optimized for $q$. Cross entropy decomposes as $H(p, q) = H(p) + D_{\mathrm{KL}}(p \,\| \, q)$, where the Kullback-Leibler divergence is the extra cost of using the wrong distribution. Since $H(p)$ is fixed by the data, minimizing cross entropy is the same as minimizing the divergence between the model and the data, which is in turn equivalent to maximum likelihood estimation. Every time a network is trained with a softmax output and a negative log likelihood objective, it is being pushed to compress the training data, and its loss in nats is literally a description length. ### 5.2 Decision trees and feature selection Classical machine learning uses entropy directly as a splitting criterion. Decision tree algorithms such as ID3 and C4.5 choose the feature that yields the largest information gain, defined as the reduction in label entropy after conditioning on that feature, $\mathrm{IG}(Y, X) = H(Y) - H(Y \mid X)$. This is exactly the mutual information between the feature and the label. Splits that most reduce uncertainty about the target are preferred, giving a principled and interpretable basis for greedy tree construction. The same mutual information criterion underlies filter methods for feature selection, which rank candidate features by how much they tell us about the target. ### 5.3 Representation learning and regularization Modern representation learning leans heavily on entropy and mutual information. Contrastive methods maximize a lower bound on the mutual information between different views of the same input. The information bottleneck principle frames learning as finding a representation $T$ that is maximally informative about the label $Y$ while being maximally compressive about the input $X$, trading off $I(X; T)$ against $I(T; Y)$. Variational autoencoders include a Kullback-Leibler term that penalizes the divergence between the approximate posterior and a prior, an entropy-based regularizer that controls how much information the latent code carries. In reinforcement learning, maximum entropy methods add an entropy bonus to the objective so that policies stay stochastic and keep exploring, since a high-entropy policy hedges across actions rather than collapsing prematurely onto one. ### 5.4 Uncertainty, calibration, and beyond Entropy gives a direct readout of a model's uncertainty. The entropy of a classifier's predicted distribution measures how confident it is: low entropy means a sharp, confident prediction, while high entropy means the model is hedging across classes. This signal drives active learning, where the system queries labels for the inputs it is most uncertain about, and it supports out-of-distribution detection and selective prediction, where low-confidence inputs are flagged for human review. Perplexity, the standard metric for language models, is simply two raised to the cross entropy, so reporting perplexity is reporting an exponentiated bit count. From the loss function that trains a network, to the criterion that grows a tree, to the metric that evaluates a language model, information theory supplies the common currency in which learning is measured. Entropy is the bridge between probability and bits, and understanding it turns many machine learning objectives from opaque formulas into statements about average surprise. ## 6. Computing Entropy in Practice The definitions above are easy to turn into code. The cell below plots the binary entropy function across the full range of $p$, confirming that it peaks at exactly one bit when $p = 1/2$, and then estimates the entropy of an empirical distribution drawn from samples. The empirical entropy of a roughly uniform four-symbol source should sit just under the theoretical maximum of $\log_2 4 = 2$ bits. ```{python} import numpy as np import matplotlib.pyplot as plt rng = np.random.default_rng(42) def entropy(p, base=2): """Shannon entropy of a probability vector, with 0 log 0 = 0.""" p = np.asarray(p, dtype=float) p = p[p > 0] return float(-np.sum(p * (np.log(p) / np.log(base)))) def binary_entropy(p, base=2): p = np.asarray(p, dtype=float) out = np.zeros_like(p) mask = (p > 0) & (p < 1) pm = p[mask] out[mask] = -(pm * np.log(pm) + (1 - pm) * np.log(1 - pm)) / np.log(base) return out # Binary entropy curve. ps = np.linspace(0, 1, 501) Hb = binary_entropy(ps) print("max binary entropy:", round(Hb.max(), 6), "bits at p =", round(ps[np.argmax(Hb)], 3)) # Empirical distribution from 10,000 samples over 4 symbols. samples = rng.integers(0, 4, size=10000) counts = np.bincount(samples, minlength=4) emp = counts / counts.sum() print("empirical probabilities:", np.round(emp, 4)) print("empirical entropy:", round(entropy(emp), 6), "bits") print("uniform maximum:", round(np.log2(4), 6), "bits") fig, ax = plt.subplots(figsize=(6, 4)) ax.plot(ps, Hb, color="C0", lw=2) ax.axvline(0.5, ls="--", color="grey", alpha=0.7) ax.set_xlabel("p (probability of success)") ax.set_ylabel("H(p) in bits") ax.set_title("Entropy of a Bernoulli variable") fig.tight_layout() plt.show() ``` The same entropy computation is short in any numerical language. The tabs below show the discrete entropy function in three ecosystems; the Python tab is executed above, while the Julia and Rust versions are illustrative. ::: {.panel-tabset} ## Python ```{python} import numpy as np def entropy_bits(p): p = np.asarray(p, dtype=float) p = p[p > 0] return float(-np.sum(p * np.log2(p))) print(round(entropy_bits([0.25, 0.25, 0.25, 0.25]), 6)) print(round(entropy_bits([0.5, 0.25, 0.25]), 6)) ``` ## Julia ```julia function entropy_bits(p::AbstractVector{<:Real}) s = 0.0 for pi in p if pi > 0 s -= pi * log2(pi) end end return s end println(entropy_bits([0.25, 0.25, 0.25, 0.25])) # 2.0 println(entropy_bits([0.5, 0.25, 0.25])) # 1.5 ``` ## Rust ```rust fn entropy_bits(p: &[f64]) -> f64 { p.iter() .filter(|&&pi| pi > 0.0) .map(|&pi| -pi * pi.log2()) .sum() } fn main() { println!("{}", entropy_bits(&[0.25, 0.25, 0.25, 0.25])); // 2.0 println!("{}", entropy_bits(&[0.5, 0.25, 0.25])); // 1.5 } ``` ::: ## References 1. Shannon, C. E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal, 27(3), 379-423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x 2. Cover, T. M., and Thomas, J. A. (2006). Elements of Information Theory (2nd ed.). Wiley-Interscience. https://doi.org/10.1002/047174882X 3. MacKay, D. J. C. (2003). Information Theory, Inference, and Learning Algorithms. Cambridge University Press. ISBN 978-0521642989. https://www.inference.org.uk/itprnn/book.pdf 4. Tishby, N., Pereira, F. C., and Bialek, W. (1999). The Information Bottleneck Method. Proceedings of the 37th Annual Allerton Conference on Communication, Control and Computing, 368-377. https://doi.org/10.48550/arXiv.physics/0004057 5. Huffman, D. A. (1952). A Method for the Construction of Minimum-Redundancy Codes. Proceedings of the IRE, 40(9), 1098-1101. https://doi.org/10.1109/JRPROC.1952.273898 6. Tishby, N., and Zaslavsky, N. (2015). Deep Learning and the Information Bottleneck Principle. 2015 IEEE Information Theory Workshop (ITW), 1-5. https://doi.org/10.1109/ITW.2015.7133169 7. Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning, Chapter 3: Probability and Information Theory. MIT Press. ISBN 978-0262035613. https://www.deeplearningbook.org/contents/prob.html