Information theory gives machine learning a precise vocabulary for talking about uncertainty, surprise, and the relationship between random variables. Two quantities sit at the center of this vocabulary: the Kullback-Leibler divergence, which measures how one probability distribution differs from another, and mutual information, which measures how much knowing one variable tells us about another. These objects appear throughout modern machine learning. They define the loss functions we minimize, the bounds we optimize in variational inference, and the objectives that shape learned representations. This chapter develops both quantities from first principles, examines their key properties, and traces the many roles they play in practice.
41.1 1. Relative Entropy and Its Properties
41.1.1 1.1 From Entropy to Relative Entropy
Recall that the Shannon entropy of a discrete distribution \(p\) over an alphabet \(\mathcal{X}\) is
which measures the average surprise, or the expected number of nats (when using natural log) needed to encode a sample from \(p\) under an optimal code. The Kullback-Leibler divergence, also called relative entropy, asks a related but distinct question. Suppose the true distribution is \(p\), but we design our code as though the distribution were \(q\). How many extra nats do we pay on average for using the wrong model? That penalty is the KL divergence:
This decomposition explains why minimizing cross entropy, the workhorse loss of classification and language modeling, is equivalent to minimizing KL divergence. When \(p\) is the fixed empirical data distribution, \(H(p)\) is a constant, so minimizing \(H(p, q)\) over the model \(q\) is exactly minimizing \(D_{\mathrm{KL}}(p \parallel q)\). Cross entropy training is KL minimization in disguise.
41.1.2 1.2 Non-negativity and the Gibbs Inequality
The single most important property of relative entropy is that it is never negative:
\[D_{\mathrm{KL}}(p \parallel q) \geq 0,\]
with equality if and only if \(p = q\) almost everywhere. This fact, sometimes called the Gibbs inequality, follows from the concavity of the logarithm. We give the full proof two ways, since both viewpoints recur throughout the chapter.
Proof via Jensen’s inequality. Jensen’s inequality states that for a convex function \(\varphi\) and a random variable \(Z\), \(\mathbb{E}[\varphi(Z)] \geq \varphi(\mathbb{E}[Z])\). The function \(\varphi(t) = -\log t\) is strictly convex on \((0, \infty)\), since \(\varphi''(t) = 1/t^2 > 0\). Take \(Z = q(x)/p(x)\) with \(x \sim p\), restricting attention to the support of \(p\) where \(p(x) > 0\). Then
The remaining sum is \(\sum_{x : p(x) > 0} q(x) \leq \sum_{x} q(x) = 1\), so \(-\log\) of it is at least \(-\log 1 = 0\), giving \(D_{\mathrm{KL}}(p \parallel q) \geq 0\). Because \(-\log\) is strictly convex, equality in Jensen’s step requires the ratio \(q(x)/p(x)\) to be constant on the support of \(p\); combined with both distributions summing to one, that constant must be \(1\), and the support of \(q\) cannot exceed that of \(p\), so \(p = q\) everywhere. \(\blacksquare\)
Proof via the inequality \(\log t \leq t - 1\). A self-contained alternative avoids Jensen entirely. For all \(t > 0\) we have \(\log t \leq t - 1\), with equality only at \(t = 1\) (the line \(t - 1\) is tangent to the concave curve \(\log t\) at that point). Apply this to \(t = q(x)/p(x)\):
Non-negativity is what makes KL usable as a loss: driving it toward zero genuinely drives \(q\) toward \(p\). It also underpins the variational bounds we will meet later, where a KL term appears as a non-negative gap that we can bound or push to zero.
and it fails the triangle inequality as well. The asymmetry is not a defect to be patched over; it carries real meaning, and choosing which direction to minimize is a genuine modeling decision. Consider what happens to each direction when the distributions disagree.
In \(D_{\mathrm{KL}}(p \parallel q)\) the expectation is taken under \(p\). Wherever \(p(x)\) is large, the term \(\log(p(x)/q(x))\) is weighted heavily, so \(q\) is penalized severely for assigning low probability to regions where \(p\) has mass. If \(p(x) > 0\) but \(q(x) \to 0\), the integrand diverges. This direction therefore demands that \(q\) cover all the mass of \(p\). Conversely, where \(p(x) = 0\) the contribution is zero regardless of \(q\), so \(q\) is free to place mass on regions \(p\) ignores.
In \(D_{\mathrm{KL}}(q \parallel p)\) the roles flip. The expectation is under \(q\), so \(q\) is penalized for placing mass where \(p\) is small. This direction pressures \(q\) to avoid the low-probability regions of \(p\) and to concentrate where \(p\) is large, even if that means ignoring some of \(p\)’s mass entirely.
When the two arguments come from the same family, the asymmetry can be modest. When \(q\) is constrained to a simpler family than \(p\), for example a unimodal Gaussian approximating a multimodal target, the two directions produce strikingly different solutions. This is the subject of the next section.
41.1.4 1.4 A Useful Special Case: Gaussians
For two multivariate Gaussians the KL divergence has a closed form, which is why so many practical methods choose Gaussian approximating families. For \(p = \mathcal{N}(\mu_1, \Sigma_1)\) and \(q = \mathcal{N}(\mu_2, \Sigma_2)\) in \(d\) dimensions,
In the common case where \(q\) is a standard normal \(\mathcal{N}(0, I)\), this collapses to a simple expression in the means and variances of \(p\), which is precisely the regularization term that appears in the variational autoencoder objective. The closed form means the KL term can be computed and differentiated exactly, with no sampling, a substantial practical advantage.
41.1.5 1.5 A Worked Example
To make the asymmetry concrete, take two coins. Let \(p = (0.5, 0.5)\) be a fair coin and \(q = (0.9, 0.1)\) a biased one, with the two outcomes labelled heads and tails. In nats,
The two numbers differ, \(0.5108 \neq 0.3681\), which demonstrates the asymmetry in a single line of arithmetic. Both are positive, as the Gibbs inequality guarantees, and both would be zero only if the two coins were identical. The forward value is larger here because \(p\) places substantial mass on the tails outcome where \(q\) assigns only \(0.1\), and the forward direction, weighted by \(p\), punishes that mismatch heavily. The Python cell below reproduces these numbers and then scales the idea up to a continuous, multimodal target.
41.2 2. Forward versus Reverse KL
41.2.1 2.1 Two Objectives, Two Behaviors
Suppose we have a complicated target distribution \(p\) and we want to approximate it with a tractable distribution \(q_\theta\) drawn from some family. There are two natural objectives:
These names follow the convention that \(p\), the fixed target, comes first in the forward case. The two objectives induce qualitatively different approximations, captured by the slogans mean-seeking and mode-seeking.
so minimizing it is maximizing the expected log-likelihood of \(q_\theta\) under samples from \(p\). This is exactly maximum likelihood estimation, which is why forward KL is the implicit objective whenever we fit a model by maximizing likelihood on data.
The penalty \(-p(x) \log q_\theta(x)\) becomes enormous wherever \(p(x)\) is large but \(q_\theta(x)\) is near zero. To avoid this, \(q_\theta\) must put probability mass everywhere \(p\) does. The approximation is forced to be inclusive, or zero-avoiding: it cannot leave any high-probability region of \(p\) uncovered. If \(p\) is bimodal and \(q_\theta\) is a single Gaussian, the forward solution stretches to span both modes, placing its mean in the low-density valley between them so as to cover both. The result averages over the modes, hence mean-seeking. The cost is that \(q_\theta\) may assign substantial probability to the gap where \(p\) has essentially no mass.
Here the expectation is under \(q_\theta\), so the penalty bites wherever \(q_\theta\) puts mass on regions where \(p\) is small. The approximation is therefore exclusive, or zero-forcing: it is heavily penalized for spreading into the low-density valley, so it prefers to retreat onto a single mode of \(p\) and model that mode well. Faced with a bimodal target, a single Gaussian under reverse KL will typically lock onto one mode and ignore the other entirely. This is mode-seeking behavior. The \(-H(q_\theta)\) term simultaneously discourages \(q_\theta\) from collapsing to a point, balancing the mode-fitting against a preference for spread.
41.2.4 2.4 Practical Consequences
The choice of direction is dictated as much by tractability as by which failure mode we prefer. Forward KL requires the expectation \(\mathbb{E}_{x \sim p}[\cdot]\), which we can estimate whenever we have samples from \(p\), as in supervised learning where the data are samples from the true distribution. It does not require evaluating the normalized density of \(p\). Reverse KL requires the expectation \(\mathbb{E}_{x \sim q_\theta}[\cdot]\), which is convenient because we control \(q_\theta\) and can sample from it freely, but it requires evaluating \(\log p(x)\), typically up to a normalizing constant. This makes reverse KL the natural choice for variational inference, where \(p\) is an intractable posterior known only up to normalization.
The mode-seeking tendency of reverse KL has visible consequences in generative modeling. Approximations trained with reverse KL can be overconfident, collapsing onto a subset of the target’s modes, a phenomenon related to mode collapse in some generative setups. Forward KL approximations err in the opposite direction, hedging by covering mass the target does not have, which can produce blurry or overly diffuse samples. Neither is universally correct. Symmetrized alternatives exist, such as the Jensen-Shannon divergence
\[D_{\mathrm{JS}}(p, q) = \tfrac{1}{2} D_{\mathrm{KL}}(p \parallel m) + \tfrac{1}{2} D_{\mathrm{KL}}(q \parallel m), \quad m = \tfrac{1}{2}(p + q),\]
which is symmetric and bounded, and which underlies the original generative adversarial network objective.
41.3 3. Mutual Information
41.3.1 3.1 Definition and Interpretation
Mutual information measures the statistical dependence between two random variables \(X\) and \(Y\). It is defined as the KL divergence between the joint distribution and the product of the marginals:
Because mutual information is a KL divergence, its non-negativity is immediate: \(I(X; Y) \geq 0\), with equality if and only if \(p(x, y) = p(x) p(y)\), that is, if and only if \(X\) and \(Y\) are independent. Mutual information thus quantifies exactly how far two variables are from being independent. Unlike correlation, which captures only linear association, mutual information captures dependence of any form.
41.3.2 3.2 The Entropy Decomposition
Mutual information connects cleanly to entropy. We can derive the identity \(I(X; Y) = H(X) - H(X \mid Y)\) directly from the definition. Start from the joint form and split the log of the ratio using \(p(x, y) = p(y) p(x \mid y)\):
The second sum marginalizes \(y\) out, since \(\sum_y p(x, y) = p(x)\), leaving \(\sum_x p(x) \log p(x) = -H(X)\). The first sum is, by definition, the negative conditional entropy \(-H(X \mid Y) = \sum_{x, y} p(x, y) \log p(x \mid y)\). Substituting both,
By the symmetry of the joint distribution in \(x\) and \(y\), the same steps with the roles swapped give \(I(X; Y) = H(Y) - H(Y \mid X)\), and adding \(H(X) + H(Y) = I(X;Y) + H(X \mid Y) + H(Y \mid X)\) together with the chain rule \(H(X, Y) = H(X) + H(Y \mid X)\) yields the third form. Collecting the symmetric relations,
Read the first form aloud: mutual information is the reduction in our uncertainty about \(X\) that results from observing \(Y\). It is the average number of nats we learn about \(X\) from one observation of \(Y\). The symmetry \(I(X; Y) = I(Y; X)\) is manifest and reflects that information is mutual: \(Y\) tells us as much about \(X\) as \(X\) tells us about \(Y\). This decomposition makes mutual information a natural objective whenever we want a learned variable to be informative about something, or, by minimizing it, when we want one variable to reveal as little as possible about another.
41.3.3 3.3 Estimating Mutual Information
Mutual information is notoriously hard to estimate in high dimensions, because it requires knowledge of the joint density and the marginals, which are exactly the quantities we lack in practice. A great deal of modern work therefore focuses on variational bounds that replace the intractable quantity with a tractable optimization. A foundational lower bound is the one underlying InfoNCE, the objective at the heart of contrastive representation learning. Given a critic function \(f(x, y)\) and \(K\) samples, one positive pair drawn from the joint and the rest from the marginal, the InfoNCE objective lower bounds mutual information:
Maximizing this bound trains the critic to assign high scores to genuinely paired inputs and low scores to mismatched ones. The bound is biased and saturates at \(\log K\), so the number of negative samples caps how much information can be detected, a practical reason contrastive methods favor large batches. Other estimators, such as the Donsker-Varadhan bound used by MINE, trade different biases against variance. The difficulty of estimating mutual information accurately is itself an active research theme.
41.4 4. Computing KL Divergence in Practice
The cleanest way to feel the difference between forward and reverse KL is to fit a single Gaussian to a stubbornly bimodal target and watch where it lands. The forward fit, which is mean-seeking, spreads to straddle both modes and parks its mean in the empty valley between them. The reverse fit, which is mode-seeking, abandons one mode and models the other faithfully. The cell below sets up a symmetric two-component target, minimizes each objective numerically, checks the closed-form Gaussian KL of Section 1.4 against a direct numerical integral, and plots the two solutions.
The forward solution returns a mean near zero with a wide standard deviation, a single broad bump that covers both target modes and assigns spurious mass to the valley. The reverse solution returns a mean near one mode with a narrow standard deviation, ignoring the other mode entirely. The numerical and closed-form Gaussian KL values agree to four decimals, confirming the formula of Section 1.4.
The most pervasive appearance of KL divergence in modern machine learning is the evidence lower bound. Suppose we have a latent variable model \(p_\theta(x, z) = p_\theta(x \mid z) p(z)\) and we want to maximize the marginal likelihood \(\log p_\theta(x)\), which requires the intractable posterior \(p_\theta(z \mid x)\). We introduce a tractable variational distribution \(q_\phi(z \mid x)\) and write
Because the KL term is non-negative, the first term is a lower bound on the log-evidence, hence its name. Maximizing the ELBO does double duty: it pushes up a bound on the likelihood and, equivalently, drives \(q_\phi\) toward the true posterior by shrinking the reverse KL gap. The ELBO can be rearranged into a reconstruction term and a regularization term:
This is precisely the variational autoencoder objective. The first term rewards faithful reconstruction; the second pulls the approximate posterior toward the prior \(p(z)\), usually a standard Gaussian, using the closed-form KL of Section 1.4. Note that the gap minimized here is a reverse KL, which is why variational inference inherits the mode-seeking behavior discussed earlier and can underestimate posterior uncertainty.
41.5.2 5.2 Representation Learning
Mutual information offers a principled objective for learning representations. The InfoMax principle holds that a good representation \(Z\) of an input \(X\) should retain as much information about \(X\) as possible, suggesting we maximize \(I(X; Z)\). Modern self-supervised methods refine this idea. Contrastive approaches maximize the mutual information between different views or augmentations of the same input, encouraging the encoder to capture the underlying content that the two views share while discarding view-specific nuisance. The InfoNCE bound of Section 3.3 is the workhorse here, and its connection to mutual information gives these methods a clear information-theoretic reading even when, in practice, much of their success also depends on architectural and optimization details.
Mutual information also clarifies what we want a representation not to contain. In fair and private representation learning, one explicitly minimizes the mutual information between the representation and a sensitive attribute \(S\), seeking features that are informative for the task but reveal little about \(S\). The same minimization principle appears in disentanglement, where penalizing the mutual information among latent dimensions encourages factors of variation to separate.
41.5.3 5.3 The Information Bottleneck
The information bottleneck principle unifies many of these ideas into a single objective. Given an input \(X\) and a target \(Y\), we seek a representation \(Z\) that is maximally informative about \(Y\) while being maximally compressive of \(X\). Formally, we minimize
where \(\beta > 0\) trades compression against prediction. The first term squeezes the representation, discarding information about the input; the second term insists that what survives the squeeze remains predictive of the target. The bottleneck framing gives a sharp account of generalization: a representation that has forgotten the irrelevant details of \(X\) while keeping the label-relevant ones should generalize better, because it has stripped away the noise on which overfitting feeds.
The deep variational information bottleneck makes this objective trainable by replacing the intractable mutual information terms with variational bounds, producing a loss that looks remarkably like a regularized variational autoencoder. The compression term \(I(X; Z)\) is upper bounded by a KL divergence between the encoder and a prior, exactly as in the ELBO, while the prediction term \(I(Z; Y)\) is lower bounded by a decoder log-likelihood. The information bottleneck thus ties together everything in this chapter: relative entropy as a regularizer, mutual information as a measure of relevance, and variational bounds as the bridge from elegant principle to runnable code.
41.5.4 5.4 A Unifying View
Stepping back, KL divergence and mutual information recur because they answer the two questions that dominate learning. KL divergence asks how wrong our model is relative to the truth, and so it furnishes loss functions, regularizers, and the gaps in variational bounds. Mutual information asks how much one quantity tells us about another, and so it furnishes objectives for what a representation should keep and what it should discard. The asymmetry of KL forces a modeling choice between covering and concentrating; the non-negativity of both quantities is what makes them safe to optimize. Together they form a compact toolkit that reappears, lightly disguised, across supervised learning, generative modeling, and self-supervised representation learning.
41.6 References
Kullback, S., and Leibler, R. A. (1951). On Information and Sufficiency. Annals of Mathematical Statistics, 22(1), 79 to 86. https://doi.org/10.1214/aoms/1177729694
Cover, T. M., and Thomas, J. A. (2006). Elements of Information Theory (2nd ed.). Wiley-Interscience. https://doi.org/10.1002/047174882X
Shannon, C. E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal, 27(3), 379 to 423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x
Csiszar, I. (1975). I-Divergence Geometry of Probability Distributions and Minimization Problems. Annals of Probability, 3(1), 146 to 158. https://doi.org/10.1214/aop/1176996454
Kingma, D. P., and Welling, M. (2014). Auto-Encoding Variational Bayes. International Conference on Learning Representations. https://doi.org/10.48550/arXiv.1312.6114
Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2017). Variational Inference: A Review for Statisticians. Journal of the American Statistical Association, 112(518), 859 to 877. https://doi.org/10.1080/01621459.2017.1285773
Tishby, N., Pereira, F. C., and Bialek, W. (1999). The Information Bottleneck Method. Proceedings of the 37th Allerton Conference on Communication, Control, and Computing, 368 to 377. https://doi.org/10.48550/arXiv.physics/0004057
Alemi, A. A., Fischer, I., Dillon, J. V., and Murphy, K. (2017). Deep Variational Information Bottleneck. International Conference on Learning Representations. https://doi.org/10.48550/arXiv.1612.00410
van den Oord, A., Li, Y., and Vinyals, O. (2018). Representation Learning with Contrastive Predictive Coding. https://doi.org/10.48550/arXiv.1807.03748
Belghazi, M. I., Baratin, A., Rajeswar, S., Ozair, S., Bengio, Y., Courville, A., and Hjelm, R. D. (2018). Mutual Information Neural Estimation. Proceedings of the 35th International Conference on Machine Learning, PMLR 80, 531 to 540. https://doi.org/10.48550/arXiv.1801.04062
Poole, B., Ozair, S., van den Oord, A., Alemi, A. A., and Tucker, G. (2019). On Variational Bounds of Mutual Information. Proceedings of the 36th International Conference on Machine Learning, PMLR 97, 5171 to 5180. https://doi.org/10.48550/arXiv.1905.06922
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative Adversarial Nets. Advances in Neural Information Processing Systems 27, 2672 to 2680. https://doi.org/10.48550/arXiv.1406.2661
Tschannen, M., Djolonga, J., Rubenstein, P. K., Gelly, S., and Lucic, M. (2020). On Mutual Information Maximization for Representation Learning. International Conference on Learning Representations. https://doi.org/10.48550/arXiv.1907.13625
# KL Divergence and Mutual InformationInformation theory gives machine learning a precise vocabulary for talking about uncertainty, surprise, and the relationship between random variables. Two quantities sit at the center of this vocabulary: the Kullback-Leibler divergence, which measures how one probability distribution differs from another, and mutual information, which measures how much knowing one variable tells us about another. These objects appear throughout modern machine learning. They define the loss functions we minimize, the bounds we optimize in variational inference, and the objectives that shape learned representations. This chapter develops both quantities from first principles, examines their key properties, and traces the many roles they play in practice.## 1. Relative Entropy and Its Properties### 1.1 From Entropy to Relative EntropyRecall that the Shannon entropy of a discrete distribution $p$ over an alphabet $\mathcal{X}$ is$$H(p) = -\sum_{x \in \mathcal{X}} p(x) \log p(x),$$which measures the average surprise, or the expected number of nats (when using natural log) needed to encode a sample from $p$ under an optimal code. The Kullback-Leibler divergence, also called relative entropy, asks a related but distinct question. Suppose the true distribution is $p$, but we design our code as though the distribution were $q$. How many extra nats do we pay on average for using the wrong model? That penalty is the KL divergence:$$D_{\mathrm{KL}}(p \parallel q) = \sum_{x \in \mathcal{X}} p(x) \log \frac{p(x)}{q(x)} = \mathbb{E}_{x \sim p}\!\left[\log \frac{p(x)}{q(x)}\right].$$For continuous distributions with densities, the sum becomes an integral:$$D_{\mathrm{KL}}(p \parallel q) = \int p(x) \log \frac{p(x)}{q(x)} \, dx.$$The KL divergence can be decomposed as the difference between a cross entropy and an entropy:$$D_{\mathrm{KL}}(p \parallel q) = \underbrace{-\sum_x p(x) \log q(x)}_{H(p, q)} - \underbrace{\left(-\sum_x p(x) \log p(x)\right)}_{H(p)}.$$This decomposition explains why minimizing cross entropy, the workhorse loss of classification and language modeling, is equivalent to minimizing KL divergence. When $p$ is the fixed empirical data distribution, $H(p)$ is a constant, so minimizing $H(p, q)$ over the model $q$ is exactly minimizing $D_{\mathrm{KL}}(p \parallel q)$. Cross entropy training is KL minimization in disguise.### 1.2 Non-negativity and the Gibbs InequalityThe single most important property of relative entropy is that it is never negative:$$D_{\mathrm{KL}}(p \parallel q) \geq 0,$$with equality if and only if $p = q$ almost everywhere. This fact, sometimes called the Gibbs inequality, follows from the concavity of the logarithm. We give the full proof two ways, since both viewpoints recur throughout the chapter.**Proof via Jensen's inequality.** Jensen's inequality states that for a convex function $\varphi$ and a random variable $Z$, $\mathbb{E}[\varphi(Z)] \geq \varphi(\mathbb{E}[Z])$. The function $\varphi(t) = -\log t$ is strictly convex on $(0, \infty)$, since $\varphi''(t) = 1/t^2 > 0$. Take $Z = q(x)/p(x)$ with $x \sim p$, restricting attention to the support of $p$ where $p(x) > 0$. Then$$D_{\mathrm{KL}}(p \parallel q) = \mathbb{E}_{x \sim p}\!\left[-\log \frac{q(x)}{p(x)}\right] \geq -\log \mathbb{E}_{x \sim p}\!\left[\frac{q(x)}{p(x)}\right] = -\log \sum_{x : p(x) > 0} p(x) \frac{q(x)}{p(x)} = -\log \sum_{x : p(x) > 0} q(x).$$The remaining sum is $\sum_{x : p(x) > 0} q(x) \leq \sum_{x} q(x) = 1$, so $-\log$ of it is at least $-\log 1 = 0$, giving $D_{\mathrm{KL}}(p \parallel q) \geq 0$. Because $-\log$ is strictly convex, equality in Jensen's step requires the ratio $q(x)/p(x)$ to be constant on the support of $p$; combined with both distributions summing to one, that constant must be $1$, and the support of $q$ cannot exceed that of $p$, so $p = q$ everywhere. $\blacksquare$**Proof via the inequality $\log t \leq t - 1$.** A self-contained alternative avoids Jensen entirely. For all $t > 0$ we have $\log t \leq t - 1$, with equality only at $t = 1$ (the line $t - 1$ is tangent to the concave curve $\log t$ at that point). Apply this to $t = q(x)/p(x)$:$$-D_{\mathrm{KL}}(p \parallel q) = \sum_x p(x) \log \frac{q(x)}{p(x)} \leq \sum_x p(x)\!\left(\frac{q(x)}{p(x)} - 1\right) = \sum_x q(x) - \sum_x p(x) = 1 - 1 = 0.$$Hence $D_{\mathrm{KL}}(p \parallel q) \geq 0$. Equality forces $q(x)/p(x) = 1$ wherever $p(x) > 0$, that is $p = q$. $\blacksquare$Non-negativity is what makes KL usable as a loss: driving it toward zero genuinely drives $q$ toward $p$. It also underpins the variational bounds we will meet later, where a KL term appears as a non-negative gap that we can bound or push to zero.### 1.3 Asymmetry and Why It MattersThe KL divergence is not a metric. In general,$$D_{\mathrm{KL}}(p \parallel q) \neq D_{\mathrm{KL}}(q \parallel p),$$and it fails the triangle inequality as well. The asymmetry is not a defect to be patched over; it carries real meaning, and choosing which direction to minimize is a genuine modeling decision. Consider what happens to each direction when the distributions disagree.In $D_{\mathrm{KL}}(p \parallel q)$ the expectation is taken under $p$. Wherever $p(x)$ is large, the term $\log(p(x)/q(x))$ is weighted heavily, so $q$ is penalized severely for assigning low probability to regions where $p$ has mass. If $p(x) > 0$ but $q(x) \to 0$, the integrand diverges. This direction therefore demands that $q$ cover all the mass of $p$. Conversely, where $p(x) = 0$ the contribution is zero regardless of $q$, so $q$ is free to place mass on regions $p$ ignores.In $D_{\mathrm{KL}}(q \parallel p)$ the roles flip. The expectation is under $q$, so $q$ is penalized for placing mass where $p$ is small. This direction pressures $q$ to avoid the low-probability regions of $p$ and to concentrate where $p$ is large, even if that means ignoring some of $p$'s mass entirely.When the two arguments come from the same family, the asymmetry can be modest. When $q$ is constrained to a simpler family than $p$, for example a unimodal Gaussian approximating a multimodal target, the two directions produce strikingly different solutions. This is the subject of the next section.### 1.4 A Useful Special Case: GaussiansFor two multivariate Gaussians the KL divergence has a closed form, which is why so many practical methods choose Gaussian approximating families. For $p = \mathcal{N}(\mu_1, \Sigma_1)$ and $q = \mathcal{N}(\mu_2, \Sigma_2)$ in $d$ dimensions,$$D_{\mathrm{KL}}(p \parallel q) = \frac{1}{2}\left[ \log \frac{\lvert \Sigma_2 \rvert}{\lvert \Sigma_1 \rvert} - d + \operatorname{tr}(\Sigma_2^{-1} \Sigma_1) + (\mu_2 - \mu_1)^\top \Sigma_2^{-1} (\mu_2 - \mu_1) \right].$$In the common case where $q$ is a standard normal $\mathcal{N}(0, I)$, this collapses to a simple expression in the means and variances of $p$, which is precisely the regularization term that appears in the variational autoencoder objective. The closed form means the KL term can be computed and differentiated exactly, with no sampling, a substantial practical advantage.### 1.5 A Worked ExampleTo make the asymmetry concrete, take two coins. Let $p = (0.5, 0.5)$ be a fair coin and $q = (0.9, 0.1)$ a biased one, with the two outcomes labelled heads and tails. In nats,$$D_{\mathrm{KL}}(p \parallel q) = 0.5 \log \frac{0.5}{0.9} + 0.5 \log \frac{0.5}{0.1} = 0.5(-0.5878) + 0.5(1.6094) = 0.5108.$$Reversing the arguments gives$$D_{\mathrm{KL}}(q \parallel p) = 0.9 \log \frac{0.9}{0.5} + 0.1 \log \frac{0.1}{0.5} = 0.9(0.5878) + 0.1(-1.6094) = 0.3681.$$The two numbers differ, $0.5108 \neq 0.3681$, which demonstrates the asymmetry in a single line of arithmetic. Both are positive, as the Gibbs inequality guarantees, and both would be zero only if the two coins were identical. The forward value is larger here because $p$ places substantial mass on the tails outcome where $q$ assigns only $0.1$, and the forward direction, weighted by $p$, punishes that mismatch heavily. The Python cell below reproduces these numbers and then scales the idea up to a continuous, multimodal target.## 2. Forward versus Reverse KL### 2.1 Two Objectives, Two BehaviorsSuppose we have a complicated target distribution $p$ and we want to approximate it with a tractable distribution $q_\theta$ drawn from some family. There are two natural objectives:$$\text{forward KL:} \quad \min_\theta D_{\mathrm{KL}}(p \parallel q_\theta), \qquad \text{reverse KL:} \quad \min_\theta D_{\mathrm{KL}}(q_\theta \parallel p).$$These names follow the convention that $p$, the fixed target, comes first in the forward case. The two objectives induce qualitatively different approximations, captured by the slogans mean-seeking and mode-seeking.### 2.2 Forward KL Is Mean-SeekingThe forward objective expands as$$D_{\mathrm{KL}}(p \parallel q_\theta) = -\sum_x p(x) \log q_\theta(x) + \text{const},$$so minimizing it is maximizing the expected log-likelihood of $q_\theta$ under samples from $p$. This is exactly maximum likelihood estimation, which is why forward KL is the implicit objective whenever we fit a model by maximizing likelihood on data.The penalty $-p(x) \log q_\theta(x)$ becomes enormous wherever $p(x)$ is large but $q_\theta(x)$ is near zero. To avoid this, $q_\theta$ must put probability mass everywhere $p$ does. The approximation is forced to be inclusive, or zero-avoiding: it cannot leave any high-probability region of $p$ uncovered. If $p$ is bimodal and $q_\theta$ is a single Gaussian, the forward solution stretches to span both modes, placing its mean in the low-density valley between them so as to cover both. The result averages over the modes, hence mean-seeking. The cost is that $q_\theta$ may assign substantial probability to the gap where $p$ has essentially no mass.### 2.3 Reverse KL Is Mode-SeekingThe reverse objective expands as$$D_{\mathrm{KL}}(q_\theta \parallel p) = \sum_x q_\theta(x) \log \frac{q_\theta(x)}{p(x)} = -H(q_\theta) - \mathbb{E}_{x \sim q_\theta}[\log p(x)].$$Here the expectation is under $q_\theta$, so the penalty bites wherever $q_\theta$ puts mass on regions where $p$ is small. The approximation is therefore exclusive, or zero-forcing: it is heavily penalized for spreading into the low-density valley, so it prefers to retreat onto a single mode of $p$ and model that mode well. Faced with a bimodal target, a single Gaussian under reverse KL will typically lock onto one mode and ignore the other entirely. This is mode-seeking behavior. The $-H(q_\theta)$ term simultaneously discourages $q_\theta$ from collapsing to a point, balancing the mode-fitting against a preference for spread.### 2.4 Practical ConsequencesThe choice of direction is dictated as much by tractability as by which failure mode we prefer. Forward KL requires the expectation $\mathbb{E}_{x \sim p}[\cdot]$, which we can estimate whenever we have samples from $p$, as in supervised learning where the data are samples from the true distribution. It does not require evaluating the normalized density of $p$. Reverse KL requires the expectation $\mathbb{E}_{x \sim q_\theta}[\cdot]$, which is convenient because we control $q_\theta$ and can sample from it freely, but it requires evaluating $\log p(x)$, typically up to a normalizing constant. This makes reverse KL the natural choice for variational inference, where $p$ is an intractable posterior known only up to normalization.The mode-seeking tendency of reverse KL has visible consequences in generative modeling. Approximations trained with reverse KL can be overconfident, collapsing onto a subset of the target's modes, a phenomenon related to mode collapse in some generative setups. Forward KL approximations err in the opposite direction, hedging by covering mass the target does not have, which can produce blurry or overly diffuse samples. Neither is universally correct. Symmetrized alternatives exist, such as the Jensen-Shannon divergence$$D_{\mathrm{JS}}(p, q) = \tfrac{1}{2} D_{\mathrm{KL}}(p \parallel m) + \tfrac{1}{2} D_{\mathrm{KL}}(q \parallel m), \quad m = \tfrac{1}{2}(p + q),$$which is symmetric and bounded, and which underlies the original generative adversarial network objective.## 3. Mutual Information### 3.1 Definition and InterpretationMutual information measures the statistical dependence between two random variables $X$ and $Y$. It is defined as the KL divergence between the joint distribution and the product of the marginals:$$I(X; Y) = D_{\mathrm{KL}}\!\left(p(x, y) \parallel p(x)\, p(y)\right) = \sum_{x, y} p(x, y) \log \frac{p(x, y)}{p(x)\, p(y)}.$$Because mutual information is a KL divergence, its non-negativity is immediate: $I(X; Y) \geq 0$, with equality if and only if $p(x, y) = p(x) p(y)$, that is, if and only if $X$ and $Y$ are independent. Mutual information thus quantifies exactly how far two variables are from being independent. Unlike correlation, which captures only linear association, mutual information captures dependence of any form.### 3.2 The Entropy DecompositionMutual information connects cleanly to entropy. We can derive the identity $I(X; Y) = H(X) - H(X \mid Y)$ directly from the definition. Start from the joint form and split the log of the ratio using $p(x, y) = p(y) p(x \mid y)$:$$I(X; Y) = \sum_{x, y} p(x, y) \log \frac{p(x, y)}{p(x)\, p(y)} = \sum_{x, y} p(x, y) \log \frac{p(x \mid y)}{p(x)}.$$Now break the logarithm of the quotient into a difference, $\log p(x \mid y) - \log p(x)$, and treat the two pieces separately:$$I(X; Y) = \sum_{x, y} p(x, y) \log p(x \mid y) - \sum_{x, y} p(x, y) \log p(x).$$The second sum marginalizes $y$ out, since $\sum_y p(x, y) = p(x)$, leaving $\sum_x p(x) \log p(x) = -H(X)$. The first sum is, by definition, the negative conditional entropy $-H(X \mid Y) = \sum_{x, y} p(x, y) \log p(x \mid y)$. Substituting both,$$I(X; Y) = -H(X \mid Y) - \big(-H(X)\big) = H(X) - H(X \mid Y).$$By the symmetry of the joint distribution in $x$ and $y$, the same steps with the roles swapped give $I(X; Y) = H(Y) - H(Y \mid X)$, and adding $H(X) + H(Y) = I(X;Y) + H(X \mid Y) + H(Y \mid X)$ together with the chain rule $H(X, Y) = H(X) + H(Y \mid X)$ yields the third form. Collecting the symmetric relations,$$I(X; Y) = H(X) - H(X \mid Y) = H(Y) - H(Y \mid X) = H(X) + H(Y) - H(X, Y).$$Read the first form aloud: mutual information is the reduction in our uncertainty about $X$ that results from observing $Y$. It is the average number of nats we learn about $X$ from one observation of $Y$. The symmetry $I(X; Y) = I(Y; X)$ is manifest and reflects that information is mutual: $Y$ tells us as much about $X$ as $X$ tells us about $Y$. This decomposition makes mutual information a natural objective whenever we want a learned variable to be informative about something, or, by minimizing it, when we want one variable to reveal as little as possible about another.### 3.3 Estimating Mutual InformationMutual information is notoriously hard to estimate in high dimensions, because it requires knowledge of the joint density and the marginals, which are exactly the quantities we lack in practice. A great deal of modern work therefore focuses on variational bounds that replace the intractable quantity with a tractable optimization. A foundational lower bound is the one underlying InfoNCE, the objective at the heart of contrastive representation learning. Given a critic function $f(x, y)$ and $K$ samples, one positive pair drawn from the joint and the rest from the marginal, the InfoNCE objective lower bounds mutual information:$$I(X; Y) \geq \mathbb{E}\!\left[ \log \frac{e^{f(x, y)}}{\frac{1}{K}\sum_{k=1}^{K} e^{f(x, y_k)}} \right] + \log K.$$Maximizing this bound trains the critic to assign high scores to genuinely paired inputs and low scores to mismatched ones. The bound is biased and saturates at $\log K$, so the number of negative samples caps how much information can be detected, a practical reason contrastive methods favor large batches. Other estimators, such as the Donsker-Varadhan bound used by MINE, trade different biases against variance. The difficulty of estimating mutual information accurately is itself an active research theme.## 4. Computing KL Divergence in PracticeThe cleanest way to feel the difference between forward and reverse KL is to fit a single Gaussian to a stubbornly bimodal target and watch where it lands. The forward fit, which is mean-seeking, spreads to straddle both modes and parks its mean in the empty valley between them. The reverse fit, which is mode-seeking, abandons one mode and models the other faithfully. The cell below sets up a symmetric two-component target, minimizes each objective numerically, checks the closed-form Gaussian KL of Section 1.4 against a direct numerical integral, and plots the two solutions.::: {.panel-tabset}## Python```{python}import numpy as npimport matplotlib.pyplot as pltfrom scipy.optimize import minimizefrom scipy.stats import normrng = np.random.default_rng(0)# Discrete coin example from Section 1.5.p_coin = np.array([0.5, 0.5])q_coin = np.array([0.9, 0.1])fwd = np.sum(p_coin * np.log(p_coin / q_coin))rev = np.sum(q_coin * np.log(q_coin / p_coin))print(f"coin KL(p||q)={fwd:.4f} KL(q||p)={rev:.4f}")# Continuous bimodal target p as an equal mixture of two Gaussians.xs = np.linspace(-8.0, 8.0, 2001)dx = xs[1] - xs[0]p =0.5* norm.pdf(xs, -3.0, 1.0) +0.5* norm.pdf(xs, 3.0, 1.0)p = p / (p.sum() * dx)eps =1e-12def gauss(m, s): q = norm.pdf(xs, m, s)return q / (q.sum() * dx)def forward_kl(params): # KL(p || q): mean-seeking m, log_s = params q = gauss(m, np.exp(log_s))return np.sum(p * (np.log(p + eps) - np.log(q + eps))) * dxdef reverse_kl(params): # KL(q || p): mode-seeking m, log_s = params q = gauss(m, np.exp(log_s))return np.sum(q * (np.log(q + eps) - np.log(p + eps))) * dxf_res = minimize(forward_kl, x0=[0.5, np.log(1.0)], method="Nelder-Mead")r_res = minimize(reverse_kl, x0=[2.5, np.log(1.0)], method="Nelder-Mead")mf, sf = f_res.x[0], np.exp(f_res.x[1])mr, sr = r_res.x[0], np.exp(r_res.x[1])print(f"forward fit: mean={mf:+.3f} sd={sf:.3f} KL={f_res.fun:.4f}")print(f"reverse fit: mean={mr:+.3f} sd={sr:.3f} KL={r_res.fun:.4f}")# Closed-form univariate Gaussian KL, checked against numerical integration.def kl_gauss_closed(m1, s1, m2, s2):return np.log(s2 / s1) + (s1**2+ (m1 - m2)**2) / (2* s2**2) -0.5a = gauss(0.0, 1.0)num = np.sum(a * (np.log(a + eps) - np.log(gauss(1.0, 1.5) + eps))) * dxprint(f"Gaussian KL: numeric={num:.4f} closed-form={kl_gauss_closed(0,1,1,1.5):.4f}")fig, ax = plt.subplots(figsize=(7, 4))ax.plot(xs, p, "k-", lw=2, label="target p (bimodal)")ax.plot(xs, gauss(mf, sf), "C0--", lw=2, label="forward KL (mean-seeking)")ax.plot(xs, gauss(mr, sr), "C3-.", lw=2, label="reverse KL (mode-seeking)")ax.set_xlabel("x"); ax.set_ylabel("density"); ax.legend()ax.set_title("Forward vs reverse KL fit to a bimodal target")fig.tight_layout()plt.show()```The forward solution returns a mean near zero with a wide standard deviation, a single broad bump that covers both target modes and assigns spurious mass to the valley. The reverse solution returns a mean near one mode with a narrow standard deviation, ignoring the other mode entirely. The numerical and closed-form Gaussian KL values agree to four decimals, confirming the formula of Section 1.4.## Julia```julia# Illustrative: discrete KL and a closed-form Gaussian KL check.usingStatisticsfunctionkl(p, q)sum(pi*log(pi/ qi) for (pi, qi) inzip(p, q) ifpi>0)endp_coin = [0.5, 0.5]q_coin = [0.9, 0.1]println("KL(p||q) = ", kl(p_coin, q_coin))println("KL(q||p) = ", kl(q_coin, p_coin))# Closed-form univariate Gaussian KL: N(m1,s1) relative to N(m2,s2).kl_gauss(m1, s1, m2, s2) =log(s2 / s1) + (s1^2+ (m1 - m2)^2) / (2* s2^2) -0.5println("Gaussian KL = ", kl_gauss(0.0, 1.0, 1.0, 1.5))```## Rust```rust// Illustrative: discrete KL divergence and closed-form Gaussian KL.fn kl(p:&[f64], q:&[f64]) -> f64 { p.iter() .zip(q.iter()) .filter(|(&pi, _)|pi>0.0) .map(|(&pi, &qi)|pi* (pi/ qi).ln()) .sum()}fn kl_gauss(m1: f64, s1: f64, m2: f64, s2: f64) -> f64 { (s2 / s1).ln() + (s1 * s1 + (m1 - m2).powi(2)) / (2.0* s2 * s2) -0.5}fn main() {let p = [0.5, 0.5];let q = [0.9, 0.1];println!("KL(p||q) = {:.4}", kl(&p, &q));println!("KL(q||p) = {:.4}", kl(&q, &p));println!("Gaussian KL = {:.4}", kl_gauss(0.0, 1.0, 1.0, 1.5));}```:::## 5. Roles in Machine Learning### 5.1 Variational Inference and the ELBOThe most pervasive appearance of KL divergence in modern machine learning is the evidence lower bound. Suppose we have a latent variable model $p_\theta(x, z) = p_\theta(x \mid z) p(z)$ and we want to maximize the marginal likelihood $\log p_\theta(x)$, which requires the intractable posterior $p_\theta(z \mid x)$. We introduce a tractable variational distribution $q_\phi(z \mid x)$ and write$$\log p_\theta(x) = \underbrace{\mathbb{E}_{q_\phi(z \mid x)}\!\left[\log \frac{p_\theta(x, z)}{q_\phi(z \mid x)}\right]}_{\text{ELBO}} + D_{\mathrm{KL}}\!\left(q_\phi(z \mid x) \parallel p_\theta(z \mid x)\right).$$Because the KL term is non-negative, the first term is a lower bound on the log-evidence, hence its name. Maximizing the ELBO does double duty: it pushes up a bound on the likelihood and, equivalently, drives $q_\phi$ toward the true posterior by shrinking the reverse KL gap. The ELBO can be rearranged into a reconstruction term and a regularization term:$$\text{ELBO} = \mathbb{E}_{q_\phi(z \mid x)}[\log p_\theta(x \mid z)] - D_{\mathrm{KL}}\!\left(q_\phi(z \mid x) \parallel p(z)\right).$$This is precisely the variational autoencoder objective. The first term rewards faithful reconstruction; the second pulls the approximate posterior toward the prior $p(z)$, usually a standard Gaussian, using the closed-form KL of Section 1.4. Note that the gap minimized here is a reverse KL, which is why variational inference inherits the mode-seeking behavior discussed earlier and can underestimate posterior uncertainty.### 5.2 Representation LearningMutual information offers a principled objective for learning representations. The InfoMax principle holds that a good representation $Z$ of an input $X$ should retain as much information about $X$ as possible, suggesting we maximize $I(X; Z)$. Modern self-supervised methods refine this idea. Contrastive approaches maximize the mutual information between different views or augmentations of the same input, encouraging the encoder to capture the underlying content that the two views share while discarding view-specific nuisance. The InfoNCE bound of Section 3.3 is the workhorse here, and its connection to mutual information gives these methods a clear information-theoretic reading even when, in practice, much of their success also depends on architectural and optimization details.Mutual information also clarifies what we want a representation not to contain. In fair and private representation learning, one explicitly minimizes the mutual information between the representation and a sensitive attribute $S$, seeking features that are informative for the task but reveal little about $S$. The same minimization principle appears in disentanglement, where penalizing the mutual information among latent dimensions encourages factors of variation to separate.### 5.3 The Information BottleneckThe information bottleneck principle unifies many of these ideas into a single objective. Given an input $X$ and a target $Y$, we seek a representation $Z$ that is maximally informative about $Y$ while being maximally compressive of $X$. Formally, we minimize$$\mathcal{L}_{\mathrm{IB}} = I(X; Z) - \beta\, I(Z; Y),$$where $\beta > 0$ trades compression against prediction. The first term squeezes the representation, discarding information about the input; the second term insists that what survives the squeeze remains predictive of the target. The bottleneck framing gives a sharp account of generalization: a representation that has forgotten the irrelevant details of $X$ while keeping the label-relevant ones should generalize better, because it has stripped away the noise on which overfitting feeds.The deep variational information bottleneck makes this objective trainable by replacing the intractable mutual information terms with variational bounds, producing a loss that looks remarkably like a regularized variational autoencoder. The compression term $I(X; Z)$ is upper bounded by a KL divergence between the encoder and a prior, exactly as in the ELBO, while the prediction term $I(Z; Y)$ is lower bounded by a decoder log-likelihood. The information bottleneck thus ties together everything in this chapter: relative entropy as a regularizer, mutual information as a measure of relevance, and variational bounds as the bridge from elegant principle to runnable code.### 5.4 A Unifying ViewStepping back, KL divergence and mutual information recur because they answer the two questions that dominate learning. KL divergence asks how wrong our model is relative to the truth, and so it furnishes loss functions, regularizers, and the gaps in variational bounds. Mutual information asks how much one quantity tells us about another, and so it furnishes objectives for what a representation should keep and what it should discard. The asymmetry of KL forces a modeling choice between covering and concentrating; the non-negativity of both quantities is what makes them safe to optimize. Together they form a compact toolkit that reappears, lightly disguised, across supervised learning, generative modeling, and self-supervised representation learning.## References1. Kullback, S., and Leibler, R. A. (1951). On Information and Sufficiency. Annals of Mathematical Statistics, 22(1), 79 to 86. https://doi.org/10.1214/aoms/11777296942. Cover, T. M., and Thomas, J. A. (2006). Elements of Information Theory (2nd ed.). Wiley-Interscience. https://doi.org/10.1002/047174882X3. Shannon, C. E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal, 27(3), 379 to 423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x4. Csiszar, I. (1975). I-Divergence Geometry of Probability Distributions and Minimization Problems. Annals of Probability, 3(1), 146 to 158. https://doi.org/10.1214/aop/11769964545. Kingma, D. P., and Welling, M. (2014). Auto-Encoding Variational Bayes. International Conference on Learning Representations. https://doi.org/10.48550/arXiv.1312.61146. Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2017). Variational Inference: A Review for Statisticians. Journal of the American Statistical Association, 112(518), 859 to 877. https://doi.org/10.1080/01621459.2017.12857737. Tishby, N., Pereira, F. C., and Bialek, W. (1999). The Information Bottleneck Method. Proceedings of the 37th Allerton Conference on Communication, Control, and Computing, 368 to 377. https://doi.org/10.48550/arXiv.physics/00040578. Alemi, A. A., Fischer, I., Dillon, J. V., and Murphy, K. (2017). Deep Variational Information Bottleneck. International Conference on Learning Representations. https://doi.org/10.48550/arXiv.1612.004109. van den Oord, A., Li, Y., and Vinyals, O. (2018). Representation Learning with Contrastive Predictive Coding. https://doi.org/10.48550/arXiv.1807.0374810. Belghazi, M. I., Baratin, A., Rajeswar, S., Ozair, S., Bengio, Y., Courville, A., and Hjelm, R. D. (2018). Mutual Information Neural Estimation. Proceedings of the 35th International Conference on Machine Learning, PMLR 80, 531 to 540. https://doi.org/10.48550/arXiv.1801.0406211. Poole, B., Ozair, S., van den Oord, A., Alemi, A. A., and Tucker, G. (2019). On Variational Bounds of Mutual Information. Proceedings of the 36th International Conference on Machine Learning, PMLR 97, 5171 to 5180. https://doi.org/10.48550/arXiv.1905.0692212. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative Adversarial Nets. Advances in Neural Information Processing Systems 27, 2672 to 2680. https://doi.org/10.48550/arXiv.1406.266113. Tschannen, M., Djolonga, J., Rubenstein, P. K., Gelly, S., and Lucic, M. (2020). On Mutual Information Maximization for Representation Learning. International Conference on Learning Representations. https://doi.org/10.48550/arXiv.1907.13625