135 The Expectation-Maximization Algorithm
Many of the most useful probabilistic models in machine learning involve variables that are never observed. A clustering model assumes each data point belongs to some hidden group, but the group labels are missing. A topic model assumes each word arose from some latent theme, but the themes are unannotated. Whenever a model couples observed data to unobserved variables, the likelihood we actually care about, the marginal likelihood of the observed data, becomes a sum or integral over all configurations of the hidden variables. That marginalization usually destroys the clean structure that makes maximum likelihood estimation tractable in fully observed models. The Expectation-Maximization (EM) algorithm is the standard tool for restoring tractability. It is an iterative scheme that turns one hard optimization problem into a sequence of easy ones, and it does so with a guarantee that the observed-data likelihood never decreases.
This chapter develops EM from first principles. We start with the general framework and the evidence lower bound, derive the E and M steps as coordinate ascent on a single objective, prove the monotonic convergence property, and then work through Gaussian mixture models as the canonical concrete example.
135.1 1. The Latent Variable Setting
Let \(\mathbf{x}\) denote observed variables, \(\mathbf{z}\) denote latent (hidden) variables, and \(\boldsymbol{\theta}\) denote the model parameters we wish to estimate. The model specifies a joint distribution \(p(\mathbf{x}, \mathbf{z} \mid \boldsymbol{\theta})\). What we observe is only \(\mathbf{x}\), so the quantity we want to maximize is the marginal log-likelihood, also called the incomplete-data log-likelihood:
\[ \ell(\boldsymbol{\theta}) = \log p(\mathbf{x} \mid \boldsymbol{\theta}) = \log \sum_{\mathbf{z}} p(\mathbf{x}, \mathbf{z} \mid \boldsymbol{\theta}). \]
For continuous \(\mathbf{z}\) the sum becomes an integral. The difficulty is the logarithm of a sum. Because the log does not pass through the summation, the gradient of \(\ell(\boldsymbol{\theta})\) couples all latent configurations together, and there is generally no closed-form maximizer even when the joint \(p(\mathbf{x}, \mathbf{z} \mid \boldsymbol{\theta})\) belongs to an exponential family with a trivial maximizer in the fully observed case.
It helps to contrast this with the complete-data log-likelihood, the quantity we could maximize if the latent variables were revealed to us:
\[ \ell_c(\boldsymbol{\theta}) = \log p(\mathbf{x}, \mathbf{z} \mid \boldsymbol{\theta}). \]
Here the log sits directly on the joint, which often factorizes into terms that are individually easy to optimize. EM is, in essence, a principled way to optimize \(\ell(\boldsymbol{\theta})\) by repeatedly optimizing an expected version of \(\ell_c(\boldsymbol{\theta})\) where the expectation fills in the missing \(\mathbf{z}\).
135.2 2. The Evidence Lower Bound
The conceptual heart of EM is a lower bound on \(\ell(\boldsymbol{\theta})\) that holds for any choice of an auxiliary distribution over the latent variables. Introduce a distribution \(q(\mathbf{z})\), which we are free to pick. For any such \(q\) with support covering the latent space, we can write
\[ \log p(\mathbf{x} \mid \boldsymbol{\theta}) = \log \sum_{\mathbf{z}} q(\mathbf{z}) \, \frac{p(\mathbf{x}, \mathbf{z} \mid \boldsymbol{\theta})}{q(\mathbf{z})}. \]
Because the logarithm is concave, Jensen’s inequality lets us move the log inside the expectation, which can only lower the value:
\[ \log p(\mathbf{x} \mid \boldsymbol{\theta}) \;\geq\; \sum_{\mathbf{z}} q(\mathbf{z}) \log \frac{p(\mathbf{x}, \mathbf{z} \mid \boldsymbol{\theta})}{q(\mathbf{z})} \;=\; \mathcal{L}(q, \boldsymbol{\theta}). \]
The quantity \(\mathcal{L}(q, \boldsymbol{\theta})\) is the evidence lower bound, often abbreviated ELBO. It is a function of both the auxiliary distribution \(q\) and the parameters \(\boldsymbol{\theta}\).
135.2.1 2.1 The Exact Decomposition
Jensen’s inequality tells us the ELBO is a lower bound, but a more revealing identity tells us exactly how loose it is. Starting from the definition of conditional probability \(p(\mathbf{x}, \mathbf{z} \mid \boldsymbol{\theta}) = p(\mathbf{z} \mid \mathbf{x}, \boldsymbol{\theta}) \, p(\mathbf{x} \mid \boldsymbol{\theta})\) and substituting, one obtains the decomposition
\[ \log p(\mathbf{x} \mid \boldsymbol{\theta}) = \mathcal{L}(q, \boldsymbol{\theta}) + \mathrm{KL}\!\left(q(\mathbf{z}) \,\|\, p(\mathbf{z} \mid \mathbf{x}, \boldsymbol{\theta})\right), \]
where the Kullback-Leibler divergence is
\[ \mathrm{KL}\!\left(q \,\|\, p\right) = \sum_{\mathbf{z}} q(\mathbf{z}) \log \frac{q(\mathbf{z})}{p(\mathbf{z} \mid \mathbf{x}, \boldsymbol{\theta})} \;\geq\; 0. \]
This identity is the single most important equation for understanding EM. The left side, the log-likelihood, does not depend on \(q\) at all. Therefore, for fixed \(\boldsymbol{\theta}\), the ELBO and the KL divergence trade off against each other and sum to a constant. The gap between the ELBO and the true log-likelihood is exactly the KL divergence between our chosen \(q\) and the true posterior \(p(\mathbf{z} \mid \mathbf{x}, \boldsymbol{\theta})\). Since KL divergence is nonnegative and equals zero if and only if its two arguments coincide, the bound is tight precisely when \(q\) equals the posterior.
135.3 3. EM as Coordinate Ascent
EM maximizes the ELBO \(\mathcal{L}(q, \boldsymbol{\theta})\) by alternating optimization over its two arguments. This is coordinate ascent on a single well-defined objective, which is the cleanest way to see why the algorithm works.
135.3.1 3.1 The E Step
Fix the current parameters at \(\boldsymbol{\theta}^{(t)}\) and maximize \(\mathcal{L}(q, \boldsymbol{\theta}^{(t)})\) over \(q\). From the decomposition, since \(\log p(\mathbf{x} \mid \boldsymbol{\theta}^{(t)})\) is fixed, maximizing the ELBO over \(q\) is equivalent to minimizing the KL divergence over \(q\). The minimum of a KL divergence is zero, achieved when
\[ q^{(t+1)}(\mathbf{z}) = p(\mathbf{z} \mid \mathbf{x}, \boldsymbol{\theta}^{(t)}). \]
The E step therefore sets \(q\) equal to the posterior distribution over the latent variables under the current parameters. After the E step the bound is tight: \(\mathcal{L}(q^{(t+1)}, \boldsymbol{\theta}^{(t)}) = \log p(\mathbf{x} \mid \boldsymbol{\theta}^{(t)})\). In practice we rarely need the full posterior as an object. We need only the expectations of the complete-data sufficient statistics under it, which is why the step is called Expectation.
135.3.2 3.2 The M Step
Now fix \(q = q^{(t+1)}\) and maximize \(\mathcal{L}(q^{(t+1)}, \boldsymbol{\theta})\) over \(\boldsymbol{\theta}\). Expand the ELBO:
\[ \mathcal{L}(q, \boldsymbol{\theta}) = \underbrace{\sum_{\mathbf{z}} q(\mathbf{z}) \log p(\mathbf{x}, \mathbf{z} \mid \boldsymbol{\theta})}_{Q(\boldsymbol{\theta}, \boldsymbol{\theta}^{(t)})} \;-\; \underbrace{\sum_{\mathbf{z}} q(\mathbf{z}) \log q(\mathbf{z})}_{\text{entropy, independent of }\boldsymbol{\theta}}. \]
The second term, the entropy of \(q\), does not involve \(\boldsymbol{\theta}\), so maximizing the ELBO over \(\boldsymbol{\theta}\) reduces to maximizing the first term. With \(q\) set to the current posterior, that first term is the expected complete-data log-likelihood, conventionally written
\[ Q(\boldsymbol{\theta}, \boldsymbol{\theta}^{(t)}) = \mathbb{E}_{\mathbf{z} \sim p(\mathbf{z} \mid \mathbf{x}, \boldsymbol{\theta}^{(t)})} \big[ \log p(\mathbf{x}, \mathbf{z} \mid \boldsymbol{\theta}) \big]. \]
The M step computes
\[ \boldsymbol{\theta}^{(t+1)} = \arg\max_{\boldsymbol{\theta}} \; Q(\boldsymbol{\theta}, \boldsymbol{\theta}^{(t)}). \]
The crucial payoff is that the expectation is taken over a fixed distribution, so the logarithm now sits on the joint \(p(\mathbf{x}, \mathbf{z} \mid \boldsymbol{\theta})\) rather than on a marginal sum. When the joint is in the exponential family, \(Q\) is a weighted sum of complete-data log-likelihood terms, and its maximizer often has the same closed form as ordinary maximum likelihood, with the hard latent assignments replaced by their posterior expectations.
The complete iteration in pseudocode:
initialize theta
repeat until convergence:
# E step: compute responsibilities / posterior expectations
q(z) = p(z | x, theta)
# M step: maximize expected complete-data log-likelihood
theta = argmax_theta E_q[ log p(x, z | theta) ]
135.4 4. Convergence Guarantees
The defining property of EM is that each iteration does not decrease the observed-data log-likelihood. We prove this directly.
135.4.1 4.1 Monotonic Ascent
Consider one full iteration from \(\boldsymbol{\theta}^{(t)}\) to \(\boldsymbol{\theta}^{(t+1)}\). Using the decomposition at the new parameters with the E-step distribution \(q^{(t+1)} = p(\mathbf{z} \mid \mathbf{x}, \boldsymbol{\theta}^{(t)})\):
\[ \log p(\mathbf{x} \mid \boldsymbol{\theta}^{(t+1)}) = \mathcal{L}(q^{(t+1)}, \boldsymbol{\theta}^{(t+1)}) + \mathrm{KL}\!\left(q^{(t+1)} \,\|\, p(\mathbf{z} \mid \mathbf{x}, \boldsymbol{\theta}^{(t+1)})\right). \]
The KL term is nonnegative, so
\[ \log p(\mathbf{x} \mid \boldsymbol{\theta}^{(t+1)}) \;\geq\; \mathcal{L}(q^{(t+1)}, \boldsymbol{\theta}^{(t+1)}). \]
The M step chose \(\boldsymbol{\theta}^{(t+1)}\) to maximize \(\mathcal{L}(q^{(t+1)}, \cdot)\), so in particular it does at least as well as \(\boldsymbol{\theta}^{(t)}\):
\[ \mathcal{L}(q^{(t+1)}, \boldsymbol{\theta}^{(t+1)}) \;\geq\; \mathcal{L}(q^{(t+1)}, \boldsymbol{\theta}^{(t)}) = \log p(\mathbf{x} \mid \boldsymbol{\theta}^{(t)}), \]
where the final equality holds because the E step made the bound tight. Chaining the two inequalities gives the central result:
\[ \log p(\mathbf{x} \mid \boldsymbol{\theta}^{(t+1)}) \;\geq\; \log p(\mathbf{x} \mid \boldsymbol{\theta}^{(t)}). \]
The likelihood is monotonically nondecreasing. An equivalent and often quoted form requires only that the M step increase \(Q\) rather than maximize it, since \(Q(\boldsymbol{\theta}^{(t+1)}, \boldsymbol{\theta}^{(t)}) \geq Q(\boldsymbol{\theta}^{(t)}, \boldsymbol{\theta}^{(t)})\) is enough to drive the chain. Algorithms that merely increase \(Q\) are called Generalized EM, and they retain the ascent property.
135.4.2 4.2 What Convergence Does and Does Not Mean
Monotonicity plus an upper bound on the likelihood (any proper likelihood is bounded above when the parameter space is well behaved) implies that the sequence of likelihood values converges. Under mild regularity conditions the parameter iterates converge to a stationary point of \(\ell(\boldsymbol{\theta})\), a point where the gradient vanishes. This is a local statement, not a global one. EM can converge to a local maximum or even a saddle point, and the fixed point reached depends on initialization. There is no guarantee of reaching the global maximizer, and for problems like Gaussian mixtures the likelihood surface is genuinely multimodal. The standard mitigation is to run EM from several random initializations and keep the solution with the highest likelihood, or to use a more informed initializer such as k-means for mixture models.
The convergence rate of EM is linear, with the speed governed by the fraction of information about \(\boldsymbol{\theta}\) that is missing because \(\mathbf{z}\) is hidden. When the latent variables carry little information, EM converges quickly. When the missing information is large, for example when mixture components overlap heavily, convergence can be painfully slow, and acceleration schemes or a switch to a quasi-Newton method near the optimum may be warranted.
135.5 5. EM for Gaussian Mixture Models
The Gaussian mixture model (GMM) is the canonical concrete instance of EM and the one most worth internalizing. It models a density as a convex combination of \(K\) Gaussian components.
135.5.1 5.1 The Model
We observe \(N\) data points \(\mathbf{x}_1, \ldots, \mathbf{x}_N\) in \(\mathbb{R}^d\). Each point is assumed to be generated by first drawing a component index from a categorical distribution, then sampling from the corresponding Gaussian. The latent variable \(z_n \in \{1, \ldots, K\}\) records which component generated point \(n\). The parameters are the mixing weights \(\pi_k\) with \(\sum_k \pi_k = 1\), the means \(\boldsymbol{\mu}_k\), and the covariances \(\boldsymbol{\Sigma}_k\). The marginal density of a single observation is
\[ p(\mathbf{x}_n \mid \boldsymbol{\theta}) = \sum_{k=1}^{K} \pi_k \, \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k). \]
The incomplete-data log-likelihood is \(\ell(\boldsymbol{\theta}) = \sum_{n=1}^{N} \log \sum_{k=1}^{K} \pi_k \, \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)\), the log of a sum that resists direct maximization.
135.5.2 5.2 The E Step: Responsibilities
The latent posterior for a single point factorizes over points, and for point \(n\) it is a categorical distribution over the \(K\) components. The posterior probability that component \(k\) generated point \(n\) is called the responsibility, obtained by Bayes’ rule:
\[ \gamma_{nk} = p(z_n = k \mid \mathbf{x}_n, \boldsymbol{\theta}^{(t)}) = \frac{\pi_k \, \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{\sum_{j=1}^{K} \pi_j \, \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)}. \]
Each \(\gamma_{nk}\) lies in \([0, 1]\) and the responsibilities for a point sum to one across components. They are the soft assignments that distinguish EM for mixtures from the hard assignments of k-means.
135.5.3 5.3 The M Step: Weighted Re-estimation
With the responsibilities fixed, the complete-data log-likelihood is
\[ \log p(\mathbf{x}, \mathbf{z} \mid \boldsymbol{\theta}) = \sum_{n=1}^{N} \sum_{k=1}^{K} \mathbb{1}[z_n = k] \left( \log \pi_k + \log \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right), \]
and taking its expectation under the posterior replaces each indicator with \(\gamma_{nk}\). Maximizing the resulting \(Q\) function, subject to the constraint that the weights sum to one (handled with a Lagrange multiplier), yields closed-form updates. Define the effective number of points assigned to component \(k\) as \(N_k = \sum_{n=1}^{N} \gamma_{nk}\). Then
\[ \pi_k^{(t+1)} = \frac{N_k}{N}, \qquad \boldsymbol{\mu}_k^{(t+1)} = \frac{1}{N_k} \sum_{n=1}^{N} \gamma_{nk} \, \mathbf{x}_n, \]
\[ \boldsymbol{\Sigma}_k^{(t+1)} = \frac{1}{N_k} \sum_{n=1}^{N} \gamma_{nk} \, (\mathbf{x}_n - \boldsymbol{\mu}_k^{(t+1)})(\mathbf{x}_n - \boldsymbol{\mu}_k^{(t+1)})^{\top}. \]
These are exactly the ordinary maximum likelihood estimates for a Gaussian, except every point contributes in proportion to its responsibility rather than being fully assigned to one component. The structural parallel to the fully observed estimator is the practical reason EM feels natural once the framework is in place.
# one EM iteration for a GMM
for each n, k: gamma[n,k] = pi[k]*N(x[n]; mu[k], Sigma[k])
gamma = gamma / row_sums(gamma) # E step (responsibilities)
N_k = column_sums(gamma)
pi = N_k / N # M step
mu[k] = (gamma[:,k] . X) / N_k[k]
Sigma[k] = weighted_covariance(X, gamma[:,k], mu[k])
135.5.4 5.4 Practical Pitfalls
Two failure modes recur in practice. The first is the singularity pathology: if a component’s mean coincides with a single data point and its covariance shrinks toward zero, the density at that point diverges and the likelihood becomes unbounded. The maximum likelihood objective itself is therefore degenerate for GMMs, and EM can fall into such a spike. Remedies include adding a small ridge term to each covariance, placing a prior on the covariances and running maximum a posteriori EM, or resetting any component whose weight or variance collapses. The second is sensitivity to initialization. Because the likelihood is multimodal, initializing the means with k-means centroids and running several restarts is standard. Selecting the number of components \(K\) is a separate model selection question, usually addressed with held-out likelihood, the Bayesian Information Criterion, or a Dirichlet process prior that lets the data favor a component count.
135.6 6. Connections and Extensions
EM is a template rather than a single algorithm. The same E and M structure underlies Baum-Welch training for hidden Markov models, factor analysis and probabilistic principal component analysis, mixtures of experts, and missing-data imputation in classical statistics. When the E step posterior is itself intractable, one replaces the exact expectation with an approximation: variational EM uses a restricted family for \(q\) and the ELBO becomes the variational objective, while Monte Carlo EM estimates the expectation by sampling. The ELBO derived in this chapter is the same object that powers variational autoencoders, where a neural network amortizes the E step by predicting \(q\) directly from each input. Viewed this way, EM is not a relic of classical statistics but the conceptual ancestor of a large fraction of modern probabilistic deep learning.
135.7 7. Summary
EM addresses maximum likelihood estimation when latent variables make the marginal likelihood intractable. It introduces an auxiliary distribution \(q\), builds the evidence lower bound, and recognizes that the gap between the bound and the true log-likelihood is exactly a KL divergence to the posterior. The E step closes that gap by setting \(q\) to the posterior, and the M step pushes the bound, and therefore the likelihood, upward by maximizing the expected complete-data log-likelihood. This coordinate ascent guarantees a monotonically nondecreasing likelihood and convergence to a stationary point, though only a local one. Gaussian mixtures make the abstraction concrete: the E step computes soft responsibilities and the M step re-estimates weights, means, and covariances as responsibility-weighted versions of the standard Gaussian estimators. Mastering this one example transfers directly to the broader family of latent variable models.
135.8 References
- Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, Series B, 39(1), 1 to 38. https://www.jstor.org/stable/2984875
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning, Chapter 9. Springer. https://www.microsoft.com/en-us/research/publication/pattern-recognition-machine-learning/
- Neal, R. M., and Hinton, G. E. (1998). A View of the EM Algorithm that Justifies Incremental, Sparse, and Other Variants. In Learning in Graphical Models, 355 to 368. https://www.cs.toronto.edu/~radford/ftp/emk.pdf
- McLachlan, G. J., and Krishnan, T. (2008). The EM Algorithm and Extensions, 2nd Edition. Wiley. https://onlinelibrary.wiley.com/doi/book/10.1002/9780470191613
- Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective, Chapter 11. MIT Press. https://probml.github.io/pml-book/book0.html
- Wu, C. F. J. (1983). On the Convergence Properties of the EM Algorithm. The Annals of Statistics, 11(1), 95 to 103. https://www.jstor.org/stable/2240463