144 Non-Negative Matrix Factorization
Non-Negative Matrix Factorization (NMF) is a linear dimensionality reduction technique that approximates a non-negative data matrix as the product of two smaller non-negative matrices. The non-negativity constraint is not a mere technical convenience. It fundamentally changes the geometry of the factorization, forcing the learned components to combine only additively. This additivity is what gives NMF its signature property: a parts-based representation in which the whole is reconstructed by summing parts, never by subtracting them. This chapter develops the mathematics of NMF rigorously, derives the celebrated multiplicative update rules, examines the two canonical objective functions, and connects the method to its two most influential application domains, topic modeling and image decomposition.
144.1 1. The Factorization Problem
144.1.1 1.1 Setup and Notation
Let \(V \in \mathbb{R}_{\geq 0}^{n \times m}\) be a data matrix with non-negative entries. Each of the \(m\) columns is a data point living in \(\mathbb{R}_{\geq 0}^{n}\), and each of the \(n\) rows is a feature. NMF seeks two non-negative factor matrices
\[ W \in \mathbb{R}_{\geq 0}^{n \times r}, \qquad H \in \mathbb{R}_{\geq 0}^{r \times m}, \]
such that
\[ V \approx W H. \]
The integer \(r\) is the rank of the factorization, chosen so that \(r \ll \min(n, m)\). Because \((WH)\) has rank at most \(r\), the product is a low-rank approximation of \(V\). The total number of free parameters drops from \(nm\) to \(r(n + m)\), which is the source of the compression and of the statistical regularization that NMF provides.
A column of \(V\) is reconstructed as \(v_j \approx W h_j\), where \(h_j\) is the \(j\)-th column of \(H\). Writing \(W = [w_1, \dots, w_r]\) in terms of its columns, this reads
\[ v_j \approx \sum_{k=1}^{r} w_k \, h_{kj}. \]
Each data vector is therefore a non-negative combination of the \(r\) columns of \(W\). The columns \(w_k\) are the basis vectors, often called the dictionary atoms or the latent components, and the entries \(h_{kj}\) are the non-negative activation coefficients or encodings.
144.1.2 1.2 Why Non-Negativity Produces Parts
The contrast with Principal Component Analysis (PCA) and the Singular Value Decomposition (SVD) is instructive. Those methods also factor \(V\) into a product of smaller matrices, but they permit negative entries, and their basis vectors are constrained to be mutually orthogonal. As a result, individual PCA components are typically dense, holistic templates, and reconstruction relies on cancellation: a large positive contribution from one component is partly erased by a negative contribution from another.
NMF forbids subtraction. Since every \(w_k \geq 0\) and every \(h_{kj} \geq 0\), the only operation available for assembling \(v_j\) is addition. Components can only pile on top of one another. This restriction tends to drive the factorization toward sparse, localized parts, because a basis vector that is active for one data point but should be absent for another cannot be cancelled away; it must simply be set near zero in that encoding. The empirical consequence, demonstrated by Lee and Seung in their 1999 paper, is that NMF applied to face images recovers localized features such as eyes, noses, and mouths, whereas PCA on the same data recovers eigenfaces that look like ghostly whole faces.
It is worth stating the limits of this intuition precisely. Non-negativity makes a parts-based decomposition possible and often likely, but it does not guarantee it. Whether the recovered parts are unique or interpretable depends on the geometry of the data and on the rank \(r\). NMF is in general not unique: for any invertible non-negative matrix \(Q\) with non-negative inverse, \(WH = (WQ)(Q^{-1}H)\) is an equally valid factorization. The only such \(Q\) that preserve non-negativity of both factors are permutations and positive diagonal scalings, which is reassuring, but other non-negative ambiguities can arise when the data do not fill the non-negative cone tightly.
144.2 2. Objective Functions
NMF is posed as an optimization problem. We must specify a cost that measures how well \(WH\) approximates \(V\), then minimize it over the non-negative factors. Two objectives dominate practice, and each corresponds to a different statistical noise model.
144.2.1 2.1 The Squared Frobenius Norm
The most common cost is the squared Frobenius norm of the residual,
\[ \mathcal{D}_F(V \,\|\, WH) = \tfrac{1}{2}\, \| V - WH \|_F^2 = \tfrac{1}{2} \sum_{i=1}^{n} \sum_{j=1}^{m} \left( V_{ij} - (WH)_{ij} \right)^2 . \]
Minimizing this objective is equivalent to maximum likelihood estimation under the assumption that the entries of \(V\) are corrupted by additive Gaussian noise of constant variance. The Frobenius objective is smooth and its gradient is simple, which makes it the default choice for image data and for any setting where the residual is plausibly symmetric around zero.
144.2.2 2.2 The Generalized Kullback-Leibler Divergence
When \(V\) records counts, such as the number of times a word appears in a document, a Gaussian model is a poor fit and a Poisson model is more appropriate. The corresponding cost is the generalized Kullback-Leibler (KL) divergence,
\[ \mathcal{D}_{KL}(V \,\|\, WH) = \sum_{i=1}^{n}\sum_{j=1}^{m} \left( V_{ij} \log \frac{V_{ij}}{(WH)_{ij}} - V_{ij} + (WH)_{ij} \right). \]
This is not the classical KL divergence between probability distributions, because \(V\) and \(WH\) need not sum to one. It is the divergence that arises as the negative log likelihood of a Poisson observation model, up to terms independent of the parameters. Both \(\mathcal{D}_F\) and \(\mathcal{D}_{KL}\) are non-negative, equal zero if and only if \(V = WH\), and are convex in \(W\) for fixed \(H\) and convex in \(H\) for fixed \(W\). Crucially, neither is jointly convex in \((W, H)\) together, so we cannot expect to find the global optimum. Both divergences are special cases of the broader family of \(\beta\)-divergences, with the Frobenius norm at \(\beta = 2\) and the KL divergence at \(\beta = 1\).
144.2.3 2.3 The Non-Convexity Consequence
Because the joint problem is non-convex, every practical NMF algorithm finds only a local minimum, and the result depends on initialization. Common strategies include random non-negative initialization with several restarts, and Non-Negative Double SVD (NNDSVD), which seeds \(W\) and \(H\) from the singular vectors of \(V\) with negative parts clipped. Multiple restarts followed by selection of the lowest cost solution is standard practice.
144.3 3. Multiplicative Update Rules
144.3.1 3.1 The Algorithm
Lee and Seung introduced a pair of remarkably simple iterative rules that monotonically decrease each objective. For the Frobenius cost, the updates are
\[ H \leftarrow H \odot \frac{W^\top V}{W^\top W H}, \qquad W \leftarrow W \odot \frac{V H^\top}{W H H^\top}, \]
where \(\odot\) denotes elementwise (Hadamard) multiplication and the division is elementwise. For the KL divergence, the updates become
\[ H_{kj} \leftarrow H_{kj} \, \frac{\sum_i W_{ik} \, V_{ij} / (WH)_{ij}}{\sum_i W_{ik}}, \qquad W_{ik} \leftarrow W_{ik} \, \frac{\sum_j H_{kj} \, V_{ij} / (WH)_{ij}}{\sum_j H_{kj}}. \]
Two properties make these rules attractive. First, every operation is a multiplication or a division of non-negative quantities, so if \(W\) and \(H\) start non-negative they remain non-negative at every iteration; the constraint is enforced automatically without any projection step. Second, each update can be shown to never increase the objective.
A schematic implementation makes the structure explicit.
initialize W, H with non-negative random values
repeat until convergence:
# Frobenius updates, epsilon guards the denominator
H = H * (Wt @ V) / (Wt @ W @ H + eps)
W = W * (V @ Ht) / (W @ H @ Ht + eps)
return W, H
144.3.2 3.2 Derivation from Gradient Descent
The multiplicative form is not arbitrary; it is a cleverly rescaled gradient descent. Consider the Frobenius objective as a function of \(H\) with \(W\) fixed. Its gradient is
\[ \nabla_H \mathcal{D}_F = W^\top W H - W^\top V . \]
A gradient descent step with elementwise step size \(\eta_{kj}\) reads \(H_{kj} \leftarrow H_{kj} - \eta_{kj} (\nabla_H \mathcal{D}_F)_{kj}\). Now choose the step size adaptively as
\[ \eta_{kj} = \frac{H_{kj}}{(W^\top W H)_{kj}} . \]
Substituting this choice yields
\[ H_{kj} \leftarrow H_{kj} - \frac{H_{kj}}{(W^\top W H)_{kj}} \left( (W^\top W H)_{kj} - (W^\top V)_{kj} \right) = H_{kj}\, \frac{(W^\top V)_{kj}}{(W^\top W H)_{kj}}, \]
which is exactly the multiplicative rule. The chosen step size is what converts an additive update, which could push entries negative, into a multiplicative one that cannot. The update for \(W\) follows by the same argument with the roles of the matrices transposed. At a fixed point, either \(H_{kj} = 0\) or the gradient component vanishes, which is precisely the Karush-Kuhn-Tucker stationarity condition for the non-negativity constrained problem.
144.3.3 3.3 Monotone Descent and the Auxiliary Function
The proof that the updates never increase the cost relies on the auxiliary function technique, a close cousin of the Expectation Maximization argument. A function \(G(h, h')\) is an auxiliary function for \(F(h)\) if
\[ G(h, h') \geq F(h) \quad \text{for all } h, h', \qquad G(h, h) = F(h). \]
If such a \(G\) exists, then defining \(h^{t+1} = \arg\min_h G(h, h^t)\) guarantees
\[ F(h^{t+1}) \leq G(h^{t+1}, h^{t}) \leq G(h^{t}, h^{t}) = F(h^{t}), \]
so \(F\) decreases monotonically. Lee and Seung constructed an explicit auxiliary function for each objective whose minimizer is exactly the multiplicative update. The construction for the KL case uses the convexity of the negative logarithm together with Jensen’s inequality, the same machinery that underlies EM. The upshot is a guarantee of monotone convergence of the objective value to a stationary point.
A caveat deserves emphasis. Monotone decrease of the cost does not by itself imply convergence of the iterates \((W, H)\) to a stationary point of the constrained problem, and early treatments overstated this. The standard remedy in production code is to add a small constant \(\epsilon\) to denominators to avoid division by zero, and modern solvers often prefer alternative optimizers such as block coordinate descent or alternating non-negative least squares, which enjoy stronger convergence guarantees and frequently run faster. The multiplicative rules remain the canonical pedagogical algorithm and a reliable baseline.
144.4 4. Application: Topic Modeling
144.4.1 4.1 The Bag of Words Matrix
In text mining, a corpus of \(m\) documents over a vocabulary of \(n\) terms is summarized by a term-document matrix \(V\), where \(V_{ij}\) is the count or the TF-IDF weight of term \(i\) in document \(j\). This matrix is non-negative by construction, which makes NMF a natural fit. Factoring \(V \approx W H\) yields an interpretation in which each of the \(r\) columns of \(W\) is a topic, namely a non-negative weighting over the vocabulary that indicates which words characterize that topic, and each column of \(H\) describes a document as a non-negative mixture of those topics.
144.4.2 4.2 Interpretation and Relation to Probabilistic Models
The parts-based property is exactly what makes NMF topics readable. Because a document is reconstructed by adding topics rather than by cancelling them, the high weight terms in a column \(w_k\) form a coherent, human-interpretable theme. Sorting each topic by descending term weight and reading off the top terms is the standard way to label topics.
There is a precise probabilistic counterpart. Minimizing the KL divergence NMF on a count matrix is closely related to Probabilistic Latent Semantic Analysis (PLSA), and both can be viewed as fitting a multinomial mixture. After normalizing the factors so that topics are distributions over words and document encodings are distributions over topics, KL-NMF and PLSA optimize equivalent likelihoods. This connects the linear algebra to the more elaborate Latent Dirichlet Allocation, which adds a Dirichlet prior over the topic mixtures. NMF can be seen as the maximum likelihood, prior-free skeleton beneath that Bayesian model, which explains why it is often faster to fit and easier to reason about, at the cost of the regularization a prior would supply.
# topic modeling sketch
V = tfidf(corpus) # n terms by m documents, non-negative
W, H = nmf(V, rank=r, loss="kullback-leibler")
for k in range(r):
top = argsort(W[:, k])[::-1][:10]
print("topic", k, [vocab[i] for i in top])
For text, the KL or the equivalent generalized divergence is usually preferred over the Frobenius norm because word counts are sparse and heavy tailed, conditions under which the Gaussian assumption behind the Frobenius cost is badly violated.
144.5 5. Application: Image Decomposition
144.5.1 5.1 Images as Non-Negative Vectors
Pixel intensities are non-negative, so a collection of grayscale images flattens naturally into a non-negative matrix \(V\), with one column per image and one row per pixel. NMF applied to a database of aligned face images recovers basis images \(w_k\) that are spatially localized: a basis vector lights up over the region of a mouth, another over an eyebrow, another over the bridge of the nose. A particular face is then encoded by activating the subset of parts it contains.
144.5.2 5.2 Sparse, Additive Reconstruction
This is the experiment that launched the field. Lee and Seung contrasted NMF with PCA and Vector Quantization (VQ) on the same face set. VQ forces each image to be represented by a single prototype, producing holistic templates. PCA produces orthogonal eigenfaces that mix positive and negative pixel values and reconstruct by cancellation. NMF alone yields the intuitive decomposition into additive parts, because the non-negativity of both the basis and the coefficients means that adding a part can only add intensity to the reconstruction, never remove it.
The same principle generalizes well beyond faces. NMF is used for hyperspectral unmixing, where each pixel spectrum is a non-negative mixture of pure material signatures called endmembers, and the abundance coefficients in \(H\) must be non-negative because a material cannot contribute a negative quantity. It also appears in audio source separation applied to the non-negative magnitude spectrogram, where the columns of \(W\) are spectral templates of individual sources and \(H\) gives their time-varying activations. In each case the physical quantity being modeled is inherently non-negative and additive, which is precisely the regime where NMF is the principled choice rather than a heuristic.
144.5.3 5.3 Encouraging Sparsity Explicitly
Although non-negativity tends to induce sparsity, the effect is not always strong enough, and the degree of localization can be controlled. A common extension augments the objective with an \(\ell_1\) penalty on the factors,
\[ \mathcal{D}_F(V \,\|\, WH) + \alpha \, \| H \|_1 + \beta \, \| W \|_1, \]
which pushes small coefficients to exactly zero and sharpens the parts-based interpretation. Hoyer’s sparse NMF formalizes this with an explicit, tunable sparsity measure, giving the practitioner direct control over how localized the learned components are. Regularization of this kind is now standard in production implementations of NMF.
144.6 6. Practical Considerations
Choosing the rank \(r\) is the central modeling decision and has no universally optimal answer. Practitioners examine the reconstruction error as a function of \(r\) and look for an elbow, use held-out imputation error on masked entries, or apply stability selection by checking how consistent the factors are across restarts. Preprocessing matters as well: features are often scaled so that no single high variance row dominates the Frobenius cost, and for text, TF-IDF weighting tempers the influence of frequent function words.
Finally, the choice of divergence should follow the noise model of the data rather than convenience. Use the Frobenius norm when the residuals are plausibly Gaussian, as with continuous, well-scaled measurements, and the KL or generalized divergence when the data are counts or otherwise Poisson-like, as with text and spectrograms. Aligning the objective with the data generating process is what turns NMF from a generic matrix approximation into a faithful, interpretable model.
144.7 7. Summary
NMF approximates a non-negative matrix by a product of two non-negative low-rank factors. The non-negativity constraint removes cancellation from the reconstruction, which favors sparse, localized, additive parts and yields representations that humans can interpret directly, whether as topics over a vocabulary or as localized features in an image. The Frobenius and KL objectives correspond to Gaussian and Poisson noise models respectively, and the multiplicative update rules provide a simple, constraint-preserving, monotone descent algorithm derived from rescaled gradient descent and justified by an auxiliary function argument. The method is non-convex and non-unique, so initialization and rank selection require care, but within the broad class of problems where the underlying quantities are genuinely non-negative and combine additively, NMF remains one of the most transparent and widely used tools for unsupervised representation learning.
144.8 References
Lee, D. D., and Seung, H. S. “Learning the parts of objects by non-negative matrix factorization.” Nature, 401, 788-791, 1999. https://www.nature.com/articles/44565
Lee, D. D., and Seung, H. S. “Algorithms for Non-negative Matrix Factorization.” Advances in Neural Information Processing Systems (NIPS), 2000. https://proceedings.neurips.cc/paper/2000/hash/f9d1152547c0bde01830b7e8bd60024c-Abstract.html
Hoyer, P. O. “Non-negative Matrix Factorization with Sparseness Constraints.” Journal of Machine Learning Research, 5, 1457-1469, 2004. https://www.jmlr.org/papers/v5/hoyer04a.html
Gillis, N. “The Why and How of Nonnegative Matrix Factorization.” In Regularization, Optimization, Kernels, and Support Vector Machines, 2014. https://arxiv.org/abs/1401.5226
Cichocki, A., Zdunek, R., Phan, A. H., and Amari, S. “Nonnegative Matrix and Tensor Factorizations.” Wiley, 2009. https://onlinelibrary.wiley.com/doi/book/10.1002/9780470747278
Fevotte, C., and Idier, J. “Algorithms for Nonnegative Matrix Factorization with the Beta-Divergence.” Neural Computation, 23(9), 2421-2456, 2011. https://arxiv.org/abs/1010.1763
Berry, M. W., Browne, M., Langville, A. N., Pauca, V. P., and Plemmons, R. J. “Algorithms and applications for approximate nonnegative matrix factorization.” Computational Statistics and Data Analysis, 52(1), 155-173, 2007. https://www.sciencedirect.com/science/article/pii/S0167947306004191