136 Spectral Clustering
Spectral clustering is a family of methods that recast clustering as a problem in linear algebra. Rather than searching directly for compact groups in the original feature space, these methods build a similarity graph over the data, compute a small number of eigenvectors of a matrix associated with that graph, and then cluster the data in the resulting low dimensional embedding. The payoff is substantial. Spectral clustering recovers clusters with complicated, non convex shapes that defeat centroid based methods such as k-means, and it comes with a clean theoretical justification through the theory of graph cuts and the spectrum of the graph Laplacian.
This chapter develops the method from the ground up. We construct similarity graphs, define the unnormalized and normalized graph Laplacians and establish their basic properties, derive the eigenvector embedding from a relaxation of the normalized cut objective, state the canonical algorithms of Shi and Malik and of Ng, Jordan, and Weiss, and finally explain why the last step of every spectral clustering algorithm is an ordinary run of k-means.
136.1 1. From Data to a Similarity Graph
Spectral clustering begins by representing the data set \(\{x_1, \dots, x_n\}\) as an undirected weighted graph \(G = (V, E)\). Each data point becomes a vertex, and edge weights \(w_{ij} \ge 0\) encode pairwise similarity, with larger weights for points that should belong to the same cluster. The collection of weights forms a symmetric adjacency matrix \(W \in \mathbb{R}^{n \times n}\) with \(w_{ij} = w_{ji}\) and \(w_{ii} = 0\).
The choice of similarity function and graph topology matters as much as any later step. Three constructions are standard.
The \(\varepsilon\) neighborhood graph connects two points whenever their distance falls below a threshold \(\varepsilon\). Because distances inside the threshold are all roughly comparable, this graph is usually treated as unweighted.
The \(k\) nearest neighbor graph connects each vertex to its \(k\) closest neighbors. Since the neighbor relation is not symmetric, one symmetrizes by connecting \(i\) and \(j\) if either is among the other’s neighbors, or alternatively only if both are. The resulting graph is sparse and adapts to varying local density, which makes it the most common choice in practice.
The fully connected graph assigns a positive weight to every pair through a similarity kernel. The Gaussian kernel
\[ w_{ij} = \exp\!\left(-\frac{\lVert x_i - x_j \rVert^2}{2\sigma^2}\right) \]
is the usual choice. The bandwidth \(\sigma\) controls how quickly similarity decays with distance and plays a role analogous to \(\varepsilon\). Because every entry is nonzero, this construction yields a dense matrix and scales poorly, so it is reserved for small problems.
The degree of vertex \(i\) is the total weight of edges incident to it,
\[ d_i = \sum_{j=1}^{n} w_{ij}, \]
and the degree matrix \(D\) is the diagonal matrix with entries \(d_1, \dots, d_n\). The matrices \(W\) and \(D\) are the only inputs the rest of the method needs.
136.2 2. Graph Laplacians
The central object of spectral clustering is the graph Laplacian. There are several variants, and keeping them straight is essential because the normalization changes both the algorithm and its theoretical guarantees.
136.2.1 2.1 The Unnormalized Laplacian
The unnormalized Laplacian is
\[ L = D - W. \]
Its defining property is a quadratic form identity. For any vector \(f \in \mathbb{R}^n\),
\[ f^\top L f = \frac{1}{2} \sum_{i=1}^{n} \sum_{j=1}^{n} w_{ij} \, (f_i - f_j)^2. \]
The derivation is short. Write \(f^\top L f = f^\top D f - f^\top W f = \sum_i d_i f_i^2 - \sum_{i,j} w_{ij} f_i f_j\), substitute \(d_i = \sum_j w_{ij}\), and complete the square. This identity has immediate consequences.
First, \(L\) is symmetric and positive semidefinite, since the sum on the right is a nonnegative combination of squares. Its eigenvalues are therefore real and nonnegative, and we order them \(0 = \lambda_1 \le \lambda_2 \le \cdots \le \lambda_n\).
Second, the constant vector \(\mathbf{1} = (1, \dots, 1)^\top\) satisfies \(L \mathbf{1} = 0\), so \(\lambda_1 = 0\) always. The quadratic form vanishes exactly when \(f\) is constant on each connected component, because any edge with positive weight forces \(f_i = f_j\).
This gives the key structural result. The multiplicity of the eigenvalue \(0\) equals the number of connected components of the graph, and the corresponding eigenspace is spanned by the indicator vectors of those components. If the data formed perfectly separated clusters joined by no edges, the zero eigenvectors would already reveal the cluster membership exactly. Real data produces clusters that are connected but weakly so, and spectral clustering can be read as a perturbation of this ideal case: the small but nonzero eigenvalues and their eigenvectors carry an approximate version of the indicator structure.
136.2.2 2.2 The Normalized Laplacians
Two normalized variants account for differences in vertex degree. The symmetric normalized Laplacian is
\[ L_{\mathrm{sym}} = D^{-1/2} L D^{-1/2} = I - D^{-1/2} W D^{-1/2}, \]
and the random walk normalized Laplacian is
\[ L_{\mathrm{rw}} = D^{-1} L = I - D^{-1} W. \]
The two are tightly related. If \(u\) is an eigenvector of \(L_{\mathrm{sym}}\) with eigenvalue \(\lambda\), then \(w = D^{-1/2} u\) is an eigenvector of \(L_{\mathrm{rw}}\) with the same eigenvalue, and vice versa. They share a spectrum. The name of \(L_{\mathrm{rw}}\) comes from the random walk on the graph whose transition matrix is \(P = D^{-1} W\); the eigenvectors of \(L_{\mathrm{rw}}\) are exactly those of \(P\), and slowly mixing directions of the walk correspond to cluster structure.
The normalized Laplacians satisfy a generalized quadratic form identity. For \(L_{\mathrm{rw}}\), the eigenvector equation \(L_{\mathrm{rw}} u = \lambda u\) is equivalent to the generalized eigenproblem
\[ L u = \lambda D u. \]
Like \(L\), both normalized Laplacians are positive semidefinite, both have smallest eigenvalue \(0\), and the multiplicity of \(0\) again counts connected components. For \(L_{\mathrm{rw}}\) the zero eigenspace is spanned by the component indicator vectors, while for \(L_{\mathrm{sym}}\) it is spanned by \(D^{1/2}\) times those indicators.
When degrees vary widely the normalized Laplacians are strongly preferred, because they correspond to a cut objective that balances clusters by volume rather than by raw count. We turn to that objective now.
136.3 3. Graph Cuts and the Normalized Cut
Clustering on a graph can be phrased as cutting it into pieces while severing as little edge weight as possible. For two disjoint vertex sets \(A\) and \(B\), define
\[ \operatorname{cut}(A, B) = \sum_{i \in A, \, j \in B} w_{ij}. \]
To partition \(V\) into \(k\) groups \(A_1, \dots, A_k\), the natural objective is the total weight removed,
\[ \operatorname{cut}(A_1, \dots, A_k) = \frac{1}{2} \sum_{l=1}^{k} \operatorname{cut}(A_l, \overline{A_l}), \]
where \(\overline{A_l}\) is the complement of \(A_l\). Minimizing this directly is a poor idea. The minimum cut often isolates a single outlier vertex, since separating one weakly connected point removes almost no weight. We need an objective that rewards balanced partitions.
Two normalizations achieve this. Let \(|A|\) denote the number of vertices in \(A\) and let \(\operatorname{vol}(A) = \sum_{i \in A} d_i\) denote its volume, the total degree of its vertices. The ratio cut and the normalized cut are
\[ \operatorname{RatioCut}(A_1, \dots, A_k) = \sum_{l=1}^{k} \frac{\operatorname{cut}(A_l, \overline{A_l})}{|A_l|}, \qquad \operatorname{Ncut}(A_1, \dots, A_k) = \sum_{l=1}^{k} \frac{\operatorname{cut}(A_l, \overline{A_l})}{\operatorname{vol}(A_l)}. \]
Dividing by size or volume penalizes tiny clusters, because a small denominator inflates the objective. The ratio cut connects to the unnormalized Laplacian and the normalized cut connects to the normalized Laplacian. The normalized cut, introduced by Shi and Malik, is the more widely used of the two, and we develop it in detail.
136.3.1 3.1 The Normalized Cut as a Trace Problem
Consider the two cluster case first, with partition \(A\) and \(\overline{A}\). Define an indicator vector \(f \in \mathbb{R}^n\) by
\[ f_i = \begin{cases} \;\;\,\sqrt{\operatorname{vol}(\overline{A}) / \operatorname{vol}(A)} & i \in A, \\[4pt] -\sqrt{\operatorname{vol}(A) / \operatorname{vol}(\overline{A})} & i \in \overline{A}. \end{cases} \]
A direct computation using the quadratic form identity shows three facts: \(f^\top L f = \operatorname{vol}(V) \cdot \operatorname{Ncut}(A, \overline{A})\), the vector \(Df\) is orthogonal to the constant vector since \((Df)^\top \mathbf{1} = 0\), and \(f^\top D f = \operatorname{vol}(V)\). Minimizing the normalized cut over all partitions is therefore equivalent to
\[ \min_{f} \; f^\top L f \quad \text{subject to} \quad f^\top D f = \operatorname{vol}(V), \;\; Df \perp \mathbf{1}, \;\; f \text{ of the discrete form above}. \]
The discreteness constraint, that \(f\) takes only two values determined by the cluster volumes, is what makes the problem NP hard. The spectral relaxation simply drops it and allows \(f\) to range over all real vectors subject to the two algebraic constraints. Substituting \(g = D^{1/2} f\) converts the relaxed problem into a standard Rayleigh quotient for \(L_{\mathrm{sym}}\), whose solution by the Rayleigh-Ritz theorem is the eigenvector of \(L_{\mathrm{sym}}\) associated with the second smallest eigenvalue. Undoing the substitution, \(f\) is the second eigenvector of \(L_{\mathrm{rw}}\), that is, the generalized eigenvector solving \(L f = \lambda D f\).
136.3.2 3.2 The Multiway Relaxation
The \(k\) way case generalizes cleanly. Encode the partition with an indicator matrix \(H \in \mathbb{R}^{n \times k}\) whose columns are scaled indicators of the clusters, normalized so that \(H^\top D H = I\). One can show
\[ \operatorname{Ncut}(A_1, \dots, A_k) = \operatorname{Tr}(H^\top L H), \]
so minimizing the normalized cut is the trace minimization
\[ \min_{H} \; \operatorname{Tr}(H^\top L H) \quad \text{subject to} \quad H^\top D H = I, \;\; H \text{ encodes a partition}. \]
Relaxing by dropping the partition constraint and setting \(T = D^{1/2} H\) yields
\[ \min_{T^\top T = I} \; \operatorname{Tr}(T^\top L_{\mathrm{sym}} T). \]
By the trace minimization version of the Rayleigh-Ritz theorem, the optimal \(T\) has as its columns the \(k\) eigenvectors of \(L_{\mathrm{sym}}\) corresponding to its \(k\) smallest eigenvalues. The matrix \(H = D^{-1/2} T\) then collects the \(k\) smallest generalized eigenvectors. These eigenvectors are the embedding that the algorithm clusters in the final step.
136.4 4. The Eigenvector Embedding
The relaxation hands us a continuous matrix \(H \in \mathbb{R}^{n \times k}\) whose entries are not cluster indicators. To recover an actual partition we read the embedding row by row. Each vertex \(i\) is mapped to the point \(y_i \in \mathbb{R}^k\) given by the \(i\) th row of the eigenvector matrix,
\[ x_i \;\longmapsto\; y_i = (u_1(i), u_2(i), \dots, u_k(i)), \]
where \(u_1, \dots, u_k\) are the chosen eigenvectors. This is the spectral embedding. Its defining virtue is that it linearizes cluster geometry. In the original space the clusters may be interlocking rings or crescents, but in the embedding the points belonging to the same cluster collapse toward a common location, and a simple geometric clusterer separates them.
The reason traces back to the ideal case of Section 2. When the graph has \(k\) connected components, the \(k\) smallest eigenvectors are linear combinations of the component indicators, and every row \(y_i\) for vertices in the same component is identical, so the embedded points sit at exactly \(k\) distinct positions. When components are merged into a single connected graph with weak inter cluster edges, matrix perturbation theory, by way of the Davis-Kahan theorem, guarantees that the eigenvectors change only slightly. The embedded points no longer coincide but form \(k\) tight clusters whose separation is controlled by the spectral gap \(\lambda_{k+1} - \lambda_k\). A large gap signals a clean cluster structure and a stable embedding.
136.5 5. The Algorithms
We can now assemble the complete procedures. The two canonical versions differ only in which Laplacian they use and whether they renormalize the embedding.
The Shi and Malik algorithm uses the random walk Laplacian.
Input: similarity matrix W, number of clusters k
1. Compute D and L = D - W.
2. Solve the generalized eigenproblem L u = lambda D u
for the k smallest eigenvalues; collect u_1..u_k as columns of U.
3. Let y_i be the i-th row of U.
4. Cluster y_1..y_n into k groups with k-means.
5. Assign x_i to the cluster of y_i.
The Ng, Jordan, and Weiss algorithm uses the symmetric Laplacian and adds a row normalization step that projects the embedding onto the unit sphere.
Input: similarity matrix W, number of clusters k
1. Compute D and L_sym = I - D^{-1/2} W D^{-1/2}.
2. Find the k smallest eigenvectors of L_sym; stack as columns of U.
3. Form T by normalizing each row of U to unit length:
T_ij = U_ij / sqrt(sum_j U_ij^2).
4. Cluster the rows of T into k groups with k-means.
5. Assign x_i to the cluster of row i.
The row normalization in step 3 compensates for the \(D^{1/2}\) factor that the symmetric Laplacian introduces into its eigenvectors, restoring the clean cluster geometry that the random walk version has automatically. In practice the random walk formulation is often the safer default, because the symmetric version can distort the embedding when degrees are very uneven, although the renormalization mitigates this.
Several practical considerations govern the outcome. The number of clusters \(k\) can be guided by the eigengap heuristic: choose \(k\) so that \(\lambda_1, \dots, \lambda_k\) are small and \(\lambda_{k+1}\) is markedly larger. The similarity graph should be sparse for efficiency, so a \(k\) nearest neighbor graph is typical, and the kernel bandwidth must be tuned, since too small a \(\sigma\) fragments the graph and too large a \(\sigma\) washes out structure. Because only the smallest eigenvectors are needed, iterative solvers such as the Lanczos method exploit sparsity and avoid a full eigendecomposition, bringing the cost well below the \(O(n^3)\) of a dense solve.
136.6 6. The Connection to k-means
Every algorithm above ends by running k-means on the embedded rows, which can seem like an arbitrary bolt on. It is not. The relationship runs in both directions and clarifies what spectral clustering actually does.
In one direction, k-means is the right tool for the final step precisely because the spectral embedding was designed to make it succeed. The relaxation produces coordinates in which same cluster points are tightly grouped around distinct centers, and that grouping is exactly the structure k-means assumes. Spectral clustering can be understood as a change of representation, a nonlinear map from the input space into the embedding, followed by ordinary k-means in the new space. The hard work is the construction of the embedding; k-means merely reads off the answer. This also explains why the embedding step is indispensable. Running k-means on the raw features would fail on non convex clusters, but running it on the spectral coordinates succeeds because the geometry has been linearized.
In the other direction, there is a formal equivalence that reveals spectral clustering as a relaxation of k-means itself. The k-means objective, minimizing within cluster sum of squared distances, can be written as a trace maximization problem
\[ \max_{H} \; \operatorname{Tr}(H^\top X^\top X H) \quad \text{subject to} \quad H^\top H = I, \;\; H \text{ a normalized cluster indicator}, \]
where \(X\) holds the data and \(H\) is a scaled indicator matrix of the same kind that appeared in the normalized cut derivation. Dropping the discrete partition constraint gives a continuous trace maximization whose solution is the top eigenvectors of the Gram matrix \(X^\top X\). This is the same relaxation move that converted the normalized cut into an eigenproblem, applied to a different matrix. Spectral clustering substitutes the graph Laplacian for the Gram matrix, which replaces the global linear geometry of k-means with the local similarity geometry of the graph. Seen this way, spectral clustering and kernel k-means are two instances of one family of trace relaxations, and they coincide for particular choices of kernel and normalization.
The practical lesson is that spectral clustering inherits both the strengths and the weaknesses of k-means in its final step. It needs \(k\) specified in advance, it is sensitive to the initialization of the centroids, and it returns a hard partition. Running k-means several times with different seeds and keeping the lowest objective, or using k-means++ initialization, addresses the seeding sensitivity in exactly the way it would for plain k-means.
136.7 7. Summary
Spectral clustering converts a clustering problem into an eigenproblem. One builds a similarity graph, forms a graph Laplacian whose quadratic form measures the weight of cut edges, and computes the eigenvectors associated with the smallest eigenvalues. These eigenvectors arise as the solution to a continuous relaxation of the normalized cut, an objective that seeks balanced partitions of low cut weight. The eigenvectors define an embedding in which clusters become geometrically simple, and a final pass of k-means recovers the partition. The method handles non convex clusters that defeat centroid based approaches, rests on the well understood theory of graph Laplacians and matrix perturbation, and connects naturally to k-means as a member of the same family of trace relaxations. Its main costs are the choice of similarity graph and bandwidth, the need to fix the number of clusters, and the eigendecomposition, which sparsity and iterative solvers keep tractable.
136.8 References
- Ulrike von Luxburg, “A Tutorial on Spectral Clustering,” Statistics and Computing, 2007. https://arxiv.org/abs/0711.0189
- Jianbo Shi and Jitendra Malik, “Normalized Cuts and Image Segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000. https://people.eecs.berkeley.edu/~malik/papers/SM-ncut.pdf
- Andrew Y. Ng, Michael I. Jordan, and Yair Weiss, “On Spectral Clustering: Analysis and an Algorithm,” Advances in Neural Information Processing Systems, 2002. https://proceedings.neurips.cc/paper/2001/hash/801272ee79cfde7fa5960571fee36b9b-Abstract.html
- Fan R. K. Chung, “Spectral Graph Theory,” American Mathematical Society, 1997. https://mathweb.ucsd.edu/~fan/research/revised.html
- Inderjit S. Dhillon, Yuqiang Guan, and Brian Kulis, “Kernel k-means, Spectral Clustering and Normalized Cuts,” ACM SIGKDD, 2004. https://www.cs.utexas.edu/~inderjit/public_papers/kdd_spectral_kernelkmeans.pdf
- Chris Ding and Xiaofeng He, “K-means Clustering via Principal Component Analysis,” International Conference on Machine Learning, 2004. https://ranger.uta.edu/~chqding/papers/KmeansPCA1.pdf