148 Manifold Learning
High dimensional data rarely fills the space it nominally occupies. A collection of \(64 \times 64\) grayscale images of a single object rotating under controlled lighting lives, formally, in \(\mathbb{R}^{4096}\), yet the configurations actually realized trace out a low dimensional surface parameterized by a handful of physical factors such as pose angle and illumination direction. Manifold learning is the study of how to recover that low dimensional structure from samples, and of the geometric assumptions that make recovery possible. This chapter develops the manifold hypothesis, then treats three foundational algorithms, namely Isomap, locally linear embedding, and Laplacian eigenmaps, with particular attention to the distinction between geodesic and Euclidean distance that motivates the entire enterprise.
148.1 1. The Manifold Hypothesis
148.1.1 1.1 Statement and Geometric Setting
The manifold hypothesis asserts that real world high dimensional data, such as natural images, speech spectrograms, and sensor readings, concentrates near a low dimensional manifold \(\mathcal{M}\) embedded in the ambient space \(\mathbb{R}^D\). Formally we posit a smooth \(d\) dimensional Riemannian manifold \(\mathcal{M}\) with \(d \ll D\), an embedding \(\psi : \mathcal{M} \to \mathbb{R}^D\), and a probability distribution supported on (or near) the image \(\psi(\mathcal{M})\). The observed data are samples
\[ x_i = \psi(\theta_i) + \varepsilon_i, \qquad \theta_i \in \mathcal{M}, \quad i = 1, \dots, n, \]
where \(\theta_i\) are the underlying latent coordinates and \(\varepsilon_i\) is small ambient noise. The goal of manifold learning, also called nonlinear dimensionality reduction, is to recover coordinates \(y_i \in \mathbb{R}^d\) that faithfully represent the intrinsic configuration \(\theta_i\), ideally up to a rigid motion or a smooth reparameterization.
The intrinsic dimension \(d\) is the number of degrees of freedom that generated the data. Estimating \(d\) is itself a nontrivial problem, often approached through the scaling of the number of neighbors within radius \(r\), since for samples on a \(d\) manifold this count grows as \(r^d\).
148.1.2 1.2 Why Linear Methods Are Insufficient
Principal component analysis (PCA) finds the affine subspace minimizing reconstruction error and is optimal when \(\mathcal{M}\) is itself a linear subspace. When the manifold is curved, PCA fails in a quantifiable way. Consider the canonical “Swiss roll”, a two dimensional sheet rolled into a spiral in \(\mathbb{R}^3\). Two points on opposite arms of the spiral can be close in Euclidean distance yet far apart along the surface. PCA, which sees only Euclidean structure, will fold these distant regions together. The remedy is to respect distances measured along the manifold rather than through the ambient space, which leads directly to the notion of geodesic distance.
148.2 2. Geodesic versus Euclidean Distance
148.2.1 2.1 Definitions
The Euclidean (or chordal) distance between two ambient points is the straight line length
\[ d_{\text{euc}}(x_i, x_j) = \lVert x_i - x_j \rVert_2 . \]
The geodesic distance is the length of the shortest path that stays on the manifold. For a smooth curve \(\gamma : [0,1] \to \mathcal{M}\) with \(\gamma(0) = x_i\) and \(\gamma(1) = x_j\), the length is \(L(\gamma) = \int_0^1 \lVert \dot\gamma(t) \rVert \, dt\), and
\[ d_{\mathcal{M}}(x_i, x_j) = \inf_{\gamma} L(\gamma), \]
with the infimum taken over all such paths on \(\mathcal{M}\). Because admissible paths are constrained to the surface, \(d_{\mathcal{M}}(x_i, x_j) \ge d_{\text{euc}}(x_i, x_j)\) always, with equality only when the connecting segment lies entirely within the manifold.
148.2.2 2.2 The Local Isometry Principle
The key structural fact that makes nonlinear embedding tractable is that on a sufficiently small neighborhood the manifold looks flat. For nearby points the geodesic and Euclidean distances agree to first order,
\[ d_{\mathcal{M}}(x_i, x_j) = d_{\text{euc}}(x_i, x_j)\left(1 + O\!\left(d_{\text{euc}}^2 \cdot \kappa^2\right)\right), \]
where \(\kappa\) bounds the curvature. Globally the two distances diverge, but locally they coincide. Every algorithm in this chapter exploits this local isometry: trust Euclidean distances among near neighbors, and stitch those local pieces into a global representation. The algorithms differ in how the stitching is performed.
148.3 3. Isomap
148.3.1 3.1 Algorithm
Isomap (isometric feature mapping) operationalizes the local isometry principle by approximating geodesic distances with graph shortest paths, then applying classical multidimensional scaling (MDS) to the resulting distance matrix. The three steps are as follows.
1. Neighborhood graph: connect each point to its k
nearest neighbors (or all points within radius eps);
weight each edge by Euclidean distance.
2. Geodesics: compute all-pairs shortest path distances
D_G on the graph (Dijkstra or Floyd-Warshall).
3. Embedding: apply classical MDS to D_G to obtain
d-dimensional coordinates Y.
The intuition is that a short path through many local hops approximates the true geodesic, because each hop is short enough that Euclidean and geodesic length agree, and the concatenation of local segments tracks the curved surface.
148.3.2 3.2 Classical MDS Step
Given the squared geodesic distance matrix \(D_G^{(2)}\) with entries \([D_G]_{ij}^2\), classical MDS forms the doubly centered Gram matrix
\[ B = -\tfrac{1}{2}\, H\, D_G^{(2)}\, H, \qquad H = I_n - \tfrac{1}{n}\mathbf{1}\mathbf{1}^\top, \]
where \(H\) is the centering matrix. If the geodesic distances were exactly realizable in \(\mathbb{R}^d\), then \(B\) would be positive semidefinite of rank \(d\). The embedding takes the top \(d\) eigenpairs \(B = U \Lambda U^\top\) and sets
\[ Y = U_d \Lambda_d^{1/2}, \]
so that \(Y Y^\top\) best approximates \(B\) in Frobenius norm. The eigenvalue spectrum of \(B\) also furnishes a diagnostic for intrinsic dimension: a sharp drop after the \(d\)th eigenvalue suggests that \(d\) coordinates suffice.
148.3.3 3.3 Theoretical Guarantee and Limitations
Tenenbaum, de Silva, and Langford established a convergence result: under sampling that is sufficiently dense on a manifold that is isometric to a convex region of \(\mathbb{R}^d\), the graph distances \(D_G\) converge to the true geodesic distances as \(n \to \infty\), and the recovered embedding converges to the true parameterization up to a rigid motion.
The convexity requirement is the principal limitation. If the parameter domain has holes or is nonconvex, shortest paths must detour around the missing region, so graph distances overestimate true geodesics and the embedding is distorted. Isomap is also a global method: every point participates in the eigendecomposition of an \(n \times n\) matrix, which is costly, and the all-pairs shortest path computation scales poorly. Finally, the neighborhood graph is fragile. Too large a \(k\) creates “short circuit” edges that jump across folds of the manifold, collapsing the geodesic structure the method is designed to preserve.
148.4 4. Locally Linear Embedding
148.4.1 4.1 Motivation
Locally linear embedding (LLE) abandons explicit geodesics in favor of a purely local linear description. The premise is that each point and its neighbors lie on or near a locally linear patch of the manifold, so each point can be reconstructed as a weighted affine combination of its neighbors, and these reconstruction weights characterize the local geometry in a way that is invariant to rotation, rescaling, and translation. Preserving the weights therefore preserves local structure, and a global embedding that honors all local weights simultaneously unfolds the manifold.
148.4.2 4.2 Two Optimization Problems
LLE solves two least squares problems in sequence. First, fix the neighbors of each \(x_i\) and find weights \(W_{ij}\) minimizing reconstruction error,
\[ \min_{W} \sum_{i=1}^{n} \left\lVert x_i - \sum_{j \in \mathcal{N}(i)} W_{ij}\, x_j \right\rVert^2 \quad \text{subject to} \quad \sum_{j} W_{ij} = 1, \;\; W_{ij} = 0 \text{ if } j \notin \mathcal{N}(i). \]
The sum to one constraint enforces translation invariance, which together with the local geometry makes the weights invariant to rigid motions and uniform scaling. For each \(i\) the optimal weights have a closed form in terms of the local Gram (covariance) matrix \(C^{(i)}_{jk} = (x_i - x_j)^\top (x_i - x_k)\),
\[ W_{ij} = \frac{\sum_k (C^{(i)})^{-1}_{jk}}{\sum_{l,m} (C^{(i)})^{-1}_{lm}}, \]
with regularization of \(C^{(i)}\) when the number of neighbors exceeds the ambient dimension.
Second, fix the weights \(W\) and find low dimensional coordinates \(Y\) that are reconstructed by the same weights,
\[ \min_{Y} \sum_{i=1}^{n} \left\lVert y_i - \sum_{j} W_{ij}\, y_j \right\rVert^2 \quad \text{subject to} \quad \tfrac{1}{n}\sum_i y_i y_i^\top = I_d, \;\; \sum_i y_i = 0 . \]
The constraints fix the scale and remove translational and rotational degeneracy, ruling out the trivial solution \(Y = 0\).
148.4.3 4.3 Eigenvector Solution
Writing the objective as a quadratic form, define the sparse matrix
\[ M = (I - W)^\top (I - W). \]
The embedding cost is \(\operatorname{tr}(Y^\top M Y)\), and under the unit covariance constraint the minimizer is given by the eigenvectors of \(M\) associated with its smallest eigenvalues. The smallest eigenvalue is zero with eigenvector \(\mathbf{1}\), corresponding to a constant shift, which is discarded. The next \(d\) eigenvectors furnish the embedding coordinates.
1. Find k nearest neighbors of each point.
2. Solve a constrained least squares per point for W.
3. Form M = (I - W)^T (I - W).
4. Embedding = eigenvectors 2..(d+1) of M (smallest
nonzero eigenvalues).
LLE is attractive because \(M\) is sparse, the optimization is convex with a unique global solution, and there are no local minima. Its weaknesses are sensitivity to the neighborhood size \(k\) and to the regularization parameter, and a tendency to distort in regions of nonuniform sampling density.
148.5 5. Laplacian Eigenmaps
148.5.1 5.1 Construction
Laplacian eigenmaps take a spectral graph theory viewpoint. Build a weighted neighborhood graph and assign similarities by a heat kernel,
\[ W_{ij} = \exp\!\left(-\frac{\lVert x_i - x_j \rVert^2}{2\sigma^2}\right) \quad \text{for } j \in \mathcal{N}(i), \qquad W_{ij} = 0 \text{ otherwise.} \]
Let \(G\) be the diagonal degree matrix with \(G_{ii} = \sum_j W_{ij}\) and define the graph Laplacian \(L = G - W\). The embedding minimizes a smoothness penalty that places connected points close together,
\[ \min_{Y} \; \tfrac{1}{2}\sum_{i,j} W_{ij}\, \lVert y_i - y_j \rVert^2 = \operatorname{tr}(Y^\top L Y) \quad \text{subject to} \quad Y^\top G Y = I . \]
A large weight \(W_{ij}\) imposes a heavy penalty whenever the images \(y_i\) and \(y_j\) drift apart, so the optimization pulls neighbors together while the normalization \(Y^\top G Y = I\) prevents collapse.
148.5.2 5.2 Generalized Eigenproblem
The solution comes from the generalized eigenvalue problem
\[ L\, v = \lambda\, G\, v . \]
Order the eigenvalues \(0 = \lambda_0 \le \lambda_1 \le \cdots\). The eigenvector \(v_0 = \mathbf{1}\) for \(\lambda_0 = 0\) is constant and is discarded; the embedding uses \(v_1, \dots, v_d\) as coordinate functions, so \(y_i = (v_1(i), \dots, v_d(i))\).
148.5.3 5.3 The Continuous Limit
The deep justification for Laplacian eigenmaps is that the graph Laplacian \(L\) converges, as sampling density increases and the kernel bandwidth \(\sigma\) shrinks appropriately, to the Laplace, Beltrami operator \(\Delta_{\mathcal{M}}\) on the manifold. The eigenfunctions of \(\Delta_{\mathcal{M}}\) form a natural orthonormal basis of smooth functions on \(\mathcal{M}\), analogous to Fourier modes, and the low frequency eigenfunctions provide an intrinsic coordinate system. Minimizing \(\operatorname{tr}(Y^\top L Y)\) is the discrete analogue of minimizing the Dirichlet energy \(\int_{\mathcal{M}} \lVert \nabla f \rVert^2\), which selects the smoothest nonconstant functions and thereby the most slowly varying intrinsic coordinates. This connection ties the algorithm to spectral clustering, where the same Laplacian eigenvectors reveal cluster structure, and to diffusion maps, which weight the eigenfunctions to define a multiscale diffusion distance.
148.6 6. Comparison and Practical Guidance
148.6.1 6.1 A Unifying View
All three methods share a common template: build a neighborhood graph encoding local Euclidean geometry, then solve an eigenproblem whose spectrum yields global coordinates. They differ in what local quantity they preserve.
| Method | Local quantity preserved | Distance notion | Matrix |
|---|---|---|---|
| Isomap | pairwise geodesic distances | global geodesic | dense Gram \(B\) |
| LLE | reconstruction weights | implicit, local linear | sparse \(M = (I-W)^\top(I-W)\) |
| Laplacian eigenmaps | neighbor proximity | diffusion proximity | sparse Laplacian \(L\) |
Isomap is a global isometry method that tries to preserve every geodesic, so it recovers the true scale of the manifold when assumptions hold but pays an \(O(n^3)\) price and is brittle to short circuits. LLE and Laplacian eigenmaps are local methods that yield sparse eigenproblems, scale better, and degrade more gracefully, but they recover coordinates only up to an unknown scaling and do not reproduce absolute geodesic distances.
148.6.2 6.2 Choosing Parameters and Reading Results
The neighborhood size \(k\) (or radius \(\varepsilon\)) is the dominant hyperparameter across all three methods, and it encodes the assumed scale of local linearity. Too small a value disconnects the graph and fragments the embedding; too large a value introduces short circuit edges that violate the local isometry assumption. A practical workflow runs the method across a range of \(k\) and inspects stability of the embedding and of the eigenvalue gap. Out of sample extension, namely embedding new points not in the training set, requires the Nystrom method or an explicit regression onto the learned coordinates, since the base algorithms only embed the given sample.
These classical methods remain conceptually central even though modern practice often favors \(t\)-SNE and UMAP for visualization or autoencoders for learned parametric maps. The eigenmap perspective, the manifold hypothesis, and the geodesic versus Euclidean distinction continue to underpin how we reason about representation learning in deep networks, where the same low dimensional structure is presumed to organize the geometry of learned feature spaces.
148.7 References
Tenenbaum, J. B., de Silva, V., and Langford, J. C. (2000). A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science, 290(5500), 2319-2323. https://www.science.org/doi/10.1126/science.290.5500.2319
Roweis, S. T., and Saul, L. K. (2000). Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science, 290(5500), 2323-2326. https://www.science.org/doi/10.1126/science.290.5500.2323
Belkin, M., and Niyogi, P. (2003). Laplacian Eigenmaps for Dimensionality Reduction and Data Representation. Neural Computation, 15(6), 1373-1396. https://doi.org/10.1162/089976603321780317
Coifman, R. R., and Lafon, S. (2006). Diffusion Maps. Applied and Computational Harmonic Analysis, 21(1), 5-30. https://doi.org/10.1016/j.acha.2006.04.006
Saul, L. K., and Roweis, S. T. (2003). Think Globally, Fit Locally: Unsupervised Learning of Low Dimensional Manifolds. Journal of Machine Learning Research, 4, 119-155. https://www.jmlr.org/papers/v4/saul03a.html
van der Maaten, L., Postma, E., and van den Herik, J. (2009). Dimensionality Reduction: A Comparative Review. Tilburg University Technical Report TiCC-TR 2009-005. https://lvdmaaten.github.io/publications/papers/TR_Dimensionality_Reduction_Review_2009.pdf
Fefferman, C., Mitter, S., and Narayanan, H. (2016). Testing the Manifold Hypothesis. Journal of the American Mathematical Society, 29(4), 983-1049. https://doi.org/10.1090/jams/852