148 Manifold Learning

High dimensional data rarely fills the space it nominally occupies. A collection of $64 \times 64$ grayscale images of a single object rotating under controlled lighting lives, formally, in $\mathbb{R}^{4096}$, yet the configurations actually realized trace out a low dimensional surface parameterized by a handful of physical factors such as pose angle and illumination direction. Manifold learning is the study of how to recover that low dimensional structure from samples, and of the geometric assumptions that make recovery possible. This chapter develops the manifold hypothesis, then treats three foundational algorithms, namely Isomap, locally linear embedding, and Laplacian eigenmaps, with particular attention to the distinction between geodesic and Euclidean distance that motivates the entire enterprise.

148.1 1. The Manifold Hypothesis

148.1.1 1.1 Statement and Geometric Setting

The manifold hypothesis asserts that real world high dimensional data, such as natural images, speech spectrograms, and sensor readings, concentrates near a low dimensional manifold $\mathcal{M}$ embedded in the ambient space $\mathbb{R}^D$. Formally we posit a smooth $d$ dimensional Riemannian manifold $\mathcal{M}$ with $d \ll D$, an embedding $\psi : \mathcal{M} \to \mathbb{R}^D$, and a probability distribution supported on (or near) the image $\psi(\mathcal{M})$. The observed data are samples

\[ x_i = \psi(\theta_i) + \varepsilon_i, \qquad \theta_i \in \mathcal{M}, \quad i = 1, \dots, n, \]

where $\theta_i$ are the underlying latent coordinates and $\varepsilon_i$ is small ambient noise. The goal of manifold learning, also called nonlinear dimensionality reduction, is to recover coordinates $y_i \in \mathbb{R}^d$ that faithfully represent the intrinsic configuration $\theta_i$, ideally up to a rigid motion or a smooth reparameterization.

The intrinsic dimension $d$ is the number of degrees of freedom that generated the data. Estimating $d$ is itself a nontrivial problem, often approached through the scaling of the number of neighbors within radius $r$, since for samples on a $d$ manifold this count grows as $r^d$.

148.1.2 1.2 Why Linear Methods Are Insufficient

Principal component analysis (PCA) finds the affine subspace minimizing reconstruction error and is optimal when $\mathcal{M}$ is itself a linear subspace. When the manifold is curved, PCA fails in a quantifiable way. Consider the canonical “Swiss roll”, a two dimensional sheet rolled into a spiral in $\mathbb{R}^3$. Two points on opposite arms of the spiral can be close in Euclidean distance yet far apart along the surface. PCA, which sees only Euclidean structure, will fold these distant regions together. The remedy is to respect distances measured along the manifold rather than through the ambient space, which leads directly to the notion of geodesic distance.

148.2 2. Geodesic versus Euclidean Distance

148.2.1 2.1 Definitions

The Euclidean (or chordal) distance between two ambient points is the straight line length

\[ d_{\text{euc}}(x_i, x_j) = \lVert x_i - x_j \rVert_2 . \]

The geodesic distance is the length of the shortest path that stays on the manifold. For a smooth curve $\gamma : [0,1] \to \mathcal{M}$ with $\gamma(0) = x_i$ and $\gamma(1) = x_j$, the length is $L(\gamma) = \int_0^1 \lVert \dot\gamma(t) \rVert \, dt$, and

\[ d_{\mathcal{M}}(x_i, x_j) = \inf_{\gamma} L(\gamma), \]

with the infimum taken over all such paths on $\mathcal{M}$. Because admissible paths are constrained to the surface, $d_{\mathcal{M}}(x_i, x_j) \ge d_{\text{euc}}(x_i, x_j)$ always, with equality only when the connecting segment lies entirely within the manifold.

148.2.2 2.2 The Local Isometry Principle

The key structural fact that makes nonlinear embedding tractable is that on a sufficiently small neighborhood the manifold looks flat. For nearby points the geodesic and Euclidean distances agree to first order,

\[ d_{\mathcal{M}}(x_i, x_j) = d_{\text{euc}}(x_i, x_j)\left(1 + O\!\left(d_{\text{euc}}^2 \cdot \kappa^2\right)\right), \]

where $\kappa$ bounds the curvature. Globally the two distances diverge, but locally they coincide. Every algorithm in this chapter exploits this local isometry: trust Euclidean distances among near neighbors, and stitch those local pieces into a global representation. The algorithms differ in how the stitching is performed.

148.3 3. Isomap

148.3.1 3.1 Algorithm

Isomap (isometric feature mapping) operationalizes the local isometry principle by approximating geodesic distances with graph shortest paths, then applying classical multidimensional scaling (MDS) to the resulting distance matrix. The three steps are as follows.

1. Neighborhood graph: connect each point to its k
   nearest neighbors (or all points within radius eps);
   weight each edge by Euclidean distance.
2. Geodesics: compute all-pairs shortest path distances
   D_G on the graph (Dijkstra or Floyd-Warshall).
3. Embedding: apply classical MDS to D_G to obtain
   d-dimensional coordinates Y.

The intuition is that a short path through many local hops approximates the true geodesic, because each hop is short enough that Euclidean and geodesic length agree, and the concatenation of local segments tracks the curved surface.

148.3.2 3.2 Classical MDS Step

Given the squared geodesic distance matrix $D_G^{(2)}$ with entries $[D_G]_{ij}^2$, classical MDS forms the doubly centered Gram matrix

\[ B = -\tfrac{1}{2}\, H\, D_G^{(2)}\, H, \qquad H = I_n - \tfrac{1}{n}\mathbf{1}\mathbf{1}^\top, \]

where $H$ is the centering matrix. If the geodesic distances were exactly realizable in $\mathbb{R}^d$, then $B$ would be positive semidefinite of rank $d$. The embedding takes the top $d$ eigenpairs $B = U \Lambda U^\top$ and sets

\[ Y = U_d \Lambda_d^{1/2}, \]

so that $Y Y^\top$ best approximates $B$ in Frobenius norm. The eigenvalue spectrum of $B$ also furnishes a diagnostic for intrinsic dimension: a sharp drop after the $d$th eigenvalue suggests that $d$ coordinates suffice.

148.3.3 3.3 Theoretical Guarantee and Limitations

Tenenbaum, de Silva, and Langford established a convergence result: under sampling that is sufficiently dense on a manifold that is isometric to a convex region of $\mathbb{R}^d$, the graph distances $D_G$ converge to the true geodesic distances as $n \to \infty$, and the recovered embedding converges to the true parameterization up to a rigid motion.

The convexity requirement is the principal limitation. If the parameter domain has holes or is nonconvex, shortest paths must detour around the missing region, so graph distances overestimate true geodesics and the embedding is distorted. Isomap is also a global method: every point participates in the eigendecomposition of an $n \times n$ matrix, which is costly, and the all-pairs shortest path computation scales poorly. Finally, the neighborhood graph is fragile. Too large a $k$ creates “short circuit” edges that jump across folds of the manifold, collapsing the geodesic structure the method is designed to preserve.

148.3.4 3.4 Why Double Centering Recovers Coordinates

It is worth spelling out exactly why the operator $B = -\tfrac12 H D^{(2)} H$ recovers an inner-product matrix from distances alone, because the same identity underlies every metric MDS variant. Suppose points $z_1, \dots, z_n \in \mathbb{R}^d$ realize the distances exactly, $D_{ij}^2 = \lVert z_i - z_j \rVert^2$. Expanding the square,

\[ D_{ij}^2 = \lVert z_i \rVert^2 + \lVert z_j \rVert^2 - 2\, z_i^\top z_j . \]

The first two terms depend on only one index each, so they are precisely the part annihilated by centering. Writing $G_{ij} = z_i^\top z_j$ for the Gram matrix of the (centered) points $\sum_i z_i = 0$, multiply $D^{(2)}$ on both sides by $H = I - \tfrac1n \mathbf{1}\mathbf{1}^\top$. Because $H \mathbf{1} = 0$, the rank-one row and column terms $\lVert z_i\rVert^2$ and $\lVert z_j\rVert^2$ vanish, leaving

\[ -\tfrac12 H D^{(2)} H = H G H = G , \]

the last equality holding because the $z_i$ are already centered. Thus $B = G$ exactly when the distances are Euclidean, and the eigendecomposition $B = U \Lambda U^\top$ yields coordinates $Y = U_d \Lambda_d^{1/2}$ whose Gram matrix $Y Y^\top$ is the rank-$d$ truncation of $G$ that is closest in Frobenius norm (Eckart, Young). When the geodesic distances are only approximately Euclidean, $B$ acquires small negative eigenvalues; clamping them to zero and keeping the top $d$ positive eigenpairs is the natural projection onto the cone of rank-$d$ Euclidean configurations.

148.3.5 3.5 Implementation

The reference implementation builds the symmetric $k$-nearest-neighbor graph, runs Floyd, Warshall to obtain all-pairs geodesics, and applies classical MDS via a symmetric eigensolver. The data below are six points tracing an open zig-zag in the plane: an intrinsically one dimensional path that Isomap should unfold onto a line, recovering the path order and the cumulative arc length rather than the misleading straight-line chord between the endpoints.

Code

from aiinaction.ch143_isomap import fit_isomap, pairwise_distances

# Six points on an open zig-zag path embedded in 2-D (intrinsic dimension 1).
X = [
    [0.0, 0.0],
    [1.0, 0.6],
    [2.2, 0.0],
    [3.2, 0.6],
    [4.6, 0.0],
    [6.2, 0.6],
]

result = fit_isomap(X, n_components=1, n_neighbors=2)

print("Embedding (1-D coordinates):")
for i, y in enumerate(result.embedding.ravel()):
    print(f"  point {i}: {y:+.4f}")

print(f"\nTop eigenvalue:        {result.eigenvalues[0]:.4f}")
chord = pairwise_distances(X)[0, 5]
geodesic = result.geodesic_distances[0, 5]
print(f"Chord    d(0,5):       {chord:.4f}")
print(f"Geodesic d(0,5):       {geodesic:.4f}  (longer, follows the path)")

Embedding (1-D coordinates):
  point 0: -2.9372
  point 1: -2.0650
  point 2: -0.7481
  point 3: +0.4240
  point 4: +1.9114
  point 5: +3.4150

Top eigenvalue:        28.9464
Chord    d(0,5):       6.2290
Geodesic d(0,5):       6.3662  (longer, follows the path)

using AIInAction.Ch143Isomap

# Six points on an open zig-zag path embedded in 2-D (intrinsic dimension 1).
X = [
    0.0 0.0
    1.0 0.6
    2.2 0.0
    3.2 0.6
    4.6 0.0
    6.2 0.6
]

result = fit_isomap(X; n_components=1, n_neighbors=2)

println("Embedding (1-D coordinates):")
for (i, y) in enumerate(result.embedding[:, 1])
    println("  point $(i - 1): $(round(y, digits=4))")
end

println("\nTop eigenvalue:  ", round(result.eigenvalues[1], digits=4))
chord = pairwise_distances(X)[1, 6]
geodesic = result.geodesic_distances[1, 6]
println("Chord    d(0,5): ", round(chord, digits=4))
println("Geodesic d(0,5): ", round(geodesic, digits=4), "  (longer, follows the path)")

use aiinaction::ch143_isomap::{fit_isomap, pairwise_distances, Matrix};

fn main() {
    // Six points on an open zig-zag path embedded in 2-D (intrinsic dimension 1).
    let x = Matrix::from_rows(&[
        vec![0.0, 0.0],
        vec![1.0, 0.6],
        vec![2.2, 0.0],
        vec![3.2, 0.6],
        vec![4.6, 0.0],
        vec![6.2, 0.6],
    ])
    .unwrap();

    let result = fit_isomap(&x, 1, 2).unwrap();

    println!("Embedding (1-D coordinates):");
    for (i, row) in result.embedding.iter().enumerate() {
        println!("  point {}: {:+.4}", i, row[0]);
    }

    println!("\nTop eigenvalue:  {:.4}", result.eigenvalues[0]);
    let chord = pairwise_distances(&x).unwrap()[5]; // entry (0, 5)
    let geodesic = result.geodesic_distances[5];
    println!("Chord    d(0,5): {:.4}", chord);
    println!("Geodesic d(0,5): {:.4}  (longer, follows the path)", geodesic);
}

The recovered coordinates are monotone along the path and centered at the origin, and the geodesic distance between the two endpoints exceeds their Euclidean chord, exactly the behavior that lets Isomap unroll a curved manifold where PCA would fold it flat.

148.4 4. Locally Linear Embedding

148.4.1 4.1 Motivation

Locally linear embedding (LLE) abandons explicit geodesics in favor of a purely local linear description. The premise is that each point and its neighbors lie on or near a locally linear patch of the manifold, so each point can be reconstructed as a weighted affine combination of its neighbors, and these reconstruction weights characterize the local geometry in a way that is invariant to rotation, rescaling, and translation. Preserving the weights therefore preserves local structure, and a global embedding that honors all local weights simultaneously unfolds the manifold.

148.4.2 4.2 Two Optimization Problems

LLE solves two least squares problems in sequence. First, fix the neighbors of each $x_i$ and find weights $W_{ij}$ minimizing reconstruction error,

\[ \min_{W} \sum_{i=1}^{n} \left\lVert x_i - \sum_{j \in \mathcal{N}(i)} W_{ij}\, x_j \right\rVert^2 \quad \text{subject to} \quad \sum_{j} W_{ij} = 1, \;\; W_{ij} = 0 \text{ if } j \notin \mathcal{N}(i). \]

The sum to one constraint enforces translation invariance, which together with the local geometry makes the weights invariant to rigid motions and uniform scaling. For each $i$ the optimal weights have a closed form in terms of the local Gram (covariance) matrix $C^{(i)}_{jk} = (x_i - x_j)^\top (x_i - x_k)$,

\[ W_{ij} = \frac{\sum_k (C^{(i)})^{-1}_{jk}}{\sum_{l,m} (C^{(i)})^{-1}_{lm}}, \]

with regularization of $C^{(i)}$ when the number of neighbors exceeds the ambient dimension.

Second, fix the weights $W$ and find low dimensional coordinates $Y$ that are reconstructed by the same weights,

\[ \min_{Y} \sum_{i=1}^{n} \left\lVert y_i - \sum_{j} W_{ij}\, y_j \right\rVert^2 \quad \text{subject to} \quad \tfrac{1}{n}\sum_i y_i y_i^\top = I_d, \;\; \sum_i y_i = 0 . \]

The constraints fix the scale and remove translational and rotational degeneracy, ruling out the trivial solution $Y = 0$.

148.4.3 4.3 Eigenvector Solution

Writing the objective as a quadratic form, define the sparse matrix

\[ M = (I - W)^\top (I - W). \]

The embedding cost is $\operatorname{tr}(Y^\top M Y)$, and under the unit covariance constraint the minimizer is given by the eigenvectors of $M$ associated with its smallest eigenvalues. The smallest eigenvalue is zero with eigenvector $\mathbf{1}$, corresponding to a constant shift, which is discarded. The next $d$ eigenvectors furnish the embedding coordinates.

1. Find k nearest neighbors of each point.
2. Solve a constrained least squares per point for W.
3. Form M = (I - W)^T (I - W).
4. Embedding = eigenvectors 2..(d+1) of M (smallest
   nonzero eigenvalues).

LLE is attractive because $M$ is sparse, the optimization is convex with a unique global solution, and there are no local minima. Its weaknesses are sensitivity to the neighborhood size $k$ and to the regularization parameter, and a tendency to distort in regions of nonuniform sampling density.

148.5 5. Laplacian Eigenmaps

148.5.1 5.1 Construction

Laplacian eigenmaps take a spectral graph theory viewpoint. Build a weighted neighborhood graph and assign similarities by a heat kernel,

\[ W_{ij} = \exp\!\left(-\frac{\lVert x_i - x_j \rVert^2}{2\sigma^2}\right) \quad \text{for } j \in \mathcal{N}(i), \qquad W_{ij} = 0 \text{ otherwise.} \]

Let $G$ be the diagonal degree matrix with $G_{ii} = \sum_j W_{ij}$ and define the graph Laplacian $L = G - W$. The embedding minimizes a smoothness penalty that places connected points close together,

\[ \min_{Y} \; \tfrac{1}{2}\sum_{i,j} W_{ij}\, \lVert y_i - y_j \rVert^2 = \operatorname{tr}(Y^\top L Y) \quad \text{subject to} \quad Y^\top G Y = I . \]

A large weight $W_{ij}$ imposes a heavy penalty whenever the images $y_i$ and $y_j$ drift apart, so the optimization pulls neighbors together while the normalization $Y^\top G Y = I$ prevents collapse.

148.5.2 5.2 Generalized Eigenproblem

The solution comes from the generalized eigenvalue problem

\[ L\, v = \lambda\, G\, v . \]

Order the eigenvalues $0 = \lambda_0 \le \lambda_1 \le \cdots$. The eigenvector $v_0 = \mathbf{1}$ for $\lambda_0 = 0$ is constant and is discarded; the embedding uses $v_1, \dots, v_d$ as coordinate functions, so $y_i = (v_1(i), \dots, v_d(i))$.

148.5.3 5.3 The Continuous Limit

The deep justification for Laplacian eigenmaps is that the graph Laplacian $L$ converges, as sampling density increases and the kernel bandwidth $\sigma$ shrinks appropriately, to the Laplace, Beltrami operator $\Delta_{\mathcal{M}}$ on the manifold. The eigenfunctions of $\Delta_{\mathcal{M}}$ form a natural orthonormal basis of smooth functions on $\mathcal{M}$, analogous to Fourier modes, and the low frequency eigenfunctions provide an intrinsic coordinate system. Minimizing $\operatorname{tr}(Y^\top L Y)$ is the discrete analogue of minimizing the Dirichlet energy $\int_{\mathcal{M}} \lVert \nabla f \rVert^2$, which selects the smoothest nonconstant functions and thereby the most slowly varying intrinsic coordinates. This connection ties the algorithm to spectral clustering, where the same Laplacian eigenvectors reveal cluster structure, and to diffusion maps, which weight the eigenfunctions to define a multiscale diffusion distance.

148.6 6. Comparison and Practical Guidance

148.6.1 6.1 A Unifying View

All three methods share a common template: build a neighborhood graph encoding local Euclidean geometry, then solve an eigenproblem whose spectrum yields global coordinates. They differ in what local quantity they preserve.

Method	Local quantity preserved	Distance notion	Matrix
Isomap	pairwise geodesic distances	global geodesic	dense Gram $B$
LLE	reconstruction weights	implicit, local linear	sparse $M = (I-W)^\top(I-W)$
Laplacian eigenmaps	neighbor proximity	diffusion proximity	sparse Laplacian $L$

Isomap is a global isometry method that tries to preserve every geodesic, so it recovers the true scale of the manifold when assumptions hold but pays an $O(n^3)$ price and is brittle to short circuits. LLE and Laplacian eigenmaps are local methods that yield sparse eigenproblems, scale better, and degrade more gracefully, but they recover coordinates only up to an unknown scaling and do not reproduce absolute geodesic distances.

148.6.2 6.2 Choosing Parameters and Reading Results

The neighborhood size $k$ (or radius $\varepsilon$) is the dominant hyperparameter across all three methods, and it encodes the assumed scale of local linearity. Too small a value disconnects the graph and fragments the embedding; too large a value introduces short circuit edges that violate the local isometry assumption. A practical workflow runs the method across a range of $k$ and inspects stability of the embedding and of the eigenvalue gap. Out of sample extension, namely embedding new points not in the training set, requires the Nystrom method or an explicit regression onto the learned coordinates, since the base algorithms only embed the given sample.

These classical methods remain conceptually central even though modern practice often favors $t$-SNE and UMAP for visualization or autoencoders for learned parametric maps. The eigenmap perspective, the manifold hypothesis, and the geodesic versus Euclidean distinction continue to underpin how we reason about representation learning in deep networks, where the same low dimensional structure is presumed to organize the geometry of learned feature spaces.

148.7 References

Tenenbaum, J. B., de Silva, V., and Langford, J. C. (2000). A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science, 290(5500), 2319-2323. https://www.science.org/doi/10.1126/science.290.5500.2319
Roweis, S. T., and Saul, L. K. (2000). Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science, 290(5500), 2323-2326. https://www.science.org/doi/10.1126/science.290.5500.2323
Belkin, M., and Niyogi, P. (2003). Laplacian Eigenmaps for Dimensionality Reduction and Data Representation. Neural Computation, 15(6), 1373-1396. https://doi.org/10.1162/089976603321780317
Coifman, R. R., and Lafon, S. (2006). Diffusion Maps. Applied and Computational Harmonic Analysis, 21(1), 5-30. https://doi.org/10.1016/j.acha.2006.04.006
Saul, L. K., and Roweis, S. T. (2003). Think Globally, Fit Locally: Unsupervised Learning of Low Dimensional Manifolds. Journal of Machine Learning Research, 4, 119-155. https://www.jmlr.org/papers/v4/saul03a.html
van der Maaten, L., Postma, E., and van den Herik, J. (2009). Dimensionality Reduction: A Comparative Review. Tilburg University Technical Report TiCC-TR 2009-005. https://lvdmaaten.github.io/publications/papers/TR_Dimensionality_Reduction_Review_2009.pdf
Fefferman, C., Mitter, S., and Narayanan, H. (2016). Testing the Manifold Hypothesis. Journal of the American Mathematical Society, 29(4), 983-1049. https://doi.org/10.1090/jams/852

# Manifold Learning High dimensional data rarely fills the space it nominally occupies. A collection of $64 \times 64$ grayscale images of a single object rotating under controlled lighting lives, formally, in $\mathbb{R}^{4096}$, yet the configurations actually realized trace out a low dimensional surface parameterized by a handful of physical factors such as pose angle and illumination direction. Manifold learning is the study of how to recover that low dimensional structure from samples, and of the geometric assumptions that make recovery possible. This chapter develops the manifold hypothesis, then treats three foundational algorithms, namely Isomap, locally linear embedding, and Laplacian eigenmaps, with particular attention to the distinction between geodesic and Euclidean distance that motivates the entire enterprise. ## 1. The Manifold Hypothesis ### 1.1 Statement and Geometric Setting The manifold hypothesis asserts that real world high dimensional data, such as natural images, speech spectrograms, and sensor readings, concentrates near a low dimensional manifold $\mathcal{M}$ embedded in the ambient space $\mathbb{R}^D$. Formally we posit a smooth $d$ dimensional Riemannian manifold $\mathcal{M}$ with $d \ll D$, an embedding $\psi : \mathcal{M} \to \mathbb{R}^D$, and a probability distribution supported on (or near) the image $\psi(\mathcal{M})$. The observed data are samples $$ x_i = \psi(\theta_i) + \varepsilon_i, \qquad \theta_i \in \mathcal{M}, \quad i = 1, \dots, n, $$ where $\theta_i$ are the underlying latent coordinates and $\varepsilon_i$ is small ambient noise. The goal of manifold learning, also called nonlinear dimensionality reduction, is to recover coordinates $y_i \in \mathbb{R}^d$ that faithfully represent the intrinsic configuration $\theta_i$, ideally up to a rigid motion or a smooth reparameterization. The intrinsic dimension $d$ is the number of degrees of freedom that generated the data. Estimating $d$ is itself a nontrivial problem, often approached through the scaling of the number of neighbors within radius $r$, since for samples on a $d$ manifold this count grows as $r^d$. ### 1.2 Why Linear Methods Are Insufficient Principal component analysis (PCA) finds the affine subspace minimizing reconstruction error and is optimal when $\mathcal{M}$ is itself a linear subspace. When the manifold is curved, PCA fails in a quantifiable way. Consider the canonical "Swiss roll", a two dimensional sheet rolled into a spiral in $\mathbb{R}^3$. Two points on opposite arms of the spiral can be close in Euclidean distance yet far apart along the surface. PCA, which sees only Euclidean structure, will fold these distant regions together. The remedy is to respect distances measured along the manifold rather than through the ambient space, which leads directly to the notion of geodesic distance. ## 2. Geodesic versus Euclidean Distance ### 2.1 Definitions The Euclidean (or chordal) distance between two ambient points is the straight line length $$ d_{\text{euc}}(x_i, x_j) = \lVert x_i - x_j \rVert_2 . $$ The geodesic distance is the length of the shortest path that stays on the manifold. For a smooth curve $\gamma : [0,1] \to \mathcal{M}$ with $\gamma(0) = x_i$ and $\gamma(1) = x_j$, the length is $L(\gamma) = \int_0^1 \lVert \dot\gamma(t) \rVert \, dt$, and $$ d_{\mathcal{M}}(x_i, x_j) = \inf_{\gamma} L(\gamma), $$ with the infimum taken over all such paths on $\mathcal{M}$. Because admissible paths are constrained to the surface, $d_{\mathcal{M}}(x_i, x_j) \ge d_{\text{euc}}(x_i, x_j)$ always, with equality only when the connecting segment lies entirely within the manifold. ### 2.2 The Local Isometry Principle The key structural fact that makes nonlinear embedding tractable is that on a sufficiently small neighborhood the manifold looks flat. For nearby points the geodesic and Euclidean distances agree to first order, $$ d_{\mathcal{M}}(x_i, x_j) = d_{\text{euc}}(x_i, x_j)\left(1 + O\!\left(d_{\text{euc}}^2 \cdot \kappa^2\right)\right), $$ where $\kappa$ bounds the curvature. Globally the two distances diverge, but locally they coincide. Every algorithm in this chapter exploits this local isometry: trust Euclidean distances among near neighbors, and stitch those local pieces into a global representation. The algorithms differ in how the stitching is performed. ## 3. Isomap ### 3.1 Algorithm Isomap (isometric feature mapping) operationalizes the local isometry principle by approximating geodesic distances with graph shortest paths, then applying classical multidimensional scaling (MDS) to the resulting distance matrix. The three steps are as follows. ```text 1. Neighborhood graph: connect each point to its k nearest neighbors (or all points within radius eps); weight each edge by Euclidean distance. 2. Geodesics: compute all-pairs shortest path distances D_G on the graph (Dijkstra or Floyd-Warshall). 3. Embedding: apply classical MDS to D_G to obtain d-dimensional coordinates Y. ``` The intuition is that a short path through many local hops approximates the true geodesic, because each hop is short enough that Euclidean and geodesic length agree, and the concatenation of local segments tracks the curved surface. ### 3.2 Classical MDS Step Given the squared geodesic distance matrix $D_G^{(2)}$ with entries $[D_G]_{ij}^2$, classical MDS forms the doubly centered Gram matrix $$ B = -\tfrac{1}{2}\, H\, D_G^{(2)}\, H, \qquad H = I_n - \tfrac{1}{n}\mathbf{1}\mathbf{1}^\top, $$ where $H$ is the centering matrix. If the geodesic distances were exactly realizable in $\mathbb{R}^d$, then $B$ would be positive semidefinite of rank $d$. The embedding takes the top $d$ eigenpairs $B = U \Lambda U^\top$ and sets $$ Y = U_d \Lambda_d^{1/2}, $$ so that $Y Y^\top$ best approximates $B$ in Frobenius norm. The eigenvalue spectrum of $B$ also furnishes a diagnostic for intrinsic dimension: a sharp drop after the $d$th eigenvalue suggests that $d$ coordinates suffice. ### 3.3 Theoretical Guarantee and Limitations Tenenbaum, de Silva, and Langford established a convergence result: under sampling that is sufficiently dense on a manifold that is isometric to a convex region of $\mathbb{R}^d$, the graph distances $D_G$ converge to the true geodesic distances as $n \to \infty$, and the recovered embedding converges to the true parameterization up to a rigid motion. The convexity requirement is the principal limitation. If the parameter domain has holes or is nonconvex, shortest paths must detour around the missing region, so graph distances overestimate true geodesics and the embedding is distorted. Isomap is also a global method: every point participates in the eigendecomposition of an $n \times n$ matrix, which is costly, and the all-pairs shortest path computation scales poorly. Finally, the neighborhood graph is fragile. Too large a $k$ creates "short circuit" edges that jump across folds of the manifold, collapsing the geodesic structure the method is designed to preserve. ### 3.4 Why Double Centering Recovers Coordinates It is worth spelling out exactly why the operator $B = -\tfrac12 H D^{(2)} H$ recovers an inner-product matrix from distances alone, because the same identity underlies every metric MDS variant. Suppose points $z_1, \dots, z_n \in \mathbb{R}^d$ realize the distances exactly, $D_{ij}^2 = \lVert z_i - z_j \rVert^2$. Expanding the square, $$ D_{ij}^2 = \lVert z_i \rVert^2 + \lVert z_j \rVert^2 - 2\, z_i^\top z_j . $$ The first two terms depend on only one index each, so they are precisely the part annihilated by centering. Writing $G_{ij} = z_i^\top z_j$ for the Gram matrix of the (centered) points $\sum_i z_i = 0$, multiply $D^{(2)}$ on both sides by $H = I - \tfrac1n \mathbf{1}\mathbf{1}^\top$. Because $H \mathbf{1} = 0$, the rank-one row and column terms $\lVert z_i\rVert^2$ and $\lVert z_j\rVert^2$ vanish, leaving $$ -\tfrac12 H D^{(2)} H = H G H = G , $$ the last equality holding because the $z_i$ are already centered. Thus $B = G$ exactly when the distances are Euclidean, and the eigendecomposition $B = U \Lambda U^\top$ yields coordinates $Y = U_d \Lambda_d^{1/2}$ whose Gram matrix $Y Y^\top$ is the rank-$d$ truncation of $G$ that is closest in Frobenius norm (Eckart, Young). When the geodesic distances are only approximately Euclidean, $B$ acquires small negative eigenvalues; clamping them to zero and keeping the top $d$ positive eigenpairs is the natural projection onto the cone of rank-$d$ Euclidean configurations. ### 3.5 Implementation The reference implementation builds the symmetric $k$-nearest-neighbor graph, runs Floyd, Warshall to obtain all-pairs geodesics, and applies classical MDS via a symmetric eigensolver. The data below are six points tracing an open zig-zag in the plane: an intrinsically one dimensional path that Isomap should unfold onto a line, recovering the path order and the cumulative arc length rather than the misleading straight-line chord between the endpoints. ::: {.panel-tabset} ## Python ```{python} from aiinaction.ch143_isomap import fit_isomap, pairwise_distances # Six points on an open zig-zag path embedded in 2-D (intrinsic dimension 1). X = [ [0.0, 0.0], [1.0, 0.6], [2.2, 0.0], [3.2, 0.6], [4.6, 0.0], [6.2, 0.6], ] result = fit_isomap(X, n_components=1, n_neighbors=2) print("Embedding (1-D coordinates):") for i, y in enumerate(result.embedding.ravel()): print(f" point {i}: {y:+.4f}") print(f"\nTop eigenvalue: {result.eigenvalues[0]:.4f}") chord = pairwise_distances(X)[0, 5] geodesic = result.geodesic_distances[0, 5] print(f"Chord d(0,5): {chord:.4f}") print(f"Geodesic d(0,5): {geodesic:.4f} (longer, follows the path)") ``` ## Julia ```julia using AIInAction.Ch143Isomap # Six points on an open zig-zag path embedded in 2-D (intrinsic dimension 1). X = [ 0.0 0.0 1.0 0.6 2.2 0.0 3.2 0.6 4.6 0.0 6.2 0.6 ] result = fit_isomap(X; n_components=1, n_neighbors=2) println("Embedding (1-D coordinates):") for (i, y) in enumerate(result.embedding[:, 1]) println(" point $(i - 1): $(round(y, digits=4))") end println("\nTop eigenvalue: ", round(result.eigenvalues[1], digits=4)) chord = pairwise_distances(X)[1, 6] geodesic = result.geodesic_distances[1, 6] println("Chord d(0,5): ", round(chord, digits=4)) println("Geodesic d(0,5): ", round(geodesic, digits=4), " (longer, follows the path)") ``` ## Rust ```rust use aiinaction::ch143_isomap::{fit_isomap, pairwise_distances, Matrix}; fn main() { // Six points on an open zig-zag path embedded in 2-D (intrinsic dimension 1). let x = Matrix::from_rows(&[ vec![0.0, 0.0], vec![1.0, 0.6], vec![2.2, 0.0], vec![3.2, 0.6], vec![4.6, 0.0], vec![6.2, 0.6], ]) .unwrap(); let result = fit_isomap(&x, 1, 2).unwrap(); println!("Embedding (1-D coordinates):"); for (i, row) in result.embedding.iter().enumerate() { println!(" point {}: {:+.4}", i, row[0]); } println!("\nTop eigenvalue: {:.4}", result.eigenvalues[0]); let chord = pairwise_distances(&x).unwrap()[5]; // entry (0, 5) let geodesic = result.geodesic_distances[5]; println!("Chord d(0,5): {:.4}", chord); println!("Geodesic d(0,5): {:.4} (longer, follows the path)", geodesic); } ``` ::: The recovered coordinates are monotone along the path and centered at the origin, and the geodesic distance between the two endpoints exceeds their Euclidean chord, exactly the behavior that lets Isomap unroll a curved manifold where PCA would fold it flat. ## 4. Locally Linear Embedding ### 4.1 Motivation Locally linear embedding (LLE) abandons explicit geodesics in favor of a purely local linear description. The premise is that each point and its neighbors lie on or near a locally linear patch of the manifold, so each point can be reconstructed as a weighted affine combination of its neighbors, and these reconstruction weights characterize the local geometry in a way that is invariant to rotation, rescaling, and translation. Preserving the weights therefore preserves local structure, and a global embedding that honors all local weights simultaneously unfolds the manifold. ### 4.2 Two Optimization Problems LLE solves two least squares problems in sequence. First, fix the neighbors of each $x_i$ and find weights $W_{ij}$ minimizing reconstruction error, $$ \min_{W} \sum_{i=1}^{n} \left\lVert x_i - \sum_{j \in \mathcal{N}(i)} W_{ij}\, x_j \right\rVert^2 \quad \text{subject to} \quad \sum_{j} W_{ij} = 1, \;\; W_{ij} = 0 \text{ if } j \notin \mathcal{N}(i). $$ The sum to one constraint enforces translation invariance, which together with the local geometry makes the weights invariant to rigid motions and uniform scaling. For each $i$ the optimal weights have a closed form in terms of the local Gram (covariance) matrix $C^{(i)}_{jk} = (x_i - x_j)^\top (x_i - x_k)$, $$ W_{ij} = \frac{\sum_k (C^{(i)})^{-1}_{jk}}{\sum_{l,m} (C^{(i)})^{-1}_{lm}}, $$ with regularization of $C^{(i)}$ when the number of neighbors exceeds the ambient dimension. Second, fix the weights $W$ and find low dimensional coordinates $Y$ that are reconstructed by the same weights, $$ \min_{Y} \sum_{i=1}^{n} \left\lVert y_i - \sum_{j} W_{ij}\, y_j \right\rVert^2 \quad \text{subject to} \quad \tfrac{1}{n}\sum_i y_i y_i^\top = I_d, \;\; \sum_i y_i = 0 . $$ The constraints fix the scale and remove translational and rotational degeneracy, ruling out the trivial solution $Y = 0$. ### 4.3 Eigenvector Solution Writing the objective as a quadratic form, define the sparse matrix $$ M = (I - W)^\top (I - W). $$ The embedding cost is $\operatorname{tr}(Y^\top M Y)$, and under the unit covariance constraint the minimizer is given by the eigenvectors of $M$ associated with its smallest eigenvalues. The smallest eigenvalue is zero with eigenvector $\mathbf{1}$, corresponding to a constant shift, which is discarded. The next $d$ eigenvectors furnish the embedding coordinates. ```text 1. Find k nearest neighbors of each point. 2. Solve a constrained least squares per point for W. 3. Form M = (I - W)^T (I - W). 4. Embedding = eigenvectors 2..(d+1) of M (smallest nonzero eigenvalues). ``` LLE is attractive because $M$ is sparse, the optimization is convex with a unique global solution, and there are no local minima. Its weaknesses are sensitivity to the neighborhood size $k$ and to the regularization parameter, and a tendency to distort in regions of nonuniform sampling density. ## 5. Laplacian Eigenmaps ### 5.1 Construction Laplacian eigenmaps take a spectral graph theory viewpoint. Build a weighted neighborhood graph and assign similarities by a heat kernel, $$ W_{ij} = \exp\!\left(-\frac{\lVert x_i - x_j \rVert^2}{2\sigma^2}\right) \quad \text{for } j \in \mathcal{N}(i), \qquad W_{ij} = 0 \text{ otherwise.} $$ Let $G$ be the diagonal degree matrix with $G_{ii} = \sum_j W_{ij}$ and define the graph Laplacian $L = G - W$. The embedding minimizes a smoothness penalty that places connected points close together, $$ \min_{Y} \; \tfrac{1}{2}\sum_{i,j} W_{ij}\, \lVert y_i - y_j \rVert^2 = \operatorname{tr}(Y^\top L Y) \quad \text{subject to} \quad Y^\top G Y = I . $$ A large weight $W_{ij}$ imposes a heavy penalty whenever the images $y_i$ and $y_j$ drift apart, so the optimization pulls neighbors together while the normalization $Y^\top G Y = I$ prevents collapse. ### 5.2 Generalized Eigenproblem The solution comes from the generalized eigenvalue problem $$ L\, v = \lambda\, G\, v . $$ Order the eigenvalues $0 = \lambda_0 \le \lambda_1 \le \cdots$. The eigenvector $v_0 = \mathbf{1}$ for $\lambda_0 = 0$ is constant and is discarded; the embedding uses $v_1, \dots, v_d$ as coordinate functions, so $y_i = (v_1(i), \dots, v_d(i))$. ### 5.3 The Continuous Limit The deep justification for Laplacian eigenmaps is that the graph Laplacian $L$ converges, as sampling density increases and the kernel bandwidth $\sigma$ shrinks appropriately, to the Laplace, Beltrami operator $\Delta_{\mathcal{M}}$ on the manifold. The eigenfunctions of $\Delta_{\mathcal{M}}$ form a natural orthonormal basis of smooth functions on $\mathcal{M}$, analogous to Fourier modes, and the low frequency eigenfunctions provide an intrinsic coordinate system. Minimizing $\operatorname{tr}(Y^\top L Y)$ is the discrete analogue of minimizing the Dirichlet energy $\int_{\mathcal{M}} \lVert \nabla f \rVert^2$, which selects the smoothest nonconstant functions and thereby the most slowly varying intrinsic coordinates. This connection ties the algorithm to spectral clustering, where the same Laplacian eigenvectors reveal cluster structure, and to diffusion maps, which weight the eigenfunctions to define a multiscale diffusion distance. ## 6. Comparison and Practical Guidance ### 6.1 A Unifying View All three methods share a common template: build a neighborhood graph encoding local Euclidean geometry, then solve an eigenproblem whose spectrum yields global coordinates. They differ in what local quantity they preserve. | Method | Local quantity preserved | Distance notion | Matrix | |---|---|---|---| | Isomap | pairwise geodesic distances | global geodesic | dense Gram $B$ | | LLE | reconstruction weights | implicit, local linear | sparse $M = (I-W)^\top(I-W)$ | | Laplacian eigenmaps | neighbor proximity | diffusion proximity | sparse Laplacian $L$ | Isomap is a global isometry method that tries to preserve every geodesic, so it recovers the true scale of the manifold when assumptions hold but pays an $O(n^3)$ price and is brittle to short circuits. LLE and Laplacian eigenmaps are local methods that yield sparse eigenproblems, scale better, and degrade more gracefully, but they recover coordinates only up to an unknown scaling and do not reproduce absolute geodesic distances. ### 6.2 Choosing Parameters and Reading Results The neighborhood size $k$ (or radius $\varepsilon$) is the dominant hyperparameter across all three methods, and it encodes the assumed scale of local linearity. Too small a value disconnects the graph and fragments the embedding; too large a value introduces short circuit edges that violate the local isometry assumption. A practical workflow runs the method across a range of $k$ and inspects stability of the embedding and of the eigenvalue gap. Out of sample extension, namely embedding new points not in the training set, requires the Nystrom method or an explicit regression onto the learned coordinates, since the base algorithms only embed the given sample. These classical methods remain conceptually central even though modern practice often favors $t$-SNE and UMAP for visualization or autoencoders for learned parametric maps. The eigenmap perspective, the manifold hypothesis, and the geodesic versus Euclidean distinction continue to underpin how we reason about representation learning in deep networks, where the same low dimensional structure is presumed to organize the geometry of learned feature spaces. ## References 1. Tenenbaum, J. B., de Silva, V., and Langford, J. C. (2000). A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science, 290(5500), 2319-2323. https://www.science.org/doi/10.1126/science.290.5500.2319 2. Roweis, S. T., and Saul, L. K. (2000). Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science, 290(5500), 2323-2326. https://www.science.org/doi/10.1126/science.290.5500.2323 3. Belkin, M., and Niyogi, P. (2003). Laplacian Eigenmaps for Dimensionality Reduction and Data Representation. Neural Computation, 15(6), 1373-1396. https://doi.org/10.1162/089976603321780317 4. Coifman, R. R., and Lafon, S. (2006). Diffusion Maps. Applied and Computational Harmonic Analysis, 21(1), 5-30. https://doi.org/10.1016/j.acha.2006.04.006 5. Saul, L. K., and Roweis, S. T. (2003). Think Globally, Fit Locally: Unsupervised Learning of Low Dimensional Manifolds. Journal of Machine Learning Research, 4, 119-155. https://www.jmlr.org/papers/v4/saul03a.html 6. van der Maaten, L., Postma, E., and van den Herik, J. (2009). Dimensionality Reduction: A Comparative Review. Tilburg University Technical Report TiCC-TR 2009-005. https://lvdmaaten.github.io/publications/papers/TR_Dimensionality_Reduction_Review_2009.pdf 7. Fefferman, C., Mitter, S., and Narayanan, H. (2016). Testing the Manifold Hypothesis. Journal of the American Mathematical Society, 29(4), 983-1049. https://doi.org/10.1090/jams/852