73  Nonlinear Dimensionality Reduction with t-SNE and UMAP

High dimensional data is the normal condition in modern machine learning. Word embeddings live in 300 to 4096 dimensions, single cell RNA sequencing produces expression vectors over 20000 genes, and the penultimate layer of a vision model emits feature vectors with thousands of components. We cannot see these spaces directly, and our geometric intuition, calibrated for two and three dimensions, fails badly above them. Nonlinear dimensionality reduction methods such as t-SNE and UMAP exist to bridge this gap. They take points in a high dimensional space and produce a two or three dimensional layout that a human can inspect. This chapter explains the manifold assumption that justifies these methods, the mechanics of t-SNE and UMAP, the hyperparameters that govern their behavior, the systematic ways their plots mislead, and the practical question of when to reach for them at all.

73.1 1. The Manifold Assumption

73.1.1 1.1 Why Linear Methods Are Not Enough

Principal component analysis (PCA) finds the linear subspace that captures maximal variance. If the data genuinely lives near a low dimensional hyperplane, PCA recovers it exactly and cheaply. The trouble is that real data rarely lies on a flat subspace. Consider points arranged on a two dimensional sheet that has been rolled up into a spiral in three dimensions, the classic Swiss roll. The intrinsic structure is two dimensional, but any linear projection collapses points that are far apart along the roll onto each other. PCA sees only the ambient variance and cannot unroll the sheet.

73.1.2 1.2 Manifolds and Intrinsic Dimension

The manifold assumption states that high dimensional data of interest concentrates near a low dimensional manifold \(\mathcal{M}\) embedded in the ambient space \(\mathbb{R}^D\). A \(d\) dimensional manifold is a space that looks locally like \(\mathbb{R}^d\) even though it may be curved globally. The number \(d\) is the intrinsic dimension, and the hope is that \(d \ll D\). Images of a single rotating object, for example, trace out a manifold whose intrinsic dimension equals the number of degrees of freedom of the rotation, not the number of pixels.

The practical consequence is that Euclidean distance in the ambient space is trustworthy only locally. Two points that are near each other in \(\mathbb{R}^D\) are probably near each other on \(\mathcal{M}\), but the straight line distance between two far apart points badly underestimates the geodesic distance measured along the manifold surface. Nonlinear methods exploit this by preserving local neighborhood structure and treating large distances as essentially unknown. This single design choice explains most of what follows, including the limitations.

73.2 2. How t-SNE Works

t-SNE, t-distributed Stochastic Neighbor Embedding, was introduced by van der Maaten and Hinton in 2008. Its strategy is to convert distances into probabilities, then find a low dimensional layout whose probabilities match.

73.2.1 2.1 Neighborhoods as Probability Distributions

For each high dimensional point \(x_i\), t-SNE defines a conditional probability that \(x_i\) would pick \(x_j\) as its neighbor, using a Gaussian kernel centered on \(x_i\):

\[ p_{j \mid i} = \frac{\exp\left(-\lVert x_i - x_j \rVert^2 / 2\sigma_i^2\right)}{\sum_{k \neq i} \exp\left(-\lVert x_i - x_k \rVert^2 / 2\sigma_i^2\right)}. \]

The bandwidth \(\sigma_i\) is set separately for each point so that the effective number of neighbors, measured through the perplexity of the distribution, matches a user chosen value. This per point adaptivity is important. Dense regions get a small \(\sigma_i\) and sparse regions get a large one, so the method respects local density. The conditional probabilities are symmetrized into joint probabilities \(p_{ij} = (p_{j \mid i} + p_{i \mid j}) / 2N\).

73.2.2 2.2 The Low Dimensional Map and the Heavy Tail

In the low dimensional map, t-SNE places points \(y_i\) and measures their similarity with a Student t-distribution with one degree of freedom, a Cauchy kernel:

\[ q_{ij} = \frac{\left(1 + \lVert y_i - y_j \rVert^2\right)^{-1}}{\sum_{k \neq l}\left(1 + \lVert y_k - y_l \rVert^2\right)^{-1}}. \]

The heavy tail is the key innovation. A Gaussian in two dimensions cannot accommodate the volume of neighbors that exists in high dimensions, so moderately distant points would be forced unnaturally close, the crowding problem. The Student t tail gives distant points room to spread out, relieving the crowding.

73.2.3 2.3 The Objective and Its Optimization

t-SNE minimizes the Kullback Leibler divergence between the two distributions:

\[ \mathrm{KL}(P \parallel Q) = \sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}}. \]

This objective is asymmetric in a revealing way. When \(p_{ij}\) is large but \(q_{ij}\) is small, meaning true neighbors are placed far apart, the penalty is severe. When \(p_{ij}\) is small but \(q_{ij}\) is large, meaning unrelated points are placed close, the penalty is mild. t-SNE therefore works hard to keep neighbors together and cares little about the placement of non neighbors. This is why local structure is faithful and global structure is not.

Optimization proceeds by gradient descent, typically with momentum and a trick called early exaggeration that multiplies the \(p_{ij}\) during the opening iterations to encourage tight, well separated clusters. Naive t-SNE costs \(O(N^2)\) per iteration. The Barnes Hut approximation reduces this to \(O(N \log N)\) by treating distant groups of points as single summarized masses, and the FFT based variant in openTSNE scales further.

1. compute pairwise affinities p_ij from input (set per-point sigma via perplexity)
2. initialize map points y_i (PCA initialization recommended)
3. for each iteration:
     compute low-dim affinities q_ij with Student-t kernel
     compute KL gradient, update y_i with momentum
     apply early exaggeration during the first ~250 steps

73.3 3. How UMAP Works

UMAP, Uniform Manifold Approximation and Projection, was introduced by McInnes, Healy, and Melville in 2018. It reaches similar visual results through a different theoretical route grounded in topology and fuzzy simplicial sets, but the practical algorithm has a clean intuition.

73.3.1 3.1 Building a Fuzzy Neighborhood Graph

UMAP first constructs a weighted \(k\) nearest neighbor graph. For each point it finds the \(k\) nearest neighbors and assigns each edge a membership strength that decays with distance:

\[ w(x_i, x_j) = \exp\left(-\frac{\max(0, d(x_i, x_j) - \rho_i)}{\sigma_i}\right), \]

where \(\rho_i\) is the distance to the nearest neighbor of \(x_i\) and \(\sigma_i\) is a per point normalizer. Subtracting \(\rho_i\) guarantees that every point is connected to its closest neighbor with full strength, which enforces local connectivity and reflects the assumption that data is uniformly distributed on the manifold. The directed edge weights are then symmetrized using a fuzzy union, \(w_{ij} = w_{i \to j} + w_{j \to i} - w_{i \to j} w_{j \to i}\).

73.3.2 3.2 Optimizing the Low Dimensional Layout

In the embedding space, UMAP models the probability that two points are connected with a smooth curve \((1 + a \lVert y_i - y_j \rVert^{2b})^{-1}\), where \(a\) and \(b\) are fit from the min_dist hyperparameter. It then minimizes a cross entropy between the high dimensional and low dimensional edge weights:

\[ \sum_{(i,j)} w_{ij} \log \frac{w_{ij}}{v_{ij}} + (1 - w_{ij}) \log \frac{1 - w_{ij}}{1 - v_{ij}}. \]

The crucial difference from t-SNE is the second term. It carries a repulsive force for pairs that are not connected, whereas the t-SNE objective lacks an explicit repulsion of comparable form. UMAP optimizes this with stochastic gradient descent and negative sampling, attracting connected pairs along graph edges and pushing apart a small sample of random non edges each step. This sampling scheme, borrowed from the word embedding literature, is what gives UMAP its speed advantage and its tendency to preserve a little more global arrangement than t-SNE.

1. build approximate k-NN graph (via nearest-neighbor descent)
2. compute fuzzy edge weights with per-point rho and sigma
3. symmetrize via fuzzy union
4. initialize layout (spectral embedding of the graph)
5. SGD: attract along edges, repel via negative sampling

73.4 4. Hyperparameters That Matter

Neither method is parameter free, and the defaults are starting points, not answers. The single most common mistake is to run with defaults and read the output as ground truth.

73.4.1 4.1 Perplexity and n_neighbors

In t-SNE, perplexity sets the effective neighborhood size, with typical values between 5 and 50. Small perplexity emphasizes very local structure and fragments the data into many small islands. Large perplexity blends more neighbors and yields smoother, more connected layouts. The analogous knob in UMAP is n_neighbors, with a default near 15. Both parameters trade local detail against global coherence: low values reveal fine substructure but exaggerate fragmentation, high values reveal broad organization but wash out small clusters. A robust practice is to run several settings and trust only structure that persists across them.

73.4.2 4.2 min_dist, Learning Rate, and Iterations

UMAP’s min_dist controls how tightly points may pack in the embedding. Small values, near 0.0 to 0.1, produce dense clumps useful for cluster separation; larger values, near 0.5 to 0.99, spread points evenly and are better for seeing topology. In t-SNE the learning rate matters more than people expect. A learning rate that is too low leaves points stuck in a compressed ball; a common heuristic sets it to roughly \(N / 12\). Insufficient iterations produce maps that have not converged, often visible as a uniformly round blob.

73.4.3 4.3 Initialization and Reproducibility

Initialization has an outsized effect on global structure. Random initialization scrambles the relative positions of clusters between runs, while informative initialization, PCA for t-SNE and spectral embedding for UMAP, preserves more of the global arrangement and improves run to run stability. Both algorithms are stochastic, so a fixed random seed is necessary for reproducibility, and the absolute coordinates carry no meaning even with a seed fixed. Report your seed and parameters in any figure caption.

73.5 5. Common Misreadings of the Plots

The plots produced by these methods are seductive and routinely overinterpreted. The Distill article “How to Use t-SNE Effectively” demonstrates each of the following failure modes interactively, and they apply to UMAP as well.

73.5.1 5.1 Cluster Sizes Are Not Meaningful

Because t-SNE adapts its bandwidth to local density, it expands dense clusters and contracts sparse ones. The visual size of a blob therefore tells you nothing reliable about the variance or spread of that group in the original space. A tight cluster and a diffuse cluster can render at the same apparent size. Do not infer that one group is more variable than another from the area it occupies.

73.5.2 5.2 Distances Between Clusters Are Not Meaningful

Since the objectives penalize misplaced neighbors heavily and misplaced non neighbors lightly, the gaps between well separated clusters are largely arbitrary. Two clusters drawn far apart may be no more dissimilar than two drawn close together. UMAP preserves somewhat more global structure than t-SNE, but neither method licenses reading inter cluster distance as dissimilarity. If you need a faithful global metric, the embedding is the wrong tool.

73.5.3 5.3 Apparent Clusters May Be Noise

At small perplexity or low n_neighbors, both methods will carve random noise into clean looking clumps. The optimizer is good at producing visually separated groups whether or not separation exists in the data. The defense is to vary the neighborhood parameter and the seed, and to treat a cluster as real only if it survives. Validating apparent structure against independent labels, a clustering computed in the original space, or a known biological or semantic grouping is essential before drawing conclusions.

73.5.4 5.4 Topology Can Be Fictional

Filaments, loops, and bridges between clusters sometimes appear that have no counterpart in the data, and conversely genuine connected structure can be torn apart. UMAP’s topological grounding makes it somewhat more trustworthy for connectivity than t-SNE, but a published 2023 analysis by Chari and Pachter argued that single cell embeddings can distort distances so severely that the layouts retain little quantitative fidelity. The lesson is to confirm any structural claim with a method that does not depend on the two dimensional projection.

73.6 6. Visualization Versus Features

A recurring question is whether t-SNE or UMAP output should feed a downstream model as features rather than serve only as a picture.

73.6.1 6.1 Why the Output Is Usually Not Good Features

For visualization the answer leans strongly toward picture only. The embeddings are stochastic, are not defined by a stable function from input to output in standard t-SNE, distort distances by design, and depend sensitively on hyperparameters. A classifier trained on two dimensional t-SNE coordinates inherits all these instabilities, and the aggressive compression to two dimensions discards information that a model could use. As a rule, do not train production models on two or three dimensional embedding coordinates.

73.6.2 6.2 When Embedding as Features Can Be Justified

There are qualified exceptions. UMAP, unlike classic t-SNE, can embed into a moderate number of dimensions such as 10 to 50 rather than 2, and it provides a transform method that maps new points into an existing embedding, giving the out of sample capability that t-SNE lacks. In that regime UMAP can act as a nonlinear feature compressor, and it has been used as a preprocessing step before clustering, where reducing to an intermediate dimension can improve density based clustering such as HDBSCAN. Even here, validate that the compression helps the end metric rather than assuming it does, and prefer PCA or a learned autoencoder when a deterministic, invertible, or out of sample stable representation is required.

73.6.3 6.3 A Practical Decision Guide

Use t-SNE or UMAP when the goal is to look at the data, to sanity check that learned representations separate known classes, or to generate hypotheses about structure that you will then confirm by other means. Reach for PCA when you need a fast, deterministic, linear summary or a faithful sense of global variance, and for autoencoders or supervised embeddings when you need stable features for a model. Prefer UMAP over t-SNE when you have large datasets, need to embed new points, or want a slightly more trustworthy global arrangement; prefer t-SNE when fine grained local cluster structure is the priority and the dataset is modest. In all cases, run multiple hyperparameter settings, fix and report seeds, and never read cluster size, inter cluster distance, or apparent topology as quantitative fact without independent confirmation.

73.7 7. Summary

t-SNE and UMAP are powerful instruments for seeing high dimensional data, and their power rests on the manifold assumption that meaningful structure is locally Euclidean and globally curved. Both convert local neighborhoods into a graph or probability structure and then lay out points to preserve those neighborhoods, t-SNE through a KL divergence on Student t similarities and UMAP through a cross entropy on fuzzy graph memberships. They reward us with vivid pictures and punish us when we forget that the pictures distort sizes, distances, and sometimes topology. Treat their output as a hypothesis generating visualization, govern it with careful hyperparameter sweeps and fixed seeds, and look elsewhere when you need stable features or a faithful metric. Used with that discipline, they are among the most useful tools in the practitioner’s kit.

73.8 References

  1. van der Maaten, L., and Hinton, G. (2008). Visualizing Data using t-SNE. Journal of Machine Learning Research. https://www.jmlr.org/papers/v9/vandermaaten08a.html
  2. McInnes, L., Healy, J., and Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. https://arxiv.org/abs/1802.03426
  3. Wattenberg, M., Viegas, F., and Johnson, I. (2016). How to Use t-SNE Effectively. Distill. https://distill.pub/2016/misread-tsne/
  4. Coenen, A., and Pearce, A. Understanding UMAP. Google PAIR. https://pair-code.github.io/understanding-umap/
  5. UMAP documentation. https://umap-learn.readthedocs.io/en/latest/
  6. Poličar, P., Stražar, M., and Zupan, B. (2024). openTSNE: a modular Python library for t-SNE dimensionality reduction. Journal of Statistical Software. https://github.com/pavlin-policar/openTSNE
  7. Chari, T., and Pachter, L. (2023). The specious art of single-cell genomics. PLOS Computational Biology. https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011288
  8. Kobak, D., and Berens, P. (2019). The art of using t-SNE for single-cell transcriptomics. Nature Communications. https://www.nature.com/articles/s41467-019-13056-x