73 Nonlinear Dimensionality Reduction with t-SNE and UMAP

High dimensional data is the normal condition in modern machine learning. Word embeddings live in 300 to 4096 dimensions, single cell RNA sequencing produces expression vectors over 20000 genes, and the penultimate layer of a vision model emits feature vectors with thousands of components. We cannot see these spaces directly, and our geometric intuition, calibrated for two and three dimensions, fails badly above them. Nonlinear dimensionality reduction methods such as t-SNE and UMAP exist to bridge this gap. They take points in a high dimensional space and produce a two or three dimensional layout that a human can inspect. This chapter explains the manifold assumption that justifies these methods, the mechanics of t-SNE and UMAP, the hyperparameters that govern their behavior, the systematic ways their plots mislead, and the practical question of when to reach for them at all.

73.1 1. The Manifold Assumption

73.1.1 1.1 Why Linear Methods Are Not Enough

Principal component analysis (PCA) finds the linear subspace that captures maximal variance. If the data genuinely lives near a low dimensional hyperplane, PCA recovers it exactly and cheaply. The trouble is that real data rarely lies on a flat subspace. Consider points arranged on a two dimensional sheet that has been rolled up into a spiral in three dimensions, the classic Swiss roll. The intrinsic structure is two dimensional, but any linear projection collapses points that are far apart along the roll onto each other. PCA sees only the ambient variance and cannot unroll the sheet.

73.1.2 1.2 Manifolds and Intrinsic Dimension

The manifold assumption states that high dimensional data of interest concentrates near a low dimensional manifold $\mathcal{M}$ embedded in the ambient space $\mathbb{R}^D$. A $d$ dimensional manifold is a space that looks locally like $\mathbb{R}^d$ even though it may be curved globally. The number $d$ is the intrinsic dimension, and the hope is that $d \ll D$. Images of a single rotating object, for example, trace out a manifold whose intrinsic dimension equals the number of degrees of freedom of the rotation, not the number of pixels.

The practical consequence is that Euclidean distance in the ambient space is trustworthy only locally. Two points that are near each other in $\mathbb{R}^D$ are probably near each other on $\mathcal{M}$, but the straight line distance between two far apart points badly underestimates the geodesic distance measured along the manifold surface. Nonlinear methods exploit this by preserving local neighborhood structure and treating large distances as essentially unknown. This single design choice explains most of what follows, including the limitations.

73.2 2. How t-SNE Works

t-SNE, t-distributed Stochastic Neighbor Embedding, was introduced by van der Maaten and Hinton in 2008. Its strategy is to convert distances into probabilities, then find a low dimensional layout whose probabilities match.

73.2.1 2.1 Neighborhoods as Probability Distributions

For each high dimensional point $x_i$, t-SNE defines a conditional probability that $x_i$ would pick $x_j$ as its neighbor, using a Gaussian kernel centered on $x_i$:

\[ p_{j \mid i} = \frac{\exp\left(-\lVert x_i - x_j \rVert^2 / 2\sigma_i^2\right)}{\sum_{k \neq i} \exp\left(-\lVert x_i - x_k \rVert^2 / 2\sigma_i^2\right)}. \]

The bandwidth $\sigma_i$ is set separately for each point so that the effective number of neighbors, measured through the perplexity of the distribution, matches a user chosen value. This per point adaptivity is important. Dense regions get a small $\sigma_i$ and sparse regions get a large one, so the method respects local density. The conditional probabilities are symmetrized into joint probabilities $p_{ij} = (p_{j \mid i} + p_{i \mid j}) / 2N$.

73.2.2 2.2 The Low Dimensional Map and the Heavy Tail

In the low dimensional map, t-SNE places points $y_i$ and measures their similarity with a Student t-distribution with one degree of freedom, a Cauchy kernel:

\[ q_{ij} = \frac{\left(1 + \lVert y_i - y_j \rVert^2\right)^{-1}}{\sum_{k \neq l}\left(1 + \lVert y_k - y_l \rVert^2\right)^{-1}}. \]

The heavy tail is the key innovation. A Gaussian in two dimensions cannot accommodate the volume of neighbors that exists in high dimensions, so moderately distant points would be forced unnaturally close, the crowding problem. The Student t tail gives distant points room to spread out, relieving the crowding.

73.2.3 2.3 The Objective and Its Optimization

t-SNE minimizes the Kullback Leibler divergence between the two distributions:

\[ \mathrm{KL}(P \parallel Q) = \sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}}. \]

This objective is asymmetric in a revealing way. When $p_{ij}$ is large but $q_{ij}$ is small, meaning true neighbors are placed far apart, the penalty is severe. When $p_{ij}$ is small but $q_{ij}$ is large, meaning unrelated points are placed close, the penalty is mild. t-SNE therefore works hard to keep neighbors together and cares little about the placement of non neighbors. This is why local structure is faithful and global structure is not.

Optimization proceeds by gradient descent, typically with momentum and a trick called early exaggeration that multiplies the $p_{ij}$ during the opening iterations to encourage tight, well separated clusters. The gradient itself has a clean physical reading,

\[ \frac{\partial \mathrm{KL}}{\partial y_i} = 4 \sum_{j \neq i} (p_{ij} - q_{ij})\,\bigl(1 + \lVert y_i - y_j \rVert^2\bigr)^{-1} (y_i - y_j), \]

a sum of springs: pairs whose modeled closeness $q_{ij}$ falls short of their true affinity $p_{ij}$ attract, and pairs that are too close repel. Naive t-SNE costs $O(N^2)$ per iteration. The Barnes Hut approximation reduces this to $O(N \log N)$ by treating distant groups of points as single summarized masses, and the FFT based variant in openTSNE scales further.

1. compute pairwise affinities p_ij from input (set per-point sigma via perplexity)
2. initialize map points y_i (PCA initialization recommended)
3. for each iteration:
     compute low-dim affinities q_ij with Student-t kernel
     compute KL gradient, update y_i with momentum
     apply early exaggeration during the first ~250 steps

73.3 3. How UMAP Works

UMAP, Uniform Manifold Approximation and Projection, was introduced by McInnes, Healy, and Melville in 2018. It reaches similar visual results through a different theoretical route grounded in topology and fuzzy simplicial sets, but the practical algorithm has a clean intuition.

73.3.1 3.1 Building a Fuzzy Neighborhood Graph

UMAP first constructs a weighted $k$ nearest neighbor graph. For each point it finds the $k$ nearest neighbors and assigns each edge a membership strength that decays with distance:

\[ w(x_i, x_j) = \exp\left(-\frac{\max(0, d(x_i, x_j) - \rho_i)}{\sigma_i}\right), \]

where $\rho_i$ is the distance to the nearest neighbor of $x_i$ and $\sigma_i$ is a per point normalizer. Subtracting $\rho_i$ guarantees that every point is connected to its closest neighbor with full strength, which enforces local connectivity and reflects the assumption that data is uniformly distributed on the manifold. The directed edge weights are then symmetrized using a fuzzy union, $w_{ij} = w_{i \to j} + w_{j \to i} - w_{i \to j} w_{j \to i}$.

73.3.2 3.2 Optimizing the Low Dimensional Layout

In the embedding space, UMAP models the probability that two points are connected with a smooth curve $(1 + a \lVert y_i - y_j \rVert^{2b})^{-1}$, where $a$ and $b$ are fit from the min_dist hyperparameter. It then minimizes a cross entropy between the high dimensional and low dimensional edge weights:

\[ \sum_{(i,j)} w_{ij} \log \frac{w_{ij}}{v_{ij}} + (1 - w_{ij}) \log \frac{1 - w_{ij}}{1 - v_{ij}}. \]

The crucial difference from t-SNE is the second term. It carries a repulsive force for pairs that are not connected, whereas the t-SNE objective lacks an explicit repulsion of comparable form. UMAP optimizes this with stochastic gradient descent and negative sampling, attracting connected pairs along graph edges and pushing apart a small sample of random non edges each step. This sampling scheme, borrowed from the word embedding literature, is what gives UMAP its speed advantage and its tendency to preserve a little more global arrangement than t-SNE.

1. build approximate k-NN graph (via nearest-neighbor descent)
2. compute fuzzy edge weights with per-point rho and sigma
3. symmetrize via fuzzy union
4. initialize layout (spectral embedding of the graph)
5. SGD: attract along edges, repel via negative sampling

73.3.3 3.3 Complexity and the Practical Choice Between Them

The two methods share a skeleton: build a local neighborhood structure, then optimize a low dimensional layout that preserves it. They differ in how each step scales. Modern t-SNE builds its affinities from approximate nearest neighbors in roughly $O(N \log N)$ and runs Barnes Hut or FFT accelerated gradient steps at the same per iteration cost. UMAP builds an approximate $k$-NN graph by nearest neighbor descent, also near $O(N \log N)$, and its negative sampling SGD touches only $O(N k)$ edges per epoch rather than every pair. In wall clock terms UMAP is usually the faster of the two on large datasets and converges in fewer passes, which is the main reason it became the default for million cell genomics atlases.

The differences that matter in day to day use follow directly from the objectives. UMAP’s cross entropy carries an explicit repulsion term, so it tends to preserve a little more of the coarse arrangement of clusters and to admit a transform method that places new points into a fixed embedding. Standard t-SNE has no such out of sample map and treats inter cluster distance as essentially free. As a first cut: reach for UMAP when the dataset is large, when new points must be embedded later, or when you want embeddings into a moderate dimension as a preprocessing step; reach for t-SNE when the dataset is modest and resolving fine local cluster structure is the priority. Neither is a metric preserving projection, a point Section 5 makes precise.

73.4 4. Hyperparameters That Matter

Neither method is parameter free, and the defaults are starting points, not answers. The single most common mistake is to run with defaults and read the output as ground truth.

73.4.1 4.1 Perplexity and n_neighbors

In t-SNE, perplexity sets the effective neighborhood size, with typical values between 5 and 50. Small perplexity emphasizes very local structure and fragments the data into many small islands. Large perplexity blends more neighbors and yields smoother, more connected layouts. The analogous knob in UMAP is n_neighbors, with a default near 15. Both parameters trade local detail against global coherence: low values reveal fine substructure but exaggerate fragmentation, high values reveal broad organization but wash out small clusters. A robust practice is to run several settings and trust only structure that persists across them.

73.4.2 4.2 min_dist, Learning Rate, and Iterations

UMAP’s min_dist controls how tightly points may pack in the embedding. Small values, near 0.0 to 0.1, produce dense clumps useful for cluster separation; larger values, near 0.5 to 0.99, spread points evenly and are better for seeing topology. In t-SNE the learning rate matters more than people expect. A learning rate that is too low leaves points stuck in a compressed ball; a common heuristic sets it to roughly $N / 12$. Insufficient iterations produce maps that have not converged, often visible as a uniformly round blob.

73.4.3 4.3 Initialization and Reproducibility

Initialization has an outsized effect on global structure. Random initialization scrambles the relative positions of clusters between runs, while informative initialization, PCA for t-SNE and spectral embedding for UMAP, preserves more of the global arrangement and improves run to run stability. Both algorithms are stochastic, so a fixed random seed is necessary for reproducibility, and the absolute coordinates carry no meaning even with a seed fixed. Report your seed and parameters in any figure caption.

73.5 5. Common Misreadings of the Plots

The plots produced by these methods are seductive and routinely overinterpreted. The Distill article “How to Use t-SNE Effectively” demonstrates each of the following failure modes interactively, and they apply to UMAP as well.

73.5.1 5.1 Cluster Sizes Are Not Meaningful

Because t-SNE adapts its bandwidth to local density, it expands dense clusters and contracts sparse ones. The visual size of a blob therefore tells you nothing reliable about the variance or spread of that group in the original space. A tight cluster and a diffuse cluster can render at the same apparent size. Do not infer that one group is more variable than another from the area it occupies.

73.5.2 5.2 Distances Between Clusters Are Not Meaningful

Since the objectives penalize misplaced neighbors heavily and misplaced non neighbors lightly, the gaps between well separated clusters are largely arbitrary. Two clusters drawn far apart may be no more dissimilar than two drawn close together. UMAP preserves somewhat more global structure than t-SNE, but neither method licenses reading inter cluster distance as dissimilarity. If you need a faithful global metric, the embedding is the wrong tool.

73.5.3 5.3 Apparent Clusters May Be Noise

At small perplexity or low n_neighbors, both methods will carve random noise into clean looking clumps. The optimizer is good at producing visually separated groups whether or not separation exists in the data. The defense is to vary the neighborhood parameter and the seed, and to treat a cluster as real only if it survives. Validating apparent structure against independent labels, a clustering computed in the original space, or a known biological or semantic grouping is essential before drawing conclusions.

73.5.4 5.4 Topology Can Be Fictional

Filaments, loops, and bridges between clusters sometimes appear that have no counterpart in the data, and conversely genuine connected structure can be torn apart. UMAP’s topological grounding makes it somewhat more trustworthy for connectivity than t-SNE, but a published 2023 analysis by Chari and Pachter argued that single cell embeddings can distort distances so severely that the layouts retain little quantitative fidelity. The lesson is to confirm any structural claim with a method that does not depend on the two dimensional projection.

73.6 6. Visualization Versus Features

A recurring question is whether t-SNE or UMAP output should feed a downstream model as features rather than serve only as a picture.

73.6.1 6.1 Why the Output Is Usually Not Good Features

For visualization the answer leans strongly toward picture only. The embeddings are stochastic, are not defined by a stable function from input to output in standard t-SNE, distort distances by design, and depend sensitively on hyperparameters. A classifier trained on two dimensional t-SNE coordinates inherits all these instabilities, and the aggressive compression to two dimensions discards information that a model could use. As a rule, do not train production models on two or three dimensional embedding coordinates.

73.6.2 6.2 When Embedding as Features Can Be Justified

There are qualified exceptions. UMAP, unlike classic t-SNE, can embed into a moderate number of dimensions such as 10 to 50 rather than 2, and it provides a transform method that maps new points into an existing embedding, giving the out of sample capability that t-SNE lacks. In that regime UMAP can act as a nonlinear feature compressor, and it has been used as a preprocessing step before clustering, where reducing to an intermediate dimension can improve density based clustering such as HDBSCAN. Even here, validate that the compression helps the end metric rather than assuming it does, and prefer PCA or a learned autoencoder when a deterministic, invertible, or out of sample stable representation is required.

73.6.3 6.3 A Practical Decision Guide

Use t-SNE or UMAP when the goal is to look at the data, to sanity check that learned representations separate known classes, or to generate hypotheses about structure that you will then confirm by other means. Reach for PCA when you need a fast, deterministic, linear summary or a faithful sense of global variance, and for autoencoders or supervised embeddings when you need stable features for a model. Prefer UMAP over t-SNE when you have large datasets, need to embed new points, or want a slightly more trustworthy global arrangement; prefer t-SNE when fine grained local cluster structure is the priority and the dataset is modest. In all cases, run multiple hyperparameter settings, fix and report seeds, and never read cluster size, inter cluster distance, or apparent topology as quantitative fact without independent confirmation.

73.7 7. Hands On with Mature Libraries

This chapter is library driven by design. Both algorithms are subtle to implement well, and the open source community has already produced battle tested, well optimized, freely licensed implementations. Reaching for them is the right call: scikit-learn’s sklearn.manifold.TSNE ships the Barnes Hut t-SNE with sensible init="pca" and learning_rate="auto" defaults, umap-learn provides the reference UMAP with its transform method for out of sample points, and openTSNE offers an FFT accelerated t-SNE with an approximate transform when you outgrow scikit-learn. All three are BSD licensed and pip installable. Writing your own from scratch is a fine learning exercise but a poor production decision.

The example below builds a small, self contained dataset, five Gaussian blobs of 600 points living in 50 dimensional space, where the ground truth cluster identity is known. Because the labels are known we can quantify what the picture would otherwise only suggest. We report two diagnostics. The silhouette score, computed in the two dimensional embedding against the true labels, measures how cleanly the known groups separate, near one for crisp separation. A local neighborhood overlap, the average fraction of each point’s ten nearest neighbors in the original 50-D space that remain among its ten nearest neighbors after embedding, measures local faithfulness. The modest overlap value is itself a lesson: even when clusters separate beautifully, the fine grained ordering of neighbors is heavily reshuffled by the squeeze to two dimensions, exactly the distortion Section 5 warns against. Everything is seeded for reproducibility.

Code

import numpy as np
from sklearn.datasets import make_blobs
from sklearn.manifold import TSNE
from sklearn.metrics import silhouette_score
from sklearn.neighbors import NearestNeighbors
import umap

# Five well-separated Gaussian blobs in 50-D. The intrinsic structure is the
# cluster identity; the ambient dimension is too high to plot directly.
X, y = make_blobs(
    n_samples=600, n_features=50, centers=5,
    cluster_std=2.5, random_state=0,
)
print(f"data: {X.shape[0]} points in {X.shape[1]}-D, {len(np.unique(y))} clusters")

def local_overlap(X_high, X_low, k=10):
    """Mean fraction of each point's k nearest neighbors preserved by embedding."""
    idx_hi = NearestNeighbors(n_neighbors=k + 1).fit(X_high) \
        .kneighbors(return_distance=False)[:, 1:]
    idx_lo = NearestNeighbors(n_neighbors=k + 1).fit(X_low) \
        .kneighbors(return_distance=False)[:, 1:]
    return float(np.mean([len(set(a) & set(b)) / k
                          for a, b in zip(idx_hi, idx_lo)]))

# t-SNE: PCA init and auto learning rate are the recommended modern defaults.
tsne = TSNE(n_components=2, perplexity=30, init="pca",
            learning_rate="auto", max_iter=1000, random_state=42)
Z_tsne = tsne.fit_transform(X)

# UMAP: a fixed random_state makes the run deterministic (single-threaded).
reducer = umap.UMAP(n_components=2, n_neighbors=15,
                    min_dist=0.1, random_state=42)
Z_umap = reducer.fit_transform(X)

print(f"t-SNE embedding {Z_tsne.shape}, final KL divergence = {tsne.kl_divergence_:.4f}")
print(f"UMAP  embedding {Z_umap.shape}")
print(f"silhouette    t-SNE={silhouette_score(Z_tsne, y):.3f}  "
      f"UMAP={silhouette_score(Z_umap, y):.3f}")
print(f"10-NN overlap t-SNE={local_overlap(X, Z_tsne):.3f}  "
      f"UMAP={local_overlap(X, Z_umap):.3f}")

# Out-of-sample: UMAP can place new points; standard t-SNE cannot.
X_new, _ = make_blobs(n_samples=5, n_features=50, centers=5,
                      cluster_std=2.5, random_state=1)
Z_new = reducer.transform(X_new)
print(f"UMAP placed {Z_new.shape[0]} new points via transform "
      f"(t-SNE has no out-of-sample map)")

C:\Users\miken\github\ai_in_action\.venv\lib\site-packages\tqdm\auto.py:21: TqdmWarning:

IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html

data: 600 points in 50-D, 5 clusters

C:\Users\miken\github\ai_in_action\.venv\lib\site-packages\umap\umap_.py:1952: UserWarning:

n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.

C:\Users\miken\github\ai_in_action\.venv\lib\site-packages\sklearn\manifold\_spectral_embedding.py:328: UserWarning:

Graph is not fully connected, spectral embedding may not work as expected.

t-SNE embedding (600, 2), final KL divergence = 0.9258
UMAP  embedding (600, 2)
silhouette    t-SNE=0.900  UMAP=0.939
10-NN overlap t-SNE=0.255  UMAP=0.203
UMAP placed 5 new points via transform (t-SNE has no out-of-sample map)

Both methods drive the silhouette near one, confirming that the five blobs separate cleanly in the picture. Yet the ten nearest neighbor overlap is far from one: the squeeze into two dimensions scrambles most of the fine grained local ordering even as it keeps the clusters apart. That gap between a crisp looking plot and a low neighborhood fidelity is the quantitative form of the warnings in Section 5, and it is why the embedding is a hypothesis generating picture rather than a faithful metric. The final line exercises UMAP’s transform, the out of sample capability that standard t-SNE simply does not have.

# ] add TSne UMAP Distances Clustering
using TSne, UMAP, Random

Random.seed!(0)
# Five blobs in 50-D: stack Gaussian clusters around random centers.
centers = [randn(50) .* 6 for _ in 1:5]
X = reduce(vcat, [hcat([c .+ 2.5 .* randn(50) for _ in 1:120]...)' for c in centers])
y = repeat(1:5, inner = 120)

# t-SNE: TSne.tsne(X, out_dims, reduce_dims, max_iter, perplexity)
Z_tsne = tsne(X, 2, 0, 1000, 30.0)

# UMAP: umap returns a (n_components x n_samples) matrix, so transpose.
Z_umap = umap(permutedims(X), 2; n_neighbors = 15, min_dist = 0.1)'

println("t-SNE embedding size: ", size(Z_tsne))
println("UMAP  embedding size: ", size(Z_umap))

# UMAP.jl exposes a model object whose transform places new points:
model = UMAP_(permutedims(X), 2; n_neighbors = 15, min_dist = 0.1)
Z_new = UMAP.transform(model, permutedims(X[1:5, :]))'
println("UMAP placed ", size(Z_new, 1), " new points via transform")

// Cargo.toml:  bhtsne = "0.5"
// bhtsne is the mature Barnes-Hut t-SNE crate. There is no mature UMAP
// crate in the Rust ecosystem; for UMAP in production, call umap-learn
// from Python (e.g. via PyO3) or use the Julia/Python tabs above.
use bhtsne::tSNE;

fn main() {
    // Five blobs in 50-D, flattened row-major into a Vec<f32>.
    let (n, d, k) = (600usize, 50usize, 5usize);
    let centers: Vec<Vec<f32>> = (0..k)
        .map(|c| (0..d).map(|j| ((c * 7 + j) as f32 % 11.0) - 5.0).collect())
        .collect();
    let mut data: Vec<f32> = Vec::with_capacity(n * d);
    for i in 0..n {
        let c = &centers[i % k];
        // Cheap deterministic jitter stands in for Gaussian noise.
        for j in 0..d {
            let jitter = (((i * 31 + j * 17) % 13) as f32 - 6.0) * 0.4;
            data.push(c[j] + jitter);
        }
    }

    let samples: Vec<&[f32]> = data.chunks(d).collect();
    let mut tsne = tSNE::new(&samples);
    tsne.embedding_dim(2)
        .perplexity(30.0)
        .epochs(1000)
        .barnes_hut(0.5, |a, b| {
            a.iter().zip(b).map(|(x, y)| (x - y).powi(2)).sum::<f32>().sqrt()
        });

    let embedding = tsne.embedding(); // flat Vec<f32>, length n * 2
    println!("t-SNE embedding: {} points x 2 dims", embedding.len() / 2);
}

The three implementations agree on the workflow even though they live in different ecosystems. Python is the most mature path and the one to prefer in practice: scikit-learn and umap-learn are widely deployed, documented, and tested. Julia’s TSne.jl and UMAP.jl are clean native ports, and UMAP.jl exposes a transform like Python. Rust’s bhtsne crate is a solid Barnes Hut t-SNE, but the Rust ecosystem has no mature UMAP crate, so honest production advice there is to call umap-learn across the language boundary rather than reimplement it.

73.8 8. Summary

t-SNE and UMAP are powerful instruments for seeing high dimensional data, and their power rests on the manifold assumption that meaningful structure is locally Euclidean and globally curved. Both convert local neighborhoods into a graph or probability structure and then lay out points to preserve those neighborhoods, t-SNE through a KL divergence on Student t similarities and UMAP through a cross entropy on fuzzy graph memberships. They reward us with vivid pictures and punish us when we forget that the pictures distort sizes, distances, and sometimes topology. Treat their output as a hypothesis generating visualization, govern it with careful hyperparameter sweeps and fixed seeds, and look elsewhere when you need stable features or a faithful metric. Used with that discipline, they are among the most useful tools in the practitioner’s kit.

73.9 References

van der Maaten, L., and Hinton, G. (2008). Visualizing Data using t-SNE. Journal of Machine Learning Research. https://www.jmlr.org/papers/v9/vandermaaten08a.html
McInnes, L., Healy, J., and Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. https://arxiv.org/abs/1802.03426
Wattenberg, M., Viegas, F., and Johnson, I. (2016). How to Use t-SNE Effectively. Distill. https://distill.pub/2016/misread-tsne/
Coenen, A., and Pearce, A. Understanding UMAP. Google PAIR. https://pair-code.github.io/understanding-umap/
UMAP documentation. https://umap-learn.readthedocs.io/en/latest/
Poličar, P., Stražar, M., and Zupan, B. (2024). openTSNE: a modular Python library for t-SNE dimensionality reduction. Journal of Statistical Software. https://github.com/pavlin-policar/openTSNE
Chari, T., and Pachter, L. (2023). The specious art of single-cell genomics. PLOS Computational Biology. https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011288
Kobak, D., and Berens, P. (2019). The art of using t-SNE for single-cell transcriptomics. Nature Communications. https://www.nature.com/articles/s41467-019-13056-x

# Nonlinear Dimensionality Reduction with t-SNE and UMAP High dimensional data is the normal condition in modern machine learning. Word embeddings live in 300 to 4096 dimensions, single cell RNA sequencing produces expression vectors over 20000 genes, and the penultimate layer of a vision model emits feature vectors with thousands of components. We cannot see these spaces directly, and our geometric intuition, calibrated for two and three dimensions, fails badly above them. Nonlinear dimensionality reduction methods such as t-SNE and UMAP exist to bridge this gap. They take points in a high dimensional space and produce a two or three dimensional layout that a human can inspect. This chapter explains the manifold assumption that justifies these methods, the mechanics of t-SNE and UMAP, the hyperparameters that govern their behavior, the systematic ways their plots mislead, and the practical question of when to reach for them at all. ## 1. The Manifold Assumption ### 1.1 Why Linear Methods Are Not Enough Principal component analysis (PCA) finds the linear subspace that captures maximal variance. If the data genuinely lives near a low dimensional hyperplane, PCA recovers it exactly and cheaply. The trouble is that real data rarely lies on a flat subspace. Consider points arranged on a two dimensional sheet that has been rolled up into a spiral in three dimensions, the classic Swiss roll. The intrinsic structure is two dimensional, but any linear projection collapses points that are far apart along the roll onto each other. PCA sees only the ambient variance and cannot unroll the sheet. ### 1.2 Manifolds and Intrinsic Dimension The manifold assumption states that high dimensional data of interest concentrates near a low dimensional manifold $\mathcal{M}$ embedded in the ambient space $\mathbb{R}^D$. A $d$ dimensional manifold is a space that looks locally like $\mathbb{R}^d$ even though it may be curved globally. The number $d$ is the intrinsic dimension, and the hope is that $d \ll D$. Images of a single rotating object, for example, trace out a manifold whose intrinsic dimension equals the number of degrees of freedom of the rotation, not the number of pixels. The practical consequence is that Euclidean distance in the ambient space is trustworthy only locally. Two points that are near each other in $\mathbb{R}^D$ are probably near each other on $\mathcal{M}$, but the straight line distance between two far apart points badly underestimates the geodesic distance measured along the manifold surface. Nonlinear methods exploit this by preserving local neighborhood structure and treating large distances as essentially unknown. This single design choice explains most of what follows, including the limitations. ## 2. How t-SNE Works t-SNE, t-distributed Stochastic Neighbor Embedding, was introduced by van der Maaten and Hinton in 2008. Its strategy is to convert distances into probabilities, then find a low dimensional layout whose probabilities match. ### 2.1 Neighborhoods as Probability Distributions For each high dimensional point $x_i$, t-SNE defines a conditional probability that $x_i$ would pick $x_j$ as its neighbor, using a Gaussian kernel centered on $x_i$: $$ p_{j \mid i} = \frac{\exp\left(-\lVert x_i - x_j \rVert^2 / 2\sigma_i^2\right)}{\sum_{k \neq i} \exp\left(-\lVert x_i - x_k \rVert^2 / 2\sigma_i^2\right)}. $$ The bandwidth $\sigma_i$ is set separately for each point so that the effective number of neighbors, measured through the perplexity of the distribution, matches a user chosen value. This per point adaptivity is important. Dense regions get a small $\sigma_i$ and sparse regions get a large one, so the method respects local density. The conditional probabilities are symmetrized into joint probabilities $p_{ij} = (p_{j \mid i} + p_{i \mid j}) / 2N$. ### 2.2 The Low Dimensional Map and the Heavy Tail In the low dimensional map, t-SNE places points $y_i$ and measures their similarity with a Student t-distribution with one degree of freedom, a Cauchy kernel: $$ q_{ij} = \frac{\left(1 + \lVert y_i - y_j \rVert^2\right)^{-1}}{\sum_{k \neq l}\left(1 + \lVert y_k - y_l \rVert^2\right)^{-1}}. $$ The heavy tail is the key innovation. A Gaussian in two dimensions cannot accommodate the volume of neighbors that exists in high dimensions, so moderately distant points would be forced unnaturally close, the crowding problem. The Student t tail gives distant points room to spread out, relieving the crowding. ### 2.3 The Objective and Its Optimization t-SNE minimizes the Kullback Leibler divergence between the two distributions: $$ \mathrm{KL}(P \parallel Q) = \sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}}. $$ This objective is asymmetric in a revealing way. When $p_{ij}$ is large but $q_{ij}$ is small, meaning true neighbors are placed far apart, the penalty is severe. When $p_{ij}$ is small but $q_{ij}$ is large, meaning unrelated points are placed close, the penalty is mild. t-SNE therefore works hard to keep neighbors together and cares little about the placement of non neighbors. This is why local structure is faithful and global structure is not. Optimization proceeds by gradient descent, typically with momentum and a trick called early exaggeration that multiplies the $p_{ij}$ during the opening iterations to encourage tight, well separated clusters. The gradient itself has a clean physical reading, $$ \frac{\partial \mathrm{KL}}{\partial y_i} = 4 \sum_{j \neq i} (p_{ij} - q_{ij})\,\bigl(1 + \lVert y_i - y_j \rVert^2\bigr)^{-1} (y_i - y_j), $$ a sum of springs: pairs whose modeled closeness $q_{ij}$ falls short of their true affinity $p_{ij}$ attract, and pairs that are too close repel. Naive t-SNE costs $O(N^2)$ per iteration. The Barnes Hut approximation reduces this to $O(N \log N)$ by treating distant groups of points as single summarized masses, and the FFT based variant in openTSNE scales further. ```text 1. compute pairwise affinities p_ij from input (set per-point sigma via perplexity) 2. initialize map points y_i (PCA initialization recommended) 3. for each iteration: compute low-dim affinities q_ij with Student-t kernel compute KL gradient, update y_i with momentum apply early exaggeration during the first ~250 steps ``` ## 3. How UMAP Works UMAP, Uniform Manifold Approximation and Projection, was introduced by McInnes, Healy, and Melville in 2018. It reaches similar visual results through a different theoretical route grounded in topology and fuzzy simplicial sets, but the practical algorithm has a clean intuition. ### 3.1 Building a Fuzzy Neighborhood Graph UMAP first constructs a weighted $k$ nearest neighbor graph. For each point it finds the $k$ nearest neighbors and assigns each edge a membership strength that decays with distance: $$ w(x_i, x_j) = \exp\left(-\frac{\max(0, d(x_i, x_j) - \rho_i)}{\sigma_i}\right), $$ where $\rho_i$ is the distance to the nearest neighbor of $x_i$ and $\sigma_i$ is a per point normalizer. Subtracting $\rho_i$ guarantees that every point is connected to its closest neighbor with full strength, which enforces local connectivity and reflects the assumption that data is uniformly distributed on the manifold. The directed edge weights are then symmetrized using a fuzzy union, $w_{ij} = w_{i \to j} + w_{j \to i} - w_{i \to j} w_{j \to i}$. ### 3.2 Optimizing the Low Dimensional Layout In the embedding space, UMAP models the probability that two points are connected with a smooth curve $(1 + a \lVert y_i - y_j \rVert^{2b})^{-1}$, where $a$ and $b$ are fit from the `min_dist` hyperparameter. It then minimizes a cross entropy between the high dimensional and low dimensional edge weights: $$ \sum_{(i,j)} w_{ij} \log \frac{w_{ij}}{v_{ij}} + (1 - w_{ij}) \log \frac{1 - w_{ij}}{1 - v_{ij}}. $$ The crucial difference from t-SNE is the second term. It carries a repulsive force for pairs that are not connected, whereas the t-SNE objective lacks an explicit repulsion of comparable form. UMAP optimizes this with stochastic gradient descent and negative sampling, attracting connected pairs along graph edges and pushing apart a small sample of random non edges each step. This sampling scheme, borrowed from the word embedding literature, is what gives UMAP its speed advantage and its tendency to preserve a little more global arrangement than t-SNE. ```text 1. build approximate k-NN graph (via nearest-neighbor descent) 2. compute fuzzy edge weights with per-point rho and sigma 3. symmetrize via fuzzy union 4. initialize layout (spectral embedding of the graph) 5. SGD: attract along edges, repel via negative sampling ``` ### 3.3 Complexity and the Practical Choice Between Them The two methods share a skeleton: build a local neighborhood structure, then optimize a low dimensional layout that preserves it. They differ in how each step scales. Modern t-SNE builds its affinities from approximate nearest neighbors in roughly $O(N \log N)$ and runs Barnes Hut or FFT accelerated gradient steps at the same per iteration cost. UMAP builds an approximate $k$-NN graph by nearest neighbor descent, also near $O(N \log N)$, and its negative sampling SGD touches only $O(N k)$ edges per epoch rather than every pair. In wall clock terms UMAP is usually the faster of the two on large datasets and converges in fewer passes, which is the main reason it became the default for million cell genomics atlases. The differences that matter in day to day use follow directly from the objectives. UMAP's cross entropy carries an explicit repulsion term, so it tends to preserve a little more of the coarse arrangement of clusters and to admit a `transform` method that places new points into a fixed embedding. Standard t-SNE has no such out of sample map and treats inter cluster distance as essentially free. As a first cut: reach for UMAP when the dataset is large, when new points must be embedded later, or when you want embeddings into a moderate dimension as a preprocessing step; reach for t-SNE when the dataset is modest and resolving fine local cluster structure is the priority. Neither is a metric preserving projection, a point Section 5 makes precise. ## 4. Hyperparameters That Matter Neither method is parameter free, and the defaults are starting points, not answers. The single most common mistake is to run with defaults and read the output as ground truth. ### 4.1 Perplexity and n_neighbors In t-SNE, perplexity sets the effective neighborhood size, with typical values between 5 and 50. Small perplexity emphasizes very local structure and fragments the data into many small islands. Large perplexity blends more neighbors and yields smoother, more connected layouts. The analogous knob in UMAP is `n_neighbors`, with a default near 15. Both parameters trade local detail against global coherence: low values reveal fine substructure but exaggerate fragmentation, high values reveal broad organization but wash out small clusters. A robust practice is to run several settings and trust only structure that persists across them. ### 4.2 min_dist, Learning Rate, and Iterations UMAP's `min_dist` controls how tightly points may pack in the embedding. Small values, near 0.0 to 0.1, produce dense clumps useful for cluster separation; larger values, near 0.5 to 0.99, spread points evenly and are better for seeing topology. In t-SNE the learning rate matters more than people expect. A learning rate that is too low leaves points stuck in a compressed ball; a common heuristic sets it to roughly $N / 12$. Insufficient iterations produce maps that have not converged, often visible as a uniformly round blob. ### 4.3 Initialization and Reproducibility Initialization has an outsized effect on global structure. Random initialization scrambles the relative positions of clusters between runs, while informative initialization, PCA for t-SNE and spectral embedding for UMAP, preserves more of the global arrangement and improves run to run stability. Both algorithms are stochastic, so a fixed random seed is necessary for reproducibility, and the absolute coordinates carry no meaning even with a seed fixed. Report your seed and parameters in any figure caption. ## 5. Common Misreadings of the Plots The plots produced by these methods are seductive and routinely overinterpreted. The Distill article "How to Use t-SNE Effectively" demonstrates each of the following failure modes interactively, and they apply to UMAP as well. ### 5.1 Cluster Sizes Are Not Meaningful Because t-SNE adapts its bandwidth to local density, it expands dense clusters and contracts sparse ones. The visual size of a blob therefore tells you nothing reliable about the variance or spread of that group in the original space. A tight cluster and a diffuse cluster can render at the same apparent size. Do not infer that one group is more variable than another from the area it occupies. ### 5.2 Distances Between Clusters Are Not Meaningful Since the objectives penalize misplaced neighbors heavily and misplaced non neighbors lightly, the gaps between well separated clusters are largely arbitrary. Two clusters drawn far apart may be no more dissimilar than two drawn close together. UMAP preserves somewhat more global structure than t-SNE, but neither method licenses reading inter cluster distance as dissimilarity. If you need a faithful global metric, the embedding is the wrong tool. ### 5.3 Apparent Clusters May Be Noise At small perplexity or low `n_neighbors`, both methods will carve random noise into clean looking clumps. The optimizer is good at producing visually separated groups whether or not separation exists in the data. The defense is to vary the neighborhood parameter and the seed, and to treat a cluster as real only if it survives. Validating apparent structure against independent labels, a clustering computed in the original space, or a known biological or semantic grouping is essential before drawing conclusions. ### 5.4 Topology Can Be Fictional Filaments, loops, and bridges between clusters sometimes appear that have no counterpart in the data, and conversely genuine connected structure can be torn apart. UMAP's topological grounding makes it somewhat more trustworthy for connectivity than t-SNE, but a published 2023 analysis by Chari and Pachter argued that single cell embeddings can distort distances so severely that the layouts retain little quantitative fidelity. The lesson is to confirm any structural claim with a method that does not depend on the two dimensional projection. ## 6. Visualization Versus Features A recurring question is whether t-SNE or UMAP output should feed a downstream model as features rather than serve only as a picture. ### 6.1 Why the Output Is Usually Not Good Features For visualization the answer leans strongly toward picture only. The embeddings are stochastic, are not defined by a stable function from input to output in standard t-SNE, distort distances by design, and depend sensitively on hyperparameters. A classifier trained on two dimensional t-SNE coordinates inherits all these instabilities, and the aggressive compression to two dimensions discards information that a model could use. As a rule, do not train production models on two or three dimensional embedding coordinates. ### 6.2 When Embedding as Features Can Be Justified There are qualified exceptions. UMAP, unlike classic t-SNE, can embed into a moderate number of dimensions such as 10 to 50 rather than 2, and it provides a `transform` method that maps new points into an existing embedding, giving the out of sample capability that t-SNE lacks. In that regime UMAP can act as a nonlinear feature compressor, and it has been used as a preprocessing step before clustering, where reducing to an intermediate dimension can improve density based clustering such as HDBSCAN. Even here, validate that the compression helps the end metric rather than assuming it does, and prefer PCA or a learned autoencoder when a deterministic, invertible, or out of sample stable representation is required. ### 6.3 A Practical Decision Guide Use t-SNE or UMAP when the goal is to look at the data, to sanity check that learned representations separate known classes, or to generate hypotheses about structure that you will then confirm by other means. Reach for PCA when you need a fast, deterministic, linear summary or a faithful sense of global variance, and for autoencoders or supervised embeddings when you need stable features for a model. Prefer UMAP over t-SNE when you have large datasets, need to embed new points, or want a slightly more trustworthy global arrangement; prefer t-SNE when fine grained local cluster structure is the priority and the dataset is modest. In all cases, run multiple hyperparameter settings, fix and report seeds, and never read cluster size, inter cluster distance, or apparent topology as quantitative fact without independent confirmation. ## 7. Hands On with Mature Libraries This chapter is library driven by design. Both algorithms are subtle to implement well, and the open source community has already produced battle tested, well optimized, freely licensed implementations. Reaching for them is the right call: scikit-learn's `sklearn.manifold.TSNE` ships the Barnes Hut t-SNE with sensible `init="pca"` and `learning_rate="auto"` defaults, `umap-learn` provides the reference UMAP with its `transform` method for out of sample points, and `openTSNE` offers an FFT accelerated t-SNE with an approximate transform when you outgrow scikit-learn. All three are BSD licensed and pip installable. Writing your own from scratch is a fine learning exercise but a poor production decision. The example below builds a small, self contained dataset, five Gaussian blobs of 600 points living in 50 dimensional space, where the ground truth cluster identity is known. Because the labels are known we can quantify what the picture would otherwise only suggest. We report two diagnostics. The silhouette score, computed in the two dimensional embedding against the true labels, measures how cleanly the known groups separate, near one for crisp separation. A local neighborhood overlap, the average fraction of each point's ten nearest neighbors in the original 50-D space that remain among its ten nearest neighbors after embedding, measures local faithfulness. The modest overlap value is itself a lesson: even when clusters separate beautifully, the fine grained ordering of neighbors is heavily reshuffled by the squeeze to two dimensions, exactly the distortion Section 5 warns against. Everything is seeded for reproducibility. ::: {.panel-tabset} ## Python ```{python} import numpy as np from sklearn.datasets import make_blobs from sklearn.manifold import TSNE from sklearn.metrics import silhouette_score from sklearn.neighbors import NearestNeighbors import umap # Five well-separated Gaussian blobs in 50-D. The intrinsic structure is the # cluster identity; the ambient dimension is too high to plot directly. X, y = make_blobs( n_samples=600, n_features=50, centers=5, cluster_std=2.5, random_state=0, ) print(f"data: {X.shape[0]} points in {X.shape[1]}-D, {len(np.unique(y))} clusters") def local_overlap(X_high, X_low, k=10): """Mean fraction of each point's k nearest neighbors preserved by embedding.""" idx_hi = NearestNeighbors(n_neighbors=k + 1).fit(X_high) \ .kneighbors(return_distance=False)[:, 1:] idx_lo = NearestNeighbors(n_neighbors=k + 1).fit(X_low) \ .kneighbors(return_distance=False)[:, 1:] return float(np.mean([len(set(a) & set(b)) / k for a, b in zip(idx_hi, idx_lo)])) # t-SNE: PCA init and auto learning rate are the recommended modern defaults. tsne = TSNE(n_components=2, perplexity=30, init="pca", learning_rate="auto", max_iter=1000, random_state=42) Z_tsne = tsne.fit_transform(X) # UMAP: a fixed random_state makes the run deterministic (single-threaded). reducer = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1, random_state=42) Z_umap = reducer.fit_transform(X) print(f"t-SNE embedding {Z_tsne.shape}, final KL divergence = {tsne.kl_divergence_:.4f}") print(f"UMAP embedding {Z_umap.shape}") print(f"silhouette t-SNE={silhouette_score(Z_tsne, y):.3f} " f"UMAP={silhouette_score(Z_umap, y):.3f}") print(f"10-NN overlap t-SNE={local_overlap(X, Z_tsne):.3f} " f"UMAP={local_overlap(X, Z_umap):.3f}") # Out-of-sample: UMAP can place new points; standard t-SNE cannot. X_new, _ = make_blobs(n_samples=5, n_features=50, centers=5, cluster_std=2.5, random_state=1) Z_new = reducer.transform(X_new) print(f"UMAP placed {Z_new.shape[0]} new points via transform " f"(t-SNE has no out-of-sample map)") ``` Both methods drive the silhouette near one, confirming that the five blobs separate cleanly in the picture. Yet the ten nearest neighbor overlap is far from one: the squeeze into two dimensions scrambles most of the fine grained local ordering even as it keeps the clusters apart. That gap between a crisp looking plot and a low neighborhood fidelity is the quantitative form of the warnings in Section 5, and it is why the embedding is a hypothesis generating picture rather than a faithful metric. The final line exercises UMAP's `transform`, the out of sample capability that standard t-SNE simply does not have. ## Julia ```julia # ] add TSne UMAP Distances Clustering using TSne, UMAP, Random Random.seed!(0) # Five blobs in 50-D: stack Gaussian clusters around random centers. centers = [randn(50) .* 6 for _ in 1:5] X = reduce(vcat, [hcat([c .+ 2.5 .* randn(50) for _ in 1:120]...)' for c in centers]) y = repeat(1:5, inner = 120) # t-SNE: TSne.tsne(X, out_dims, reduce_dims, max_iter, perplexity) Z_tsne = tsne(X, 2, 0, 1000, 30.0) # UMAP: umap returns a (n_components x n_samples) matrix, so transpose. Z_umap = umap(permutedims(X), 2; n_neighbors = 15, min_dist = 0.1)' println("t-SNE embedding size: ", size(Z_tsne)) println("UMAP embedding size: ", size(Z_umap)) # UMAP.jl exposes a model object whose transform places new points: model = UMAP_(permutedims(X), 2; n_neighbors = 15, min_dist = 0.1) Z_new = UMAP.transform(model, permutedims(X[1:5, :]))' println("UMAP placed ", size(Z_new, 1), " new points via transform") ``` ## Rust ```rust // Cargo.toml: bhtsne = "0.5" // bhtsne is the mature Barnes-Hut t-SNE crate. There is no mature UMAP // crate in the Rust ecosystem; for UMAP in production, call umap-learn // from Python (e.g. via PyO3) or use the Julia/Python tabs above. use bhtsne::tSNE; fn main() { // Five blobs in 50-D, flattened row-major into a Vec<f32>. let (n, d, k) = (600usize, 50usize, 5usize); let centers: Vec<Vec<f32>> = (0..k) .map(|c| (0..d).map(|j| ((c * 7 + j) as f32 % 11.0) - 5.0).collect()) .collect(); let mut data: Vec<f32> = Vec::with_capacity(n * d); for i in 0..n { let c = &centers[i % k]; // Cheap deterministic jitter stands in for Gaussian noise. for j in 0..d { let jitter = (((i * 31 + j * 17) % 13) as f32 - 6.0) * 0.4; data.push(c[j] + jitter); } } let samples: Vec<&[f32]> = data.chunks(d).collect(); let mut tsne = tSNE::new(&samples); tsne.embedding_dim(2) .perplexity(30.0) .epochs(1000) .barnes_hut(0.5, |a, b| { a.iter().zip(b).map(|(x, y)| (x - y).powi(2)).sum::<f32>().sqrt() }); let embedding = tsne.embedding(); // flat Vec<f32>, length n * 2 println!("t-SNE embedding: {} points x 2 dims", embedding.len() / 2); } ``` ::: The three implementations agree on the workflow even though they live in different ecosystems. Python is the most mature path and the one to prefer in practice: scikit-learn and `umap-learn` are widely deployed, documented, and tested. Julia's `TSne.jl` and `UMAP.jl` are clean native ports, and `UMAP.jl` exposes a `transform` like Python. Rust's `bhtsne` crate is a solid Barnes Hut t-SNE, but the Rust ecosystem has no mature UMAP crate, so honest production advice there is to call `umap-learn` across the language boundary rather than reimplement it. ## 8. Summary t-SNE and UMAP are powerful instruments for seeing high dimensional data, and their power rests on the manifold assumption that meaningful structure is locally Euclidean and globally curved. Both convert local neighborhoods into a graph or probability structure and then lay out points to preserve those neighborhoods, t-SNE through a KL divergence on Student t similarities and UMAP through a cross entropy on fuzzy graph memberships. They reward us with vivid pictures and punish us when we forget that the pictures distort sizes, distances, and sometimes topology. Treat their output as a hypothesis generating visualization, govern it with careful hyperparameter sweeps and fixed seeds, and look elsewhere when you need stable features or a faithful metric. Used with that discipline, they are among the most useful tools in the practitioner's kit. ## References 1. van der Maaten, L., and Hinton, G. (2008). Visualizing Data using t-SNE. Journal of Machine Learning Research. https://www.jmlr.org/papers/v9/vandermaaten08a.html 2. McInnes, L., Healy, J., and Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. https://arxiv.org/abs/1802.03426 3. Wattenberg, M., Viegas, F., and Johnson, I. (2016). How to Use t-SNE Effectively. Distill. https://distill.pub/2016/misread-tsne/ 4. Coenen, A., and Pearce, A. Understanding UMAP. Google PAIR. https://pair-code.github.io/understanding-umap/ 5. UMAP documentation. https://umap-learn.readthedocs.io/en/latest/ 6. Poličar, P., Stražar, M., and Zupan, B. (2024). openTSNE: a modular Python library for t-SNE dimensionality reduction. Journal of Statistical Software. https://github.com/pavlin-policar/openTSNE 7. Chari, T., and Pachter, L. (2023). The specious art of single-cell genomics. PLOS Computational Biology. https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011288 8. Kobak, D., and Berens, P. (2019). The art of using t-SNE for single-cell transcriptomics. Nature Communications. https://www.nature.com/articles/s41467-019-13056-x