126  The Philosophy of Unsupervised Learning

Supervised learning begins with an answer key. Someone has already decided, for every training example, what the correct output should be, and the learner only has to discover a function that reproduces those answers and generalizes to new inputs. Unsupervised learning begins with no answer key at all. It receives a pile of observations \(x_1, x_2, \dots, x_n\) drawn from some unknown distribution and is asked to do something useful with them, where “useful” is left deliberately underspecified. This chapter argues that unsupervised learning is best understood not as a single task but as a family of attempts to recover the latent structure that generated the data, and that this framing explains both why it is foundational to modern machine learning and why it remains stubbornly hard to evaluate.

126.1 1. What It Means to Learn Without Labels

The defining feature of the unsupervised setting is the absence of a target variable. In supervised learning we model a conditional distribution \(p(y \mid x)\) or learn a map \(f: \mathcal{X} \to \mathcal{Y}\), and the loss measures disagreement with known targets \(y_i\). In unsupervised learning we have only \(\{x_i\}\), and the object of interest is the data distribution itself, \(p(x)\), or some structural property of it that we believe carries meaning.

This shift changes the nature of the problem in a deep way. Supervised learning has an external standard of correctness baked into every example. Unsupervised learning has to supply its own standard, and the choice of standard is a modeling assumption rather than a fact about the world. When we cluster, we assert that the data falls into groups. When we estimate density, we assert that some regions of input space are more typical than others. When we reduce dimension, we assert that the data lives near a lower dimensional surface embedded in a high dimensional ambient space. None of these assertions is guaranteed by the data. Each is a hypothesis about structure that the algorithm then tries to fit.

A useful way to organize the field is by the kind of structure being posited. The classical goals are clustering (discrete grouping), density estimation (a full probabilistic model of \(p(x)\)), dimensionality reduction (a low dimensional coordinate system), and representation learning (features that make downstream tasks easier). These overlap, and modern methods often pursue several at once, but the taxonomy is worth keeping because each goal comes with its own assumptions, its own algorithms, and its own evaluation headaches.

126.2 2. The Manifold Hypothesis and Why Structure Exists

Why should unlabeled data have any recoverable structure at all? The working answer in modern machine learning is the manifold hypothesis: real high dimensional data does not fill its ambient space uniformly but concentrates near a lower dimensional manifold. A \(256 \times 256\) color image lives nominally in a space of dimension \(256 \times 256 \times 3 \approx 196{,}608\), yet the set of images that look like natural photographs occupies a vanishingly thin sliver of that space. Almost every point in the ambient cube is television static, not a photograph.

If the manifold hypothesis holds, then the intrinsic dimension of the data is far smaller than the ambient dimension, and the goal of unsupervised learning can be restated as recovering the manifold and a coordinate system on it. Clustering corresponds to finding connected components or modes; dimensionality reduction corresponds to finding the manifold’s local coordinates; density estimation corresponds to modeling how probability mass is distributed across and around it. The hypothesis is empirical rather than provable, but it is strongly supported by the success of methods that exploit it, and it gives a unifying picture of what “structure” means.

The manifold view also clarifies why distance matters and why it is treacherous. Euclidean distance in the ambient space is a poor proxy for distance along the manifold. Two photographs can be close pixel by pixel yet semantically unrelated, while two semantically similar images can be far apart in raw pixels. Much of the craft of unsupervised learning is the search for a representation in which ordinary distances become meaningful.

126.3 3. Clustering: Positing Discrete Groups

Clustering assumes the data partitions into groups whose members are more similar to one another than to outsiders. The trouble is that “similar” and “group” admit many incompatible formalizations, and different formalizations give different clusters from the same data.

The \(k\)-means objective makes the assumptions explicit. Given \(k\), it seeks centroids \(\mu_1, \dots, \mu_k\) and an assignment minimizing within cluster squared distance,

\[ J = \sum_{i=1}^{n} \min_{j} \lVert x_i - \mu_j \rVert^2. \]

This silently assumes clusters are roughly spherical, comparable in size, and convex. Density based methods such as DBSCAN make a different assumption: clusters are high density regions separated by low density gaps, so they can recover elongated or nested shapes that \(k\)-means cannot, at the cost of sensitivity to a density threshold. Spectral clustering assumes the data forms a graph whose natural cuts reveal the groups, and it works by embedding the graph’s Laplacian eigenvectors before clustering in that embedded space. Gaussian mixture models assume each cluster is a Gaussian and fit them by maximum likelihood, which generalizes \(k\)-means to elliptical, overlapping clusters with soft assignments.

The unavoidable lesson is that the number of clusters and the notion of similarity are inputs, not outputs. A clustering algorithm answers the question “if the data were grouped in this way, what grouping fits best,” and the honesty of the result depends entirely on whether that way of grouping matches the phenomenon. This is the first concrete sign that unsupervised learning carries its assumptions in its objective function.

126.4 4. Density Estimation: Modeling the Whole Distribution

Density estimation is the most ambitious of the classical goals because it tries to model \(p(x)\) in full rather than summarizing it with groups or coordinates. A good density model can score how typical a new point is, generate fresh samples, detect anomalies as low probability regions, and serve as a prior in downstream inference.

The simplest approach, kernel density estimation, places a small bump at each data point,

\[ \hat{p}(x) = \frac{1}{n h^d} \sum_{i=1}^{n} K\!\left( \frac{x - x_i}{h} \right), \]

and immediately exposes the curse of dimensionality: as the dimension \(d\) grows, the number of points needed to fill space well enough for \(\hat{p}\) to be reliable grows exponentially. Pointwise density estimation is essentially hopeless in raw high dimensional spaces, which is exactly why the manifold hypothesis matters. Modern generative models sidestep the curse by parameterizing the density implicitly or by factoring it. Autoregressive models write \(p(x) = \prod_t p(x_t \mid x_{<t})\) and predict one coordinate at a time. Normalizing flows transform a simple base density through invertible maps and track the change of variables exactly. Diffusion models learn to reverse a gradual noising process and, in doing so, learn the score \(\nabla_x \log p(x)\) rather than the density directly. These methods trade exact likelihood for scalability in different ways, but all are attempts to make \(p(x)\) tractable where naive estimation fails.

The connection to representation learning is that a model forced to assign high probability to real data and low probability to everything else must internalize the regularities that distinguish the two. The features it builds along the way are often more valuable than the density itself.

126.5 5. Dimensionality Reduction: Finding the Coordinates

Dimensionality reduction seeks a map from a high dimensional input to a low dimensional code that preserves the structure we care about. The classical method, principal component analysis, finds the linear subspace capturing maximal variance by taking the top eigenvectors of the covariance matrix. PCA is fast, convex, and interpretable, but it assumes the manifold is a flat linear subspace, which it usually is not.

Nonlinear methods relax that assumption. Autoencoders learn an encoder \(g_\phi: \mathcal{X} \to \mathcal{Z}\) and decoder \(h_\theta: \mathcal{Z} \to \mathcal{X}\) trained to reconstruct the input through a narrow bottleneck,

\[ \min_{\theta, \phi} \; \frac{1}{n}\sum_{i=1}^{n} \lVert x_i - h_\theta(g_\phi(x_i)) \rVert^2, \]

so the bottleneck code \(z = g_\phi(x)\) must capture whatever is needed to reconstruct \(x\). Neighbor embedding methods such as t-SNE and UMAP take a different stance: they care only about preserving local neighborhoods for the purpose of visualization, deliberately distorting global geometry to make clusters legible in two dimensions. This is a crucial caveat. A t-SNE plot is a lens for inspection, not a faithful metric space, and distances or cluster sizes in such a plot should not be read literally.

Here again the assumptions are doing the work. PCA assumes linearity, autoencoders assume reconstructability through a bottleneck, neighbor embeddings assume that local structure is what matters. Choose the wrong assumption and the reduced representation discards exactly the structure you needed.

126.6 6. Representation Learning: The Modern Center of Gravity

The four goals converge on a single modern objective: learn a representation, a function \(z = f(x)\) mapping raw inputs to vectors in which downstream problems become easy. A good representation makes semantically similar inputs nearby, disentangles factors of variation, and transfers across tasks. Representation quality, not reconstruction error or likelihood, is what practitioners ultimately care about, because the representation is what gets reused.

What makes a representation good is partly a matter of invariance and partly a matter of informativeness. We want features invariant to nuisances such as lighting, cropping, or word order, while remaining sensitive to the content that distinguishes one input from another. A representation that throws away everything is perfectly invariant and perfectly useless; a representation that keeps every pixel is perfectly informative and equally useless. The art is to keep the right information.

This reframing is what connects classical unsupervised learning to the deep learning era. The pretrained encoders that power modern systems are, in the end, unsupervised or self-supervised representation learners. The question of how to learn good representations without labels has become one of the central questions of the field, and self-supervised pretraining, treated in section 8, is the current best answer.

126.7 7. Why Unsupervised Learning Is Hard to Evaluate

The deepest difficulty in unsupervised learning is not optimization but evaluation. In supervised learning, held out accuracy gives an unambiguous, externally grounded score. In unsupervised learning there is no ground truth to compare against, because the whole premise is that nobody labeled the data. This has several consequences.

First, internal metrics measure self consistency, not correctness. Reconstruction error, log likelihood, and within cluster variance all quantify how well a model fits its own objective. A clustering can minimize within cluster distance beautifully while carving the data along axes nobody cares about. A density model can achieve excellent likelihood while producing samples that look wrong, because likelihood is dominated by getting the bulk of the distribution roughly right and is insensitive to perceptually important details.

Second, when external labels do exist for evaluation, they reintroduce a notion of correctness that the algorithm never optimized, so a mismatch may mean the algorithm failed or may mean the labels reflect one of many equally valid structures. Metrics such as adjusted Rand index or normalized mutual information compare a clustering to a reference labeling, but they presume that reference is the right one. The same photographs can be validly grouped by object, by scene, by color, or by photographer.

Third, evaluation is task relative. A representation that is excellent for one downstream task can be poor for another, so there is no single number that captures its quality. The honest practice, and the dominant practice today, is to evaluate representations extrinsically: freeze the learned features and measure how well a simple model trained on top of them performs on real labeled tasks, often via linear probing or transfer learning. This admits that “good structure” is ultimately defined by usefulness rather than by any intrinsic property the unsupervised method can see on its own.

# The pattern that quietly powers most modern evaluation
features = encoder(x)            # learned without labels
clf = LinearModel()             # tiny supervised head
clf.fit(features_train, y_train)
score = clf.score(features_test, y_test)   # extrinsic, task-relative

This evaluation gap is not a temporary nuisance to be engineered away. It is a structural feature of learning without a target, and recognizing it prevents a great deal of self deception.

126.8 8. Self-Supervised Pretraining: Unsupervised Learning’s Triumph

The most consequential development of the past decade is that the field found a way to manufacture supervision from unlabeled data. Self-supervised learning hides part of each example and trains the model to predict the hidden part from the visible part. No human labels are required, yet every example becomes a supervised problem with an automatically generated target. Philosophically this is still unsupervised learning, since no external annotation is used, but it inherits the machinery and stability of supervised training.

Two broad families dominate. Masked prediction removes part of the input and asks the model to reconstruct it. Masked language modeling trains a model to fill in blanked tokens, and masked image modeling does the analogous thing for image patches. The pretext task forces the model to understand context, syntax, and semantics well enough to recover what is missing. Contrastive and self-distillation methods take a different route: they create two augmented views of the same input and train the representation so that views of the same example are close while views of different examples are far apart, learning invariances directly. A representative contrastive objective for a positive pair \((z_i, z_j)\) among negatives is

\[ \mathcal{L} = -\log \frac{\exp(\text{sim}(z_i, z_j)/\tau)}{\sum_{k \neq i} \exp(\text{sim}(z_i, z_k)/\tau)}, \]

where \(\text{sim}\) is cosine similarity and \(\tau\) is a temperature controlling how sharply positives are favored.

The reason this matters is scale. Labeled data is expensive and finite; unlabeled data is effectively unlimited. By converting the ocean of unlabeled text, images, audio, and code into self-supervised prediction problems, the field can train enormous models on enormous corpora and let them discover structure that no labeling budget could ever specify. The large pretrained models behind contemporary language and vision systems are the direct descendants of this idea. Next token prediction over web text is, in the framing of this chapter, density estimation over sequences whose learned internal representations turn out to encode a remarkable amount of world structure.

This is the resolution of an old tension. For years unsupervised learning was admired in principle but underperformed supervised learning in practice, precisely because of the evaluation and objective problems discussed above. Self-supervised pretraining sidestepped the worst of those problems by giving the model a crisp, automatically scored pretext task while keeping the label free premise. The result is that learning structure without labels, once a niche concern, now sits at the foundation of the most capable systems in machine learning.

126.9 9. Practical Guidance and Closing Perspective

For a practitioner, several principles follow from this philosophy. State your structural assumption before choosing an algorithm, because the algorithm cannot discover structure of a kind it was never designed to look for. Match the method to the goal: use clustering when you genuinely believe in discrete groups, density estimation when you need to score typicality or generate, dimensionality reduction when you need compact coordinates, and representation learning when a downstream task is the real target. Treat visualization methods as inspection tools rather than measurements. Evaluate extrinsically whenever a downstream task exists, and be skeptical of any unsupervised result whose only justification is that it optimized its own internal objective well.

The broader lesson is that unsupervised learning is a disciplined way of encoding beliefs about how data is generated and then letting the data refine those beliefs. It is hard to evaluate precisely because it is ambitious: it tries to recover meaning that nobody wrote down. Its modern triumph through self-supervised pretraining does not abolish that difficulty so much as route around it, by inventing pretext tasks whose answers the data already contains. Understanding both the ambition and the difficulty is what separates principled use of these methods from cargo cult application.

126.10 References

  1. Bengio, Y., Courville, A., and Vincent, P. “Representation Learning: A Review and New Perspectives.” IEEE TPAMI, 2013. https://arxiv.org/abs/1206.5538
  2. Hastie, T., Tibshirani, R., and Friedman, J. “The Elements of Statistical Learning,” 2nd ed., Chapter 14 (Unsupervised Learning). Springer, 2009. https://hastie.su.domains/ElemStatLearn/
  3. van der Maaten, L., and Hinton, G. “Visualizing Data using t-SNE.” JMLR, 2008. https://www.jmlr.org/papers/v9/vandermaaten08a.html
  4. McInnes, L., Healy, J., and Melville, J. “UMAP: Uniform Manifold Approximation and Projection.” 2018. https://arxiv.org/abs/1802.03426
  5. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” NAACL, 2019. https://arxiv.org/abs/1810.04805
  6. Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. “A Simple Framework for Contrastive Learning of Visual Representations (SimCLR).” ICML, 2020. https://arxiv.org/abs/2002.05709
  7. He, K., Chen, X., Xie, S., Li, Y., Dollar, P., and Girshick, R. “Masked Autoencoders Are Scalable Vision Learners.” CVPR, 2022. https://arxiv.org/abs/2111.06377
  8. Ho, J., Jain, A., and Abbeel, P. “Denoising Diffusion Probabilistic Models.” NeurIPS, 2020. https://arxiv.org/abs/2006.11239
  9. Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. “A Density-Based Algorithm for Discovering Clusters (DBSCAN).” KDD, 1996. https://www.dbs.ifi.lmu.de/Publikationen/Papers/KDD-96.final.frame.pdf
  10. Balestriero, R., et al. “A Cookbook of Self-Supervised Learning.” 2023. https://arxiv.org/abs/2304.12210