127 Clustering Fundamentals
Clustering is the canonical unsupervised learning task. Given a collection of objects and no labels, we want to partition or organize those objects into groups so that objects in the same group are more alike than objects in different groups. This chapter develops the conceptual machinery you need before studying any specific algorithm. We formalize the clustering problem, examine the similarity and distance measures that make “alike” precise, distinguish hard from soft assignments, survey the main algorithmic families, and confront the uncomfortable question of what makes a clustering good in the first place.
127.1 1. The Clustering Problem
127.1.1 1.1 A Working Definition
Let \(X = \{x_1, x_2, \ldots, x_n\}\) be a set of \(n\) objects, each typically represented as a feature vector \(x_i \in \mathbb{R}^d\). A clustering is a function that assigns each object to one or more groups. In the simplest case, a hard partitional clustering produces a partition
\[ C = \{C_1, C_2, \ldots, C_k\}, \qquad \bigcup_{j=1}^{k} C_j = X, \qquad C_a \cap C_b = \varnothing \text{ for } a \neq b, \]
where each \(C_j\) is a nonempty cluster. The number of clusters \(k\) may be fixed in advance, discovered by the algorithm, or left implicit in a hierarchy. Unlike classification, there is no target variable and no ground truth assignment to imitate. We are asked to impose structure rather than to recover a known mapping.
This freedom is also the central difficulty. The phrase “objects that are alike” presupposes a notion of likeness, and different notions yield different, equally defensible clusterings of the same data. Clustering is therefore not a single well posed optimization problem but a family of problems, each defined by a similarity measure, a model of what a cluster is, and an objective that scores candidate solutions.
127.1.2 1.2 Why It Is Hard
Three properties make clustering harder than supervised learning. First, the objective is often combinatorial. The number of ways to partition \(n\) objects into \(k\) nonempty groups is the Stirling number of the second kind \(S(n,k)\), which grows faster than exponentially, so exhaustive search is hopeless and most useful objectives are NP-hard to optimize exactly. Second, the problem is underdetermined: without labels there is no external signal to adjudicate between competing structures. Third, evaluation is circular, because the same assumptions that drive the algorithm tend to drive any internal quality score we compute afterward. These difficulties recur throughout the chapter and shape every design choice.
127.2 2. Similarity and Distance Measures
A clustering algorithm never sees the objects directly. It sees them through the lens of a proximity measure that quantifies how close two objects are. The choice of measure is the single most consequential modeling decision, often more important than the choice of algorithm.
127.2.1 2.1 Metrics and Their Axioms
A dissimilarity \(d : X \times X \to \mathbb{R}_{\geq 0}\) is a metric when it satisfies, for all \(x, y, z\),
\[ d(x,x) = 0, \quad d(x,y) = d(y,x), \quad d(x,y) \le d(x,z) + d(z,y). \]
The last condition is the triangle inequality, and it is what allows many algorithms to prune computation and to reason geometrically. Not every useful dissimilarity is a metric. Squared Euclidean distance, used implicitly by \(k\)-means, violates the triangle inequality, and many domain similarities are not even symmetric. Knowing which axioms hold tells you which algorithmic guarantees survive.
127.2.2 2.2 The Minkowski Family
For real valued vectors the workhorse family is the Minkowski distance of order \(p\),
\[ d_p(x, y) = \left( \sum_{i=1}^{d} |x_i - y_i|^{p} \right)^{1/p}. \]
Setting \(p = 2\) recovers Euclidean distance, the straight line distance that underlies most geometric intuition. Setting \(p = 1\) gives the Manhattan or city block distance, which is more robust to outliers because it does not square deviations. The limit \(p \to \infty\) gives the Chebyshev distance \(\max_i |x_i - y_i|\). As \(p\) grows, large coordinate differences dominate; as \(p\) shrinks toward zero, the measure becomes sensitive to the count of differing coordinates rather than their magnitude.
127.2.3 2.3 Cosine, Correlation, and Angular Measures
When magnitude is uninformative and only direction matters, as with bag of words text vectors or many embedding spaces, cosine similarity is preferred,
\[ \cos(x, y) = \frac{x^\top y}{\lVert x \rVert \, \lVert y \rVert}, \qquad d_{\cos}(x,y) = 1 - \cos(x,y). \]
Two documents that use the same vocabulary in the same proportions are deemed similar even if one is ten times longer. Pearson correlation is cosine similarity applied to mean centered vectors, which makes it the natural choice when an additive offset per object should be ignored, as in gene expression profiles.
127.2.4 2.4 Measures for Non-Euclidean Data
Real data are frequently not points in \(\mathbb{R}^d\). Binary attributes call for the Jaccard coefficient, the ratio of the size of the intersection to the size of the union of two attribute sets,
\[ J(A, B) = \frac{|A \cap B|}{|A \cup B|}. \]
Categorical data may use a simple matching coefficient or, for sequences, an edit distance such as Levenshtein. Probability distributions are compared with the Kullback to Leibler divergence or its symmetric Jensen to Shannon variant. Time series often demand dynamic time warping, which aligns sequences of different lengths before measuring residual disagreement. The lesson is invariant across cases: pick the measure that encodes your domain notion of similarity, then choose an algorithm that respects it.
127.2.5 2.5 Scaling, Weighting, and the Curse of Dimensionality
Because the Minkowski family sums over coordinates, features measured on large numeric scales silently dominate the distance. Standardizing each feature to zero mean and unit variance, or rescaling to a common range, is therefore a near mandatory preprocessing step whenever features are heterogeneous. The Mahalanobis distance
\[ d_M(x, y) = \sqrt{(x - y)^\top \Sigma^{-1} (x - y)} \]
generalizes this idea by using the inverse covariance matrix \(\Sigma^{-1}\) to whiten correlated features and equalize their contributions.
A subtler problem appears in high dimensions. As \(d\) grows, the contrast between the nearest and farthest neighbor of a query point shrinks, so that
\[ \frac{\max_i d(q, x_i) - \min_i d(q, x_i)}{\min_i d(q, x_i)} \to 0 \]
under broad conditions. Distances become nearly equal, and the very notion of a nearest cluster degrades. This curse of dimensionality is why dimensionality reduction, feature selection, and subspace clustering are routine companions of high dimensional clustering rather than optional extras.
127.3 3. Hard Versus Soft Clustering
127.3.1 3.1 Hard Assignment
A hard clustering assigns each object to exactly one cluster. We can encode it with an indicator matrix \(U \in \{0, 1\}^{n \times k}\) in which \(u_{ij} = 1\) when object \(i\) belongs to cluster \(j\), subject to \(\sum_{j} u_{ij} = 1\) for every \(i\). Hard assignments are simple to interpret and store, and they match problems where each object truly belongs to one category. Their weakness is brittleness: an object lying exactly between two clusters is forced into one, discarding the information that it was a borderline case.
127.3.2 3.2 Soft and Probabilistic Assignment
A soft or fuzzy clustering relaxes the indicator to a membership in \([0, 1]\), with the row constraint \(\sum_{j} u_{ij} = 1\) retained so that memberships behave like a distribution over clusters. Fuzzy \(c\)-means minimizes
\[ J_m = \sum_{i=1}^{n} \sum_{j=1}^{k} u_{ij}^{m} \, \lVert x_i - c_j \rVert^2, \qquad m > 1, \]
where the fuzzifier \(m\) controls how soft the assignment is, approaching hard clustering as \(m \to 1\). A probabilistic clustering goes further and treats the data as generated by a mixture model. In a Gaussian mixture, the posterior responsibility
\[ \gamma_{ij} = \frac{\pi_j \, \mathcal{N}(x_i \mid \mu_j, \Sigma_j)}{\sum_{l} \pi_l \, \mathcal{N}(x_i \mid \mu_l, \Sigma_l)} \]
gives the probability that component \(j\) generated object \(i\). Soft memberships preserve uncertainty, support principled downstream reasoning, and degrade gracefully near boundaries, at the cost of more parameters and a heavier inference procedure.
hard: x_i -> cluster 2
soft: x_i -> [0.05, 0.60, 0.35] (membership per cluster)
127.3.3 3.3 Overlapping and Disjoint Structure
Hard and soft are not the only axis. Some applications need overlapping clusters in which an object genuinely belongs to several groups at once, such as a person in multiple social communities. This differs from soft clustering, where memberships are fractional but still sum to one. Overlapping models allow an object to have full membership in more than one cluster simultaneously, which calls for yet another family of methods.
127.4 4. The Main Families of Methods
No single algorithm dominates, because each family encodes a different answer to the question “what is a cluster?” Understanding the answer each family gives is the fastest route to choosing wisely.
127.4.1 4.1 Partitional and Centroid Based
Centroid methods define a cluster by a prototype and assign each object to the nearest prototype. The archetype is \(k\)-means, which seeks centers \(\{c_1, \ldots, c_k\}\) minimizing the within cluster sum of squares,
\[ \min_{C, \, c} \sum_{j=1}^{k} \sum_{x \in C_j} \lVert x - c_j \rVert^2. \]
Lloyd’s algorithm alternates assigning points to the nearest center and recomputing each center as the mean of its members. The method is fast and scalable but assumes clusters are convex, roughly equal in size, and isotropic, and it requires \(k\) in advance. The medoid variant, \(k\)-medoids, uses actual data points as centers and tolerates arbitrary dissimilarities, trading speed for robustness and generality.
127.4.2 4.2 Hierarchical
Hierarchical methods build a tree, the dendrogram, rather than a single flat partition. Agglomerative clustering starts with every object in its own cluster and repeatedly merges the two closest clusters until one remains; divisive clustering runs the reverse. The notion of closest between clusters is set by a linkage criterion such as single, complete, average, or Ward linkage, and the choice strongly shapes the result, with single linkage prone to chaining and Ward favoring compact spherical groups. A flat clustering is obtained by cutting the dendrogram at a chosen height. The output is informative and needs no preset \(k\), but the classic algorithms cost \(O(n^2)\) memory or more, which limits scale.
cut height ----.
|
+------+------+
+--+--+ +---+---+
A B C D
127.4.3 4.3 Density Based
Density methods define a cluster as a dense region of space separated from other dense regions by sparser space. DBSCAN grows clusters from core points that have at least a minimum number of neighbors within radius \(\varepsilon\), labels points reachable from cores as cluster members, and marks the rest as noise. Because clusters follow the contour of dense regions, the family recovers arbitrarily shaped clusters and is robust to outliers, neither of which centroid methods can do. The price is sensitivity to the density parameters and difficulty when clusters have very different densities, partly addressed by hierarchical successors such as OPTICS and HDBSCAN.
127.4.4 4.4 Model Based
Model based methods assume the data were generated by a probabilistic model, usually a mixture of distributions, and fit the model by maximum likelihood, typically with the expectation maximization algorithm. A Gaussian mixture yields soft memberships and can capture elliptical, differently sized, and differently oriented clusters through full covariance matrices, generalizing \(k\)-means, which is the limiting case with shared spherical covariance and hard assignment. The framework also offers principled model selection, since criteria such as the Bayesian information criterion can compare fits with different \(k\).
127.4.5 4.5 Graph and Spectral
When data are naturally relational, or when nonconvex structure defeats geometric methods, we build a similarity graph whose edges encode pairwise affinity and then partition the graph. Spectral clustering embeds the objects using the leading eigenvectors of the graph Laplacian and clusters in that embedding, which lets it separate intertwined, nonconvex shapes that \(k\)-means cannot. Community detection methods such as modularity optimization operate directly on networks. These approaches shine on manifolds and graphs but incur the cost of constructing and decomposing large affinity matrices.
127.5 5. What Makes a Clustering Good
127.5.1 5.1 The Absence of a Universal Criterion
There is no clustering objective that satisfies every reasonable requirement at once. Kleinberg’s impossibility result shows that three intuitive properties, scale invariance, richness, and consistency, cannot be simultaneously satisfied by any clustering function. The practical consequence is that “good” is always relative to a purpose. A clustering that serves customer segmentation may be useless for anomaly detection on the same data. State your purpose first, then evaluate against it.
127.5.2 5.2 Internal Validation
Internal indices score a clustering using only the data and the partition, with no external labels. They typically reward two competing virtues: cohesion, meaning objects within a cluster are close, and separation, meaning clusters are far apart. The silhouette coefficient of object \(i\) captures both,
\[ s(i) = \frac{b(i) - a(i)}{\max\{a(i), b(i)\}}, \]
where \(a(i)\) is the mean distance to other members of its own cluster and \(b(i)\) is the mean distance to the nearest other cluster. Values near one indicate a well placed object, values near zero a borderline one, and negative values a likely misassignment. Other indices include the Davies to Bouldin index, which is smaller when clusters are compact and well separated, and the Calinski to Harabasz ratio of between cluster to within cluster dispersion. Every internal index embeds its own definition of a cluster, so an index that rewards compactness will flatter centroid methods and penalize density based ones; never treat such a score as neutral ground truth.
127.5.3 5.3 External Validation
When reference labels exist, even if only on a sample, external indices compare the clustering to that reference. The Rand index counts the fraction of object pairs on which the two agree, either grouped together by both or separated by both, and the adjusted Rand index corrects this for chance agreement so that a random clustering scores near zero. Information theoretic measures such as normalized mutual information quantify how much knowing the cluster reduces uncertainty about the label. These measures are valuable but assume the reference labels encode the structure you care about, which is not always so.
127.5.4 5.4 Stability and Practical Judgment
A complementary view asks whether the structure is real or an artifact of one particular sample. Stability analysis perturbs the data, by resampling, subsampling, or adding noise, reclusters, and measures how much the assignments change. Clusterings that survive perturbation are more trustworthy, and stability is often used to choose \(k\). Beyond any number, the ultimate test is utility. Does the clustering compress the data into groups a domain expert recognizes, does it improve a downstream task, does it generate hypotheses that hold up? A clustering is good when it is useful for the purpose that motivated it, and the quantitative indices are instruments in service of that judgment rather than substitutes for it.
127.6 6. Summary
Clustering imposes structure on unlabeled data, and every step in that process is a modeling choice. The proximity measure encodes what similar means, and its selection and scaling matter more than the algorithm. Hard assignments are simple and interpretable, soft and probabilistic assignments preserve uncertainty, and overlapping models handle genuine multiple membership. The main families, centroid, hierarchical, density, model based, and spectral, each answer the question of what a cluster is differently, so the right family follows from the shape and meaning of your data. Finally, no universal definition of good clustering exists. We evaluate with internal indices, external indices when labels are available, and stability analysis, but the final arbiter is fitness for the purpose that prompted the analysis. Approached this way, clustering is less a search for the one true grouping than a disciplined conversation between assumptions, data, and goals.
127.7 References
- Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Pattern Recognition Letters, 31(8), 651 to 666. https://doi.org/10.1016/j.patrec.2009.09.011
- Jain, A. K., Murty, M. N., and Flynn, P. J. (1999). Data clustering: a review. ACM Computing Surveys, 31(3), 264 to 323. https://doi.org/10.1145/331499.331504
- Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer. https://hastie.su.domains/ElemStatLearn/
- Aggarwal, C. C., and Reddy, C. K. (Eds.). (2013). Data Clustering: Algorithms and Applications. CRC Press. https://www.charuaggarwal.net/clusterbook.pdf
- Ester, M., Kriegel, H. P., Sander, J., and Xu, X. (1996). A density based algorithm for discovering clusters in large spatial databases with noise. KDD-96, 226 to 231. https://cdn.aaai.org/KDD/1996/KDD96-037.pdf
- von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and Computing, 17(4), 395 to 416. https://doi.org/10.1007/s11222-007-9033-z
- Kleinberg, J. (2002). An impossibility theorem for clustering. Advances in Neural Information Processing Systems 15. https://proceedings.neurips.cc/paper/2002/file/43e4e6a6f341e00671e123714de019a8-Paper.pdf
- Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53 to 65. https://doi.org/10.1016/0377-0427(87)90125-7
- Hubert, L., and Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193 to 218. https://doi.org/10.1007/BF01908075
- Beyer, K., Goldstein, J., Ramakrishnan, R., and Shaft, U. (1999). When is nearest neighbor meaningful? ICDT 1999, 217 to 235. https://doi.org/10.1007/3-540-49257-7_15