169  Clustering Evaluation Metrics

Clustering is unsupervised by definition, which makes its evaluation uniquely difficult. There is no ground truth label to predict, no loss to minimize against an external target, and frequently no agreement about what a “correct” partition even means. Yet practitioners must still answer concrete questions: How many clusters should we keep? Does algorithm A produce a better grouping than algorithm B? Is the structure we found real, or an artifact of the metric we happened to choose? This chapter develops the two principal families of clustering evaluation, internal and external validity, with their mathematical definitions, and then examines why no single number ever settles the matter.

169.1 1. The Evaluation Problem

Let \(X = \{x_1, \dots, x_n\}\) be a dataset of \(n\) points and let \(\mathcal{C} = \{C_1, \dots, C_K\}\) be a partition into \(K\) clusters, so that \(\bigcup_k C_k = X\) and \(C_j \cap C_k = \varnothing\) for \(j \neq k\). A clustering evaluation metric is a function that maps \(\mathcal{C}\) (and possibly auxiliary information) to a real number expressing quality.

We distinguish three regimes. Internal validity uses only \(X\) and \(\mathcal{C}\), scoring a partition by geometric or statistical properties such as compactness and separation. External validity compares \(\mathcal{C}\) to a reference partition \(\mathcal{P} = \{P_1, \dots, P_L\}\) supplied from outside, for example human labels. Relative validity uses an internal index to rank several candidate clusterings, typically to select \(K\). The boundary between internal and relative is mostly procedural, since relative criteria are internal indices evaluated across configurations.

A central caution frames everything that follows: a clustering objective and a clustering evaluation are different things, and optimizing one with respect to the other is circular. Scoring k-means output with an index that also rewards spherical, equal-variance clusters tells you little, because both encode the same bias.

169.2 2. Internal Validity

Internal indices formalize the intuition that good clusters are compact within and well separated between. They require a dissimilarity, usually Euclidean distance \(d(x_i, x_j) = \lVert x_i - x_j \rVert_2\), and inherit all of that metric’s assumptions about scale and geometry.

169.2.1 2.1 The Silhouette Coefficient

The silhouette assigns a score to each individual point. For point \(x_i\) in cluster \(C_k\), define the mean intra-cluster distance

\[ a(i) = \frac{1}{|C_k| - 1} \sum_{x_j \in C_k,\, j \neq i} d(x_i, x_j), \]

the average distance to its own neighbors, and the mean distance to the nearest competing cluster,

\[ b(i) = \min_{\ell \neq k} \frac{1}{|C_\ell|} \sum_{x_j \in C_\ell} d(x_i, x_j). \]

The silhouette of the point is

\[ s(i) = \frac{b(i) - a(i)}{\max\{a(i),\, b(i)\}} \in [-1, 1]. \]

A value near \(1\) means the point sits comfortably inside its cluster and far from any other; a value near \(0\) means it lies on a boundary; a negative value suggests it would fit a neighboring cluster better. Singleton clusters are conventionally assigned \(s(i) = 0\). The overall silhouette is the mean \(\bar{s} = \frac{1}{n} \sum_i s(i)\), and maximizing \(\bar{s}\) over \(K\) is a standard heuristic for choosing the number of clusters.

for each point i:
    a = mean distance to other points in same cluster
    b = min over other clusters of mean distance to that cluster
    s = (b - a) / max(a, b)
score = mean of s over all points

The silhouette is interpretable and bounded, but it carries a strong bias toward convex, roughly isotropic clusters, because both \(a\) and \(b\) are mean pairwise distances. On density-based or manifold-shaped clusters it can rank the geometrically correct partition below a worse one. It is also \(O(n^2)\) in the naive form, motivating sampled or simplified variants for large \(n\).

169.2.2 2.2 The Dunn Index

The Dunn index targets the worst case rather than the average. Let \(\delta(C_i, C_j)\) be the inter-cluster distance between two clusters, for instance the minimum pairwise distance \(\min_{x \in C_i, y \in C_j} d(x, y)\), and let \(\Delta(C_k)\) be the cluster diameter, for instance the maximum intra-cluster distance \(\max_{x, y \in C_k} d(x, y)\). The Dunn index is

\[ D = \frac{\min_{1 \le i < j \le K} \delta(C_i, C_j)}{\max_{1 \le k \le K} \Delta(C_k)}. \]

It is the ratio of the smallest separation to the largest spread, so larger is better and a high value certifies that even the closest pair of clusters is well separated relative to the loosest single cluster. Because both numerator and denominator are extrema, \(D\) is acutely sensitive to outliers: one stray point inflates a diameter or shrinks a gap and collapses the index. The choice of \(\delta\) and \(\Delta\) matters greatly, and generalized Dunn indices substitute average linkage or centroid distances to recover robustness.

169.2.3 2.3 Other Internal Criteria

Two further indices appear constantly. The Calinski-Harabasz index, or variance ratio criterion, is

\[ \mathrm{CH} = \frac{\operatorname{tr}(B_K)}{\operatorname{tr}(W_K)} \cdot \frac{n - K}{K - 1}, \]

where \(\operatorname{tr}(B_K)\) and \(\operatorname{tr}(W_K)\) are the traces of the between-cluster and within-cluster scatter matrices. Higher is better, and the \((n-K)/(K-1)\) factor penalizes excess clusters. The Davies-Bouldin index averages, over each cluster, the worst-case ratio of combined scatter to centroid separation against any other cluster,

\[ \mathrm{DB} = \frac{1}{K} \sum_{i=1}^{K} \max_{j \neq i} \frac{\sigma_i + \sigma_j}{d(\mu_i, \mu_j)}, \]

where \(\sigma_i\) is the mean distance of points in \(C_i\) to their centroid \(\mu_i\). Here lower is better. Both share the silhouette’s preference for compact, separated, convex clusters, which is exactly why agreement among internal indices is reassuring but not conclusive.

169.3 3. External Validity

When a reference partition \(\mathcal{P}\) is available, we can ask how well the discovered \(\mathcal{C}\) recovers it. External indices do not require the cluster labels to match the reference labels by name, since clustering is invariant to relabeling. Instead they measure agreement on the more fundamental question of which pairs of points are grouped together.

169.3.1 3.1 The Pair-Counting View and the Rand Index

Consider every unordered pair of points. Each pair falls into one of four categories relative to \(\mathcal{C}\) and \(\mathcal{P}\):

  • \(a\): pairs together in both \(\mathcal{C}\) and \(\mathcal{P}\),
  • \(b\): pairs together in \(\mathcal{C}\) but separated in \(\mathcal{P}\),
  • \(c\): pairs separated in \(\mathcal{C}\) but together in \(\mathcal{P}\),
  • \(d\): pairs separated in both.

There are \(\binom{n}{2} = a + b + c + d\) pairs in total. The Rand index is the fraction of pairs the two partitions agree on,

\[ \mathrm{RI} = \frac{a + d}{a + b + c + d} \in [0, 1]. \]

It is symmetric and intuitive, but it has a serious flaw: two random partitions already score high, because the \(d\) term, agreements on separation, dominates when \(K\) and \(L\) are large. The expected value of \(\mathrm{RI}\) under random labeling is not zero and drifts with the number of clusters, which makes raw Rand scores incomparable across settings.

169.3.2 3.2 The Adjusted Rand Index

The fix is to subtract the expected agreement under a null model and normalize against the maximum. With \(n_{ij} = |C_i \cap P_j|\) the contingency counts, \(a_i = \sum_j n_{ij}\) the cluster sizes, and \(b_j = \sum_i n_{ij}\) the reference sizes, the adjusted Rand index is

\[ \mathrm{ARI} = \frac{\sum_{ij} \binom{n_{ij}}{2} - \left[\sum_i \binom{a_i}{2} \sum_j \binom{b_j}{2}\right] \big/ \binom{n}{2}}{\tfrac{1}{2}\left[\sum_i \binom{a_i}{2} + \sum_j \binom{b_j}{2}\right] - \left[\sum_i \binom{a_i}{2} \sum_j \binom{b_j}{2}\right] \big/ \binom{n}{2}}. \]

The null model assumes the hypergeometric distribution of cluster assignments given fixed marginals. By construction the expected value of ARI under random labeling is \(0\), perfect agreement is \(1\), and worse-than-random partitions go negative. This chance correction is what makes ARI the workhorse external index, comparable across datasets and values of \(K\).

169.3.3 3.3 Mutual Information and Its Normalizations

An information-theoretic alternative treats cluster membership as a random variable. Let \(U\) be the cluster label of a uniformly random point under \(\mathcal{C}\) and \(V\) its label under \(\mathcal{P}\). With \(p(i) = a_i / n\), \(p(j) = b_j / n\), and joint \(p(i, j) = n_{ij} / n\), the mutual information is

\[ \mathrm{MI}(U, V) = \sum_{i=1}^{K} \sum_{j=1}^{L} p(i, j) \log \frac{p(i, j)}{p(i)\, p(j)}. \]

MI is zero exactly when the two partitions are statistically independent and grows with shared information, but it is unbounded and increases as clusters are split, so it cannot be compared directly. Normalized mutual information divides by an average of the marginal entropies \(H(U)\) and \(H(V)\),

\[ \mathrm{NMI}(U, V) = \frac{\mathrm{MI}(U, V)}{\operatorname{mean}\{H(U),\, H(V)\}}, \]

with the geometric or arithmetic mean as the usual choice, yielding a score in \([0, 1]\). NMI is popular but, like the raw Rand index, is not corrected for chance: random partitions with many clusters score spuriously high. The adjusted mutual information applies the same expectation-subtraction logic as ARI,

\[ \mathrm{AMI}(U, V) = \frac{\mathrm{MI}(U, V) - \mathbb{E}[\mathrm{MI}(U, V)]}{\operatorname{mean}\{H(U),\, H(V)\} - \mathbb{E}[\mathrm{MI}(U, V)]}, \]

restoring an expected value of \(0\) under independence. As a rule of thumb, prefer AMI when the reference has many small clusters and ARI when clusters are large and balanced.

169.3.4 3.4 V-Measure

The V-measure decomposes agreement into two complementary conditional entropies, mirroring precision and recall. Homogeneity asks whether each cluster contains members of a single reference class,

\[ h = 1 - \frac{H(V \mid U)}{H(V)}, \]

and completeness asks whether all members of a reference class are assigned to the same cluster,

\[ c = 1 - \frac{H(U \mid V)}{H(U)}. \]

The V-measure is their harmonic mean, optionally weighted by \(\beta\),

\[ V_\beta = \frac{(1 + \beta)\, h \, c}{\beta\, h + c}. \]

Its diagnostic value lies in the split: a clustering can be perfectly homogeneous yet badly incomplete if it shatters one true class into many clusters, and the two numbers tell you which failure mode you face. Like NMI, the V-measure is not chance-corrected, so its absolute magnitude is sensitive to \(K\).

169.3.5 3.5 The Fowlkes-Mallows Index

Returning to pair counting, the Fowlkes-Mallows index is the geometric mean of pairwise precision and recall,

\[ \mathrm{FM} = \sqrt{\frac{a}{a + b} \cdot \frac{a}{a + c}} = \frac{a}{\sqrt{(a + b)(a + c)}} \in [0, 1]. \]

Here \(a / (a + b)\) is the fraction of co-clustered pairs that are truly together, and \(a / (a + c)\) is the fraction of truly-together pairs that were co-clustered. FM ignores the \(d\) term entirely, so unlike the raw Rand index it does not inflate when most pairs are trivially separated, which makes it more informative for large \(K\). It too benefits from a chance-corrected reading, and its expected value under random labeling depends on the marginals.

build contingency table n_ij = |C_i ∩ P_j|
a = sum_ij C(n_ij, 2)             # pairs together in both
a+b = sum_i C(|C_i|, 2)           # pairs together in C
a+c = sum_j C(|P_j|, 2)           # pairs together in P
FM  = a / sqrt((a+b) * (a+c))

169.4 4. The Limits of Clustering Evaluation

The proliferation of indices is a symptom, not a solution. Several structural limits constrain what any of them can tell us.

No index is objective about \(K\). Internal indices have systematic preferences over the number of clusters. Calinski-Harabasz tends to favor fewer clusters, others favor more, and the silhouette can show several local maxima. Selecting \(K\) by maximizing one index merely encodes that index’s bias as a decision.

Internal indices smuggle in a cluster definition. Silhouette, Dunn, Davies-Bouldin, and Calinski-Harabasz all reward compact, separated, convex blobs. On data with elongated, nested, or density-varying structure, the geometrically and semantically correct clustering can score worse than an incorrect spherical one. There is no model-free notion of cluster quality, a point formalized by Kleinberg’s impossibility theorem, which shows that no clustering function can simultaneously satisfy scale invariance, richness, and consistency.

External indices answer a narrower question than they appear to. A high ARI against human labels confirms only that the clustering recovers that particular labeling, which may itself be one arbitrary view of the data among many valid ones. Genes, documents, or customers can be partitioned along several legitimate axes, and matching one reference says nothing about the others. The reference labels are also frequently noisy or coarse.

Chance correction is not optional. Raw Rand index, NMI, V-measure, and Fowlkes-Mallows all drift with the number and balance of clusters. Reporting them without their adjusted counterparts, or comparing them across experiments with different \(K\), produces conclusions that are artifacts of the metric. The expected-value baseline must be subtracted before scores are comparable.

Metrics disagree, and that is informative. Because each index encodes different assumptions, a robust conclusion is one that survives several of them: a pairwise index such as ARI, an information-theoretic index such as AMI, and where labels are absent, an internal index read with its biases in mind. When the indices conflict, the disagreement localizes the assumption that is doing the work, which is more honest than averaging them into a single composite. Sound practice also includes stability analysis under resampling and perturbation, and a qualitative inspection of the clusters themselves, because the final arbiter of a clustering is whether it is useful for the task that motivated it.

169.5 5. References

  1. Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53-65. https://doi.org/10.1016/0377-0427(87)90125-7
  2. Dunn, J. C. (1973). A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. Journal of Cybernetics, 3(3), 32-57. https://doi.org/10.1080/01969727308546046
  3. Hubert, L., and Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193-218. https://doi.org/10.1007/BF01908075
  4. Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336), 846-850. https://doi.org/10.2307/2284239
  5. Vinh, N. X., Epps, J., and Bailey, J. (2010). Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. Journal of Machine Learning Research, 11, 2837-2854. https://www.jmlr.org/papers/v11/vinh10a.html
  6. Rosenberg, A., and Hirschberg, J. (2007). V-measure: a conditional entropy-based external cluster evaluation measure. Proceedings of EMNLP-CoNLL, 410-420. https://aclanthology.org/D07-1043/
  7. Fowlkes, E. B., and Mallows, C. L. (1983). A method for comparing two hierarchical clusterings. Journal of the American Statistical Association, 78(383), 553-569. https://doi.org/10.2307/2288117
  8. Calinski, T., and Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics, 3(1), 1-27. https://doi.org/10.1080/03610927408827101
  9. Davies, D. L., and Bouldin, D. W. (1979). A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-1(2), 224-227. https://doi.org/10.1109/TPAMI.1979.4766909
  10. Kleinberg, J. (2002). An impossibility theorem for clustering. Advances in Neural Information Processing Systems (NeurIPS), 15. https://papers.nips.cc/paper/2002/hash/43e4e6a6f341e00671e123714de019a8-Abstract.html