130  Agglomerative Hierarchical Clustering

Hierarchical clustering builds a nested family of partitions rather than a single flat assignment of points to groups. Instead of committing to a fixed number of clusters in advance, it produces a tree of clusterings that records how groups merge as we relax our notion of proximity. This chapter develops the agglomerative (bottom up) approach in depth: the merge procedure, the linkage criteria that define distance between clusters, the dendrogram that visualizes the result, the problem of cutting the tree into a usable partition, and the computational complexity that governs how large a dataset the method can handle.

130.1 1. The Agglomerative Schema

Agglomerative clustering begins with \(n\) singleton clusters, one per data point, and repeatedly merges the two closest clusters until a single cluster containing all points remains. The output is not one clustering but a sequence of \(n\) nested clusterings, indexed by the number of merges performed.

130.1.1 1.1 Formal Setup

Let \(X = \{x_1, \dots, x_n\}\) be a set of objects with a dissimilarity function \(d(x_i, x_j) \geq 0\). We do not require \(d\) to be a metric, although several linkage criteria behave best when it is. A clustering at level \(k\) is a partition \(\mathcal{C}_k = \{C_1, \dots, C_k\}\) of \(X\). The agglomerative process produces partitions \(\mathcal{C}_n, \mathcal{C}_{n-1}, \dots, \mathcal{C}_1\), where \(\mathcal{C}_n\) assigns each point to its own cluster and \(\mathcal{C}_1 = \{X\}\).

The key ingredient is a linkage function \(D(C_a, C_b)\) that extends the pointwise dissimilarity \(d\) to a dissimilarity between clusters. Given \(D\), the generic algorithm is:

initialize each point as its own cluster
compute D for all cluster pairs
while more than one cluster remains:
    find the pair (Ca, Cb) minimizing D
    merge Ca and Cb into a new cluster Cab
    record the merge and the merge height D(Ca, Cb)
    update D between Cab and all remaining clusters

The recorded merge heights are central. A well behaved linkage produces monotonically nondecreasing heights, meaning every merge occurs at a height at least as large as the merges that preceded it. This monotonicity is what makes the resulting tree drawable without crossings.

130.1.2 1.2 Why Nested Partitions Are Useful

A flat method such as k-means returns a single partition and forces the analyst to fix \(k\) beforehand. Hierarchical clustering defers that choice. The same computation yields a coarse two cluster view and a fine fifty cluster view simultaneously, which is valuable when the natural granularity is unknown or when the data genuinely has structure at multiple scales. Taxonomies, gene expression groupings, and document organization are classic settings where a nested view matches the domain better than a flat one.

130.2 2. Linkage Criteria

The choice of linkage function \(D\) determines the shape of the resulting tree more than any other decision. We describe the four most important criteria and their characteristic behavior.

130.2.1 2.1 Single Linkage

Single linkage defines the distance between two clusters as the smallest pairwise distance across them:

\[ D_{\text{single}}(C_a, C_b) = \min_{x \in C_a,\, y \in C_b} d(x, y). \]

Because only the nearest pair matters, single linkage can connect clusters through a chain of intermediate points. This gives it the ability to recover elongated or non convex shapes, and it is closely related to the minimum spanning tree of the data: the sequence of single linkage merges corresponds exactly to adding edges of the minimum spanning tree in increasing weight order. The drawback is the chaining effect, where a thin bridge of points links two otherwise distinct groups into one cluster, often undesirably.

130.2.2 2.2 Complete Linkage

Complete linkage uses the largest pairwise distance:

\[ D_{\text{complete}}(C_a, C_b) = \max_{x \in C_a,\, y \in C_b} d(x, y). \]

Here a merge is permitted only when every point in one cluster is reasonably close to every point in the other, since the controlling quantity is the worst case pair. The result is compact, roughly spherical clusters with small diameters. Complete linkage resists chaining but is sensitive to outliers, because a single distant point inflates the maximum and delays merges.

130.2.3 2.3 Average Linkage

Average linkage, often called UPGMA (unweighted pair group method with arithmetic mean), averages over all cross cluster pairs:

\[ D_{\text{average}}(C_a, C_b) = \frac{1}{|C_a|\,|C_b|} \sum_{x \in C_a} \sum_{y \in C_b} d(x, y). \]

This is a compromise between single and complete linkage. It is less prone to chaining than single linkage and less outlier sensitive than complete linkage, which makes it a common default for general purpose use, particularly in bioinformatics where UPGMA has a long history.

130.2.4 2.4 Ward Linkage

Ward linkage takes a variance minimizing view. Rather than working directly with pairwise distances, it merges the pair of clusters that produces the smallest increase in total within cluster sum of squares. Define the sum of squared deviations of a cluster about its centroid \(\mu_C\):

\[ \text{ESS}(C) = \sum_{x \in C} \lVert x - \mu_C \rVert^2. \]

The merge cost is the increase in total ESS, which has the closed form

\[ \Delta(C_a, C_b) = \frac{|C_a|\,|C_b|}{|C_a| + |C_b|}\, \lVert \mu_{C_a} - \mu_{C_b} \rVert^2. \]

Ward tends to produce balanced, compact clusters of comparable size and is the linkage most analogous to k-means, since both minimize within cluster variance. It assumes a Euclidean geometry and should be paired with squared Euclidean distance for the closed form above to hold.

130.2.5 2.5 The Lance Williams Recurrence

A practical reason these four criteria are grouped together is that all of them obey the Lance Williams update formula, which expresses the distance from a newly merged cluster \(C_a \cup C_b\) to any other cluster \(C_k\) in terms of distances already computed:

\[ D(C_a \cup C_b, C_k) = \alpha_a D(C_a, C_k) + \alpha_b D(C_b, C_k) + \beta D(C_a, C_b) + \gamma \lvert D(C_a, C_k) - D(C_b, C_k) \rvert. \]

Each linkage corresponds to a particular setting of the coefficients. Writing \(n_a = |C_a|\), \(n_b = |C_b|\), \(n_k = |C_k|\):

Linkage \(\alpha_a\) \(\alpha_b\) \(\beta\) \(\gamma\)
Single \(1/2\) \(1/2\) \(0\) \(-1/2\)
Complete \(1/2\) \(1/2\) \(0\) \(+1/2\)
Average \(\frac{n_a}{n_a+n_b}\) \(\frac{n_b}{n_a+n_b}\) \(0\) \(0\)
Ward \(\frac{n_a+n_k}{n_a+n_b+n_k}\) \(\frac{n_b+n_k}{n_a+n_b+n_k}\) \(\frac{-n_k}{n_a+n_b+n_k}\) \(0\)

The recurrence is what makes an efficient implementation possible. Once we have the distance from each old cluster to its peers, the distance involving a merged cluster is a constant time arithmetic update rather than a fresh pass over the underlying points.

130.3 3. The Dendrogram

The dendrogram is the canonical visualization of a hierarchical clustering. It is a rooted binary tree whose leaves are the \(n\) data points and whose internal nodes are merges, drawn at a height equal to the linkage distance at which the merge occurred.

130.3.1 3.1 Reading a Dendrogram

Each leaf sits at height zero. When two clusters merge, a horizontal bar connects their subtrees at the merge height, and the new combined subtree continues upward. The vertical axis therefore measures dissimilarity: short bars indicate that two groups joined while still very similar, and tall bars indicate that the groups were dissimilar when forced together.

The horizontal ordering of leaves is not unique. At each internal node the two child subtrees can be swapped without changing the clustering, so the same dendrogram admits \(2^{n-1}\) leaf orderings. Implementations often apply an ordering heuristic to place similar leaves near one another, but no semantic meaning should be read into left versus right placement beyond what the heights encode.

130.3.2 3.2 Monotonicity and Inversions

A dendrogram is drawable without crossings only when merge heights are nondecreasing along every root to leaf path. Single, complete, average, and Ward linkage all guarantee this monotonicity. Some alternative criteria, notably centroid and median linkage, can violate it: a later merge may occur at a lower height than an earlier one, producing an inversion (also called a reversal) where a horizontal bar dips below the bars beneath it. Inversions make the tree hard to interpret and are one reason centroid and median linkage are used less often than the four criteria above.

130.3.3 3.3 Cophenetic Distance

The dendrogram induces its own distance on the data. The cophenetic distance between two points is the height of the merge at which they first fall into the same cluster. Comparing this cophenetic distance to the original pairwise distance, via the cophenetic correlation coefficient, quantifies how faithfully the tree preserves the input dissimilarities. A high cophenetic correlation suggests the chosen linkage captured the data geometry well; a low one warns that the tree imposed structure the data does not support.

130.4 4. Cutting the Tree

A dendrogram is a family of clusterings, not a single answer. To obtain a concrete partition we cut the tree, and the cutting strategy is where domain judgment re enters.

130.4.1 4.1 Cutting at a Fixed Height

The simplest cut is a horizontal line at height \(h\). Every branch crossed by the line becomes one cluster, gathering all leaves beneath that crossing. Lowering \(h\) yields more, smaller clusters; raising it yields fewer, larger ones. A natural place to cut is just below a large vertical gap, since a tall bar means the two groups it joins were dissimilar and probably should stay separate.

# conceptual: clusters from a height cut
labels = cut_tree_at_height(dendrogram, h=2.5)

130.4.2 4.2 Cutting for a Fixed Number of Clusters

Alternatively we fix the desired number of clusters \(k\) and cut at the height that produces exactly \(k\) branches. Because the merge heights are known, this is equivalent to stopping the agglomeration after \(n - k\) merges. This is convenient when downstream requirements dictate a specific \(k\), but it discards the height information that might suggest a more natural value.

130.4.3 4.3 Choosing Where to Cut

Several heuristics guide the cut. The largest gap rule cuts across the tallest unbroken vertical span, on the theory that the biggest jump in merge height separates genuinely distinct structure. Internal validity indices such as the silhouette coefficient can be evaluated across candidate cuts, selecting the \(k\) that maximizes cluster separation relative to cohesion. The inconsistency coefficient compares each merge height to the average and spread of heights below it, flagging merges that are anomalously tall as good cut points. No rule is universally correct; the cut encodes a decision about granularity that the data alone cannot make.

130.4.4 4.4 Non Horizontal Cuts

Nothing requires the cut to be a single height. A more flexible scheme allows different branches to be cut at different heights, which is appropriate when clusters have genuinely different scales. Dynamic tree cutting methods do exactly this, examining the shape of each branch and cutting it where its internal structure suggests, rather than imposing one global threshold. This addresses a real weakness of horizontal cuts: a single height that resolves a dense region well may over split a sparse one, and vice versa.

130.5 5. Complexity

The expressiveness of hierarchical clustering comes at a cost. Understanding that cost determines whether the method is feasible for a given dataset.

130.5.1 5.1 Time Complexity

A naive implementation recomputes all pairwise cluster distances at every step, costing \(O(n^3)\) time overall: there are \(O(n)\) merges, and each scans an \(O(n^2)\) distance structure. With the Lance Williams recurrence and a priority queue that maintains the nearest neighbor of each cluster, the cost drops to \(O(n^2 \log n)\) for general linkages.

For single linkage the minimum spanning tree connection yields an exact \(O(n^2)\) algorithm. A more general technique, the nearest neighbor chain algorithm, achieves \(O(n^2)\) time and \(O(n)\) working memory for any linkage that satisfies a reducibility property, which complete, average, and Ward linkage all do. The nearest neighbor chain walks from cluster to cluster following nearest neighbor links until it finds a reciprocal nearest neighbor pair, merges that pair, and resumes, guaranteeing that each distance is examined a bounded number of times.

The quadratic floor is fundamental for exact methods, because the algorithm must at minimum consider the \(\binom{n}{2}\) pairwise dissimilarities. This makes exact hierarchical clustering impractical for datasets much beyond tens of thousands of points without approximation.

130.5.2 5.2 Space Complexity

Storing the full pairwise distance matrix requires \(O(n^2)\) memory, which is usually the binding constraint in practice. At \(n = 10^5\) the matrix holds \(5 \times 10^9\) entries, well beyond ordinary memory budgets. The nearest neighbor chain algorithm reduces working memory to \(O(n)\) by recomputing distances on demand rather than materializing the matrix, trading arithmetic for storage, and is the standard route to scaling hierarchical clustering to larger \(n\).

130.5.3 5.3 Practical Mitigations

When \(n\) is too large for exact methods, common strategies include clustering a representative sample and assigning the rest by nearest cluster, running a fast flat method such as k-means to produce a few thousand micro clusters and then applying hierarchical clustering to those, or restricting merges to a sparse neighborhood graph so that only local distances are ever considered. Each sacrifices some exactness for tractability, and the right choice depends on whether the priority is the full tree or only a partition near a particular granularity.

130.6 6. Practical Guidance

A few recommendations consolidate the preceding material. Use Ward linkage with Euclidean distance as a default when clusters are expected to be compact and of similar size, since it behaves predictably and parallels k-means. Reach for single linkage only when elongated or irregular cluster shapes are expected, and watch for chaining. Prefer average or complete linkage when a balance between the extremes is wanted. Always inspect the dendrogram before cutting, since the merge heights reveal whether the data has clear structure or whether any cut would be arbitrary. Validate the chosen cut with an index such as the silhouette score or the cophenetic correlation rather than trusting a single visual impression. Finally, respect the quadratic cost: confirm that \(n\) is small enough for the full distance matrix, and switch to sampling or micro clustering before it is not.

130.7 References

  1. Lance, G. N., and Williams, W. T. “A General Theory of Classificatory Sorting Strategies: 1. Hierarchical Systems.” The Computer Journal, 1967. https://doi.org/10.1093/comjnl/9.4.373
  2. Ward, J. H. “Hierarchical Grouping to Optimize an Objective Function.” Journal of the American Statistical Association, 1963. https://doi.org/10.1080/01621459.1963.10500845
  3. Murtagh, F., and Contreras, P. “Algorithms for Hierarchical Clustering: An Overview.” WIREs Data Mining and Knowledge Discovery, 2012. https://doi.org/10.1002/widm.53
  4. Murtagh, F., and Legendre, P. “Ward’s Hierarchical Agglomerative Clustering Method: Which Algorithms Implement Ward’s Criterion?” Journal of Classification, 2014. https://doi.org/10.1007/s00357-014-9161-z
  5. Mullner, D. “Modern Hierarchical, Agglomerative Clustering Algorithms.” arXiv preprint, 2011. https://arxiv.org/abs/1109.2378
  6. Langfelder, P., Zhang, B., and Horvath, S. “Defining Clusters from a Hierarchical Cluster Tree: The Dynamic Tree Cut Package for R.” Bioinformatics, 2008. https://doi.org/10.1093/bioinformatics/btm563
  7. SciPy Documentation. “Hierarchical Clustering (scipy.cluster.hierarchy).” https://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html
  8. scikit-learn Documentation. “Hierarchical Clustering.” https://scikit-learn.org/stable/modules/clustering.html#hierarchical-clustering