130 Agglomerative Hierarchical Clustering

Hierarchical clustering builds a nested family of partitions rather than a single flat assignment of points to groups. Instead of committing to a fixed number of clusters in advance, it produces a tree of clusterings that records how groups merge as we relax our notion of proximity. This chapter develops the agglomerative (bottom up) approach in depth: the merge procedure, the linkage criteria that define distance between clusters, the dendrogram that visualizes the result, the problem of cutting the tree into a usable partition, and the computational complexity that governs how large a dataset the method can handle.

130.1 1. The Agglomerative Schema

Agglomerative clustering begins with $n$ singleton clusters, one per data point, and repeatedly merges the two closest clusters until a single cluster containing all points remains. The output is not one clustering but a sequence of $n$ nested clusterings, indexed by the number of merges performed.

130.1.1 1.1 Formal Setup

Let $X = \{x_1, \dots, x_n\}$ be a set of objects with a dissimilarity function $d(x_i, x_j) \geq 0$. We do not require $d$ to be a metric, although several linkage criteria behave best when it is. A clustering at level $k$ is a partition $\mathcal{C}_k = \{C_1, \dots, C_k\}$ of $X$. The agglomerative process produces partitions $\mathcal{C}_n, \mathcal{C}_{n-1}, \dots, \mathcal{C}_1$, where $\mathcal{C}_n$ assigns each point to its own cluster and $\mathcal{C}_1 = \{X\}$.

The key ingredient is a linkage function $D(C_a, C_b)$ that extends the pointwise dissimilarity $d$ to a dissimilarity between clusters. Given $D$, the generic algorithm is:

initialize each point as its own cluster
compute D for all cluster pairs
while more than one cluster remains:
    find the pair (Ca, Cb) minimizing D
    merge Ca and Cb into a new cluster Cab
    record the merge and the merge height D(Ca, Cb)
    update D between Cab and all remaining clusters

The recorded merge heights are central. A well behaved linkage produces monotonically nondecreasing heights, meaning every merge occurs at a height at least as large as the merges that preceded it. This monotonicity is what makes the resulting tree drawable without crossings.

130.1.2 1.2 Why Nested Partitions Are Useful

A flat method such as k-means returns a single partition and forces the analyst to fix $k$ beforehand. Hierarchical clustering defers that choice. The same computation yields a coarse two cluster view and a fine fifty cluster view simultaneously, which is valuable when the natural granularity is unknown or when the data genuinely has structure at multiple scales. Taxonomies, gene expression groupings, and document organization are classic settings where a nested view matches the domain better than a flat one.

130.2 2. Linkage Criteria

The choice of linkage function $D$ determines the shape of the resulting tree more than any other decision. We describe the four most important criteria and their characteristic behavior.

130.2.1 2.1 Single Linkage

Single linkage defines the distance between two clusters as the smallest pairwise distance across them:

\[ D_{\text{single}}(C_a, C_b) = \min_{x \in C_a,\, y \in C_b} d(x, y). \]

Because only the nearest pair matters, single linkage can connect clusters through a chain of intermediate points. This gives it the ability to recover elongated or non convex shapes, and it is closely related to the minimum spanning tree of the data: the sequence of single linkage merges corresponds exactly to adding edges of the minimum spanning tree in increasing weight order. The drawback is the chaining effect, where a thin bridge of points links two otherwise distinct groups into one cluster, often undesirably.

130.2.2 2.2 Complete Linkage

Complete linkage uses the largest pairwise distance:

\[ D_{\text{complete}}(C_a, C_b) = \max_{x \in C_a,\, y \in C_b} d(x, y). \]

Here a merge is permitted only when every point in one cluster is reasonably close to every point in the other, since the controlling quantity is the worst case pair. The result is compact, roughly spherical clusters with small diameters. Complete linkage resists chaining but is sensitive to outliers, because a single distant point inflates the maximum and delays merges.

130.2.3 2.3 Average Linkage

Average linkage, often called UPGMA (unweighted pair group method with arithmetic mean), averages over all cross cluster pairs:

\[ D_{\text{average}}(C_a, C_b) = \frac{1}{|C_a|\,|C_b|} \sum_{x \in C_a} \sum_{y \in C_b} d(x, y). \]

This is a compromise between single and complete linkage. It is less prone to chaining than single linkage and less outlier sensitive than complete linkage, which makes it a common default for general purpose use, particularly in bioinformatics where UPGMA has a long history.

130.2.4 2.4 Ward Linkage

Ward linkage takes a variance minimizing view. Rather than working directly with pairwise distances, it merges the pair of clusters that produces the smallest increase in total within cluster sum of squares. Define the sum of squared deviations of a cluster about its centroid $\mu_C$:

\[ \text{ESS}(C) = \sum_{x \in C} \lVert x - \mu_C \rVert^2. \]

The merge cost is the increase in total ESS, which has the closed form

\[ \Delta(C_a, C_b) = \frac{|C_a|\,|C_b|}{|C_a| + |C_b|}\, \lVert \mu_{C_a} - \mu_{C_b} \rVert^2. \]

Ward tends to produce balanced, compact clusters of comparable size and is the linkage most analogous to k-means, since both minimize within cluster variance. It assumes a Euclidean geometry and should be paired with squared Euclidean distance for the closed form above to hold.

130.2.5 2.5 The Lance Williams Recurrence

A practical reason these four criteria are grouped together is that all of them obey the Lance Williams update formula, which expresses the distance from a newly merged cluster $C_a \cup C_b$ to any other cluster $C_k$ in terms of distances already computed:

\[ D(C_a \cup C_b, C_k) = \alpha_a D(C_a, C_k) + \alpha_b D(C_b, C_k) + \beta D(C_a, C_b) + \gamma \lvert D(C_a, C_k) - D(C_b, C_k) \rvert. \]

Each linkage corresponds to a particular setting of the coefficients. Writing $n_a = |C_a|$, $n_b = |C_b|$, $n_k = |C_k|$:

Linkage	$\alpha_a$	$\alpha_b$	$\beta$	$\gamma$
Single	$1/2$	$1/2$	$0$	$-1/2$
Complete	$1/2$	$1/2$	$0$	$+1/2$
Average	$\frac{n_a}{n_a+n_b}$	$\frac{n_b}{n_a+n_b}$	$0$	$0$
Ward	$\frac{n_a+n_k}{n_a+n_b+n_k}$	$\frac{n_b+n_k}{n_a+n_b+n_k}$	$\frac{-n_k}{n_a+n_b+n_k}$	$0$

The recurrence is what makes an efficient implementation possible. Once we have the distance from each old cluster to its peers, the distance involving a merged cluster is a constant time arithmetic update rather than a fresh pass over the underlying points.

130.3 3. The Dendrogram

The dendrogram is the canonical visualization of a hierarchical clustering. It is a rooted binary tree whose leaves are the $n$ data points and whose internal nodes are merges, drawn at a height equal to the linkage distance at which the merge occurred.

130.3.1 3.1 Reading a Dendrogram

Each leaf sits at height zero. When two clusters merge, a horizontal bar connects their subtrees at the merge height, and the new combined subtree continues upward. The vertical axis therefore measures dissimilarity: short bars indicate that two groups joined while still very similar, and tall bars indicate that the groups were dissimilar when forced together.

The horizontal ordering of leaves is not unique. At each internal node the two child subtrees can be swapped without changing the clustering, so the same dendrogram admits $2^{n-1}$ leaf orderings. Implementations often apply an ordering heuristic to place similar leaves near one another, but no semantic meaning should be read into left versus right placement beyond what the heights encode.

130.3.2 3.2 Monotonicity and Inversions

A dendrogram is drawable without crossings only when merge heights are nondecreasing along every root to leaf path. Single, complete, average, and Ward linkage all guarantee this monotonicity. Some alternative criteria, notably centroid and median linkage, can violate it: a later merge may occur at a lower height than an earlier one, producing an inversion (also called a reversal) where a horizontal bar dips below the bars beneath it. Inversions make the tree hard to interpret and are one reason centroid and median linkage are used less often than the four criteria above.

130.3.3 3.3 Cophenetic Distance

The dendrogram induces its own distance on the data. The cophenetic distance between two points is the height of the merge at which they first fall into the same cluster. Comparing this cophenetic distance to the original pairwise distance, via the cophenetic correlation coefficient, quantifies how faithfully the tree preserves the input dissimilarities. A high cophenetic correlation suggests the chosen linkage captured the data geometry well; a low one warns that the tree imposed structure the data does not support.

130.4 4. Cutting the Tree

A dendrogram is a family of clusterings, not a single answer. To obtain a concrete partition we cut the tree, and the cutting strategy is where domain judgment re enters.

130.4.1 4.1 Cutting at a Fixed Height

The simplest cut is a horizontal line at height $h$. Every branch crossed by the line becomes one cluster, gathering all leaves beneath that crossing. Lowering $h$ yields more, smaller clusters; raising it yields fewer, larger ones. A natural place to cut is just below a large vertical gap, since a tall bar means the two groups it joins were dissimilar and probably should stay separate.

# conceptual: clusters from a height cut
labels = cut_tree_at_height(dendrogram, h=2.5)

130.4.2 4.2 Cutting for a Fixed Number of Clusters

Alternatively we fix the desired number of clusters $k$ and cut at the height that produces exactly $k$ branches. Because the merge heights are known, this is equivalent to stopping the agglomeration after $n - k$ merges. This is convenient when downstream requirements dictate a specific $k$, but it discards the height information that might suggest a more natural value.

130.4.3 4.3 Choosing Where to Cut

Several heuristics guide the cut. The largest gap rule cuts across the tallest unbroken vertical span, on the theory that the biggest jump in merge height separates genuinely distinct structure. Internal validity indices such as the silhouette coefficient can be evaluated across candidate cuts, selecting the $k$ that maximizes cluster separation relative to cohesion. The inconsistency coefficient compares each merge height to the average and spread of heights below it, flagging merges that are anomalously tall as good cut points. No rule is universally correct; the cut encodes a decision about granularity that the data alone cannot make.

130.4.4 4.4 Non Horizontal Cuts

Nothing requires the cut to be a single height. A more flexible scheme allows different branches to be cut at different heights, which is appropriate when clusters have genuinely different scales. Dynamic tree cutting methods do exactly this, examining the shape of each branch and cutting it where its internal structure suggests, rather than imposing one global threshold. This addresses a real weakness of horizontal cuts: a single height that resolves a dense region well may over split a sparse one, and vice versa.

130.5 5. Complexity

The expressiveness of hierarchical clustering comes at a cost. Understanding that cost determines whether the method is feasible for a given dataset.

130.5.1 5.1 Time Complexity

A naive implementation recomputes all pairwise cluster distances at every step, costing $O(n^3)$ time overall: there are $O(n)$ merges, and each scans an $O(n^2)$ distance structure. With the Lance Williams recurrence and a priority queue that maintains the nearest neighbor of each cluster, the cost drops to $O(n^2 \log n)$ for general linkages.

For single linkage the minimum spanning tree connection yields an exact $O(n^2)$ algorithm. A more general technique, the nearest neighbor chain algorithm, achieves $O(n^2)$ time and $O(n)$ working memory for any linkage that satisfies a reducibility property, which complete, average, and Ward linkage all do. The nearest neighbor chain walks from cluster to cluster following nearest neighbor links until it finds a reciprocal nearest neighbor pair, merges that pair, and resumes, guaranteeing that each distance is examined a bounded number of times.

The quadratic floor is fundamental for exact methods, because the algorithm must at minimum consider the $\binom{n}{2}$ pairwise dissimilarities. This makes exact hierarchical clustering impractical for datasets much beyond tens of thousands of points without approximation.

130.5.2 5.2 Space Complexity

Storing the full pairwise distance matrix requires $O(n^2)$ memory, which is usually the binding constraint in practice. At $n = 10^5$ the matrix holds $5 \times 10^9$ entries, well beyond ordinary memory budgets. The nearest neighbor chain algorithm reduces working memory to $O(n)$ by recomputing distances on demand rather than materializing the matrix, trading arithmetic for storage, and is the standard route to scaling hierarchical clustering to larger $n$.

130.5.3 5.3 Practical Mitigations

When $n$ is too large for exact methods, common strategies include clustering a representative sample and assigning the rest by nearest cluster, running a fast flat method such as k-means to produce a few thousand micro clusters and then applying hierarchical clustering to those, or restricting merges to a sparse neighborhood graph so that only local distances are ever considered. Each sacrifices some exactness for tractability, and the right choice depends on whether the priority is the full tree or only a partition near a particular granularity.

130.6 6. A Worked Derivation: Ward From the Lance Williams Update

It is worth seeing exactly why the Ward merge cost reduces to the closed form quoted in Section 2.4 and how that closed form is consistent with the Lance Williams coefficients, since the two appear unrelated at first glance.

Start from the error sum of squares of a cluster, $\text{ESS}(C) = \sum_{x \in C} \lVert x - \mu_C \rVert^2$ with centroid $\mu_C = \frac{1}{|C|}\sum_{x \in C} x$. A standard identity rewrites this without reference to the centroid:

\[ \text{ESS}(C) = \frac{1}{2|C|} \sum_{x \in C} \sum_{y \in C} \lVert x - y \rVert^2 . \]

When we merge $C_a$ and $C_b$, the change in total ESS is $\Delta = \text{ESS}(C_a \cup C_b) - \text{ESS}(C_a) - \text{ESS}(C_b)$. Substituting the centroid form of each term and simplifying, every within $C_a$ and within $C_b$ contribution cancels, leaving only the cross term, which collapses to

\[ \Delta(C_a, C_b) = \frac{n_a n_b}{n_a + n_b} \, \lVert \mu_{C_a} - \mu_{C_b} \rVert^2 , \]

the closed form of Section 2.4. Now define the working dissimilarity that the algorithm actually stores as $D(C_a, C_b) = \Delta(C_a, C_b)$, a squared quantity. Expanding $\Delta(C_a \cup C_b, C_k)$ in terms of $\Delta(C_a, C_k)$, $\Delta(C_b, C_k)$ and $\Delta(C_a, C_b)$ and matching coefficients yields exactly the Ward row of the Lance Williams table:

\[ \alpha_a = \frac{n_a + n_k}{n_a + n_b + n_k}, \quad \alpha_b = \frac{n_b + n_k}{n_a + n_b + n_k}, \quad \beta = \frac{-n_k}{n_a + n_b + n_k}, \quad \gamma = 0 . \]

This is why a correct Ward implementation runs the recurrence on squared distances and reports $\sqrt{\Delta}$ as the dendrogram height: the recurrence is exact only in the squared metric, while the displayed height is conventionally the Euclidean distance between the merged centroids scaled by the size factor. The reference implementation below follows precisely this rule, which is what lets it reproduce the SciPy linkage matrix to machine precision.

130.7 7. Reference Implementation

The companion libraries ship a dependency-free agglomerative clustering routine in all three languages. Each returns a linkage matrix in the SciPy row format [node_a, node_b, height, size]: original points carry ids 0 .. n-1, and the merge formed at step t is given id n + t. The same Lance Williams recurrence drives single, complete, average, and Ward linkage, and the three implementations agree to within 1e-9 on shared fixtures. Install the package for your language (aiinaction for Python, the AIInAction Julia package, and the aiinaction Rust crate) and the snippets below run as written.

We cluster five points, three near the origin and two near $(10, 10)$, and read off the merge structure and a two cluster cut.

Code

from aiinaction.ch125_agglomerative_clustering import (
    linkage_matrix,
    fcluster,
    cophenetic_distances,
)

points = [
    [0.0, 0.0],
    [1.0, 0.0],
    [0.0, 1.0],
    [10.0, 10.0],
    [10.0, 11.0],
]

Z = linkage_matrix(points, linkage="ward")
print("Ward linkage matrix [a, b, height, size]:")
for row in Z:
    a, b, h, size = row
    print(f"  {int(a)} + {int(b)}  height={h:.4f}  size={int(size)}")

print("Two-cluster cut:", fcluster(Z, n_clusters=2))

coph = cophenetic_distances(Z)
print(f"Cophenetic distance between point 0 and point 3: {coph[0][3]:.4f}")

Ward linkage matrix [a, b, height, size]:
  0 + 1  height=1.0000  size=2
  3 + 4  height=1.0000  size=2
  2 + 5  height=1.2910  size=3
  6 + 7  height=21.7332  size=5
Two-cluster cut: [0, 0, 0, 1, 1]
Cophenetic distance between point 0 and point 3: 21.7332

The first two merges (heights of 1) pull together the two tight groups; the final merge, at a much larger height, joins them only because the algorithm must. The two cluster cut recovers the obvious split, and the large cophenetic distance between a point near the origin and a point near $(10, 10)$ confirms that the tree kept them apart until the very end.

using AIInAction.Ch125AgglomerativeClustering

points = [
    [0.0, 0.0],
    [1.0, 0.0],
    [0.0, 1.0],
    [10.0, 10.0],
    [10.0, 11.0],
]

Z = linkage_matrix(points, "ward")
for row in Z
    println("  ", Int(row[1]), " + ", Int(row[2]),
            "  height=", round(row[3], digits=4), "  size=", Int(row[4]))
end

println("Two-cluster cut: ", fcluster(Z, 2))   # [0, 0, 0, 1, 1]

use aiinaction::ch125_agglomerative_clustering::{linkage_matrix, fcluster};

fn main() {
    let points = vec![
        vec![0.0, 0.0],
        vec![1.0, 0.0],
        vec![0.0, 1.0],
        vec![10.0, 10.0],
        vec![10.0, 11.0],
    ];

    let z = linkage_matrix(&points, "ward").unwrap();
    for row in &z {
        println!(
            "  {} + {}  height={:.4}  size={}",
            row[0] as usize, row[1] as usize, row[2], row[3] as usize
        );
    }

    // [0, 0, 0, 1, 1]
    println!("Two-cluster cut: {:?}", fcluster(&z, 2).unwrap());
}

All three print the identical linkage matrix and the same [0, 0, 0, 1, 1] cut, which is exactly the parity the shared test fixtures enforce.

130.8 8. Practical Guidance

A few recommendations consolidate the preceding material. Use Ward linkage with Euclidean distance as a default when clusters are expected to be compact and of similar size, since it behaves predictably and parallels k-means. Reach for single linkage only when elongated or irregular cluster shapes are expected, and watch for chaining. Prefer average or complete linkage when a balance between the extremes is wanted. Always inspect the dendrogram before cutting, since the merge heights reveal whether the data has clear structure or whether any cut would be arbitrary. Validate the chosen cut with an index such as the silhouette score or the cophenetic correlation rather than trusting a single visual impression. Finally, respect the quadratic cost: confirm that $n$ is small enough for the full distance matrix, and switch to sampling or micro clustering before it is not.

130.9 References

Lance, G. N., and Williams, W. T. “A General Theory of Classificatory Sorting Strategies: 1. Hierarchical Systems.” The Computer Journal, 1967. https://doi.org/10.1093/comjnl/9.4.373
Ward, J. H. “Hierarchical Grouping to Optimize an Objective Function.” Journal of the American Statistical Association, 1963. https://doi.org/10.1080/01621459.1963.10500845
Murtagh, F., and Contreras, P. “Algorithms for Hierarchical Clustering: An Overview.” WIREs Data Mining and Knowledge Discovery, 2012. https://doi.org/10.1002/widm.53
Murtagh, F., and Legendre, P. “Ward’s Hierarchical Agglomerative Clustering Method: Which Algorithms Implement Ward’s Criterion?” Journal of Classification, 2014. https://doi.org/10.1007/s00357-014-9161-z
Mullner, D. “Modern Hierarchical, Agglomerative Clustering Algorithms.” arXiv preprint, 2011. https://arxiv.org/abs/1109.2378
Langfelder, P., Zhang, B., and Horvath, S. “Defining Clusters from a Hierarchical Cluster Tree: The Dynamic Tree Cut Package for R.” Bioinformatics, 2008. https://doi.org/10.1093/bioinformatics/btm563
SciPy Documentation. “Hierarchical Clustering (scipy.cluster.hierarchy).” https://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html
scikit-learn Documentation. “Hierarchical Clustering.” https://scikit-learn.org/stable/modules/clustering.html#hierarchical-clustering

# Agglomerative Hierarchical Clustering Hierarchical clustering builds a nested family of partitions rather than a single flat assignment of points to groups. Instead of committing to a fixed number of clusters in advance, it produces a tree of clusterings that records how groups merge as we relax our notion of proximity. This chapter develops the agglomerative (bottom up) approach in depth: the merge procedure, the linkage criteria that define distance between clusters, the dendrogram that visualizes the result, the problem of cutting the tree into a usable partition, and the computational complexity that governs how large a dataset the method can handle. ## 1. The Agglomerative Schema Agglomerative clustering begins with $n$ singleton clusters, one per data point, and repeatedly merges the two closest clusters until a single cluster containing all points remains. The output is not one clustering but a sequence of $n$ nested clusterings, indexed by the number of merges performed. ### 1.1 Formal Setup Let $X = \{x_1, \dots, x_n\}$ be a set of objects with a dissimilarity function $d(x_i, x_j) \geq 0$. We do not require $d$ to be a metric, although several linkage criteria behave best when it is. A clustering at level $k$ is a partition $\mathcal{C}_k = \{C_1, \dots, C_k\}$ of $X$. The agglomerative process produces partitions $\mathcal{C}_n, \mathcal{C}_{n-1}, \dots, \mathcal{C}_1$, where $\mathcal{C}_n$ assigns each point to its own cluster and $\mathcal{C}_1 = \{X\}$. The key ingredient is a linkage function $D(C_a, C_b)$ that extends the pointwise dissimilarity $d$ to a dissimilarity between clusters. Given $D$, the generic algorithm is: ```text initialize each point as its own cluster compute D for all cluster pairs while more than one cluster remains: find the pair (Ca, Cb) minimizing D merge Ca and Cb into a new cluster Cab record the merge and the merge height D(Ca, Cb) update D between Cab and all remaining clusters ``` The recorded merge heights are central. A well behaved linkage produces monotonically nondecreasing heights, meaning every merge occurs at a height at least as large as the merges that preceded it. This monotonicity is what makes the resulting tree drawable without crossings. ### 1.2 Why Nested Partitions Are Useful A flat method such as k-means returns a single partition and forces the analyst to fix $k$ beforehand. Hierarchical clustering defers that choice. The same computation yields a coarse two cluster view and a fine fifty cluster view simultaneously, which is valuable when the natural granularity is unknown or when the data genuinely has structure at multiple scales. Taxonomies, gene expression groupings, and document organization are classic settings where a nested view matches the domain better than a flat one. ## 2. Linkage Criteria The choice of linkage function $D$ determines the shape of the resulting tree more than any other decision. We describe the four most important criteria and their characteristic behavior. ### 2.1 Single Linkage Single linkage defines the distance between two clusters as the smallest pairwise distance across them: $$ D_{\text{single}}(C_a, C_b) = \min_{x \in C_a,\, y \in C_b} d(x, y). $$ Because only the nearest pair matters, single linkage can connect clusters through a chain of intermediate points. This gives it the ability to recover elongated or non convex shapes, and it is closely related to the minimum spanning tree of the data: the sequence of single linkage merges corresponds exactly to adding edges of the minimum spanning tree in increasing weight order. The drawback is the chaining effect, where a thin bridge of points links two otherwise distinct groups into one cluster, often undesirably. ### 2.2 Complete Linkage Complete linkage uses the largest pairwise distance: $$ D_{\text{complete}}(C_a, C_b) = \max_{x \in C_a,\, y \in C_b} d(x, y). $$ Here a merge is permitted only when every point in one cluster is reasonably close to every point in the other, since the controlling quantity is the worst case pair. The result is compact, roughly spherical clusters with small diameters. Complete linkage resists chaining but is sensitive to outliers, because a single distant point inflates the maximum and delays merges. ### 2.3 Average Linkage Average linkage, often called UPGMA (unweighted pair group method with arithmetic mean), averages over all cross cluster pairs: $$ D_{\text{average}}(C_a, C_b) = \frac{1}{|C_a|\,|C_b|} \sum_{x \in C_a} \sum_{y \in C_b} d(x, y). $$ This is a compromise between single and complete linkage. It is less prone to chaining than single linkage and less outlier sensitive than complete linkage, which makes it a common default for general purpose use, particularly in bioinformatics where UPGMA has a long history. ### 2.4 Ward Linkage Ward linkage takes a variance minimizing view. Rather than working directly with pairwise distances, it merges the pair of clusters that produces the smallest increase in total within cluster sum of squares. Define the sum of squared deviations of a cluster about its centroid $\mu_C$: $$ \text{ESS}(C) = \sum_{x \in C} \lVert x - \mu_C \rVert^2. $$ The merge cost is the increase in total ESS, which has the closed form $$ \Delta(C_a, C_b) = \frac{|C_a|\,|C_b|}{|C_a| + |C_b|}\, \lVert \mu_{C_a} - \mu_{C_b} \rVert^2. $$ Ward tends to produce balanced, compact clusters of comparable size and is the linkage most analogous to k-means, since both minimize within cluster variance. It assumes a Euclidean geometry and should be paired with squared Euclidean distance for the closed form above to hold. ### 2.5 The Lance Williams Recurrence A practical reason these four criteria are grouped together is that all of them obey the Lance Williams update formula, which expresses the distance from a newly merged cluster $C_a \cup C_b$ to any other cluster $C_k$ in terms of distances already computed: $$ D(C_a \cup C_b, C_k) = \alpha_a D(C_a, C_k) + \alpha_b D(C_b, C_k) + \beta D(C_a, C_b) + \gamma \lvert D(C_a, C_k) - D(C_b, C_k) \rvert. $$ Each linkage corresponds to a particular setting of the coefficients. Writing $n_a = |C_a|$, $n_b = |C_b|$, $n_k = |C_k|$: | Linkage | $\alpha_a$ | $\alpha_b$ | $\beta$ | $\gamma$ | |-----------|-----------|-----------|---------|----------| | Single | $1/2$ | $1/2$ | $0$ | $-1/2$ | | Complete | $1/2$ | $1/2$ | $0$ | $+1/2$ | | Average | $\frac{n_a}{n_a+n_b}$ | $\frac{n_b}{n_a+n_b}$ | $0$ | $0$ | | Ward | $\frac{n_a+n_k}{n_a+n_b+n_k}$ | $\frac{n_b+n_k}{n_a+n_b+n_k}$ | $\frac{-n_k}{n_a+n_b+n_k}$ | $0$ | The recurrence is what makes an efficient implementation possible. Once we have the distance from each old cluster to its peers, the distance involving a merged cluster is a constant time arithmetic update rather than a fresh pass over the underlying points. ## 3. The Dendrogram The dendrogram is the canonical visualization of a hierarchical clustering. It is a rooted binary tree whose leaves are the $n$ data points and whose internal nodes are merges, drawn at a height equal to the linkage distance at which the merge occurred. ### 3.1 Reading a Dendrogram Each leaf sits at height zero. When two clusters merge, a horizontal bar connects their subtrees at the merge height, and the new combined subtree continues upward. The vertical axis therefore measures dissimilarity: short bars indicate that two groups joined while still very similar, and tall bars indicate that the groups were dissimilar when forced together. The horizontal ordering of leaves is not unique. At each internal node the two child subtrees can be swapped without changing the clustering, so the same dendrogram admits $2^{n-1}$ leaf orderings. Implementations often apply an ordering heuristic to place similar leaves near one another, but no semantic meaning should be read into left versus right placement beyond what the heights encode. ### 3.2 Monotonicity and Inversions A dendrogram is drawable without crossings only when merge heights are nondecreasing along every root to leaf path. Single, complete, average, and Ward linkage all guarantee this monotonicity. Some alternative criteria, notably centroid and median linkage, can violate it: a later merge may occur at a lower height than an earlier one, producing an inversion (also called a reversal) where a horizontal bar dips below the bars beneath it. Inversions make the tree hard to interpret and are one reason centroid and median linkage are used less often than the four criteria above. ### 3.3 Cophenetic Distance The dendrogram induces its own distance on the data. The cophenetic distance between two points is the height of the merge at which they first fall into the same cluster. Comparing this cophenetic distance to the original pairwise distance, via the cophenetic correlation coefficient, quantifies how faithfully the tree preserves the input dissimilarities. A high cophenetic correlation suggests the chosen linkage captured the data geometry well; a low one warns that the tree imposed structure the data does not support. ## 4. Cutting the Tree A dendrogram is a family of clusterings, not a single answer. To obtain a concrete partition we cut the tree, and the cutting strategy is where domain judgment re enters. ### 4.1 Cutting at a Fixed Height The simplest cut is a horizontal line at height $h$. Every branch crossed by the line becomes one cluster, gathering all leaves beneath that crossing. Lowering $h$ yields more, smaller clusters; raising it yields fewer, larger ones. A natural place to cut is just below a large vertical gap, since a tall bar means the two groups it joins were dissimilar and probably should stay separate. ```text # conceptual: clusters from a height cut labels = cut_tree_at_height(dendrogram, h=2.5) ``` ### 4.2 Cutting for a Fixed Number of Clusters Alternatively we fix the desired number of clusters $k$ and cut at the height that produces exactly $k$ branches. Because the merge heights are known, this is equivalent to stopping the agglomeration after $n - k$ merges. This is convenient when downstream requirements dictate a specific $k$, but it discards the height information that might suggest a more natural value. ### 4.3 Choosing Where to Cut Several heuristics guide the cut. The largest gap rule cuts across the tallest unbroken vertical span, on the theory that the biggest jump in merge height separates genuinely distinct structure. Internal validity indices such as the silhouette coefficient can be evaluated across candidate cuts, selecting the $k$ that maximizes cluster separation relative to cohesion. The inconsistency coefficient compares each merge height to the average and spread of heights below it, flagging merges that are anomalously tall as good cut points. No rule is universally correct; the cut encodes a decision about granularity that the data alone cannot make. ### 4.4 Non Horizontal Cuts Nothing requires the cut to be a single height. A more flexible scheme allows different branches to be cut at different heights, which is appropriate when clusters have genuinely different scales. Dynamic tree cutting methods do exactly this, examining the shape of each branch and cutting it where its internal structure suggests, rather than imposing one global threshold. This addresses a real weakness of horizontal cuts: a single height that resolves a dense region well may over split a sparse one, and vice versa. ## 5. Complexity The expressiveness of hierarchical clustering comes at a cost. Understanding that cost determines whether the method is feasible for a given dataset. ### 5.1 Time Complexity A naive implementation recomputes all pairwise cluster distances at every step, costing $O(n^3)$ time overall: there are $O(n)$ merges, and each scans an $O(n^2)$ distance structure. With the Lance Williams recurrence and a priority queue that maintains the nearest neighbor of each cluster, the cost drops to $O(n^2 \log n)$ for general linkages. For single linkage the minimum spanning tree connection yields an exact $O(n^2)$ algorithm. A more general technique, the nearest neighbor chain algorithm, achieves $O(n^2)$ time and $O(n)$ working memory for any linkage that satisfies a reducibility property, which complete, average, and Ward linkage all do. The nearest neighbor chain walks from cluster to cluster following nearest neighbor links until it finds a reciprocal nearest neighbor pair, merges that pair, and resumes, guaranteeing that each distance is examined a bounded number of times. The quadratic floor is fundamental for exact methods, because the algorithm must at minimum consider the $\binom{n}{2}$ pairwise dissimilarities. This makes exact hierarchical clustering impractical for datasets much beyond tens of thousands of points without approximation. ### 5.2 Space Complexity Storing the full pairwise distance matrix requires $O(n^2)$ memory, which is usually the binding constraint in practice. At $n = 10^5$ the matrix holds $5 \times 10^9$ entries, well beyond ordinary memory budgets. The nearest neighbor chain algorithm reduces working memory to $O(n)$ by recomputing distances on demand rather than materializing the matrix, trading arithmetic for storage, and is the standard route to scaling hierarchical clustering to larger $n$. ### 5.3 Practical Mitigations When $n$ is too large for exact methods, common strategies include clustering a representative sample and assigning the rest by nearest cluster, running a fast flat method such as k-means to produce a few thousand micro clusters and then applying hierarchical clustering to those, or restricting merges to a sparse neighborhood graph so that only local distances are ever considered. Each sacrifices some exactness for tractability, and the right choice depends on whether the priority is the full tree or only a partition near a particular granularity. ## 6. A Worked Derivation: Ward From the Lance Williams Update It is worth seeing exactly why the Ward merge cost reduces to the closed form quoted in Section 2.4 and how that closed form is consistent with the Lance Williams coefficients, since the two appear unrelated at first glance. Start from the error sum of squares of a cluster, $\text{ESS}(C) = \sum_{x \in C} \lVert x - \mu_C \rVert^2$ with centroid $\mu_C = \frac{1}{|C|}\sum_{x \in C} x$. A standard identity rewrites this without reference to the centroid: $$ \text{ESS}(C) = \frac{1}{2|C|} \sum_{x \in C} \sum_{y \in C} \lVert x - y \rVert^2 . $$ When we merge $C_a$ and $C_b$, the change in total ESS is $\Delta = \text{ESS}(C_a \cup C_b) - \text{ESS}(C_a) - \text{ESS}(C_b)$. Substituting the centroid form of each term and simplifying, every within $C_a$ and within $C_b$ contribution cancels, leaving only the cross term, which collapses to $$ \Delta(C_a, C_b) = \frac{n_a n_b}{n_a + n_b} \, \lVert \mu_{C_a} - \mu_{C_b} \rVert^2 , $$ the closed form of Section 2.4. Now define the working dissimilarity that the algorithm actually stores as $D(C_a, C_b) = \Delta(C_a, C_b)$, a squared quantity. Expanding $\Delta(C_a \cup C_b, C_k)$ in terms of $\Delta(C_a, C_k)$, $\Delta(C_b, C_k)$ and $\Delta(C_a, C_b)$ and matching coefficients yields exactly the Ward row of the Lance Williams table: $$ \alpha_a = \frac{n_a + n_k}{n_a + n_b + n_k}, \quad \alpha_b = \frac{n_b + n_k}{n_a + n_b + n_k}, \quad \beta = \frac{-n_k}{n_a + n_b + n_k}, \quad \gamma = 0 . $$ This is why a correct Ward implementation runs the recurrence on squared distances and reports $\sqrt{\Delta}$ as the dendrogram height: the recurrence is exact only in the squared metric, while the displayed height is conventionally the Euclidean distance between the merged centroids scaled by the size factor. The reference implementation below follows precisely this rule, which is what lets it reproduce the SciPy linkage matrix to machine precision. ## 7. Reference Implementation The companion libraries ship a dependency-free agglomerative clustering routine in all three languages. Each returns a linkage matrix in the SciPy row format `[node_a, node_b, height, size]`: original points carry ids `0 .. n-1`, and the merge formed at step `t` is given id `n + t`. The same Lance Williams recurrence drives single, complete, average, and Ward linkage, and the three implementations agree to within `1e-9` on shared fixtures. Install the package for your language ([`aiinaction`](https://pypi.org) for Python, the `AIInAction` Julia package, and the `aiinaction` Rust crate) and the snippets below run as written. We cluster five points, three near the origin and two near $(10, 10)$, and read off the merge structure and a two cluster cut. ::: {.panel-tabset} ## Python ```{python} from aiinaction.ch125_agglomerative_clustering import ( linkage_matrix, fcluster, cophenetic_distances, ) points = [ [0.0, 0.0], [1.0, 0.0], [0.0, 1.0], [10.0, 10.0], [10.0, 11.0], ] Z = linkage_matrix(points, linkage="ward") print("Ward linkage matrix [a, b, height, size]:") for row in Z: a, b, h, size = row print(f" {int(a)} + {int(b)} height={h:.4f} size={int(size)}") print("Two-cluster cut:", fcluster(Z, n_clusters=2)) coph = cophenetic_distances(Z) print(f"Cophenetic distance between point 0 and point 3: {coph[0][3]:.4f}") ``` The first two merges (heights of 1) pull together the two tight groups; the final merge, at a much larger height, joins them only because the algorithm must. The two cluster cut recovers the obvious split, and the large cophenetic distance between a point near the origin and a point near $(10, 10)$ confirms that the tree kept them apart until the very end. ## Julia ```julia using AIInAction.Ch125AgglomerativeClustering points = [ [0.0, 0.0], [1.0, 0.0], [0.0, 1.0], [10.0, 10.0], [10.0, 11.0], ] Z = linkage_matrix(points, "ward") for row in Z println(" ", Int(row[1]), " + ", Int(row[2]), " height=", round(row[3], digits=4), " size=", Int(row[4])) end println("Two-cluster cut: ", fcluster(Z, 2)) # [0, 0, 0, 1, 1] ``` ## Rust ```rust use aiinaction::ch125_agglomerative_clustering::{linkage_matrix, fcluster}; fn main() { let points = vec![ vec![0.0, 0.0], vec![1.0, 0.0], vec![0.0, 1.0], vec![10.0, 10.0], vec![10.0, 11.0], ]; let z = linkage_matrix(&points, "ward").unwrap(); for row in &z { println!( " {} + {} height={:.4} size={}", row[0] as usize, row[1] as usize, row[2], row[3] as usize ); } // [0, 0, 0, 1, 1] println!("Two-cluster cut: {:?}", fcluster(&z, 2).unwrap()); } ``` ::: All three print the identical linkage matrix and the same `[0, 0, 0, 1, 1]` cut, which is exactly the parity the shared test fixtures enforce. ## 8. Practical Guidance A few recommendations consolidate the preceding material. Use Ward linkage with Euclidean distance as a default when clusters are expected to be compact and of similar size, since it behaves predictably and parallels k-means. Reach for single linkage only when elongated or irregular cluster shapes are expected, and watch for chaining. Prefer average or complete linkage when a balance between the extremes is wanted. Always inspect the dendrogram before cutting, since the merge heights reveal whether the data has clear structure or whether any cut would be arbitrary. Validate the chosen cut with an index such as the silhouette score or the cophenetic correlation rather than trusting a single visual impression. Finally, respect the quadratic cost: confirm that $n$ is small enough for the full distance matrix, and switch to sampling or micro clustering before it is not. ## References 1. Lance, G. N., and Williams, W. T. "A General Theory of Classificatory Sorting Strategies: 1. Hierarchical Systems." The Computer Journal, 1967. https://doi.org/10.1093/comjnl/9.4.373 2. Ward, J. H. "Hierarchical Grouping to Optimize an Objective Function." Journal of the American Statistical Association, 1963. https://doi.org/10.1080/01621459.1963.10500845 3. Murtagh, F., and Contreras, P. "Algorithms for Hierarchical Clustering: An Overview." WIREs Data Mining and Knowledge Discovery, 2012. https://doi.org/10.1002/widm.53 4. Murtagh, F., and Legendre, P. "Ward's Hierarchical Agglomerative Clustering Method: Which Algorithms Implement Ward's Criterion?" Journal of Classification, 2014. https://doi.org/10.1007/s00357-014-9161-z 5. Mullner, D. "Modern Hierarchical, Agglomerative Clustering Algorithms." arXiv preprint, 2011. https://arxiv.org/abs/1109.2378 6. Langfelder, P., Zhang, B., and Horvath, S. "Defining Clusters from a Hierarchical Cluster Tree: The Dynamic Tree Cut Package for R." Bioinformatics, 2008. https://doi.org/10.1093/bioinformatics/btm563 7. SciPy Documentation. "Hierarchical Clustering (scipy.cluster.hierarchy)." https://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html 8. scikit-learn Documentation. "Hierarchical Clustering." https://scikit-learn.org/stable/modules/clustering.html#hierarchical-clustering

Linkage	\(\alpha_a\)	\(\alpha_b\)	\(\beta\)	\(\gamma\)
Single	\(1/2\)	\(1/2\)	\(0\)	\(-1/2\)
Complete	\(1/2\)	\(1/2\)	\(0\)	\(+1/2\)
Average	\(\frac{n_a}{n_a+n_b}\)	\(\frac{n_b}{n_a+n_b}\)	\(0\)	\(0\)
Ward	\(\frac{n_a+n_k}{n_a+n_b+n_k}\)	\(\frac{n_b+n_k}{n_a+n_b+n_k}\)	\(\frac{-n_k}{n_a+n_b+n_k}\)	\(0\)