131  Divisive Hierarchical Clustering

Hierarchical clustering builds a nested family of partitions rather than a single flat assignment. Two opposite strategies generate that nesting. Agglomerative methods start from singletons and merge upward. Divisive methods start from one all encompassing cluster and split downward. This chapter develops the divisive view in depth, with the classical DIANA algorithm as the concrete anchor, and contrasts it carefully with the agglomerative tradition that dominates most textbooks and library defaults.

131.1 1. Top-Down Clustering as a Recursive Partition

131.1.1 1.1 The dendrogram and its two directions

Let \(X = \{x_1, \dots, x_n\}\) be a set of objects with a dissimilarity function \(d(x_i, x_j) \ge 0\). A hierarchical clustering is a sequence of nested partitions \(P_0, P_1, \dots, P_{n-1}\) where \(P_0 = \{X\}\) is the single cluster containing everything and \(P_{n-1} = \{\{x_1\}, \dots, \{x_n\}\}\) is the set of singletons. Nesting means every cluster at one level is either preserved or split as we move toward the singleton end. The dendrogram is the tree that records this structure, with heights encoding the dissimilarity at which each split or merge occurs.

Agglomerative clustering reads this tree from the leaves up. Divisive clustering reads it from the root down. The divisive recipe is simple to state:

DIVISIVE(X):
    tree = {root: X}
    while some leaf cluster has more than 1 object:
        pick a leaf cluster C to split
        partition C into C_left and C_right
        attach C_left and C_right as children of C
    return tree

The two design choices buried in this skeleton are which cluster to split next and how to split it. Both are harder than they look, and the second is the reason divisive methods are less common than agglomerative ones.

131.1.2 1.2 Why splitting is combinatorially hard

To split a cluster \(C\) of size \(m\) optimally, in the sense of minimizing within cluster dissimilarity of the two resulting parts, you must in principle examine all \(2^{m-1} - 1\) nontrivial bipartitions. This is exponential, so a brute force optimal split is infeasible beyond tiny clusters. Compare this with agglomerative merging, where at each step you only evaluate \(\binom{k}{2}\) candidate merges among the current \(k\) clusters and pick the closest pair. The merge decision is a low order polynomial search; the optimal split decision is exponential. Practical divisive algorithms therefore replace the exact split with a clever heuristic. DIANA is the most influential such heuristic.

131.2 2. DIANA: Divisive Analysis

131.2.1 2.1 Macnaughton-Smith splitting

DIANA, introduced by Kaufman and Rousseeuw, sidesteps the exponential search using a procedure due to Macnaughton-Smith and colleagues. The idea is to grow a splinter group one object at a time, always moving the object that is most out of place relative to the rest of its current cluster.

Define, for an object \(i\) currently in cluster \(C\) with a candidate splinter set \(S \subset C\), the average dissimilarity to the objects remaining in the main group \(C \setminus S\) and the average dissimilarity to the splinter group \(S\). The procedure is:

  1. Start with \(S = \varnothing\) and the full cluster as the main group. For each object \(i\), compute its average dissimilarity to all other objects in the cluster. Move the object with the largest average to start \(S\). This is the object least similar to the rest, a natural seed for a breakaway faction.

  2. For every remaining object \(i\) in the main group, compute \[ D(i) = \frac{1}{|C \setminus S| - 1} \sum_{j \in C \setminus S,\, j \ne i} d(i,j) \;-\; \frac{1}{|S|} \sum_{j \in S} d(i,j). \] The first term is the average distance of \(i\) to the objects that would stay; the second is the average distance to the splinter group.

  3. If \(\max_i D(i) > 0\), move the maximizing object into \(S\). A positive value means the object is on balance closer to the splinter than to the remainder, so it belongs with the breakaways. Repeat step 2.

  4. When \(\max_i D(i) \le 0\), no object prefers the splinter group anymore. Stop. The cluster is split into \(S\) and \(C \setminus S\).

The Macnaughton-Smith heuristic builds one good bipartition in \(O(m^2)\) work per cluster rather than exploring \(2^{m-1}\) of them. It is greedy and not guaranteed optimal, but in practice it produces sensible, well separated splits.

131.2.2 2.2 Choosing which cluster to split

Having a split routine, DIANA still must decide which existing leaf to divide next. The standard rule selects the cluster with the largest diameter, where the diameter is the maximum pairwise dissimilarity inside the cluster: \[ \operatorname{diam}(C) = \max_{i,j \in C} d(i,j). \] Splitting the widest cluster first tends to produce a dendrogram whose heights decrease monotonically as you descend, which keeps the tree readable and the merge heights interpretable. The full DIANA loop is:

DIANA(X):
    clusters = {X}
    while any cluster has size > 1:
        C = cluster with the largest diameter
        S = macnaughton_smith_split(C)   # the splinter group
        replace C with S and (C \ S)
        record split height = diam(C)
    return dendrogram

131.2.3 2.3 Complexity and scaling

A single Macnaughton-Smith split of an \(m\) object cluster costs \(O(m^2)\). Splitting all the way down to singletons, in a roughly balanced tree, sums to \(O(n^2)\) to \(O(n^2 \log n)\) depending on how the splits balance, and DIANA needs the full \(n \times n\) dissimilarity matrix in memory at \(O(n^2)\) space. This is comparable to standard agglomerative implementations, which are typically \(O(n^2 \log n)\) time and \(O(n^2)\) space, with \(O(n^2)\) achievable for specific linkages such as single linkage via the minimum spanning tree and complete or average linkage via the nearest neighbor chain algorithm. Neither family scales comfortably past tens of thousands of objects without approximation, sampling, or a switch to a flat method such as \(k\)-means.

131.3 3. Comparison With Agglomerative Clustering

131.3.1 3.1 Global versus local decisions

The deepest conceptual difference is the altitude at which the first important decision is made. An agglomerative algorithm’s very first action is to merge the two single closest objects, a maximally local decision. It commits to fine grained structure before it has any view of the whole. A divisive algorithm’s first action is to cut the entire dataset into two parts, a maximally global decision. It commits to the coarse structure first and refines later.

This matters because hierarchical clustering decisions are irrevocable. An object placed in the wrong cluster early is never reconsidered. Agglomerative errors accumulate from the bottom and can propagate upward into the macro structure. Divisive errors are made at the top with full information about the data, which is exactly where you most want the major boundaries to be correct. When the question of interest is the small number of large, top level groups, the divisive approach allocates its most informed decision to the question you care about.

131.3.2 3.2 Linkage and the chaining problem

Agglomerative results depend strongly on the linkage function that defines the distance between clusters. Single linkage uses the minimum inter cluster distance and is prone to chaining, where a thin bridge of intermediate points fuses two otherwise distinct clusters. Complete linkage uses the maximum and tends to produce compact, equal sized clusters that can fracture elongated ones. Average linkage and Ward’s minimum variance criterion sit between these extremes. The practitioner must choose, and the choice changes the answer.

Divisive methods have their own analogous knob in the splitting criterion, but DIANA’s average dissimilarity based split is less vulnerable to single linkage style chaining, because a thin bridge does not change average dissimilarities much. The diameter based selection rule also directly targets the loosest cluster, which aligns the algorithm with the intuition that the biggest, most heterogeneous group is the one most in need of division.

131.3.3 3.3 A summary contrast

Aspect Agglomerative Divisive (DIANA)
Direction Bottom up, merge Top down, split
First decision Closest pair, local Whole set bipartition, global
Split or merge search \(\binom{k}{2}\) merges Heuristic, avoids \(2^{m-1}\)
Typical time \(O(n^2 \log n)\) \(O(n^2)\) to \(O(n^2 \log n)\)
Tuning knob Linkage choice Splitting and selection rule
Strength Fine local structure Correct coarse structure
Library support Ubiquitous Sparse

131.3.4 3.4 They do not produce the same tree

It is tempting to assume the two directions converge on the same dendrogram. They do not. Agglomerative and divisive procedures generally yield different hierarchies on the same data, because they optimize different things at different scales. There is no reason a greedy bottom up merge sequence should reconstruct the splits a greedy top down procedure would make. Treat them as distinct tools, not as two implementations of one idea.

131.4 4. A Practical Worked Sketch

Consider a small set of one dimensional values to make the splinter mechanism concrete: \(\{1, 2, 3, 20, 21, 22\}\) with \(d\) the absolute difference. DIANA begins with the whole set. Average dissimilarities to the rest are largest for the extreme points \(1\) and \(22\). Suppose \(22\) seeds the splinter. Then \(D(i)\) is positive for \(21\) and \(20\), which join the splinter, and negative for \(1, 2, 3\), which prefer to stay. The split is \(\{20, 21, 22\}\) against \(\{1, 2, 3\}\), the obvious answer, recovered in one pass without enumerating bipartitions. Each of those clusters is then split again into singletons by the same logic, producing a clean two level structure that mirrors the data’s geometry.

The lesson generalizes. When the data contains a small number of well separated macro groups, the first divisive cut tends to find the correct macro boundary immediately, because the seed and the positive \(D(i)\) objects naturally coalesce around the largest gap.

# Scikit-learn ships agglomerative clustering but not divisive.
# DIANA is available in R via cluster::diana, and in Python via
# third-party packages. The call shape is typically:
#   model = Diana(metric="euclidean")
#   labels = model.fit_predict(X)         # then cut the tree at k
# Always inspect the dendrogram heights before choosing a cut level.

131.5 5. When Divisive Clustering Helps

131.5.1 5.1 Few clusters, read from the top

If you only need a coarse partition, say two to ten clusters, divisive clustering is attractive because you can stop after a handful of splits. You never pay to build the entire tree down to singletons. Agglomerative methods, by contrast, conceptually start at the singleton end and must do most of their work before the top level structure emerges, although early stopping at the merge end is also possible. When the analysis question is “what are the few big groups,” top down division spends its budget and its best information exactly there.

131.5.2 5.2 Robustness to local noise and chaining

Because the early divisive decisions average over many objects, they are comparatively insensitive to individual outliers and thin connecting bridges. Single linkage agglomeration can be derailed by a single chain of intermediate points; DIANA’s averaged criterion shrugs off such bridges. In domains where measurement noise creates spurious near neighbors, the global character of the first cuts is a genuine advantage.

131.5.3 5.3 Interpretability of the major divisions

In taxonomy, document organization, and market segmentation, the top splits are often the deliverable. A divisive tree presents these as explicit, early, well justified bipartitions, each annotated with the splinter group that broke away. This narrative, “the population first divided along this axis, then each part subdivided,” is frequently more natural to explain to stakeholders than a long sequence of pairwise merges.

131.5.4 5.4 When to prefer agglomerative instead

Divisive clustering is not a universal upgrade. Prefer agglomerative methods when the fine grained structure matters most, when you want Ward’s variance minimizing behavior, when you need single linkage to trace genuinely elongated or manifold shaped clusters, or simply when you need a mature, well tested implementation. The agglomerative ecosystem is far richer, and for many flat clustering needs neither hierarchical family is the right tool; \(k\)-means, DBSCAN, or spectral methods may serve better. Choose divisive when the coarse, top level partition is the prize and you want your most informed decision spent on it.

131.6 6. Summary

Divisive hierarchical clustering inverts the familiar bottom up recipe, starting from one cluster and recursively splitting. The exponential cost of optimal splitting is tamed by heuristics, most notably the Macnaughton-Smith splinter procedure at the heart of DIANA, which grows a breakaway group object by object using averaged dissimilarities and selects the widest cluster to divide next. Relative to agglomerative clustering, the divisive approach makes its most consequential decisions early and globally, with full data in view, making it well suited when the goal is a small number of correct, interpretable, noise robust top level groups. It is underused mostly for practical reasons of library availability rather than any deep deficiency, and it deserves a place in the analyst’s toolkit alongside its more popular agglomerative cousin.

131.7 References

  1. Kaufman, L. and Rousseeuw, P. J. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, 1990. https://onlinelibrary.wiley.com/doi/book/10.1002/9780470316801
  2. Macnaughton-Smith, P., Williams, W. T., Dale, M. B., and Mockett, L. G. “Dissimilarity Analysis: A New Technique of Hierarchical Sub-Division.” Nature, 202, 1964. https://www.nature.com/articles/2021034a0
  3. Hastie, T., Tibshirani, R., and Friedman, J. The Elements of Statistical Learning, 2nd ed., Chapter 14. Springer, 2009. https://hastie.su.domains/ElemStatLearn/
  4. cluster package documentation, diana function reference, CRAN. https://cran.r-project.org/web/packages/cluster/cluster.pdf
  5. Murtagh, F. and Contreras, P. “Algorithms for Hierarchical Clustering: An Overview.” WIREs Data Mining and Knowledge Discovery, 2012. https://onlinelibrary.wiley.com/doi/10.1002/widm.53
  6. scikit-learn developers. “Hierarchical Clustering” user guide. https://scikit-learn.org/stable/modules/clustering.html#hierarchical-clustering