133  Density-Based Clustering with HDBSCAN and OPTICS

Density-based clustering treats clusters as regions of high observation density separated by regions of low density. This view, made popular by DBSCAN, frees the analyst from specifying the number of clusters in advance and allows clusters of arbitrary shape. Yet plain DBSCAN carries a fragile assumption: a single global density threshold governs the entire dataset. When clusters occur at genuinely different densities, no single threshold recovers all of them. OPTICS and HDBSCAN are the two principled answers to this problem. Both replace the flat threshold with a hierarchical, multi-scale notion of density. This chapter develops the reachability formalism behind OPTICS, the condensed-tree and stability formalism behind HDBSCAN, and the practical judgment needed to deploy them.

133.1 1. From a Global Threshold to a Density Hierarchy

133.1.1 1.1 The DBSCAN baseline and its failure mode

DBSCAN is parameterized by a radius \(\varepsilon\) and a minimum count minPts. A point \(p\) is a core point if its \(\varepsilon\)-neighborhood contains at least minPts points. Clusters grow by chaining core points that are within \(\varepsilon\) of one another, and points that fall in no cluster are labeled noise. The procedure is elegant, but the radius \(\varepsilon\) fixes one density scale for the whole space.

Consider two Gaussian blobs, one tight and one diffuse, plus background noise. A small \(\varepsilon\) tuned to the tight blob shatters the diffuse blob into fragments and noise. A large \(\varepsilon\) tuned to the diffuse blob merges the tight blob with nearby background. There is no \(\varepsilon\) that recovers both. This variable-density problem motivates a hierarchy: rather than commit to one density level, examine the cluster structure across all density levels at once.

133.1.2 1.2 Core distance and mutual reachability

Both algorithms begin with the same local density estimate. For a point \(p\) and a parameter \(k\) (the analog of minPts, often written min_samples), the core distance is the distance to the \(k\)-th nearest neighbor:

\[d_{\text{core}}(p) = \text{distance to the } k\text{-th nearest neighbor of } p.\]

A small core distance means \(p\) sits in a dense region. To compare points fairly, define the mutual reachability distance between \(p\) and \(q\):

\[d_{\text{mreach}}(p, q) = \max\big(d_{\text{core}}(p),\, d_{\text{core}}(q),\, d(p, q)\big).\]

This symmetric quantity inflates the raw distance whenever either endpoint lies in a sparse region. Its effect is to push sparse points away from one another while leaving dense points connected at their true separation. Mutual reachability is the metric on which the density hierarchy is built, and it is the single most important construct in this chapter.

133.2 2. OPTICS and the Reachability Plot

133.2.1 2.1 The ordering algorithm

OPTICS (Ordering Points To Identify the Clustering Structure) does not produce clusters directly. Instead it produces an ordering of the points together with a reachability distance for each, from which clusterings at every density level can be read off.

The algorithm processes points in a priority queue keyed by reachability distance. Starting from an arbitrary unprocessed point, it computes the core distance, then for each neighbor updates a tentative reachability distance equal to the mutual reachability to the current core. It always expands the point with the smallest current reachability, which causes the walk to stay inside a dense region until that region is exhausted before jumping across a sparse gap.

seed an unprocessed point, output it with reachability = undefined
maintain a priority queue of (reachability, point)
repeatedly:
  pop the point with smallest reachability, mark processed, output it
  if it is a core point, update queue reachabilities of its neighbors

The output is a sequence \(o_1, o_2, \ldots, o_n\) with reachability values \(r_1, r_2, \ldots, r_n\). Crucially, the parameter \(\varepsilon\) in OPTICS is only an upper bound used to prune the search for efficiency. It does not commit the analysis to one density scale.

133.2.2 2.2 Reading the reachability plot

Plotting the reachability values \(r_i\) in processing order produces the reachability plot, the signature visualization of OPTICS. The horizontal axis is the ordering; the vertical axis is reachability distance. Valleys correspond to clusters: while the walk stays in a dense region, reachability stays low. Peaks correspond to the sparse boundaries crossed when the walk jumps from one cluster to the next.

A nested valley, a small dip inside a larger dip, reveals a subcluster embedded in a coarser cluster. This is precisely the variable-density structure that defeats DBSCAN. To extract a flat DBSCAN-equivalent clustering, draw a horizontal cut at height \(\varepsilon'\) and treat each contiguous run below the line as a cluster. Different cut heights yield different clusterings, and the plot lets the analyst choose visually.

133.2.3 2.3 The xi method for variable-density extraction

A horizontal cut still imposes one global level. The \(\xi\) (xi) extraction method instead looks for steep relative changes in reachability. A downward region where reachability drops by at least a factor \(1 - \xi\) marks a cluster start, and a matching upward region marks its end. By keying on the steepness of transitions rather than their absolute height, \(\xi\) extraction adapts the effective threshold to each valley and recovers clusters at multiple densities from a single run. This is the OPTICS counterpart to what HDBSCAN achieves through stability.

133.3 3. HDBSCAN: Hierarchical Density-Based Clustering

HDBSCAN (Hierarchical DBSCAN) takes the hierarchical idea further. Rather than leave extraction to a visual cut, it builds the full hierarchy, condenses it, and selects clusters by an explicit optimization over stability. The pipeline has four stages.

133.3.1 3.1 Stage one: mutual reachability graph and minimum spanning tree

Treat the points as nodes of a complete weighted graph whose edge weights are the mutual reachability distances of Section 1.2. Compute its minimum spanning tree (MST). The MST is the cheapest set of edges connecting all points under the mutual reachability metric, and it encodes how the dataset knits together as the density threshold is relaxed.

weight(p, q) = d_mreach(p, q)
T = minimum_spanning_tree(complete_graph)

133.3.2 3.2 Stage two: the single-linkage hierarchy

Sort the MST edges by increasing weight and add them one at a time, merging the components they connect. This is single-linkage agglomerative clustering on the mutual reachability metric. The result is a dendrogram: at the bottom every point is its own cluster, and as the threshold \(\lambda\) rises (equivalently, as the distance threshold rises) components merge, until at the top everything is one cluster.

It is convenient to work in the inverse-distance scale \(\lambda = 1 / d_{\text{mreach}}\). A high \(\lambda\) corresponds to a strict, high-density threshold; a low \(\lambda\) to a permissive one. Each cluster in the hierarchy is born at some \(\lambda_{\text{birth}}\) and, if it splits or dissolves, dies at some \(\lambda_{\text{death}}\).

133.3.3 3.3 Stage three: condensing the tree

The raw single-linkage dendrogram has \(n - 1\) internal nodes, most of them spurious: a single point peeling off a cluster is not a meaningful split. HDBSCAN condenses the tree using the min_cluster_size parameter. Walking down from the root, at each split it asks whether each of the two children has at least min_cluster_size points.

  • If both children are large enough, this is a true split into two clusters.
  • If only one child is large enough, the smaller child is treated as points falling out of the parent, and the parent persists as the same cluster at a lower \(\lambda\).

The outcome is the condensed tree, a much smaller hierarchy whose nodes are genuine clusters, each spanning a range of \(\lambda\) and gradually shedding points as \(\lambda\) increases.

133.3.4 3.4 Stage four: stability and flat extraction

For each cluster \(C\) in the condensed tree, define its stability as the total persistence of its members across the \(\lambda\) range during which they belong to \(C\):

\[S(C) = \sum_{p \in C} \big(\lambda_p - \lambda_{\text{birth}}(C)\big),\]

where \(\lambda_p\) is the value at which point \(p\) leaves \(C\) (by falling out or by the cluster splitting) and \(\lambda_{\text{birth}}(C)\) is the value at which \(C\) first appears. A cluster that persists over a wide \(\lambda\) range with many members accumulates high stability. A cluster that fleetingly appears and immediately splits scores low.

Extraction is then a constrained optimization over the condensed tree. Process clusters from the leaves upward. If the sum of the stabilities of a cluster’s children exceeds the cluster’s own stability, keep the children and set the parent stability to that sum. Otherwise select the parent and discard the subtree beneath it. The selected clusters form an antichain: no selected cluster is an ancestor of another, so the result is a valid flat partition with a noise label for unselected points. This rule, the Excess of Mass criterion, lets HDBSCAN choose clusters at different density levels in different parts of the space, which is exactly what flat DBSCAN cannot do.

133.3.5 3.5 Soft clustering and outlier scores

Because each point has a \(\lambda\) at which it leaves its cluster, HDBSCAN yields more than hard labels. The membership strength of a point can be derived from how deep into its cluster’s \(\lambda\) range it survives, giving soft cluster probabilities. The library also exposes GLOSH outlier scores, which compare a point’s \(\lambda\) at departure to the maximum \(\lambda\) reached by its cluster, flagging points that detach early as likely outliers. These scores are valuable in anomaly detection, where the question is not which cluster a point belongs to but how firmly it belongs anywhere.

133.4 4. Parameters, Tuning, and Diagnostics

133.4.1 4.1 The two governing parameters

HDBSCAN exposes two parameters with distinct roles.

  • min_cluster_size is the smallest group of points you are willing to call a cluster. It controls the granularity of the condensed tree and is the parameter to reach for first. Raising it suppresses small clusters and pushes their points toward larger clusters or noise.
  • min_samples (the \(k\) of the core distance) controls how conservative the density estimate is. Larger values make core distances larger in sparse regions, which expands what counts as noise and yields tighter, more conservative clusters. When unset it defaults to min_cluster_size.

A useful mental model: min_cluster_size shapes what survives in the hierarchy, while min_samples shapes how aggressively the algorithm calls borderline points noise.

import hdbscan
clusterer = hdbscan.HDBSCAN(
    min_cluster_size=15,
    min_samples=5,
    cluster_selection_method="eom",
)
labels = clusterer.fit_predict(X)

133.4.2 4.2 Leaf versus excess-of-mass selection

The default Excess of Mass method favors larger, more stable clusters and can swallow fine substructure under a single broad cluster. Setting cluster_selection_method="leaf" instead selects the leaf clusters of the condensed tree, producing more numerous and homogeneous clusters. A related knob, cluster_selection_epsilon, merges clusters whose separation falls below a distance floor, which is helpful when EOM oversplits a region you regard as one group.

133.4.3 4.3 Validating density clusterings

Density clusters are non-convex, so centroid-based indices such as silhouette can mislead. The Density-Based Clustering Validation index (DBCV) is designed for this setting: it scores a clustering by contrasting within-cluster density against between-cluster separation using the same density-connectivity ideas, and it can fairly compare partitions with noise and arbitrary shapes. When ground truth exists, the Adjusted Rand Index or Adjusted Mutual Information remain appropriate. Always inspect the fraction of points labeled noise; an unexpectedly large noise fraction usually signals that min_samples is too high or the feature scaling is off.

133.4.4 4.4 Metric choice and dimensionality

Mutual reachability inherits whatever base metric you supply, so feature scaling and metric choice matter as much as the algorithm. In high dimensions, distance concentration erodes the contrast between core and non-core points, and density estimates degrade. A common and effective pipeline reduces dimensionality with UMAP, which preserves local density structure better than PCA for this purpose, and then runs HDBSCAN in the reduced space. This combination underlies many modern document and embedding clustering systems, including topic-modeling pipelines built on transformer embeddings.

133.5 5. Choosing Between OPTICS and HDBSCAN

The two methods share a foundation but differ in emphasis. OPTICS produces the reachability plot, an interpretable artifact that exposes the entire density structure for human inspection; this makes it attractive for exploratory analysis where the analyst wants to see and reason about nested structure before committing to a clustering. HDBSCAN automates the extraction step through stability optimization, producing a defensible flat clustering, soft memberships, and outlier scores with minimal manual judgment; this makes it attractive for production pipelines and for embedding in larger systems.

In terms of cost, both are dominated by nearest-neighbor computation. Naive implementations are \(O(n^2)\), but spatial index acceleration brings the practical complexity closer to \(O(n \log n)\) on low-dimensional or well-structured data, and HDBSCAN’s reference implementation uses space-partitioning trees and a boruvka-style MST to scale to hundreds of thousands of points. For very large datasets, approximate nearest-neighbor variants extend the reach further.

A practical default: when you need an interpretable map of multi-scale structure, reach for OPTICS and study its reachability plot. When you need an automatic, robust clustering with membership and outlier information, reach for HDBSCAN. In both cases the conceptual payoff is the same. By replacing a single global density threshold with a hierarchy over mutual reachability, these methods recover clusters of varying density and arbitrary shape that flat DBSCAN cannot, and they make the act of choosing a density level an explicit, inspectable decision rather than a hidden assumption.

133.6 6. Summary

Variable-density clustering requires abandoning the single-threshold view. The core-distance and mutual-reachability constructs turn local density into a metric on which a full hierarchy can be built. OPTICS linearizes that hierarchy into a reachability plot, where valleys are clusters and the \(\xi\) method extracts variable-density partitions. HDBSCAN materializes the hierarchy as a condensed tree and selects clusters by maximizing stability, then offers soft memberships and outlier scores as a bonus. Mastery of these methods rests on two ideas worth repeating: mutual reachability is the metric that equalizes density across the space, and cluster stability is the criterion that lets different parts of the space be clustered at different densities within one coherent result.

133.7 References

  1. Ankerst, M., Breunig, M., Kriegel, H.-P., and Sander, J. “OPTICS: Ordering Points To Identify the Clustering Structure.” ACM SIGMOD, 1999. https://dl.acm.org/doi/10.1145/304182.304187
  2. Campello, R. J. G. B., Moulavi, D., and Sander, J. “Density-Based Clustering Based on Hierarchical Density Estimates.” PAKDD, 2013. https://link.springer.com/chapter/10.1007/978-3-642-37456-2_14
  3. Campello, R. J. G. B., Moulavi, D., Zimek, A., and Sander, J. “Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection.” ACM TKDD, 2015. https://dl.acm.org/doi/10.1145/2733381
  4. McInnes, L., Healy, J., and Astels, S. “hdbscan: Hierarchical density based clustering.” Journal of Open Source Software, 2017. https://joss.theoj.org/papers/10.21105/joss.00205
  5. Moulavi, D., Jaskowiak, P. A., Campello, R. J. G. B., Zimek, A., and Sander, J. “Density-Based Clustering Validation.” SIAM International Conference on Data Mining, 2014. https://epubs.siam.org/doi/10.1137/1.9781611973440.96
  6. Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.” KDD, 1996. https://dl.acm.org/doi/10.5555/3001460.3001507
  7. McInnes, L., and Healy, J. “UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction.” 2018. https://arxiv.org/abs/1802.03426
  8. The HDBSCAN Clustering Library documentation. https://hdbscan.readthedocs.io/