133 Density-Based Clustering with HDBSCAN and OPTICS

Density-based clustering treats clusters as regions of high observation density separated by regions of low density. This view, made popular by DBSCAN, frees the analyst from specifying the number of clusters in advance and allows clusters of arbitrary shape. Yet plain DBSCAN carries a fragile assumption: a single global density threshold governs the entire dataset. When clusters occur at genuinely different densities, no single threshold recovers all of them. OPTICS and HDBSCAN are the two principled answers to this problem. Both replace the flat threshold with a hierarchical, multi-scale notion of density. This chapter develops the reachability formalism behind OPTICS, the condensed-tree and stability formalism behind HDBSCAN, and the practical judgment needed to deploy them. It then demonstrates both algorithms with mature open-source tooling rather than a hand-rolled reimplementation, because for density clustering the correctness of the spatial-index and minimum-spanning-tree machinery is exactly what you do not want to get subtly wrong.

133.1 1. From a Global Threshold to a Density Hierarchy

133.1.1 1.1 The DBSCAN baseline and its failure mode

DBSCAN is parameterized by a radius $\varepsilon$ and a minimum count minPts. A point $p$ is a core point if its $\varepsilon$-neighborhood contains at least minPts points. Clusters grow by chaining core points that are within $\varepsilon$ of one another, and points that fall in no cluster are labeled noise. The procedure is elegant, but the radius $\varepsilon$ fixes one density scale for the whole space.

Consider two Gaussian blobs, one tight and one diffuse, plus background noise. A small $\varepsilon$ tuned to the tight blob shatters the diffuse blob into fragments and noise. A large $\varepsilon$ tuned to the diffuse blob merges the tight blob with nearby background. There is no $\varepsilon$ that recovers both. This variable-density problem motivates a hierarchy: rather than commit to one density level, examine the cluster structure across all density levels at once.

133.1.2 1.2 Core distance and mutual reachability

Both algorithms begin with the same local density estimate. For a point $p$ and a parameter $k$ (the analog of minPts, often written min_samples), the core distance is the distance to the $k$-th nearest neighbor:

\[d_{\text{core}}(p) = \text{distance from } p \text{ to its } k\text{-th nearest neighbor}.\]

A small core distance means $p$ sits in a dense region. To compare points fairly, define the mutual reachability distance between $p$ and $q$:

\[d_{\text{mreach}}(p, q) = \max\big(d_{\text{core}}(p),\, d_{\text{core}}(q),\, d(p, q)\big).\]

This symmetric quantity inflates the raw distance whenever either endpoint lies in a sparse region. Its effect is to push sparse points away from one another while leaving dense points connected at their true separation: two points sitting in the dense core keep their actual distance $d(p,q)$, but a point in a sparse halo is held at least its (large) core distance away from everything. Mutual reachability is the metric on which the density hierarchy is built, and it is the single most important construct in this chapter. Note that it is not a true metric, since the triangle inequality can fail, but it is symmetric and non-negative, which is all the spanning-tree construction needs.

133.2 2. OPTICS and the Reachability Plot

133.2.1 2.1 The ordering algorithm

OPTICS (Ordering Points To Identify the Clustering Structure) does not produce clusters directly. Instead it produces an ordering of the points together with a reachability distance for each, from which clusterings at every density level can be read off.

The algorithm processes points in a priority queue keyed by reachability distance. Starting from an arbitrary unprocessed point, it computes the core distance, then for each neighbor updates a tentative reachability distance equal to the mutual reachability to the current core. It always expands the point with the smallest current reachability, which causes the walk to stay inside a dense region until that region is exhausted before jumping across a sparse gap.

seed an unprocessed point, output it with reachability = undefined
maintain a priority queue of (reachability, point)
repeatedly:
  pop the point with smallest reachability, mark processed, output it
  if it is a core point, update queue reachabilities of its neighbors

The output is a sequence $o_1, o_2, \ldots, o_n$ with reachability values $r_1, r_2, \ldots, r_n$. Crucially, the parameter $\varepsilon$ in OPTICS is only an upper bound used to prune the search for efficiency. It does not commit the analysis to one density scale; setting it to infinity simply makes the algorithm exact at higher cost.

133.2.2 2.2 Reading the reachability plot

Plotting the reachability values $r_i$ in processing order produces the reachability plot, the signature visualization of OPTICS. The horizontal axis is the ordering; the vertical axis is reachability distance. Valleys correspond to clusters: while the walk stays in a dense region, reachability stays low. Peaks correspond to the sparse boundaries crossed when the walk jumps from one cluster to the next.

A nested valley, a small dip inside a larger dip, reveals a subcluster embedded in a coarser cluster. This is precisely the variable-density structure that defeats DBSCAN. To extract a flat DBSCAN-equivalent clustering, draw a horizontal cut at height $\varepsilon'$ and treat each contiguous run below the line as a cluster. Different cut heights yield different clusterings, and the plot lets the analyst choose visually. This horizontal-cut extraction is exactly equivalent to running DBSCAN with that $\varepsilon'$, which is why a single OPTICS run can replay every DBSCAN result without recomputation.

133.2.3 2.3 The xi method for variable-density extraction

A horizontal cut still imposes one global level. The $\xi$ (xi) extraction method instead looks for steep relative changes in reachability. A downward region where reachability drops by at least a factor $1 - \xi$ marks a cluster start, and a matching upward region marks its end. By keying on the steepness of transitions rather than their absolute height, $\xi$ extraction adapts the effective threshold to each valley and recovers clusters at multiple densities from a single run. This is the OPTICS counterpart to what HDBSCAN achieves through stability. In practice $\xi$ is delicate: it reacts to the relative slope of the reachability curve, so a value that cleanly separates well-isolated blobs can over-fragment or merge clusters whose densities differ only mildly. When the blobs touch or overlap, a plain DBSCAN-style horizontal cut on the same OPTICS ordering is often the steadier choice, a contrast the worked example below makes concrete.

133.3 3. HDBSCAN: Hierarchical Density-Based Clustering

HDBSCAN (Hierarchical DBSCAN) takes the hierarchical idea further. Rather than leave extraction to a visual cut, it builds the full hierarchy, condenses it, and selects clusters by an explicit optimization over stability. The pipeline has four stages.

133.3.1 3.1 Stage one: mutual reachability graph and minimum spanning tree

Treat the points as nodes of a complete weighted graph whose edge weights are the mutual reachability distances of Section 1.2. Compute its minimum spanning tree (MST). The MST is the cheapest set of edges connecting all points under the mutual reachability metric, and it encodes how the dataset knits together as the density threshold is relaxed.

weight(p, q) = d_mreach(p, q)
T = minimum_spanning_tree(complete_graph)

The complete graph is never materialized in a good implementation. A dual-tree Boruvka algorithm over space-partitioning trees finds the MST while touching only a small fraction of the $\binom{n}{2}$ possible edges, which is what makes HDBSCAN tractable on large data.

133.3.2 3.2 Stage two: the single-linkage hierarchy

Sort the MST edges by increasing weight and add them one at a time, merging the components they connect. This is single-linkage agglomerative clustering on the mutual reachability metric. The result is a dendrogram: at the bottom every point is its own cluster, and as the threshold rises (equivalently, as the distance threshold rises) components merge, until at the top everything is one cluster.

It is convenient to work in the inverse-distance scale $\lambda = 1 / d_{\text{mreach}}$. A high $\lambda$ corresponds to a strict, high-density threshold; a low $\lambda$ to a permissive one. Each cluster in the hierarchy is born at some $\lambda_{\text{birth}}$ and, if it splits or dissolves, dies at some $\lambda_{\text{death}}$.

133.3.3 3.3 Stage three: condensing the tree

The raw single-linkage dendrogram has $n - 1$ internal nodes, most of them spurious: a single point peeling off a cluster is not a meaningful split. HDBSCAN condenses the tree using the min_cluster_size parameter. Walking down from the root, at each split it asks whether each of the two children has at least min_cluster_size points.

If both children are large enough, this is a true split into two clusters.
If only one child is large enough, the smaller child is treated as points falling out of the parent, and the parent persists as the same cluster at a lower $\lambda$.

The outcome is the condensed tree, a much smaller hierarchy whose nodes are genuine clusters, each spanning a range of $\lambda$ and gradually shedding points as $\lambda$ increases.

133.3.4 3.4 Stage four: stability and flat extraction

For each cluster $C$ in the condensed tree, define its stability as the total persistence of its members across the $\lambda$ range during which they belong to $C$:

\[S(C) = \sum_{p \in C} \big(\lambda_p - \lambda_{\text{birth}}(C)\big),\]

where $\lambda_p$ is the value at which point $p$ leaves $C$ (by falling out or by the cluster splitting) and $\lambda_{\text{birth}}(C)$ is the value at which $C$ first appears. A cluster that persists over a wide $\lambda$ range with many members accumulates high stability. A cluster that fleetingly appears and immediately splits scores low.

Extraction is then a constrained optimization over the condensed tree. Process clusters from the leaves upward. If the sum of the stabilities of a cluster’s children exceeds the cluster’s own stability, keep the children and set the parent stability to that sum. Otherwise select the parent and discard the subtree beneath it. The selected clusters form an antichain: no selected cluster is an ancestor of another, so the result is a valid flat partition with a noise label for unselected points. This rule, the Excess of Mass criterion, lets HDBSCAN choose clusters at different density levels in different parts of the space, which is exactly what flat DBSCAN cannot do.

133.3.5 3.5 Soft clustering and outlier scores

Because each point has a $\lambda$ at which it leaves its cluster, HDBSCAN yields more than hard labels. The membership strength of a point can be derived from how deep into its cluster’s $\lambda$ range it survives, giving soft cluster probabilities. The reference library also exposes GLOSH outlier scores, which compare a point’s $\lambda$ at departure to the maximum $\lambda$ reached by its cluster, flagging points that detach early as likely outliers. These scores are valuable in anomaly detection, where the question is not which cluster a point belongs to but how firmly it belongs anywhere.

133.4 4. Parameters, Tuning, and Diagnostics

133.4.1 4.1 The two governing parameters

HDBSCAN exposes two parameters with distinct roles.

min_cluster_size is the smallest group of points you are willing to call a cluster. It controls the granularity of the condensed tree and is the parameter to reach for first. Raising it suppresses small clusters and pushes their points toward larger clusters or noise.
min_samples (the $k$ of the core distance) controls how conservative the density estimate is. Larger values make core distances larger in sparse regions, which expands what counts as noise and yields tighter, more conservative clusters. When unset it defaults to min_cluster_size.

A useful mental model: min_cluster_size shapes what survives in the hierarchy, while min_samples shapes how aggressively the algorithm calls borderline points noise.

133.4.2 4.2 Leaf versus excess-of-mass selection

The default Excess of Mass method favors larger, more stable clusters and can swallow fine substructure under a single broad cluster. Setting cluster_selection_method="leaf" instead selects the leaf clusters of the condensed tree, producing more numerous and homogeneous clusters. A related knob, cluster_selection_epsilon, merges clusters whose separation falls below a distance floor, which is helpful when EOM oversplits a region you regard as one group.

133.4.3 4.3 Validating density clusterings

Density clusters are non-convex, so centroid-based indices such as silhouette can mislead. The Density-Based Clustering Validation index (DBCV) is designed for this setting: it scores a clustering by contrasting within-cluster density against between-cluster separation using the same density-connectivity ideas, and it can fairly compare partitions with noise and arbitrary shapes. When ground truth exists, the Adjusted Rand Index or Adjusted Mutual Information remain appropriate. Always inspect the fraction of points labeled noise; an unexpectedly large noise fraction usually signals that min_samples is too high or the feature scaling is off.

133.4.4 4.4 Metric choice and dimensionality

Mutual reachability inherits whatever base metric you supply, so feature scaling and metric choice matter as much as the algorithm. In high dimensions, distance concentration erodes the contrast between core and non-core points, and density estimates degrade. A common and effective pipeline reduces dimensionality with UMAP, which preserves local density structure better than PCA for this purpose, and then runs HDBSCAN in the reduced space. This combination underlies many modern document and embedding clustering systems, including topic-modeling pipelines built on transformer embeddings.

133.5 5. Choosing Between OPTICS and HDBSCAN

The two methods share a foundation but differ in emphasis. OPTICS produces the reachability plot, an interpretable artifact that exposes the entire density structure for human inspection; this makes it attractive for exploratory analysis where the analyst wants to see and reason about nested structure before committing to a clustering. HDBSCAN automates the extraction step through stability optimization, producing a defensible flat clustering, soft memberships, and outlier scores with minimal manual judgment; this makes it attractive for production pipelines and for embedding in larger systems.

In terms of cost, both are dominated by nearest-neighbor computation. Naive implementations are $O(n^2)$, but spatial index acceleration brings the practical complexity closer to $O(n \log n)$ on low-dimensional or well-structured data, and HDBSCAN’s reference implementation uses space-partitioning trees and a Boruvka-style MST to scale to hundreds of thousands of points. For very large datasets, approximate nearest-neighbor variants extend the reach further. Memory is the other constraint: OPTICS stores the ordering and reachability arrays in $O(n)$, while a naive mutual-reachability matrix would be $O(n^2)$ and must be avoided in favor of tree-based MST construction.

A practical default: when you need an interpretable map of multi-scale structure, reach for OPTICS and study its reachability plot. When you need an automatic, robust clustering with membership and outlier information, reach for HDBSCAN. In both cases the conceptual payoff is the same. By replacing a single global density threshold with a hierarchy over mutual reachability, these methods recover clusters of varying density and arbitrary shape that flat DBSCAN cannot, and they make the act of choosing a density level an explicit, inspectable decision rather than a hidden assumption.

133.6 6. Practice: Library-Driven Density Clustering

Density clustering is a poor candidate for a from-scratch reimplementation: the difficult parts are the spatial-index nearest-neighbor queries, the Boruvka MST, and the condensed-tree bookkeeping, all of which mature libraries have implemented and battle-tested. The right engineering posture is to understand the algorithm deeply, as the preceding sections do, and then call a trusted tool.

The reference open-source tool is scikit-learn, which since version 1.3 ships both sklearn.cluster.HDBSCAN and sklearn.cluster.OPTICS in its core, with no extra dependency. The standalone hdbscan package by McInnes and collaborators remains the most feature-complete implementation (GLOSH outlier scores, soft membership vectors, DBCV, prediction on new points), and scikit-learn’s port covers the common cases including probabilities_ and the dbscan_clustering re-extraction method. For the exploratory reachability-plot workflow, scikit-learn’s OPTICS additionally exposes cluster_optics_dbscan, which replays any DBSCAN cut on a single ordering.

The running example is a deliberately variable-density scene: three Gaussian blobs with standard deviations spanning roughly a factor of five, plus uniform background noise. This is precisely the configuration that defeats a single-$\varepsilon$ DBSCAN and that the hierarchical methods are built to handle. The dataset is generated inline with make_blobs, with all seeds fixed, so the printed numbers are reproducible. The example fits HDBSCAN, fits OPTICS with the $\xi$ extractor, and also reads a single DBSCAN cut off the OPTICS ordering, then scores each against the known labels with the Adjusted Rand Index.

Code

import numpy as np
from sklearn.datasets import make_blobs
from sklearn.cluster import HDBSCAN, OPTICS, cluster_optics_dbscan
from sklearn.metrics import adjusted_rand_score

rng = np.random.RandomState(42)

# Variable-density scene: one tight blob, one diffuse blob, one medium
# blob (std spans ~5x), plus uniform background noise that the methods
# should label as outliers (-1).
Xc, yc = make_blobs(
    n_samples=[150, 150, 150],
    centers=[[0.0, 0.0], [6.0, 6.0], [-6.0, 6.0]],
    cluster_std=[0.30, 1.40, 0.70],
    random_state=42,
)
noise = rng.uniform(low=-9, high=12, size=(60, 2))
X = np.vstack([Xc, noise])
y_true = np.concatenate([yc, np.full(len(noise), -1)])
print(f"dataset: X={X.shape}, true clusters=3 + {len(noise)} noise points")

# --- HDBSCAN: automatic stability-based extraction -------------------
hdb = HDBSCAN(min_cluster_size=25, min_samples=10, store_centers="centroid")
hdb_labels = hdb.fit_predict(X)
n_hdb = len(set(hdb_labels)) - (1 if -1 in hdb_labels else 0)
n_noise = int((hdb_labels == -1).sum())
print("\n[HDBSCAN]")
print(f"  clusters found:        {n_hdb}")
print(f"  points labeled noise:  {n_noise}")
print(f"  ARI vs ground truth:   {adjusted_rand_score(y_true, hdb_labels):.3f}")
print(f"  centroids (rounded):\n{np.round(hdb.centroids_, 2)}")
# probabilities_ are soft membership strengths in [0, 1]; noise gets 0.
core = hdb_labels != -1
print(f"  mean membership prob (clustered): "
      f"{hdb.probabilities_[core].mean():.3f}")

# --- OPTICS: reachability ordering + xi extraction ------------------
opt = OPTICS(min_samples=10, xi=0.05, min_cluster_size=25)
opt_labels = opt.fit_predict(X)
n_opt = len(set(opt_labels)) - (1 if -1 in opt_labels else 0)
print("\n[OPTICS]")
print(f"  clusters (xi method):  {n_opt}")
print(f"  ARI vs ground truth:   {adjusted_rand_score(y_true, opt_labels):.3f}")

# The ordered reachabilities are exactly what the reachability plot draws.
reach = opt.reachability_[opt.ordering_]
finite = reach[np.isfinite(reach)]
print(f"  reachability (ordered): min={finite.min():.2f}, "
      f"median={np.median(finite):.2f}, max={finite.max():.2f}")

# A single global DBSCAN cut at eps=2.0, read off the same ordering for free.
db_labels = cluster_optics_dbscan(
    reachability=opt.reachability_,
    core_distances=opt.core_distances_,
    ordering=opt.ordering_,
    eps=2.0,
)
n_db = len(set(db_labels)) - (1 if -1 in db_labels else 0)
print(f"  DBSCAN cut at eps=2.0:  {n_db} clusters, "
      f"ARI={adjusted_rand_score(y_true, db_labels):.3f}")

dataset: X=(510, 2), true clusters=3 + 60 noise points

[HDBSCAN]
  clusters found:        3
  points labeled noise:  31
  ARI vs ground truth:   0.896
  centroids (rounded):
[[ 6.05  5.77]
 [-6.01  6.07]
 [-0.03  0.01]]
  mean membership prob (clustered): 0.688

[OPTICS]
  clusters (xi method):  3
  ARI vs ground truth:   0.310
  reachability (ordered): min=0.09, median=0.34, max=4.95
  DBSCAN cut at eps=2.0:  3 clusters, ARI=0.936

HDBSCAN recovers all three blobs at their differing densities and scores a high ARI while quarantining most background points as noise, exactly the variable-density behavior a single-$\varepsilon$ DBSCAN cannot match. The centroids_ line confirms the cluster locations, and the mean membership probability shows that points sit firmly, not marginally, inside their clusters. The OPTICS block illustrates the tuning subtlety of Section 2.3: the $\xi$ extractor finds the right number of clusters but, because the diffuse blob’s reachability slope is gentle, it can mislabel boundary points and score a weaker ARI than the plain DBSCAN cut read off the very same ordering. The lesson is the chapter’s thesis in miniature: the ordering carries the full structure, and how you cut it is a deliberate, inspectable choice.

# Clustering.jl provides DBSCAN and OPTICS over a precomputed distance
# matrix or coordinates. It has no native HDBSCAN, so OPTICS plus an
# eps-cut is the closest equivalent in the Julia ecosystem.
using Clustering, Distances, Random

Random.seed!(42)

# Three blobs of differing spread plus uniform noise (columns = points).
c1 = 0.30 .* randn(2, 150) .+ [0.0, 0.0]
c2 = 1.40 .* randn(2, 150) .+ [6.0, 6.0]
c3 = 0.70 .* randn(2, 150) .+ [-6.0, 6.0]
noise = rand(2, 60) .* [21.0, 21.0] .+ [-9.0, -9.0]
X = hcat(c1, c2, c3, noise)

# OPTICS: build the ordering, then extract clusters with an eps-cut.
optres = optics(X, 10; metric=Euclidean())   # min_cluster_size = 10
clusts = extract_dbscan_clustering(optres, 2.0)  # eps = 2.0
println("OPTICS+eps clusters: ", nclusters(clusts))
println("assignments (first 10): ", assignments(clusts)[1:10])

# A direct DBSCAN for comparison on the same data.
dbres = dbscan(X, 2.0; min_neighbors=10, min_cluster_size=25)
println("DBSCAN clusters: ", nclusters(dbres))

// linfa-clustering provides DBSCAN and OPTICS via the `linfa` ML
// toolkit. It does NOT implement HDBSCAN, so OPTICS (with a manual
// reachability cut) is the closest mature equivalent in Rust today.
// Honest note: for production HDBSCAN in Rust there is no battle-tested
// crate; teams typically call Python's hdbscan over PyO3 or FFI.
use linfa::traits::Transformer;
use linfa_clustering::{Dbscan, OpticsAnalysis, Optics};
use ndarray::{array, Array2};

fn main() {
    // Three blobs of differing spread plus background noise.
    // (In real code, generate with ndarray-rand using a fixed seed.)
    let data: Array2<f64> = array![
        [0.0, 0.0], [0.2, -0.1], [-0.1, 0.3],      // tight blob
        [6.0, 6.0], [7.2, 5.1], [4.9, 7.0],        // diffuse blob
        [-6.0, 6.0], [-5.6, 6.4], [-6.3, 5.7],     // medium blob
        [10.0, -8.0], [-8.5, 11.0],                // noise
    ];

    // OPTICS produces an ordering with reachability distances.
    let analysis: OpticsAnalysis<f64> = Optics::params(3)
        .tolerance(2.0)          // eps upper bound for the search
        .transform(&data)
        .expect("optics failed");
    println!("optics samples ordered: {}", analysis.len());

    // DBSCAN for a direct flat clustering at one density level.
    let labels = Dbscan::params(3)
        .tolerance(2.0)
        .transform(&data)
        .expect("dbscan failed");
    let n = labels.iter().filter_map(|x| *x).max().map_or(0, |m| m + 1);
    println!("dbscan clusters: {}", n);
}

Only the Python tab executes in this book; the Julia and Rust tabs are illustrative and show the closest mature equivalents in those ecosystems. The practical takeaway across all three is the same. DBSCAN and OPTICS are widely implemented, but HDBSCAN’s full machinery (condensed tree, stability extraction, soft memberships, GLOSH) lives almost entirely in the Python ecosystem through scikit-learn and the standalone hdbscan package. If you need HDBSCAN specifically and your stack is Julia or Rust, the pragmatic path is to call the Python implementation; if you only need a density-aware flat clustering, the native OPTICS plus an $\varepsilon$-cut in Clustering.jl or linfa-clustering is a sound choice.

133.7 7. Summary

Variable-density clustering requires abandoning the single-threshold view. The core-distance and mutual-reachability constructs turn local density into a metric on which a full hierarchy can be built. OPTICS linearizes that hierarchy into a reachability plot, where valleys are clusters and the $\xi$ method extracts variable-density partitions, though the cut you choose deserves scrutiny. HDBSCAN materializes the hierarchy as a condensed tree and selects clusters by maximizing stability, then offers soft memberships and outlier scores as a bonus. Mastery of these methods rests on two ideas worth repeating: mutual reachability is the metric that equalizes density across the space, and cluster stability is the criterion that lets different parts of the space be clustered at different densities within one coherent result. Because the heavy machinery is subtle, the recommended practice is to lean on mature open-source implementations, scikit-learn’s HDBSCAN and OPTICS and the standalone hdbscan package, and spend your judgment on parameters, validation, and metric choice rather than on reimplementing nearest-neighbor and minimum-spanning-tree code.

133.8 References

Ankerst, M., Breunig, M., Kriegel, H.-P., and Sander, J. “OPTICS: Ordering Points To Identify the Clustering Structure.” ACM SIGMOD, 1999. https://dl.acm.org/doi/10.1145/304182.304187
Campello, R. J. G. B., Moulavi, D., and Sander, J. “Density-Based Clustering Based on Hierarchical Density Estimates.” PAKDD, 2013. https://link.springer.com/chapter/10.1007/978-3-642-37456-2_14
Campello, R. J. G. B., Moulavi, D., Zimek, A., and Sander, J. “Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection.” ACM TKDD, 2015. https://dl.acm.org/doi/10.1145/2733381
McInnes, L., Healy, J., and Astels, S. “hdbscan: Hierarchical density based clustering.” Journal of Open Source Software, 2017. https://joss.theoj.org/papers/10.21105/joss.00205
Moulavi, D., Jaskowiak, P. A., Campello, R. J. G. B., Zimek, A., and Sander, J. “Density-Based Clustering Validation.” SIAM International Conference on Data Mining, 2014. https://epubs.siam.org/doi/10.1137/1.9781611973440.96
Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.” KDD, 1996. https://dl.acm.org/doi/10.5555/3001460.3001507
McInnes, L., and Healy, J. “UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction.” 2018. https://arxiv.org/abs/1802.03426
Pedregosa, F., et al. “Scikit-learn: Machine Learning in Python.” Journal of Machine Learning Research, 2011. https://jmlr.org/papers/v12/pedregosa11a.html
The HDBSCAN Clustering Library documentation. https://hdbscan.readthedocs.io/
scikit-learn developers. “Clustering: HDBSCAN and OPTICS.” https://scikit-learn.org/stable/modules/clustering.html

# Density-Based Clustering with HDBSCAN and OPTICS Density-based clustering treats clusters as regions of high observation density separated by regions of low density. This view, made popular by DBSCAN, frees the analyst from specifying the number of clusters in advance and allows clusters of arbitrary shape. Yet plain DBSCAN carries a fragile assumption: a single global density threshold governs the entire dataset. When clusters occur at genuinely different densities, no single threshold recovers all of them. OPTICS and HDBSCAN are the two principled answers to this problem. Both replace the flat threshold with a hierarchical, multi-scale notion of density. This chapter develops the reachability formalism behind OPTICS, the condensed-tree and stability formalism behind HDBSCAN, and the practical judgment needed to deploy them. It then demonstrates both algorithms with mature open-source tooling rather than a hand-rolled reimplementation, because for density clustering the correctness of the spatial-index and minimum-spanning-tree machinery is exactly what you do not want to get subtly wrong. ## 1. From a Global Threshold to a Density Hierarchy ### 1.1 The DBSCAN baseline and its failure mode DBSCAN is parameterized by a radius $\varepsilon$ and a minimum count `minPts`. A point $p$ is a core point if its $\varepsilon$-neighborhood contains at least `minPts` points. Clusters grow by chaining core points that are within $\varepsilon$ of one another, and points that fall in no cluster are labeled noise. The procedure is elegant, but the radius $\varepsilon$ fixes one density scale for the whole space. Consider two Gaussian blobs, one tight and one diffuse, plus background noise. A small $\varepsilon$ tuned to the tight blob shatters the diffuse blob into fragments and noise. A large $\varepsilon$ tuned to the diffuse blob merges the tight blob with nearby background. There is no $\varepsilon$ that recovers both. This variable-density problem motivates a hierarchy: rather than commit to one density level, examine the cluster structure across all density levels at once. ### 1.2 Core distance and mutual reachability Both algorithms begin with the same local density estimate. For a point $p$ and a parameter $k$ (the analog of `minPts`, often written `min_samples`), the core distance is the distance to the $k$-th nearest neighbor: $$d_{\text{core}}(p) = \text{distance from } p \text{ to its } k\text{-th nearest neighbor}.$$ A small core distance means $p$ sits in a dense region. To compare points fairly, define the mutual reachability distance between $p$ and $q$: $$d_{\text{mreach}}(p, q) = \max\big(d_{\text{core}}(p),\, d_{\text{core}}(q),\, d(p, q)\big).$$ This symmetric quantity inflates the raw distance whenever either endpoint lies in a sparse region. Its effect is to push sparse points away from one another while leaving dense points connected at their true separation: two points sitting in the dense core keep their actual distance $d(p,q)$, but a point in a sparse halo is held at least its (large) core distance away from everything. Mutual reachability is the metric on which the density hierarchy is built, and it is the single most important construct in this chapter. Note that it is not a true metric, since the triangle inequality can fail, but it is symmetric and non-negative, which is all the spanning-tree construction needs. ## 2. OPTICS and the Reachability Plot ### 2.1 The ordering algorithm OPTICS (Ordering Points To Identify the Clustering Structure) does not produce clusters directly. Instead it produces an ordering of the points together with a reachability distance for each, from which clusterings at every density level can be read off. The algorithm processes points in a priority queue keyed by reachability distance. Starting from an arbitrary unprocessed point, it computes the core distance, then for each neighbor updates a tentative reachability distance equal to the mutual reachability to the current core. It always expands the point with the smallest current reachability, which causes the walk to stay inside a dense region until that region is exhausted before jumping across a sparse gap. ```text seed an unprocessed point, output it with reachability = undefined maintain a priority queue of (reachability, point) repeatedly: pop the point with smallest reachability, mark processed, output it if it is a core point, update queue reachabilities of its neighbors ``` The output is a sequence $o_1, o_2, \ldots, o_n$ with reachability values $r_1, r_2, \ldots, r_n$. Crucially, the parameter $\varepsilon$ in OPTICS is only an upper bound used to prune the search for efficiency. It does not commit the analysis to one density scale; setting it to infinity simply makes the algorithm exact at higher cost. ### 2.2 Reading the reachability plot Plotting the reachability values $r_i$ in processing order produces the reachability plot, the signature visualization of OPTICS. The horizontal axis is the ordering; the vertical axis is reachability distance. Valleys correspond to clusters: while the walk stays in a dense region, reachability stays low. Peaks correspond to the sparse boundaries crossed when the walk jumps from one cluster to the next. A nested valley, a small dip inside a larger dip, reveals a subcluster embedded in a coarser cluster. This is precisely the variable-density structure that defeats DBSCAN. To extract a flat DBSCAN-equivalent clustering, draw a horizontal cut at height $\varepsilon'$ and treat each contiguous run below the line as a cluster. Different cut heights yield different clusterings, and the plot lets the analyst choose visually. This horizontal-cut extraction is exactly equivalent to running DBSCAN with that $\varepsilon'$, which is why a single OPTICS run can replay every DBSCAN result without recomputation. ### 2.3 The xi method for variable-density extraction A horizontal cut still imposes one global level. The $\xi$ (xi) extraction method instead looks for steep relative changes in reachability. A downward region where reachability drops by at least a factor $1 - \xi$ marks a cluster start, and a matching upward region marks its end. By keying on the steepness of transitions rather than their absolute height, $\xi$ extraction adapts the effective threshold to each valley and recovers clusters at multiple densities from a single run. This is the OPTICS counterpart to what HDBSCAN achieves through stability. In practice $\xi$ is delicate: it reacts to the relative slope of the reachability curve, so a value that cleanly separates well-isolated blobs can over-fragment or merge clusters whose densities differ only mildly. When the blobs touch or overlap, a plain DBSCAN-style horizontal cut on the same OPTICS ordering is often the steadier choice, a contrast the worked example below makes concrete. ## 3. HDBSCAN: Hierarchical Density-Based Clustering HDBSCAN (Hierarchical DBSCAN) takes the hierarchical idea further. Rather than leave extraction to a visual cut, it builds the full hierarchy, condenses it, and selects clusters by an explicit optimization over stability. The pipeline has four stages. ### 3.1 Stage one: mutual reachability graph and minimum spanning tree Treat the points as nodes of a complete weighted graph whose edge weights are the mutual reachability distances of Section 1.2. Compute its minimum spanning tree (MST). The MST is the cheapest set of edges connecting all points under the mutual reachability metric, and it encodes how the dataset knits together as the density threshold is relaxed. ```text weight(p, q) = d_mreach(p, q) T = minimum_spanning_tree(complete_graph) ``` The complete graph is never materialized in a good implementation. A dual-tree Boruvka algorithm over space-partitioning trees finds the MST while touching only a small fraction of the $\binom{n}{2}$ possible edges, which is what makes HDBSCAN tractable on large data. ### 3.2 Stage two: the single-linkage hierarchy Sort the MST edges by increasing weight and add them one at a time, merging the components they connect. This is single-linkage agglomerative clustering on the mutual reachability metric. The result is a dendrogram: at the bottom every point is its own cluster, and as the threshold rises (equivalently, as the distance threshold rises) components merge, until at the top everything is one cluster. It is convenient to work in the inverse-distance scale $\lambda = 1 / d_{\text{mreach}}$. A high $\lambda$ corresponds to a strict, high-density threshold; a low $\lambda$ to a permissive one. Each cluster in the hierarchy is born at some $\lambda_{\text{birth}}$ and, if it splits or dissolves, dies at some $\lambda_{\text{death}}$. ### 3.3 Stage three: condensing the tree The raw single-linkage dendrogram has $n - 1$ internal nodes, most of them spurious: a single point peeling off a cluster is not a meaningful split. HDBSCAN condenses the tree using the `min_cluster_size` parameter. Walking down from the root, at each split it asks whether each of the two children has at least `min_cluster_size` points. - If both children are large enough, this is a true split into two clusters. - If only one child is large enough, the smaller child is treated as points falling out of the parent, and the parent persists as the same cluster at a lower $\lambda$. The outcome is the condensed tree, a much smaller hierarchy whose nodes are genuine clusters, each spanning a range of $\lambda$ and gradually shedding points as $\lambda$ increases. ### 3.4 Stage four: stability and flat extraction For each cluster $C$ in the condensed tree, define its stability as the total persistence of its members across the $\lambda$ range during which they belong to $C$: $$S(C) = \sum_{p \in C} \big(\lambda_p - \lambda_{\text{birth}}(C)\big),$$ where $\lambda_p$ is the value at which point $p$ leaves $C$ (by falling out or by the cluster splitting) and $\lambda_{\text{birth}}(C)$ is the value at which $C$ first appears. A cluster that persists over a wide $\lambda$ range with many members accumulates high stability. A cluster that fleetingly appears and immediately splits scores low. Extraction is then a constrained optimization over the condensed tree. Process clusters from the leaves upward. If the sum of the stabilities of a cluster's children exceeds the cluster's own stability, keep the children and set the parent stability to that sum. Otherwise select the parent and discard the subtree beneath it. The selected clusters form an antichain: no selected cluster is an ancestor of another, so the result is a valid flat partition with a noise label for unselected points. This rule, the Excess of Mass criterion, lets HDBSCAN choose clusters at different density levels in different parts of the space, which is exactly what flat DBSCAN cannot do. ### 3.5 Soft clustering and outlier scores Because each point has a $\lambda$ at which it leaves its cluster, HDBSCAN yields more than hard labels. The membership strength of a point can be derived from how deep into its cluster's $\lambda$ range it survives, giving soft cluster probabilities. The reference library also exposes GLOSH outlier scores, which compare a point's $\lambda$ at departure to the maximum $\lambda$ reached by its cluster, flagging points that detach early as likely outliers. These scores are valuable in anomaly detection, where the question is not which cluster a point belongs to but how firmly it belongs anywhere. ## 4. Parameters, Tuning, and Diagnostics ### 4.1 The two governing parameters HDBSCAN exposes two parameters with distinct roles. - `min_cluster_size` is the smallest group of points you are willing to call a cluster. It controls the granularity of the condensed tree and is the parameter to reach for first. Raising it suppresses small clusters and pushes their points toward larger clusters or noise. - `min_samples` (the $k$ of the core distance) controls how conservative the density estimate is. Larger values make core distances larger in sparse regions, which expands what counts as noise and yields tighter, more conservative clusters. When unset it defaults to `min_cluster_size`. A useful mental model: `min_cluster_size` shapes what survives in the hierarchy, while `min_samples` shapes how aggressively the algorithm calls borderline points noise. ### 4.2 Leaf versus excess-of-mass selection The default Excess of Mass method favors larger, more stable clusters and can swallow fine substructure under a single broad cluster. Setting `cluster_selection_method="leaf"` instead selects the leaf clusters of the condensed tree, producing more numerous and homogeneous clusters. A related knob, `cluster_selection_epsilon`, merges clusters whose separation falls below a distance floor, which is helpful when EOM oversplits a region you regard as one group. ### 4.3 Validating density clusterings Density clusters are non-convex, so centroid-based indices such as silhouette can mislead. The Density-Based Clustering Validation index (DBCV) is designed for this setting: it scores a clustering by contrasting within-cluster density against between-cluster separation using the same density-connectivity ideas, and it can fairly compare partitions with noise and arbitrary shapes. When ground truth exists, the Adjusted Rand Index or Adjusted Mutual Information remain appropriate. Always inspect the fraction of points labeled noise; an unexpectedly large noise fraction usually signals that `min_samples` is too high or the feature scaling is off. ### 4.4 Metric choice and dimensionality Mutual reachability inherits whatever base metric you supply, so feature scaling and metric choice matter as much as the algorithm. In high dimensions, distance concentration erodes the contrast between core and non-core points, and density estimates degrade. A common and effective pipeline reduces dimensionality with UMAP, which preserves local density structure better than PCA for this purpose, and then runs HDBSCAN in the reduced space. This combination underlies many modern document and embedding clustering systems, including topic-modeling pipelines built on transformer embeddings. ## 5. Choosing Between OPTICS and HDBSCAN The two methods share a foundation but differ in emphasis. OPTICS produces the reachability plot, an interpretable artifact that exposes the entire density structure for human inspection; this makes it attractive for exploratory analysis where the analyst wants to see and reason about nested structure before committing to a clustering. HDBSCAN automates the extraction step through stability optimization, producing a defensible flat clustering, soft memberships, and outlier scores with minimal manual judgment; this makes it attractive for production pipelines and for embedding in larger systems. In terms of cost, both are dominated by nearest-neighbor computation. Naive implementations are $O(n^2)$, but spatial index acceleration brings the practical complexity closer to $O(n \log n)$ on low-dimensional or well-structured data, and HDBSCAN's reference implementation uses space-partitioning trees and a Boruvka-style MST to scale to hundreds of thousands of points. For very large datasets, approximate nearest-neighbor variants extend the reach further. Memory is the other constraint: OPTICS stores the ordering and reachability arrays in $O(n)$, while a naive mutual-reachability matrix would be $O(n^2)$ and must be avoided in favor of tree-based MST construction. A practical default: when you need an interpretable map of multi-scale structure, reach for OPTICS and study its reachability plot. When you need an automatic, robust clustering with membership and outlier information, reach for HDBSCAN. In both cases the conceptual payoff is the same. By replacing a single global density threshold with a hierarchy over mutual reachability, these methods recover clusters of varying density and arbitrary shape that flat DBSCAN cannot, and they make the act of choosing a density level an explicit, inspectable decision rather than a hidden assumption. ## 6. Practice: Library-Driven Density Clustering Density clustering is a poor candidate for a from-scratch reimplementation: the difficult parts are the spatial-index nearest-neighbor queries, the Boruvka MST, and the condensed-tree bookkeeping, all of which mature libraries have implemented and battle-tested. The right engineering posture is to understand the algorithm deeply, as the preceding sections do, and then call a trusted tool. The reference open-source tool is scikit-learn, which since version 1.3 ships both `sklearn.cluster.HDBSCAN` and `sklearn.cluster.OPTICS` in its core, with no extra dependency. The standalone `hdbscan` package by McInnes and collaborators remains the most feature-complete implementation (GLOSH outlier scores, soft membership vectors, DBCV, prediction on new points), and scikit-learn's port covers the common cases including `probabilities_` and the `dbscan_clustering` re-extraction method. For the exploratory reachability-plot workflow, scikit-learn's OPTICS additionally exposes `cluster_optics_dbscan`, which replays any DBSCAN cut on a single ordering. The running example is a deliberately variable-density scene: three Gaussian blobs with standard deviations spanning roughly a factor of five, plus uniform background noise. This is precisely the configuration that defeats a single-$\varepsilon$ DBSCAN and that the hierarchical methods are built to handle. The dataset is generated inline with `make_blobs`, with all seeds fixed, so the printed numbers are reproducible. The example fits HDBSCAN, fits OPTICS with the $\xi$ extractor, and also reads a single DBSCAN cut off the OPTICS ordering, then scores each against the known labels with the Adjusted Rand Index. ::: {.panel-tabset} ## Python ```{python} import numpy as np from sklearn.datasets import make_blobs from sklearn.cluster import HDBSCAN, OPTICS, cluster_optics_dbscan from sklearn.metrics import adjusted_rand_score rng = np.random.RandomState(42) # Variable-density scene: one tight blob, one diffuse blob, one medium # blob (std spans ~5x), plus uniform background noise that the methods # should label as outliers (-1). Xc, yc = make_blobs( n_samples=[150, 150, 150], centers=[[0.0, 0.0], [6.0, 6.0], [-6.0, 6.0]], cluster_std=[0.30, 1.40, 0.70], random_state=42, ) noise = rng.uniform(low=-9, high=12, size=(60, 2)) X = np.vstack([Xc, noise]) y_true = np.concatenate([yc, np.full(len(noise), -1)]) print(f"dataset: X={X.shape}, true clusters=3 + {len(noise)} noise points") # --- HDBSCAN: automatic stability-based extraction ------------------- hdb = HDBSCAN(min_cluster_size=25, min_samples=10, store_centers="centroid") hdb_labels = hdb.fit_predict(X) n_hdb = len(set(hdb_labels)) - (1 if -1 in hdb_labels else 0) n_noise = int((hdb_labels == -1).sum()) print("\n[HDBSCAN]") print(f" clusters found: {n_hdb}") print(f" points labeled noise: {n_noise}") print(f" ARI vs ground truth: {adjusted_rand_score(y_true, hdb_labels):.3f}") print(f" centroids (rounded):\n{np.round(hdb.centroids_, 2)}") # probabilities_ are soft membership strengths in [0, 1]; noise gets 0. core = hdb_labels != -1 print(f" mean membership prob (clustered): " f"{hdb.probabilities_[core].mean():.3f}") # --- OPTICS: reachability ordering + xi extraction ------------------ opt = OPTICS(min_samples=10, xi=0.05, min_cluster_size=25) opt_labels = opt.fit_predict(X) n_opt = len(set(opt_labels)) - (1 if -1 in opt_labels else 0) print("\n[OPTICS]") print(f" clusters (xi method): {n_opt}") print(f" ARI vs ground truth: {adjusted_rand_score(y_true, opt_labels):.3f}") # The ordered reachabilities are exactly what the reachability plot draws. reach = opt.reachability_[opt.ordering_] finite = reach[np.isfinite(reach)] print(f" reachability (ordered): min={finite.min():.2f}, " f"median={np.median(finite):.2f}, max={finite.max():.2f}") # A single global DBSCAN cut at eps=2.0, read off the same ordering for free. db_labels = cluster_optics_dbscan( reachability=opt.reachability_, core_distances=opt.core_distances_, ordering=opt.ordering_, eps=2.0, ) n_db = len(set(db_labels)) - (1 if -1 in db_labels else 0) print(f" DBSCAN cut at eps=2.0: {n_db} clusters, " f"ARI={adjusted_rand_score(y_true, db_labels):.3f}") ``` HDBSCAN recovers all three blobs at their differing densities and scores a high ARI while quarantining most background points as noise, exactly the variable-density behavior a single-$\varepsilon$ DBSCAN cannot match. The `centroids_` line confirms the cluster locations, and the mean membership probability shows that points sit firmly, not marginally, inside their clusters. The OPTICS block illustrates the tuning subtlety of Section 2.3: the $\xi$ extractor finds the right number of clusters but, because the diffuse blob's reachability slope is gentle, it can mislabel boundary points and score a weaker ARI than the plain DBSCAN cut read off the very same ordering. The lesson is the chapter's thesis in miniature: the ordering carries the full structure, and how you cut it is a deliberate, inspectable choice. ## Julia ```julia # Clustering.jl provides DBSCAN and OPTICS over a precomputed distance # matrix or coordinates. It has no native HDBSCAN, so OPTICS plus an # eps-cut is the closest equivalent in the Julia ecosystem. using Clustering, Distances, Random Random.seed!(42) # Three blobs of differing spread plus uniform noise (columns = points). c1 = 0.30 .* randn(2, 150) .+ [0.0, 0.0] c2 = 1.40 .* randn(2, 150) .+ [6.0, 6.0] c3 = 0.70 .* randn(2, 150) .+ [-6.0, 6.0] noise = rand(2, 60) .* [21.0, 21.0] .+ [-9.0, -9.0] X = hcat(c1, c2, c3, noise) # OPTICS: build the ordering, then extract clusters with an eps-cut. optres = optics(X, 10; metric=Euclidean()) # min_cluster_size = 10 clusts = extract_dbscan_clustering(optres, 2.0) # eps = 2.0 println("OPTICS+eps clusters: ", nclusters(clusts)) println("assignments (first 10): ", assignments(clusts)[1:10]) # A direct DBSCAN for comparison on the same data. dbres = dbscan(X, 2.0; min_neighbors=10, min_cluster_size=25) println("DBSCAN clusters: ", nclusters(dbres)) ``` ## Rust ```rust // linfa-clustering provides DBSCAN and OPTICS via the `linfa` ML // toolkit. It does NOT implement HDBSCAN, so OPTICS (with a manual // reachability cut) is the closest mature equivalent in Rust today. // Honest note: for production HDBSCAN in Rust there is no battle-tested // crate; teams typically call Python's hdbscan over PyO3 or FFI. use linfa::traits::Transformer; use linfa_clustering::{Dbscan, OpticsAnalysis, Optics}; use ndarray::{array, Array2}; fn main() { // Three blobs of differing spread plus background noise. // (In real code, generate with ndarray-rand using a fixed seed.) let data: Array2<f64> = array![ [0.0, 0.0], [0.2, -0.1], [-0.1, 0.3], // tight blob [6.0, 6.0], [7.2, 5.1], [4.9, 7.0], // diffuse blob [-6.0, 6.0], [-5.6, 6.4], [-6.3, 5.7], // medium blob [10.0, -8.0], [-8.5, 11.0], // noise ]; // OPTICS produces an ordering with reachability distances. let analysis: OpticsAnalysis<f64> = Optics::params(3) .tolerance(2.0) // eps upper bound for the search .transform(&data) .expect("optics failed"); println!("optics samples ordered: {}", analysis.len()); // DBSCAN for a direct flat clustering at one density level. let labels = Dbscan::params(3) .tolerance(2.0) .transform(&data) .expect("dbscan failed"); let n = labels.iter().filter_map(|x| *x).max().map_or(0, |m| m + 1); println!("dbscan clusters: {}", n); } ``` ::: Only the Python tab executes in this book; the Julia and Rust tabs are illustrative and show the closest mature equivalents in those ecosystems. The practical takeaway across all three is the same. DBSCAN and OPTICS are widely implemented, but HDBSCAN's full machinery (condensed tree, stability extraction, soft memberships, GLOSH) lives almost entirely in the Python ecosystem through scikit-learn and the standalone `hdbscan` package. If you need HDBSCAN specifically and your stack is Julia or Rust, the pragmatic path is to call the Python implementation; if you only need a density-aware flat clustering, the native OPTICS plus an $\varepsilon$-cut in Clustering.jl or linfa-clustering is a sound choice. ## 7. Summary Variable-density clustering requires abandoning the single-threshold view. The core-distance and mutual-reachability constructs turn local density into a metric on which a full hierarchy can be built. OPTICS linearizes that hierarchy into a reachability plot, where valleys are clusters and the $\xi$ method extracts variable-density partitions, though the cut you choose deserves scrutiny. HDBSCAN materializes the hierarchy as a condensed tree and selects clusters by maximizing stability, then offers soft memberships and outlier scores as a bonus. Mastery of these methods rests on two ideas worth repeating: mutual reachability is the metric that equalizes density across the space, and cluster stability is the criterion that lets different parts of the space be clustered at different densities within one coherent result. Because the heavy machinery is subtle, the recommended practice is to lean on mature open-source implementations, scikit-learn's `HDBSCAN` and `OPTICS` and the standalone `hdbscan` package, and spend your judgment on parameters, validation, and metric choice rather than on reimplementing nearest-neighbor and minimum-spanning-tree code. ## References 1. Ankerst, M., Breunig, M., Kriegel, H.-P., and Sander, J. "OPTICS: Ordering Points To Identify the Clustering Structure." ACM SIGMOD, 1999. https://dl.acm.org/doi/10.1145/304182.304187 2. Campello, R. J. G. B., Moulavi, D., and Sander, J. "Density-Based Clustering Based on Hierarchical Density Estimates." PAKDD, 2013. https://link.springer.com/chapter/10.1007/978-3-642-37456-2_14 3. Campello, R. J. G. B., Moulavi, D., Zimek, A., and Sander, J. "Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection." ACM TKDD, 2015. https://dl.acm.org/doi/10.1145/2733381 4. McInnes, L., Healy, J., and Astels, S. "hdbscan: Hierarchical density based clustering." Journal of Open Source Software, 2017. https://joss.theoj.org/papers/10.21105/joss.00205 5. Moulavi, D., Jaskowiak, P. A., Campello, R. J. G. B., Zimek, A., and Sander, J. "Density-Based Clustering Validation." SIAM International Conference on Data Mining, 2014. https://epubs.siam.org/doi/10.1137/1.9781611973440.96 6. Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. "A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise." KDD, 1996. https://dl.acm.org/doi/10.5555/3001460.3001507 7. McInnes, L., and Healy, J. "UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction." 2018. https://arxiv.org/abs/1802.03426 8. Pedregosa, F., et al. "Scikit-learn: Machine Learning in Python." Journal of Machine Learning Research, 2011. https://jmlr.org/papers/v12/pedregosa11a.html 9. The HDBSCAN Clustering Library documentation. https://hdbscan.readthedocs.io/ 10. scikit-learn developers. "Clustering: HDBSCAN and OPTICS." https://scikit-learn.org/stable/modules/clustering.html