151 Distance-Based Anomaly Detection
Distance-based anomaly detection rests on a simple geometric intuition: normal observations lie in dense regions of the feature space, surrounded by many close neighbors, while anomalies sit far from the rest of the data. Unlike model-based approaches that fit a parametric distribution and then test for tail membership, distance methods make almost no assumptions about the shape of the data generating process. They require only a metric and a notion of locality. This chapter develops the principal distance-based scores used in practice, the k-nearest-neighbor distance, the local outlier factor, and the connectivity-based outlier factor, and then confronts the phenomenon that limits all of them in high dimensions: the curse of dimensionality.
151.1 1. Setup and Notation
Let \(X = \{x_1, \dots, x_n\}\) be a dataset with each \(x_i \in \mathbb{R}^d\), equipped with a distance function \(\operatorname{dist}(x_i, x_j)\). Throughout we assume the Euclidean metric
\[ \operatorname{dist}(x_i, x_j) = \|x_i - x_j\|_2 = \left( \sum_{m=1}^{d} (x_{im} - x_{jm})^2 \right)^{1/2}, \]
though everything below generalizes to any Minkowski \(L_p\) distance or to a learned metric. We write \(N_k(x)\) for the set of the \(k\) nearest neighbors of a point \(x\) (excluding \(x\) itself), and we treat anomaly detection as an unsupervised ranking problem: the goal is to assign each point a real-valued score such that larger scores correspond to greater outlierness. A downstream threshold or top-\(m\) cutoff then converts scores into labels.
The central design question is what reference set a point should be compared against. Global methods compare a point to the entire dataset and implicitly assume a single density scale. Local methods compare a point only to its own neighborhood and so tolerate datasets with regions of differing density. The progression from k-nearest-neighbor distance to local outlier factor to connectivity-based outlier factor is precisely a progression in how locality is handled.
151.2 2. The k-Nearest-Neighbor Distance
The most direct distance-based score is the distance from a point to its \(k\)-th nearest neighbor. Define
\[ \operatorname{kNN}(x) = \operatorname{dist}\big(x, \, \text{the } k\text{-th nearest neighbor of } x\big). \]
A point in a dense cluster has a small \(k\)-th neighbor distance because \(k\) points lie nearby; an isolated point has a large one. A common variant averages over the whole neighborhood rather than using the single \(k\)-th value,
\[ \operatorname{kNN}_{\text{avg}}(x) = \frac{1}{k} \sum_{y \in N_k(x)} \operatorname{dist}(x, y), \]
which is smoother and less sensitive to the exact choice of \(k\). Both are global scores in the sense that the same threshold is applied everywhere.
The parameter \(k\) governs a bias variance tradeoff. Small \(k\) makes the score sensitive to individual nearby points and therefore noisy, and it can fail when anomalies arrive in small tight clusters of size up to \(k\), since each member then has \(k-1\) close companions and a deceptively small score. Large \(k\) smooths the estimate but blurs fine local structure and raises computational cost. A practical rule is to choose \(k\) larger than the size of any anomaly micro cluster one wishes to detect.
score(x) = distance(x, kth_nearest_neighbor(x))
rank all points by score, descending
flag the top m as anomalies
The fundamental weakness of the global kNN distance is its single density scale. Consider two Gaussian clusters, one tight and one diffuse. A point on the natural edge of the diffuse cluster may have a larger absolute \(k\)-th neighbor distance than a genuine outlier sitting just outside the tight cluster. The global threshold cannot separate these cases. Resolving this requires normalizing each point’s distance by the typical distance in its own neighborhood, which is exactly what the local outlier factor does.
151.3 3. The Local Outlier Factor
The local outlier factor, introduced by Breunig and colleagues in 2000, converts the absolute kNN distance into a relative density ratio. It produces a score near \(1\) for points whose local density matches that of their neighbors and a score well above \(1\) for points that are substantially sparser than their surroundings.
151.3.1 3.1 Reachability Distance
To stabilize density estimates, LOF replaces the raw distance with a smoothed quantity. Let \(\operatorname{kdist}(y)\) denote the distance from \(y\) to its \(k\)-th nearest neighbor. The reachability distance of \(x\) with respect to \(y\) is
\[ \operatorname{reach\text{-}dist}_k(x, y) = \max\big\{ \operatorname{kdist}(y), \, \operatorname{dist}(x, y) \big\}. \]
If \(x\) is far from \(y\), the true distance is used. If \(x\) lies inside the \(k\)-neighborhood of \(y\), the value is floored at \(\operatorname{kdist}(y)\). This flooring reduces the statistical fluctuation of distances among very close points and is the only role of the asymmetric construction.
151.3.2 3.2 Local Reachability Density
The local reachability density of \(x\) is the inverse of the average reachability distance from \(x\) to its neighbors,
\[ \operatorname{lrd}_k(x) = \left( \frac{1}{|N_k(x)|} \sum_{y \in N_k(x)} \operatorname{reach\text{-}dist}_k(x, y) \right)^{-1}. \]
A high \(\operatorname{lrd}\) means \(x\) sits in a dense pocket; a low \(\operatorname{lrd}\) means its neighbors are typically far away.
151.3.3 3.3 The LOF Score
The LOF score compares the local density of \(x\) to the local densities of its neighbors,
\[ \operatorname{LOF}_k(x) = \frac{1}{|N_k(x)|} \sum_{y \in N_k(x)} \frac{\operatorname{lrd}_k(y)}{\operatorname{lrd}_k(x)}. \]
The interpretation is clean. When \(\operatorname{LOF}_k(x) \approx 1\), the point is as dense as its neighbors and is considered normal. When \(\operatorname{LOF}_k(x) \gg 1\), the neighbors are much denser than \(x\), marking \(x\) as a local outlier. Values below \(1\) indicate a point in a region denser than its neighbors, which under most conventions is treated as strongly normal.
for each x:
compute k-distance and N_k(x)
for each x:
lrd(x) = 1 / mean over y in N_k(x) of reach_dist(x, y)
for each x:
LOF(x) = mean over y in N_k(x) of lrd(y) / lrd(x)
Because every quantity in \(\operatorname{LOF}\) is a ratio of local densities, the method is invariant to the absolute density scale of a region. In the tight versus diffuse cluster example, an edge point of the diffuse cluster has neighbors of similarly low density, so its ratio stays near \(1\), while a true outlier near the tight cluster has dense neighbors and a large ratio. This local normalization is the decisive advantage of LOF over the global kNN distance. The price is increased sensitivity to \(k\) and a tendency to produce false positives along legitimate cluster boundaries where density changes sharply.
151.4 4. The Connectivity-Based Outlier Factor
The connectivity-based outlier factor, or COF, proposed by Tang and colleagues in 2002, addresses a structural blind spot of LOF. LOF assumes that the neighbors of a normal point are distributed roughly isotropically, like a spherical cloud. When normal data lie along a curve, a line, or any low-dimensional manifold embedded in the feature space, LOF can misjudge points on the manifold because their neighbors are arranged anisotropically rather than in a ball.
151.4.1 4.1 Set-Based Nearest Path
COF replaces the density notion with a connectivity notion built from a set-based nearest path, a greedy chain that grows the neighborhood one point at a time. Starting from \(x\), the path repeatedly appends the data point nearest to the current set, producing an ordering \(\{x = p_1, p_2, \dots, p_{k+1}\}\) of \(x\) together with its \(k\) neighbors. At each step \(i\) the cost \(e_i\) is the edge length that connected the new point to the growing set. This construction follows the local geometry of the data, tracing along a manifold rather than insisting on a sphere.
151.4.2 4.2 Average Chaining Distance
The average chaining distance weights earlier, shorter edges more heavily,
\[ \operatorname{ac\text{-}dist}_k(x) = \sum_{i=1}^{k} \frac{2(k + 1 - i)}{k(k+1)} \, e_i . \]
The weights \(\tfrac{2(k+1-i)}{k(k+1)}\) are positive, decreasing in \(i\), and sum to one, so \(\operatorname{ac\text{-}dist}\) is a weighted average of the path edge lengths that emphasizes the steps taken when the set was still small and close to \(x\).
151.4.3 4.3 The COF Score
The COF score compares the average chaining distance of \(x\) to those of its neighbors,
\[ \operatorname{COF}_k(x) = \frac{|N_k(x)| \cdot \operatorname{ac\text{-}dist}_k(x)}{\sum_{y \in N_k(x)} \operatorname{ac\text{-}dist}_k(y)} . \]
As with LOF, a value near \(1\) indicates that \(x\) is as well connected to the data as its neighbors are, while a value substantially greater than \(1\) marks \(x\) as poorly connected and therefore anomalous. Because connectivity is measured along the nearest path rather than through an isotropic density estimate, COF correctly handles points distributed along low-dimensional structures, flagging only those that deviate from the local pattern of connection. The cost is additional computation, since each chaining distance requires constructing a path, and a sensitivity to noise that can fragment the chain.
151.5 5. The Curse of Dimensionality
Every method above depends on the meaningfulness of distances and of the contrast between near and far. In high-dimensional spaces this contrast erodes, a phenomenon known as the curse of dimensionality, and it undermines distance-based detection at its foundation.
151.5.1 5.1 Distance Concentration
Consider \(n\) points drawn independently from a distribution in \(\mathbb{R}^d\). Under broad conditions, as \(d\) grows the pairwise distances concentrate around a common value. A widely cited formalization states that
\[ \lim_{d \to \infty} \frac{\mathbb{E}\big[\, \operatorname{dist}_{\max}(d) - \operatorname{dist}_{\min}(d) \,\big]}{\mathbb{E}\big[\, \operatorname{dist}_{\min}(d) \,\big]} \to 0, \]
where \(\operatorname{dist}_{\max}\) and \(\operatorname{dist}_{\min}\) are the largest and smallest distances from a query point to the data. The relative gap between the nearest and farthest neighbor vanishes. When the closest and most distant points are nearly equidistant, the very idea of a nearest neighbor loses discriminative power, and a kNN distance, an \(\operatorname{lrd}\), or an \(\operatorname{ac\text{-}dist}\) computed from such distances carries little signal.
A complementary way to see the effect is through the coefficient of variation of distances. For independent and identically distributed coordinates the squared Euclidean distance is a sum of \(d\) terms, so its mean grows like \(d\) while its standard deviation grows like \(\sqrt{d}\), giving a relative spread that decays as \(1/\sqrt{d}\). Distances become almost constant.
151.5.2 5.2 Consequences for Detection
Three practical failures follow. First, scores compress: the LOF and COF ratios drift toward \(1\) for all points, collapsing the separation between normal and anomalous. Second, the choice of metric matters more, since fractional \(L_p\) norms with \(p < 1\) retain contrast somewhat better than the Euclidean norm in high dimensions. Third, irrelevant features dominate, because every noisy dimension contributes to the distance and dilutes the few informative ones, so an anomaly that is obvious in a two-feature subspace becomes invisible in the full space.
151.5.3 5.3 Mitigations
The standard responses fall into two families. One reduces dimensionality before applying a distance method, through principal component analysis, random projection justified by the Johnson Lindenstrauss lemma, autoencoders, or feature selection, so that distances are computed in a space where they remain meaningful. The other abandons full-space distances in favor of subspace and ensemble methods, which search for low-dimensional projections in which an anomaly stands out and then aggregate the evidence. Feature bagging for LOF and subspace outlier detection are representative. In genuinely high-dimensional regimes these strategies routinely outperform any single full-space distance score.
151.6 6. Practical Guidance
Distance-based detectors share a workflow. Standardize features so that no dimension dominates the metric merely through scale. Choose \(k\) deliberately, large enough to exceed plausible anomaly cluster sizes yet small enough to preserve locality, and report sensitivity to \(k\) rather than a single value. Prefer LOF or COF over global kNN distance when the data contain regions of differing density, and prefer COF when normal data follow low-dimensional structure. Above roughly ten to twenty informative dimensions, treat raw full-space distances with suspicion and reach for dimensionality reduction or subspace ensembles. Finally, because all three scores are unsupervised rankings, validate thresholds against whatever labeled anomalies or domain knowledge are available, since the methods produce orderings, not calibrated probabilities.
151.7 7. Summary
The k-nearest-neighbor distance offers a transparent global measure of isolation but assumes a single density scale. The local outlier factor restores locality by scoring each point against the density of its own neighborhood, and the connectivity-based outlier factor further accommodates data that lie along low-dimensional manifolds by measuring connectivity along a nearest path rather than through isotropic density. All three are bound together by their dependence on meaningful distances, and all three degrade as dimensionality grows and distance contrast concentrates. Understanding both the geometric intuition and this shared failure mode is what lets a practitioner choose, tune, and trust a distance-based detector.
151.8 References
- Breunig, M. M., Kriegel, H.-P., Ng, R. T., and Sander, J. LOF: Identifying Density-Based Local Outliers. Proceedings of the ACM SIGMOD International Conference on Management of Data, 2000. https://dl.acm.org/doi/10.1145/342009.335388
- Tang, J., Chen, Z., Fu, A. W.-C., and Cheung, D. W. Enhancing Effectiveness of Outlier Detections for Low Density Patterns. Advances in Knowledge Discovery and Data Mining (PAKDD), 2002. https://link.springer.com/chapter/10.1007/3-540-47887-6_53
- Ramaswamy, S., Rastogi, R., and Shim, K. Efficient Algorithms for Mining Outliers from Large Data Sets. Proceedings of the ACM SIGMOD International Conference on Management of Data, 2000. https://dl.acm.org/doi/10.1145/342009.335437
- Beyer, K., Goldstein, J., Ramakrishnan, R., and Shaft, U. When Is Nearest Neighbor Meaningful? Proceedings of the International Conference on Database Theory (ICDT), 1999. https://link.springer.com/chapter/10.1007/3-540-49257-7_15
- Aggarwal, C. C., Hinneburg, A., and Keim, D. A. On the Surprising Behavior of Distance Metrics in High Dimensional Space. Proceedings of the International Conference on Database Theory (ICDT), 2001. https://link.springer.com/chapter/10.1007/3-540-44503-X_27
- Lazarevic, A., and Kumar, V. Feature Bagging for Outlier Detection. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2005. https://dl.acm.org/doi/10.1145/1081870.1081891
- Aggarwal, C. C. Outlier Analysis, 2nd edition. Springer, 2017. https://link.springer.com/book/10.1007/978-3-319-47578-3