151 Distance-Based Anomaly Detection

Distance-based anomaly detection rests on a simple geometric intuition: normal observations lie in dense regions of the feature space, surrounded by many close neighbors, while anomalies sit far from the rest of the data. Unlike model-based approaches that fit a parametric distribution and then test for tail membership, distance methods make almost no assumptions about the shape of the data generating process. They require only a metric and a notion of locality. This chapter develops the principal distance-based scores used in practice, the k-nearest-neighbor distance, the local outlier factor, and the connectivity-based outlier factor, and then confronts the phenomenon that limits all of them in high dimensions: the curse of dimensionality.

151.1 1. Setup and Notation

Let $X = \{x_1, \dots, x_n\}$ be a dataset with each $x_i \in \mathbb{R}^d$, equipped with a distance function $\operatorname{dist}(x_i, x_j)$. Throughout we assume the Euclidean metric

\[ \operatorname{dist}(x_i, x_j) = \|x_i - x_j\|_2 = \left( \sum_{m=1}^{d} (x_{im} - x_{jm})^2 \right)^{1/2}, \]

though everything below generalizes to any Minkowski $L_p$ distance or to a learned metric. We write $N_k(x)$ for the set of the $k$ nearest neighbors of a point $x$ (excluding $x$ itself), and we treat anomaly detection as an unsupervised ranking problem: the goal is to assign each point a real-valued score such that larger scores correspond to greater outlierness. A downstream threshold or top-$m$ cutoff then converts scores into labels.

The central design question is what reference set a point should be compared against. Global methods compare a point to the entire dataset and implicitly assume a single density scale. Local methods compare a point only to its own neighborhood and so tolerate datasets with regions of differing density. The progression from k-nearest-neighbor distance to local outlier factor to connectivity-based outlier factor is precisely a progression in how locality is handled.

151.2 2. The k-Nearest-Neighbor Distance

The most direct distance-based score is the distance from a point to its $k$-th nearest neighbor. Define

\[ \operatorname{kNN}(x) = \operatorname{dist}\big(x, \, \text{the } k\text{-th nearest neighbor of } x\big). \]

A point in a dense cluster has a small $k$-th neighbor distance because $k$ points lie nearby; an isolated point has a large one. A common variant averages over the whole neighborhood rather than using the single $k$-th value,

\[ \operatorname{kNN}_{\text{avg}}(x) = \frac{1}{k} \sum_{y \in N_k(x)} \operatorname{dist}(x, y), \]

which is smoother and less sensitive to the exact choice of $k$. Both are global scores in the sense that the same threshold is applied everywhere.

The parameter $k$ governs a bias variance tradeoff. Small $k$ makes the score sensitive to individual nearby points and therefore noisy, and it can fail when anomalies arrive in small tight clusters of size up to $k$, since each member then has $k-1$ close companions and a deceptively small score. Large $k$ smooths the estimate but blurs fine local structure and raises computational cost. A practical rule is to choose $k$ larger than the size of any anomaly micro cluster one wishes to detect.

score(x) = distance(x, kth_nearest_neighbor(x))
rank all points by score, descending
flag the top m as anomalies

The fundamental weakness of the global kNN distance is its single density scale. Consider two Gaussian clusters, one tight and one diffuse. A point on the natural edge of the diffuse cluster may have a larger absolute $k$-th neighbor distance than a genuine outlier sitting just outside the tight cluster. The global threshold cannot separate these cases. Resolving this requires normalizing each point’s distance by the typical distance in its own neighborhood, which is exactly what the local outlier factor does.

151.3 3. The Local Outlier Factor

The local outlier factor, introduced by Breunig and colleagues in 2000, converts the absolute kNN distance into a relative density ratio. It produces a score near $1$ for points whose local density matches that of their neighbors and a score well above $1$ for points that are substantially sparser than their surroundings.

151.3.1 3.1 Reachability Distance

To stabilize density estimates, LOF replaces the raw distance with a smoothed quantity. Let $\operatorname{kdist}(y)$ denote the distance from $y$ to its $k$-th nearest neighbor. The reachability distance of $x$ with respect to $y$ is

\[ \operatorname{reach\text{-}dist}_k(x, y) = \max\big\{ \operatorname{kdist}(y), \, \operatorname{dist}(x, y) \big\}. \]

If $x$ is far from $y$, the true distance is used. If $x$ lies inside the $k$-neighborhood of $y$, the value is floored at $\operatorname{kdist}(y)$. This flooring reduces the statistical fluctuation of distances among very close points and is the only role of the asymmetric construction.

151.3.2 3.2 Local Reachability Density

The local reachability density of $x$ is the inverse of the average reachability distance from $x$ to its neighbors,

\[ \operatorname{lrd}_k(x) = \left( \frac{1}{|N_k(x)|} \sum_{y \in N_k(x)} \operatorname{reach\text{-}dist}_k(x, y) \right)^{-1}. \]

A high $\operatorname{lrd}$ means $x$ sits in a dense pocket; a low $\operatorname{lrd}$ means its neighbors are typically far away.

151.3.3 3.3 The LOF Score

The LOF score compares the local density of $x$ to the local densities of its neighbors,

\[ \operatorname{LOF}_k(x) = \frac{1}{|N_k(x)|} \sum_{y \in N_k(x)} \frac{\operatorname{lrd}_k(y)}{\operatorname{lrd}_k(x)}. \]

The interpretation is clean. When $\operatorname{LOF}_k(x) \approx 1$, the point is as dense as its neighbors and is considered normal. When $\operatorname{LOF}_k(x) \gg 1$, the neighbors are much denser than $x$, marking $x$ as a local outlier. Values below $1$ indicate a point in a region denser than its neighbors, which under most conventions is treated as strongly normal.

for each x:
    compute k-distance and N_k(x)
for each x:
    lrd(x) = 1 / mean over y in N_k(x) of reach_dist(x, y)
for each x:
    LOF(x) = mean over y in N_k(x) of lrd(y) / lrd(x)

Because every quantity in $\operatorname{LOF}$ is a ratio of local densities, the method is invariant to the absolute density scale of a region. In the tight versus diffuse cluster example, an edge point of the diffuse cluster has neighbors of similarly low density, so its ratio stays near $1$, while a true outlier near the tight cluster has dense neighbors and a large ratio. This local normalization is the decisive advantage of LOF over the global kNN distance. The price is increased sensitivity to $k$ and a tendency to produce false positives along legitimate cluster boundaries where density changes sharply.

151.3.4 3.4 Why the Inlier Score Is Exactly One

The scale-invariance claim can be made precise. Suppose every point in $N_k(x)$ and every point in their neighborhoods shares a common local reachability density $\rho$, that is $\operatorname{lrd}_k(x) = \operatorname{lrd}_k(y) = \rho$ for all $y \in N_k(x)$. Then

\[ \operatorname{LOF}_k(x) = \frac{1}{|N_k(x)|} \sum_{y \in N_k(x)} \frac{\operatorname{lrd}_k(y)}{\operatorname{lrd}_k(x)} = \frac{1}{|N_k(x)|} \sum_{y \in N_k(x)} \frac{\rho}{\rho} = 1 . \]

The density $\rho$ cancels identically, so the score does not depend on the absolute scale of the region: a point embedded in a uniformly dense pocket scores exactly $1$ whether that pocket is tight or diffuse. This is the algebraic content of “locality.” Worked numerically, the four corners of a unit square each have two neighbors at reachability distance $1$, giving $\operatorname{lrd} = 1$ for all four and hence $\operatorname{LOF} = 1$ each. Insert a fifth point far away and its neighbors retain $\operatorname{lrd} = 1$ while its own $\operatorname{lrd}$ collapses to $1 / \overline{\text{reach}} \approx 0.097$, so its $\operatorname{LOF} = \overline{\operatorname{lrd}(y)} / \operatorname{lrd}(x) \approx 1/0.097 \approx 10.3$. The example below reproduces exactly these numbers in three languages.

A subtle implementation point: the neighbor set $N_k(x)$ must be made deterministic when distances tie, otherwise the ratio above can change between runs and between languages. The reference implementations break ties by ascending point index, which fixes a single canonical neighborhood. They also floor $\operatorname{lrd}$ at $+\infty$ when a point coincides with all of its neighbors (zero mean reachability distance), and define the resulting score as $0$, so duplicate points never produce a NaN.

151.3.5 3.5 Implementation

The aiinaction companion library ships a from-scratch LOF in Python, Julia, and Rust that share one source of truth: the same five-point fixture (a unit square plus a far outlier) and the same numeric results to floating-point tolerance. Ties in distance are broken by ascending index so the neighborhoods, and therefore the scores, are reproducible across all three. The example fits LOF with $k = 2$ and reports the per-point scores together with the ranked anomalies.

Code

from aiinaction.ch146_lof import lof_scores, top_anomalies, k_distance

# Four points on a unit square (a dense cluster) plus one far outlier.
X = [
    [0.0, 0.0],
    [0.0, 1.0],
    [1.0, 0.0],
    [1.0, 1.0],
    [8.0, 8.0],
]

scores = lof_scores(X, k=2)
for i, s in enumerate(scores):
    print(f"point {i}: LOF = {s:.4f}")

print("k-distances:", [round(v, 4) for v in k_distance(X, k=2)])
print("most anomalous (top 2):", top_anomalies(X, k=2, m=2))

point 0: LOF = 1.0000
point 1: LOF = 1.0000
point 2: LOF = 1.0000
point 3: LOF = 1.0000
point 4: LOF = 10.2648
k-distances: [1.0, 1.0, 1.0, 1.0, 10.6301]
most anomalous (top 2): [4, 0]

The four cluster points score exactly 1.0; the isolated point scores about 10.26, and top_anomalies ranks it first.

using AIInAction.Ch146Lof

# Same fixture as the Python example (one point per row).
X = [
    0.0 0.0
    0.0 1.0
    1.0 0.0
    1.0 1.0
    8.0 8.0
]

scores = lof_scores(X, 2)
for (i, s) in enumerate(scores)
    println("point $(i - 1): LOF = ", round(s, digits=4))
end

println("k-distances: ", round.(k_distance(X, 2), digits=4))
# Julia indices are 1-based; the outlier is point 5 (point 4 zero-based).
println("most anomalous (top 2): ", top_anomalies(X, 2, 2))

use aiinaction::ch146_lof::{lof_scores, top_anomalies, k_distance, Matrix};

fn main() {
    // Same fixture as the Python example.
    let x = Matrix::from_rows(&[
        vec![0.0, 0.0],
        vec![0.0, 1.0],
        vec![1.0, 0.0],
        vec![1.0, 1.0],
        vec![8.0, 8.0],
    ])
    .unwrap();

    let scores = lof_scores(&x, 2).unwrap();
    for (i, s) in scores.iter().enumerate() {
        println!("point {}: LOF = {:.4}", i, s);
    }

    println!("k-distances: {:?}", k_distance(&x, 2).unwrap());
    println!("most anomalous (top 2): {:?}", top_anomalies(&x, 2, 2).unwrap());
}

151.4 4. The Connectivity-Based Outlier Factor

The connectivity-based outlier factor, or COF, proposed by Tang and colleagues in 2002, addresses a structural blind spot of LOF. LOF assumes that the neighbors of a normal point are distributed roughly isotropically, like a spherical cloud. When normal data lie along a curve, a line, or any low-dimensional manifold embedded in the feature space, LOF can misjudge points on the manifold because their neighbors are arranged anisotropically rather than in a ball.

151.4.1 4.1 Set-Based Nearest Path

COF replaces the density notion with a connectivity notion built from a set-based nearest path, a greedy chain that grows the neighborhood one point at a time. Starting from $x$, the path repeatedly appends the data point nearest to the current set, producing an ordering $\{x = p_1, p_2, \dots, p_{k+1}\}$ of $x$ together with its $k$ neighbors. At each step $i$ the cost $e_i$ is the edge length that connected the new point to the growing set. This construction follows the local geometry of the data, tracing along a manifold rather than insisting on a sphere.

151.4.2 4.2 Average Chaining Distance

The average chaining distance weights earlier, shorter edges more heavily,

\[ \operatorname{ac\text{-}dist}_k(x) = \sum_{i=1}^{k} \frac{2(k + 1 - i)}{k(k+1)} \, e_i . \]

The weights $\tfrac{2(k+1-i)}{k(k+1)}$ are positive, decreasing in $i$, and sum to one, so $\operatorname{ac\text{-}dist}$ is a weighted average of the path edge lengths that emphasizes the steps taken when the set was still small and close to $x$.

151.4.3 4.3 The COF Score

The COF score compares the average chaining distance of $x$ to those of its neighbors,

\[ \operatorname{COF}_k(x) = \frac{|N_k(x)| \cdot \operatorname{ac\text{-}dist}_k(x)}{\sum_{y \in N_k(x)} \operatorname{ac\text{-}dist}_k(y)} . \]

As with LOF, a value near $1$ indicates that $x$ is as well connected to the data as its neighbors are, while a value substantially greater than $1$ marks $x$ as poorly connected and therefore anomalous. Because connectivity is measured along the nearest path rather than through an isotropic density estimate, COF correctly handles points distributed along low-dimensional structures, flagging only those that deviate from the local pattern of connection. The cost is additional computation, since each chaining distance requires constructing a path, and a sensitivity to noise that can fragment the chain.

151.5 5. The Curse of Dimensionality

Every method above depends on the meaningfulness of distances and of the contrast between near and far. In high-dimensional spaces this contrast erodes, a phenomenon known as the curse of dimensionality, and it undermines distance-based detection at its foundation.

151.5.1 5.1 Distance Concentration

Consider $n$ points drawn independently from a distribution in $\mathbb{R}^d$. Under broad conditions, as $d$ grows the pairwise distances concentrate around a common value. A widely cited formalization states that

\[ \lim_{d \to \infty} \frac{\mathbb{E}\big[\, \operatorname{dist}_{\max}(d) - \operatorname{dist}_{\min}(d) \,\big]}{\mathbb{E}\big[\, \operatorname{dist}_{\min}(d) \,\big]} \to 0, \]

where $\operatorname{dist}_{\max}$ and $\operatorname{dist}_{\min}$ are the largest and smallest distances from a query point to the data. The relative gap between the nearest and farthest neighbor vanishes. When the closest and most distant points are nearly equidistant, the very idea of a nearest neighbor loses discriminative power, and a kNN distance, an $\operatorname{lrd}$, or an $\operatorname{ac\text{-}dist}$ computed from such distances carries little signal.

A complementary way to see the effect is through the coefficient of variation of distances. For independent and identically distributed coordinates the squared Euclidean distance is a sum of $d$ terms, so its mean grows like $d$ while its standard deviation grows like $\sqrt{d}$, giving a relative spread that decays as $1/\sqrt{d}$. Distances become almost constant.

151.5.2 5.2 Consequences for Detection

Three practical failures follow. First, scores compress: the LOF and COF ratios drift toward $1$ for all points, collapsing the separation between normal and anomalous. Second, the choice of metric matters more, since fractional $L_p$ norms with $p < 1$ retain contrast somewhat better than the Euclidean norm in high dimensions. Third, irrelevant features dominate, because every noisy dimension contributes to the distance and dilutes the few informative ones, so an anomaly that is obvious in a two-feature subspace becomes invisible in the full space.

151.5.3 5.3 Mitigations

The standard responses fall into two families. One reduces dimensionality before applying a distance method, through principal component analysis, random projection justified by the Johnson Lindenstrauss lemma, autoencoders, or feature selection, so that distances are computed in a space where they remain meaningful. The other abandons full-space distances in favor of subspace and ensemble methods, which search for low-dimensional projections in which an anomaly stands out and then aggregate the evidence. Feature bagging for LOF and subspace outlier detection are representative. In genuinely high-dimensional regimes these strategies routinely outperform any single full-space distance score.

151.6 6. Practical Guidance

Distance-based detectors share a workflow. Standardize features so that no dimension dominates the metric merely through scale. Choose $k$ deliberately, large enough to exceed plausible anomaly cluster sizes yet small enough to preserve locality, and report sensitivity to $k$ rather than a single value. Prefer LOF or COF over global kNN distance when the data contain regions of differing density, and prefer COF when normal data follow low-dimensional structure. Above roughly ten to twenty informative dimensions, treat raw full-space distances with suspicion and reach for dimensionality reduction or subspace ensembles. Finally, because all three scores are unsupervised rankings, validate thresholds against whatever labeled anomalies or domain knowledge are available, since the methods produce orderings, not calibrated probabilities.

151.7 7. Summary

The k-nearest-neighbor distance offers a transparent global measure of isolation but assumes a single density scale. The local outlier factor restores locality by scoring each point against the density of its own neighborhood, and the connectivity-based outlier factor further accommodates data that lie along low-dimensional manifolds by measuring connectivity along a nearest path rather than through isotropic density. All three are bound together by their dependence on meaningful distances, and all three degrade as dimensionality grows and distance contrast concentrates. Understanding both the geometric intuition and this shared failure mode is what lets a practitioner choose, tune, and trust a distance-based detector.

151.8 References

Breunig, M. M., Kriegel, H.-P., Ng, R. T., and Sander, J. LOF: Identifying Density-Based Local Outliers. Proceedings of the ACM SIGMOD International Conference on Management of Data, 2000. https://dl.acm.org/doi/10.1145/342009.335388
Tang, J., Chen, Z., Fu, A. W.-C., and Cheung, D. W. Enhancing Effectiveness of Outlier Detections for Low Density Patterns. Advances in Knowledge Discovery and Data Mining (PAKDD), 2002. https://link.springer.com/chapter/10.1007/3-540-47887-6_53
Ramaswamy, S., Rastogi, R., and Shim, K. Efficient Algorithms for Mining Outliers from Large Data Sets. Proceedings of the ACM SIGMOD International Conference on Management of Data, 2000. https://dl.acm.org/doi/10.1145/342009.335437
Beyer, K., Goldstein, J., Ramakrishnan, R., and Shaft, U. When Is Nearest Neighbor Meaningful? Proceedings of the International Conference on Database Theory (ICDT), 1999. https://link.springer.com/chapter/10.1007/3-540-49257-7_15
Aggarwal, C. C., Hinneburg, A., and Keim, D. A. On the Surprising Behavior of Distance Metrics in High Dimensional Space. Proceedings of the International Conference on Database Theory (ICDT), 2001. https://link.springer.com/chapter/10.1007/3-540-44503-X_27
Lazarevic, A., and Kumar, V. Feature Bagging for Outlier Detection. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2005. https://dl.acm.org/doi/10.1145/1081870.1081891
Aggarwal, C. C. Outlier Analysis, 2nd edition. Springer, 2017. https://link.springer.com/book/10.1007/978-3-319-47578-3

# Distance-Based Anomaly Detection Distance-based anomaly detection rests on a simple geometric intuition: normal observations lie in dense regions of the feature space, surrounded by many close neighbors, while anomalies sit far from the rest of the data. Unlike model-based approaches that fit a parametric distribution and then test for tail membership, distance methods make almost no assumptions about the shape of the data generating process. They require only a metric and a notion of locality. This chapter develops the principal distance-based scores used in practice, the k-nearest-neighbor distance, the local outlier factor, and the connectivity-based outlier factor, and then confronts the phenomenon that limits all of them in high dimensions: the curse of dimensionality. ## 1. Setup and Notation Let $X = \{x_1, \dots, x_n\}$ be a dataset with each $x_i \in \mathbb{R}^d$, equipped with a distance function $\operatorname{dist}(x_i, x_j)$. Throughout we assume the Euclidean metric $$ \operatorname{dist}(x_i, x_j) = \|x_i - x_j\|_2 = \left( \sum_{m=1}^{d} (x_{im} - x_{jm})^2 \right)^{1/2}, $$ though everything below generalizes to any Minkowski $L_p$ distance or to a learned metric. We write $N_k(x)$ for the set of the $k$ nearest neighbors of a point $x$ (excluding $x$ itself), and we treat anomaly detection as an unsupervised ranking problem: the goal is to assign each point a real-valued score such that larger scores correspond to greater outlierness. A downstream threshold or top-$m$ cutoff then converts scores into labels. The central design question is what reference set a point should be compared against. Global methods compare a point to the entire dataset and implicitly assume a single density scale. Local methods compare a point only to its own neighborhood and so tolerate datasets with regions of differing density. The progression from k-nearest-neighbor distance to local outlier factor to connectivity-based outlier factor is precisely a progression in how locality is handled. ## 2. The k-Nearest-Neighbor Distance The most direct distance-based score is the distance from a point to its $k$-th nearest neighbor. Define $$ \operatorname{kNN}(x) = \operatorname{dist}\big(x, \, \text{the } k\text{-th nearest neighbor of } x\big). $$ A point in a dense cluster has a small $k$-th neighbor distance because $k$ points lie nearby; an isolated point has a large one. A common variant averages over the whole neighborhood rather than using the single $k$-th value, $$ \operatorname{kNN}_{\text{avg}}(x) = \frac{1}{k} \sum_{y \in N_k(x)} \operatorname{dist}(x, y), $$ which is smoother and less sensitive to the exact choice of $k$. Both are global scores in the sense that the same threshold is applied everywhere. The parameter $k$ governs a bias variance tradeoff. Small $k$ makes the score sensitive to individual nearby points and therefore noisy, and it can fail when anomalies arrive in small tight clusters of size up to $k$, since each member then has $k-1$ close companions and a deceptively small score. Large $k$ smooths the estimate but blurs fine local structure and raises computational cost. A practical rule is to choose $k$ larger than the size of any anomaly micro cluster one wishes to detect. ```text score(x) = distance(x, kth_nearest_neighbor(x)) rank all points by score, descending flag the top m as anomalies ``` The fundamental weakness of the global kNN distance is its single density scale. Consider two Gaussian clusters, one tight and one diffuse. A point on the natural edge of the diffuse cluster may have a larger absolute $k$-th neighbor distance than a genuine outlier sitting just outside the tight cluster. The global threshold cannot separate these cases. Resolving this requires normalizing each point's distance by the typical distance in its own neighborhood, which is exactly what the local outlier factor does. ## 3. The Local Outlier Factor The local outlier factor, introduced by Breunig and colleagues in 2000, converts the absolute kNN distance into a relative density ratio. It produces a score near $1$ for points whose local density matches that of their neighbors and a score well above $1$ for points that are substantially sparser than their surroundings. ### 3.1 Reachability Distance To stabilize density estimates, LOF replaces the raw distance with a smoothed quantity. Let $\operatorname{kdist}(y)$ denote the distance from $y$ to its $k$-th nearest neighbor. The reachability distance of $x$ with respect to $y$ is $$ \operatorname{reach\text{-}dist}_k(x, y) = \max\big\{ \operatorname{kdist}(y), \, \operatorname{dist}(x, y) \big\}. $$ If $x$ is far from $y$, the true distance is used. If $x$ lies inside the $k$-neighborhood of $y$, the value is floored at $\operatorname{kdist}(y)$. This flooring reduces the statistical fluctuation of distances among very close points and is the only role of the asymmetric construction. ### 3.2 Local Reachability Density The local reachability density of $x$ is the inverse of the average reachability distance from $x$ to its neighbors, $$ \operatorname{lrd}_k(x) = \left( \frac{1}{|N_k(x)|} \sum_{y \in N_k(x)} \operatorname{reach\text{-}dist}_k(x, y) \right)^{-1}. $$ A high $\operatorname{lrd}$ means $x$ sits in a dense pocket; a low $\operatorname{lrd}$ means its neighbors are typically far away. ### 3.3 The LOF Score The LOF score compares the local density of $x$ to the local densities of its neighbors, $$ \operatorname{LOF}_k(x) = \frac{1}{|N_k(x)|} \sum_{y \in N_k(x)} \frac{\operatorname{lrd}_k(y)}{\operatorname{lrd}_k(x)}. $$ The interpretation is clean. When $\operatorname{LOF}_k(x) \approx 1$, the point is as dense as its neighbors and is considered normal. When $\operatorname{LOF}_k(x) \gg 1$, the neighbors are much denser than $x$, marking $x$ as a local outlier. Values below $1$ indicate a point in a region denser than its neighbors, which under most conventions is treated as strongly normal. ```text for each x: compute k-distance and N_k(x) for each x: lrd(x) = 1 / mean over y in N_k(x) of reach_dist(x, y) for each x: LOF(x) = mean over y in N_k(x) of lrd(y) / lrd(x) ``` Because every quantity in $\operatorname{LOF}$ is a ratio of local densities, the method is invariant to the absolute density scale of a region. In the tight versus diffuse cluster example, an edge point of the diffuse cluster has neighbors of similarly low density, so its ratio stays near $1$, while a true outlier near the tight cluster has dense neighbors and a large ratio. This local normalization is the decisive advantage of LOF over the global kNN distance. The price is increased sensitivity to $k$ and a tendency to produce false positives along legitimate cluster boundaries where density changes sharply. ### 3.4 Why the Inlier Score Is Exactly One The scale-invariance claim can be made precise. Suppose every point in $N_k(x)$ and every point in their neighborhoods shares a common local reachability density $\rho$, that is $\operatorname{lrd}_k(x) = \operatorname{lrd}_k(y) = \rho$ for all $y \in N_k(x)$. Then $$ \operatorname{LOF}_k(x) = \frac{1}{|N_k(x)|} \sum_{y \in N_k(x)} \frac{\operatorname{lrd}_k(y)}{\operatorname{lrd}_k(x)} = \frac{1}{|N_k(x)|} \sum_{y \in N_k(x)} \frac{\rho}{\rho} = 1 . $$ The density $\rho$ cancels identically, so the score does not depend on the absolute scale of the region: a point embedded in a uniformly dense pocket scores exactly $1$ whether that pocket is tight or diffuse. This is the algebraic content of "locality." Worked numerically, the four corners of a unit square each have two neighbors at reachability distance $1$, giving $\operatorname{lrd} = 1$ for all four and hence $\operatorname{LOF} = 1$ each. Insert a fifth point far away and its neighbors retain $\operatorname{lrd} = 1$ while its own $\operatorname{lrd}$ collapses to $1 / \overline{\text{reach}} \approx 0.097$, so its $\operatorname{LOF} = \overline{\operatorname{lrd}(y)} / \operatorname{lrd}(x) \approx 1/0.097 \approx 10.3$. The example below reproduces exactly these numbers in three languages. A subtle implementation point: the neighbor set $N_k(x)$ must be made deterministic when distances tie, otherwise the ratio above can change between runs and between languages. The reference implementations break ties by ascending point index, which fixes a single canonical neighborhood. They also floor $\operatorname{lrd}$ at $+\infty$ when a point coincides with all of its neighbors (zero mean reachability distance), and define the resulting score as $0$, so duplicate points never produce a NaN. ### 3.5 Implementation The `aiinaction` companion library ships a from-scratch LOF in Python, Julia, and Rust that share one source of truth: the same five-point fixture (a unit square plus a far outlier) and the same numeric results to floating-point tolerance. Ties in distance are broken by ascending index so the neighborhoods, and therefore the scores, are reproducible across all three. The example fits LOF with $k = 2$ and reports the per-point scores together with the ranked anomalies. ::: {.panel-tabset} ## Python ```{python} from aiinaction.ch146_lof import lof_scores, top_anomalies, k_distance # Four points on a unit square (a dense cluster) plus one far outlier. X = [ [0.0, 0.0], [0.0, 1.0], [1.0, 0.0], [1.0, 1.0], [8.0, 8.0], ] scores = lof_scores(X, k=2) for i, s in enumerate(scores): print(f"point {i}: LOF = {s:.4f}") print("k-distances:", [round(v, 4) for v in k_distance(X, k=2)]) print("most anomalous (top 2):", top_anomalies(X, k=2, m=2)) ``` The four cluster points score exactly `1.0`; the isolated point scores about `10.26`, and `top_anomalies` ranks it first. ## Julia ```julia using AIInAction.Ch146Lof # Same fixture as the Python example (one point per row). X = [ 0.0 0.0 0.0 1.0 1.0 0.0 1.0 1.0 8.0 8.0 ] scores = lof_scores(X, 2) for (i, s) in enumerate(scores) println("point $(i - 1): LOF = ", round(s, digits=4)) end println("k-distances: ", round.(k_distance(X, 2), digits=4)) # Julia indices are 1-based; the outlier is point 5 (point 4 zero-based). println("most anomalous (top 2): ", top_anomalies(X, 2, 2)) ``` ## Rust ```rust use aiinaction::ch146_lof::{lof_scores, top_anomalies, k_distance, Matrix}; fn main() { // Same fixture as the Python example. let x = Matrix::from_rows(&[ vec![0.0, 0.0], vec![0.0, 1.0], vec![1.0, 0.0], vec![1.0, 1.0], vec![8.0, 8.0], ]) .unwrap(); let scores = lof_scores(&x, 2).unwrap(); for (i, s) in scores.iter().enumerate() { println!("point {}: LOF = {:.4}", i, s); } println!("k-distances: {:?}", k_distance(&x, 2).unwrap()); println!("most anomalous (top 2): {:?}", top_anomalies(&x, 2, 2).unwrap()); } ``` ::: ## 4. The Connectivity-Based Outlier Factor The connectivity-based outlier factor, or COF, proposed by Tang and colleagues in 2002, addresses a structural blind spot of LOF. LOF assumes that the neighbors of a normal point are distributed roughly isotropically, like a spherical cloud. When normal data lie along a curve, a line, or any low-dimensional manifold embedded in the feature space, LOF can misjudge points on the manifold because their neighbors are arranged anisotropically rather than in a ball. ### 4.1 Set-Based Nearest Path COF replaces the density notion with a connectivity notion built from a set-based nearest path, a greedy chain that grows the neighborhood one point at a time. Starting from $x$, the path repeatedly appends the data point nearest to the current set, producing an ordering $\{x = p_1, p_2, \dots, p_{k+1}\}$ of $x$ together with its $k$ neighbors. At each step $i$ the cost $e_i$ is the edge length that connected the new point to the growing set. This construction follows the local geometry of the data, tracing along a manifold rather than insisting on a sphere. ### 4.2 Average Chaining Distance The average chaining distance weights earlier, shorter edges more heavily, $$ \operatorname{ac\text{-}dist}_k(x) = \sum_{i=1}^{k} \frac{2(k + 1 - i)}{k(k+1)} \, e_i . $$ The weights $\tfrac{2(k+1-i)}{k(k+1)}$ are positive, decreasing in $i$, and sum to one, so $\operatorname{ac\text{-}dist}$ is a weighted average of the path edge lengths that emphasizes the steps taken when the set was still small and close to $x$. ### 4.3 The COF Score The COF score compares the average chaining distance of $x$ to those of its neighbors, $$ \operatorname{COF}_k(x) = \frac{|N_k(x)| \cdot \operatorname{ac\text{-}dist}_k(x)}{\sum_{y \in N_k(x)} \operatorname{ac\text{-}dist}_k(y)} . $$ As with LOF, a value near $1$ indicates that $x$ is as well connected to the data as its neighbors are, while a value substantially greater than $1$ marks $x$ as poorly connected and therefore anomalous. Because connectivity is measured along the nearest path rather than through an isotropic density estimate, COF correctly handles points distributed along low-dimensional structures, flagging only those that deviate from the local pattern of connection. The cost is additional computation, since each chaining distance requires constructing a path, and a sensitivity to noise that can fragment the chain. ## 5. The Curse of Dimensionality Every method above depends on the meaningfulness of distances and of the contrast between near and far. In high-dimensional spaces this contrast erodes, a phenomenon known as the curse of dimensionality, and it undermines distance-based detection at its foundation. ### 5.1 Distance Concentration Consider $n$ points drawn independently from a distribution in $\mathbb{R}^d$. Under broad conditions, as $d$ grows the pairwise distances concentrate around a common value. A widely cited formalization states that $$ \lim_{d \to \infty} \frac{\mathbb{E}\big[\, \operatorname{dist}_{\max}(d) - \operatorname{dist}_{\min}(d) \,\big]}{\mathbb{E}\big[\, \operatorname{dist}_{\min}(d) \,\big]} \to 0, $$ where $\operatorname{dist}_{\max}$ and $\operatorname{dist}_{\min}$ are the largest and smallest distances from a query point to the data. The relative gap between the nearest and farthest neighbor vanishes. When the closest and most distant points are nearly equidistant, the very idea of a nearest neighbor loses discriminative power, and a kNN distance, an $\operatorname{lrd}$, or an $\operatorname{ac\text{-}dist}$ computed from such distances carries little signal. A complementary way to see the effect is through the coefficient of variation of distances. For independent and identically distributed coordinates the squared Euclidean distance is a sum of $d$ terms, so its mean grows like $d$ while its standard deviation grows like $\sqrt{d}$, giving a relative spread that decays as $1/\sqrt{d}$. Distances become almost constant. ### 5.2 Consequences for Detection Three practical failures follow. First, scores compress: the LOF and COF ratios drift toward $1$ for all points, collapsing the separation between normal and anomalous. Second, the choice of metric matters more, since fractional $L_p$ norms with $p < 1$ retain contrast somewhat better than the Euclidean norm in high dimensions. Third, irrelevant features dominate, because every noisy dimension contributes to the distance and dilutes the few informative ones, so an anomaly that is obvious in a two-feature subspace becomes invisible in the full space. ### 5.3 Mitigations The standard responses fall into two families. One reduces dimensionality before applying a distance method, through principal component analysis, random projection justified by the Johnson Lindenstrauss lemma, autoencoders, or feature selection, so that distances are computed in a space where they remain meaningful. The other abandons full-space distances in favor of subspace and ensemble methods, which search for low-dimensional projections in which an anomaly stands out and then aggregate the evidence. Feature bagging for LOF and subspace outlier detection are representative. In genuinely high-dimensional regimes these strategies routinely outperform any single full-space distance score. ## 6. Practical Guidance Distance-based detectors share a workflow. Standardize features so that no dimension dominates the metric merely through scale. Choose $k$ deliberately, large enough to exceed plausible anomaly cluster sizes yet small enough to preserve locality, and report sensitivity to $k$ rather than a single value. Prefer LOF or COF over global kNN distance when the data contain regions of differing density, and prefer COF when normal data follow low-dimensional structure. Above roughly ten to twenty informative dimensions, treat raw full-space distances with suspicion and reach for dimensionality reduction or subspace ensembles. Finally, because all three scores are unsupervised rankings, validate thresholds against whatever labeled anomalies or domain knowledge are available, since the methods produce orderings, not calibrated probabilities. ## 7. Summary The k-nearest-neighbor distance offers a transparent global measure of isolation but assumes a single density scale. The local outlier factor restores locality by scoring each point against the density of its own neighborhood, and the connectivity-based outlier factor further accommodates data that lie along low-dimensional manifolds by measuring connectivity along a nearest path rather than through isotropic density. All three are bound together by their dependence on meaningful distances, and all three degrade as dimensionality grows and distance contrast concentrates. Understanding both the geometric intuition and this shared failure mode is what lets a practitioner choose, tune, and trust a distance-based detector. ## References 1. Breunig, M. M., Kriegel, H.-P., Ng, R. T., and Sander, J. LOF: Identifying Density-Based Local Outliers. Proceedings of the ACM SIGMOD International Conference on Management of Data, 2000. https://dl.acm.org/doi/10.1145/342009.335388 2. Tang, J., Chen, Z., Fu, A. W.-C., and Cheung, D. W. Enhancing Effectiveness of Outlier Detections for Low Density Patterns. Advances in Knowledge Discovery and Data Mining (PAKDD), 2002. https://link.springer.com/chapter/10.1007/3-540-47887-6_53 3. Ramaswamy, S., Rastogi, R., and Shim, K. Efficient Algorithms for Mining Outliers from Large Data Sets. Proceedings of the ACM SIGMOD International Conference on Management of Data, 2000. https://dl.acm.org/doi/10.1145/342009.335437 4. Beyer, K., Goldstein, J., Ramakrishnan, R., and Shaft, U. When Is Nearest Neighbor Meaningful? Proceedings of the International Conference on Database Theory (ICDT), 1999. https://link.springer.com/chapter/10.1007/3-540-49257-7_15 5. Aggarwal, C. C., Hinneburg, A., and Keim, D. A. On the Surprising Behavior of Distance Metrics in High Dimensional Space. Proceedings of the International Conference on Database Theory (ICDT), 2001. https://link.springer.com/chapter/10.1007/3-540-44503-X_27 6. Lazarevic, A., and Kumar, V. Feature Bagging for Outlier Detection. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2005. https://dl.acm.org/doi/10.1145/1081870.1081891 7. Aggarwal, C. C. Outlier Analysis, 2nd edition. Springer, 2017. https://link.springer.com/book/10.1007/978-3-319-47578-3