149 Anomaly Detection Foundations

Anomaly detection is the task of identifying observations that deviate so markedly from the bulk of the data that they arouse suspicion of having been generated by a different mechanism. The phrasing comes from Hawkins and captures the central intuition that anomalies are not merely rare points but points whose generative process differs from the norm. This chapter develops the conceptual scaffolding that every practitioner needs before reaching for a specific algorithm: a taxonomy of anomaly types, the learning settings defined by label availability, the statistical pathology known as the base-rate problem, and the evaluation machinery appropriate to severely imbalanced detection tasks.

The two governing axes of the field, anomaly type and label availability, can be held in mind from the outset.

flowchart TD
    A["Anomaly detection problem"] --> B["Axis 1: anomaly type"]
    A --> C["Axis 2: label availability"]
    B --> B1["Point"]
    B --> B2["Contextual"]
    B --> B3["Collective"]
    C --> C1["Supervised: normal and anomaly labels"]
    C --> C2["Unsupervised: no labels"]
    C --> C3["Semi-supervised: normal only"]

Figure 149.1: The two design axes that determine method selection in anomaly detection.

Overlaying both axes is the base-rate problem, the arithmetic fact that rarity crushes precision, which dictates how detectors must be evaluated. The chapter closes with that evaluation machinery and a disciplined protocol for reporting results honestly.

149.1 1. What Counts as an Anomaly

An anomaly, also called an outlier or a novelty depending on context, is an observation that is inconsistent with a model of normal behavior. The qualifier “model of normal behavior” matters. Anomaly is not an intrinsic property of a point in isolation; it is a relationship between a point and a reference distribution. A heart rate of 180 beats per minute is anomalous at rest and entirely normal during a sprint. This relational character is what makes the problem subtle and what forces us to be precise about the context in which normality is defined.

Definition: anomaly

Fix a normal generating process with density $p_{\text{normal}}$ over a feature space $\mathcal{X}$. An observation $x \in \mathcal{X}$ is an anomaly when it is poorly explained by $p_{\text{normal}}$, formalized either as low density $p_{\text{normal}}(x) < \tau$ for a threshold $\tau$, or as a likelihood ratio favoring an alternative process, $p_{\text{anom}}(x) / p_{\text{normal}}(x) > 1$. Anomaly is defined relative to $p_{\text{normal}}$, never absolutely.

149.1.1 1.1 The Generative Framing

Let $p_{\text{normal}}(x)$ denote the density of the normal generating process over a feature space $\mathcal{X}$. A point $x$ is anomalous to the degree that it has low density under this process, or equivalently to the degree that it is better explained by some alternative process $p_{\text{anom}}(x)$. Many detectors implicitly or explicitly estimate $p_{\text{normal}}$ with $\hat p_{\text{normal}}$ and then assign an anomaly score equal to the negative log density, flagging points that exceed a threshold:

\[ s(x) = -\log \hat p_{\text{normal}}(x), \qquad \text{flag } x \iff s(x) > \tau . \]

The negative log makes the score a surprisal: it is large exactly where the model is surprised. The threshold $\tau$ is the single tunable knob that trades recall against precision, and choosing it well is the operational heart of the field.

This density view unifies a large family of methods, but it is not the only framing. The same scoring intuition surfaces in three other guises that avoid estimating a density directly.

Distance based. Score by remoteness from the data, for example the distance to the $k$-th nearest neighbor. Remote points are sparse, and sparsity is a nonparametric proxy for low density.
Reconstruction based. Fit a low-capacity map (principal components, an autoencoder) to normal data and score by reconstruction error $\lVert x - \hat x \rVert$. The map compresses what is typical, so atypical points reconstruct poorly.
Isolation based. Score by how easily a point is separated from the rest under random partitioning, as in isolation forest. Anomalies are isolated in few splits because they lie in sparse regions.

Each of these can be read as a proxy for the same question, “how surprising is this point under normality,” even when no explicit $\hat p_{\text{normal}}$ is ever formed. This unifying view is developed at length by Ruff and colleagues, who show that shallow and deep detectors alike reduce to a learned score plus a threshold.

149.1.2 1.2 Anomaly, Outlier, Novelty

The terms overlap but carry different operational connotations. An outlier usually refers to a point inside an existing dataset that we wish to identify, often for cleaning. A novelty refers to a genuinely new pattern arriving after a model has been trained on clean data, as in novelty detection. Anomaly is the umbrella term and is the one we use throughout. The distinction governs whether contamination is present at training time, which in turn constrains the learning setting available to us.

149.2 2. A Taxonomy of Anomaly Types

Chandola, Banerjee, and Kumar give the canonical three way classification that organizes almost all of the literature: point, contextual, and collective anomalies. Recognizing which type you face is the single most consequential modeling decision, because methods that excel at one type are frequently blind to another.

149.2.1 2.1 Point Anomalies

A point anomaly is an individual instance that is anomalous with respect to the rest of the data. This is the simplest and most studied case. A single credit card transaction of one hundred thousand dollars, when a cardholder’s transactions are typically under two hundred dollars, is a point anomaly. Formally, given a normality region $R \subseteq \mathcal{X}$ that contains the overwhelming majority of probability mass, a point $x$ is a point anomaly if $x \notin R$. Most distance and density methods, including $k$ nearest neighbor distance, local outlier factor, and one class support vector machines, are designed primarily for point anomalies.

149.2.2 2.2 Contextual Anomalies

A contextual anomaly, also called a conditional anomaly, is an instance that is anomalous only in a specific context, while being perfectly ordinary in another. The feature set is partitioned into contextual attributes and behavioral attributes. The contextual attributes define the situation, such as time of day, geographic location, or season. The behavioral attributes carry the measurement whose normality is conditional on the context.

The defining property is that the same behavioral value can be normal or anomalous depending on context. A temperature of thirty degrees Celsius is unremarkable in July and extraordinary in January. The detector must therefore model a conditional distribution $p(\text{behavioral} \mid \text{context})$ rather than a marginal one. A point $x = (c, b)$ with context $c$ and behavior $b$ is contextually anomalous when $b$ is improbable under $p(b \mid c)$ even though it may be probable under the marginal $p(b)$:

\[ s(c, b) = -\log \hat p(b \mid c). \]

The two views diverge precisely when $b$ and $c$ are dependent. If they were independent, $p(b \mid c) = p(b)$ and contextual detection would collapse to point detection. A useful diagnostic is therefore to ask whether the behavioral variable’s distribution actually shifts with context; if it does not, the extra machinery buys nothing.

Time series provide the richest source of contextual anomalies, because the timestamp is a natural contextual attribute. Spatial data, where location is the context, is the other common source. Failing to model context is the most common reason a marginal point detector misses real anomalies while drowning the analyst in false alarms on perfectly normal seasonal extremes.

149.2.3 2.3 Collective Anomalies

A collective anomaly is a set of related instances that is anomalous with respect to the entire dataset, even though the individual instances within the set may not be anomalous on their own. The anomaly lives in the relationship among the points, not in any single point. A classic example is an electrocardiogram trace in which a low value persists for an unusually long interval. Each individual low reading is within the normal range, but the sustained run of low readings is the anomaly.

Collective anomalies require that the data have structure: a sequence, a graph, or a spatial arrangement that makes “a related set” meaningful. They cannot occur in a dataset of independent and identically distributed records, because there is no relationship to violate. Detection typically operates over subsequences or subgraphs, scoring a window $W = (x_t, x_{t+1}, \ldots, x_{t+\ell})$ by its distance to the nearest normal pattern in a reference library $\mathcal{L}$:

\[ s(W) = \min_{W' \in \mathcal{L}} d(W, W'). \]

The interplay between the three types is worth internalizing. A point anomaly is collective with window length one. A contextual anomaly becomes a point anomaly once the context is folded into the feature representation, for example by appending the season or the time of day to the feature vector. A collective anomaly becomes a point anomaly once a window is embedded as a single feature vector, for example by stacking the values of the window or summarizing it with statistics. Much of the craft of anomaly detection is choosing a representation that turns the anomaly you actually face into the type your chosen detector handles well. The corollary is a warning: the wrong representation can also hide an anomaly, as when point-wise scoring of an electrocardiogram never sees the sustained low run because it never forms a window.

149.3 3. Learning Settings and Label Availability

The second axis of the design space is how much labeled information is available. The three settings are supervised, unsupervised, and semi-supervised, and they correspond to fundamentally different assumptions about the training data.

149.3.1 3.1 Supervised Anomaly Detection

In the supervised setting we possess a training set with labeled normal and labeled anomalous instances. In principle this reduces to ordinary binary classification, and one might ask why a separate field exists at all. The answer is the extreme class imbalance that defines anomaly detection. Anomalies are rare by construction, so the positive class may constitute well under one percent of the data. Standard classifiers trained to minimize average error will happily predict “normal” for everything and achieve high accuracy while detecting nothing.

A second difficulty is that the anomalous class is rarely representative. The anomalies we have labeled are the ones we have already seen; the dangerous ones are the novel attacks and failures we have not. A supervised model fit to known anomalies can overfit to their specific signature and miss the next variant entirely. For these reasons the genuinely supervised setting is less common in practice than its conceptual simplicity would suggest, and when used it is paired with imbalance aware techniques such as cost sensitive losses, resampling, or focal style reweighting.

149.3.2 3.2 Unsupervised Anomaly Detection

In the unsupervised setting no labels are available at all. This is the most common situation in practice. The detector must infer normality from the data itself, under the working assumption that anomalies are both rare and different. Concretely, the algorithm assumes that normal points occupy dense regions while anomalies are sparse and far from the dense core, and it scores points accordingly.

The critical and often unstated assumption is that the training data is dominated by normal instances. If contamination, the fraction of anomalies present, is too high, the notion of “normal” is corrupted and the densest regions may themselves be anomalous clusters. Algorithms such as isolation forest, local outlier factor, and $k$ nearest neighbor distance accept a contamination hyperparameter $\nu \in (0, 1)$ that encodes the analyst’s prior on the anomaly rate and that sets the decision threshold at the corresponding upper quantile of the scores: flag $x$ when $s(x) > Q_{1-\nu}(s)$, where $Q_{1-\nu}$ is the empirical $(1-\nu)$ quantile of the scores over the data.

This makes the limitation explicit. The detector does not discover the anomaly rate; it is told the rate through $\nu$ and merely ranks. Set $\nu$ too low and real anomalies fall below the threshold; set it too high and normal points are flagged. The convenience of needing no labels is paid for in the impossibility of principled threshold selection without ground truth, and in the absence of any direct way to validate the result. In the popular open-source library scikit-learn, this hyperparameter is named contamination and defaults to a small value precisely because it cannot be learned from unlabeled data.

149.3.3 3.3 Semi-Supervised Anomaly Detection

The semi-supervised setting, which in this literature usually means one class learning, assumes access to a training set consisting only of normal instances. No anomalies are seen during training. The model builds a boundary, a density, or a reconstruction map of normality, and at test time anything that falls outside the learned region of normality is flagged. This is the natural framing for novelty detection and for fault detection in engineered systems, where one can collect abundant data from healthy operation but, by definition, little or none from the rare failure.

One class support vector machines, support vector data description, autoencoder reconstruction error, and deep one class methods all live here. The autoencoder version is especially intuitive: train an encoder-decoder pair $(f, g)$ to reconstruct normal data by minimizing $\lVert x - g(f(x)) \rVert^2$ over normal points, then at inference time score by reconstruction error and flag points that exceed a threshold,

\[ s(x) = \lVert x - g(f(x)) \rVert, \qquad \text{flag } x \iff s(x) > \tau . \]

The logic is that the network allocates its limited capacity to the patterns it saw, so it never learned to reconstruct what it never saw, and anomalies reconstruct poorly. The threshold $\tau$ is commonly set from a held-out sample of normal data, for example at a high quantile of the normal reconstruction errors, which bounds the false alarm rate by construction.

The semi-supervised setting is attractive precisely because clean normal data is often obtainable when labeled anomalies are not, and because it sidesteps the contamination assumption of the unsupervised setting. Its weakness is sensitivity to distribution shift in the normal class, which produces false alarms when the definition of normal drifts over time. A subtle failure mode of high-capacity reconstructors deserves mention: a network powerful enough to generalize can learn to reconstruct anomalies it never saw, collapsing the error gap that the method relies on, which is why capacity must be deliberately constrained.

149.4 4. The Base-Rate Problem

No topic causes more confusion among newcomers to anomaly detection than the base-rate problem, sometimes called the base-rate fallacy. It explains why a detector that appears excellent in isolation can be useless or worse in deployment, and it follows directly from Bayes’ theorem applied to a rare positive class.

149.4.1 4.1 The Statement

Suppose anomalies have prevalence, or base rate, $\pi = P(A)$, where $A$ denotes the event that a point is anomalous. Let the detector have true positive rate, or recall, $\text{TPR} = P(\text{flag} \mid A)$ and false positive rate $\text{FPR} = P(\text{flag} \mid \neg A)$. The quantity an operator actually cares about is precision, the probability that a flagged point is genuinely anomalous, $P(A \mid \text{flag})$. By Bayes’ theorem:

\[ P(A \mid \text{flag}) = \frac{\text{TPR} \cdot \pi}{\text{TPR} \cdot \pi + \text{FPR} \cdot (1 - \pi)}. \]

When $\pi$ is tiny, the term $\text{FPR} \cdot (1 - \pi)$ in the denominator dominates unless $\text{FPR}$ is extraordinarily small, and precision collapses.

149.4.2 4.2 A Concrete Calculation

Consider a fraud detector with an enviable recall of $0.99$ and a false positive rate of $0.01$, applied to a transaction stream where the fraud rate is $\pi = 0.001$, that is one in a thousand. Plugging in:

\[ P(A \mid \text{flag}) = \frac{0.99 \cdot 0.001}{0.99 \cdot 0.001 + 0.01 \cdot 0.999} = \frac{0.00099}{0.00099 + 0.00999} \approx 0.090. \]

Roughly nine out of every ten alerts are false. A detector with ninety nine percent recall and a one percent false alarm rate, numbers that would be celebrated on most benchmarks, generates an alert stream that is ninety percent noise. This is not a defect of the detector; it is an arithmetic consequence of rarity, and no amount of clever modeling changes the structure of the equation. The only levers are driving $\text{FPR}$ far lower, raising $\pi$ by pre filtering the stream, or accepting that alerts feed a downstream triage process rather than an automated action.

It is worth seeing just how far $\text{FPR}$ must fall. Holding recall at $0.99$ and the base rate at $\pi = 0.001$, the table below shows precision as a function of $\text{FPR}$, computed from the same Bayes formula.

False positive rate $\text{FPR}$	Precision $P(A \mid \text{flag})$
$10^{-2}$	$0.090$
$10^{-3}$	$0.498$
$10^{-4}$	$0.908$
$10^{-5}$	$0.990$

To reach even coin-flip precision the false positive rate must drop to one in a thousand, and to reach ninety percent precision it must drop to one in ten thousand, a hundredfold improvement over the already respectable starting point. This is the quantitative core of why anomaly detection is hard, and why operators speak in terms of an alert budget rather than a single accuracy number.

149.4.3 4.3 Operational Consequences

The base-rate problem reframes what “good” means. It tells us that recall and false positive rate, the quantities most detectors are tuned on, are insufficient summaries when the base rate is low. It motivates a relentless focus on the absolute number of false positives an operator must review per day, often called the alert budget. It explains why precision oriented and rank oriented metrics, discussed next, are the correct evaluation language for this field, and why accuracy is actively misleading.

149.5 5. Evaluation

Because anomalies are rare, evaluation demands metrics that are insensitive to the dominant normal class and that respect the ranking nature of most detectors.

149.5.1 5.1 Why Accuracy Fails

If the base rate is $\pi = 0.001$, a model that labels every point normal achieves accuracy $1 - \pi = 0.999$. Accuracy rewards ignoring the positive class entirely, so it is worthless here. The confusion matrix must be read through the lens of the rare class, using precision $\text{TP} / (\text{TP} + \text{FP})$ and recall $\text{TP} / (\text{TP} + \text{FN})$, and their harmonic mean, the $F_1$ score.

149.5.2 5.2 Threshold-Free Ranking Metrics

Most detectors output a continuous score, and the choice of threshold is a separate operational decision. It is therefore valuable to evaluate the ranking quality across all thresholds at once. Two curves dominate practice.

The receiver operating characteristic curve plots $\text{TPR}$ against $\text{FPR}$ as the threshold varies, and the area under it, ROC AUC, is the probability that a randomly chosen anomaly is scored above a randomly chosen normal point. ROC AUC has a hidden flaw for rare classes: because $\text{FPR}$ has the large normal count in its denominator, even many false positives barely move the curve, so ROC AUC can look reassuringly high while precision is dismal.

The precision recall curve plots precision against recall, and the area under it, the average precision, is far more informative under heavy imbalance because both axes involve the rare class. A useful anchor is the baseline of a random scorer: on the ROC plot it is the diagonal with area $0.5$ regardless of imbalance, whereas on the precision recall plot it is the horizontal line at precision $\pi$. As $\pi$ shrinks, the achievable precision recall area shrinks with it, so the precision recall plot exposes difficulty that the ROC plot conceals. When the base rate is low, the precision recall curve and average precision are the primary metrics, and ROC AUC is a secondary one. This ordering is argued carefully by Davis and Goadrich and by Saito and Rehmsmeier.

149.5.3 5.3 Metrics Matched to Operations

Beyond curves, deployment usually fixes an alert budget. If analysts can review one hundred alerts per day, the relevant metric is precision at $k$, the fraction of the top $k$ scored points that are true anomalies, with $k$ set to the budget. Recall at a fixed FPR, and the number of true anomalies caught within a fixed alert volume, translate model quality into the currency the operator spends.

149.5.4 5.4 Time Series and Range-Based Evaluation

When anomalies are collective or contextual events spanning intervals, point wise precision and recall mislead, because a single alert anywhere inside a long anomalous range arguably constitutes a catch, and a one timestamp offset should not count as a miss. Range based precision and recall, and point adjusted scoring, were developed to credit a detection that overlaps the true anomalous interval. These should be used deliberately and reported transparently, since point adjustment in particular can inflate scores dramatically and has drawn justified criticism for making weak detectors look strong.

149.5.5 5.5 A Disciplined Evaluation Protocol

A sound protocol fixes the contamination assumption before looking at results, evaluates with a ranking metric appropriate to the base rate, reports a precision oriented operating point tied to a realistic alert budget, and, for temporal data, states explicitly whether scoring is point wise or range based. Reporting a single number without the base rate and without the operating point is uninformative, because the same detector can be excellent or useless depending on the prevalence it faces.

149.6 6. Choosing a Setting: When to Use What, and Pitfalls

The foundations above translate into a short decision procedure. First ask what kind of anomaly you face, because that fixes the representation. If individual records are independent, you have point anomalies and a marginal density or distance method suffices. If normality depends on a covariate such as time, season, or location, you have contextual anomalies and must condition on that covariate, either by modeling $p(b \mid c)$ directly or by adding the context to the feature vector. If the anomaly lives in a run, a shape, or a subgraph, you have collective anomalies and must score windows or substructures rather than points.

Then ask what labels you can obtain, because that fixes the learning setting.

Use supervised only when you have a reliable, reasonably representative sample of anomalies, and pair it with imbalance-aware training such as cost-sensitive loss or resampling. The pitfall is overfitting to the known anomaly signature and missing novel variants.
Use semi-supervised (one-class) when clean normal data is plentiful but anomalies are not, which is the common case in fault and novelty detection. The pitfall is distribution shift in the normal class, which manifests as a rising false alarm rate over time and calls for periodic recalibration.
Use unsupervised when no labels exist at all. The pitfall is the contamination assumption: if anomalies are not rare in the training data, the densest region need not be normal, and the contamination hyperparameter $\nu$ is a guess rather than a learned quantity.

Three pitfalls cut across all settings. The first is evaluating with accuracy or ROC AUC alone, which can flatter a detector that an operator would find useless once the base rate is accounted for. The second is ignoring the alert budget, since a detector is only deployable if its false positives per day fit the capacity of whoever triages them. The third is leaking the future into the past in temporal data, where naive shuffling of a time series into train and test sets lets the model see information it could not have at prediction time and produces optimistic scores that collapse in deployment.

149.7 7. Summary

Anomaly detection is organized along two axes that together determine method selection. The first axis is the anomaly type: point anomalies sit far from the bulk of the data, contextual anomalies are abnormal only relative to a conditioning context, and collective anomalies are structured sets whose members are individually unremarkable. The second axis is label availability: supervised detection treats the problem as imbalanced classification, unsupervised detection infers normality from contaminated data under a rarity assumption, and semi-supervised detection learns a model of clean normal data and flags departures from it. Overlaying both axes is the base-rate problem, the arithmetic certainty that low prevalence crushes precision even for high recall detectors, which in turn dictates an evaluation regime built on precision recall analysis, average precision, and budget aware operating points rather than accuracy or ROC AUC alone. Mastering these foundations is the prerequisite for using any specific algorithm well, because the algorithm is only ever as good as the match between its assumptions and the type, setting, and prevalence of the anomalies you actually face.

149.8 References

Hawkins, D. M. Identification of Outliers. Chapman and Hall, 1980. https://link.springer.com/book/10.1007/978-94-015-3994-4
Chandola, V., Banerjee, A., and Kumar, V. “Anomaly Detection: A Survey.” ACM Computing Surveys, 41(3), 2009. https://dl.acm.org/doi/10.1145/1541880.1541882
Aggarwal, C. C. Outlier Analysis, 2nd edition. Springer, 2017. https://link.springer.com/book/10.1007/978-3-319-47578-3
Liu, F. T., Ting, K. M., and Zhou, Z.-H. “Isolation Forest.” IEEE International Conference on Data Mining, 2008. https://ieeexplore.ieee.org/document/4781136
Breunig, M. M., Kriegel, H.-P., Ng, R. T., and Sander, J. “LOF: Identifying Density-Based Local Outliers.” ACM SIGMOD, 2000. https://dl.acm.org/doi/10.1145/342009.335388
Schölkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., and Williamson, R. C. “Estimating the Support of a High-Dimensional Distribution.” Neural Computation, 13(7), 2001. https://direct.mit.edu/neco/article/13/7/1443/6529
Ruff, L., Vandermeulen, R., et al. “A Unifying Review of Deep and Shallow Anomaly Detection.” Proceedings of the IEEE, 109(5), 2021. https://ieeexplore.ieee.org/document/9347460
Davis, J., and Goadrich, M. “The Relationship Between Precision-Recall and ROC Curves.” International Conference on Machine Learning, 2006. https://dl.acm.org/doi/10.1145/1143844.1143874
Saito, T., and Rehmsmeier, M. “The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets.” PLOS ONE, 10(3), 2015. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0118432
Tatbul, N., Lee, T. J., Zdonik, S., Alam, M., and Gottschlich, J. “Precision and Recall for Time Series.” Advances in Neural Information Processing Systems, 2018. https://papers.nips.cc/paper/2018/hash/8f468c873a32bb0619eaeb2050ba45d1-Abstract.html
Kim, S., Choi, K., Choi, H.-S., Lee, B., and Yoon, S. “Towards a Rigorous Evaluation of Time-Series Anomaly Detection.” AAAI Conference on Artificial Intelligence, 2022. https://ojs.aaai.org/index.php/AAAI/article/view/20680

# Anomaly Detection Foundations Anomaly detection is the task of identifying observations that deviate so markedly from the bulk of the data that they arouse suspicion of having been generated by a different mechanism. The phrasing comes from Hawkins and captures the central intuition that anomalies are not merely rare points but points whose generative process differs from the norm. This chapter develops the conceptual scaffolding that every practitioner needs before reaching for a specific algorithm: a taxonomy of anomaly types, the learning settings defined by label availability, the statistical pathology known as the base-rate problem, and the evaluation machinery appropriate to severely imbalanced detection tasks. The two governing axes of the field, anomaly type and label availability, can be held in mind from the outset. ```{mermaid} %%| label: fig-axes %%| fig-cap: "The two design axes that determine method selection in anomaly detection." flowchart TD A["Anomaly detection problem"] --> B["Axis 1: anomaly type"] A --> C["Axis 2: label availability"] B --> B1["Point"] B --> B2["Contextual"] B --> B3["Collective"] C --> C1["Supervised: normal and anomaly labels"] C --> C2["Unsupervised: no labels"] C --> C3["Semi-supervised: normal only"] ``` Overlaying both axes is the base-rate problem, the arithmetic fact that rarity crushes precision, which dictates how detectors must be evaluated. The chapter closes with that evaluation machinery and a disciplined protocol for reporting results honestly. ## 1. What Counts as an Anomaly An anomaly, also called an outlier or a novelty depending on context, is an observation that is inconsistent with a model of normal behavior. The qualifier "model of normal behavior" matters. Anomaly is not an intrinsic property of a point in isolation; it is a relationship between a point and a reference distribution. A heart rate of 180 beats per minute is anomalous at rest and entirely normal during a sprint. This relational character is what makes the problem subtle and what forces us to be precise about the context in which normality is defined. ::: {.callout-note title="Definition: anomaly"} Fix a normal generating process with density $p_{\text{normal}}$ over a feature space $\mathcal{X}$. An observation $x \in \mathcal{X}$ is an anomaly when it is poorly explained by $p_{\text{normal}}$, formalized either as low density $p_{\text{normal}}(x) < \tau$ for a threshold $\tau$, or as a likelihood ratio favoring an alternative process, $p_{\text{anom}}(x) / p_{\text{normal}}(x) > 1$. Anomaly is defined relative to $p_{\text{normal}}$, never absolutely. ::: ### 1.1 The Generative Framing Let $p_{\text{normal}}(x)$ denote the density of the normal generating process over a feature space $\mathcal{X}$. A point $x$ is anomalous to the degree that it has low density under this process, or equivalently to the degree that it is better explained by some alternative process $p_{\text{anom}}(x)$. Many detectors implicitly or explicitly estimate $p_{\text{normal}}$ with $\hat p_{\text{normal}}$ and then assign an anomaly score equal to the negative log density, flagging points that exceed a threshold: $$ s(x) = -\log \hat p_{\text{normal}}(x), \qquad \text{flag } x \iff s(x) > \tau . $$ The negative log makes the score a surprisal: it is large exactly where the model is surprised. The threshold $\tau$ is the single tunable knob that trades recall against precision, and choosing it well is the operational heart of the field. This density view unifies a large family of methods, but it is not the only framing. The same scoring intuition surfaces in three other guises that avoid estimating a density directly. - **Distance based.** Score by remoteness from the data, for example the distance to the $k$-th nearest neighbor. Remote points are sparse, and sparsity is a nonparametric proxy for low density. - **Reconstruction based.** Fit a low-capacity map (principal components, an autoencoder) to normal data and score by reconstruction error $\lVert x - \hat x \rVert$. The map compresses what is typical, so atypical points reconstruct poorly. - **Isolation based.** Score by how easily a point is separated from the rest under random partitioning, as in isolation forest. Anomalies are isolated in few splits because they lie in sparse regions. Each of these can be read as a proxy for the same question, "how surprising is this point under normality," even when no explicit $\hat p_{\text{normal}}$ is ever formed. This unifying view is developed at length by Ruff and colleagues, who show that shallow and deep detectors alike reduce to a learned score plus a threshold. ### 1.2 Anomaly, Outlier, Novelty The terms overlap but carry different operational connotations. An outlier usually refers to a point inside an existing dataset that we wish to identify, often for cleaning. A novelty refers to a genuinely new pattern arriving after a model has been trained on clean data, as in novelty detection. Anomaly is the umbrella term and is the one we use throughout. The distinction governs whether contamination is present at training time, which in turn constrains the learning setting available to us. ## 2. A Taxonomy of Anomaly Types Chandola, Banerjee, and Kumar give the canonical three way classification that organizes almost all of the literature: point, contextual, and collective anomalies. Recognizing which type you face is the single most consequential modeling decision, because methods that excel at one type are frequently blind to another. ### 2.1 Point Anomalies A point anomaly is an individual instance that is anomalous with respect to the rest of the data. This is the simplest and most studied case. A single credit card transaction of one hundred thousand dollars, when a cardholder's transactions are typically under two hundred dollars, is a point anomaly. Formally, given a normality region $R \subseteq \mathcal{X}$ that contains the overwhelming majority of probability mass, a point $x$ is a point anomaly if $x \notin R$. Most distance and density methods, including $k$ nearest neighbor distance, local outlier factor, and one class support vector machines, are designed primarily for point anomalies. ### 2.2 Contextual Anomalies A contextual anomaly, also called a conditional anomaly, is an instance that is anomalous only in a specific context, while being perfectly ordinary in another. The feature set is partitioned into contextual attributes and behavioral attributes. The contextual attributes define the situation, such as time of day, geographic location, or season. The behavioral attributes carry the measurement whose normality is conditional on the context. The defining property is that the same behavioral value can be normal or anomalous depending on context. A temperature of thirty degrees Celsius is unremarkable in July and extraordinary in January. The detector must therefore model a conditional distribution $p(\text{behavioral} \mid \text{context})$ rather than a marginal one. A point $x = (c, b)$ with context $c$ and behavior $b$ is contextually anomalous when $b$ is improbable under $p(b \mid c)$ even though it may be probable under the marginal $p(b)$: $$ s(c, b) = -\log \hat p(b \mid c). $$ The two views diverge precisely when $b$ and $c$ are dependent. If they were independent, $p(b \mid c) = p(b)$ and contextual detection would collapse to point detection. A useful diagnostic is therefore to ask whether the behavioral variable's distribution actually shifts with context; if it does not, the extra machinery buys nothing. Time series provide the richest source of contextual anomalies, because the timestamp is a natural contextual attribute. Spatial data, where location is the context, is the other common source. Failing to model context is the most common reason a marginal point detector misses real anomalies while drowning the analyst in false alarms on perfectly normal seasonal extremes. ### 2.3 Collective Anomalies A collective anomaly is a set of related instances that is anomalous with respect to the entire dataset, even though the individual instances within the set may not be anomalous on their own. The anomaly lives in the relationship among the points, not in any single point. A classic example is an electrocardiogram trace in which a low value persists for an unusually long interval. Each individual low reading is within the normal range, but the sustained run of low readings is the anomaly. Collective anomalies require that the data have structure: a sequence, a graph, or a spatial arrangement that makes "a related set" meaningful. They cannot occur in a dataset of independent and identically distributed records, because there is no relationship to violate. Detection typically operates over subsequences or subgraphs, scoring a window $W = (x_t, x_{t+1}, \ldots, x_{t+\ell})$ by its distance to the nearest normal pattern in a reference library $\mathcal{L}$: $$ s(W) = \min_{W' \in \mathcal{L}} d(W, W'). $$ The interplay between the three types is worth internalizing. A point anomaly is collective with window length one. A contextual anomaly becomes a point anomaly once the context is folded into the feature representation, for example by appending the season or the time of day to the feature vector. A collective anomaly becomes a point anomaly once a window is embedded as a single feature vector, for example by stacking the values of the window or summarizing it with statistics. Much of the craft of anomaly detection is choosing a representation that turns the anomaly you actually face into the type your chosen detector handles well. The corollary is a warning: the wrong representation can also hide an anomaly, as when point-wise scoring of an electrocardiogram never sees the sustained low run because it never forms a window. ## 3. Learning Settings and Label Availability The second axis of the design space is how much labeled information is available. The three settings are supervised, unsupervised, and semi-supervised, and they correspond to fundamentally different assumptions about the training data. ### 3.1 Supervised Anomaly Detection In the supervised setting we possess a training set with labeled normal and labeled anomalous instances. In principle this reduces to ordinary binary classification, and one might ask why a separate field exists at all. The answer is the extreme class imbalance that defines anomaly detection. Anomalies are rare by construction, so the positive class may constitute well under one percent of the data. Standard classifiers trained to minimize average error will happily predict "normal" for everything and achieve high accuracy while detecting nothing. A second difficulty is that the anomalous class is rarely representative. The anomalies we have labeled are the ones we have already seen; the dangerous ones are the novel attacks and failures we have not. A supervised model fit to known anomalies can overfit to their specific signature and miss the next variant entirely. For these reasons the genuinely supervised setting is less common in practice than its conceptual simplicity would suggest, and when used it is paired with imbalance aware techniques such as cost sensitive losses, resampling, or focal style reweighting. ### 3.2 Unsupervised Anomaly Detection In the unsupervised setting no labels are available at all. This is the most common situation in practice. The detector must infer normality from the data itself, under the working assumption that anomalies are both rare and different. Concretely, the algorithm assumes that normal points occupy dense regions while anomalies are sparse and far from the dense core, and it scores points accordingly. The critical and often unstated assumption is that the training data is dominated by normal instances. If contamination, the fraction of anomalies present, is too high, the notion of "normal" is corrupted and the densest regions may themselves be anomalous clusters. Algorithms such as isolation forest, local outlier factor, and $k$ nearest neighbor distance accept a contamination hyperparameter $\nu \in (0, 1)$ that encodes the analyst's prior on the anomaly rate and that sets the decision threshold at the corresponding upper quantile of the scores: flag $x$ when $s(x) > Q_{1-\nu}(s)$, where $Q_{1-\nu}$ is the empirical $(1-\nu)$ quantile of the scores over the data. This makes the limitation explicit. The detector does not discover the anomaly rate; it is told the rate through $\nu$ and merely ranks. Set $\nu$ too low and real anomalies fall below the threshold; set it too high and normal points are flagged. The convenience of needing no labels is paid for in the impossibility of principled threshold selection without ground truth, and in the absence of any direct way to validate the result. In the popular open-source library scikit-learn, this hyperparameter is named `contamination` and defaults to a small value precisely because it cannot be learned from unlabeled data. ### 3.3 Semi-Supervised Anomaly Detection The semi-supervised setting, which in this literature usually means one class learning, assumes access to a training set consisting only of normal instances. No anomalies are seen during training. The model builds a boundary, a density, or a reconstruction map of normality, and at test time anything that falls outside the learned region of normality is flagged. This is the natural framing for novelty detection and for fault detection in engineered systems, where one can collect abundant data from healthy operation but, by definition, little or none from the rare failure. One class support vector machines, support vector data description, autoencoder reconstruction error, and deep one class methods all live here. The autoencoder version is especially intuitive: train an encoder-decoder pair $(f, g)$ to reconstruct normal data by minimizing $\lVert x - g(f(x)) \rVert^2$ over normal points, then at inference time score by reconstruction error and flag points that exceed a threshold, $$ s(x) = \lVert x - g(f(x)) \rVert, \qquad \text{flag } x \iff s(x) > \tau . $$ The logic is that the network allocates its limited capacity to the patterns it saw, so it never learned to reconstruct what it never saw, and anomalies reconstruct poorly. The threshold $\tau$ is commonly set from a held-out sample of normal data, for example at a high quantile of the normal reconstruction errors, which bounds the false alarm rate by construction. The semi-supervised setting is attractive precisely because clean normal data is often obtainable when labeled anomalies are not, and because it sidesteps the contamination assumption of the unsupervised setting. Its weakness is sensitivity to distribution shift in the normal class, which produces false alarms when the definition of normal drifts over time. A subtle failure mode of high-capacity reconstructors deserves mention: a network powerful enough to generalize can learn to reconstruct anomalies it never saw, collapsing the error gap that the method relies on, which is why capacity must be deliberately constrained. ## 4. The Base-Rate Problem No topic causes more confusion among newcomers to anomaly detection than the base-rate problem, sometimes called the base-rate fallacy. It explains why a detector that appears excellent in isolation can be useless or worse in deployment, and it follows directly from Bayes' theorem applied to a rare positive class. ### 4.1 The Statement Suppose anomalies have prevalence, or base rate, $\pi = P(A)$, where $A$ denotes the event that a point is anomalous. Let the detector have true positive rate, or recall, $\text{TPR} = P(\text{flag} \mid A)$ and false positive rate $\text{FPR} = P(\text{flag} \mid \neg A)$. The quantity an operator actually cares about is precision, the probability that a flagged point is genuinely anomalous, $P(A \mid \text{flag})$. By Bayes' theorem: $$ P(A \mid \text{flag}) = \frac{\text{TPR} \cdot \pi}{\text{TPR} \cdot \pi + \text{FPR} \cdot (1 - \pi)}. $$ When $\pi$ is tiny, the term $\text{FPR} \cdot (1 - \pi)$ in the denominator dominates unless $\text{FPR}$ is extraordinarily small, and precision collapses. ### 4.2 A Concrete Calculation Consider a fraud detector with an enviable recall of $0.99$ and a false positive rate of $0.01$, applied to a transaction stream where the fraud rate is $\pi = 0.001$, that is one in a thousand. Plugging in: $$ P(A \mid \text{flag}) = \frac{0.99 \cdot 0.001}{0.99 \cdot 0.001 + 0.01 \cdot 0.999} = \frac{0.00099}{0.00099 + 0.00999} \approx 0.090. $$ Roughly nine out of every ten alerts are false. A detector with ninety nine percent recall and a one percent false alarm rate, numbers that would be celebrated on most benchmarks, generates an alert stream that is ninety percent noise. This is not a defect of the detector; it is an arithmetic consequence of rarity, and no amount of clever modeling changes the structure of the equation. The only levers are driving $\text{FPR}$ far lower, raising $\pi$ by pre filtering the stream, or accepting that alerts feed a downstream triage process rather than an automated action. It is worth seeing just how far $\text{FPR}$ must fall. Holding recall at $0.99$ and the base rate at $\pi = 0.001$, the table below shows precision as a function of $\text{FPR}$, computed from the same Bayes formula. | False positive rate $\text{FPR}$ | Precision $P(A \mid \text{flag})$ | |---|---| | $10^{-2}$ | $0.090$ | | $10^{-3}$ | $0.498$ | | $10^{-4}$ | $0.908$ | | $10^{-5}$ | $0.990$ | To reach even coin-flip precision the false positive rate must drop to one in a thousand, and to reach ninety percent precision it must drop to one in ten thousand, a hundredfold improvement over the already respectable starting point. This is the quantitative core of why anomaly detection is hard, and why operators speak in terms of an alert budget rather than a single accuracy number. ### 4.3 Operational Consequences The base-rate problem reframes what "good" means. It tells us that recall and false positive rate, the quantities most detectors are tuned on, are insufficient summaries when the base rate is low. It motivates a relentless focus on the absolute number of false positives an operator must review per day, often called the alert budget. It explains why precision oriented and rank oriented metrics, discussed next, are the correct evaluation language for this field, and why accuracy is actively misleading. ## 5. Evaluation Because anomalies are rare, evaluation demands metrics that are insensitive to the dominant normal class and that respect the ranking nature of most detectors. ### 5.1 Why Accuracy Fails If the base rate is $\pi = 0.001$, a model that labels every point normal achieves accuracy $1 - \pi = 0.999$. Accuracy rewards ignoring the positive class entirely, so it is worthless here. The confusion matrix must be read through the lens of the rare class, using precision $\text{TP} / (\text{TP} + \text{FP})$ and recall $\text{TP} / (\text{TP} + \text{FN})$, and their harmonic mean, the $F_1$ score. ### 5.2 Threshold-Free Ranking Metrics Most detectors output a continuous score, and the choice of threshold is a separate operational decision. It is therefore valuable to evaluate the ranking quality across all thresholds at once. Two curves dominate practice. The receiver operating characteristic curve plots $\text{TPR}$ against $\text{FPR}$ as the threshold varies, and the area under it, ROC AUC, is the probability that a randomly chosen anomaly is scored above a randomly chosen normal point. ROC AUC has a hidden flaw for rare classes: because $\text{FPR}$ has the large normal count in its denominator, even many false positives barely move the curve, so ROC AUC can look reassuringly high while precision is dismal. The precision recall curve plots precision against recall, and the area under it, the average precision, is far more informative under heavy imbalance because both axes involve the rare class. A useful anchor is the baseline of a random scorer: on the ROC plot it is the diagonal with area $0.5$ regardless of imbalance, whereas on the precision recall plot it is the horizontal line at precision $\pi$. As $\pi$ shrinks, the achievable precision recall area shrinks with it, so the precision recall plot exposes difficulty that the ROC plot conceals. When the base rate is low, the precision recall curve and average precision are the primary metrics, and ROC AUC is a secondary one. This ordering is argued carefully by Davis and Goadrich and by Saito and Rehmsmeier. ### 5.3 Metrics Matched to Operations Beyond curves, deployment usually fixes an alert budget. If analysts can review one hundred alerts per day, the relevant metric is precision at $k$, the fraction of the top $k$ scored points that are true anomalies, with $k$ set to the budget. Recall at a fixed FPR, and the number of true anomalies caught within a fixed alert volume, translate model quality into the currency the operator spends. ### 5.4 Time Series and Range-Based Evaluation When anomalies are collective or contextual events spanning intervals, point wise precision and recall mislead, because a single alert anywhere inside a long anomalous range arguably constitutes a catch, and a one timestamp offset should not count as a miss. Range based precision and recall, and point adjusted scoring, were developed to credit a detection that overlaps the true anomalous interval. These should be used deliberately and reported transparently, since point adjustment in particular can inflate scores dramatically and has drawn justified criticism for making weak detectors look strong. ### 5.5 A Disciplined Evaluation Protocol A sound protocol fixes the contamination assumption before looking at results, evaluates with a ranking metric appropriate to the base rate, reports a precision oriented operating point tied to a realistic alert budget, and, for temporal data, states explicitly whether scoring is point wise or range based. Reporting a single number without the base rate and without the operating point is uninformative, because the same detector can be excellent or useless depending on the prevalence it faces. ## 6. Choosing a Setting: When to Use What, and Pitfalls The foundations above translate into a short decision procedure. First ask what kind of anomaly you face, because that fixes the representation. If individual records are independent, you have point anomalies and a marginal density or distance method suffices. If normality depends on a covariate such as time, season, or location, you have contextual anomalies and must condition on that covariate, either by modeling $p(b \mid c)$ directly or by adding the context to the feature vector. If the anomaly lives in a run, a shape, or a subgraph, you have collective anomalies and must score windows or substructures rather than points. Then ask what labels you can obtain, because that fixes the learning setting. - **Use supervised** only when you have a reliable, reasonably representative sample of anomalies, and pair it with imbalance-aware training such as cost-sensitive loss or resampling. The pitfall is overfitting to the known anomaly signature and missing novel variants. - **Use semi-supervised (one-class)** when clean normal data is plentiful but anomalies are not, which is the common case in fault and novelty detection. The pitfall is distribution shift in the normal class, which manifests as a rising false alarm rate over time and calls for periodic recalibration. - **Use unsupervised** when no labels exist at all. The pitfall is the contamination assumption: if anomalies are not rare in the training data, the densest region need not be normal, and the contamination hyperparameter $\nu$ is a guess rather than a learned quantity. Three pitfalls cut across all settings. The first is evaluating with accuracy or ROC AUC alone, which can flatter a detector that an operator would find useless once the base rate is accounted for. The second is ignoring the alert budget, since a detector is only deployable if its false positives per day fit the capacity of whoever triages them. The third is leaking the future into the past in temporal data, where naive shuffling of a time series into train and test sets lets the model see information it could not have at prediction time and produces optimistic scores that collapse in deployment. ## 7. Summary Anomaly detection is organized along two axes that together determine method selection. The first axis is the anomaly type: point anomalies sit far from the bulk of the data, contextual anomalies are abnormal only relative to a conditioning context, and collective anomalies are structured sets whose members are individually unremarkable. The second axis is label availability: supervised detection treats the problem as imbalanced classification, unsupervised detection infers normality from contaminated data under a rarity assumption, and semi-supervised detection learns a model of clean normal data and flags departures from it. Overlaying both axes is the base-rate problem, the arithmetic certainty that low prevalence crushes precision even for high recall detectors, which in turn dictates an evaluation regime built on precision recall analysis, average precision, and budget aware operating points rather than accuracy or ROC AUC alone. Mastering these foundations is the prerequisite for using any specific algorithm well, because the algorithm is only ever as good as the match between its assumptions and the type, setting, and prevalence of the anomalies you actually face. ## References 1. Hawkins, D. M. *Identification of Outliers*. Chapman and Hall, 1980. https://link.springer.com/book/10.1007/978-94-015-3994-4 2. Chandola, V., Banerjee, A., and Kumar, V. "Anomaly Detection: A Survey." *ACM Computing Surveys*, 41(3), 2009. https://dl.acm.org/doi/10.1145/1541880.1541882 3. Aggarwal, C. C. *Outlier Analysis*, 2nd edition. Springer, 2017. https://link.springer.com/book/10.1007/978-3-319-47578-3 4. Liu, F. T., Ting, K. M., and Zhou, Z.-H. "Isolation Forest." *IEEE International Conference on Data Mining*, 2008. https://ieeexplore.ieee.org/document/4781136 5. Breunig, M. M., Kriegel, H.-P., Ng, R. T., and Sander, J. "LOF: Identifying Density-Based Local Outliers." *ACM SIGMOD*, 2000. https://dl.acm.org/doi/10.1145/342009.335388 6. Schölkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., and Williamson, R. C. "Estimating the Support of a High-Dimensional Distribution." *Neural Computation*, 13(7), 2001. https://direct.mit.edu/neco/article/13/7/1443/6529 7. Ruff, L., Vandermeulen, R., et al. "A Unifying Review of Deep and Shallow Anomaly Detection." *Proceedings of the IEEE*, 109(5), 2021. https://ieeexplore.ieee.org/document/9347460 8. Davis, J., and Goadrich, M. "The Relationship Between Precision-Recall and ROC Curves." *International Conference on Machine Learning*, 2006. https://dl.acm.org/doi/10.1145/1143844.1143874 9. Saito, T., and Rehmsmeier, M. "The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets." *PLOS ONE*, 10(3), 2015. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0118432 10. Tatbul, N., Lee, T. J., Zdonik, S., Alam, M., and Gottschlich, J. "Precision and Recall for Time Series." *Advances in Neural Information Processing Systems*, 2018. https://papers.nips.cc/paper/2018/hash/8f468c873a32bb0619eaeb2050ba45d1-Abstract.html 11. Kim, S., Choi, K., Choi, H.-S., Lee, B., and Yoon, S. "Towards a Rigorous Evaluation of Time-Series Anomaly Detection." *AAAI Conference on Artificial Intelligence*, 2022. https://ojs.aaai.org/index.php/AAAI/article/view/20680

False positive rate \(\text{FPR}\)	Precision \(P(A \mid \text{flag})\)
\(10^{-2}\)	\(0.090\)
\(10^{-3}\)	\(0.498\)
\(10^{-4}\)	\(0.908\)
\(10^{-5}\)	\(0.990\)