154 Deep Anomaly Detection

Anomaly detection asks a deceptively simple question: which observations do not belong? In low dimensional, well behaved data the classical answers (Gaussian models, distance based outlier scores, kernel methods such as one class support vector machines, and tree based isolation) often suffice. They break down when the data is high dimensional, structured, and rich in nuisance variation: images, network traffic, sensor streams, financial transactions, and medical scans. The difficulty is the curse of dimensionality. In high dimensions distances concentrate, volumes are dominated by their boundary, and any fixed sample becomes vanishingly sparse, so raw distance and density estimates lose discriminative power. Deep anomaly detection addresses this regime by learning representations in which normality is compact and deviations are exposed. This chapter develops the main families of deep methods, the assumptions that make them work, the mathematics that explains when they fail, and the practical tradeoffs that govern their use. The unifying review by Ruff et al. (reference 1) is a useful companion throughout.

154.1 1. Problem Setting and Assumptions

Let $x \in \mathcal{X}$ be drawn from an unknown distribution. We posit a dominant “normal” distribution $p_{\text{normal}}$ and treat anomalies as samples from a different, typically unknown and under sampled, process $p_{\text{anom}}$. The observed data is a mixture

\[ p(x) = (1 - \pi)\, p_{\text{normal}}(x) + \pi\, p_{\text{anom}}(x), \qquad \pi \ll 1, \]

where the contamination rate $\pi$ is small. The goal is a scoring function $s: \mathcal{X} \to \mathbb{R}$ such that anomalous inputs receive higher scores. A threshold $\tau$ converts scores into a decision rule $\hat{y}(x) = \mathbb{1}[s(x) > \tau]$, and downstream metrics such as area under the ROC curve (AUROC) or area under the precision recall curve (AUPRC) summarize ranking quality independent of $\tau$.

It is worth distinguishing three notions that the word “anomaly” conflates. A point anomaly is a single observation that is improbable in isolation. A contextual anomaly is normal in general but improbable given its context, for example a temperature of thirty degrees is unremarkable in summer and anomalous in winter. A collective anomaly is a set of points each individually plausible whose joint pattern is not, for example an otherwise normal sequence of packets that together constitute a port scan. Different method families implicitly target different notions, and a detector tuned for point anomalies will often miss contextual or collective ones.

Three labeling regimes recur. In the unsupervised setting the training set is unlabeled and assumed mostly normal. In the semi supervised (one class) setting the training set is clean normal data only. In the supervised setting a few labeled anomalies are available, usually far too few to train a conventional classifier. Most deep methods target the first two regimes because anomalies are rare, diverse, and expensive to label.

154.1.1 1.1 Two assumptions, made precise

Nearly every deep detector rests on two assumptions. State them explicitly, because every failure mode below traces back to a violation of one of them.

Assumption 1 (concentration / manifold). Normal data lies on or near a low dimensional manifold $\mathcal{M} \subset \mathcal{X}$, or equivalently in a compact high density region $\mathcal{R}_\alpha = \{x : p_{\text{normal}}(x) \ge \alpha\}$ whose volume is small.

Assumption 2 (separation). Anomalies violate the regularities the model has internalized. Formally, the model induces a representation or density under which anomalies map outside $\mathcal{R}_\alpha$, reconstruct poorly, or receive low likelihood.

These assumptions make the task well posed. By the Neyman Pearson lemma, if both $p_{\text{normal}}$ and $p_{\text{anom}}$ were known the optimal detector at any false alarm rate would threshold the likelihood ratio $p_{\text{anom}}(x) / p_{\text{normal}}(x)$. We almost never know $p_{\text{anom}}$, so deep methods substitute a surrogate: thresholding low density (an implicit ratio against a uniform background), large reconstruction error, or large distance to a learned center. Each surrogate is the Neyman Pearson rule under a particular, usually unstated, assumption about how anomalies are distributed. Recognizing that assumption is the difference between a detector that works and one that fails silently.

When the assumptions fail (for example when anomalies are themselves common so $\pi$ is not small, or when normality is multimodal and the model under fits it) detectors degrade. Keeping the assumptions explicit is the single most useful habit in applied anomaly detection.

154.1.2 1.2 A map of the methods

The families below differ mainly in how they operationalize Assumption 2: by reconstruction, by geometry, by generation, by density, or by an auxiliary self supervised task.

flowchart TD
    A["Deep anomaly detection"] --> B["Reconstruction based"]
    A --> C["One class geometric"]
    A --> D["Generative or density based"]
    A --> E["Self supervised"]
    B --> B1["Autoencoder error"]
    B --> B2["VAE negative ELBO"]
    C --> C1["Deep SVDD hypersphere"]
    D --> D1["GAN reconstruction"]
    D --> D2["Normalizing flow density"]
    D --> D3["Feature memory nearest neighbor"]
    E --> E1["Transformation prediction"]
    E --> E2["Contrastive plus compactness"]
    E --> E3["Synthetic anomaly segmentation"]

154.2 2. Autoencoder Reconstruction Error

154.2.1 2.1 The core idea

An autoencoder learns an encoder $f_\theta: \mathcal{X} \to \mathbb{R}^d$ and a decoder $g_\phi: \mathbb{R}^d \to \mathcal{X}$ trained to minimize reconstruction loss on (mostly) normal data:

\[ \mathcal{L}(\theta, \phi) = \frac{1}{n} \sum_{i=1}^{n} \lVert x_i - g_\phi(f_\theta(x_i)) \rVert^2 . \]

The bottleneck dimension $d$ forces the network to capture the dominant factors of variation in normal data. At test time the anomaly score is the reconstruction error,

\[ s(x) = \lVert x - g_\phi(f_\theta(x)) \rVert^2, \]

on the hypothesis that the decoder, never having seen anomalous structure, reconstructs it poorly.

There is a clean intuition for why this can work. A linear autoencoder with squared loss recovers principal component analysis: the optimal encoder projects onto the top $d$ principal subspace and the score is the squared residual outside that subspace, $\lVert (I - P_d) x \rVert^2$, where $P_d$ projects onto the leading eigenvectors of the normal covariance. Reconstruction based anomaly detection is therefore a nonlinear generalization of “distance to the dominant subspace of normal data.” This identity also exposes the central risk: if the learned subspace (or its nonlinear analogue) is large enough to contain anomalous directions, the residual collapses and the score is blind.

# Reconstruction score (PyTorch sketch, not runnable as is)
z = encoder(x)
x_hat = decoder(z)
score = ((x - x_hat) ** 2).flatten(1).mean(dim=1)

154.2.2 2.2 Why it can fail and how to harden it

The central pathology is generalization to anomalies: a sufficiently expressive autoencoder may reconstruct inputs it has never seen, including anomalies, by exploiting local image statistics or identity like shortcuts. Several remedies exist. A denoising objective trains the model to map a corrupted input $\tilde{x}$ back to the clean $x$, which discourages identity mappings and ties the model to the normal manifold; under a Gaussian corruption the denoising autoencoder learns a direction proportional to the score $\nabla_x \log p_{\text{normal}}(x)$, giving the residual a probabilistic reading. Memory augmented autoencoders (MemAE, reference 6) restrict the latent code to a convex combination of a learned dictionary of normal prototypes, so anomalous codes cannot be represented and reconstruction error rises. Sparsity and contractive penalties regularize the latent geometry. Choosing $d$ matters: too large invites memorization, too small under fits normal variation and produces false alarms.

Reconstruction error is also sensitive to the loss metric. Squared error in pixel space penalizes high frequency detail unevenly; perceptual or feature space distances (errors measured on activations of a pretrained network) often localize defects better, which matters for tasks such as industrial inspection where the goal is a pixel level anomaly map rather than a single score.

154.2.3 2.3 Variational and probabilistic variants

A variational autoencoder (VAE) places a prior $p(z)$ on the latent and maximizes the evidence lower bound,

\[ \mathcal{L}_{\text{ELBO}} = \mathbb{E}_{q_\theta(z \mid x)}[\log p_\phi(x \mid z)] - D_{\text{KL}}\big(q_\theta(z \mid x) \,\Vert\, p(z)\big) . \]

The negative ELBO, or the reconstruction probability under the decoder, serves as a score with a probabilistic interpretation. A caution that recurs in the literature: likelihood based generative models can assign higher likelihood to certain out of distribution inputs than to in distribution ones (Nalisnick et al., reference 11). The reason is structural. The log likelihood of a flow or VAE is dominated by low level statistics; a model trained on a complex dataset can assign higher density to a simpler, smoother input simply because smooth inputs sit near the high probability mode of a typical pixel distribution, independent of semantic content. Concretely, models trained on natural images can rate plainer images as more probable than their own training data. Raw likelihood is therefore not a reliable anomaly score on its own. A principled fix is a likelihood ratio against a background model $p_0$ that captures the same low level statistics,

\[ s(x) = -\log \frac{p_{\text{normal}}(x)}{p_0(x)}, \]

which cancels the input agnostic component and isolates the semantic part of the density (Ren et al., reference 13). This ratio is exactly the Neyman Pearson statistic of Section 1.1 with $p_0$ standing in for the unknown anomaly distribution.

154.3 3. Deep SVDD and One Class Representation Learning

Deep Support Vector Data Description (Deep SVDD) learns a representation in which normal data is enclosed by a small hypersphere (Ruff et al., reference 2). Given a network $\phi_\theta$, the soft boundary objective is

\[ \min_{R, \theta} \; R^2 + \frac{1}{\nu n} \sum_{i=1}^{n} \max\!\big(0,\; \lVert \phi_\theta(x_i) - c \rVert^2 - R^2\big) + \frac{\lambda}{2}\lVert \theta \rVert^2, \]

where $c$ is a fixed center, $R$ the radius, and $\nu \in (0,1]$ controls the fraction of points allowed outside the sphere. As in the classical $\nu$ support vector formulation, $\nu$ is an upper bound on the fraction of outliers and a lower bound on the fraction of support vectors, which gives it a direct operational meaning. The common one class simplification drops $R$ and simply pulls representations toward $c$:

\[ \min_\theta \; \frac{1}{n} \sum_{i=1}^{n} \lVert \phi_\theta(x_i) - c \rVert^2 + \frac{\lambda}{2}\lVert \theta \rVert^2 . \]

The anomaly score at test time is the distance to the center, $s(x) = \lVert \phi_\theta(x) - c \rVert^2$.

The danger here is representational collapse: the trivial solution $\phi_\theta \equiv c$ achieves zero loss and is useless. It is worth seeing why the obvious failure modes arise. If the network had a bias term in its final layer it could output the constant $c$ regardless of input; if it used a bounded saturating activation it could drive all inputs to a saturated corner that maps to $c$. Deep SVDD therefore forbids both: no bias terms anywhere, and unbounded non saturating activations. The center $c$ is fixed to the mean of the initial network outputs rather than learned, since a learnable $c$ together with the trivial map yields the degenerate optimum. Even with these guards, collapse and the choice of $c$ remain practical concerns, and many practitioners initialize $\phi_\theta$ from an autoencoder encoder so that the geometry is meaningful before the compactness objective is applied.

# Deep SVDD one-class loss (sketch)
z = net(x)                 # net has no bias terms
loss = ((z - c) ** 2).sum(dim=1).mean() + l2_penalty

A useful conceptual link: minimizing volume around normal data (SVDD) and minimizing reconstruction error (autoencoders) are two faces of the same goal, learning a compact description of normality. Both can be read as estimating a level set $\mathcal{R}_\alpha$ of $p_{\text{normal}}$, the hypersphere by bounding a region in feature space and the autoencoder by bounding the residual off the manifold. Hybrid objectives that combine reconstruction with a compactness term often outperform either alone.

154.4 4. GAN Based Deep Anomaly Detection

Generative adversarial networks (GANs) learn a generator $G$ mapping latent noise to data and a discriminator $D$ separating real from generated samples. After training $G$ on normal data only, anomalies are detected by how poorly the model can explain a test input.

154.4.1 4.1 AnoGAN and its successors

The original AnoGAN (Schlegl et al., reference 3) scores an input by searching the latent space for the code $z$ whose generation best matches $x$:

\[ s(x) = (1 - \lambda)\, \underbrace{\lVert x - G(z^\ast) \rVert_1}_{\text{residual}} + \lambda \, \underbrace{\lVert f_D(x) - f_D(G(z^\ast)) \rVert_1}_{\text{discrimination}}, \]

where $z^\ast$ is found by gradient descent at inference and $f_D$ denotes discriminator features. The combination of a residual term (pixel mismatch) and a feature matching term (mismatch in discriminator space) is the recurring template for GAN based scores. The crippling cost is the per sample latent optimization, which makes inference slow.

Later methods remove that cost by training an encoder jointly with the GAN. Efficient GAN based detection and f AnoGAN add an encoder $E$ so that $z^\ast \approx E(x)$ is produced in a single forward pass. GANomaly (Akcay et al., reference 4) trains an encoder, decoder, encoder pipeline and scores by the distance between the latent code of the input and the re encoded reconstruction, $\lVert E(x) - E(G(E(x))) \rVert$, which sidesteps pixel space comparison entirely.

# GANomaly-style latent consistency score (sketch)
z   = enc1(x)
x_h = dec(z)
z_h = enc2(x_h)
score = (z - z_h).abs().mean(dim=1)

154.4.2 4.2 Practical reality

GAN based detectors can capture sharp, high frequency normal structure that blurry autoencoders miss, which helps on textured images. They inherit GAN training instability, mode collapse, and sensitivity to hyperparameters, and they rarely dominate simpler reconstruction or feature based baselines once those baselines are tuned. They remain valuable when realistic generation of normal samples is itself useful, for example to synthesize plausible counterfactuals.

154.5 5. Density Based Deep Methods

Density based methods estimate $p_{\text{normal}}(x)$ (or a density in a learned feature space) and flag low density points. Doing this directly in pixel space is hopeless in high dimensions, so the deep variants estimate density on learned representations.

154.5.1 5.1 Deep autoencoding Gaussian mixtures and normalizing flows

DAGMM (Zong et al., reference 5) couples a compression network (an autoencoder) with an estimation network that fits a Gaussian mixture model over the joint of the latent code and the reconstruction error features, trained end to end so the representation is shaped to be density friendly. The energy of the mixture, the negative log likelihood, is the anomaly score.

Normalizing flows offer exact likelihoods through an invertible map $h = T(x)$ with tractable Jacobian, by the change of variables formula:

\[ \log p_X(x) = \log p_H(T(x)) + \log \left| \det \frac{\partial T}{\partial x} \right| . \]

The two terms have a clean reading. The first rewards mapping $x$ to a high density region of the base distribution $p_H$ (usually a standard Gaussian); the second, the log absolute Jacobian determinant, accounts for how the transformation locally expands or contracts volume, ensuring the result integrates to one. Flows applied to features from a pretrained backbone are among the strongest detectors for industrial image inspection, because they model the density of rich semantic features rather than raw pixels. The same out of distribution likelihood caveat from Section 2.3 applies: flows trained on one dataset can assign high likelihood to structurally simple inputs from another, so the choice of feature space and, sometimes, likelihood ratio corrections matter.

154.5.2 5.2 Feature memory and nearest neighbor density

A simple and remarkably strong family stores a memory bank of normal feature vectors from a frozen pretrained backbone and scores a test patch by its distance to the nearest stored normal feature. This is a $k$ nearest neighbor density estimate in feature space: the distance to the $k$th neighbor is monotone in an estimate of the local normal density, so large nearest neighbor distance means low estimated density, which is exactly the level set rule of Section 1.1. The PatchCore approach (Roth et al., reference 10) is the prominent example, using a coreset subsampled memory bank to keep the bank small while preserving coverage. It achieves state of the art localization on standard benchmarks with no generative training at all. The lesson is that representation quality often matters more than the scoring mechanism: good frozen features plus a trivial nearest neighbor rule can beat elaborate end to end models.

154.6 6. Self Supervised Anomaly Detection

Self supervised learning builds a pretext task whose solution requires understanding normal structure, then derives anomaly scores from the model’s behavior on that task. The appeal is that no anomalies and no labels are needed, only the design of a task that normal data solves easily and anomalies do not.

154.6.1 6.1 Geometric and transformation prediction

A foundational instance (Golan and El-Yaniv, reference 7) trains a classifier to predict which of $K$ applied geometric transformations (rotations, flips, translations) was applied to an input. Normal data yields confident, correct predictions; anomalies, lacking the learned regularities, produce diffuse predictions. The score aggregates the model’s confidence across transformations, for example

\[ s(x) = -\sum_{k=1}^{K} \log p_\theta\big(k \mid t_k(x)\big), \]

where $t_k$ is the $k$th transformation and $p_\theta(k \mid \cdot)$ the predicted probability of the applied transformation. A normal input keeps each term small (the model is confident and correct), so the sum is small; an anomaly spreads probability mass and inflates every term. The principle generalizes: any auxiliary task that normal data solves through its specific structure can expose anomalies through degraded task performance.

154.6.2 6.2 Contrastive and outlier exposure approaches

Contrastive representation learning, which pulls augmented views of the same sample together and pushes different samples apart, produces features in which simple one class scores work well. Methods in this vein (for example CSI, Tack et al., reference 8) combine a contrastive objective with a compactness or distribution shifting transformation so that the learned space is both discriminative and tight around normal data.

When even a small or synthetic set of anomalies is available, outlier exposure (Hendrycks et al., reference 12) trains the model to produce uniform or high uncertainty outputs on auxiliary outliers while remaining confident on normal data, sharpening the decision boundary. A closely related practical trick is synthetic anomaly generation, cutting and pasting patches or otherwise corrupting normal images to create pseudo anomalies, which converts the one class problem into a supervised segmentation problem and yields strong, well localized detectors (the CutPaste style being a representative example, Li et al., reference 9).

154.6.3 6.3 Why self supervision often wins

Self supervised detectors frequently outperform reconstruction and one class baselines because the pretext task forces the network to learn discriminative, semantically meaningful features rather than the smooth, averaging representations that reconstruction objectives encourage. The cost is task design: a pretext task poorly matched to the anomalies of interest provides no signal, so domain knowledge about what makes an anomaly anomalous remains essential.

154.7 7. A Worked Example: Why Reconstruction Can Miss an Anomaly

A small linear example makes the central failure mode concrete and quantitative. Suppose normal data is two dimensional and concentrated on the first axis: $x = (a, \varepsilon)$ with $a \sim \mathcal{N}(0, 1)$ and a tiny off axis noise $\varepsilon \sim \mathcal{N}(0, \sigma^2)$, $\sigma = 0.01$. A bottleneck autoencoder with $d = 1$ learns essentially the projection onto the first axis, $g(f(x)) = (a, 0)$, because that captures almost all of the variance. Its reconstruction score is the squared off axis residual, $s(x) = \varepsilon^2$.

Now consider two test inputs. Input $u = (50, 0)$ is a clear point anomaly: it lies far out along the normal axis, fifty standard deviations from the mean. Yet it sits exactly on the learned subspace, so its reconstruction is perfect and its score is $s(u) = 0$. Input $v = (0, 0.5)$ is only half a unit off axis, but that is fifty standard deviations along the thin direction, and its score is $s(v) = 0.25$, enormous relative to the normal range of order $\sigma^2 = 10^{-4}$. The autoencoder flags $v$ loudly and misses $u$ entirely.

The lesson is general and not an artifact of linearity. Reconstruction error measures only the component of an input orthogonal to the learned manifold. Anomalies that are extreme along the manifold, the directions of high normal variance, are invisible to it, while a density or distance based score (Mahalanobis distance, Deep SVDD distance to center, nearest neighbor in feature space) catches $u$ immediately because $u$ is far from the bulk of normal data in the learned representation. This is the formal reason hybrid objectives that combine reconstruction with a compactness or density term are more robust than reconstruction alone, and a concrete instance of matching the surrogate score to the way anomalies actually deviate.

154.8 8. Evaluation and Deployment Considerations

Threshold free metrics (AUROC, AUPRC) are the standard for benchmarking. AUROC is the probability that a random anomaly is ranked above a random normal point, which makes it invariant to the base rate $\pi$ and therefore attractive but also misleading: under heavy imbalance an ROC curve can look strong while the precision at any useful operating point is poor. AUPRC is preferred when anomalies are extremely rare, because its baseline equals the positive rate $\pi$ and it directly reflects the alert precision an operator will experience. For localization tasks, pixel level AUROC and the per region overlap metric (PRO) are reported alongside image level scores; the per region metric weights small and large defects equally and is harder to inflate than pixel AUROC. Beware contaminated test protocols and information leakage; many published gains evaporate under careful, leakage free evaluation.

Deployment raises issues the offline benchmark hides.

Distribution shift moves the normal manifold over time (seasonality, sensor drift, software updates), so scores must be recalibrated and thresholds adapted rather than fixed once. A useful discipline is to monitor the score distribution itself for drift, not only the labeled outcomes.
Contamination of the assumed clean training set with undetected anomalies biases every one class method toward enclosing those anomalies as normal; robust training, trimming the highest scoring training points, or iterative cleaning may be needed.
Explainability matters operationally: a reconstruction error map, the nearest normal neighbor, or the failed transformation gives an analyst something to act on, whereas a bare score does not.
Threshold selection from scores into actionable alert rates, accounting for the cost asymmetry between misses and false alarms, is usually more consequential to the deployed system than a marginal gain in AUROC. When a target alert budget is known, set $\tau$ from an empirical quantile of validation scores rather than from a fixed score value, since the score scale drifts but the quantile is comparatively stable.

154.9 9. Choosing a Method: When to Use, and Pitfalls

No single method dominates. The table summarizes the practical tradeoffs; the prose that follows gives the heuristic.

Family	Use when	Main pitfall
Autoencoder reconstruction	Bespoke unlabeled data, need an interpretable baseline and anomaly maps	Generalizes to anomalies; blind to on manifold outliers
Deep SVDD / one class	Tight one class geometry wanted, collapse controllable	Representational collapse; sensitive to center and architecture
GAN based	High fidelity generation of normal data is independently useful	Training instability; rarely beats tuned simpler baselines
Normalizing flow on features	Strong pretrained backbone, density on semantic features	Raw likelihood unreliable out of distribution; needs ratio correction
Feature memory nearest neighbor	A strong frozen backbone exists, localization matters	Memory and latency scale with bank size; backbone domain mismatch
Self supervised	A pretext task or synthetic anomalies fit the target anomalies	No signal if the task is mismatched to real anomalies

As a practical heuristic: when a strong pretrained backbone exists for the domain, start with frozen features plus nearest neighbor or normalizing flow density, since these are simple and competitive. When data is bespoke and unlabeled, an autoencoder (denoising or memory augmented) gives a robust, interpretable baseline, but pair it with a distance or density score so on manifold anomalies are not missed (Section 7). When tight one class geometry is desired and collapse can be controlled, Deep SVDD or a hybrid reconstruction plus compactness objective is attractive. When a meaningful pretext task or synthetic anomalies can be designed, self supervised approaches typically give the best accuracy. GAN based methods are a specialized tool, justified mainly when high fidelity generation of normal data is independently valuable.

For tooling, mature open source options cover this entire stack and are worth preferring over bespoke or proprietary code. The PyOD library collects a broad set of classical and deep detectors behind a uniform interface, Anomalib targets deep image anomaly detection and localization with reference implementations of flow and memory based methods, and PyTorch with scikit-learn covers the building blocks (autoencoders, nearest neighbor, mixtures) when a custom pipeline is needed.

In every case, the governing discipline is the one from Section 1: state the normality assumption, verify that anomalies plausibly violate it, match the surrogate score to the way they deviate, and evaluate without leakage.

154.10 References

Ruff, L., et al. (2021). A Unifying Review of Deep and Shallow Anomaly Detection. Proceedings of the IEEE. https://doi.org/10.1109/JPROC.2021.3052449
Ruff, L., et al. (2018). Deep One Class Classification (Deep SVDD). ICML. https://proceedings.mlr.press/v80/ruff18a.html
Schlegl, T., et al. (2017). Unsupervised Anomaly Detection with Generative Adversarial Networks to Guide Marker Discovery (AnoGAN). IPMI. https://doi.org/10.1007/978-3-319-59050-9_12
Akcay, S., Atapour-Abarghouei, A., Breckon, T. (2018). GANomaly: Semi Supervised Anomaly Detection via Adversarial Training. ACCV. https://doi.org/10.1007/978-3-030-20893-6_39
Zong, B., et al. (2018). Deep Autoencoding Gaussian Mixture Model for Unsupervised Anomaly Detection (DAGMM). ICLR. https://openreview.net/forum?id=BJJLHbb0-
Gong, D., et al. (2019). Memorizing Normality to Detect Anomaly: Memory Augmented Deep Autoencoder (MemAE). ICCV. https://doi.org/10.1109/ICCV.2019.00179
Golan, I., El-Yaniv, R. (2018). Deep Anomaly Detection Using Geometric Transformations. NeurIPS. https://arxiv.org/abs/1805.10917
Tack, J., et al. (2020). CSI: Novelty Detection via Contrastive Learning on Distributionally Shifted Instances. NeurIPS. https://arxiv.org/abs/2007.08176
Li, C.-L., et al. (2021). CutPaste: Self Supervised Learning for Anomaly Detection and Localization. CVPR. https://doi.org/10.1109/CVPR46437.2021.00954
Roth, K., et al. (2022). Towards Total Recall in Industrial Anomaly Detection (PatchCore). CVPR. https://doi.org/10.1109/CVPR52688.2022.01392
Nalisnick, E., et al. (2019). Do Deep Generative Models Know What They Don’t Know? ICLR. https://arxiv.org/abs/1810.09136
Hendrycks, D., Mazeika, M., Dietterich, T. (2019). Deep Anomaly Detection with Outlier Exposure. ICLR. https://arxiv.org/abs/1812.04606
Ren, J., et al. (2019). Likelihood Ratios for Out of Distribution Detection. NeurIPS. https://arxiv.org/abs/1906.02845

# Deep Anomaly Detection Anomaly detection asks a deceptively simple question: which observations do not belong? In low dimensional, well behaved data the classical answers (Gaussian models, distance based outlier scores, kernel methods such as one class support vector machines, and tree based isolation) often suffice. They break down when the data is high dimensional, structured, and rich in nuisance variation: images, network traffic, sensor streams, financial transactions, and medical scans. The difficulty is the curse of dimensionality. In high dimensions distances concentrate, volumes are dominated by their boundary, and any fixed sample becomes vanishingly sparse, so raw distance and density estimates lose discriminative power. Deep anomaly detection addresses this regime by learning representations in which normality is compact and deviations are exposed. This chapter develops the main families of deep methods, the assumptions that make them work, the mathematics that explains when they fail, and the practical tradeoffs that govern their use. The unifying review by Ruff et al. (reference 1) is a useful companion throughout. ## 1. Problem Setting and Assumptions Let $x \in \mathcal{X}$ be drawn from an unknown distribution. We posit a dominant "normal" distribution $p_{\text{normal}}$ and treat anomalies as samples from a different, typically unknown and under sampled, process $p_{\text{anom}}$. The observed data is a mixture $$ p(x) = (1 - \pi)\, p_{\text{normal}}(x) + \pi\, p_{\text{anom}}(x), \qquad \pi \ll 1, $$ where the contamination rate $\pi$ is small. The goal is a scoring function $s: \mathcal{X} \to \mathbb{R}$ such that anomalous inputs receive higher scores. A threshold $\tau$ converts scores into a decision rule $\hat{y}(x) = \mathbb{1}[s(x) > \tau]$, and downstream metrics such as area under the ROC curve (AUROC) or area under the precision recall curve (AUPRC) summarize ranking quality independent of $\tau$. It is worth distinguishing three notions that the word "anomaly" conflates. A **point anomaly** is a single observation that is improbable in isolation. A **contextual anomaly** is normal in general but improbable given its context, for example a temperature of thirty degrees is unremarkable in summer and anomalous in winter. A **collective anomaly** is a set of points each individually plausible whose joint pattern is not, for example an otherwise normal sequence of packets that together constitute a port scan. Different method families implicitly target different notions, and a detector tuned for point anomalies will often miss contextual or collective ones. Three labeling regimes recur. In the **unsupervised** setting the training set is unlabeled and assumed mostly normal. In the **semi supervised** (one class) setting the training set is clean normal data only. In the **supervised** setting a few labeled anomalies are available, usually far too few to train a conventional classifier. Most deep methods target the first two regimes because anomalies are rare, diverse, and expensive to label. ### 1.1 Two assumptions, made precise Nearly every deep detector rests on two assumptions. State them explicitly, because every failure mode below traces back to a violation of one of them. **Assumption 1 (concentration / manifold).** Normal data lies on or near a low dimensional manifold $\mathcal{M} \subset \mathcal{X}$, or equivalently in a compact high density region $\mathcal{R}_\alpha = \{x : p_{\text{normal}}(x) \ge \alpha\}$ whose volume is small. **Assumption 2 (separation).** Anomalies violate the regularities the model has internalized. Formally, the model induces a representation or density under which anomalies map outside $\mathcal{R}_\alpha$, reconstruct poorly, or receive low likelihood. These assumptions make the task well posed. By the Neyman Pearson lemma, if both $p_{\text{normal}}$ and $p_{\text{anom}}$ were known the optimal detector at any false alarm rate would threshold the likelihood ratio $p_{\text{anom}}(x) / p_{\text{normal}}(x)$. We almost never know $p_{\text{anom}}$, so deep methods substitute a surrogate: thresholding low density (an implicit ratio against a uniform background), large reconstruction error, or large distance to a learned center. Each surrogate is the Neyman Pearson rule under a particular, usually unstated, assumption about how anomalies are distributed. Recognizing that assumption is the difference between a detector that works and one that fails silently. When the assumptions fail (for example when anomalies are themselves common so $\pi$ is not small, or when normality is multimodal and the model under fits it) detectors degrade. Keeping the assumptions explicit is the single most useful habit in applied anomaly detection. ### 1.2 A map of the methods The families below differ mainly in how they operationalize Assumption 2: by reconstruction, by geometry, by generation, by density, or by an auxiliary self supervised task. ```{mermaid} flowchart TD A["Deep anomaly detection"] --> B["Reconstruction based"] A --> C["One class geometric"] A --> D["Generative or density based"] A --> E["Self supervised"] B --> B1["Autoencoder error"] B --> B2["VAE negative ELBO"] C --> C1["Deep SVDD hypersphere"] D --> D1["GAN reconstruction"] D --> D2["Normalizing flow density"] D --> D3["Feature memory nearest neighbor"] E --> E1["Transformation prediction"] E --> E2["Contrastive plus compactness"] E --> E3["Synthetic anomaly segmentation"] ``` ## 2. Autoencoder Reconstruction Error ### 2.1 The core idea An autoencoder learns an encoder $f_\theta: \mathcal{X} \to \mathbb{R}^d$ and a decoder $g_\phi: \mathbb{R}^d \to \mathcal{X}$ trained to minimize reconstruction loss on (mostly) normal data: $$ \mathcal{L}(\theta, \phi) = \frac{1}{n} \sum_{i=1}^{n} \lVert x_i - g_\phi(f_\theta(x_i)) \rVert^2 . $$ The bottleneck dimension $d$ forces the network to capture the dominant factors of variation in normal data. At test time the anomaly score is the reconstruction error, $$ s(x) = \lVert x - g_\phi(f_\theta(x)) \rVert^2, $$ on the hypothesis that the decoder, never having seen anomalous structure, reconstructs it poorly. There is a clean intuition for why this can work. A linear autoencoder with squared loss recovers principal component analysis: the optimal encoder projects onto the top $d$ principal subspace and the score is the squared residual outside that subspace, $\lVert (I - P_d) x \rVert^2$, where $P_d$ projects onto the leading eigenvectors of the normal covariance. Reconstruction based anomaly detection is therefore a nonlinear generalization of "distance to the dominant subspace of normal data." This identity also exposes the central risk: if the learned subspace (or its nonlinear analogue) is large enough to contain anomalous directions, the residual collapses and the score is blind. ```python # Reconstruction score (PyTorch sketch, not runnable as is) z = encoder(x) x_hat = decoder(z) score = ((x - x_hat) ** 2).flatten(1).mean(dim=1) ``` ### 2.2 Why it can fail and how to harden it The central pathology is **generalization to anomalies**: a sufficiently expressive autoencoder may reconstruct inputs it has never seen, including anomalies, by exploiting local image statistics or identity like shortcuts. Several remedies exist. A **denoising** objective trains the model to map a corrupted input $\tilde{x}$ back to the clean $x$, which discourages identity mappings and ties the model to the normal manifold; under a Gaussian corruption the denoising autoencoder learns a direction proportional to the score $\nabla_x \log p_{\text{normal}}(x)$, giving the residual a probabilistic reading. **Memory augmented** autoencoders (MemAE, reference 6) restrict the latent code to a convex combination of a learned dictionary of normal prototypes, so anomalous codes cannot be represented and reconstruction error rises. **Sparsity** and **contractive** penalties regularize the latent geometry. Choosing $d$ matters: too large invites memorization, too small under fits normal variation and produces false alarms. Reconstruction error is also sensitive to the loss metric. Squared error in pixel space penalizes high frequency detail unevenly; perceptual or feature space distances (errors measured on activations of a pretrained network) often localize defects better, which matters for tasks such as industrial inspection where the goal is a pixel level anomaly map rather than a single score. ### 2.3 Variational and probabilistic variants A variational autoencoder (VAE) places a prior $p(z)$ on the latent and maximizes the evidence lower bound, $$ \mathcal{L}_{\text{ELBO}} = \mathbb{E}_{q_\theta(z \mid x)}[\log p_\phi(x \mid z)] - D_{\text{KL}}\big(q_\theta(z \mid x) \,\Vert\, p(z)\big) . $$ The negative ELBO, or the reconstruction probability under the decoder, serves as a score with a probabilistic interpretation. A caution that recurs in the literature: likelihood based generative models can assign **higher** likelihood to certain out of distribution inputs than to in distribution ones (Nalisnick et al., reference 11). The reason is structural. The log likelihood of a flow or VAE is dominated by low level statistics; a model trained on a complex dataset can assign higher density to a simpler, smoother input simply because smooth inputs sit near the high probability mode of a typical pixel distribution, independent of semantic content. Concretely, models trained on natural images can rate plainer images as more probable than their own training data. Raw likelihood is therefore not a reliable anomaly score on its own. A principled fix is a **likelihood ratio** against a background model $p_0$ that captures the same low level statistics, $$ s(x) = -\log \frac{p_{\text{normal}}(x)}{p_0(x)}, $$ which cancels the input agnostic component and isolates the semantic part of the density (Ren et al., reference 13). This ratio is exactly the Neyman Pearson statistic of Section 1.1 with $p_0$ standing in for the unknown anomaly distribution. ## 3. Deep SVDD and One Class Representation Learning Deep Support Vector Data Description (Deep SVDD) learns a representation in which normal data is enclosed by a small hypersphere (Ruff et al., reference 2). Given a network $\phi_\theta$, the soft boundary objective is $$ \min_{R, \theta} \; R^2 + \frac{1}{\nu n} \sum_{i=1}^{n} \max\!\big(0,\; \lVert \phi_\theta(x_i) - c \rVert^2 - R^2\big) + \frac{\lambda}{2}\lVert \theta \rVert^2, $$ where $c$ is a fixed center, $R$ the radius, and $\nu \in (0,1]$ controls the fraction of points allowed outside the sphere. As in the classical $\nu$ support vector formulation, $\nu$ is an upper bound on the fraction of outliers and a lower bound on the fraction of support vectors, which gives it a direct operational meaning. The common **one class** simplification drops $R$ and simply pulls representations toward $c$: $$ \min_\theta \; \frac{1}{n} \sum_{i=1}^{n} \lVert \phi_\theta(x_i) - c \rVert^2 + \frac{\lambda}{2}\lVert \theta \rVert^2 . $$ The anomaly score at test time is the distance to the center, $s(x) = \lVert \phi_\theta(x) - c \rVert^2$. The danger here is **representational collapse**: the trivial solution $\phi_\theta \equiv c$ achieves zero loss and is useless. It is worth seeing why the obvious failure modes arise. If the network had a bias term in its final layer it could output the constant $c$ regardless of input; if it used a bounded saturating activation it could drive all inputs to a saturated corner that maps to $c$. Deep SVDD therefore forbids both: no bias terms anywhere, and unbounded non saturating activations. The center $c$ is fixed to the mean of the initial network outputs rather than learned, since a learnable $c$ together with the trivial map yields the degenerate optimum. Even with these guards, collapse and the choice of $c$ remain practical concerns, and many practitioners initialize $\phi_\theta$ from an autoencoder encoder so that the geometry is meaningful before the compactness objective is applied. ```python # Deep SVDD one-class loss (sketch) z = net(x) # net has no bias terms loss = ((z - c) ** 2).sum(dim=1).mean() + l2_penalty ``` A useful conceptual link: minimizing volume around normal data (SVDD) and minimizing reconstruction error (autoencoders) are two faces of the same goal, learning a compact description of normality. Both can be read as estimating a level set $\mathcal{R}_\alpha$ of $p_{\text{normal}}$, the hypersphere by bounding a region in feature space and the autoencoder by bounding the residual off the manifold. Hybrid objectives that combine reconstruction with a compactness term often outperform either alone. ## 4. GAN Based Deep Anomaly Detection Generative adversarial networks (GANs) learn a generator $G$ mapping latent noise to data and a discriminator $D$ separating real from generated samples. After training $G$ on normal data only, anomalies are detected by how poorly the model can explain a test input. ### 4.1 AnoGAN and its successors The original **AnoGAN** (Schlegl et al., reference 3) scores an input by searching the latent space for the code $z$ whose generation best matches $x$: $$ s(x) = (1 - \lambda)\, \underbrace{\lVert x - G(z^\ast) \rVert_1}_{\text{residual}} + \lambda \, \underbrace{\lVert f_D(x) - f_D(G(z^\ast)) \rVert_1}_{\text{discrimination}}, $$ where $z^\ast$ is found by gradient descent at inference and $f_D$ denotes discriminator features. The combination of a **residual** term (pixel mismatch) and a **feature matching** term (mismatch in discriminator space) is the recurring template for GAN based scores. The crippling cost is the per sample latent optimization, which makes inference slow. Later methods remove that cost by training an encoder jointly with the GAN. **Efficient GAN based** detection and **f AnoGAN** add an encoder $E$ so that $z^\ast \approx E(x)$ is produced in a single forward pass. **GANomaly** (Akcay et al., reference 4) trains an encoder, decoder, encoder pipeline and scores by the distance between the latent code of the input and the re encoded reconstruction, $\lVert E(x) - E(G(E(x))) \rVert$, which sidesteps pixel space comparison entirely. ```python # GANomaly-style latent consistency score (sketch) z = enc1(x) x_h = dec(z) z_h = enc2(x_h) score = (z - z_h).abs().mean(dim=1) ``` ### 4.2 Practical reality GAN based detectors can capture sharp, high frequency normal structure that blurry autoencoders miss, which helps on textured images. They inherit GAN training instability, mode collapse, and sensitivity to hyperparameters, and they rarely dominate simpler reconstruction or feature based baselines once those baselines are tuned. They remain valuable when realistic generation of normal samples is itself useful, for example to synthesize plausible counterfactuals. ## 5. Density Based Deep Methods Density based methods estimate $p_{\text{normal}}(x)$ (or a density in a learned feature space) and flag low density points. Doing this directly in pixel space is hopeless in high dimensions, so the deep variants estimate density on learned representations. ### 5.1 Deep autoencoding Gaussian mixtures and normalizing flows **DAGMM** (Zong et al., reference 5) couples a compression network (an autoencoder) with an estimation network that fits a Gaussian mixture model over the joint of the latent code and the reconstruction error features, trained end to end so the representation is shaped to be density friendly. The energy of the mixture, the negative log likelihood, is the anomaly score. **Normalizing flows** offer exact likelihoods through an invertible map $h = T(x)$ with tractable Jacobian, by the change of variables formula: $$ \log p_X(x) = \log p_H(T(x)) + \log \left| \det \frac{\partial T}{\partial x} \right| . $$ The two terms have a clean reading. The first rewards mapping $x$ to a high density region of the base distribution $p_H$ (usually a standard Gaussian); the second, the log absolute Jacobian determinant, accounts for how the transformation locally expands or contracts volume, ensuring the result integrates to one. Flows applied to features from a pretrained backbone are among the strongest detectors for industrial image inspection, because they model the density of rich semantic features rather than raw pixels. The same out of distribution likelihood caveat from Section 2.3 applies: flows trained on one dataset can assign high likelihood to structurally simple inputs from another, so the choice of feature space and, sometimes, likelihood ratio corrections matter. ### 5.2 Feature memory and nearest neighbor density A simple and remarkably strong family stores a memory bank of normal feature vectors from a frozen pretrained backbone and scores a test patch by its distance to the nearest stored normal feature. This is a $k$ nearest neighbor density estimate in feature space: the distance to the $k$th neighbor is monotone in an estimate of the local normal density, so large nearest neighbor distance means low estimated density, which is exactly the level set rule of Section 1.1. The **PatchCore** approach (Roth et al., reference 10) is the prominent example, using a coreset subsampled memory bank to keep the bank small while preserving coverage. It achieves state of the art localization on standard benchmarks with no generative training at all. The lesson is that **representation quality often matters more than the scoring mechanism**: good frozen features plus a trivial nearest neighbor rule can beat elaborate end to end models. ## 6. Self Supervised Anomaly Detection Self supervised learning builds a pretext task whose solution requires understanding normal structure, then derives anomaly scores from the model's behavior on that task. The appeal is that no anomalies and no labels are needed, only the design of a task that normal data solves easily and anomalies do not. ### 6.1 Geometric and transformation prediction A foundational instance (Golan and El-Yaniv, reference 7) trains a classifier to predict which of $K$ applied geometric transformations (rotations, flips, translations) was applied to an input. Normal data yields confident, correct predictions; anomalies, lacking the learned regularities, produce diffuse predictions. The score aggregates the model's confidence across transformations, for example $$ s(x) = -\sum_{k=1}^{K} \log p_\theta\big(k \mid t_k(x)\big), $$ where $t_k$ is the $k$th transformation and $p_\theta(k \mid \cdot)$ the predicted probability of the applied transformation. A normal input keeps each term small (the model is confident and correct), so the sum is small; an anomaly spreads probability mass and inflates every term. The principle generalizes: any auxiliary task that normal data solves through its specific structure can expose anomalies through degraded task performance. ### 6.2 Contrastive and outlier exposure approaches Contrastive representation learning, which pulls augmented views of the same sample together and pushes different samples apart, produces features in which simple one class scores work well. Methods in this vein (for example CSI, Tack et al., reference 8) combine a contrastive objective with a compactness or distribution shifting transformation so that the learned space is both discriminative and tight around normal data. When even a small or synthetic set of anomalies is available, **outlier exposure** (Hendrycks et al., reference 12) trains the model to produce uniform or high uncertainty outputs on auxiliary outliers while remaining confident on normal data, sharpening the decision boundary. A closely related practical trick is **synthetic anomaly generation**, cutting and pasting patches or otherwise corrupting normal images to create pseudo anomalies, which converts the one class problem into a supervised segmentation problem and yields strong, well localized detectors (the **CutPaste** style being a representative example, Li et al., reference 9). ### 6.3 Why self supervision often wins Self supervised detectors frequently outperform reconstruction and one class baselines because the pretext task forces the network to learn discriminative, semantically meaningful features rather than the smooth, averaging representations that reconstruction objectives encourage. The cost is task design: a pretext task poorly matched to the anomalies of interest provides no signal, so domain knowledge about what makes an anomaly anomalous remains essential. ## 7. A Worked Example: Why Reconstruction Can Miss an Anomaly A small linear example makes the central failure mode concrete and quantitative. Suppose normal data is two dimensional and concentrated on the first axis: $x = (a, \varepsilon)$ with $a \sim \mathcal{N}(0, 1)$ and a tiny off axis noise $\varepsilon \sim \mathcal{N}(0, \sigma^2)$, $\sigma = 0.01$. A bottleneck autoencoder with $d = 1$ learns essentially the projection onto the first axis, $g(f(x)) = (a, 0)$, because that captures almost all of the variance. Its reconstruction score is the squared off axis residual, $s(x) = \varepsilon^2$. Now consider two test inputs. Input $u = (50, 0)$ is a clear point anomaly: it lies far out along the normal axis, fifty standard deviations from the mean. Yet it sits exactly on the learned subspace, so its reconstruction is perfect and its score is $s(u) = 0$. Input $v = (0, 0.5)$ is only half a unit off axis, but that is fifty standard deviations along the thin direction, and its score is $s(v) = 0.25$, enormous relative to the normal range of order $\sigma^2 = 10^{-4}$. The autoencoder flags $v$ loudly and misses $u$ entirely. The lesson is general and not an artifact of linearity. Reconstruction error measures only the component of an input orthogonal to the learned manifold. Anomalies that are extreme **along** the manifold, the directions of high normal variance, are invisible to it, while a density or distance based score (Mahalanobis distance, Deep SVDD distance to center, nearest neighbor in feature space) catches $u$ immediately because $u$ is far from the bulk of normal data in the learned representation. This is the formal reason hybrid objectives that combine reconstruction with a compactness or density term are more robust than reconstruction alone, and a concrete instance of matching the surrogate score to the way anomalies actually deviate. ## 8. Evaluation and Deployment Considerations Threshold free metrics (AUROC, AUPRC) are the standard for benchmarking. AUROC is the probability that a random anomaly is ranked above a random normal point, which makes it invariant to the base rate $\pi$ and therefore attractive but also misleading: under heavy imbalance an ROC curve can look strong while the precision at any useful operating point is poor. AUPRC is preferred when anomalies are extremely rare, because its baseline equals the positive rate $\pi$ and it directly reflects the alert precision an operator will experience. For localization tasks, pixel level AUROC and the per region overlap metric (PRO) are reported alongside image level scores; the per region metric weights small and large defects equally and is harder to inflate than pixel AUROC. Beware contaminated test protocols and information leakage; many published gains evaporate under careful, leakage free evaluation. Deployment raises issues the offline benchmark hides. - **Distribution shift** moves the normal manifold over time (seasonality, sensor drift, software updates), so scores must be recalibrated and thresholds adapted rather than fixed once. A useful discipline is to monitor the score distribution itself for drift, not only the labeled outcomes. - **Contamination** of the assumed clean training set with undetected anomalies biases every one class method toward enclosing those anomalies as normal; robust training, trimming the highest scoring training points, or iterative cleaning may be needed. - **Explainability** matters operationally: a reconstruction error map, the nearest normal neighbor, or the failed transformation gives an analyst something to act on, whereas a bare score does not. - **Threshold selection** from scores into actionable alert rates, accounting for the cost asymmetry between misses and false alarms, is usually more consequential to the deployed system than a marginal gain in AUROC. When a target alert budget is known, set $\tau$ from an empirical quantile of validation scores rather than from a fixed score value, since the score scale drifts but the quantile is comparatively stable. ## 9. Choosing a Method: When to Use, and Pitfalls No single method dominates. The table summarizes the practical tradeoffs; the prose that follows gives the heuristic. | Family | Use when | Main pitfall | | --- | --- | --- | | Autoencoder reconstruction | Bespoke unlabeled data, need an interpretable baseline and anomaly maps | Generalizes to anomalies; blind to on manifold outliers | | Deep SVDD / one class | Tight one class geometry wanted, collapse controllable | Representational collapse; sensitive to center and architecture | | GAN based | High fidelity generation of normal data is independently useful | Training instability; rarely beats tuned simpler baselines | | Normalizing flow on features | Strong pretrained backbone, density on semantic features | Raw likelihood unreliable out of distribution; needs ratio correction | | Feature memory nearest neighbor | A strong frozen backbone exists, localization matters | Memory and latency scale with bank size; backbone domain mismatch | | Self supervised | A pretext task or synthetic anomalies fit the target anomalies | No signal if the task is mismatched to real anomalies | As a practical heuristic: when a strong pretrained backbone exists for the domain, start with frozen features plus nearest neighbor or normalizing flow density, since these are simple and competitive. When data is bespoke and unlabeled, an autoencoder (denoising or memory augmented) gives a robust, interpretable baseline, but pair it with a distance or density score so on manifold anomalies are not missed (Section 7). When tight one class geometry is desired and collapse can be controlled, Deep SVDD or a hybrid reconstruction plus compactness objective is attractive. When a meaningful pretext task or synthetic anomalies can be designed, self supervised approaches typically give the best accuracy. GAN based methods are a specialized tool, justified mainly when high fidelity generation of normal data is independently valuable. For tooling, mature open source options cover this entire stack and are worth preferring over bespoke or proprietary code. The PyOD library collects a broad set of classical and deep detectors behind a uniform interface, Anomalib targets deep image anomaly detection and localization with reference implementations of flow and memory based methods, and PyTorch with scikit-learn covers the building blocks (autoencoders, nearest neighbor, mixtures) when a custom pipeline is needed. In every case, the governing discipline is the one from Section 1: state the normality assumption, verify that anomalies plausibly violate it, match the surrogate score to the way they deviate, and evaluate without leakage. ## References 1. Ruff, L., et al. (2021). A Unifying Review of Deep and Shallow Anomaly Detection. Proceedings of the IEEE. https://doi.org/10.1109/JPROC.2021.3052449 2. Ruff, L., et al. (2018). Deep One Class Classification (Deep SVDD). ICML. https://proceedings.mlr.press/v80/ruff18a.html 3. Schlegl, T., et al. (2017). Unsupervised Anomaly Detection with Generative Adversarial Networks to Guide Marker Discovery (AnoGAN). IPMI. https://doi.org/10.1007/978-3-319-59050-9_12 4. Akcay, S., Atapour-Abarghouei, A., Breckon, T. (2018). GANomaly: Semi Supervised Anomaly Detection via Adversarial Training. ACCV. https://doi.org/10.1007/978-3-030-20893-6_39 5. Zong, B., et al. (2018). Deep Autoencoding Gaussian Mixture Model for Unsupervised Anomaly Detection (DAGMM). ICLR. https://openreview.net/forum?id=BJJLHbb0- 6. Gong, D., et al. (2019). Memorizing Normality to Detect Anomaly: Memory Augmented Deep Autoencoder (MemAE). ICCV. https://doi.org/10.1109/ICCV.2019.00179 7. Golan, I., El-Yaniv, R. (2018). Deep Anomaly Detection Using Geometric Transformations. NeurIPS. https://arxiv.org/abs/1805.10917 8. Tack, J., et al. (2020). CSI: Novelty Detection via Contrastive Learning on Distributionally Shifted Instances. NeurIPS. https://arxiv.org/abs/2007.08176 9. Li, C.-L., et al. (2021). CutPaste: Self Supervised Learning for Anomaly Detection and Localization. CVPR. https://doi.org/10.1109/CVPR46437.2021.00954 10. Roth, K., et al. (2022). Towards Total Recall in Industrial Anomaly Detection (PatchCore). CVPR. https://doi.org/10.1109/CVPR52688.2022.01392 11. Nalisnick, E., et al. (2019). Do Deep Generative Models Know What They Don't Know? ICLR. https://arxiv.org/abs/1810.09136 12. Hendrycks, D., Mazeika, M., Dietterich, T. (2019). Deep Anomaly Detection with Outlier Exposure. ICLR. https://arxiv.org/abs/1812.04606 13. Ren, J., et al. (2019). Likelihood Ratios for Out of Distribution Detection. NeurIPS. https://arxiv.org/abs/1906.02845