153 One-Class SVM and Boundary-Based Anomaly Detection

Anomaly detection often arrives with an awkward asymmetry. We have an abundance of examples of normal behavior, such as healthy machines, legitimate transactions, or routine network traffic, but almost no examples of the anomalies we want to catch. Worse, the anomalies we have not yet seen may look nothing like the ones we have. This setting rules out ordinary supervised classification, which needs both classes, and motivates a family of methods that learn the shape of normality from positive examples alone. The One-Class Support Vector Machine (OCSVM) is the canonical kernel method for this task. It learns a boundary that wraps around the normal data, and it flags anything outside that boundary as anomalous.

This chapter develops the One-Class SVM from its geometric intuition through its dual optimization, explains the meaning of the $\nu$ parameter and the role of the kernel, contrasts it with the closely related Support Vector Data Description (SVDD), and finishes with a careful comparison against Isolation Forests.

153.1 1. The One-Class Problem

Let $x_1, \dots, x_n \in \mathbb{R}^d$ be a sample drawn from an unknown distribution $P$ that we treat as normal. The goal is to estimate a region $R$ such that $P(R)$ is high and the volume of $R$ is small. A new point $x$ is then scored as normal if $x \in R$ and anomalous otherwise. This is a density level set estimation problem in disguise. If we could estimate the density $p$, the ideal region would be a level set $\{x : p(x) \ge \rho\}$. Estimating a full density in high dimensions is hard, so the One-Class SVM sidesteps it and estimates the boundary of a level set directly.

Two design choices follow. First, we work in a feature space induced by a kernel so that the boundary can be nonlinear in the original coordinates. Second, we allow a controlled fraction of training points to fall outside the boundary, because insisting that every training point be enclosed makes the estimate brittle and sensitive to noise.

153.2 2. The Schölkopf Formulation

Schölkopf and colleagues framed the One-Class SVM as separating the data from the origin with maximum margin in feature space [1]. Let $\phi : \mathbb{R}^d \to \mathcal{H}$ map inputs into a reproducing kernel Hilbert space with kernel $k(x, x') = \langle \phi(x), \phi(x') \rangle$. We seek a hyperplane $\langle w, \phi(x) \rangle = \rho$ that separates most of the mapped data from the origin while lying as far from the origin as possible.

The primal optimization problem is

\[ \min_{w, \, \xi, \, \rho} \; \frac{1}{2}\|w\|^2 + \frac{1}{\nu n}\sum_{i=1}^{n}\xi_i - \rho \quad \text{subject to} \quad \langle w, \phi(x_i)\rangle \ge \rho - \xi_i, \;\; \xi_i \ge 0 . \]

Here $\xi_i$ are slack variables that permit points to lie on the wrong side of the hyperplane, and $\rho$ is the offset. The term $-\rho$ in the objective pushes the hyperplane away from the origin, which enlarges the margin. The decision function is

\[ f(x) = \operatorname{sign}\!\big(\langle w, \phi(x)\rangle - \rho\big), \]

returning $+1$ for normal points and $-1$ for anomalies.

Why separate from the origin? With many common kernels, notably the Gaussian RBF kernel, all mapped points $\phi(x)$ live on a hypersphere of fixed radius in feature space and reside in the same orthant. Maximizing the distance of the data from the origin then carves out a region that tightly contains the bulk of the mapped points. The origin plays the role of a surrogate for “everything that is not data.”

153.2.1 2.1 The Dual and Support Vectors

Introducing Lagrange multipliers $\alpha_i$ and eliminating $w$, $\xi$, and $\rho$ yields the dual

\[ \min_{\alpha} \; \frac{1}{2}\sum_{i,j}\alpha_i\alpha_j\, k(x_i, x_j) \quad \text{subject to} \quad 0 \le \alpha_i \le \frac{1}{\nu n}, \;\; \sum_{i=1}^{n}\alpha_i = 1 . \]

The recovered weight vector is $w = \sum_i \alpha_i \phi(x_i)$, so the decision function depends only on kernel evaluations:

\[ f(x) = \operatorname{sign}\!\left(\sum_{i=1}^{n}\alpha_i\, k(x_i, x) - \rho\right). \]

Points with $\alpha_i > 0$ are the support vectors. The box constraint $\alpha_i \le 1/(\nu n)$ caps the influence of any single point, which is precisely what bounds the number of outliers. Points strictly inside the boundary have $\alpha_i = 0$, points on the boundary have $0 < \alpha_i < 1/(\nu n)$, and points outside the boundary saturate at $\alpha_i = 1/(\nu n)$. The offset $\rho$ is recovered from any margin support vector, where $\sum_j \alpha_j k(x_j, x_i) = \rho$.

153.3 3. The Meaning of $\nu$

The parameter $\nu \in (0, 1]$ is the conceptual heart of the method. It is not an opaque regularization knob but a quantity with a precise, dual interpretation given by the $\nu$-property [1]:

$\nu$ is an upper bound on the fraction of outliers, meaning the fraction of training points for which $f(x_i) \le 0$.
$\nu$ is a lower bound on the fraction of support vectors.

Asymptotically, both fractions converge to $\nu$ under mild conditions. This gives the practitioner a direct lever. Setting $\nu = 0.05$ expresses a belief that roughly five percent of the training data are contaminated or atypical and should be allowed to fall outside the learned boundary. Small $\nu$ produces a permissive boundary that encloses almost everything and rarely raises alarms. Large $\nu$ produces a tight boundary that flags many points.

The contrast with the soft-margin parameter $C$ in two-class SVMs is worth stating. Where $C$ trades margin against error in units that are hard to interpret, $\nu$ is bounded in $[0, 1]$ and carries a direct probabilistic meaning. This is why the formulation is sometimes called the $\nu$-SVM family. In practice $\nu$ should be set to the analyst’s prior estimate of the contamination rate, then refined if a small validation set of labeled anomalies is available.

A worked, runnable demonstration of exactly this lever appears in Section 8.

153.4 4. Kernel Choice

The kernel determines the geometry of the boundary, and for the One-Class SVM it matters even more than for two-class classification because there is no second class to anchor the decision surface.

153.4.1 4.1 The Gaussian RBF Kernel

The Gaussian kernel

\[ k(x, x') = \exp\!\left(-\gamma \|x - x'\|^2\right), \qquad \gamma = \frac{1}{2\sigma^2}, \]

is the default choice. It is universal, meaning it can approximate any continuous decision boundary, and it maps every point onto a sphere of unit norm in feature space, which makes the separate-from-origin geometry behave as intended. The bandwidth $\gamma$ controls smoothness. Small $\gamma$ (large $\sigma$) yields a smooth, blob-like boundary that may underfit multimodal normal data. Large $\gamma$ (small $\sigma$) yields a wiggly boundary that can fragment into islands around individual points and overfit, in the extreme tracing a tight bubble around each training example.

A useful heuristic sets $\gamma$ from the distribution of pairwise distances, for instance the inverse of the median squared distance, which is the spirit of the “scale” setting in common libraries. Because $\nu$ and $\gamma$ interact, with $\gamma$ shaping the boundary and $\nu$ shaping how much of the data it must enclose, the two are best tuned jointly.

153.4.2 4.2 Linear and Polynomial Kernels

The linear kernel $k(x, x') = \langle x, x' \rangle$ reduces the boundary to a half-space and is rarely adequate for normality, which is seldom well described by a single linear cut from the origin. Polynomial kernels $k(x, x') = (\langle x, x'\rangle + c)^p$ can capture curved boundaries but are numerically delicate and lack the bounded feature norm of the RBF kernel. For most applications the RBF kernel is the right starting point, with the linear kernel reserved for very high-dimensional sparse data such as text, where linear models often suffice.

153.5 5. Support Vector Data Description

The Support Vector Data Description of Tax and Duin takes a different geometric route to the same goal [2]. Instead of separating data from the origin with a hyperplane, SVDD finds the smallest hypersphere in feature space that contains most of the data. We seek a center $a$ and radius $R$ solving

\[ \min_{a, \, R, \, \xi} \; R^2 + \frac{1}{\nu n}\sum_{i=1}^{n}\xi_i \quad \text{subject to} \quad \|\phi(x_i) - a\|^2 \le R^2 + \xi_i, \;\; \xi_i \ge 0 . \]

A test point is normal if it falls inside the sphere, that is if $\|\phi(x) - a\|^2 \le R^2$. The dual again involves only kernel evaluations, and the center is $a = \sum_i \alpha_i \phi(x_i)$.

The key theoretical fact is that for any kernel where $k(x, x)$ is constant for all $x$, the RBF kernel being the prime example since $k(x,x) = 1$, the SVDD hypersphere and the Schölkopf hyperplane give identical decision boundaries [1], [2]. The expansion $\|\phi(x) - a\|^2 = k(x,x) - 2\sum_i \alpha_i k(x_i, x) + \text{const}$ collapses, when $k(x,x)$ is constant, into a function of $\sum_i \alpha_i k(x_i, x)$ alone, which is the same quantity that drives the hyperplane decision. The two formulations are therefore not competitors but dual viewpoints, one spherical and one planar, of the same underlying estimator. They diverge only for kernels without constant diagonal, where the sphere interpretation of SVDD is the more natural one and accommodates a richer set of kernels and even data-dependent rescalings.

153.6 6. Comparison with Isolation Forests

The Isolation Forest takes a strategy that is almost the conceptual opposite of the One-Class SVM [3]. Rather than modeling the dense region of normality, it directly exploits the fact that anomalies are few and different and are therefore easy to isolate. An isolation tree is built by repeatedly choosing a random feature and a random split value, partitioning the data until points are isolated into leaves. Anomalous points, being sparse and far from the mass of the data, tend to be separated after only a few splits, so they have short average path lengths from the root. The anomaly score for a point $x$ averaged over a forest of trees is

\[ s(x, n) = 2^{-\,\frac{\mathbb{E}[h(x)]}{c(n)}}, \]

where $h(x)$ is the path length to isolate $x$ in a tree, and $c(n)$ is the expected path length for $n$ points used as a normalizing constant. Scores near $1$ indicate anomalies and scores well below $0.5$ indicate normal points.

The two methods differ along several axes that matter in practice.

Aspect	One-Class SVM	Isolation Forest
Core idea	Wrap a boundary around normal data	Isolate sparse points quickly
Model	Kernel boundary in feature space	Ensemble of random trees
Key parameters	$\nu$, kernel bandwidth $\gamma$	number of trees, subsample size
Training cost	$O(n^2)$ to $O(n^3)$ in sample size	near linear, sublinear with subsampling
Scaling sensitivity	high, distances depend on feature scale	low, splits are axis aligned
High dimensions	degrades, distances concentrate	generally more robust
Interpretability	support vectors, geometric boundary	path lengths, partial transparency

Several practical implications follow. The quadratic to cubic training cost of the One-Class SVM makes it awkward on large datasets, where Isolation Forests, which subsample a few hundred points per tree, scale comfortably to millions of records. The One-Class SVM relies on distances and so demands careful feature standardization, whereas the axis-aligned splits of an Isolation Forest are invariant to monotone per-feature transformations. On the other hand, the One-Class SVM, with a well-chosen RBF kernel, can model smooth curved manifolds of normality that axis-aligned partitions approximate only coarsely, and its $\nu$ parameter gives a cleaner handle on the expected contamination rate. Isolation Forests can struggle when anomalies are not globally sparse but lie in local pockets, and they exhibit axis-aligned artifacts that the rotation-aware Extended Isolation Forest was designed to mitigate.

A pragmatic recommendation is to treat the two as complementary baselines. For moderate sample sizes with a meaningful distance metric after careful preprocessing, the One-Class SVM and its SVDD twin offer a principled geometric boundary. For large, high-dimensional, or heterogeneous tabular data where speed and robustness to scaling dominate, the Isolation Forest is usually the stronger and cheaper first choice. Evaluating both on a held-out set with whatever labeled anomalies exist, using ranking metrics such as area under the precision-recall curve, is the surest way to choose.

153.7 7. Mature Open-Source Tooling

The One-Class SVM is a solved engineering problem, and there is no reason to reimplement the quadratic program by hand. The reference solver underneath almost every implementation is LIBSVM, the C++ library of Chang and Lin that introduced the working-set decomposition (SMO-type) algorithm and exposes the $\nu$-SVM family directly [4]. The mature, freely licensed, well-maintained wrappers around it are what you should reach for in production.

Python: sklearn.svm.OneClassSVM (BSD-licensed scikit-learn) wraps LIBSVM and slots into the same fit / predict / decision_function interface as the rest of the ecosystem, so it composes with StandardScaler, Pipeline, and GridSearchCV for leakage-free tuning. For larger problems scikit-learn also ships SGDOneClassSVM, a linear stochastic-gradient approximation whose cost is linear in the sample size, usually paired with Nystroem kernel approximation to recover nonlinearity.
Julia: LIBSVM.jl (MIT-licensed) binds the same LIBSVM C library; its svmtrain(X; svmtype = OneClassSVM, nu = 0.05, kernel = Kernel.RadialBasis) exposes the identical $\nu$ and $\gamma$ knobs.
Rust: linfa-svm (part of the Apache/MIT-licensed linfa toolkit) provides a pure-Rust SVM with a one-class mode via Svm::params().nu_weight(...), and smartcore offers a related SVM family. The Rust ecosystem here is younger than LIBSVM, so for high-stakes work cross-check against the scikit-learn reference.

The decision boundary is identical across these because they call (or faithfully reproduce) the same solver; the choice is governed by the language your pipeline already lives in.

153.8 8. A Worked Example

The example below builds a small, self-contained anomaly-detection problem. The normal class is two tight Gaussian clusters, the kind of structure you see when a sensor operates in two healthy regimes. A separate batch of uniformly scattered points stands in for unseen anomalies. We standardize on the training statistics, fit an RBF One-Class SVM with $\nu = 0.05$, and inspect both the hard predict labels and the continuous decision_function scores. Watch the $\nu$-property assert itself: the fraction of flagged training points stays at or below $\nu$, while the support-vector fraction sits at or above it.

Code

import numpy as np
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
from sklearn.svm import OneClassSVM

rng = np.random.RandomState(0)

# Normal data: two tight Gaussian clusters of "healthy" sensor readings.
X_normal, _ = make_blobs(
    n_samples=300, centers=[[0.0, 0.0], [4.0, 4.0]],
    cluster_std=0.6, random_state=0,
)

# Held-out anomalies: points scattered uniformly across the plane,
# most of which land away from the two normal clusters.
X_anom = rng.uniform(low=-4.0, high=8.0, size=(40, 2))

# Standardize on the TRAINING statistics only (the RBF kernel is distance based).
scaler = StandardScaler().fit(X_normal)
Xn = scaler.transform(X_normal)
Xa = scaler.transform(X_anom)

ocsvm = OneClassSVM(kernel="rbf", gamma="scale", nu=0.05)
ocsvm.fit(Xn)

pred_normal = ocsvm.predict(Xn)        # +1 inlier, -1 outlier
pred_anom = ocsvm.predict(Xa)
scores_normal = ocsvm.decision_function(Xn)
scores_anom = ocsvm.decision_function(Xa)

n_sv = ocsvm.support_vectors_.shape[0]

print(f"training points:        {Xn.shape[0]}")
print(f"support vectors:        {n_sv}  "
      f"({n_sv / Xn.shape[0]:.1%} of train, >= nu = 0.05)")
print(f"train flagged outliers: {(pred_normal == -1).sum()} "
      f"({(pred_normal == -1).mean():.1%}, <= nu = 0.05)")
print(f"anomalies caught:       {(pred_anom == -1).sum()} / {Xa.shape[0]}")
print()
print("decision scores  > 0 means inlier, < 0 means outlier")
print(f"  normal  min/mean/max: {scores_normal.min():+.3f} / "
      f"{scores_normal.mean():+.3f} / {scores_normal.max():+.3f}")
print(f"  anomaly min/mean/max: {scores_anom.min():+.3f} / "
      f"{scores_anom.mean():+.3f} / {scores_anom.max():+.3f}")
print()
print("first 5 anomaly scores:", np.round(scores_anom[:5], 3))

training points:        300
support vectors:        18  (6.0% of train, >= nu = 0.05)
train flagged outliers: 14 (4.7%, <= nu = 0.05)
anomalies caught:       35 / 40

decision scores  > 0 means inlier, < 0 means outlier
  normal  min/mean/max: -0.371 / +0.548 / +0.917
  anomaly min/mean/max: -4.867 / -2.364 / +0.900

first 5 anomaly scores: [-0.027  0.236 -1.315 -3.238 -4.112]

The printout confirms the theory of Section 3. With $\nu = 0.05$ the support-vector fraction lands just above five percent and the flagged-training fraction just below it, exactly bracketing $\nu$ as the $\nu$-property predicts. The continuous scores separate cleanly: normal points carry mostly positive scores, while the scattered anomalies are pushed strongly negative, and a handful of anomalies that happen to fall near a normal cluster receive near-zero or slightly positive scores, which is the correct and honest behavior of a boundary estimator. Ranking by decision_function rather than thresholding at zero is what you would do in deployment when you can tolerate only a fixed alert budget.

# LIBSVM.jl wraps the same LIBSVM C library as scikit-learn.
# Pkg.add("LIBSVM"); Pkg.add("Statistics")
using LIBSVM, Statistics, Random

Random.seed!(0)

# Two normal clusters (columns are samples, the LIBSVM.jl convention).
c1 = randn(2, 150) .* 0.6 .+ [0.0, 0.0]
c2 = randn(2, 150) .* 0.6 .+ [4.0, 4.0]
X_normal = hcat(c1, c2)
X_anom = rand(2, 40) .* 12.0 .- 4.0          # uniform in [-4, 8]^2

# Standardize on training statistics.
mu = mean(X_normal; dims = 2)
sd = std(X_normal; dims = 2)
Xn = (X_normal .- mu) ./ sd
Xa = (X_anom .- mu) ./ sd

model = svmtrain(Xn;
    svmtype = OneClassSVM, kernel = Kernel.RadialBasis,
    nu = 0.05, gamma = 1.0 / size(Xn, 1))

pred_n, dec_n = svmpredict(model, Xn)        # pred is true/false (inlier/outlier)
pred_a, dec_a = svmpredict(model, Xa)

println("support vectors:  ", model.nSV)
println("train outliers:   ", count(!, pred_n), " / ", size(Xn, 2))
println("anomalies caught: ", count(!, pred_a), " / ", size(Xa, 2))

// linfa-svm provides a pure-Rust SVM with a one-class mode.
// Cargo.toml: linfa = "0.7", linfa-svm = "0.7", ndarray = "0.15", ndarray-rand = "0.14"
use linfa::prelude::*;
use linfa::dataset::DatasetBase;
use linfa_svm::Svm;
use ndarray::Array2;
use ndarray_rand::{rand_distr::Normal, RandomExt};

fn main() {
    // Two normal Gaussian clusters (300 x 2), standardized beforehand in practice.
    let c1 = Array2::random((150, 2), Normal::new(0.0, 0.6).unwrap());
    let c2 = Array2::random((150, 2), Normal::new(4.0, 0.6).unwrap());
    let x = ndarray::concatenate(ndarray::Axis(0), &[c1.view(), c2.view()]).unwrap();
    let data = DatasetBase::from(x);

    // One-class RBF SVM: nu controls the outlier fraction, gamma the bandwidth.
    let model = Svm::<f64, bool>::params()
        .nu_weight(0.05)
        .gaussian_kernel(1.0)
        .fit(&data)
        .unwrap();

    let preds = model.predict(&data);
    let outliers = preds.iter().filter(|&&p| !p).count();
    println!("train outliers: {} / {}", outliers, preds.len());
}

Honest note: the Rust SVM ecosystem is less mature than LIBSVM. linfa-svm implements one-class classification but its API and numerics evolve between releases, and smartcore focuses on supervised SVM. For production anomaly detection in a Rust service, the safest path today is to validate against the scikit-learn reference, or to call a LIBSVM binding through FFI.

153.9 9. Practical Guidance

A short checklist captures the operational essentials. Standardize features before fitting a One-Class SVM, since the RBF kernel is distance based. Set $\nu$ from a genuine prior on the contamination rate rather than leaving it at a default. Tune $\gamma$ jointly with $\nu$, watching for the overfitting signature where nearly every training point becomes a support vector. Prefer the RBF kernel unless the data are high-dimensional and sparse. Remember that the One-Class SVM produces a hard boundary by default, so use the continuous decision_function rather than the sign when you need calibrated ranking. Finally, validate against Isolation Forest and, where the spherical view or a non-constant kernel is natural, against SVDD, because no single boundary estimator dominates across the diversity of anomaly detection problems.

153.10 References

Schölkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., and Williamson, R. C. “Estimating the Support of a High-Dimensional Distribution.” Neural Computation, 13(7), 2001. https://doi.org/10.1162/089976601750264965
Tax, D. M. J., and Duin, R. P. W. “Support Vector Data Description.” Machine Learning, 54(1), 2004. https://doi.org/10.1023/B:MACH.0000008084.60811.49
Liu, F. T., Ting, K. M., and Zhou, Z.-H. “Isolation Forest.” Proceedings of the 2008 IEEE International Conference on Data Mining. https://doi.org/10.1109/ICDM.2008.17
Chang, C.-C., and Lin, C.-J. “LIBSVM: A Library for Support Vector Machines.” ACM Transactions on Intelligent Systems and Technology, 2(3), 2011. https://doi.org/10.1145/1961189.1961199
Hariri, S., Kind, M. C., and Brunner, R. J. “Extended Isolation Forest.” IEEE Transactions on Knowledge and Data Engineering, 33(4), 2021. https://doi.org/10.1109/TKDE.2019.2947676
scikit-learn developers. “Novelty and Outlier Detection.” https://scikit-learn.org/stable/modules/outlier_detection.html

# One-Class SVM and Boundary-Based Anomaly Detection Anomaly detection often arrives with an awkward asymmetry. We have an abundance of examples of normal behavior, such as healthy machines, legitimate transactions, or routine network traffic, but almost no examples of the anomalies we want to catch. Worse, the anomalies we have not yet seen may look nothing like the ones we have. This setting rules out ordinary supervised classification, which needs both classes, and motivates a family of methods that learn the shape of normality from positive examples alone. The One-Class Support Vector Machine (OCSVM) is the canonical kernel method for this task. It learns a boundary that wraps around the normal data, and it flags anything outside that boundary as anomalous. This chapter develops the One-Class SVM from its geometric intuition through its dual optimization, explains the meaning of the $\nu$ parameter and the role of the kernel, contrasts it with the closely related Support Vector Data Description (SVDD), and finishes with a careful comparison against Isolation Forests. ## 1. The One-Class Problem Let $x_1, \dots, x_n \in \mathbb{R}^d$ be a sample drawn from an unknown distribution $P$ that we treat as normal. The goal is to estimate a region $R$ such that $P(R)$ is high and the volume of $R$ is small. A new point $x$ is then scored as normal if $x \in R$ and anomalous otherwise. This is a density level set estimation problem in disguise. If we could estimate the density $p$, the ideal region would be a level set $\{x : p(x) \ge \rho\}$. Estimating a full density in high dimensions is hard, so the One-Class SVM sidesteps it and estimates the boundary of a level set directly. Two design choices follow. First, we work in a feature space induced by a kernel so that the boundary can be nonlinear in the original coordinates. Second, we allow a controlled fraction of training points to fall outside the boundary, because insisting that every training point be enclosed makes the estimate brittle and sensitive to noise. ## 2. The Schölkopf Formulation Schölkopf and colleagues framed the One-Class SVM as separating the data from the origin with maximum margin in feature space [1]. Let $\phi : \mathbb{R}^d \to \mathcal{H}$ map inputs into a reproducing kernel Hilbert space with kernel $k(x, x') = \langle \phi(x), \phi(x') \rangle$. We seek a hyperplane $\langle w, \phi(x) \rangle = \rho$ that separates most of the mapped data from the origin while lying as far from the origin as possible. The primal optimization problem is $$ \min_{w, \, \xi, \, \rho} \; \frac{1}{2}\|w\|^2 + \frac{1}{\nu n}\sum_{i=1}^{n}\xi_i - \rho \quad \text{subject to} \quad \langle w, \phi(x_i)\rangle \ge \rho - \xi_i, \;\; \xi_i \ge 0 . $$ Here $\xi_i$ are slack variables that permit points to lie on the wrong side of the hyperplane, and $\rho$ is the offset. The term $-\rho$ in the objective pushes the hyperplane away from the origin, which enlarges the margin. The decision function is $$ f(x) = \operatorname{sign}\!\big(\langle w, \phi(x)\rangle - \rho\big), $$ returning $+1$ for normal points and $-1$ for anomalies. Why separate from the origin? With many common kernels, notably the Gaussian RBF kernel, all mapped points $\phi(x)$ live on a hypersphere of fixed radius in feature space and reside in the same orthant. Maximizing the distance of the data from the origin then carves out a region that tightly contains the bulk of the mapped points. The origin plays the role of a surrogate for "everything that is not data." ### 2.1 The Dual and Support Vectors Introducing Lagrange multipliers $\alpha_i$ and eliminating $w$, $\xi$, and $\rho$ yields the dual $$ \min_{\alpha} \; \frac{1}{2}\sum_{i,j}\alpha_i\alpha_j\, k(x_i, x_j) \quad \text{subject to} \quad 0 \le \alpha_i \le \frac{1}{\nu n}, \;\; \sum_{i=1}^{n}\alpha_i = 1 . $$ The recovered weight vector is $w = \sum_i \alpha_i \phi(x_i)$, so the decision function depends only on kernel evaluations: $$ f(x) = \operatorname{sign}\!\left(\sum_{i=1}^{n}\alpha_i\, k(x_i, x) - \rho\right). $$ Points with $\alpha_i > 0$ are the support vectors. The box constraint $\alpha_i \le 1/(\nu n)$ caps the influence of any single point, which is precisely what bounds the number of outliers. Points strictly inside the boundary have $\alpha_i = 0$, points on the boundary have $0 < \alpha_i < 1/(\nu n)$, and points outside the boundary saturate at $\alpha_i = 1/(\nu n)$. The offset $\rho$ is recovered from any margin support vector, where $\sum_j \alpha_j k(x_j, x_i) = \rho$. ## 3. The Meaning of $\nu$ The parameter $\nu \in (0, 1]$ is the conceptual heart of the method. It is not an opaque regularization knob but a quantity with a precise, dual interpretation given by the $\nu$-property [1]: - $\nu$ is an upper bound on the fraction of outliers, meaning the fraction of training points for which $f(x_i) \le 0$. - $\nu$ is a lower bound on the fraction of support vectors. Asymptotically, both fractions converge to $\nu$ under mild conditions. This gives the practitioner a direct lever. Setting $\nu = 0.05$ expresses a belief that roughly five percent of the training data are contaminated or atypical and should be allowed to fall outside the learned boundary. Small $\nu$ produces a permissive boundary that encloses almost everything and rarely raises alarms. Large $\nu$ produces a tight boundary that flags many points. The contrast with the soft-margin parameter $C$ in two-class SVMs is worth stating. Where $C$ trades margin against error in units that are hard to interpret, $\nu$ is bounded in $[0, 1]$ and carries a direct probabilistic meaning. This is why the formulation is sometimes called the $\nu$-SVM family. In practice $\nu$ should be set to the analyst's prior estimate of the contamination rate, then refined if a small validation set of labeled anomalies is available. A worked, runnable demonstration of exactly this lever appears in Section 8. ## 4. Kernel Choice The kernel determines the geometry of the boundary, and for the One-Class SVM it matters even more than for two-class classification because there is no second class to anchor the decision surface. ### 4.1 The Gaussian RBF Kernel The Gaussian kernel $$ k(x, x') = \exp\!\left(-\gamma \|x - x'\|^2\right), \qquad \gamma = \frac{1}{2\sigma^2}, $$ is the default choice. It is universal, meaning it can approximate any continuous decision boundary, and it maps every point onto a sphere of unit norm in feature space, which makes the separate-from-origin geometry behave as intended. The bandwidth $\gamma$ controls smoothness. Small $\gamma$ (large $\sigma$) yields a smooth, blob-like boundary that may underfit multimodal normal data. Large $\gamma$ (small $\sigma$) yields a wiggly boundary that can fragment into islands around individual points and overfit, in the extreme tracing a tight bubble around each training example. A useful heuristic sets $\gamma$ from the distribution of pairwise distances, for instance the inverse of the median squared distance, which is the spirit of the "scale" setting in common libraries. Because $\nu$ and $\gamma$ interact, with $\gamma$ shaping the boundary and $\nu$ shaping how much of the data it must enclose, the two are best tuned jointly. ### 4.2 Linear and Polynomial Kernels The linear kernel $k(x, x') = \langle x, x' \rangle$ reduces the boundary to a half-space and is rarely adequate for normality, which is seldom well described by a single linear cut from the origin. Polynomial kernels $k(x, x') = (\langle x, x'\rangle + c)^p$ can capture curved boundaries but are numerically delicate and lack the bounded feature norm of the RBF kernel. For most applications the RBF kernel is the right starting point, with the linear kernel reserved for very high-dimensional sparse data such as text, where linear models often suffice. ## 5. Support Vector Data Description The Support Vector Data Description of Tax and Duin takes a different geometric route to the same goal [2]. Instead of separating data from the origin with a hyperplane, SVDD finds the smallest hypersphere in feature space that contains most of the data. We seek a center $a$ and radius $R$ solving $$ \min_{a, \, R, \, \xi} \; R^2 + \frac{1}{\nu n}\sum_{i=1}^{n}\xi_i \quad \text{subject to} \quad \|\phi(x_i) - a\|^2 \le R^2 + \xi_i, \;\; \xi_i \ge 0 . $$ A test point is normal if it falls inside the sphere, that is if $\|\phi(x) - a\|^2 \le R^2$. The dual again involves only kernel evaluations, and the center is $a = \sum_i \alpha_i \phi(x_i)$. The key theoretical fact is that for any kernel where $k(x, x)$ is constant for all $x$, the RBF kernel being the prime example since $k(x,x) = 1$, the SVDD hypersphere and the Schölkopf hyperplane give identical decision boundaries [1], [2]. The expansion $\|\phi(x) - a\|^2 = k(x,x) - 2\sum_i \alpha_i k(x_i, x) + \text{const}$ collapses, when $k(x,x)$ is constant, into a function of $\sum_i \alpha_i k(x_i, x)$ alone, which is the same quantity that drives the hyperplane decision. The two formulations are therefore not competitors but dual viewpoints, one spherical and one planar, of the same underlying estimator. They diverge only for kernels without constant diagonal, where the sphere interpretation of SVDD is the more natural one and accommodates a richer set of kernels and even data-dependent rescalings. ## 6. Comparison with Isolation Forests The Isolation Forest takes a strategy that is almost the conceptual opposite of the One-Class SVM [3]. Rather than modeling the dense region of normality, it directly exploits the fact that anomalies are few and different and are therefore easy to isolate. An isolation tree is built by repeatedly choosing a random feature and a random split value, partitioning the data until points are isolated into leaves. Anomalous points, being sparse and far from the mass of the data, tend to be separated after only a few splits, so they have short average path lengths from the root. The anomaly score for a point $x$ averaged over a forest of trees is $$ s(x, n) = 2^{-\,\frac{\mathbb{E}[h(x)]}{c(n)}}, $$ where $h(x)$ is the path length to isolate $x$ in a tree, and $c(n)$ is the expected path length for $n$ points used as a normalizing constant. Scores near $1$ indicate anomalies and scores well below $0.5$ indicate normal points. The two methods differ along several axes that matter in practice. | Aspect | One-Class SVM | Isolation Forest | | --- | --- | --- | | Core idea | Wrap a boundary around normal data | Isolate sparse points quickly | | Model | Kernel boundary in feature space | Ensemble of random trees | | Key parameters | $\nu$, kernel bandwidth $\gamma$ | number of trees, subsample size | | Training cost | $O(n^2)$ to $O(n^3)$ in sample size | near linear, sublinear with subsampling | | Scaling sensitivity | high, distances depend on feature scale | low, splits are axis aligned | | High dimensions | degrades, distances concentrate | generally more robust | | Interpretability | support vectors, geometric boundary | path lengths, partial transparency | Several practical implications follow. The quadratic to cubic training cost of the One-Class SVM makes it awkward on large datasets, where Isolation Forests, which subsample a few hundred points per tree, scale comfortably to millions of records. The One-Class SVM relies on distances and so demands careful feature standardization, whereas the axis-aligned splits of an Isolation Forest are invariant to monotone per-feature transformations. On the other hand, the One-Class SVM, with a well-chosen RBF kernel, can model smooth curved manifolds of normality that axis-aligned partitions approximate only coarsely, and its $\nu$ parameter gives a cleaner handle on the expected contamination rate. Isolation Forests can struggle when anomalies are not globally sparse but lie in local pockets, and they exhibit axis-aligned artifacts that the rotation-aware Extended Isolation Forest was designed to mitigate. A pragmatic recommendation is to treat the two as complementary baselines. For moderate sample sizes with a meaningful distance metric after careful preprocessing, the One-Class SVM and its SVDD twin offer a principled geometric boundary. For large, high-dimensional, or heterogeneous tabular data where speed and robustness to scaling dominate, the Isolation Forest is usually the stronger and cheaper first choice. Evaluating both on a held-out set with whatever labeled anomalies exist, using ranking metrics such as area under the precision-recall curve, is the surest way to choose. ## 7. Mature Open-Source Tooling The One-Class SVM is a solved engineering problem, and there is no reason to reimplement the quadratic program by hand. The reference solver underneath almost every implementation is **LIBSVM**, the C++ library of Chang and Lin that introduced the working-set decomposition (SMO-type) algorithm and exposes the $\nu$-SVM family directly [4]. The mature, freely licensed, well-maintained wrappers around it are what you should reach for in production. - **Python**: `sklearn.svm.OneClassSVM` (BSD-licensed scikit-learn) wraps LIBSVM and slots into the same `fit` / `predict` / `decision_function` interface as the rest of the ecosystem, so it composes with `StandardScaler`, `Pipeline`, and `GridSearchCV` for leakage-free tuning. For larger problems scikit-learn also ships `SGDOneClassSVM`, a linear stochastic-gradient approximation whose cost is linear in the sample size, usually paired with `Nystroem` kernel approximation to recover nonlinearity. - **Julia**: `LIBSVM.jl` (MIT-licensed) binds the same LIBSVM C library; its `svmtrain(X; svmtype = OneClassSVM, nu = 0.05, kernel = Kernel.RadialBasis)` exposes the identical $\nu$ and $\gamma$ knobs. - **Rust**: `linfa-svm` (part of the Apache/MIT-licensed `linfa` toolkit) provides a pure-Rust SVM with a one-class mode via `Svm::params().nu_weight(...)`, and `smartcore` offers a related SVM family. The Rust ecosystem here is younger than LIBSVM, so for high-stakes work cross-check against the scikit-learn reference. The decision boundary is identical across these because they call (or faithfully reproduce) the same solver; the choice is governed by the language your pipeline already lives in. ## 8. A Worked Example The example below builds a small, self-contained anomaly-detection problem. The normal class is two tight Gaussian clusters, the kind of structure you see when a sensor operates in two healthy regimes. A separate batch of uniformly scattered points stands in for unseen anomalies. We standardize on the training statistics, fit an RBF One-Class SVM with $\nu = 0.05$, and inspect both the hard `predict` labels and the continuous `decision_function` scores. Watch the $\nu$-property assert itself: the fraction of flagged training points stays at or below $\nu$, while the support-vector fraction sits at or above it. ::: {.panel-tabset} ## Python ```{python} import numpy as np from sklearn.datasets import make_blobs from sklearn.preprocessing import StandardScaler from sklearn.svm import OneClassSVM rng = np.random.RandomState(0) # Normal data: two tight Gaussian clusters of "healthy" sensor readings. X_normal, _ = make_blobs( n_samples=300, centers=[[0.0, 0.0], [4.0, 4.0]], cluster_std=0.6, random_state=0, ) # Held-out anomalies: points scattered uniformly across the plane, # most of which land away from the two normal clusters. X_anom = rng.uniform(low=-4.0, high=8.0, size=(40, 2)) # Standardize on the TRAINING statistics only (the RBF kernel is distance based). scaler = StandardScaler().fit(X_normal) Xn = scaler.transform(X_normal) Xa = scaler.transform(X_anom) ocsvm = OneClassSVM(kernel="rbf", gamma="scale", nu=0.05) ocsvm.fit(Xn) pred_normal = ocsvm.predict(Xn) # +1 inlier, -1 outlier pred_anom = ocsvm.predict(Xa) scores_normal = ocsvm.decision_function(Xn) scores_anom = ocsvm.decision_function(Xa) n_sv = ocsvm.support_vectors_.shape[0] print(f"training points: {Xn.shape[0]}") print(f"support vectors: {n_sv} " f"({n_sv / Xn.shape[0]:.1%} of train, >= nu = 0.05)") print(f"train flagged outliers: {(pred_normal == -1).sum()} " f"({(pred_normal == -1).mean():.1%}, <= nu = 0.05)") print(f"anomalies caught: {(pred_anom == -1).sum()} / {Xa.shape[0]}") print() print("decision scores > 0 means inlier, < 0 means outlier") print(f" normal min/mean/max: {scores_normal.min():+.3f} / " f"{scores_normal.mean():+.3f} / {scores_normal.max():+.3f}") print(f" anomaly min/mean/max: {scores_anom.min():+.3f} / " f"{scores_anom.mean():+.3f} / {scores_anom.max():+.3f}") print() print("first 5 anomaly scores:", np.round(scores_anom[:5], 3)) ``` The printout confirms the theory of Section 3. With $\nu = 0.05$ the support-vector fraction lands just above five percent and the flagged-training fraction just below it, exactly bracketing $\nu$ as the $\nu$-property predicts. The continuous scores separate cleanly: normal points carry mostly positive scores, while the scattered anomalies are pushed strongly negative, and a handful of anomalies that happen to fall near a normal cluster receive near-zero or slightly positive scores, which is the correct and honest behavior of a boundary estimator. Ranking by `decision_function` rather than thresholding at zero is what you would do in deployment when you can tolerate only a fixed alert budget. ## Julia ```julia # LIBSVM.jl wraps the same LIBSVM C library as scikit-learn. # Pkg.add("LIBSVM"); Pkg.add("Statistics") using LIBSVM, Statistics, Random Random.seed!(0) # Two normal clusters (columns are samples, the LIBSVM.jl convention). c1 = randn(2, 150) .* 0.6 .+ [0.0, 0.0] c2 = randn(2, 150) .* 0.6 .+ [4.0, 4.0] X_normal = hcat(c1, c2) X_anom = rand(2, 40) .* 12.0 .- 4.0 # uniform in [-4, 8]^2 # Standardize on training statistics. mu = mean(X_normal; dims = 2) sd = std(X_normal; dims = 2) Xn = (X_normal .- mu) ./ sd Xa = (X_anom .- mu) ./ sd model = svmtrain(Xn; svmtype = OneClassSVM, kernel = Kernel.RadialBasis, nu = 0.05, gamma = 1.0 / size(Xn, 1)) pred_n, dec_n = svmpredict(model, Xn) # pred is true/false (inlier/outlier) pred_a, dec_a = svmpredict(model, Xa) println("support vectors: ", model.nSV) println("train outliers: ", count(!, pred_n), " / ", size(Xn, 2)) println("anomalies caught: ", count(!, pred_a), " / ", size(Xa, 2)) ``` ## Rust ```rust // linfa-svm provides a pure-Rust SVM with a one-class mode. // Cargo.toml: linfa = "0.7", linfa-svm = "0.7", ndarray = "0.15", ndarray-rand = "0.14" use linfa::prelude::*; use linfa::dataset::DatasetBase; use linfa_svm::Svm; use ndarray::Array2; use ndarray_rand::{rand_distr::Normal, RandomExt}; fn main() { // Two normal Gaussian clusters (300 x 2), standardized beforehand in practice. let c1 = Array2::random((150, 2), Normal::new(0.0, 0.6).unwrap()); let c2 = Array2::random((150, 2), Normal::new(4.0, 0.6).unwrap()); let x = ndarray::concatenate(ndarray::Axis(0), &[c1.view(), c2.view()]).unwrap(); let data = DatasetBase::from(x); // One-class RBF SVM: nu controls the outlier fraction, gamma the bandwidth. let model = Svm::<f64, bool>::params() .nu_weight(0.05) .gaussian_kernel(1.0) .fit(&data) .unwrap(); let preds = model.predict(&data); let outliers = preds.iter().filter(|&&p| !p).count(); println!("train outliers: {} / {}", outliers, preds.len()); } ``` Honest note: the Rust SVM ecosystem is less mature than LIBSVM. `linfa-svm` implements one-class classification but its API and numerics evolve between releases, and `smartcore` focuses on supervised SVM. For production anomaly detection in a Rust service, the safest path today is to validate against the scikit-learn reference, or to call a LIBSVM binding through FFI. ::: ## 9. Practical Guidance A short checklist captures the operational essentials. Standardize features before fitting a One-Class SVM, since the RBF kernel is distance based. Set $\nu$ from a genuine prior on the contamination rate rather than leaving it at a default. Tune $\gamma$ jointly with $\nu$, watching for the overfitting signature where nearly every training point becomes a support vector. Prefer the RBF kernel unless the data are high-dimensional and sparse. Remember that the One-Class SVM produces a hard boundary by default, so use the continuous `decision_function` rather than the sign when you need calibrated ranking. Finally, validate against Isolation Forest and, where the spherical view or a non-constant kernel is natural, against SVDD, because no single boundary estimator dominates across the diversity of anomaly detection problems. ## References 1. Schölkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., and Williamson, R. C. "Estimating the Support of a High-Dimensional Distribution." Neural Computation, 13(7), 2001. https://doi.org/10.1162/089976601750264965 2. Tax, D. M. J., and Duin, R. P. W. "Support Vector Data Description." Machine Learning, 54(1), 2004. https://doi.org/10.1023/B:MACH.0000008084.60811.49 3. Liu, F. T., Ting, K. M., and Zhou, Z.-H. "Isolation Forest." Proceedings of the 2008 IEEE International Conference on Data Mining. https://doi.org/10.1109/ICDM.2008.17 4. Chang, C.-C., and Lin, C.-J. "LIBSVM: A Library for Support Vector Machines." ACM Transactions on Intelligent Systems and Technology, 2(3), 2011. https://doi.org/10.1145/1961189.1961199 5. Hariri, S., Kind, M. C., and Brunner, R. J. "Extended Isolation Forest." IEEE Transactions on Knowledge and Data Engineering, 33(4), 2021. https://doi.org/10.1109/TKDE.2019.2947676 6. scikit-learn developers. "Novelty and Outlier Detection." https://scikit-learn.org/stable/modules/outlier_detection.html