119 Kernel Methods Beyond SVM

Support vector machines made kernels famous, but the kernel idea is far more general than max margin classification. Any learning algorithm that can be written purely in terms of inner products between data points can be lifted into a rich, possibly infinite dimensional feature space by replacing those inner products with a kernel function. This chapter develops the kernel machinery that surrounds and outlives the SVM, with a sharp focus on the three tools that matter most in practice today: kernel ridge regression as the canonical kernel learner, and the two approximations, the Nystrom method and random Fourier features, that make kernels usable on large data. The representer theorem ties them together and explains why every one of them is tractable.

The treatment is deliberately library driven. The mathematics below is worth understanding line by line, but you should almost never implement it yourself. Mature, well tested open source libraries already provide exact and approximate kernel methods with sensible defaults, and the demonstration at the end uses them directly rather than reinventing the linear algebra.

119.1 1. Kernels and Feature Spaces

119.1.1 1.1 The kernel trick, restated

A kernel is a symmetric function $k(\mathbf{x}, \mathbf{x}')$ that computes an inner product in some feature space. Concretely, there exists a feature map $\phi: \mathcal{X} \to \mathcal{H}$ into a Hilbert space such that

\[ k(\mathbf{x}, \mathbf{x}') = \langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle_{\mathcal{H}}. \]

The point of the trick is that we never form $\phi(\mathbf{x})$ explicitly. The Gaussian (RBF) kernel

\[ k(\mathbf{x}, \mathbf{x}') = \exp\!\left(-\gamma \lVert \mathbf{x} - \mathbf{x}' \rVert^2\right), \qquad \gamma = \frac{1}{2\sigma^2}, \]

corresponds to an infinite dimensional $\mathcal{H}$, yet each evaluation costs only $O(d)$ where $d$ is the input dimension. The bandwidth parameter $\gamma$ (equivalently $\sigma$) sets the length scale over which two points are considered similar, and it is the single most important hyperparameter in any RBF based model.

119.1.2 1.2 Positive definiteness and Mercer’s condition

Not every symmetric function is a valid kernel. The requirement is positive definiteness: for any finite set $\{\mathbf{x}_1, \dots, \mathbf{x}_n\}$, the Gram matrix $K$ with entries $K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$ must be positive semidefinite, meaning $\mathbf{c}^\top K \mathbf{c} \geq 0$ for all $\mathbf{c} \in \mathbb{R}^n$. Mercer’s theorem guarantees that any continuous positive definite kernel admits an eigenfunction expansion $k(\mathbf{x}, \mathbf{x}') = \sum_i \lambda_i \psi_i(\mathbf{x}) \psi_i(\mathbf{x}')$ with $\lambda_i \geq 0$, which is exactly the feature map written out. Valid kernels are closed under addition, multiplication by a positive scalar, pointwise products, and composition with feature maps, which lets practitioners build complex kernels from simple parts.

119.1.3 1.3 The reproducing kernel Hilbert space

Each positive definite kernel induces a unique reproducing kernel Hilbert space (RKHS) $\mathcal{H}_k$, the closure of the span of functions $\{k(\cdot, \mathbf{x})\}$ under the inner product $\langle k(\cdot, \mathbf{x}), k(\cdot, \mathbf{x}') \rangle = k(\mathbf{x}, \mathbf{x}')$. The defining property is reproduction:

\[ f(\mathbf{x}) = \langle f, k(\cdot, \mathbf{x}) \rangle_{\mathcal{H}_k} \quad \text{for all } f \in \mathcal{H}_k. \]

The RKHS norm $\lVert f \rVert_{\mathcal{H}_k}$ measures the smoothness of $f$, and controlling it is how kernel methods regularize. This abstract setup pays off immediately in the representer theorem.

119.2 2. The Representer Theorem

119.2.1 2.1 Statement

Consider any regularized empirical risk problem over an RKHS:

\[ \min_{f \in \mathcal{H}_k} \; \sum_{i=1}^{n} L\big(y_i, f(\mathbf{x}_i)\big) + \Omega\big(\lVert f \rVert_{\mathcal{H}_k}\big), \]

where $L$ is an arbitrary loss and $\Omega$ is strictly increasing. The representer theorem states that any minimizer admits a finite expansion over the training points:

\[ f^\star(\mathbf{x}) = \sum_{i=1}^{n} \alpha_i \, k(\mathbf{x}_i, \mathbf{x}). \]

The infinite dimensional optimization collapses to finding $n$ coefficients $\alpha_i$.

119.2.2 2.2 Why it holds

Decompose any candidate $f$ into a part inside the span $\mathcal{S} = \mathrm{span}\{k(\cdot, \mathbf{x}_i)\}$ and an orthogonal remainder: $f = f_{\parallel} + f_{\perp}$. By the reproducing property, $f(\mathbf{x}_i) = \langle f, k(\cdot, \mathbf{x}_i)\rangle = \langle f_{\parallel}, k(\cdot, \mathbf{x}_i)\rangle$, so $f_{\perp}$ does not affect any training prediction and hence does not affect the loss term. But $\lVert f \rVert^2 = \lVert f_{\parallel} \rVert^2 + \lVert f_{\perp} \rVert^2$, so a nonzero $f_{\perp}$ only inflates the regularizer. Since $\Omega$ is increasing, the optimum must have $f_{\perp} = 0$, which is precisely the claimed finite expansion. This single argument is what makes kernel ridge regression, kernel logistic regression, the SVM, and Gaussian process regression all tractable.

119.3 3. Kernel Ridge Regression

119.3.1 3.1 From ridge to kernel ridge

Kernel ridge regression (KRR) is the most direct application of the machinery above and the workhorse we will scale up later. Ordinary ridge regression solves $\min_{\mathbf{w}} \lVert \mathbf{y} - X\mathbf{w} \rVert^2 + \lambda \lVert \mathbf{w} \rVert^2$ with closed form $\mathbf{w} = (X^\top X + \lambda I)^{-1} X^\top \mathbf{y}$. Replacing $\mathbf{x}$ with $\phi(\mathbf{x})$ and invoking the representer theorem, write $f(\mathbf{x}) = \sum_i \alpha_i k(\mathbf{x}_i, \mathbf{x})$. Substituting into the squared loss with RKHS regularizer $\lambda \lVert f \rVert^2 = \lambda \, \boldsymbol{\alpha}^\top K \boldsymbol{\alpha}$ yields

\[ \min_{\boldsymbol{\alpha}} \; \lVert \mathbf{y} - K\boldsymbol{\alpha} \rVert^2 + \lambda \, \boldsymbol{\alpha}^\top K \boldsymbol{\alpha}. \]

Setting the gradient to zero gives the elegant closed form

\[ \boldsymbol{\alpha} = (K + \lambda I)^{-1} \mathbf{y}, \qquad f(\mathbf{x}_\star) = \mathbf{k}_\star^\top (K + \lambda I)^{-1} \mathbf{y}, \]

where $\mathbf{k}_\star = [k(\mathbf{x}_1, \mathbf{x}_\star), \dots, k(\mathbf{x}_n, \mathbf{x}_\star)]^\top$. Notice there is no iterative optimization: KRR is one linear solve, which is part of why it is such a clean baseline.

119.3.2 3.2 Properties and cost

KRR has no hyperparameter for sparsity, so unlike the SVM every training point contributes a nonzero $\alpha_i$. It is the predictive mean of Gaussian process regression with noise variance $\lambda$, which connects it to a full Bayesian treatment and lets you borrow GP intuition about kernels and length scales. The dominant cost is the $O(n^3)$ factorization of the $n \times n$ matrix $K + \lambda I$ and $O(n^2)$ memory to store $K$. This cubic scaling is the central obstacle. At a few thousand points it is a second or two; at a hundred thousand it is hours and tens of gigabytes; beyond that it is simply infeasible. Sections 4 and 5 exist entirely to break this barrier.

119.3.3 3.3 Choosing the regularizer and bandwidth

Two knobs govern KRR: the regularization strength $\lambda$ (called alpha in scikit-learn) and the kernel bandwidth $\gamma$. Small bandwidth (large $\gamma$) produces a wiggly function that interpolates noise; large bandwidth (small $\gamma$) oversmooths toward a constant. Because KRR has a closed form, leave one out cross validation can be computed cheaply from a single factorization: the LOO residuals are $e_i / (1 - H_{ii})$ where $H = K(K + \lambda I)^{-1}$ is the smoother (hat) matrix. This makes principled tuning practical on moderate datasets without refitting. In a library setting you would wrap KRR in a grid search over $(\gamma, \lambda)$; the closed form keeps each fit cheap.

119.3.4 3.4 Failure modes

KRR fails quietly in a few recognizable ways. An unstandardized input with one dominant scale makes a single $\gamma$ meaningless across dimensions, so always standardize before an RBF kernel. Too small a $\lambda$ with a small bandwidth interpolates the training noise and generalizes terribly even as training error vanishes. And the dense $O(n^2)$ Gram matrix silently exhausts memory well before the $O(n^3)$ solve becomes the bottleneck, which is the practical reason large problems demand the approximations below rather than a faster solver.

119.4 4. The Nystrom Approximation

119.4.1 4.1 Low rank from a landmark subset

The Nystrom method attacks the $O(n^3)$ cost by approximating the Gram matrix with a low rank factorization built from a data dependent subsample. Choose $m \ll n$ landmark points (often a uniform or leverage score weighted sample of the training set). Let $K_{mm}$ be the kernel matrix among landmarks and $K_{nm}$ the kernel between all points and landmarks. The Nystrom low rank approximation is

\[ K \approx \tilde{K} = K_{nm} \, K_{mm}^{+} \, K_{nm}^\top, \]

where $K_{mm}^{+}$ is the pseudoinverse. This reconstructs the full Gram matrix from an $m$ column sketch and has rank at most $m$.

119.4.2 4.2 Explicit features and solving

Factoring $K_{mm}^{+} = L L^\top$ gives an explicit feature map $\mathbf{z}(\mathbf{x}) = L^\top \mathbf{k}_m(\mathbf{x})$ where $\mathbf{k}_m(\mathbf{x})$ collects the kernel values against the $m$ landmarks. This is exactly what scikit-learn’s Nystroem transformer returns: it maps each input to an $m$ dimensional vector. Once the data live in this explicit space, downstream learning becomes linear, so KRR reduces to ordinary ridge regression on the transformed features and costs $O(nm^2 + m^3)$ time and $O(nm)$ memory. Using the Woodbury identity, the solution can be written directly in terms of the small matrices without ever forming the full $n \times n$ system.

The key design choice is $m$. Larger $m$ gives a more faithful approximation and better accuracy at higher cost; the right value is the smallest $m$ at which validation error plateaus, typically a few hundred to a couple thousand.

119.5 5. Random Fourier Features

119.5.1 5.1 Bochner’s theorem

Random Fourier features (RFF) attack the same cost from a data independent angle: they approximate the kernel with an explicit, low dimensional feature map drawn without looking at the data. For shift invariant kernels, $k(\mathbf{x}, \mathbf{x}') = k(\mathbf{x} - \mathbf{x}')$, Bochner’s theorem says the kernel is the Fourier transform of a nonnegative measure. After normalizing $k(\mathbf{0}) = 1$, that measure is a probability density $p(\boldsymbol{\omega})$:

\[ k(\mathbf{x} - \mathbf{x}') = \int_{\mathbb{R}^d} p(\boldsymbol{\omega}) \, e^{\,i \boldsymbol{\omega}^\top (\mathbf{x} - \mathbf{x}')} \, d\boldsymbol{\omega} = \mathbb{E}_{\boldsymbol{\omega} \sim p}\!\left[ e^{\,i \boldsymbol{\omega}^\top (\mathbf{x} - \mathbf{x}')} \right]. \]

For the Gaussian kernel, $p$ is itself a Gaussian, $\boldsymbol{\omega} \sim \mathcal{N}(\mathbf{0}, 2\gamma I)$.

119.5.2 5.2 The Monte Carlo feature map

The expectation above is a population average that we estimate by sampling. Taking real parts and drawing $D$ frequencies $\boldsymbol{\omega}_1, \dots, \boldsymbol{\omega}_D \sim p$ and phases $b_j \sim \mathrm{Uniform}(0, 2\pi)$, define

\[ \mathbf{z}(\mathbf{x}) = \sqrt{\tfrac{2}{D}} \, \big[\cos(\boldsymbol{\omega}_1^\top \mathbf{x} + b_1), \dots, \cos(\boldsymbol{\omega}_D^\top \mathbf{x} + b_D)\big]^\top. \]

Then $\mathbf{z}(\mathbf{x})^\top \mathbf{z}(\mathbf{x}')$ is an unbiased estimator of $k(\mathbf{x}, \mathbf{x}')$, with approximation error that decays as $O(1/\sqrt{D})$ uniformly over the data domain. This is precisely what scikit-learn’s RBFSampler computes. The kernel has been replaced by an explicit $D$ dimensional map drawn once, up front, independent of the labels and of the data distribution.

119.5.3 5.3 Linear cost after lifting

The payoff is the same as Nystrom: once each point is mapped to $\mathbf{z}(\mathbf{x}) \in \mathbb{R}^D$, any linear method applies directly. KRR becomes ordinary ridge regression on the lifted matrix $Z$, costing $O(nD^2 + D^3)$ time and $O(nD)$ memory instead of $O(n^3)$ and $O(n^2)$. For $D \ll n$ this is the difference between feasible and impossible, and because the features are independent of the data the transform is embarrassingly parallel and streams trivially. Variants such as orthogonal random features and structured (Fastfood) transforms reduce variance or accelerate the projection from $O(dD)$ toward $O(D \log d)$, and they are the default at very large scale.

119.6 6. Choosing Among the Three

The three methods sit on a clear spectrum, and the panel below demonstrates all of them on the same data.

Exact KRR is the accuracy gold standard and the right choice up to a few thousand points, where the cubic solve is still cheap and the closed form LOO makes tuning painless. Past that, you approximate.

Nystrom adapts to the spectrum of the data, so when the Gram matrix has rapidly decaying eigenvalues it achieves a given accuracy with far fewer features than RFF, whose error is governed by the worst case rather than the actual spectrum. Theory shows the Nystrom generalization error can improve to $O(1/m)$ when leverage scores guide sampling, beating the $O(1/\sqrt{D})$ rate of plain random features. The cost is a data dependent construction and sensitivity to landmark selection.

Random Fourier features, by contrast, are trivial to generate, label and distribution independent, embarrassingly parallel, and apply to any shift invariant kernel without touching the data, which makes them the natural fit for streaming and privacy constrained settings. A useful default is to prefer Nystrom when the kernel matrix is approximately low rank and random features when it is not, or when retaining landmarks is undesirable.

One subtlety ties the approximation level back to regularization: a coarser approximation (small $m$ or $D$) acts as implicit regularization, so the optimal $\lambda$ often shrinks as $m$ or $D$ grows. Tune them together rather than in isolation.

119.7 7. Library Demonstration

Rather than reimplement any of this, we lean on scikit-learn, the mature open source standard for classical machine learning in Python. It ships KernelRidge for the exact method and the Nystroem and RBFSampler transformers for the two approximations, all sharing the same RBF kernel and gamma so the comparison is apples to apples. The running example is make_friedman1, a standard synthetic regression benchmark whose target

\[ y = 10 \sin(\pi x_1 x_2) + 20 (x_3 - 0.5)^2 + 10 x_4 + 5 x_5 + \varepsilon \]

is a smooth nonlinear function of five informative features, padded here with five pure noise features. A linear model is hopeless on it, while an RBF kernel fits it well, so it cleanly exposes both the value of kernels and the accuracy cost of each approximation. Everything is seeded for reproducibility, and each model is wrapped in a StandardScaler pipeline because, as Section 3.4 warned, a single bandwidth only makes sense on standardized inputs.

Code

import time
import numpy as np
from sklearn.datasets import make_friedman1
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.kernel_ridge import KernelRidge
from sklearn.kernel_approximation import Nystroem, RBFSampler
from sklearn.linear_model import Ridge
from sklearn.metrics import r2_score

SEED = 0

# Friedman #1: a smooth nonlinear surface with 5 informative features
# plus 5 pure-noise features. n is large enough that the exact O(n^3)
# solve is visibly slower than the O(n D^2) approximations.
X, y = make_friedman1(n_samples=6000, n_features=10, noise=1.0,
                      random_state=SEED)
X_tr, X_te, y_tr, y_te = train_test_split(
    X, y, test_size=0.2, random_state=SEED)

GAMMA = 0.1     # shared RBF bandwidth (1 / 2 sigma^2)
LAMBDA = 1.0    # ridge / KRR regularization (sklearn calls it alpha)
D = 300         # explicit features for Nystroem and RFF (m == D here)

def evaluate(name, model):
    t0 = time.perf_counter()
    model.fit(X_tr, y_tr)
    fit_s = time.perf_counter() - t0
    r2 = r2_score(y_te, model.predict(X_te))
    print(f"{name:<26} test R2 = {r2:6.3f}   fit = {fit_s:6.2f}s")
    return r2

print(f"train shape {X_tr.shape}, test shape {X_te.shape}\n")

# 1. Exact kernel ridge regression: the O(n^3) gold standard.
evaluate("Exact KernelRidge (RBF)", make_pipeline(
    StandardScaler(),
    KernelRidge(kernel="rbf", gamma=GAMMA, alpha=LAMBDA),
))

# 2. Nystroem: data-dependent landmark sketch, then linear ridge.
evaluate("Nystroem + Ridge", make_pipeline(
    StandardScaler(),
    Nystroem(kernel="rbf", gamma=GAMMA, n_components=D, random_state=SEED),
    Ridge(alpha=LAMBDA),
))

# 3. Random Fourier features: data-independent map, then linear ridge.
evaluate("RandomFourier + Ridge", make_pipeline(
    StandardScaler(),
    RBFSampler(gamma=GAMMA, n_components=D, random_state=SEED),
    Ridge(alpha=LAMBDA),
))

# 4. Plain linear ridge: the baseline the kernel must beat.
evaluate("Linear Ridge (baseline)", make_pipeline(
    StandardScaler(), Ridge(alpha=LAMBDA)))

# Inspect the explicit feature map the approximation builds: an n x D
# matrix replaces the n x n Gram matrix the exact method would form.
scaler = StandardScaler().fit(X_tr)
Z = Nystroem(kernel="rbf", gamma=GAMMA, n_components=D,
             random_state=SEED).fit_transform(scaler.transform(X_tr))
print(f"\nNystroem feature matrix: {Z.shape}  "
      f"(vs a {X_tr.shape[0]} x {X_tr.shape[0]} Gram matrix)")
print("first row, first 5 features:", np.round(Z[0, :5], 4))

train shape (4800, 10), test shape (1200, 10)

Exact KernelRidge (RBF)    test R2 =  0.917   fit =   1.36s
Nystroem + Ridge           test R2 =  0.892   fit =   0.10s
RandomFourier + Ridge      test R2 =  0.875   fit =   0.05s
Linear Ridge (baseline)    test R2 =  0.717   fit =   0.00s

Nystroem feature matrix: (4800, 300)  (vs a 4800 x 4800 Gram matrix)
first row, first 5 features: [0.1304 0.003  0.0128 0.0242 0.0014]

The exact RBF kernel reaches the highest test $R^2$ but takes the longest to fit. Nystrom recovers almost all of that accuracy in a fraction of the time, and random Fourier features are faster still at a slightly larger accuracy cost. All three crush the linear baseline, whose $R^2$ near $0.7$ shows how much nonlinear signal the kernel captures. The printed $n \times D$ matrix is the whole point of the approximations: a tall, thin explicit feature matrix replaces the dense $n \times n$ Gram matrix, and from there it is just linear algebra.

# Julia: KernelFunctions.jl builds the kernel; MLJ wires up the pipeline.
# Equivalent exact kernel ridge regression on the same idea.
using MLJ, KernelFunctions, Random
Random.seed!(0)

# A Friedman-1-style target (see the Python cell for the exact form).
n, d = 6000, 10
X = rand(n, d)
y = 10 .* sin.(pi .* X[:,1] .* X[:,2]) .+ 20 .* (X[:,3] .- 0.5).^2 .+
    10 .* X[:,4] .+ 5 .* X[:,5] .+ randn(n)

(Xtr, Xte), (ytr, yte) = partition((MLJ.table(X), y), 0.8,
                                    multi=true, rng=0)

# Exact kernel ridge: form the RBF Gram matrix and solve (K + lambda I) a = y.
gamma, lambda = 0.1, 1.0
k = with_lengthscale(SqExponentialKernel(), sqrt(1 / (2 * gamma)))
Ktr = kernelmatrix(k, RowVecs(X[1:4800, :]))
alpha = (Ktr + lambda * I) \ y[1:4800]

# For Nystroem / random features at scale, see ApproximateGPs.jl, which
# provides inducing-point (Nystroem-style) and Fourier-feature kernels.

// Rust: smartcore is the most mature general ML crate. It ships exact
// kernel ridge regression (KernelRidge) with an RBF kernel.
use smartcore::dataset::generator;
use smartcore::linalg::basic::matrix::DenseMatrix;
use smartcore::svm::svr::{SVR, SVRParameters};
use smartcore::svm::RBFKernel;

fn main() {
    // smartcore exposes RBF-kernel regression via SVR; a dedicated
    // KernelRidge lives in smartcore::linear::kernel_ridge on recent
    // versions. Both form the dense Gram matrix (exact, O(n^3)).
    let data = generator::make_regression(6000, 10, 5, 1.0, 42);
    let x = DenseMatrix::from_2d_vec(&data.0).unwrap();
    let y = data.1;

    let model = SVR::fit(
        &x, &y,
        SVRParameters::default().with_kernel(RBFKernel::default().with_gamma(0.1)),
    ).unwrap();
    let _pred = model.predict(&x).unwrap();
}
// Honest note: as of mid-2026 no mainstream Rust crate ships Nystroem
// or random Fourier features out of the box. For large-scale kernels in
// Rust you implement the explicit feature map yourself (sample omega,
// apply cos) and feed it to smartcore's linear Ridge, or call sklearn
// across an FFI boundary. smartcore's exact RBF path is shown above.

119.8 8. Practical Guidance

When data is small (up to a few thousand points), use exact KRR and tune $\lambda$ and the bandwidth by cross validation, exploiting the closed form LOO formula for cheap model selection. From roughly ten thousand to a million points, switch to Nystrom if the spectrum decays quickly or to random Fourier features otherwise, choosing $m$ or $D$ in the low hundreds to low thousands and increasing until validation error plateaus. Always standardize inputs before applying an RBF kernel so a single bandwidth is sensible across dimensions. Remember that the approximations interact with regularization: a coarser approximation acts as implicit regularization, so the optimal $\lambda$ often shrinks as $m$ or $D$ grows, and the two should be tuned jointly. Finally, when an explicit feature map is available through either approximation, the entire kernel toolbox (regression, classification, clustering) reduces to fast linear algebra on the lifted features, which is the modern, library friendly way to deploy kernel methods at scale. Reach for the well tested transformers in scikit-learn (or the Julia and Rust analogues above) before writing any of this by hand.

119.9 References

Scholkopf, B. and Smola, A. J. Learning with Kernels. MIT Press, 2002. https://mitpress.mit.edu/9780262536578/learning-with-kernels/
Scholkopf, B., Herbrich, R., and Smola, A. J. A Generalized Representer Theorem. COLT, 2001. https://doi.org/10.1007/3-540-44581-1_27
Rahimi, A. and Recht, B. Random Features for Large Scale Kernel Machines. NeurIPS, 2007. https://papers.nips.cc/paper/2007/hash/013a006f03dbc5392effeb8f18fda755-Abstract.html
Williams, C. and Seeger, M. Using the Nystrom Method to Speed Up Kernel Machines. NeurIPS, 2001. https://papers.nips.cc/paper/2000/hash/19de10adbaa1b2ee13f77f679fa1483a-Abstract.html
Drineas, P. and Mahoney, M. W. On the Nystrom Method for Approximating a Gram Matrix. JMLR, 2005. https://www.jmlr.org/papers/v6/drineas05a.html
Rudi, A., Camoriano, R., and Rosasco, L. Less is More: Nystrom Computational Regularization. NeurIPS, 2015. https://papers.nips.cc/paper/2015/hash/03e0704b5690a2dee1861dc3ad3316c9-Abstract.html
Rasmussen, C. E. and Williams, C. K. I. Gaussian Processes for Machine Learning. MIT Press, 2006. https://gaussianprocess.org/gpml/
Le, Q., Sarlos, T., and Smola, A. Fastfood: Approximating Kernel Expansions in Loglinear Time. ICML, 2013. https://proceedings.mlr.press/v28/le13.html
Pedregosa, F. et al. Scikit-learn: Machine Learning in Python. JMLR, 2011. https://www.jmlr.org/papers/v12/pedregosa11a.html

# Kernel Methods Beyond SVM Support vector machines made kernels famous, but the kernel idea is far more general than max margin classification. Any learning algorithm that can be written purely in terms of inner products between data points can be lifted into a rich, possibly infinite dimensional feature space by replacing those inner products with a kernel function. This chapter develops the kernel machinery that surrounds and outlives the SVM, with a sharp focus on the three tools that matter most in practice today: kernel ridge regression as the canonical kernel learner, and the two approximations, the Nystrom method and random Fourier features, that make kernels usable on large data. The representer theorem ties them together and explains why every one of them is tractable. The treatment is deliberately library driven. The mathematics below is worth understanding line by line, but you should almost never implement it yourself. Mature, well tested open source libraries already provide exact and approximate kernel methods with sensible defaults, and the demonstration at the end uses them directly rather than reinventing the linear algebra. ## 1. Kernels and Feature Spaces ### 1.1 The kernel trick, restated A kernel is a symmetric function $k(\mathbf{x}, \mathbf{x}')$ that computes an inner product in some feature space. Concretely, there exists a feature map $\phi: \mathcal{X} \to \mathcal{H}$ into a Hilbert space such that $$ k(\mathbf{x}, \mathbf{x}') = \langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle_{\mathcal{H}}. $$ The point of the trick is that we never form $\phi(\mathbf{x})$ explicitly. The Gaussian (RBF) kernel $$ k(\mathbf{x}, \mathbf{x}') = \exp\!\left(-\gamma \lVert \mathbf{x} - \mathbf{x}' \rVert^2\right), \qquad \gamma = \frac{1}{2\sigma^2}, $$ corresponds to an infinite dimensional $\mathcal{H}$, yet each evaluation costs only $O(d)$ where $d$ is the input dimension. The bandwidth parameter $\gamma$ (equivalently $\sigma$) sets the length scale over which two points are considered similar, and it is the single most important hyperparameter in any RBF based model. ### 1.2 Positive definiteness and Mercer's condition Not every symmetric function is a valid kernel. The requirement is positive definiteness: for any finite set $\{\mathbf{x}_1, \dots, \mathbf{x}_n\}$, the Gram matrix $K$ with entries $K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$ must be positive semidefinite, meaning $\mathbf{c}^\top K \mathbf{c} \geq 0$ for all $\mathbf{c} \in \mathbb{R}^n$. Mercer's theorem guarantees that any continuous positive definite kernel admits an eigenfunction expansion $k(\mathbf{x}, \mathbf{x}') = \sum_i \lambda_i \psi_i(\mathbf{x}) \psi_i(\mathbf{x}')$ with $\lambda_i \geq 0$, which is exactly the feature map written out. Valid kernels are closed under addition, multiplication by a positive scalar, pointwise products, and composition with feature maps, which lets practitioners build complex kernels from simple parts. ### 1.3 The reproducing kernel Hilbert space Each positive definite kernel induces a unique reproducing kernel Hilbert space (RKHS) $\mathcal{H}_k$, the closure of the span of functions $\{k(\cdot, \mathbf{x})\}$ under the inner product $\langle k(\cdot, \mathbf{x}), k(\cdot, \mathbf{x}') \rangle = k(\mathbf{x}, \mathbf{x}')$. The defining property is reproduction: $$ f(\mathbf{x}) = \langle f, k(\cdot, \mathbf{x}) \rangle_{\mathcal{H}_k} \quad \text{for all } f \in \mathcal{H}_k. $$ The RKHS norm $\lVert f \rVert_{\mathcal{H}_k}$ measures the smoothness of $f$, and controlling it is how kernel methods regularize. This abstract setup pays off immediately in the representer theorem. ## 2. The Representer Theorem ### 2.1 Statement Consider any regularized empirical risk problem over an RKHS: $$ \min_{f \in \mathcal{H}_k} \; \sum_{i=1}^{n} L\big(y_i, f(\mathbf{x}_i)\big) + \Omega\big(\lVert f \rVert_{\mathcal{H}_k}\big), $$ where $L$ is an arbitrary loss and $\Omega$ is strictly increasing. The representer theorem states that any minimizer admits a finite expansion over the training points: $$ f^\star(\mathbf{x}) = \sum_{i=1}^{n} \alpha_i \, k(\mathbf{x}_i, \mathbf{x}). $$ The infinite dimensional optimization collapses to finding $n$ coefficients $\alpha_i$. ### 2.2 Why it holds Decompose any candidate $f$ into a part inside the span $\mathcal{S} = \mathrm{span}\{k(\cdot, \mathbf{x}_i)\}$ and an orthogonal remainder: $f = f_{\parallel} + f_{\perp}$. By the reproducing property, $f(\mathbf{x}_i) = \langle f, k(\cdot, \mathbf{x}_i)\rangle = \langle f_{\parallel}, k(\cdot, \mathbf{x}_i)\rangle$, so $f_{\perp}$ does not affect any training prediction and hence does not affect the loss term. But $\lVert f \rVert^2 = \lVert f_{\parallel} \rVert^2 + \lVert f_{\perp} \rVert^2$, so a nonzero $f_{\perp}$ only inflates the regularizer. Since $\Omega$ is increasing, the optimum must have $f_{\perp} = 0$, which is precisely the claimed finite expansion. This single argument is what makes kernel ridge regression, kernel logistic regression, the SVM, and Gaussian process regression all tractable. ## 3. Kernel Ridge Regression ### 3.1 From ridge to kernel ridge Kernel ridge regression (KRR) is the most direct application of the machinery above and the workhorse we will scale up later. Ordinary ridge regression solves $\min_{\mathbf{w}} \lVert \mathbf{y} - X\mathbf{w} \rVert^2 + \lambda \lVert \mathbf{w} \rVert^2$ with closed form $\mathbf{w} = (X^\top X + \lambda I)^{-1} X^\top \mathbf{y}$. Replacing $\mathbf{x}$ with $\phi(\mathbf{x})$ and invoking the representer theorem, write $f(\mathbf{x}) = \sum_i \alpha_i k(\mathbf{x}_i, \mathbf{x})$. Substituting into the squared loss with RKHS regularizer $\lambda \lVert f \rVert^2 = \lambda \, \boldsymbol{\alpha}^\top K \boldsymbol{\alpha}$ yields $$ \min_{\boldsymbol{\alpha}} \; \lVert \mathbf{y} - K\boldsymbol{\alpha} \rVert^2 + \lambda \, \boldsymbol{\alpha}^\top K \boldsymbol{\alpha}. $$ Setting the gradient to zero gives the elegant closed form $$ \boldsymbol{\alpha} = (K + \lambda I)^{-1} \mathbf{y}, \qquad f(\mathbf{x}_\star) = \mathbf{k}_\star^\top (K + \lambda I)^{-1} \mathbf{y}, $$ where $\mathbf{k}_\star = [k(\mathbf{x}_1, \mathbf{x}_\star), \dots, k(\mathbf{x}_n, \mathbf{x}_\star)]^\top$. Notice there is no iterative optimization: KRR is one linear solve, which is part of why it is such a clean baseline. ### 3.2 Properties and cost KRR has no hyperparameter for sparsity, so unlike the SVM every training point contributes a nonzero $\alpha_i$. It is the predictive mean of Gaussian process regression with noise variance $\lambda$, which connects it to a full Bayesian treatment and lets you borrow GP intuition about kernels and length scales. The dominant cost is the $O(n^3)$ factorization of the $n \times n$ matrix $K + \lambda I$ and $O(n^2)$ memory to store $K$. This cubic scaling is the central obstacle. At a few thousand points it is a second or two; at a hundred thousand it is hours and tens of gigabytes; beyond that it is simply infeasible. Sections 4 and 5 exist entirely to break this barrier. ### 3.3 Choosing the regularizer and bandwidth Two knobs govern KRR: the regularization strength $\lambda$ (called `alpha` in scikit-learn) and the kernel bandwidth $\gamma$. Small bandwidth (large $\gamma$) produces a wiggly function that interpolates noise; large bandwidth (small $\gamma$) oversmooths toward a constant. Because KRR has a closed form, leave one out cross validation can be computed cheaply from a single factorization: the LOO residuals are $e_i / (1 - H_{ii})$ where $H = K(K + \lambda I)^{-1}$ is the smoother (hat) matrix. This makes principled tuning practical on moderate datasets without refitting. In a library setting you would wrap KRR in a grid search over $(\gamma, \lambda)$; the closed form keeps each fit cheap. ### 3.4 Failure modes KRR fails quietly in a few recognizable ways. An unstandardized input with one dominant scale makes a single $\gamma$ meaningless across dimensions, so always standardize before an RBF kernel. Too small a $\lambda$ with a small bandwidth interpolates the training noise and generalizes terribly even as training error vanishes. And the dense $O(n^2)$ Gram matrix silently exhausts memory well before the $O(n^3)$ solve becomes the bottleneck, which is the practical reason large problems demand the approximations below rather than a faster solver. ## 4. The Nystrom Approximation ### 4.1 Low rank from a landmark subset The Nystrom method attacks the $O(n^3)$ cost by approximating the Gram matrix with a low rank factorization built from a data dependent subsample. Choose $m \ll n$ landmark points (often a uniform or leverage score weighted sample of the training set). Let $K_{mm}$ be the kernel matrix among landmarks and $K_{nm}$ the kernel between all points and landmarks. The Nystrom low rank approximation is $$ K \approx \tilde{K} = K_{nm} \, K_{mm}^{+} \, K_{nm}^\top, $$ where $K_{mm}^{+}$ is the pseudoinverse. This reconstructs the full Gram matrix from an $m$ column sketch and has rank at most $m$. ### 4.2 Explicit features and solving Factoring $K_{mm}^{+} = L L^\top$ gives an explicit feature map $\mathbf{z}(\mathbf{x}) = L^\top \mathbf{k}_m(\mathbf{x})$ where $\mathbf{k}_m(\mathbf{x})$ collects the kernel values against the $m$ landmarks. This is exactly what scikit-learn's `Nystroem` transformer returns: it maps each input to an $m$ dimensional vector. Once the data live in this explicit space, downstream learning becomes linear, so KRR reduces to ordinary ridge regression on the transformed features and costs $O(nm^2 + m^3)$ time and $O(nm)$ memory. Using the Woodbury identity, the solution can be written directly in terms of the small matrices without ever forming the full $n \times n$ system. The key design choice is $m$. Larger $m$ gives a more faithful approximation and better accuracy at higher cost; the right value is the smallest $m$ at which validation error plateaus, typically a few hundred to a couple thousand. ## 5. Random Fourier Features ### 5.1 Bochner's theorem Random Fourier features (RFF) attack the same cost from a data independent angle: they approximate the kernel with an explicit, low dimensional feature map drawn without looking at the data. For shift invariant kernels, $k(\mathbf{x}, \mathbf{x}') = k(\mathbf{x} - \mathbf{x}')$, Bochner's theorem says the kernel is the Fourier transform of a nonnegative measure. After normalizing $k(\mathbf{0}) = 1$, that measure is a probability density $p(\boldsymbol{\omega})$: $$ k(\mathbf{x} - \mathbf{x}') = \int_{\mathbb{R}^d} p(\boldsymbol{\omega}) \, e^{\,i \boldsymbol{\omega}^\top (\mathbf{x} - \mathbf{x}')} \, d\boldsymbol{\omega} = \mathbb{E}_{\boldsymbol{\omega} \sim p}\!\left[ e^{\,i \boldsymbol{\omega}^\top (\mathbf{x} - \mathbf{x}')} \right]. $$ For the Gaussian kernel, $p$ is itself a Gaussian, $\boldsymbol{\omega} \sim \mathcal{N}(\mathbf{0}, 2\gamma I)$. ### 5.2 The Monte Carlo feature map The expectation above is a population average that we estimate by sampling. Taking real parts and drawing $D$ frequencies $\boldsymbol{\omega}_1, \dots, \boldsymbol{\omega}_D \sim p$ and phases $b_j \sim \mathrm{Uniform}(0, 2\pi)$, define $$ \mathbf{z}(\mathbf{x}) = \sqrt{\tfrac{2}{D}} \, \big[\cos(\boldsymbol{\omega}_1^\top \mathbf{x} + b_1), \dots, \cos(\boldsymbol{\omega}_D^\top \mathbf{x} + b_D)\big]^\top. $$ Then $\mathbf{z}(\mathbf{x})^\top \mathbf{z}(\mathbf{x}')$ is an unbiased estimator of $k(\mathbf{x}, \mathbf{x}')$, with approximation error that decays as $O(1/\sqrt{D})$ uniformly over the data domain. This is precisely what scikit-learn's `RBFSampler` computes. The kernel has been replaced by an explicit $D$ dimensional map drawn once, up front, independent of the labels and of the data distribution. ### 5.3 Linear cost after lifting The payoff is the same as Nystrom: once each point is mapped to $\mathbf{z}(\mathbf{x}) \in \mathbb{R}^D$, any linear method applies directly. KRR becomes ordinary ridge regression on the lifted matrix $Z$, costing $O(nD^2 + D^3)$ time and $O(nD)$ memory instead of $O(n^3)$ and $O(n^2)$. For $D \ll n$ this is the difference between feasible and impossible, and because the features are independent of the data the transform is embarrassingly parallel and streams trivially. Variants such as orthogonal random features and structured (Fastfood) transforms reduce variance or accelerate the projection from $O(dD)$ toward $O(D \log d)$, and they are the default at very large scale. ## 6. Choosing Among the Three The three methods sit on a clear spectrum, and the panel below demonstrates all of them on the same data. Exact KRR is the accuracy gold standard and the right choice up to a few thousand points, where the cubic solve is still cheap and the closed form LOO makes tuning painless. Past that, you approximate. Nystrom adapts to the spectrum of the data, so when the Gram matrix has rapidly decaying eigenvalues it achieves a given accuracy with far fewer features than RFF, whose error is governed by the worst case rather than the actual spectrum. Theory shows the Nystrom generalization error can improve to $O(1/m)$ when leverage scores guide sampling, beating the $O(1/\sqrt{D})$ rate of plain random features. The cost is a data dependent construction and sensitivity to landmark selection. Random Fourier features, by contrast, are trivial to generate, label and distribution independent, embarrassingly parallel, and apply to any shift invariant kernel without touching the data, which makes them the natural fit for streaming and privacy constrained settings. A useful default is to prefer Nystrom when the kernel matrix is approximately low rank and random features when it is not, or when retaining landmarks is undesirable. One subtlety ties the approximation level back to regularization: a coarser approximation (small $m$ or $D$) acts as implicit regularization, so the optimal $\lambda$ often shrinks as $m$ or $D$ grows. Tune them together rather than in isolation. ## 7. Library Demonstration Rather than reimplement any of this, we lean on scikit-learn, the mature open source standard for classical machine learning in Python. It ships `KernelRidge` for the exact method and the `Nystroem` and `RBFSampler` transformers for the two approximations, all sharing the same RBF kernel and `gamma` so the comparison is apples to apples. The running example is `make_friedman1`, a standard synthetic regression benchmark whose target $$ y = 10 \sin(\pi x_1 x_2) + 20 (x_3 - 0.5)^2 + 10 x_4 + 5 x_5 + \varepsilon $$ is a smooth nonlinear function of five informative features, padded here with five pure noise features. A linear model is hopeless on it, while an RBF kernel fits it well, so it cleanly exposes both the value of kernels and the accuracy cost of each approximation. Everything is seeded for reproducibility, and each model is wrapped in a `StandardScaler` pipeline because, as Section 3.4 warned, a single bandwidth only makes sense on standardized inputs. ::: {.panel-tabset} ## Python ```{python} import time import numpy as np from sklearn.datasets import make_friedman1 from sklearn.model_selection import train_test_split from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler from sklearn.kernel_ridge import KernelRidge from sklearn.kernel_approximation import Nystroem, RBFSampler from sklearn.linear_model import Ridge from sklearn.metrics import r2_score SEED = 0 # Friedman #1: a smooth nonlinear surface with 5 informative features # plus 5 pure-noise features. n is large enough that the exact O(n^3) # solve is visibly slower than the O(n D^2) approximations. X, y = make_friedman1(n_samples=6000, n_features=10, noise=1.0, random_state=SEED) X_tr, X_te, y_tr, y_te = train_test_split( X, y, test_size=0.2, random_state=SEED) GAMMA = 0.1 # shared RBF bandwidth (1 / 2 sigma^2) LAMBDA = 1.0 # ridge / KRR regularization (sklearn calls it alpha) D = 300 # explicit features for Nystroem and RFF (m == D here) def evaluate(name, model): t0 = time.perf_counter() model.fit(X_tr, y_tr) fit_s = time.perf_counter() - t0 r2 = r2_score(y_te, model.predict(X_te)) print(f"{name:<26} test R2 = {r2:6.3f} fit = {fit_s:6.2f}s") return r2 print(f"train shape {X_tr.shape}, test shape {X_te.shape}\n") # 1. Exact kernel ridge regression: the O(n^3) gold standard. evaluate("Exact KernelRidge (RBF)", make_pipeline( StandardScaler(), KernelRidge(kernel="rbf", gamma=GAMMA, alpha=LAMBDA), )) # 2. Nystroem: data-dependent landmark sketch, then linear ridge. evaluate("Nystroem + Ridge", make_pipeline( StandardScaler(), Nystroem(kernel="rbf", gamma=GAMMA, n_components=D, random_state=SEED), Ridge(alpha=LAMBDA), )) # 3. Random Fourier features: data-independent map, then linear ridge. evaluate("RandomFourier + Ridge", make_pipeline( StandardScaler(), RBFSampler(gamma=GAMMA, n_components=D, random_state=SEED), Ridge(alpha=LAMBDA), )) # 4. Plain linear ridge: the baseline the kernel must beat. evaluate("Linear Ridge (baseline)", make_pipeline( StandardScaler(), Ridge(alpha=LAMBDA))) # Inspect the explicit feature map the approximation builds: an n x D # matrix replaces the n x n Gram matrix the exact method would form. scaler = StandardScaler().fit(X_tr) Z = Nystroem(kernel="rbf", gamma=GAMMA, n_components=D, random_state=SEED).fit_transform(scaler.transform(X_tr)) print(f"\nNystroem feature matrix: {Z.shape} " f"(vs a {X_tr.shape[0]} x {X_tr.shape[0]} Gram matrix)") print("first row, first 5 features:", np.round(Z[0, :5], 4)) ``` The exact RBF kernel reaches the highest test $R^2$ but takes the longest to fit. Nystrom recovers almost all of that accuracy in a fraction of the time, and random Fourier features are faster still at a slightly larger accuracy cost. All three crush the linear baseline, whose $R^2$ near $0.7$ shows how much nonlinear signal the kernel captures. The printed $n \times D$ matrix is the whole point of the approximations: a tall, thin explicit feature matrix replaces the dense $n \times n$ Gram matrix, and from there it is just linear algebra. ## Julia ```julia # Julia: KernelFunctions.jl builds the kernel; MLJ wires up the pipeline. # Equivalent exact kernel ridge regression on the same idea. using MLJ, KernelFunctions, Random Random.seed!(0) # A Friedman-1-style target (see the Python cell for the exact form). n, d = 6000, 10 X = rand(n, d) y = 10 .* sin.(pi .* X[:,1] .* X[:,2]) .+ 20 .* (X[:,3] .- 0.5).^2 .+ 10 .* X[:,4] .+ 5 .* X[:,5] .+ randn(n) (Xtr, Xte), (ytr, yte) = partition((MLJ.table(X), y), 0.8, multi=true, rng=0) # Exact kernel ridge: form the RBF Gram matrix and solve (K + lambda I) a = y. gamma, lambda = 0.1, 1.0 k = with_lengthscale(SqExponentialKernel(), sqrt(1 / (2 * gamma))) Ktr = kernelmatrix(k, RowVecs(X[1:4800, :])) alpha = (Ktr + lambda * I) \ y[1:4800] # For Nystroem / random features at scale, see ApproximateGPs.jl, which # provides inducing-point (Nystroem-style) and Fourier-feature kernels. ``` ## Rust ```rust // Rust: smartcore is the most mature general ML crate. It ships exact // kernel ridge regression (KernelRidge) with an RBF kernel. use smartcore::dataset::generator; use smartcore::linalg::basic::matrix::DenseMatrix; use smartcore::svm::svr::{SVR, SVRParameters}; use smartcore::svm::RBFKernel; fn main() { // smartcore exposes RBF-kernel regression via SVR; a dedicated // KernelRidge lives in smartcore::linear::kernel_ridge on recent // versions. Both form the dense Gram matrix (exact, O(n^3)). let data = generator::make_regression(6000, 10, 5, 1.0, 42); let x = DenseMatrix::from_2d_vec(&data.0).unwrap(); let y = data.1; let model = SVR::fit( &x, &y, SVRParameters::default().with_kernel(RBFKernel::default().with_gamma(0.1)), ).unwrap(); let _pred = model.predict(&x).unwrap(); } // Honest note: as of mid-2026 no mainstream Rust crate ships Nystroem // or random Fourier features out of the box. For large-scale kernels in // Rust you implement the explicit feature map yourself (sample omega, // apply cos) and feed it to smartcore's linear Ridge, or call sklearn // across an FFI boundary. smartcore's exact RBF path is shown above. ``` ::: ## 8. Practical Guidance When data is small (up to a few thousand points), use exact KRR and tune $\lambda$ and the bandwidth by cross validation, exploiting the closed form LOO formula for cheap model selection. From roughly ten thousand to a million points, switch to Nystrom if the spectrum decays quickly or to random Fourier features otherwise, choosing $m$ or $D$ in the low hundreds to low thousands and increasing until validation error plateaus. Always standardize inputs before applying an RBF kernel so a single bandwidth is sensible across dimensions. Remember that the approximations interact with regularization: a coarser approximation acts as implicit regularization, so the optimal $\lambda$ often shrinks as $m$ or $D$ grows, and the two should be tuned jointly. Finally, when an explicit feature map is available through either approximation, the entire kernel toolbox (regression, classification, clustering) reduces to fast linear algebra on the lifted features, which is the modern, library friendly way to deploy kernel methods at scale. Reach for the well tested transformers in scikit-learn (or the Julia and Rust analogues above) before writing any of this by hand. ## References 1. Scholkopf, B. and Smola, A. J. *Learning with Kernels*. MIT Press, 2002. https://mitpress.mit.edu/9780262536578/learning-with-kernels/ 2. Scholkopf, B., Herbrich, R., and Smola, A. J. A Generalized Representer Theorem. *COLT*, 2001. https://doi.org/10.1007/3-540-44581-1_27 3. Rahimi, A. and Recht, B. Random Features for Large Scale Kernel Machines. *NeurIPS*, 2007. https://papers.nips.cc/paper/2007/hash/013a006f03dbc5392effeb8f18fda755-Abstract.html 4. Williams, C. and Seeger, M. Using the Nystrom Method to Speed Up Kernel Machines. *NeurIPS*, 2001. https://papers.nips.cc/paper/2000/hash/19de10adbaa1b2ee13f77f679fa1483a-Abstract.html 5. Drineas, P. and Mahoney, M. W. On the Nystrom Method for Approximating a Gram Matrix. *JMLR*, 2005. https://www.jmlr.org/papers/v6/drineas05a.html 6. Rudi, A., Camoriano, R., and Rosasco, L. Less is More: Nystrom Computational Regularization. *NeurIPS*, 2015. https://papers.nips.cc/paper/2015/hash/03e0704b5690a2dee1861dc3ad3316c9-Abstract.html 7. Rasmussen, C. E. and Williams, C. K. I. *Gaussian Processes for Machine Learning*. MIT Press, 2006. https://gaussianprocess.org/gpml/ 8. Le, Q., Sarlos, T., and Smola, A. Fastfood: Approximating Kernel Expansions in Loglinear Time. *ICML*, 2013. https://proceedings.mlr.press/v28/le13.html 9. Pedregosa, F. et al. Scikit-learn: Machine Learning in Python. *JMLR*, 2011. https://www.jmlr.org/papers/v12/pedregosa11a.html