119  Kernel Methods Beyond SVM

Support vector machines made kernels famous, but the kernel idea is far more general than max margin classification. Any learning algorithm that can be written purely in terms of inner products between data points can be lifted into a rich, possibly infinite dimensional feature space by replacing those inner products with a kernel function. This chapter develops the kernel machinery that surrounds and outlives the SVM: kernel ridge regression, kernel principal component analysis, the representer theorem that justifies the whole enterprise, and the scaling techniques (random Fourier features and the Nystrom approximation) that make kernels usable on large data.

119.1 1. Kernels and Feature Spaces

119.1.1 1.1 The kernel trick, restated

A kernel is a symmetric function \(k(\mathbf{x}, \mathbf{x}')\) that computes an inner product in some feature space. Concretely, there exists a feature map \(\phi: \mathcal{X} \to \mathcal{H}\) into a Hilbert space such that

\[ k(\mathbf{x}, \mathbf{x}') = \langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle_{\mathcal{H}}. \]

The point of the trick is that we never form \(\phi(\mathbf{x})\) explicitly. The Gaussian (RBF) kernel

\[ k(\mathbf{x}, \mathbf{x}') = \exp\!\left(-\frac{\lVert \mathbf{x} - \mathbf{x}' \rVert^2}{2\sigma^2}\right) \]

corresponds to an infinite dimensional \(\mathcal{H}\), yet each evaluation costs only \(O(d)\) where \(d\) is the input dimension.

119.1.2 1.2 Positive definiteness and Mercer’s condition

Not every symmetric function is a valid kernel. The requirement is positive definiteness: for any finite set \(\{\mathbf{x}_1, \dots, \mathbf{x}_n\}\), the Gram matrix \(K\) with entries \(K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)\) must be positive semidefinite, meaning \(\mathbf{c}^\top K \mathbf{c} \geq 0\) for all \(\mathbf{c} \in \mathbb{R}^n\). Mercer’s theorem guarantees that any continuous positive definite kernel admits an eigenfunction expansion \(k(\mathbf{x}, \mathbf{x}') = \sum_i \lambda_i \psi_i(\mathbf{x}) \psi_i(\mathbf{x}')\) with \(\lambda_i \geq 0\), which is exactly the feature map written out. Valid kernels are closed under addition, multiplication by a positive scalar, pointwise products, and composition with feature maps, which lets practitioners build complex kernels from simple parts.

119.1.3 1.3 The reproducing kernel Hilbert space

Each positive definite kernel induces a unique reproducing kernel Hilbert space (RKHS) \(\mathcal{H}_k\), the closure of the span of functions \(\{k(\cdot, \mathbf{x})\}\) under the inner product \(\langle k(\cdot, \mathbf{x}), k(\cdot, \mathbf{x}') \rangle = k(\mathbf{x}, \mathbf{x}')\). The defining property is reproduction:

\[ f(\mathbf{x}) = \langle f, k(\cdot, \mathbf{x}) \rangle_{\mathcal{H}_k} \quad \text{for all } f \in \mathcal{H}_k. \]

The RKHS norm \(\lVert f \rVert_{\mathcal{H}_k}\) measures the smoothness of \(f\), and controlling it is how kernel methods regularize. This abstract setup pays off immediately in the representer theorem.

119.2 2. The Representer Theorem

119.2.1 2.1 Statement

Consider any regularized empirical risk problem over an RKHS:

\[ \min_{f \in \mathcal{H}_k} \; \sum_{i=1}^{n} L\big(y_i, f(\mathbf{x}_i)\big) + \Omega\big(\lVert f \rVert_{\mathcal{H}_k}\big), \]

where \(L\) is an arbitrary loss and \(\Omega\) is strictly increasing. The representer theorem states that any minimizer admits a finite expansion over the training points:

\[ f^\star(\mathbf{x}) = \sum_{i=1}^{n} \alpha_i \, k(\mathbf{x}_i, \mathbf{x}). \]

The infinite dimensional optimization collapses to finding \(n\) coefficients \(\alpha_i\).

119.2.2 2.2 Why it holds

Decompose any candidate \(f\) into a part inside the span \(\mathcal{S} = \mathrm{span}\{k(\cdot, \mathbf{x}_i)\}\) and an orthogonal remainder: \(f = f_{\parallel} + f_{\perp}\). By the reproducing property, \(f(\mathbf{x}_i) = \langle f, k(\cdot, \mathbf{x}_i)\rangle = \langle f_{\parallel}, k(\cdot, \mathbf{x}_i)\rangle\), so \(f_{\perp}\) does not affect any training prediction and hence does not affect the loss term. But \(\lVert f \rVert^2 = \lVert f_{\parallel} \rVert^2 + \lVert f_{\perp} \rVert^2\), so a nonzero \(f_{\perp}\) only inflates the regularizer. Since \(\Omega\) is increasing, the optimum must have \(f_{\perp} = 0\), which is precisely the claimed finite expansion. This single argument is what makes kernel ridge regression, kernel logistic regression, the SVM, and Gaussian process regression all tractable.

119.3 3. Kernel Ridge Regression

119.3.1 3.1 From ridge to kernel ridge

Ordinary ridge regression solves \(\min_{\mathbf{w}} \lVert \mathbf{y} - X\mathbf{w} \rVert^2 + \lambda \lVert \mathbf{w} \rVert^2\) with closed form \(\mathbf{w} = (X^\top X + \lambda I)^{-1} X^\top \mathbf{y}\). Replacing \(\mathbf{x}\) with \(\phi(\mathbf{x})\) and invoking the representer theorem, write \(f(\mathbf{x}) = \sum_i \alpha_i k(\mathbf{x}_i, \mathbf{x})\). Substituting into the squared loss with RKHS regularizer \(\lambda \lVert f \rVert^2 = \lambda \, \boldsymbol{\alpha}^\top K \boldsymbol{\alpha}\) yields

\[ \min_{\boldsymbol{\alpha}} \; \lVert \mathbf{y} - K\boldsymbol{\alpha} \rVert^2 + \lambda \, \boldsymbol{\alpha}^\top K \boldsymbol{\alpha}. \]

Setting the gradient to zero gives the elegant closed form

\[ \boldsymbol{\alpha} = (K + \lambda I)^{-1} \mathbf{y}, \qquad f(\mathbf{x}_\star) = \mathbf{k}_\star^\top (K + \lambda I)^{-1} \mathbf{y}, \]

where \(\mathbf{k}_\star = [k(\mathbf{x}_1, \mathbf{x}_\star), \dots, k(\mathbf{x}_n, \mathbf{x}_\star)]^\top\).

119.3.2 3.2 Properties and cost

Kernel ridge regression (KRR) has no hyperparameter for sparsity, so unlike the SVM every training point contributes a nonzero \(\alpha_i\). It is the predictive mean of Gaussian process regression with noise variance \(\lambda\), which connects it to a full Bayesian treatment. The dominant cost is the \(O(n^3)\) factorization of the \(n \times n\) matrix \(K + \lambda I\) and \(O(n^2)\) memory to store \(K\). This cubic scaling is the central obstacle that Sections 5 and 6 address.

# Kernel ridge regression, illustrative
K = rbf_kernel(X, X, gamma)          # n x n Gram matrix
alpha = solve(K + lam * eye(n), y)   # O(n^3)
k_star = rbf_kernel(X, X_test, gamma)
y_pred = k_star.T @ alpha

119.3.3 3.3 Choosing the regularizer and bandwidth

Two knobs govern KRR: the regularization strength \(\lambda\) and the kernel bandwidth (for the RBF, \(\sigma\) or \(\gamma = 1/(2\sigma^2)\)). Small \(\sigma\) produces a wiggly function that interpolates noise; large \(\sigma\) oversmooths. Because KRR has a closed form, leave one out cross validation can be computed cheaply from a single factorization: the LOO residuals are \(e_i / (1 - H_{ii})\) where \(H = K(K + \lambda I)^{-1}\) is the smoother matrix. This makes principled tuning practical on moderate datasets.

119.4 4. Kernel PCA

119.4.1 4.1 Centering in feature space

Principal component analysis finds directions of maximal variance. Kernel PCA performs the same operation on \(\phi(\mathbf{x})\), extracting nonlinear principal components. We never access \(\phi\) directly, so we work with the Gram matrix. First the features must be centered. The centered kernel matrix is

\[ \tilde{K} = K - \mathbf{1}_n K - K \mathbf{1}_n + \mathbf{1}_n K \mathbf{1}_n, \]

where \(\mathbf{1}_n\) is the \(n \times n\) matrix with every entry \(1/n\). This implements \(\tilde{\phi}(\mathbf{x}_i) = \phi(\mathbf{x}_i) - \bar{\phi}\) entirely through inner products.

119.4.2 4.2 Eigenproblem and projections

The principal directions in feature space are \(\mathbf{v}_k = \sum_i a_i^{(k)} \tilde{\phi}(\mathbf{x}_i)\), and the coefficients come from the eigendecomposition of the centered Gram matrix:

\[ \tilde{K} \mathbf{a}^{(k)} = n \lambda_k \, \mathbf{a}^{(k)}. \]

After normalizing so that \(\lambda_k \lVert \mathbf{a}^{(k)} \rVert^2 = 1\), the projection of a new point \(\mathbf{x}_\star\) onto component \(k\) is

\[ z_k(\mathbf{x}_\star) = \sum_{i=1}^{n} a_i^{(k)} \, \tilde{k}(\mathbf{x}_i, \mathbf{x}_\star). \]

119.4.3 4.3 Uses and caveats

Kernel PCA is a workhorse for nonlinear dimensionality reduction, denoising, and as a preprocessing step that unfolds manifolds before a linear model. With an RBF kernel it can separate concentric clusters that linear PCA cannot. Two practical caveats deserve emphasis. First, there is no exact pre image: a point in feature space generally has no input that maps to it, so reconstruction in input space requires solving a separate optimization. Second, kernel PCA shares the \(O(n^3)\) eigendecomposition cost and \(O(n^2)\) memory of all dense kernel methods, which again motivates the approximations below.

119.5 5. Random Fourier Features

119.5.1 5.1 Bochner’s theorem

The scaling problem can be attacked by approximating the kernel with an explicit, low dimensional feature map. For shift invariant kernels, \(k(\mathbf{x}, \mathbf{x}') = k(\mathbf{x} - \mathbf{x}')\), Bochner’s theorem says the kernel is the Fourier transform of a nonnegative measure. After normalizing \(k(\mathbf{0}) = 1\), that measure is a probability density \(p(\boldsymbol{\omega})\):

\[ k(\mathbf{x} - \mathbf{x}') = \int_{\mathbb{R}^d} p(\boldsymbol{\omega}) \, e^{\,i \boldsymbol{\omega}^\top (\mathbf{x} - \mathbf{x}')} \, d\boldsymbol{\omega} = \mathbb{E}_{\boldsymbol{\omega} \sim p}\!\left[ e^{\,i \boldsymbol{\omega}^\top (\mathbf{x} - \mathbf{x}')} \right]. \]

For the Gaussian kernel, \(p\) is itself a Gaussian, \(\boldsymbol{\omega} \sim \mathcal{N}(\mathbf{0}, \sigma^{-2} I)\).

119.5.2 5.2 The Monte Carlo feature map

The expectation above is a population average that we estimate by sampling. Taking real parts and drawing \(D\) frequencies \(\boldsymbol{\omega}_1, \dots, \boldsymbol{\omega}_D \sim p\) and phases \(b_j \sim \mathrm{Uniform}(0, 2\pi)\), define

\[ \mathbf{z}(\mathbf{x}) = \sqrt{\tfrac{2}{D}} \, \big[\cos(\boldsymbol{\omega}_1^\top \mathbf{x} + b_1), \dots, \cos(\boldsymbol{\omega}_D^\top \mathbf{x} + b_D)\big]^\top. \]

Then \(\mathbf{z}(\mathbf{x})^\top \mathbf{z}(\mathbf{x}')\) is an unbiased estimator of \(k(\mathbf{x}, \mathbf{x}')\), with approximation error that decays as \(O(1/\sqrt{D})\) uniformly over the data domain. The kernel has been replaced by an explicit \(D\) dimensional map.

119.5.3 5.3 Linear cost after lifting

The payoff is enormous. Once each point is mapped to \(\mathbf{z}(\mathbf{x}) \in \mathbb{R}^D\), any linear method applies directly. Kernel ridge regression becomes ordinary ridge regression on \(Z\), costing \(O(nD^2 + D^3)\) time and \(O(nD)\) memory instead of \(O(n^3)\) and \(O(n^2)\). For \(D \ll n\) this is the difference between feasible and impossible.

# Random Fourier features for the RBF kernel
W = randn(d, D) / sigma          # frequencies ~ N(0, 1/sigma^2)
b = uniform(0, 2*pi, size=D)
Z = sqrt(2.0 / D) * cos(X @ W + b)
w = solve(Z.T @ Z + lam * eye(D), Z.T @ y)   # ridge on features

Variants such as orthogonal random features and structured (Fastfood) transforms reduce variance or accelerate the projection from \(O(dD)\) toward \(O(D \log d)\), and they are the default at scale.

119.6 6. The Nystrom Approximation

119.6.1 6.1 Low rank from a landmark subset

Random Fourier features are data independent: the frequencies are drawn without looking at the data. The Nystrom method takes the complementary, data dependent route. Choose \(m \ll n\) landmark points (often a uniform or leverage score weighted sample of the training set). Let \(K_{mm}\) be the kernel matrix among landmarks and \(K_{nm}\) the kernel between all points and landmarks. The Nystrom low rank approximation is

\[ K \approx \tilde{K} = K_{nm} \, K_{mm}^{+} \, K_{nm}^\top, \]

where \(K_{mm}^{+}\) is the pseudoinverse. This reconstructs the full Gram matrix from an \(m\) column sketch and has rank at most \(m\).

119.6.2 6.2 Explicit features and solving

Factoring \(K_{mm}^{+} = L L^\top\) gives an explicit feature map \(\mathbf{z}(\mathbf{x}) = L^\top \mathbf{k}_m(\mathbf{x})\) where \(\mathbf{k}_m(\mathbf{x})\) collects the kernel values against the \(m\) landmarks. As with random features, downstream learning becomes linear in this \(m\) dimensional space, so KRR costs \(O(nm^2 + m^3)\). Using the Woodbury identity, the KRR solution can be written directly in terms of the small matrices without ever forming the full \(n \times n\) system.

119.6.3 6.3 Nystrom versus random Fourier features

The two methods trade off along a clear axis. Nystrom adapts to the spectrum of the data, so when the Gram matrix has rapidly decaying eigenvalues it achieves a given accuracy with far fewer features than random Fourier features, whose error is governed by the worst case rather than the actual spectrum. Theory shows the Nystrom generalization error can improve to \(O(1/m)\) when leverage scores guide sampling, beating the \(O(1/\sqrt{D})\) rate of plain random features. The cost is a data dependent construction and sensitivity to landmark selection. Random Fourier features, by contrast, are trivial to generate, embarrassingly parallel, and apply to any shift invariant kernel without touching the data. A useful default is to prefer Nystrom when the kernel matrix is approximately low rank and random features when it is not or when streaming and privacy constraints discourage retaining landmarks.

119.7 7. Practical Guidance

When data is small (up to a few thousand points), use exact KRR or kernel PCA and tune \(\lambda\) and the bandwidth by cross validation, exploiting the closed form LOO formula. From roughly ten thousand to a million points, switch to Nystrom if the spectrum decays quickly or to random Fourier features otherwise, choosing \(m\) or \(D\) in the low hundreds to low thousands and increasing until validation error plateaus. Always standardize inputs before applying an RBF kernel so a single bandwidth is sensible across dimensions. Remember that these approximations interact with regularization: a coarser approximation acts as implicit regularization, so the optimal \(\lambda\) often shrinks as \(m\) or \(D\) grows. Finally, when an explicit feature map is available through either approximation, the entire kernel toolbox (regression, PCA, clustering, classification) reduces to fast linear algebra, which is the modern way to deploy kernel methods at scale.

119.8 References

  1. Scholkopf, B. and Smola, A. J. Learning with Kernels. MIT Press, 2002. https://mitpress.mit.edu/9780262536578/learning-with-kernels/
  2. Scholkopf, B., Herbrich, R., and Smola, A. J. A Generalized Representer Theorem. COLT, 2001. https://link.springer.com/chapter/10.1007/3-540-44581-1_27
  3. Scholkopf, B., Smola, A., and Muller, K. R. Nonlinear Component Analysis as a Kernel Eigenvalue Problem. Neural Computation, 1998. https://direct.mit.edu/neco/article/10/5/1299/6193
  4. Rahimi, A. and Recht, B. Random Features for Large Scale Kernel Machines. NeurIPS, 2007. https://papers.nips.cc/paper/2007/hash/013a006f03dbc5392effeb8f18fda755-Abstract.html
  5. Williams, C. and Seeger, M. Using the Nystrom Method to Speed Up Kernel Machines. NeurIPS, 2001. https://papers.nips.cc/paper/2000/hash/19de10adbaa1b2ee13f77f679fa1483a-Abstract.html
  6. Drineas, P. and Mahoney, M. W. On the Nystrom Method for Approximating a Gram Matrix. JMLR, 2005. https://www.jmlr.org/papers/v6/drineas05a.html
  7. Rasmussen, C. E. and Williams, C. K. I. Gaussian Processes for Machine Learning. MIT Press, 2006. https://gaussianprocess.org/gpml/
  8. Le, Q., Sarlos, T., and Smola, A. Fastfood: Approximating Kernel Expansions in Loglinear Time. ICML, 2013. https://proceedings.mlr.press/v28/le13.html