142  Kernel PCA

Principal component analysis (PCA) is the workhorse of linear dimensionality reduction. It finds an orthogonal basis that captures the directions of maximal variance in a dataset, and it does so by an eigendecomposition of the sample covariance matrix. Yet many datasets of practical interest live on curved, nonlinear manifolds. A spiral, a Swiss roll, or the set of all rotated images of a single object cannot be summarized by linear projections without severe distortion. Kernel PCA, introduced by Scholkopf, Smola, and Muller in 1998, extends PCA to nonlinear structure by performing ordinary PCA in a high dimensional feature space that is never explicitly constructed. The mechanism that makes this tractable is the kernel trick. This chapter develops the method from first principles, derives the centered kernel matrix, shows how to project previously unseen points, and confronts the pre-image problem that arises when one wishes to map results back to the input space.

142.1 1. From Linear PCA to Feature Space

Let \(\{\mathbf{x}_1, \dots, \mathbf{x}_n\}\) be a dataset with \(\mathbf{x}_i \in \mathbb{R}^d\). Ordinary PCA assumes the data are centered, \(\sum_i \mathbf{x}_i = \mathbf{0}\), and diagonalizes the sample covariance matrix

\[ C = \frac{1}{n} \sum_{i=1}^{n} \mathbf{x}_i \mathbf{x}_i^\top . \]

The principal directions are the eigenvectors \(\mathbf{v}\) of \(C\) satisfying \(C\mathbf{v} = \lambda \mathbf{v}\) with \(\lambda \geq 0\). Projecting a point onto the leading eigenvectors yields a low dimensional representation that preserves variance.

The key idea of Kernel PCA is to map each input through a nonlinear feature map \(\phi: \mathbb{R}^d \to \mathcal{H}\), where \(\mathcal{H}\) is a possibly infinite dimensional Hilbert space, and then to run PCA on the images \(\phi(\mathbf{x}_i)\). A linear direction in \(\mathcal{H}\) corresponds to a nonlinear direction in the original input space, so linear PCA in \(\mathcal{H}\) recovers nonlinear structure in \(\mathbb{R}^d\).

Assume for the moment that the mapped data are centered in feature space, \(\sum_i \phi(\mathbf{x}_i) = \mathbf{0}\). We will relax this assumption in Section 4. The covariance operator in feature space is

\[ \bar{C} = \frac{1}{n} \sum_{i=1}^{n} \phi(\mathbf{x}_i)\, \phi(\mathbf{x}_i)^\top, \]

and we seek eigenvectors \(\mathbf{V}\) with \(\bar{C}\mathbf{V} = \lambda \mathbf{V}\). The obstacle is immediate. If \(\mathcal{H}\) has dimension in the millions or is infinite, we cannot form \(\bar{C}\), let alone diagonalize it. The kernel trick removes this obstacle.

142.2 2. The Kernel Trick and the Dual Formulation

A kernel function \(k(\mathbf{x}, \mathbf{y}) = \langle \phi(\mathbf{x}), \phi(\mathbf{y}) \rangle\) computes the inner product of two feature vectors without ever evaluating \(\phi\) explicitly. By Mercer’s theorem, any symmetric positive semidefinite kernel corresponds to such an inner product in some feature space. Common choices include the polynomial kernel \(k(\mathbf{x},\mathbf{y}) = (\mathbf{x}^\top \mathbf{y} + c)^p\) and the Gaussian radial basis function (RBF) kernel

\[ k(\mathbf{x}, \mathbf{y}) = \exp\!\left( -\frac{\lVert \mathbf{x} - \mathbf{y} \rVert^2}{2\sigma^2} \right), \]

whose feature space is infinite dimensional.

To exploit the kernel, we observe that every eigenvector of \(\bar{C}\) with nonzero eigenvalue lies in the span of the mapped data. From the eigenvalue equation,

\[ \mathbf{V} = \frac{1}{n\lambda} \sum_{i=1}^{n} \phi(\mathbf{x}_i)\, \langle \phi(\mathbf{x}_i), \mathbf{V} \rangle, \]

so there exist coefficients \(\alpha_i\) such that

\[ \mathbf{V} = \sum_{i=1}^{n} \alpha_i\, \phi(\mathbf{x}_i) . \]

This is the representer property in disguise. We have reduced the unknown from a vector in \(\mathcal{H}\) to a vector \(\boldsymbol{\alpha} \in \mathbb{R}^n\).

Substitute this expansion into \(\bar{C}\mathbf{V} = \lambda \mathbf{V}\) and project both sides onto each \(\phi(\mathbf{x}_\ell)\) by taking inner products. Define the kernel matrix, also called the Gram matrix, with entries \(K_{ij} = \langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle = k(\mathbf{x}_i, \mathbf{x}_j)\). After algebra the eigenproblem collapses to

\[ K^2 \boldsymbol{\alpha} = n \lambda\, K \boldsymbol{\alpha} . \]

Because \(K\) is positive semidefinite, the solutions of interest coincide with those of the simpler problem

\[ K \boldsymbol{\alpha} = n \lambda\, \boldsymbol{\alpha} . \]

Thus the eigenvectors of the \(n \times n\) kernel matrix \(K\) give the coefficient vectors \(\boldsymbol{\alpha}\), and the eigenvalues are \(n\lambda\). The dimensionality of the problem is now \(n\), the number of data points, rather than the dimension of \(\mathcal{H}\). This is the central computational victory of Kernel PCA.

142.3 3. Normalization of the Eigenvectors

The eigenvectors \(\mathbf{V}_k\) in feature space must be unit normalized, \(\langle \mathbf{V}_k, \mathbf{V}_k \rangle = 1\), for the projections to be properly scaled. Expanding,

\[ \langle \mathbf{V}_k, \mathbf{V}_k \rangle = \sum_{i,j} \alpha_i^{k} \alpha_j^{k} \langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle = (\boldsymbol{\alpha}^k)^\top K \boldsymbol{\alpha}^k = n \lambda_k\, (\boldsymbol{\alpha}^k)^\top \boldsymbol{\alpha}^k, \]

where the last step uses \(K\boldsymbol{\alpha}^k = n\lambda_k \boldsymbol{\alpha}^k\). Requiring this to equal one means we rescale each raw eigenvector \(\boldsymbol{\alpha}^k\) so that

\[ (\boldsymbol{\alpha}^k)^\top \boldsymbol{\alpha}^k = \frac{1}{n \lambda_k}. \]

In practice, if a numerical routine returns eigenvectors of \(K\) with unit Euclidean norm, we divide each by \(\sqrt{n\lambda_k}\), equivalently by \(\sqrt{\mu_k}\) where \(\mu_k = n\lambda_k\) is the eigenvalue actually returned for \(K\). This normalization is easy to omit and a frequent source of bugs.

142.4 4. The Centered Kernel Matrix

The derivation above assumed the mapped data are centered in feature space. In general they are not, and centering cannot be done directly because we never compute \(\phi(\mathbf{x}_i)\). We must center implicitly through the kernel.

Define the centered feature map

\[ \tilde{\phi}(\mathbf{x}_i) = \phi(\mathbf{x}_i) - \frac{1}{n} \sum_{m=1}^{n} \phi(\mathbf{x}_m). \]

The centered kernel matrix has entries \(\tilde{K}_{ij} = \langle \tilde{\phi}(\mathbf{x}_i), \tilde{\phi}(\mathbf{x}_j) \rangle\). Expanding the inner product term by term,

\[ \tilde{K}_{ij} = K_{ij} - \frac{1}{n}\sum_{m} K_{im} - \frac{1}{n}\sum_{m} K_{mj} + \frac{1}{n^2}\sum_{m,\ell} K_{m\ell}. \]

This admits a clean matrix form. Let \(\mathbf{1}_n\) be the \(n \times n\) matrix every entry of which equals \(1/n\), and let \(I\) be the identity. With \(H = I - \mathbf{1}_n\), the centering matrix, we obtain

\[ \tilde{K} = H K H = \left( I - \mathbf{1}_n \right) K \left( I - \mathbf{1}_n \right). \]

The matrix \(H\) is symmetric and idempotent, \(H^2 = H\), and projects onto the subspace orthogonal to the all ones vector. Kernel PCA proceeds by diagonalizing \(\tilde{K}\) rather than \(K\). The following snippet expresses the construction.

# K is the n-by-n uncentered Gram matrix
one_n = np.ones((n, n)) / n
K_tilde = K - one_n @ K - K @ one_n + one_n @ K @ one_n
eigvals, eigvecs = np.linalg.eigh(K_tilde)
# sort descending, normalize each alpha by 1 / sqrt(eigval)

All subsequent steps, eigendecomposition and normalization, use \(\tilde{K}\) in place of \(K\).

142.5 5. Projecting Data Points

Once the normalized coefficient vectors \(\boldsymbol{\alpha}^k\) are in hand, the projection of a training point \(\phi(\mathbf{x}_i)\) onto the \(k\)th principal component is the inner product

\[ \beta_i^{k} = \langle \mathbf{V}_k, \tilde{\phi}(\mathbf{x}_i) \rangle = \sum_{j=1}^{n} \alpha_j^{k} \langle \tilde{\phi}(\mathbf{x}_j), \tilde{\phi}(\mathbf{x}_i) \rangle = \sum_{j=1}^{n} \alpha_j^{k} \tilde{K}_{ji} . \]

In matrix form the matrix of projections is simply \(\tilde{K}\) times the matrix of coefficient vectors. Because \(\tilde{K}\boldsymbol{\alpha}^k = \mu_k \boldsymbol{\alpha}^k\), the projections of the training set are \(\beta_i^k = \mu_k \alpha_i^k\), a useful shortcut. The number of usable components is at most \(n - 1\), since centering removes one degree of freedom, and in practice many fewer because the trailing eigenvalues are negligible.

142.6 6. Projecting New Points Out of Sample

A new test point \(\mathbf{z}\) that was not part of the training set must be projected using only kernel evaluations against the training data. Its projection onto component \(k\) is

\[ \beta^{k}(\mathbf{z}) = \langle \mathbf{V}_k, \tilde{\phi}(\mathbf{z}) \rangle = \sum_{j=1}^{n} \alpha_j^{k} \big\langle \tilde{\phi}(\mathbf{x}_j), \tilde{\phi}(\mathbf{z}) \big\rangle . \]

The centered inner product between a training image and the test image expands to

\[ \big\langle \tilde{\phi}(\mathbf{x}_j), \tilde{\phi}(\mathbf{z}) \big\rangle = k(\mathbf{x}_j, \mathbf{z}) - \frac{1}{n}\sum_{m} k(\mathbf{x}_m, \mathbf{z}) - \frac{1}{n}\sum_{m} k(\mathbf{x}_j, \mathbf{x}_m) + \frac{1}{n^2}\sum_{m,\ell} k(\mathbf{x}_m, \mathbf{x}_\ell). \]

Collecting the vector \(\mathbf{k}_z\) with entries \((\mathbf{k}_z)_j = k(\mathbf{x}_j, \mathbf{z})\), the centered test kernel vector is

\[ \tilde{\mathbf{k}}_z = \mathbf{k}_z - \mathbf{1}_n \mathbf{k}_z - K \mathbf{1}_n^{\text{col}} + \mathbf{1}_n K \mathbf{1}_n^{\text{col}}, \]

where \(\mathbf{1}_n^{\text{col}}\) denotes the length \(n\) vector of entries \(1/n\). The projection is then \(\beta^k(\mathbf{z}) = (\boldsymbol{\alpha}^k)^\top \tilde{\mathbf{k}}_z\). Note that the centering of a test point reuses the training statistics, the row and column means of the training kernel, and never recomputes them from the test point alone. This guarantees that the same affine offset is applied consistently across train and test.

142.7 7. The Pre-Image Problem

Kernel PCA produces points in feature space, but for tasks such as denoising, visualization of reconstructions, or compression, we want a point \(\mathbf{z} \in \mathbb{R}^d\) in the original input space whose image \(\phi(\mathbf{z})\) approximates the projection. Finding such a \(\mathbf{z}\) is the pre-image problem.

The difficulty is fundamental. The feature map \(\phi\) is generally not surjective, so the projection of \(\tilde\phi(\mathbf{x})\) onto the leading principal components, call it \(P\tilde\phi(\mathbf{x})\), usually lies outside the image of \(\phi\). There may be no exact pre-image at all. We therefore seek an approximate pre-image by minimizing the feature space reconstruction error

\[ \mathbf{z}^\star = \arg\min_{\mathbf{z}} \big\lVert \phi(\mathbf{z}) - P\tilde\phi(\mathbf{x}) \big\rVert^2 . \]

Expanding the squared norm and discarding terms independent of \(\mathbf{z}\) leaves

\[ \mathbf{z}^\star = \arg\min_{\mathbf{z}} \Big( k(\mathbf{z}, \mathbf{z}) - 2 \sum_{i} \gamma_i\, k(\mathbf{z}, \mathbf{x}_i) \Big), \]

where the coefficients \(\gamma_i\) collect the contributions of the projection \(P\tilde\phi(\mathbf{x})\) expressed in the basis of mapped training points. This is a nonconvex optimization in \(\mathbf{z}\) and in general has no closed form.

For the Gaussian RBF kernel, where \(k(\mathbf{z},\mathbf{z}) = 1\) is constant, Mika and colleagues derived an elegant fixed point iteration. Setting the gradient to zero yields

\[ \mathbf{z}_{t+1} = \frac{\sum_{i} \gamma_i \exp\!\big(-\lVert \mathbf{z}_t - \mathbf{x}_i \rVert^2 / 2\sigma^2\big)\, \mathbf{x}_i}{\sum_{i} \gamma_i \exp\!\big(-\lVert \mathbf{z}_t - \mathbf{x}_i \rVert^2 / 2\sigma^2\big)} . \]

Each update places \(\mathbf{z}\) at a weighted average of the training points, with weights that depend on the current estimate. The iteration is a form of weighted mean shift. It is sensitive to initialization, can converge to poor local optima, and can fail when the denominator approaches zero. A common initialization is the input point itself or the mean of its nearest neighbors.

def rbf_preimage(z0, X, gamma, sigma, iters=50):
    z = z0.copy()
    for _ in range(iters):
        d = np.sum((X - z)**2, axis=1)
        w = gamma * np.exp(-d / (2 * sigma**2))
        s = w.sum()
        if abs(s) < 1e-12:
            break
        z = (w[:, None] * X).sum(axis=0) / s
    return z

Because of these instabilities, alternative pre-image methods have been developed. Kwok and Tsang proposed a method that exploits the relationship between feature space distances and input space distances, reconstructing \(\mathbf{z}\) from its distances to nearby training points by multidimensional scaling, which avoids iterative optimization and tends to be more robust. Other approaches learn a parametric inverse map directly, for example with a neural network trained to regress input points from their kernel PCA codes.

142.8 8. Practical Considerations

Several issues govern whether Kernel PCA succeeds in practice. The choice of kernel and its hyperparameters, such as the RBF bandwidth \(\sigma\), controls the geometry of the feature space and strongly affects results. Too small a bandwidth makes every point nearly orthogonal to every other, so the kernel matrix approaches the identity and no structure emerges; too large a bandwidth makes the map nearly linear and Kernel PCA degenerates toward ordinary PCA. Cross validation against a downstream objective is the usual remedy.

Computationally, the eigendecomposition of the \(n \times n\) matrix \(\tilde{K}\) costs \(O(n^3)\) time and \(O(n^2)\) memory, which becomes prohibitive beyond tens of thousands of points. Scalable variants address this through low rank approximations such as the Nystrom method, which samples a subset of columns of \(K\) and reconstructs an approximate spectrum, or through incremental and randomized eigensolvers. These reduce both storage and runtime at the cost of approximation error.

Finally, interpretation differs from linear PCA. The components live in feature space, so the loadings do not correspond to interpretable directions in input space, and reconstruction requires solving the pre-image problem. When the goal is denoising, the pipeline projects a noisy input onto the leading kernel principal components and then recovers a pre-image, with the projection acting as a nonlinear filter that suppresses variation along low variance directions of the manifold.

142.9 9. Summary

Kernel PCA generalizes principal component analysis to nonlinear structure by performing linear PCA in an implicit feature space. The kernel trick reduces an intractable eigenproblem in \(\mathcal{H}\) to the diagonalization of an \(n \times n\) kernel matrix. Centering must be performed implicitly through the transformation \(\tilde{K} = HKH\), eigenvectors must be normalized so that \((\boldsymbol{\alpha}^k)^\top \boldsymbol{\alpha}^k = 1/\mu_k\), and out of sample points are projected by centering their kernel vector against the stored training statistics. Mapping results back to input space requires solving the pre-image problem, a nonconvex task with iterative and distance based approximations. Together these ingredients make Kernel PCA a flexible tool for nonlinear dimensionality reduction, manifold learning, and denoising.

142.10 References

  1. Scholkopf, B., Smola, A., and Muller, K.-R. Nonlinear Component Analysis as a Kernel Eigenvalue Problem. Neural Computation, 10(5), 1299 to 1319, 1998. https://doi.org/10.1162/089976698300017467
  2. Scholkopf, B. and Smola, A. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, 2002. https://mitpress.mit.edu/9780262536578/learning-with-kernels/
  3. Mika, S., Scholkopf, B., Smola, A., Muller, K.-R., Scholz, M., and Ratsch, G. Kernel PCA and De-Noising in Feature Spaces. Advances in Neural Information Processing Systems 11, 1999. https://proceedings.neurips.cc/paper/1998/hash/226d1f15ecd35f784d2a20c3ecf56d7f-Abstract.html
  4. Kwok, J. T. and Tsang, I. W. The Pre-Image Problem in Kernel Methods. IEEE Transactions on Neural Networks, 15(6), 1517 to 1525, 2004. https://doi.org/10.1109/TNN.2004.837781
  5. Bakir, G. H., Weston, J., and Scholkopf, B. Learning to Find Pre-Images. Advances in Neural Information Processing Systems 16, 2004. https://proceedings.neurips.cc/paper/2003/hash/ac1ad983e08ad3304a97e147f522747e-Abstract.html
  6. Williams, C. K. I. and Seeger, M. Using the Nystrom Method to Speed Up Kernel Machines. Advances in Neural Information Processing Systems 13, 2001. https://proceedings.neurips.cc/paper/2000/hash/19de10adbaa1b2ee13f77f679fa1483a-Abstract.html
  7. Pedregosa, F. et al. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825 to 2830, 2011. https://scikit-learn.org/stable/modules/decomposition.html#kernel-principal-component-analysis-kpca