147  Autoencoders for Dimensionality Reduction

Dimensionality reduction is the practice of mapping high dimensional data into a lower dimensional space while preserving as much of the structure that matters as possible. Classical methods such as principal component analysis (PCA) accomplish this with linear algebra. Autoencoders generalize the idea to nonlinear maps learned by neural networks. This chapter develops the encoder-decoder architecture, explains why the bottleneck forces compression, establishes the precise relationship between linear autoencoders and PCA, surveys undercomplete and regularized variants, and closes with a bridge to representation learning, where the learned code becomes the central object of interest rather than a byproduct of reconstruction.

147.1 1. The Encoder-Decoder Architecture

An autoencoder is a neural network trained to copy its input to its output. That sounds trivial until you constrain the network so that it cannot learn the identity function directly. The architecture is built from two maps. The encoder \(f_\theta : \mathbb{R}^d \to \mathbb{R}^k\) sends an input \(x\) to a code (also called the latent representation or the bottleneck activation) \(h = f_\theta(x)\). The decoder \(g_\phi : \mathbb{R}^k \to \mathbb{R}^d\) maps the code back to the input space, producing a reconstruction \(\hat{x} = g_\phi(h)\). The composed network computes

\[ \hat{x} = g_\phi(f_\theta(x)). \]

Training minimizes a reconstruction loss averaged over a dataset \(\{x^{(i)}\}_{i=1}^{n}\),

\[ \mathcal{L}(\theta, \phi) = \frac{1}{n} \sum_{i=1}^{n} \ell\bigl(x^{(i)}, g_\phi(f_\theta(x^{(i)}))\bigr). \]

The per-example loss \(\ell\) is chosen to match the data. For real valued inputs the squared error \(\ell(x, \hat{x}) = \lVert x - \hat{x} \rVert_2^2\) is standard, and it corresponds to a Gaussian observation model. For inputs in \([0,1]\), such as normalized pixel intensities, the binary cross entropy loss is common, corresponding to a Bernoulli model per coordinate. The parameters \(\theta\) and \(\phi\) are optimized jointly by stochastic gradient descent and backpropagation, exactly as in any feedforward network.

A minimal encoder and decoder can each be a single affine layer followed by a nonlinearity. A typical specification looks like the following.

# Single hidden layer (bottleneck) autoencoder, schematic
h     = activation(W_enc @ x + b_enc)   # encoder: R^d -> R^k
x_hat = W_dec @ h + b_dec               # decoder: R^k -> R^d
loss  = mean((x - x_hat) ** 2)          # reconstruction objective

The signal that drives learning is the gap between \(x\) and \(\hat{x}\). Because no labels are required, the autoencoder is an instance of self-supervised learning: the supervisory target is the input itself.

147.2 2. The Bottleneck and Why It Forces Structure

If the code dimension \(k\) is at least as large as the input dimension \(d\) and the network is expressive enough, the autoencoder can trivially learn the identity map, copying each coordinate straight through and reconstructing perfectly without learning anything useful. The whole enterprise depends on preventing this collapse. The most direct prevention is to make the code narrow, \(k < d\). This narrow layer is the bottleneck, and it is what gives dimensionality reduction its teeth.

When \(k < d\), the network cannot store every input exactly. To keep reconstruction error low across the dataset it must discover the directions of variation that are shared among examples and allocate its limited code budget to them. Coordinates that are noisy, redundant, or constant carry little reconstruction value and are discarded. The code \(h\) therefore concentrates the recurring, predictive structure of the data. This is the same intuition behind lossy compression: a good code throws away what can be regenerated and keeps what cannot.

Two informal information bounds make the point. The reconstruction is a function of the code alone, so by the data processing inequality the mutual information between input and reconstruction cannot exceed the information the code retains,

\[ I(x; \hat{x}) \le I(x; h). \]

At the same time the code is a deterministic function of \(k\) real numbers, so it cannot carry more than \(k\) effective degrees of freedom. Low reconstruction error therefore requires that the \(k\) dimensional code capture the dominant variation in \(x\). The bottleneck is not the only way to constrain an autoencoder, as Section 5 shows, but it is the most interpretable, because the width \(k\) is the literal target dimension of the reduction.

147.3 3. Undercomplete Autoencoders

An autoencoder whose code dimension is smaller than its input dimension, \(k < d\), is called undercomplete. The undercomplete autoencoder is the canonical tool for nonlinear dimensionality reduction. After training, the encoder \(f_\theta\) alone is used to embed data: each input \(x\) yields a \(k\) dimensional vector \(h\) that can feed a downstream classifier, a clustering routine, a nearest neighbor index, or a two dimensional or three dimensional visualization when \(k\) is small.

The expressive power of the encoder and decoder controls the geometry that can be captured. With linear maps and a squared error loss the undercomplete autoencoder recovers a linear subspace, as the next section makes precise. With nonlinear activations and multiple layers the encoder can fold, bend, and unroll the data, fitting a curved low dimensional manifold embedded in the high dimensional space. Data such as natural images, sensor readings, or word counts often lie near such a manifold rather than on a flat plane, which is exactly where nonlinear autoencoders outperform linear projections.

Depth matters. A single hidden layer limits the encoder to a shallow transformation, whereas stacking layers lets the network compose simple operations into the highly nonlinear coordinate change a complicated manifold demands. A practical recipe is a symmetric pair of multilayer perceptrons whose widths taper from \(d\) down to \(k\) in the encoder and expand back from \(k\) to \(d\) in the decoder.

# Encoder widths taper to the bottleneck k; decoder mirrors them
encoder: d -> 512 -> 128 -> k      # nonlinear layers between sizes
decoder: k -> 128 -> 512 -> d      # symmetric expansion back to input

The width \(k\) trades reconstruction fidelity against compression. Small \(k\) yields aggressive reduction and forces the model to be selective, but discards more detail. Large \(k\) preserves detail but offers weaker reduction and risks approaching the identity map. Choosing \(k\) is an empirical decision, often guided by a reconstruction error versus dimension curve analogous to the scree plot used for PCA.

147.4 4. The Relationship to PCA

The cleanest theoretical anchor for autoencoders is their relationship to principal component analysis. Consider a linear autoencoder with no nonlinear activations: the encoder is \(h = W_e x\) and the decoder is \(\hat{x} = W_d h\), with \(W_e \in \mathbb{R}^{k \times d}\) and \(W_d \in \mathbb{R}^{d \times k}\). Assume the data are mean centered. The training objective with squared error is

\[ \min_{W_e, W_d} \; \frac{1}{n} \sum_{i=1}^{n} \bigl\lVert x^{(i)} - W_d W_e x^{(i)} \bigr\rVert_2^2 . \]

Write \(M = W_d W_e\). Because \(W_e\) has rank at most \(k\) and \(W_d\) has rank at most \(k\), the product \(M\) has rank at most \(k\). The problem reduces to finding the best rank \(k\) linear map that reconstructs the data, which is precisely the problem solved by PCA. The Eckart-Young theorem states that the optimal rank \(k\) reconstruction projects each point onto the subspace spanned by the top \(k\) eigenvectors of the empirical covariance matrix \(\Sigma = \frac{1}{n} \sum_i x^{(i)} {x^{(i)}}^\top\). At the optimum the linear autoencoder spans the same subspace that PCA finds, so it achieves the same minimal reconstruction error.

There is one important caveat about uniqueness. PCA returns an orthonormal basis whose components are ordered by decreasing variance and are mutually uncorrelated. A linear autoencoder is only required to span the correct \(k\) dimensional subspace; its solution is determined only up to an invertible \(k \times k\) change of basis, since \(W_d W_e = (W_d A^{-1})(A W_e)\) for any invertible \(A\). The encoder weights need not be orthonormal and the code coordinates need not be ordered or uncorrelated. The two methods agree on the optimal subspace and on the optimal reconstruction error; they need not agree on the particular coordinates within that subspace. Tied weights (\(W_d = W_e^\top\)) and an orthogonality penalty can push a linear autoencoder toward the PCA basis itself, but plain training does not guarantee it.

This equivalence is what justifies thinking of autoencoders as a strict generalization of PCA. Replace the linear maps with deep nonlinear networks and the model can represent reductions that no linear projection can express. The price is the loss of the clean closed form solution: where PCA reduces to an eigendecomposition with a global optimum, a nonlinear autoencoder is trained by nonconvex optimization with no guarantee of reaching the best possible code. In practice the gain in representational flexibility usually outweighs this loss for data that is genuinely nonlinear.

147.5 5. Regularized Autoencoders

Constraining the code width is not the only way to prevent an autoencoder from learning the identity. Regularized autoencoders keep a code that may be as wide as the input, or even wider (the overcomplete case), but add a penalty or a structural device that forces the code to capture useful structure rather than copy the input. These methods broaden the design space and often produce more useful codes than a bare bottleneck.

147.5.1 5.1 Sparse Autoencoders

A sparse autoencoder augments the reconstruction loss with a penalty that drives most code activations toward zero for any given input, so each input is explained by a small active subset of code units. The objective becomes

\[ \mathcal{L}(\theta, \phi) = \frac{1}{n} \sum_{i=1}^{n} \ell\bigl(x^{(i)}, \hat{x}^{(i)}\bigr) + \lambda \, \Omega(h), \]

where \(\Omega(h)\) measures activation magnitude. A common choice is an \(L_1\) penalty \(\Omega(h) = \lVert h \rVert_1\), which yields sparse codes much as the lasso yields sparse coefficients. An alternative penalizes the deviation of each unit’s average activation from a small target rate using a Kullback-Leibler term. Sparsity lets the code dimension \(k\) exceed \(d\) while still being informative, because the effective number of active units per input stays small. Such overcomplete sparse codes are a recurring theme in feature learning and in modern work on interpretable features inside large models.

147.5.2 5.2 Denoising Autoencoders

A denoising autoencoder corrupts each input before encoding and asks the network to reconstruct the clean original. Let \(\tilde{x}\) be a corrupted version of \(x\) drawn from a noise process \(C(\tilde{x} \mid x)\), for example additive Gaussian noise or randomly zeroed coordinates. The model minimizes

\[ \mathcal{L} = \mathbb{E}_{x} \, \mathbb{E}_{\tilde{x} \sim C(\cdot \mid x)} \bigl[ \ell\bigl(x, g_\phi(f_\theta(\tilde{x}))\bigr) \bigr]. \]

Because the network never sees the clean input at its own input layer, copying is impossible; it must learn how the data is structured well enough to undo the corruption. The reconstruction map is pushed to project corrupted points back toward the data manifold, and the learned code captures the manifold’s shape. Denoising also has an elegant interpretation: under small Gaussian corruption the trained reconstruction approximates the direction of the gradient of the log density of the data, linking the autoencoder to score based generative models.

147.5.3 5.3 Contractive Autoencoders

A contractive autoencoder penalizes the sensitivity of the code to small input changes by adding the squared Frobenius norm of the encoder Jacobian to the loss,

\[ \mathcal{L} = \frac{1}{n} \sum_{i=1}^{n} \ell\bigl(x^{(i)}, \hat{x}^{(i)}\bigr) + \lambda \left\lVert \frac{\partial f_\theta}{\partial x} \Big|_{x^{(i)}} \right\rVert_F^2 . \]

This term encourages the encoder to be locally flat, contracting most directions in input space so that the code changes only along the directions in which the data actually varies. The result is a representation that is robust to small perturbations and aligned with the local manifold tangent. Denoising and contractive autoencoders are closely related: both make the code insensitive to off manifold variation, one by injecting noise and one by penalizing the Jacobian directly.

147.6 6. A Bridge to Representation Learning

Up to this point reconstruction has been the objective and the code a means to that end. Representation learning inverts the emphasis. The reconstruction loss becomes a pretext task whose real purpose is to produce a code \(h\) that is useful for problems the network was never directly trained on. A good representation makes downstream tasks easier: it disentangles factors of variation, it is robust to nuisance perturbations, it places semantically similar inputs near one another, and it requires few labeled examples to fit a downstream model on top.

Autoencoders were among the first practical tools for unsupervised representation learning. Stacked autoencoders, trained greedily layer by layer, were used in the late 2000s to initialize deep networks before purely supervised training with better initialization and optimization made such pretraining unnecessary for many tasks. The core idea survived and matured: learn features by setting up a self-supervised objective on unlabeled data, then transfer those features.

The variational autoencoder reframes the bottleneck probabilistically. Instead of a deterministic code, the encoder outputs the parameters of a distribution \(q(h \mid x)\) over latent variables, and training maximizes a lower bound on the data likelihood that combines a reconstruction term with a term pulling \(q(h \mid x)\) toward a prior \(p(h)\). This yields a smooth, generative latent space from which new samples can be drawn, and it makes explicit the link between dimensionality reduction and density modeling.

Modern self-supervised representation learning has largely moved beyond pixel level reconstruction. Contrastive methods learn codes by pulling together representations of different views of the same input and pushing apart representations of different inputs, and masked prediction methods, which reconstruct deliberately hidden portions of the input, power the pretraining of large language models and masked image models. These approaches abandon full input reconstruction yet inherit the autoencoder’s defining move: define a self-supervised objective whose solution requires a representation that captures the structure of the data. The encoder-decoder skeleton, the bottleneck that forces selectivity, and the use of a learned code as a transferable feature are the durable ideas, and they remain the conceptual foundation on which contemporary representation learning is built.

147.7 7. Practical Considerations

Several choices determine whether an autoencoder produces a useful reduction. Inputs should be scaled or normalized so that no coordinate dominates the squared error purely because of its units. The bottleneck width \(k\) should be selected by sweeping a range and inspecting reconstruction error together with downstream task performance, not reconstruction alone, because the lowest reconstruction error often corresponds to a code that has merely memorized rather than generalized. Symmetric encoder and decoder depths are a sensible default, and weight tying can reduce parameters and nudge linear models toward the PCA solution. When the goal is a clean low dimensional embedding for analysis, a regularized variant such as a denoising or contractive autoencoder usually yields a more stable and more transferable code than an unregularized bottleneck of the same width. Finally, always compare against PCA as a baseline: if a linear projection reaches comparable reconstruction and downstream accuracy, the simpler, deterministic, and globally optimal method is the better engineering choice, and the autoencoder earns its place only when the data’s nonlinearity makes the extra capacity pay off.

147.8 References

  1. Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning, Chapter 14: Autoencoders. MIT Press, 2016. https://www.deeplearningbook.org/contents/autoencoders.html
  2. Hinton, G. E., and Salakhutdinov, R. R. Reducing the Dimensionality of Data with Neural Networks. Science, 313(5786), 504-507, 2006. https://www.science.org/doi/10.1126/science.1127647
  3. Baldi, P., and Hornik, K. Neural Networks and Principal Component Analysis: Learning from Examples Without Local Minima. Neural Networks, 2(1), 53-58, 1989. https://doi.org/10.1016/0893-6080(89)90014-2
  4. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P. A. Stacked Denoising Autoencoders. Journal of Machine Learning Research, 11, 3371-3408, 2010. https://www.jmlr.org/papers/v11/vincent10a.html
  5. Rifai, S., Vincent, P., Muller, X., Glorot, X., and Bengio, Y. Contractive Auto-Encoders: Explicit Invariance During Feature Extraction. ICML, 2011. https://icml.cc/2011/papers/455_icmlpaper.pdf
  6. Kingma, D. P., and Welling, M. Auto-Encoding Variational Bayes. ICLR, 2014. https://arxiv.org/abs/1312.6114
  7. Bengio, Y., Courville, A., and Vincent, P. Representation Learning: A Review and New Perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 1798-1828, 2013. https://arxiv.org/abs/1206.5538
  8. He, K., Chen, X., Xie, S., Li, Y., Dollar, P., and Girshick, R. Masked Autoencoders Are Scalable Vision Learners. CVPR, 2022. https://arxiv.org/abs/2111.06377