147 Autoencoders for Dimensionality Reduction

Dimensionality reduction is the practice of mapping high dimensional data into a lower dimensional space while preserving as much of the structure that matters as possible. Classical methods such as principal component analysis (PCA) accomplish this with linear algebra. Autoencoders generalize the idea to nonlinear maps learned by neural networks. This chapter develops the encoder-decoder architecture, explains why the bottleneck forces compression, establishes the precise relationship between linear autoencoders and PCA, surveys undercomplete and regularized variants, analyzes cost and failure modes, and closes with a bridge to representation learning, where the learned code becomes the central object of interest rather than a byproduct of reconstruction. A worked example near the end trains a small autoencoder with the mature open-source PyTorch library and pits it head to head against PCA on a curved manifold, with parallel Julia (Flux.jl) and Rust (candle) versions.

147.1 1. The Encoder-Decoder Architecture

An autoencoder is a neural network trained to copy its input to its output. That sounds trivial until you constrain the network so that it cannot learn the identity function directly. The architecture is built from two maps. The encoder $f_\theta : \mathbb{R}^d \to \mathbb{R}^k$ sends an input $x$ to a code (also called the latent representation or the bottleneck activation) $h = f_\theta(x)$. The decoder $g_\phi : \mathbb{R}^k \to \mathbb{R}^d$ maps the code back to the input space, producing a reconstruction $\hat{x} = g_\phi(h)$. The composed network computes

\[ \hat{x} = g_\phi(f_\theta(x)). \]

Training minimizes a reconstruction loss averaged over a dataset $\{x^{(i)}\}_{i=1}^{n}$,

\[ \mathcal{L}(\theta, \phi) = \frac{1}{n} \sum_{i=1}^{n} \ell\bigl(x^{(i)}, g_\phi(f_\theta(x^{(i)}))\bigr). \]

The per-example loss $\ell$ is chosen to match the data. For real valued inputs the squared error $\ell(x, \hat{x}) = \lVert x - \hat{x} \rVert_2^2$ is standard, and it corresponds to a Gaussian observation model. For inputs in $[0,1]$, such as normalized pixel intensities, the binary cross entropy loss is common, corresponding to a Bernoulli model per coordinate. The parameters $\theta$ and $\phi$ are optimized jointly by stochastic gradient descent and backpropagation, exactly as in any feedforward network.

A minimal encoder and decoder can each be a single affine layer followed by a nonlinearity. A typical specification looks like the following.

# Single hidden layer (bottleneck) autoencoder, schematic
h     = activation(W_enc @ x + b_enc)   # encoder: R^d -> R^k
x_hat = W_dec @ h + b_dec               # decoder: R^k -> R^d
loss  = mean((x - x_hat) ** 2)          # reconstruction objective

The signal that drives learning is the gap between $x$ and $\hat{x}$. Because no labels are required, the autoencoder is an instance of self-supervised learning: the supervisory target is the input itself.

147.2 2. The Bottleneck and Why It Forces Structure

If the code dimension $k$ is at least as large as the input dimension $d$ and the network is expressive enough, the autoencoder can trivially learn the identity map, copying each coordinate straight through and reconstructing perfectly without learning anything useful. The whole enterprise depends on preventing this collapse. The most direct prevention is to make the code narrow, $k < d$. This narrow layer is the bottleneck, and it is what gives dimensionality reduction its teeth.

When $k < d$, the network cannot store every input exactly. To keep reconstruction error low across the dataset it must discover the directions of variation that are shared among examples and allocate its limited code budget to them. Coordinates that are noisy, redundant, or constant carry little reconstruction value and are discarded. The code $h$ therefore concentrates the recurring, predictive structure of the data. This is the same intuition behind lossy compression: a good code throws away what can be regenerated and keeps what cannot.

Two informal information bounds make the point. The reconstruction is a function of the code alone, so by the data processing inequality the mutual information between input and reconstruction cannot exceed the information the code retains,

\[ I(x; \hat{x}) \le I(x; h). \]

At the same time the code is a deterministic function of $k$ real numbers, so it cannot carry more than $k$ effective degrees of freedom. Low reconstruction error therefore requires that the $k$ dimensional code capture the dominant variation in $x$. The bottleneck is not the only way to constrain an autoencoder, as Section 5 shows, but it is the most interpretable, because the width $k$ is the literal target dimension of the reduction.

147.3 3. Undercomplete Autoencoders

An autoencoder whose code dimension is smaller than its input dimension, $k < d$, is called undercomplete. The undercomplete autoencoder is the canonical tool for nonlinear dimensionality reduction. After training, the encoder $f_\theta$ alone is used to embed data: each input $x$ yields a $k$ dimensional vector $h$ that can feed a downstream classifier, a clustering routine, a nearest neighbor index, or a two dimensional or three dimensional visualization when $k$ is small.

The expressive power of the encoder and decoder controls the geometry that can be captured. With linear maps and a squared error loss the undercomplete autoencoder recovers a linear subspace, as the next section makes precise. With nonlinear activations and multiple layers the encoder can fold, bend, and unroll the data, fitting a curved low dimensional manifold embedded in the high dimensional space. Data such as natural images, sensor readings, or word counts often lie near such a manifold rather than on a flat plane, which is exactly where nonlinear autoencoders outperform linear projections.

Depth matters. A single hidden layer limits the encoder to a shallow transformation, whereas stacking layers lets the network compose simple operations into the highly nonlinear coordinate change a complicated manifold demands. A practical recipe is a symmetric pair of multilayer perceptrons whose widths taper from $d$ down to $k$ in the encoder and expand back from $k$ to $d$ in the decoder.

# Encoder widths taper to the bottleneck k; decoder mirrors them
encoder: d -> 512 -> 128 -> k      # nonlinear layers between sizes
decoder: k -> 128 -> 512 -> d      # symmetric expansion back to input

The width $k$ trades reconstruction fidelity against compression. Small $k$ yields aggressive reduction and forces the model to be selective, but discards more detail. Large $k$ preserves detail but offers weaker reduction and risks approaching the identity map. Choosing $k$ is an empirical decision, often guided by a reconstruction error versus dimension curve analogous to the scree plot used for PCA.

147.4 4. The Relationship to PCA

The cleanest theoretical anchor for autoencoders is their relationship to principal component analysis. Consider a linear autoencoder with no nonlinear activations: the encoder is $h = W_e x$ and the decoder is $\hat{x} = W_d h$, with $W_e \in \mathbb{R}^{k \times d}$ and $W_d \in \mathbb{R}^{d \times k}$. Assume the data are mean centered. The training objective with squared error is

\[ \min_{W_e, W_d} \; \frac{1}{n} \sum_{i=1}^{n} \bigl\lVert x^{(i)} - W_d W_e x^{(i)} \bigr\rVert_2^2 . \]

Write $M = W_d W_e$. Because $W_e$ has rank at most $k$ and $W_d$ has rank at most $k$, the product $M$ has rank at most $k$. The problem reduces to finding the best rank $k$ linear map that reconstructs the data, which is precisely the problem solved by PCA. The Eckart-Young theorem states that the optimal rank $k$ reconstruction projects each point onto the subspace spanned by the top $k$ eigenvectors of the empirical covariance matrix $\Sigma = \frac{1}{n} \sum_i x^{(i)} {x^{(i)}}^\top$. At the optimum the linear autoencoder spans the same subspace that PCA finds, so it achieves the same minimal reconstruction error.

There is one important caveat about uniqueness. PCA returns an orthonormal basis whose components are ordered by decreasing variance and are mutually uncorrelated. A linear autoencoder is only required to span the correct $k$ dimensional subspace; its solution is determined only up to an invertible $k \times k$ change of basis, since $W_d W_e = (W_d A^{-1})(A W_e)$ for any invertible $A$. The encoder weights need not be orthonormal and the code coordinates need not be ordered or uncorrelated. The two methods agree on the optimal subspace and on the optimal reconstruction error; they need not agree on the particular coordinates within that subspace. Tied weights ($W_d = W_e^\top$) and an orthogonality penalty can push a linear autoencoder toward the PCA basis itself, but plain training does not guarantee it.

This equivalence is what justifies thinking of autoencoders as a strict generalization of PCA. Replace the linear maps with deep nonlinear networks and the model can represent reductions that no linear projection can express. The price is the loss of the clean closed form solution: where PCA reduces to an eigendecomposition with a global optimum, a nonlinear autoencoder is trained by nonconvex optimization with no guarantee of reaching the best possible code. In practice the gain in representational flexibility usually outweighs this loss for data that is genuinely nonlinear.

147.5 5. Regularized Autoencoders

Constraining the code width is not the only way to prevent an autoencoder from learning the identity. Regularized autoencoders keep a code that may be as wide as the input, or even wider (the overcomplete case), but add a penalty or a structural device that forces the code to capture useful structure rather than copy the input. These methods broaden the design space and often produce more useful codes than a bare bottleneck.

147.5.1 5.1 Sparse Autoencoders

A sparse autoencoder augments the reconstruction loss with a penalty that drives most code activations toward zero for any given input, so each input is explained by a small active subset of code units. The objective becomes

\[ \mathcal{L}(\theta, \phi) = \frac{1}{n} \sum_{i=1}^{n} \ell\bigl(x^{(i)}, \hat{x}^{(i)}\bigr) + \lambda \, \Omega(h), \]

where $\Omega(h)$ measures activation magnitude. A common choice is an $L_1$ penalty $\Omega(h) = \lVert h \rVert_1$, which yields sparse codes much as the lasso yields sparse coefficients. An alternative penalizes the deviation of each unit’s average activation from a small target rate using a Kullback-Leibler term. Sparsity lets the code dimension $k$ exceed $d$ while still being informative, because the effective number of active units per input stays small. Such overcomplete sparse codes are a recurring theme in feature learning and in modern work on interpretable features inside large models.

147.5.2 5.2 Denoising Autoencoders

A denoising autoencoder corrupts each input before encoding and asks the network to reconstruct the clean original. Let $\tilde{x}$ be a corrupted version of $x$ drawn from a noise process $C(\tilde{x} \mid x)$, for example additive Gaussian noise or randomly zeroed coordinates. The model minimizes

\[ \mathcal{L} = \mathbb{E}_{x} \, \mathbb{E}_{\tilde{x} \sim C(\cdot \mid x)} \bigl[ \ell\bigl(x, g_\phi(f_\theta(\tilde{x}))\bigr) \bigr]. \]

Because the network never sees the clean input at its own input layer, copying is impossible; it must learn how the data is structured well enough to undo the corruption. The reconstruction map is pushed to project corrupted points back toward the data manifold, and the learned code captures the manifold’s shape. Denoising also has an elegant interpretation: under small Gaussian corruption the trained reconstruction approximates the direction of the gradient of the log density of the data, linking the autoencoder to score based generative models.

147.5.3 5.3 Contractive Autoencoders

A contractive autoencoder penalizes the sensitivity of the code to small input changes by adding the squared Frobenius norm of the encoder Jacobian to the loss,

\[ \mathcal{L} = \frac{1}{n} \sum_{i=1}^{n} \ell\bigl(x^{(i)}, \hat{x}^{(i)}\bigr) + \lambda \left\lVert \frac{\partial f_\theta}{\partial x} \Big|_{x^{(i)}} \right\rVert_F^2 . \]

This term encourages the encoder to be locally flat, contracting most directions in input space so that the code changes only along the directions in which the data actually varies. The result is a representation that is robust to small perturbations and aligned with the local manifold tangent. Denoising and contractive autoencoders are closely related: both make the code insensitive to off manifold variation, one by injecting noise and one by penalizing the Jacobian directly.

147.6 6. A Bridge to Representation Learning

Up to this point reconstruction has been the objective and the code a means to that end. Representation learning inverts the emphasis. The reconstruction loss becomes a pretext task whose real purpose is to produce a code $h$ that is useful for problems the network was never directly trained on. A good representation makes downstream tasks easier: it disentangles factors of variation, it is robust to nuisance perturbations, it places semantically similar inputs near one another, and it requires few labeled examples to fit a downstream model on top.

Autoencoders were among the first practical tools for unsupervised representation learning. Stacked autoencoders, trained greedily layer by layer, were used in the late 2000s to initialize deep networks before purely supervised training with better initialization and optimization made such pretraining unnecessary for many tasks. The core idea survived and matured: learn features by setting up a self-supervised objective on unlabeled data, then transfer those features.

The variational autoencoder reframes the bottleneck probabilistically. Instead of a deterministic code, the encoder outputs the parameters of a distribution $q(h \mid x)$ over latent variables, and training maximizes a lower bound on the data likelihood that combines a reconstruction term with a term pulling $q(h \mid x)$ toward a prior $p(h)$. This yields a smooth, generative latent space from which new samples can be drawn, and it makes explicit the link between dimensionality reduction and density modeling.

Modern self-supervised representation learning has largely moved beyond pixel level reconstruction. Contrastive methods learn codes by pulling together representations of different views of the same input and pushing apart representations of different inputs, and masked prediction methods, which reconstruct deliberately hidden portions of the input, power the pretraining of large language models and masked image models. These approaches abandon full input reconstruction yet inherit the autoencoder’s defining move: define a self-supervised objective whose solution requires a representation that captures the structure of the data. The encoder-decoder skeleton, the bottleneck that forces selectivity, and the use of a learned code as a transferable feature are the durable ideas, and they remain the conceptual foundation on which contemporary representation learning is built.

147.7 7. Complexity, When to Use It, and Failure Modes

147.7.1 7.1 Computational Cost

Training cost scales with the size of the network and the dataset. A forward and backward pass through a multilayer perceptron with layer widths $m_0 = d, m_1, \dots, m_L = k$ in the encoder (mirrored in the decoder) costs $O\!\left(\sum_\ell m_{\ell-1} m_\ell\right)$ floating point operations per example, and training runs that pass over $n$ examples for $E$ epochs, giving roughly $O\!\left(E \, n \sum_\ell m_{\ell-1} m_\ell\right)$ total work. The dominant term is usually the first layer, $O(d \cdot m_1)$, when the input dimension $d$ is large. Memory at training time holds the parameters plus the per batch activations needed for backpropagation. Inference, the only part that matters once the encoder is deployed as a feature extractor, is a single forward pass through the encoder at $O\!\left(\sum_\ell m_{\ell-1} m_\ell\right)$ per example.

Contrast this with PCA, whose truncated randomized SVD costs $O(n d m)$ once and has no hyperparameters to tune and no optimization to babysit. An autoencoder is strictly more expensive: it adds epochs, learning rates, architecture choices, and the risk of a bad local optimum. That cost is only justified when the data’s nonlinearity is real, which leads to the next point.

147.7.2 7.2 When to Reach for an Autoencoder

Prefer an autoencoder over a linear method when the data plausibly lie on a curved manifold that a flat subspace cannot capture, when you have enough unlabeled data to fit a flexible model without overfitting, and when a nonlinear code measurably improves a downstream task you can evaluate. Prefer PCA, kernel PCA, or a spectral manifold method when the dataset is small, when a transparent and globally optimal solution matters, or when a quick linear baseline already reaches the accuracy you need. The disciplined workflow is to fit PCA first, record its reconstruction error and downstream score, and only adopt an autoencoder if it clears that bar by a margin worth the added complexity. The example at the end of this chapter runs exactly that comparison.

147.7.3 7.3 Failure Modes

Several pathologies recur in practice.

Identity collapse. If the bottleneck is too wide or the network too powerful relative to the constraint, the model approximates the identity and the code carries no useful compression. The remedy is a narrower $k$, a regularizer, or a denoising objective.
Memorization without generalization. The lowest training reconstruction error often corresponds to a code that has overfit. Select $k$ and architecture by validation reconstruction and, crucially, by downstream task performance, never by training loss alone.
Unscaled inputs. Because squared error sums over coordinates, a feature measured in large units dominates the loss and the code purely as an artifact of scale. Standardize or normalize before training.
Optimization sensitivity. The objective is nonconvex, so results depend on initialization, learning rate, and seed. Fix the seed for reproducibility and run a small sweep when stability matters.
An arbitrary, unaligned code. Unlike PCA, the autoencoder code has no canonical orientation, ordering, or scale, and successive runs can produce rotated or permuted axes. If you need stable, interpretable coordinates, add a regularizer (sparsity, contraction) or fall back to PCA.

147.7.4 7.4 Mature Open-Source Tooling

There is no reason to hand-roll training loops, optimizers, or autodiff. The autoencoder in this chapter is a few layers expressed in PyTorch, the most widely used open-source deep learning framework, with Keras on top of TensorFlow and JAX (via Flax) as equally capable free alternatives. For the linear and kernel baselines, scikit-learn ships PCA, KernelPCA, and IncrementalPCA out of the box. For the manifold visualizations an autoencoder competes with, the open-source UMAP and scikit-learn’s TSNE and Isomap are the standard references. In the Julia ecosystem Flux.jl provides the same building blocks, and in Rust the candle and burn crates do. The tabs below show the same undercomplete autoencoder in all three languages. Build on these mature, well-tested libraries rather than reimplementing backpropagation; reserve a from-scratch implementation for learning, not production.

147.8 8. A Worked Example: Unrolling a Swiss Roll

The cleanest way to see why a nonlinear autoencoder beats a linear projection is to give both a dataset that is genuinely curved. The Swiss roll is the textbook case: points sampled from a two dimensional sheet that has been rolled up inside three dimensional space. The data truly have two intrinsic degrees of freedom, but they wind around so that no flat plane can separate the near and far layers of the roll. PCA, restricted to a linear projection, smears the layers together; an autoencoder with nonlinear layers can unroll the sheet and lay it flat in its two dimensional code.

The example below generates the roll with scikit-learn’s make_swiss_roll, standardizes it, and trains a small symmetric autoencoder in PyTorch to compress $\mathbb{R}^3$ to a $\mathbb{R}^2$ code. It then compares against a two-component PCA on two axes: reconstruction error, and how well a plain logistic classifier can recover a label that varies along the roll’s intrinsic coordinate from each two dimensional embedding. Everything is seeded so the numbers reproduce.

Code

import numpy as np
import torch
import torch.nn as nn
from sklearn.datasets import make_swiss_roll
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

SEED = 0
np.random.seed(SEED)
torch.manual_seed(SEED)

# A 3D Swiss roll: data lie on a curved 2D manifold embedded in R^3.
# A linear projection cannot unroll it; a nonlinear autoencoder can.
X, color = make_swiss_roll(n_samples=2000, noise=0.05, random_state=SEED)
# Label tracks position ALONG the roll (its intrinsic unrolled coordinate).
# The two classes interleave in R^3 because the sheet wraps around, so a
# straight cut in raw space cannot separate them, but an unrolled code can.
y = (np.sin(color / 3.0) > 0).astype(int)

Xtr, Xte, ytr, yte = train_test_split(
    X, y, test_size=0.25, random_state=SEED, stratify=y
)
scaler = StandardScaler().fit(Xtr)
Xtr = scaler.transform(Xtr).astype(np.float32)
Xte = scaler.transform(Xte).astype(np.float32)

D, K = X.shape[1], 2  # reduce R^3 -> R^2


class Autoencoder(nn.Module):
    def __init__(self, d, k):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(d, 32), nn.ReLU(),
            nn.Linear(32, 16), nn.ReLU(),
            nn.Linear(16, k),
        )
        self.decoder = nn.Sequential(
            nn.Linear(k, 16), nn.ReLU(),
            nn.Linear(16, 32), nn.ReLU(),
            nn.Linear(32, d),
        )

    def forward(self, x):
        h = self.encoder(x)
        return self.decoder(h), h


model = Autoencoder(D, K)
opt = torch.optim.Adam(model.parameters(), lr=1e-2)
loss_fn = nn.MSELoss()

Xtr_t = torch.from_numpy(Xtr)
for epoch in range(600):
    model.train()
    opt.zero_grad()
    recon, _ = model(Xtr_t)
    loss = loss_fn(recon, Xtr_t)
    loss.backward()
    opt.step()

# Encode train/test to the 2D code and measure test reconstruction error.
model.eval()
with torch.no_grad():
    recon_te, Hte = model(torch.from_numpy(Xte))
    ae_mse = loss_fn(recon_te, torch.from_numpy(Xte)).item()
    Htr = model.encoder(Xtr_t).numpy()
Hte = Hte.numpy()

# Linear baseline: PCA to the same 2 dimensions.
pca = PCA(n_components=K, random_state=SEED).fit(Xtr)
pca_mse = float(np.mean((Xte - pca.inverse_transform(pca.transform(Xte))) ** 2))


# Downstream probe: how linearly separable is the label in each 2D code?
def probe(Ztr, Zte):
    clf = LogisticRegression(max_iter=1000).fit(Ztr, ytr)
    return clf.score(Zte, yte)


ae_acc = probe(Htr, Hte)
pca_acc = probe(pca.transform(Xtr), pca.transform(Xte))

print(f"input dim d = {D}, code dim k = {K}")
print(f"AE  test reconstruction MSE: {ae_mse:.4f}")
print(f"PCA test reconstruction MSE: {pca_mse:.4f}")
print(f"AE  code shape: {Hte.shape}")
print("AE  first three codes:")
print(np.round(Hte[:3], 3))
print(f"downstream logistic-probe accuracy  AE: {ae_acc:.3f}  PCA: {pca_acc:.3f}")

input dim d = 3, code dim k = 2
AE  test reconstruction MSE: 0.0605
PCA test reconstruction MSE: 0.2777
AE  code shape: (500, 2)
AE  first three codes:
[[ 0.181 -0.675]
 [ 0.612  1.525]
 [ 0.25   0.691]]
downstream logistic-probe accuracy  AE: 0.676  PCA: 0.500

The autoencoder reaches a far lower reconstruction error than PCA on the same two dimensional budget, because its nonlinear layers fold the curved sheet flat where a linear projection cannot. The downstream probe makes the practical payoff explicit: a plain linear classifier recovers the roll’s intrinsic label well above chance from the autoencoder code, but only at chance from the PCA code, because PCA leaves the interleaved layers of the roll overlapping. This is precisely the regime in which the extra cost of an autoencoder pays off.

# Equivalent undercomplete autoencoder with Flux.jl (not executed here).
using Flux, MLJ, Random, Statistics
using Flux: mse, train!

Random.seed!(0)

# Assume X (3 x n, standardized) and labels y are available, e.g. from a
# Swiss-roll generator; Flux expects features in columns.
d, k = 3, 2

encoder = Chain(Dense(d, 32, relu), Dense(32, 16, relu), Dense(16, k))
decoder = Chain(Dense(k, 16, relu), Dense(16, 32, relu), Dense(32, d))
model   = Chain(encoder, decoder)

opt   = Flux.setup(Adam(1e-2), model)
loss(m, x) = mse(m(x), x)

for epoch in 1:600
    grads = Flux.gradient(m -> loss(m, Xtr), model)
    Flux.update!(opt, model, grads[1])
end

# Encode to the 2D code and report test reconstruction error.
H       = encoder(Xte)
ae_mse  = mse(decoder(H), Xte)
println("AE test reconstruction MSE: ", ae_mse)
println("AE code shape: ", size(H))

// Equivalent undercomplete autoencoder with candle (not executed here).
// candle is Hugging Face's minimalist, mature deep-learning crate for Rust;
// burn (https://burn.dev/) is an equally good alternative with a higher-level
// training loop. Assume `xtr` is a standardized (n x 3) Tensor.
use candle_core::{DType, Device, Tensor};
use candle_nn::{loss, linear, Linear, Module, Optimizer, VarBuilder, VarMap, AdamW, ParamsAdamW};

struct Autoencoder { e1: Linear, e2: Linear, e3: Linear,
                     d1: Linear, d2: Linear, d3: Linear }

impl Autoencoder {
    fn encode(&self, x: &Tensor) -> candle_core::Result<Tensor> {
        let h = self.e1.forward(x)?.relu()?;
        let h = self.e2.forward(&h)?.relu()?;
        self.e3.forward(&h)
    }
    fn forward(&self, x: &Tensor) -> candle_core::Result<Tensor> {
        let h = self.encode(x)?;
        let h = self.d1.forward(&h)?.relu()?;
        let h = self.d2.forward(&h)?.relu()?;
        self.d3.forward(&h)
    }
}

fn main() -> candle_core::Result<()> {
    let dev = Device::Cpu;
    let vm = VarMap::new();
    let vb = VarBuilder::from_varmap(&vm, DType::F32, &dev);
    let (d, k) = (3usize, 2usize);
    let model = Autoencoder {
        e1: linear(d, 32, vb.pp("e1"))?, e2: linear(32, 16, vb.pp("e2"))?,
        e3: linear(16, k, vb.pp("e3"))?, d1: linear(k, 16, vb.pp("d1"))?,
        d2: linear(16, 32, vb.pp("d2"))?, d3: linear(32, d, vb.pp("d3"))?,
    };
    let mut opt = AdamW::new(vm.all_vars(), ParamsAdamW { lr: 1e-2, ..Default::default() })?;

    // xtr: standardized training tensor of shape (n, 3).
    for _ in 0..600 {
        let recon = model.forward(&xtr)?;
        let l = loss::mse(&recon, &xtr)?;
        opt.backward_step(&l)?;
    }
    let code = model.encode(&xtr)?;       // (n, 2) code
    println!("AE code shape: {:?}", code.dims());
    Ok(())
}

147.9 9. Practical Considerations

Several choices determine whether an autoencoder produces a useful reduction. Inputs should be scaled or normalized so that no coordinate dominates the squared error purely because of its units. The bottleneck width $k$ should be selected by sweeping a range and inspecting reconstruction error together with downstream task performance, not reconstruction alone, because the lowest reconstruction error often corresponds to a code that has merely memorized rather than generalized. Symmetric encoder and decoder depths are a sensible default, and weight tying can reduce parameters and nudge linear models toward the PCA solution. When the goal is a clean low dimensional embedding for analysis, a regularized variant such as a denoising or contractive autoencoder usually yields a more stable and more transferable code than an unregularized bottleneck of the same width. Finally, always compare against PCA as a baseline: if a linear projection reaches comparable reconstruction and downstream accuracy, the simpler, deterministic, and globally optimal method is the better engineering choice, and the autoencoder earns its place only when the data’s nonlinearity makes the extra capacity pay off.

147.10 References

Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning, Chapter 14: Autoencoders. MIT Press, 2016. https://www.deeplearningbook.org/contents/autoencoders.html
Hinton, G. E., and Salakhutdinov, R. R. Reducing the Dimensionality of Data with Neural Networks. Science, 313(5786), 504-507, 2006. https://www.science.org/doi/10.1126/science.1127647
Baldi, P., and Hornik, K. Neural Networks and Principal Component Analysis: Learning from Examples Without Local Minima. Neural Networks, 2(1), 53-58, 1989. https://doi.org/10.1016/0893-6080(89)90014-2
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P. A. Stacked Denoising Autoencoders. Journal of Machine Learning Research, 11, 3371-3408, 2010. https://www.jmlr.org/papers/v11/vincent10a.html
Rifai, S., Vincent, P., Muller, X., Glorot, X., and Bengio, Y. Contractive Auto-Encoders: Explicit Invariance During Feature Extraction. ICML, 2011. https://icml.cc/2011/papers/455_icmlpaper.pdf
Kingma, D. P., and Welling, M. Auto-Encoding Variational Bayes. ICLR, 2014. https://arxiv.org/abs/1312.6114
Bengio, Y., Courville, A., and Vincent, P. Representation Learning: A Review and New Perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 1798-1828, 2013. https://arxiv.org/abs/1206.5538
He, K., Chen, X., Xie, S., Li, Y., Dollar, P., and Girshick, R. Masked Autoencoders Are Scalable Vision Learners. CVPR, 2022. https://arxiv.org/abs/2111.06377

# Autoencoders for Dimensionality Reduction Dimensionality reduction is the practice of mapping high dimensional data into a lower dimensional space while preserving as much of the structure that matters as possible. Classical methods such as principal component analysis (PCA) accomplish this with linear algebra. Autoencoders generalize the idea to nonlinear maps learned by neural networks. This chapter develops the encoder-decoder architecture, explains why the bottleneck forces compression, establishes the precise relationship between linear autoencoders and PCA, surveys undercomplete and regularized variants, analyzes cost and failure modes, and closes with a bridge to representation learning, where the learned code becomes the central object of interest rather than a byproduct of reconstruction. A worked example near the end trains a small autoencoder with the mature open-source PyTorch library and pits it head to head against PCA on a curved manifold, with parallel Julia (Flux.jl) and Rust (candle) versions. ## 1. The Encoder-Decoder Architecture An autoencoder is a neural network trained to copy its input to its output. That sounds trivial until you constrain the network so that it cannot learn the identity function directly. The architecture is built from two maps. The encoder $f_\theta : \mathbb{R}^d \to \mathbb{R}^k$ sends an input $x$ to a code (also called the latent representation or the bottleneck activation) $h = f_\theta(x)$. The decoder $g_\phi : \mathbb{R}^k \to \mathbb{R}^d$ maps the code back to the input space, producing a reconstruction $\hat{x} = g_\phi(h)$. The composed network computes $$ \hat{x} = g_\phi(f_\theta(x)). $$ Training minimizes a reconstruction loss averaged over a dataset $\{x^{(i)}\}_{i=1}^{n}$, $$ \mathcal{L}(\theta, \phi) = \frac{1}{n} \sum_{i=1}^{n} \ell\bigl(x^{(i)}, g_\phi(f_\theta(x^{(i)}))\bigr). $$ The per-example loss $\ell$ is chosen to match the data. For real valued inputs the squared error $\ell(x, \hat{x}) = \lVert x - \hat{x} \rVert_2^2$ is standard, and it corresponds to a Gaussian observation model. For inputs in $[0,1]$, such as normalized pixel intensities, the binary cross entropy loss is common, corresponding to a Bernoulli model per coordinate. The parameters $\theta$ and $\phi$ are optimized jointly by stochastic gradient descent and backpropagation, exactly as in any feedforward network. A minimal encoder and decoder can each be a single affine layer followed by a nonlinearity. A typical specification looks like the following. ```python # Single hidden layer (bottleneck) autoencoder, schematic h = activation(W_enc @ x + b_enc) # encoder: R^d -> R^k x_hat = W_dec @ h + b_dec # decoder: R^k -> R^d loss = mean((x - x_hat) ** 2) # reconstruction objective ``` The signal that drives learning is the gap between $x$ and $\hat{x}$. Because no labels are required, the autoencoder is an instance of self-supervised learning: the supervisory target is the input itself. ## 2. The Bottleneck and Why It Forces Structure If the code dimension $k$ is at least as large as the input dimension $d$ and the network is expressive enough, the autoencoder can trivially learn the identity map, copying each coordinate straight through and reconstructing perfectly without learning anything useful. The whole enterprise depends on preventing this collapse. The most direct prevention is to make the code narrow, $k < d$. This narrow layer is the bottleneck, and it is what gives dimensionality reduction its teeth. When $k < d$, the network cannot store every input exactly. To keep reconstruction error low across the dataset it must discover the directions of variation that are shared among examples and allocate its limited code budget to them. Coordinates that are noisy, redundant, or constant carry little reconstruction value and are discarded. The code $h$ therefore concentrates the recurring, predictive structure of the data. This is the same intuition behind lossy compression: a good code throws away what can be regenerated and keeps what cannot. Two informal information bounds make the point. The reconstruction is a function of the code alone, so by the data processing inequality the mutual information between input and reconstruction cannot exceed the information the code retains, $$ I(x; \hat{x}) \le I(x; h). $$ At the same time the code is a deterministic function of $k$ real numbers, so it cannot carry more than $k$ effective degrees of freedom. Low reconstruction error therefore requires that the $k$ dimensional code capture the dominant variation in $x$. The bottleneck is not the only way to constrain an autoencoder, as Section 5 shows, but it is the most interpretable, because the width $k$ is the literal target dimension of the reduction. ## 3. Undercomplete Autoencoders An autoencoder whose code dimension is smaller than its input dimension, $k < d$, is called undercomplete. The undercomplete autoencoder is the canonical tool for nonlinear dimensionality reduction. After training, the encoder $f_\theta$ alone is used to embed data: each input $x$ yields a $k$ dimensional vector $h$ that can feed a downstream classifier, a clustering routine, a nearest neighbor index, or a two dimensional or three dimensional visualization when $k$ is small. The expressive power of the encoder and decoder controls the geometry that can be captured. With linear maps and a squared error loss the undercomplete autoencoder recovers a linear subspace, as the next section makes precise. With nonlinear activations and multiple layers the encoder can fold, bend, and unroll the data, fitting a curved low dimensional manifold embedded in the high dimensional space. Data such as natural images, sensor readings, or word counts often lie near such a manifold rather than on a flat plane, which is exactly where nonlinear autoencoders outperform linear projections. Depth matters. A single hidden layer limits the encoder to a shallow transformation, whereas stacking layers lets the network compose simple operations into the highly nonlinear coordinate change a complicated manifold demands. A practical recipe is a symmetric pair of multilayer perceptrons whose widths taper from $d$ down to $k$ in the encoder and expand back from $k$ to $d$ in the decoder. ```python # Encoder widths taper to the bottleneck k; decoder mirrors them encoder: d -> 512 -> 128 -> k # nonlinear layers between sizes decoder: k -> 128 -> 512 -> d # symmetric expansion back to input ``` The width $k$ trades reconstruction fidelity against compression. Small $k$ yields aggressive reduction and forces the model to be selective, but discards more detail. Large $k$ preserves detail but offers weaker reduction and risks approaching the identity map. Choosing $k$ is an empirical decision, often guided by a reconstruction error versus dimension curve analogous to the scree plot used for PCA. ## 4. The Relationship to PCA The cleanest theoretical anchor for autoencoders is their relationship to principal component analysis. Consider a linear autoencoder with no nonlinear activations: the encoder is $h = W_e x$ and the decoder is $\hat{x} = W_d h$, with $W_e \in \mathbb{R}^{k \times d}$ and $W_d \in \mathbb{R}^{d \times k}$. Assume the data are mean centered. The training objective with squared error is $$ \min_{W_e, W_d} \; \frac{1}{n} \sum_{i=1}^{n} \bigl\lVert x^{(i)} - W_d W_e x^{(i)} \bigr\rVert_2^2 . $$ Write $M = W_d W_e$. Because $W_e$ has rank at most $k$ and $W_d$ has rank at most $k$, the product $M$ has rank at most $k$. The problem reduces to finding the best rank $k$ linear map that reconstructs the data, which is precisely the problem solved by PCA. The Eckart-Young theorem states that the optimal rank $k$ reconstruction projects each point onto the subspace spanned by the top $k$ eigenvectors of the empirical covariance matrix $\Sigma = \frac{1}{n} \sum_i x^{(i)} {x^{(i)}}^\top$. At the optimum the linear autoencoder spans the same subspace that PCA finds, so it achieves the same minimal reconstruction error. There is one important caveat about uniqueness. PCA returns an orthonormal basis whose components are ordered by decreasing variance and are mutually uncorrelated. A linear autoencoder is only required to span the correct $k$ dimensional subspace; its solution is determined only up to an invertible $k \times k$ change of basis, since $W_d W_e = (W_d A^{-1})(A W_e)$ for any invertible $A$. The encoder weights need not be orthonormal and the code coordinates need not be ordered or uncorrelated. The two methods agree on the optimal subspace and on the optimal reconstruction error; they need not agree on the particular coordinates within that subspace. Tied weights ($W_d = W_e^\top$) and an orthogonality penalty can push a linear autoencoder toward the PCA basis itself, but plain training does not guarantee it. This equivalence is what justifies thinking of autoencoders as a strict generalization of PCA. Replace the linear maps with deep nonlinear networks and the model can represent reductions that no linear projection can express. The price is the loss of the clean closed form solution: where PCA reduces to an eigendecomposition with a global optimum, a nonlinear autoencoder is trained by nonconvex optimization with no guarantee of reaching the best possible code. In practice the gain in representational flexibility usually outweighs this loss for data that is genuinely nonlinear. ## 5. Regularized Autoencoders Constraining the code width is not the only way to prevent an autoencoder from learning the identity. Regularized autoencoders keep a code that may be as wide as the input, or even wider (the overcomplete case), but add a penalty or a structural device that forces the code to capture useful structure rather than copy the input. These methods broaden the design space and often produce more useful codes than a bare bottleneck. ### 5.1 Sparse Autoencoders A sparse autoencoder augments the reconstruction loss with a penalty that drives most code activations toward zero for any given input, so each input is explained by a small active subset of code units. The objective becomes $$ \mathcal{L}(\theta, \phi) = \frac{1}{n} \sum_{i=1}^{n} \ell\bigl(x^{(i)}, \hat{x}^{(i)}\bigr) + \lambda \, \Omega(h), $$ where $\Omega(h)$ measures activation magnitude. A common choice is an $L_1$ penalty $\Omega(h) = \lVert h \rVert_1$, which yields sparse codes much as the lasso yields sparse coefficients. An alternative penalizes the deviation of each unit's average activation from a small target rate using a Kullback-Leibler term. Sparsity lets the code dimension $k$ exceed $d$ while still being informative, because the effective number of active units per input stays small. Such overcomplete sparse codes are a recurring theme in feature learning and in modern work on interpretable features inside large models. ### 5.2 Denoising Autoencoders A denoising autoencoder corrupts each input before encoding and asks the network to reconstruct the clean original. Let $\tilde{x}$ be a corrupted version of $x$ drawn from a noise process $C(\tilde{x} \mid x)$, for example additive Gaussian noise or randomly zeroed coordinates. The model minimizes $$ \mathcal{L} = \mathbb{E}_{x} \, \mathbb{E}_{\tilde{x} \sim C(\cdot \mid x)} \bigl[ \ell\bigl(x, g_\phi(f_\theta(\tilde{x}))\bigr) \bigr]. $$ Because the network never sees the clean input at its own input layer, copying is impossible; it must learn how the data is structured well enough to undo the corruption. The reconstruction map is pushed to project corrupted points back toward the data manifold, and the learned code captures the manifold's shape. Denoising also has an elegant interpretation: under small Gaussian corruption the trained reconstruction approximates the direction of the gradient of the log density of the data, linking the autoencoder to score based generative models. ### 5.3 Contractive Autoencoders A contractive autoencoder penalizes the sensitivity of the code to small input changes by adding the squared Frobenius norm of the encoder Jacobian to the loss, $$ \mathcal{L} = \frac{1}{n} \sum_{i=1}^{n} \ell\bigl(x^{(i)}, \hat{x}^{(i)}\bigr) + \lambda \left\lVert \frac{\partial f_\theta}{\partial x} \Big|_{x^{(i)}} \right\rVert_F^2 . $$ This term encourages the encoder to be locally flat, contracting most directions in input space so that the code changes only along the directions in which the data actually varies. The result is a representation that is robust to small perturbations and aligned with the local manifold tangent. Denoising and contractive autoencoders are closely related: both make the code insensitive to off manifold variation, one by injecting noise and one by penalizing the Jacobian directly. ## 6. A Bridge to Representation Learning Up to this point reconstruction has been the objective and the code a means to that end. Representation learning inverts the emphasis. The reconstruction loss becomes a pretext task whose real purpose is to produce a code $h$ that is useful for problems the network was never directly trained on. A good representation makes downstream tasks easier: it disentangles factors of variation, it is robust to nuisance perturbations, it places semantically similar inputs near one another, and it requires few labeled examples to fit a downstream model on top. Autoencoders were among the first practical tools for unsupervised representation learning. Stacked autoencoders, trained greedily layer by layer, were used in the late 2000s to initialize deep networks before purely supervised training with better initialization and optimization made such pretraining unnecessary for many tasks. The core idea survived and matured: learn features by setting up a self-supervised objective on unlabeled data, then transfer those features. The variational autoencoder reframes the bottleneck probabilistically. Instead of a deterministic code, the encoder outputs the parameters of a distribution $q(h \mid x)$ over latent variables, and training maximizes a lower bound on the data likelihood that combines a reconstruction term with a term pulling $q(h \mid x)$ toward a prior $p(h)$. This yields a smooth, generative latent space from which new samples can be drawn, and it makes explicit the link between dimensionality reduction and density modeling. Modern self-supervised representation learning has largely moved beyond pixel level reconstruction. Contrastive methods learn codes by pulling together representations of different views of the same input and pushing apart representations of different inputs, and masked prediction methods, which reconstruct deliberately hidden portions of the input, power the pretraining of large language models and masked image models. These approaches abandon full input reconstruction yet inherit the autoencoder's defining move: define a self-supervised objective whose solution requires a representation that captures the structure of the data. The encoder-decoder skeleton, the bottleneck that forces selectivity, and the use of a learned code as a transferable feature are the durable ideas, and they remain the conceptual foundation on which contemporary representation learning is built. ## 7. Complexity, When to Use It, and Failure Modes ### 7.1 Computational Cost Training cost scales with the size of the network and the dataset. A forward and backward pass through a multilayer perceptron with layer widths $m_0 = d, m_1, \dots, m_L = k$ in the encoder (mirrored in the decoder) costs $O\!\left(\sum_\ell m_{\ell-1} m_\ell\right)$ floating point operations per example, and training runs that pass over $n$ examples for $E$ epochs, giving roughly $O\!\left(E \, n \sum_\ell m_{\ell-1} m_\ell\right)$ total work. The dominant term is usually the first layer, $O(d \cdot m_1)$, when the input dimension $d$ is large. Memory at training time holds the parameters plus the per batch activations needed for backpropagation. Inference, the only part that matters once the encoder is deployed as a feature extractor, is a single forward pass through the encoder at $O\!\left(\sum_\ell m_{\ell-1} m_\ell\right)$ per example. Contrast this with PCA, whose truncated randomized SVD costs $O(n d m)$ once and has no hyperparameters to tune and no optimization to babysit. An autoencoder is strictly more expensive: it adds epochs, learning rates, architecture choices, and the risk of a bad local optimum. That cost is only justified when the data's nonlinearity is real, which leads to the next point. ### 7.2 When to Reach for an Autoencoder Prefer an autoencoder over a linear method when the data plausibly lie on a curved manifold that a flat subspace cannot capture, when you have enough unlabeled data to fit a flexible model without overfitting, and when a nonlinear code measurably improves a downstream task you can evaluate. Prefer PCA, kernel PCA, or a spectral manifold method when the dataset is small, when a transparent and globally optimal solution matters, or when a quick linear baseline already reaches the accuracy you need. The disciplined workflow is to fit PCA first, record its reconstruction error and downstream score, and only adopt an autoencoder if it clears that bar by a margin worth the added complexity. The example at the end of this chapter runs exactly that comparison. ### 7.3 Failure Modes Several pathologies recur in practice. - **Identity collapse.** If the bottleneck is too wide or the network too powerful relative to the constraint, the model approximates the identity and the code carries no useful compression. The remedy is a narrower $k$, a regularizer, or a denoising objective. - **Memorization without generalization.** The lowest training reconstruction error often corresponds to a code that has overfit. Select $k$ and architecture by validation reconstruction and, crucially, by downstream task performance, never by training loss alone. - **Unscaled inputs.** Because squared error sums over coordinates, a feature measured in large units dominates the loss and the code purely as an artifact of scale. Standardize or normalize before training. - **Optimization sensitivity.** The objective is nonconvex, so results depend on initialization, learning rate, and seed. Fix the seed for reproducibility and run a small sweep when stability matters. - **An arbitrary, unaligned code.** Unlike PCA, the autoencoder code has no canonical orientation, ordering, or scale, and successive runs can produce rotated or permuted axes. If you need stable, interpretable coordinates, add a regularizer (sparsity, contraction) or fall back to PCA. ### 7.4 Mature Open-Source Tooling There is no reason to hand-roll training loops, optimizers, or autodiff. The autoencoder in this chapter is a few layers expressed in [PyTorch](https://pytorch.org/), the most widely used open-source deep learning framework, with [Keras](https://keras.io/) on top of [TensorFlow](https://www.tensorflow.org/) and [JAX](https://jax.readthedocs.io/) (via [Flax](https://flax.readthedocs.io/)) as equally capable free alternatives. For the linear and kernel baselines, [scikit-learn](https://scikit-learn.org/) ships `PCA`, `KernelPCA`, and `IncrementalPCA` out of the box. For the manifold visualizations an autoencoder competes with, the open-source [UMAP](https://umap-learn.readthedocs.io/) and scikit-learn's `TSNE` and `Isomap` are the standard references. In the Julia ecosystem [Flux.jl](https://fluxml.ai/) provides the same building blocks, and in Rust the [candle](https://github.com/huggingface/candle) and [burn](https://burn.dev/) crates do. The tabs below show the same undercomplete autoencoder in all three languages. Build on these mature, well-tested libraries rather than reimplementing backpropagation; reserve a from-scratch implementation for learning, not production. ## 8. A Worked Example: Unrolling a Swiss Roll The cleanest way to see why a nonlinear autoencoder beats a linear projection is to give both a dataset that is genuinely curved. The Swiss roll is the textbook case: points sampled from a two dimensional sheet that has been rolled up inside three dimensional space. The data truly have two intrinsic degrees of freedom, but they wind around so that no flat plane can separate the near and far layers of the roll. PCA, restricted to a linear projection, smears the layers together; an autoencoder with nonlinear layers can unroll the sheet and lay it flat in its two dimensional code. The example below generates the roll with scikit-learn's `make_swiss_roll`, standardizes it, and trains a small symmetric autoencoder in PyTorch to compress $\mathbb{R}^3$ to a $\mathbb{R}^2$ code. It then compares against a two-component PCA on two axes: reconstruction error, and how well a plain logistic classifier can recover a label that varies along the roll's intrinsic coordinate from each two dimensional embedding. Everything is seeded so the numbers reproduce. ::: {.panel-tabset} ## Python ```{python} import numpy as np import torch import torch.nn as nn from sklearn.datasets import make_swiss_roll from sklearn.decomposition import PCA from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler SEED = 0 np.random.seed(SEED) torch.manual_seed(SEED) # A 3D Swiss roll: data lie on a curved 2D manifold embedded in R^3. # A linear projection cannot unroll it; a nonlinear autoencoder can. X, color = make_swiss_roll(n_samples=2000, noise=0.05, random_state=SEED) # Label tracks position ALONG the roll (its intrinsic unrolled coordinate). # The two classes interleave in R^3 because the sheet wraps around, so a # straight cut in raw space cannot separate them, but an unrolled code can. y = (np.sin(color / 3.0) > 0).astype(int) Xtr, Xte, ytr, yte = train_test_split( X, y, test_size=0.25, random_state=SEED, stratify=y ) scaler = StandardScaler().fit(Xtr) Xtr = scaler.transform(Xtr).astype(np.float32) Xte = scaler.transform(Xte).astype(np.float32) D, K = X.shape[1], 2 # reduce R^3 -> R^2 class Autoencoder(nn.Module): def __init__(self, d, k): super().__init__() self.encoder = nn.Sequential( nn.Linear(d, 32), nn.ReLU(), nn.Linear(32, 16), nn.ReLU(), nn.Linear(16, k), ) self.decoder = nn.Sequential( nn.Linear(k, 16), nn.ReLU(), nn.Linear(16, 32), nn.ReLU(), nn.Linear(32, d), ) def forward(self, x): h = self.encoder(x) return self.decoder(h), h model = Autoencoder(D, K) opt = torch.optim.Adam(model.parameters(), lr=1e-2) loss_fn = nn.MSELoss() Xtr_t = torch.from_numpy(Xtr) for epoch in range(600): model.train() opt.zero_grad() recon, _ = model(Xtr_t) loss = loss_fn(recon, Xtr_t) loss.backward() opt.step() # Encode train/test to the 2D code and measure test reconstruction error. model.eval() with torch.no_grad(): recon_te, Hte = model(torch.from_numpy(Xte)) ae_mse = loss_fn(recon_te, torch.from_numpy(Xte)).item() Htr = model.encoder(Xtr_t).numpy() Hte = Hte.numpy() # Linear baseline: PCA to the same 2 dimensions. pca = PCA(n_components=K, random_state=SEED).fit(Xtr) pca_mse = float(np.mean((Xte - pca.inverse_transform(pca.transform(Xte))) ** 2)) # Downstream probe: how linearly separable is the label in each 2D code? def probe(Ztr, Zte): clf = LogisticRegression(max_iter=1000).fit(Ztr, ytr) return clf.score(Zte, yte) ae_acc = probe(Htr, Hte) pca_acc = probe(pca.transform(Xtr), pca.transform(Xte)) print(f"input dim d = {D}, code dim k = {K}") print(f"AE test reconstruction MSE: {ae_mse:.4f}") print(f"PCA test reconstruction MSE: {pca_mse:.4f}") print(f"AE code shape: {Hte.shape}") print("AE first three codes:") print(np.round(Hte[:3], 3)) print(f"downstream logistic-probe accuracy AE: {ae_acc:.3f} PCA: {pca_acc:.3f}") ``` The autoencoder reaches a far lower reconstruction error than PCA on the same two dimensional budget, because its nonlinear layers fold the curved sheet flat where a linear projection cannot. The downstream probe makes the practical payoff explicit: a plain linear classifier recovers the roll's intrinsic label well above chance from the autoencoder code, but only at chance from the PCA code, because PCA leaves the interleaved layers of the roll overlapping. This is precisely the regime in which the extra cost of an autoencoder pays off. ## Julia ```julia # Equivalent undercomplete autoencoder with Flux.jl (not executed here). using Flux, MLJ, Random, Statistics using Flux: mse, train! Random.seed!(0) # Assume X (3 x n, standardized) and labels y are available, e.g. from a # Swiss-roll generator; Flux expects features in columns. d, k = 3, 2 encoder = Chain(Dense(d, 32, relu), Dense(32, 16, relu), Dense(16, k)) decoder = Chain(Dense(k, 16, relu), Dense(16, 32, relu), Dense(32, d)) model = Chain(encoder, decoder) opt = Flux.setup(Adam(1e-2), model) loss(m, x) = mse(m(x), x) for epoch in 1:600 grads = Flux.gradient(m -> loss(m, Xtr), model) Flux.update!(opt, model, grads[1]) end # Encode to the 2D code and report test reconstruction error. H = encoder(Xte) ae_mse = mse(decoder(H), Xte) println("AE test reconstruction MSE: ", ae_mse) println("AE code shape: ", size(H)) ``` ## Rust ```rust // Equivalent undercomplete autoencoder with candle (not executed here). // candle is Hugging Face's minimalist, mature deep-learning crate for Rust; // burn (https://burn.dev/) is an equally good alternative with a higher-level // training loop. Assume `xtr` is a standardized (n x 3) Tensor. use candle_core::{DType, Device, Tensor}; use candle_nn::{loss, linear, Linear, Module, Optimizer, VarBuilder, VarMap, AdamW, ParamsAdamW}; struct Autoencoder { e1: Linear, e2: Linear, e3: Linear, d1: Linear, d2: Linear, d3: Linear } impl Autoencoder { fn encode(&self, x: &Tensor) -> candle_core::Result<Tensor> { let h = self.e1.forward(x)?.relu()?; let h = self.e2.forward(&h)?.relu()?; self.e3.forward(&h) } fn forward(&self, x: &Tensor) -> candle_core::Result<Tensor> { let h = self.encode(x)?; let h = self.d1.forward(&h)?.relu()?; let h = self.d2.forward(&h)?.relu()?; self.d3.forward(&h) } } fn main() -> candle_core::Result<()> { let dev = Device::Cpu; let vm = VarMap::new(); let vb = VarBuilder::from_varmap(&vm, DType::F32, &dev); let (d, k) = (3usize, 2usize); let model = Autoencoder { e1: linear(d, 32, vb.pp("e1"))?, e2: linear(32, 16, vb.pp("e2"))?, e3: linear(16, k, vb.pp("e3"))?, d1: linear(k, 16, vb.pp("d1"))?, d2: linear(16, 32, vb.pp("d2"))?, d3: linear(32, d, vb.pp("d3"))?, }; let mut opt = AdamW::new(vm.all_vars(), ParamsAdamW { lr: 1e-2, ..Default::default() })?; // xtr: standardized training tensor of shape (n, 3). for _ in 0..600 { let recon = model.forward(&xtr)?; let l = loss::mse(&recon, &xtr)?; opt.backward_step(&l)?; } let code = model.encode(&xtr)?; // (n, 2) code println!("AE code shape: {:?}", code.dims()); Ok(()) } ``` ::: ## 9. Practical Considerations Several choices determine whether an autoencoder produces a useful reduction. Inputs should be scaled or normalized so that no coordinate dominates the squared error purely because of its units. The bottleneck width $k$ should be selected by sweeping a range and inspecting reconstruction error together with downstream task performance, not reconstruction alone, because the lowest reconstruction error often corresponds to a code that has merely memorized rather than generalized. Symmetric encoder and decoder depths are a sensible default, and weight tying can reduce parameters and nudge linear models toward the PCA solution. When the goal is a clean low dimensional embedding for analysis, a regularized variant such as a denoising or contractive autoencoder usually yields a more stable and more transferable code than an unregularized bottleneck of the same width. Finally, always compare against PCA as a baseline: if a linear projection reaches comparable reconstruction and downstream accuracy, the simpler, deterministic, and globally optimal method is the better engineering choice, and the autoencoder earns its place only when the data's nonlinearity makes the extra capacity pay off. ## References 1. Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning, Chapter 14: Autoencoders. MIT Press, 2016. https://www.deeplearningbook.org/contents/autoencoders.html 2. Hinton, G. E., and Salakhutdinov, R. R. Reducing the Dimensionality of Data with Neural Networks. Science, 313(5786), 504-507, 2006. https://www.science.org/doi/10.1126/science.1127647 3. Baldi, P., and Hornik, K. Neural Networks and Principal Component Analysis: Learning from Examples Without Local Minima. Neural Networks, 2(1), 53-58, 1989. https://doi.org/10.1016/0893-6080(89)90014-2 4. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P. A. Stacked Denoising Autoencoders. Journal of Machine Learning Research, 11, 3371-3408, 2010. https://www.jmlr.org/papers/v11/vincent10a.html 5. Rifai, S., Vincent, P., Muller, X., Glorot, X., and Bengio, Y. Contractive Auto-Encoders: Explicit Invariance During Feature Extraction. ICML, 2011. https://icml.cc/2011/papers/455_icmlpaper.pdf 6. Kingma, D. P., and Welling, M. Auto-Encoding Variational Bayes. ICLR, 2014. https://arxiv.org/abs/1312.6114 7. Bengio, Y., Courville, A., and Vincent, P. Representation Learning: A Review and New Perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 1798-1828, 2013. https://arxiv.org/abs/1206.5538 8. He, K., Chen, X., Xie, S., Li, Y., Dollar, P., and Girshick, R. Masked Autoencoders Are Scalable Vision Learners. CVPR, 2022. https://arxiv.org/abs/2111.06377