Almost every modern AI system is built on the same idea:
Turn complex objects into vectors that live in a relatively low-dimensional space, then operate on those vectors.
Words, images, audio clips, graphs, user profiles, molecules, and entire documents all end up as points in some high-dimensional vector space. Learning useful representations is at least half of the story of deep learning (Bengio, Courville, and Vincent 2013).
This chapter develops a systematic view of representations and embeddings, with three goals:
Explain what representations are and why vector spaces are so central.
Show how to construct linear and nonlinear feature maps in practice, at both small and production scales.
Develop geometric and information-theoretic intuitions that we will reuse when we talk about attention, contrastive learning, and generative models.
We will move through:
Linear and nonlinear feature maps.
Distributional and embedding-based representations of words and tokens.
Manifold hypotheses about high-dimensional data.
Information bottleneck ideas and compression.
Geometric properties of learned representations.
18.1 What is a representation?
At a high level, a representation is a mapping from objects of interest into a space where learning and reasoning are convenient.
Formally, let \(\mathcal{X}\) be a set of inputs: sentences, images, graphs, etc. A representation is a function
\[
\phi: \mathcal{X} \to \mathbb{R}^d
\]
that maps each input \(x \in \mathcal{X}\) to a vector \(\phi(x)\) in a \(d\)-dimensional space. This \(\phi\) might be:
Handcrafted, as in classical feature engineering.
Learned, as in neural networks.
The downstream model then operates on \(\phi(x)\) instead of directly on \(x\):
\[
f_\theta(x) = g_\theta(\phi(x)).
\]
In deep learning, \(\phi\) is itself parameterized and learned:
For documents, we can use bag of words representations:
Count how many times each vocabulary word appears in the document.
Optionally normalize or weight using tf-idf.
These are high dimensional and sparse, but conceptually simple.
Code
from collections import Counterfrom typing import List, Dict, Tupleimport numpy as npimport scipy.sparse as spdef build_vocabulary(corpus: List[List[str]], min_freq: int=1) -> Dict[str, int]:""" Build token to index mapping from a tokenized corpus. corpus: list of documents, each a list of tokens. """ freq = Counter()for doc in corpus: freq.update(doc) vocab = {}for token, count in freq.items():if count >= min_freq: vocab[token] =len(vocab)return vocabdef one_hot_encode(token: str, vocab: Dict[str, int]) -> np.ndarray:"""Dense one hot vector for a single token.""" d =len(vocab) vec = np.zeros(d, dtype=float)if token in vocab: vec[vocab[token]] =1.0return vecdef bow_sparse_matrix( corpus: List[List[str]], vocab: Dict[str, int],) -> sp.csr_matrix:""" Build a sparse bag-of-words matrix X of shape (n_docs, |vocab|). Entry X[i, j] is the count of vocab token j in document i. """ rows = [] cols = [] data = []for i, doc inenumerate(corpus): counts = Counter(token for token in doc if token in vocab)for token, c in counts.items(): rows.append(i) cols.append(vocab[token]) data.append(c) n_docs =len(corpus) X = sp.csr_matrix((data, (rows, cols)), shape=(n_docs, len(vocab)), dtype=float)return Xdef tfidf_transform(X: sp.csr_matrix) -> sp.csr_matrix:""" Apply a simple tf-idf transform to a sparse count matrix X. tf = count in document / total counts in document. idf = log((1 + n_docs) / (1 + df)) + 1, where df is document frequency. """ n_docs, n_terms = X.shape# Document frequency for each term. df = (X >0).sum(axis=0).A1 # shape: (n_terms,) idf = np.log((1+ n_docs) / (1+ df)) +1.0# Term frequencies: normalize rows to sum to 1 row_sums = np.asarray(X.sum(axis=1)).ravel() row_sums[row_sums ==0] =1.0 inv_row_sums =1.0/ row_sums D_inv = sp.diags(inv_row_sums) tf = D_inv @ X# Apply idf. IDF = sp.diags(idf) tfidf = tf @ IDFreturn tfidf
Notes:
We use scipy.sparse to handle large vocabularies and corpora. In production, bag-of-words matrices can be extremely sparse and must be stored and processed as such.
tfidf_transform illustrates a common weighting scheme that downweights very frequent words.
Even though one-hot and bag-of-words representations are simple, they already enable powerful linear models (SVMs, logistic regression) for text classification and retrieval tasks.
18.2.2 Learned linear projections
One-hot vectors are extremely high-dimensional. A standard remedy is to learn a linear projection to a lower-dimensional space.
Let \(x \in \mathbb{R}^{D}\) be a feature vector (for example, bag of words). A linear embedding is
\[
z = W x, \quad W \in \mathbb{R}^{d \times D},
\]
where typically \(d \ll D\). In neural networks, such linear layers are ubiquitous.
Code
class LinearProjection:""" Simple linear feature map z = W x + b. For high dimensional sparse x, we assume x is a scipy sparse CSR row vector. """def__init__(self, in_dim: int, out_dim: int, rng: np.random.Generator |None=None):self.in_dim = in_dimself.out_dim = out_dimif rng isNone: rng = np.random.default_rng(0)# Xavier initialization limit = np.sqrt(6.0/ (in_dim + out_dim))self.W = rng.uniform(-limit, limit, size=(out_dim, in_dim)).astype(np.float32)self.b = np.zeros(out_dim, dtype=np.float32)def__call__(self, x) -> np.ndarray:""" Apply linear map. x can be dense (1D array) or sparse row vector. Returns dense vector z of shape (out_dim,). """if sp.issparse(x): z =self.W @ x.T # shape (out_dim, 1) z = np.asarray(z).reshape(-1).astype(np.float32)else: x = np.asarray(x, dtype=np.float32) z =self.W @ xreturn z +self.b
For very large dimensions, we rarely store full \(W\) explicitly unless \(in_{dim}\) is already modest. For very large categorical spaces we use embedding tables, which is the next topic.
18.3 Embedding tables for high cardinality categorical features
In large production systems, we often have categorical IDs:
User IDs, item IDs, product IDs.
Token IDs in very large vocabularies.
Categorical combinations of features.
A standard pattern is to maintain an embedding table:
\[
E \in \mathbb{R}^{|\mathcal{C}| \times d},
\]
where each category \(c \in \mathcal{C}\) has a learned vector \(E_c \in \mathbb{R}^d\).
Lookup is simply:
\[
\phi(c) = E_c.
\]
The gradient during training only updates the rows of \(E\) that were actually used in the batch. This sparsity is crucial at large-scale.
18.3.1 The General Framework
An embedding is: discrete object → continuous vector such that vector similarity reflects meaningful object similarity
What are you embedding?
The Two Universal Questions
Unit Level
Text Domain
Graph Domain
Image Domain
Recommendation Domain
Atomic element
Token (Word2Vec, GloVe)
Node (DeepWalk, Node2Vec, GraphSAGE)
Pixel/Patch (ViT patches)
User or Item (matrix factorization)
Pairwise relation
Word pair / collocation
Edge (TransE, DistMult, RotatE)
Spatial relationship
User-item interaction
Multi-way relation
n-gram, phrase
Hyperedge (HyperGCN, HGNN)
Scene graph relations
Session or basket
Local structure
Sentence (Sentence-BERT)
Subgraph / Motif (Subgraph2Vec)
Object / Region (RoI features)
User neighborhood
Community / cluster
Paragraph, topic
Community (ComE, vGraph)
Image segment
User segment
Whole structure
Document (Doc2Vec)
Graph (Graph2Vec, DiffPool)
Full image (ResNet, CLIP)
Full catalog / platform
What’s the unit?
Determined by your downstream task
Classifying documents → document embedding.
Predicting links → edge or node-pair embedding.
Comparing networks → graph embedding.
What training signal defines similarity?
Signal Type
How It Works
Examples
Co-occurrence
Things in similar contexts are similar
Word2Vec, DeepWalk, GloVe
Supervision
Things with same labels are similar
Fine-tuned BERT, supervised GNN
Contrastive
Push positives together, negatives apart
SimCLR, GraphCL, CLIP
Reconstruction
Good embeddings can rebuild the input
Autoencoders, GAE
Translation/transformation
Relations as operations on embeddings
TransE, RotatE
Everything else is implementation detail.
18.3.2 Numpy implementation of an embedding table
Code
from dataclasses import dataclassfrom typing import Sequence@dataclassclass EmbeddingTable:""" Simple embedding table for categorical IDs 0, 1, ..., num_embeddings - 1. """ num_embeddings: int embedding_dim: int weights: np.ndarray # shape (num_embeddings, embedding_dim)@classmethoddef random( cls, num_embeddings: int, embedding_dim: int, rng: np.random.Generator |None=None, ) ->"EmbeddingTable":if rng isNone: rng = np.random.default_rng(0) limit = np.sqrt(6.0/ (num_embeddings + embedding_dim)) weights = rng.uniform(-limit, limit, size=(num_embeddings, embedding_dim)).astype(np.float32)return cls(num_embeddings=num_embeddings, embedding_dim=embedding_dim, weights=weights)def lookup(self, ids: Sequence[int]) -> np.ndarray:""" Return embedding matrix of shape (len(ids), embedding_dim). """ ids = np.asarray(ids, dtype=int)returnself.weights[ids]def update(self, ids: Sequence[int], grad_embeddings: np.ndarray, lr: float) ->None:""" Gradient descent update for a batch of embeddings. ids: indices used in this batch, shape (batch_size,) grad_embeddings: gradient wrt embeddings, shape (batch_size, embedding_dim) Only those rows are updated. """ ids = np.asarray(ids, dtype=int)assert grad_embeddings.shape == (len(ids), self.embedding_dim)# Accumulate gradient per row in case of duplicates grad_accum = np.zeros_like(self.weights) np.add.at(grad_accum, ids, grad_embeddings)self.weights -= lr * grad_accum
This pattern (store a big matrix, do row lookups, update only those rows) is the basis of token embeddings in language models, as well as user and item embeddings in recommender systems.
In production with billions of IDs:
The embedding table may be sharded across many machines.
Some rows may be evicted or stored on slower storage when rarely used.
Quantization and compression of embeddings become important for memory and bandwidth.
Conceptually, though, the core abstraction remains: an embedding table is a learned dictionary from IDs to vectors.
18.4 4. Nonlinear feature maps
Linear maps can only capture linear structure. To represent more complex patterns, we use nonlinear feature maps.
Two complementary viewpoints:
Explicit nonlinear transformations \(\phi(x)\) applied before a linear model.
Deep networks where layers and activations jointly form \(\phi_\psi(x)\).
18.4.1 4.1 Kernel features and random Fourier features
Classical kernel methods rely on the idea of mapping inputs into a (possibly infinite dimensional) feature space where linear methods are powerful.
Given a positive definite kernel \(k(x, x') = \langle \phi(x), \phi(x') \rangle\), one can work with \(\phi\) implicitly via kernel evaluations. However, this does not scale well to large datasets.
Random Fourier features approximate shift invariant kernels such as the Gaussian kernel:
The key result {cite}rahimi2007random is that we can approximate \(k(x, x')\) with a randomly generated finite dimensional map \(\phi(x) \in \mathbb{R}^D\) such that
\[
k(x, x') \approx \phi(x)^\top \phi(x').
\]
Construction:
Sample \(D\) random frequencies \(\omega_i \sim \mathcal{N}(0, \sigma^{-2} I_d)\).
Sample random phases \(b_i \sim \text{Uniform}[0, 2\pi]\).
Define \[
\phi(x) = \sqrt{\frac{2}{D}} \left[\cos(\omega_1^\top x + b_1), \dots, \cos(\omega_D^\top x + b_D)\right].
\]
Then inner products in the feature space approximate the Gaussian kernel.
Code
class RandomFourierFeatures:""" Random Fourier feature map approximating a Gaussian kernel. phi(x) = sqrt(2 / D) * cos(W x + b), where rows of W are drawn from N(0, sigma^{-2} I). """def__init__(self, input_dim: int, num_features: int, sigma: float, rng: np.random.Generator |None=None):self.input_dim = input_dimself.num_features = num_featuresself.sigma = sigmaif rng isNone: rng = np.random.default_rng(0)self.W = rng.normal(loc=0.0, scale=1.0/ sigma, size=(num_features, input_dim)).astype(np.float32)self.b = rng.uniform(0.0, 2.0* np.pi, size=(num_features,)).astype(np.float32)def__call__(self, X: np.ndarray) -> np.ndarray:""" Map a batch X of shape (n_samples, input_dim) to random Fourier features of shape (n_samples, num_features). """ X = np.asarray(X, dtype=np.float32)# Compute W x + b for each sample. WX = X @self.W.T # shape (n_samples, num_features)return np.sqrt(2.0/self.num_features) * np.cos(WX +self.b)
We can then train a linear model on top of RandomFourierFeatures to approximate a nonlinear kernel method while keeping training and inference linear in the dataset size and feature dimension.
\[
\phi^{(\ell)}(h) = \sigma(W^{(\ell)} h + b^{(\ell)}).
\]
Even a two layer network can represent highly nonlinear decision boundaries.
A minimalist fully connected network for tabular inputs using NumPy:
Code
class TwoLayerMLP:""" Two layer perceptron: h = relu(W1 x + b1), z = W2 h + b2. """def__init__(self, input_dim: int, hidden_dim: int, output_dim: int, rng: np.random.Generator |None=None):if rng isNone: rng = np.random.default_rng(0)self.input_dim = input_dimself.hidden_dim = hidden_dimself.output_dim = output_dim limit1 = np.sqrt(6.0/ (input_dim + hidden_dim)) limit2 = np.sqrt(6.0/ (hidden_dim + output_dim))self.W1 = rng.uniform(-limit1, limit1, size=(hidden_dim, input_dim)).astype(np.float32)self.b1 = np.zeros(hidden_dim, dtype=np.float32)self.W2 = rng.uniform(-limit2, limit2, size=(output_dim, hidden_dim)).astype(np.float32)self.b2 = np.zeros(output_dim, dtype=np.float32)@staticmethoddef relu(z: np.ndarray) -> np.ndarray:return np.maximum(z, 0.0)def forward(self, X: np.ndarray) -> np.ndarray: X = np.asarray(X, dtype=np.float32) h =self.relu(X @self.W1.T +self.b1) z = h @self.W2.T +self.b2return zdef forward_hidden(self, X: np.ndarray) -> np.ndarray:"""Return hidden representation h, which is a learned nonlinear feature.""" X = np.asarray(X, dtype=np.float32)returnself.relu(X @self.W1.T +self.b1)
Training this MLP with SGD on a classification task will produce a hidden representation h that is tuned to separate the classes in a way that a simple linear classifier in h space can use.
In practice, we rely on frameworks like PyTorch or JAX for automatic differentiation and efficient batched operations, but the conceptual structure is the same.
18.5 5. Distributional representations of words and tokens
Natural language is the canonical domain where embeddings transformed the field. The distributional hypothesis {cite}harris1954distributional states:
Words that occur in similar contexts tend to have similar meanings.
Word embeddings operationalize this by learning vector representations where words with similar usage patterns are close in vector space.
18.5.1 5.1 Cooccurrence and matrix factorization
Let \(\mathcal{V}\) be a vocabulary. Given a large corpus, we can build a cooccurrence matrix\(X \in \mathbb{R}^{|\mathcal{V}| \times |\mathcal{V}|}\) where \(X_{ij}\) counts how often word \(j\) appears in a context window around word \(i\).
We can then seek low rank matrices \(W, C \in \mathbb{R}^{|\mathcal{V}| \times d}\) such that
\[
W C^\top \approx f(X)
\]
for some transformation \(f\), for example logarithm or pointwise mutual information (PMI). The rows of \(W\) and \(C\) are word and context embeddings. GloVe {cite}pennington2014glove is a refined version of this idea.
Below is a small pipeline to build a cooccurrence matrix and derive embeddings via truncated SVD.
Code
from itertools import islicefrom typing import Iterableimport numpy as npimport scipy.sparse as spfrom sklearn.decomposition import TruncatedSVDdef build_cooccurrence_matrix( corpus: List[List[str]], vocab: Dict[str, int], window_size: int=4, symmetric: bool=True, min_cooccurrence: int=1,) -> sp.coo_matrix:""" Build a word-word cooccurrence matrix X where X[i, j] counts how often j appears in the context window of i. """ rows = [] cols = [] data = []for doc in corpus: ids = [vocab[token] for token in doc if token in vocab]for center_pos, center_id inenumerate(ids): left =max(0, center_pos - window_size) right =min(len(ids), center_pos + window_size +1)for pos inrange(left, right):if pos == center_pos:continue context_id = ids[pos] rows.append(center_id) cols.append(context_id) data.append(1.0)if symmetric: rows.append(context_id) cols.append(center_id) data.append(1.0) n_vocab =len(vocab) X = sp.coo_matrix((data, (rows, cols)), shape=(n_vocab, n_vocab), dtype=np.float32)if min_cooccurrence >1: X = X.tocsr() X.data[X.data < min_cooccurrence] =0.0 X.eliminate_zeros() X = X.tocoo()return Xdef pmi_matrix(X: sp.csr_matrix, positive: bool=True, eps: float=1e-8) -> sp.csr_matrix:""" Compute (positive) PMI matrix from a cooccurrence matrix X. PMI(i, j) = log ( p(i, j) / (p(i) p(j)) ). """ X = X.tocsr() total = X.sum() sum_over_rows = np.asarray(X.sum(axis=1)).ravel() # p(i) sum_over_cols = np.asarray(X.sum(axis=0)).ravel() # p(j) rows, cols = X.nonzero() data = []for i, j inzip(rows, cols): p_ij = X[i, j] / total p_i = sum_over_rows[i] / total p_j = sum_over_cols[j] / total pmi = np.log(p_ij / (p_i * p_j + eps) + eps)if positive: pmi =max(pmi, 0.0) data.append(pmi) M = sp.csr_matrix((data, (rows, cols)), shape=X.shape, dtype=np.float32)return Mdef svd_embeddings( M: sp.csr_matrix, dim: int, rng: np.random.Generator |None=None,) -> np.ndarray:""" Compute low rank embeddings from a matrix M via truncated SVD. Returns word embeddings of shape (n_vocab, dim). """if rng isNone: rng = np.random.default_rng(0) svd = TruncatedSVD(n_components=dim, random_state=rng.integers(10**9)) U = svd.fit_transform(M) # shape (n_vocab, dim)return U.astype(np.float32)
This approach is not how very large systems train embeddings in practice, because:
A dense \(|\mathcal{V}| \times |\mathcal{V}|\) matrix is too large.
SVD on massive matrices is expensive.
However, the matrix factorization viewpoint is conceptually important. It connects to:
Word2vec as implicit factorization of a shifted PMI matrix.
Collaborative filtering in recommender systems as factorization of user item matrices.
18.5.2 5.2 Skip gram with negative sampling (word2vec style)
Word2vec {cite}mikolov2013efficient introduced a very scalable embedding training method. The skip gram with negative sampling (SGNS) objective is:
Given a center word \(w\) and a context word \(c\) observed in the corpus, encourage \(\langle v_w, u_c \rangle\) to be large.
For randomly sampled negative contexts \(c'\), encourage \(\langle v_w, u_{c'} \rangle\) to be small.
For a single positive pair \((w, c)\) and \(k\) negative samples \(c_1', \dots, c_k'\), the loss is
This implementation is intentionally explicit to show how embeddings and gradients interact. In real systems:
Batching is more careful for speed.
Negative samplers use frequency adjusted distributions.
Training is distributed across many workers.
Still, the core idea is:
Embeddings are rows in matrices.
Training updates those rows based on cooccurrence patterns.
18.6 6. Manifolds and the geometry of data
The manifold hypothesis states that high dimensional data of interest often lies near a low dimensional manifold embedded in \(\mathbb{R}^D\):
Although the ambient space is high dimensional, the data concentrates near a lower dimensional structure.
Representations and embeddings can be seen as attempts to find coordinate systems that make this structure simple.
18.6.1 6.1 Linear manifolds: PCA
Principal component analysis (PCA) is the canonical linear method for discovering low dimensional structure. Given centered data \(X \in \mathbb{R}^{n \times D}\), PCA finds directions \(w_1, \dots, w_d\) that maximize variance.
In matrix terms, the top \(d\) singular vectors of \(X\) define the projection.
Code
from sklearn.decomposition import PCAfrom sklearn.datasets import make_swiss_rolldef pca_embedding(X: np.ndarray, dim: int) -> np.ndarray: pca = PCA(n_components=dim) Z = pca.fit_transform(X)return Z.astype(np.float32)def demo_pca_on_swiss_roll():""" Illustrate linear vs nonlinear embeddings on a 3D swiss roll manifold. """ X, _ = make_swiss_roll(n_samples=2000, noise=0.05)# X has shape (n_samples, 3) Z_linear = pca_embedding(X, dim=2)print("Original shape:", X.shape)print("Linear embedding shape:", Z_linear.shape)
PCA is efficient and widely used as a preprocessing step, but it can only recover linear manifolds. Many complex data manifolds (for example, images) are highly nonlinear.
18.6.2 6.2 Nonlinear manifolds: autoencoders
To handle nonlinear structure, we use autoencoders:
An encoder maps \(x\) to a latent code \(z\): \[
z = \text{enc}_\psi(x).
\]
A decoder reconstructs \(x\) from \(z\): \[
\hat{x} = \text{dec}_\theta(z).
\]
We train by minimizing a reconstruction loss: \[
\min_{\psi, \theta} \ \frac{1}{n} \sum_{i=1}^n |x_i - \hat{x}_i|_2^2.
\]
The latent \(z\) is a learned nonlinear embedding. With appropriate architectures and regularization, autoencoders discover useful structure in data.
Below is a concise PyTorch autoencoder for 2D embeddings of 784 dimensional images (for example MNIST). Even if you do not run PyTorch in your pipeline, the structure is worth understanding.
For large datasets, we stream from disk or remote storage.
For large images, we use convolutional encoders and decoders, not fully connected ones.
For stability and better structure in \(z\), we might use variational autoencoders, which connect directly to information bottleneck ideas (next section).
Manifold based thinking reappears constantly when we visualize and interpret representations. Tools like t-SNE or UMAP {cite}vanDerMaaten2008tsne,mcinnes2018umap are essentially nonlinear embedding methods for visualization.
18.7 7. Information bottleneck and compression
The information bottleneck (IB) framework {cite}tishby1999information provides a principled way to think about representations as lossy compressions.
We have:
Input random variable \(X\).
Target variable \(Y\) (for example, labels).
Representation \(Z\) produced from \(X\).
The IB principle:
Find a representation \(Z\) that preserves as much information about \(Y\) as possible, while being as compressed a summary of \(X\) as possible.
Formally, we consider the optimization problem
\[
\min_{p(z \mid x)} I(X; Z) - \beta I(Z; Y),
\]
where:
\(I(\cdot; \cdot)\) denotes mutual information.
\(I(X; Z)\) measures how much information \(Z\) retains from \(X\) (we want this small).
\(I(Z; Y)\) measures how much information \(Z\) has about \(Y\) (we want this large).
\(\beta > 0\) controls the tradeoff.
While directly solving this optimization is hard for complex models, the IB perspective leads to:
Regularization schemes that encourage compression.
Variational approximations like variational autoencoders.
Intuitions about generalization: models that compress input information but retain label information may generalize better.
18.7.1 7.1 Variational autoencoders (high level)
Variational autoencoders (VAEs) implement a stochastic encoder:
where \(p(z)\) is a prior, typically standard normal.
This has an IB flavor:
The reconstruction term encourages \(Z\) to retain information about \(X\).
The KL term encourages \(Z\) to be close to the prior, limiting capacity and compressing information.
Adjusting the weight of the KL term (as in \(\beta\)-VAEs) makes the connection to the information bottleneck explicit.
18.7.2 7.2 Compression of embedding tables
The IB ideas also suggest compressing representations where possible, especially in production:
Embedding table compression:
Reduce embedding dimension.
Apply quantization (for example 8 bit or 4 bit) to embedding vectors.
Factorize embedding matrices into smaller components.
Model compression:
Prune weights with small magnitudes.
Distill representations from a large model into a smaller one.
A simple example: post training quantization of embeddings to 8 bits centered around zero.
Code
def quantize_embeddings_8bit(E: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:""" Uniformly quantize embeddings E to 8 bit signed integers per column. Returns integer matrix Q and per-dimension scale factors s such that E ≈ s * Q, with broadcasting s of shape (1, d). """ E = np.asarray(E, dtype=np.float32) max_abs = np.max(np.abs(E), axis=0) # shape (d,) max_abs[max_abs ==0] =1.0# Map to [-127, 127] scale = max_abs /127.0 Q = np.round(E / scale).astype(np.int8)return Q, scaledef dequantize_embeddings_8bit(Q: np.ndarray, scale: np.ndarray) -> np.ndarray:""" Recover approximate embeddings from quantized representation. """return Q.astype(np.float32) * scale
In large systems, more sophisticated schemes are used, but the basic pattern is:
Compress embedding representations while keeping information about targets intact.
Use approximate nearest neighbor search algorithms that operate efficiently on compressed embeddings.
18.8 8. Geometry of learned representations
Finally, we turn to the geometry of representation spaces and how to analyze them.
18.8.1 8.1 Similarity measures
Given embeddings \(z, z' \in \mathbb{R}^d\), typical similarity measures include:
Cosine similarity is scale invariant and widely used for comparing word and sentence embeddings.
Code
def l2_distance(a: np.ndarray, b: np.ndarray, eps: float=1e-12) ->float:returnfloat(np.linalg.norm(a - b))def cosine_similarity(a: np.ndarray, b: np.ndarray, eps: float=1e-12) ->float: a = np.asarray(a, dtype=np.float32) b = np.asarray(b, dtype=np.float32) na = np.linalg.norm(a) nb = np.linalg.norm(b)if na < eps or nb < eps:return0.0returnfloat(a @ b / (na * nb))
For batches, we can compute all pairwise cosine similarities efficiently:
Code
def batch_cosine_similarity(E: np.ndarray) -> np.ndarray:""" Compute cosine similarity matrix S for embeddings E of shape (n, d), so that S[i, j] is the cosine similarity between E[i] and E[j]. """ E = np.asarray(E, dtype=np.float32) norms = np.linalg.norm(E, axis=1, keepdims=True) +1e-12 E_norm = E / normsreturn E_norm @ E_norm.T # shape (n, n)
In production retrieval systems, we rarely compute full \(n \times n\) similarity matrices, but we rely on approximate nearest neighbor indices (for example HNSW or IVF) to retrieve top similar vectors in logarithmic or sublinear time.
18.8.2 8.2 Nearest neighbors and analogies
One of the most striking properties of good word embeddings is that they support semantic analogies. For example, with appropriate training, one can see:
Practically, we often want to find nearest neighbors of a given embedding in embedding space.
Code
def knn( query: np.ndarray, E: np.ndarray, k: int=10, metric: str="cosine",) -> np.ndarray:""" Return indices of k nearest neighbors of query in embedding matrix E. E has shape (n_items, d). query has shape (d,). """ query = np.asarray(query, dtype=np.float32) E = np.asarray(E, dtype=np.float32)if metric =="cosine": norms = np.linalg.norm(E, axis=1) * (np.linalg.norm(query) +1e-12) scores = E @ query / (norms +1e-12)# Larger is more similar idx = np.argsort(-scores)[:k]elif metric =="l2": diffs = E - query[None, :] dists = np.sum(diffs * diffs, axis=1) idx = np.argsort(dists)[:k]else:raiseValueError(f"Unknown metric: {metric}")return idx
To use this for words we need a mapping between vocabulary indices and tokens:
Code
def build_index_to_token(vocab: Dict[str, int]) -> Dict[int, str]:return {idx: tok for tok, idx in vocab.items()}def nearest_words( word: str, vocab: Dict[str, int], E: np.ndarray, k: int=10,) -> List[Tuple[str, float]]:""" Return top k nearest words to the given word based on cosine similarity. """if word notin vocab:raiseValueError(f"Unknown word: {word}") idx = vocab[word] query = E[idx] sim_matrix = batch_cosine_similarity(E) sims = sim_matrix[idx]# Exclude the word itself sims[idx] =-1.0 top_idx = np.argsort(-sims)[:k] idx_to_token = build_index_to_token(vocab)return [(idx_to_token[j], float(sims[j])) for j in top_idx]
In large production systems, similar functionality is implemented using:
Approximate nearest neighbor search (for example FAISS, ScaNN).
Precomputed indices updated periodically as embeddings change.
The core geometry is the same: similar items correspond to nearby vectors.
18.8.3 8.3 Linear probes
To assess the quality of a representation, we can train a simple linear probe:
Freeze the representation \(\phi(x)\).
Train a linear classifier or regressor \(g_\theta(z)\) on top of \(z = \phi(x)\).
If a simple linear probe can solve a task well, we say the relevant information is linearly accessible in the representation.
In code, assuming we have embeddings Z of shape (n_samples, d) and labels y:
Code
from sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import accuracy_scorefrom sklearn.model_selection import train_test_splitdef linear_probe_classification(Z: np.ndarray, y: np.ndarray, test_size: float=0.2) ->float:""" Train a linear classifier on fixed embeddings Z and evaluate accuracy. """ X_train, X_test, y_train, y_test = train_test_split(Z, y, test_size=test_size, random_state=0) clf = LogisticRegression(max_iter=1000) clf.fit(X_train, y_train) y_pred = clf.predict(X_test) acc = accuracy_score(y_test, y_pred)returnfloat(acc)
This pattern is heavily used to evaluate representations of large pretrained models:
For language models: are syntax, semantics, sentiment, etc. linearly separable?
For vision models: are object categories, pose, or texture linearly accessible?
From a practical standpoint, high quality representations can dramatically reduce the amount of labeled data and training required for downstream tasks.
18.8.4 8.4 Anisotropy and whitening
Real world embeddings often exhibit anisotropy: the distribution of vectors is not isotropic in \(\mathbb{R}^d\). For example:
Many vectors may have large projections along a few dominant directions.
Cosine similarities may be biased upward due to a shared mean vector.
This can degrade nearest neighbor quality and interpretability. Common remedies include:
Centering: subtract the mean vector from all embeddings.
Whitening: apply a linear transform that makes covariance close to identity.
Removing top principal components.
Code
def center_embeddings(E: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:""" Center embeddings by subtracting the mean vector. Returns centered embeddings and the mean. """ mean = np.mean(E, axis=0, keepdims=True) E_centered = E - meanreturn E_centered, meandef whiten_embeddings(E: np.ndarray, eps: float=1e-5) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:""" Whiten embeddings: transform them so that covariance is close to identity. Returns whitened embeddings, mean, and whitening matrix W such that E_whitened = (E - mean) @ W. """ E = np.asarray(E, dtype=np.float32) n, d = E.shape mean = np.mean(E, axis=0, keepdims=True) X = E - mean# Covariance C = (X.T @ X) / (n -1) # shape (d, d)# Eigen decomposition evals, evecs = np.linalg.eigh(C)# Avoid division by zero evals_clipped = np.maximum(evals, eps) D_inv_sqrt = np.diag(1.0/ np.sqrt(evals_clipped)) W = evecs @ D_inv_sqrt @ evecs.T E_whitened = X @ Wreturn E_whitened, mean, W
Such geometric preprocessing can significantly improve downstream tasks that rely on distances and similarities in embedding space.
18.9 9. Summary and connections
In this chapter, we developed a detailed view of representations and embeddings:
We formalized representations as mappings \(\phi: \mathcal{X} \to \mathbb{R}^d\) and motivated vector spaces as the natural setting for learning and inference.
We built linear feature maps, including one hot encoding, bag of words, and learned linear projections, with attention to sparse handling and scalability.
We introduced embedding tables for large categorical spaces, emphasizing the row lookup and sparse update pattern that is central to production systems.
We explored nonlinear feature maps using random Fourier features and multilayer perceptrons as learned nonlinear embeddings.
We developed distributional word embeddings via cooccurrence matrices, PMI, and truncated SVD, and then moved to skip gram with negative sampling as an explicitly trainable embedding model.
We framed high dimensional data through the manifold hypothesis, using PCA and autoencoders to illustrate linear and nonlinear embeddings.
We connected representations to information bottleneck ideas, highlighting how compression and relevance trade off and how this appears in variational autoencoders and embedding compression.
We examined the geometry of embedding spaces through distances, cosine similarity, nearest neighbors, linear probes, anisotropy, and whitening.
These ideas are not isolated. They are deeply woven into the topics that follow:
Attention mechanisms operate on sequences of embeddings, using dot product similarities to compute weighted combinations.
Contrastive learning often trains embeddings so that positive pairs are close and negative pairs are far in embedding space, directly operating on the geometric properties we analyzed.
Generative models (language models, diffusion models, VAEs) transform embeddings through deep architectures, and their quality depends heavily on the structure of the latent spaces they learn.
A strong grasp of representations and embeddings is essential for understanding the behavior, limitations, and potential of modern AI systems. In later chapters, we will repeatedly return to these concepts, especially when designing architectures, choosing pretraining objectives, and analyzing model behavior.
Bengio, Yoshua, Aaron Courville, and Pascal Vincent. 2013. “Representation Learning: A Review and New Perspectives.”IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (8): 1798–828.
# Representations and embeddingsAlmost every modern AI system is built on the same idea:> Turn complex objects into vectors that live in a relatively low-dimensional space, then operate on those vectors.Words, images, audio clips, graphs, user profiles, molecules, and entire documents all end up as points in some high-dimensional vector space. Learning useful **representations** is at least half of the story of deep learning [@bengio2013representation].This chapter develops a systematic view of representations and embeddings, with three goals:1. Explain **what** representations are and **why** vector spaces are so central.2. Show **how** to construct linear and nonlinear feature maps in practice, at both small and production scales.3. Develop geometric and information-theoretic intuitions that we will reuse when we talk about attention, contrastive learning, and generative models.We will move through:- Linear and nonlinear feature maps.- Distributional and embedding-based representations of words and tokens.- Manifold hypotheses about high-dimensional data.- Information bottleneck ideas and compression.- Geometric properties of learned representations.------------------------------------------------------------------------## What is a representation?At a high level, a **representation** is a mapping from objects of interest into a space where learning and reasoning are convenient.Formally, let $\mathcal{X}$ be a set of inputs: sentences, images, graphs, etc. A representation is a function$$\phi: \mathcal{X} \to \mathbb{R}^d$$that maps each input $x \in \mathcal{X}$ to a vector $\phi(x)$ in a $d$-dimensional space. This $\phi$ might be:- Handcrafted, as in classical feature engineering.- Learned, as in neural networks.The downstream model then operates on $\phi(x)$ instead of directly on $x$:$$f_\theta(x) = g_\theta(\phi(x)).$$In deep learning, $\phi$ is itself parameterized and learned:$$\phi_\psi: \mathcal{X} \to \mathbb{R}^d, \quad f_{\theta,\psi}(x) = g_\theta(\phi_\psi(x)).$$The key advantages of vector representations:- We can use linear algebra (dot products, matrix multiplication, eigenvalue decompositions).- We can measure similarity with norms and angles.- Computations can be batched and run on accelerators.The subtlety, and much of the art, lies in how we choose or learn $\phi$.------------------------------------------------------------------------## Linear feature mapsWe start with linear feature maps, both fixed and learned. They are limited but foundational.### One hot and bag of wordsFor a finite vocabulary $\mathcal{V}$ (words, tokens, discrete IDs), the simplest representation is **one-hot encoding**:- Fix an ordering of tokens in the vocabulary, so $\mathcal{V} = {w_1, \dots, w_{|\mathcal{V}|}}$.- Represent token $w_j$ as a vector $e_j \in \mathbb{R}^{|\mathcal{V}|}$ with a 1 in position $j$ and 0 elsewhere.This is a linear feature map:$$\phi_{\text{one-hot}}: \mathcal{V} \to \mathbb{R}^{|\mathcal{V}|}.$$For documents, we can use **bag of words** representations:- Count how many times each vocabulary word appears in the document.- Optionally normalize or weight using tf-idf.These are high dimensional and sparse, but conceptually simple.```{python}from collections import Counterfrom typing import List, Dict, Tupleimport numpy as npimport scipy.sparse as spdef build_vocabulary(corpus: List[List[str]], min_freq: int=1) -> Dict[str, int]:""" Build token to index mapping from a tokenized corpus. corpus: list of documents, each a list of tokens. """ freq = Counter()for doc in corpus: freq.update(doc) vocab = {}for token, count in freq.items():if count >= min_freq: vocab[token] =len(vocab)return vocabdef one_hot_encode(token: str, vocab: Dict[str, int]) -> np.ndarray:"""Dense one hot vector for a single token.""" d =len(vocab) vec = np.zeros(d, dtype=float)if token in vocab: vec[vocab[token]] =1.0return vecdef bow_sparse_matrix( corpus: List[List[str]], vocab: Dict[str, int],) -> sp.csr_matrix:""" Build a sparse bag-of-words matrix X of shape (n_docs, |vocab|). Entry X[i, j] is the count of vocab token j in document i. """ rows = [] cols = [] data = []for i, doc inenumerate(corpus): counts = Counter(token for token in doc if token in vocab)for token, c in counts.items(): rows.append(i) cols.append(vocab[token]) data.append(c) n_docs =len(corpus) X = sp.csr_matrix((data, (rows, cols)), shape=(n_docs, len(vocab)), dtype=float)return Xdef tfidf_transform(X: sp.csr_matrix) -> sp.csr_matrix:""" Apply a simple tf-idf transform to a sparse count matrix X. tf = count in document / total counts in document. idf = log((1 + n_docs) / (1 + df)) + 1, where df is document frequency. """ n_docs, n_terms = X.shape# Document frequency for each term. df = (X >0).sum(axis=0).A1 # shape: (n_terms,) idf = np.log((1+ n_docs) / (1+ df)) +1.0# Term frequencies: normalize rows to sum to 1 row_sums = np.asarray(X.sum(axis=1)).ravel() row_sums[row_sums ==0] =1.0 inv_row_sums =1.0/ row_sums D_inv = sp.diags(inv_row_sums) tf = D_inv @ X# Apply idf. IDF = sp.diags(idf) tfidf = tf @ IDFreturn tfidf```Notes:- We use `scipy.sparse` to handle large vocabularies and corpora. In production, bag-of-words matrices can be extremely sparse and must be stored and processed as such.- `tfidf_transform` illustrates a common weighting scheme that downweights very frequent words.Even though one-hot and bag-of-words representations are simple, they already enable powerful linear models (SVMs, logistic regression) for text classification and retrieval tasks.### Learned linear projectionsOne-hot vectors are extremely high-dimensional. A standard remedy is to learn a linear projection to a lower-dimensional space.Let $x \in \mathbb{R}^{D}$ be a feature vector (for example, bag of words). A linear embedding is$$z = W x, \quad W \in \mathbb{R}^{d \times D},$$where typically $d \ll D$. In neural networks, such linear layers are ubiquitous.```{python}class LinearProjection:""" Simple linear feature map z = W x + b. For high dimensional sparse x, we assume x is a scipy sparse CSR row vector. """def__init__(self, in_dim: int, out_dim: int, rng: np.random.Generator |None=None):self.in_dim = in_dimself.out_dim = out_dimif rng isNone: rng = np.random.default_rng(0)# Xavier initialization limit = np.sqrt(6.0/ (in_dim + out_dim))self.W = rng.uniform(-limit, limit, size=(out_dim, in_dim)).astype(np.float32)self.b = np.zeros(out_dim, dtype=np.float32)def__call__(self, x) -> np.ndarray:""" Apply linear map. x can be dense (1D array) or sparse row vector. Returns dense vector z of shape (out_dim,). """if sp.issparse(x): z =self.W @ x.T # shape (out_dim, 1) z = np.asarray(z).reshape(-1).astype(np.float32)else: x = np.asarray(x, dtype=np.float32) z =self.W @ xreturn z +self.b```For very large dimensions, we rarely store full $W$ explicitly unless $in_{dim}$ is already modest. For very large categorical spaces we use embedding tables, which is the next topic.------------------------------------------------------------------------## Embedding tables for high cardinality categorical featuresIn large production systems, we often have categorical IDs:- User IDs, item IDs, product IDs.- Token IDs in very large vocabularies.- Categorical combinations of features.A standard pattern is to maintain an embedding table:$$E \in \mathbb{R}^{|\mathcal{C}| \times d},$$where each category $c \in \mathcal{C}$ has a learned vector $E_c \in \mathbb{R}^d$.Lookup is simply:$$\phi(c) = E_c.$$The gradient during training only updates the rows of $E$ that were actually used in the batch. This sparsity is crucial at large-scale.------------------------------------------------------------------------### The General FrameworkAn embedding is: **discrete object → continuous vector such that vector similarity reflects meaningful object similarity****What are you embedding?**| Unit Level | Text Domain | Graph Domain | Image Domain | Recommendation Domain ||---------------|---------------|---------------|---------------|---------------|| Atomic element | Token (Word2Vec, GloVe) | Node (DeepWalk, Node2Vec, GraphSAGE) | Pixel/Patch (ViT patches) | User or Item (matrix factorization) || Pairwise relation | Word pair / collocation | Edge (TransE, DistMult, RotatE) | Spatial relationship | User-item interaction || Multi-way relation | n-gram, phrase | Hyperedge (HyperGCN, HGNN) | Scene graph relations | Session or basket || Local structure | Sentence (Sentence-BERT) | Subgraph / Motif (Subgraph2Vec) | Object / Region (RoI features) | User neighborhood || Community / cluster | Paragraph, topic | Community (ComE, vGraph) | Image segment | User segment || Whole structure | Document (Doc2Vec) | Graph (Graph2Vec, DiffPool) | Full image (ResNet, CLIP) | Full catalog / platform |: The Two Universal Questions1. **What's the unit?**Determined by your downstream task- Classifying documents → document embedding.- Predicting links → edge or node-pair embedding.- Comparing networks → graph embedding.2. **What training signal defines similarity?**| Signal Type | How It Works | Examples ||-------------------------|---------------------------|--------------------|| Co-occurrence | Things in similar contexts are similar | Word2Vec, DeepWalk, GloVe || Supervision | Things with same labels are similar | Fine-tuned BERT, supervised GNN || Contrastive | Push positives together, negatives apart | SimCLR, GraphCL, CLIP || Reconstruction | Good embeddings can rebuild the input | Autoencoders, GAE || Translation/transformation | Relations as operations on embeddings | TransE, RotatE |Everything else is implementation detail.### Numpy implementation of an embedding table```{python}from dataclasses import dataclassfrom typing import Sequence@dataclassclass EmbeddingTable:""" Simple embedding table for categorical IDs 0, 1, ..., num_embeddings - 1. """ num_embeddings: int embedding_dim: int weights: np.ndarray # shape (num_embeddings, embedding_dim)@classmethoddef random( cls, num_embeddings: int, embedding_dim: int, rng: np.random.Generator |None=None, ) ->"EmbeddingTable":if rng isNone: rng = np.random.default_rng(0) limit = np.sqrt(6.0/ (num_embeddings + embedding_dim)) weights = rng.uniform(-limit, limit, size=(num_embeddings, embedding_dim)).astype(np.float32)return cls(num_embeddings=num_embeddings, embedding_dim=embedding_dim, weights=weights)def lookup(self, ids: Sequence[int]) -> np.ndarray:""" Return embedding matrix of shape (len(ids), embedding_dim). """ ids = np.asarray(ids, dtype=int)returnself.weights[ids]def update(self, ids: Sequence[int], grad_embeddings: np.ndarray, lr: float) ->None:""" Gradient descent update for a batch of embeddings. ids: indices used in this batch, shape (batch_size,) grad_embeddings: gradient wrt embeddings, shape (batch_size, embedding_dim) Only those rows are updated. """ ids = np.asarray(ids, dtype=int)assert grad_embeddings.shape == (len(ids), self.embedding_dim)# Accumulate gradient per row in case of duplicates grad_accum = np.zeros_like(self.weights) np.add.at(grad_accum, ids, grad_embeddings)self.weights -= lr * grad_accum```This pattern (store a big matrix, do row lookups, update only those rows) is the basis of token embeddings in language models, as well as user and item embeddings in recommender systems.In production with billions of IDs:- The embedding table may be sharded across many machines.- Some rows may be evicted or stored on slower storage when rarely used.- Quantization and compression of embeddings become important for memory and bandwidth.Conceptually, though, the core abstraction remains: an embedding table is a learned dictionary from IDs to vectors.------------------------------------------------------------------------## 4. Nonlinear feature mapsLinear maps can only capture linear structure. To represent more complex patterns, we use nonlinear feature maps.Two complementary viewpoints:1. Explicit nonlinear transformations $\phi(x)$ applied before a linear model.2. Deep networks where layers and activations jointly form $\phi_\psi(x)$.### 4.1 Kernel features and random Fourier featuresClassical kernel methods rely on the idea of mapping inputs into a (possibly infinite dimensional) feature space where linear methods are powerful.Given a positive definite kernel $k(x, x') = \langle \phi(x), \phi(x') \rangle$, one can work with $\phi$ implicitly via kernel evaluations. However, this does not scale well to large datasets.**Random Fourier features** approximate shift invariant kernels such as the Gaussian kernel:$$k(x, x') = \exp\left( -\frac{|x - x'|_2^2}{2 \sigma^2} \right).$$The key result {cite}`rahimi2007random` is that we can approximate $k(x, x')$ with a randomly generated finite dimensional map $\phi(x) \in \mathbb{R}^D$ such that$$k(x, x') \approx \phi(x)^\top \phi(x').$$Construction:- Sample $D$ random frequencies $\omega_i \sim \mathcal{N}(0, \sigma^{-2} I_d)$.- Sample random phases $b_i \sim \text{Uniform}[0, 2\pi]$.- Define $$ \phi(x) = \sqrt{\frac{2}{D}} \left[\cos(\omega_1^\top x + b_1), \dots, \cos(\omega_D^\top x + b_D)\right]. $$Then inner products in the feature space approximate the Gaussian kernel.```{python}class RandomFourierFeatures:""" Random Fourier feature map approximating a Gaussian kernel. phi(x) = sqrt(2 / D) * cos(W x + b), where rows of W are drawn from N(0, sigma^{-2} I). """def__init__(self, input_dim: int, num_features: int, sigma: float, rng: np.random.Generator |None=None):self.input_dim = input_dimself.num_features = num_featuresself.sigma = sigmaif rng isNone: rng = np.random.default_rng(0)self.W = rng.normal(loc=0.0, scale=1.0/ sigma, size=(num_features, input_dim)).astype(np.float32)self.b = rng.uniform(0.0, 2.0* np.pi, size=(num_features,)).astype(np.float32)def__call__(self, X: np.ndarray) -> np.ndarray:""" Map a batch X of shape (n_samples, input_dim) to random Fourier features of shape (n_samples, num_features). """ X = np.asarray(X, dtype=np.float32)# Compute W x + b for each sample. WX = X @self.W.T # shape (n_samples, num_features)return np.sqrt(2.0/self.num_features) * np.cos(WX +self.b)```We can then train a linear model on top of `RandomFourierFeatures` to approximate a nonlinear kernel method while keeping training and inference linear in the dataset size and feature dimension.### 4.2 Multilayer nonlinear feature maps (feedforward networks)In deep learning, we implement nonlinear feature maps as compositions of linear layers and nonlinear activation functions:$$\phi(x) = \phi^{(L)} \circ \dots \circ \phi^{(1)}(x),$$where each layer has the form$$\phi^{(\ell)}(h) = \sigma(W^{(\ell)} h + b^{(\ell)}).$$Even a two layer network can represent highly nonlinear decision boundaries.A minimalist fully connected network for tabular inputs using NumPy:```{python}class TwoLayerMLP:""" Two layer perceptron: h = relu(W1 x + b1), z = W2 h + b2. """def__init__(self, input_dim: int, hidden_dim: int, output_dim: int, rng: np.random.Generator |None=None):if rng isNone: rng = np.random.default_rng(0)self.input_dim = input_dimself.hidden_dim = hidden_dimself.output_dim = output_dim limit1 = np.sqrt(6.0/ (input_dim + hidden_dim)) limit2 = np.sqrt(6.0/ (hidden_dim + output_dim))self.W1 = rng.uniform(-limit1, limit1, size=(hidden_dim, input_dim)).astype(np.float32)self.b1 = np.zeros(hidden_dim, dtype=np.float32)self.W2 = rng.uniform(-limit2, limit2, size=(output_dim, hidden_dim)).astype(np.float32)self.b2 = np.zeros(output_dim, dtype=np.float32)@staticmethoddef relu(z: np.ndarray) -> np.ndarray:return np.maximum(z, 0.0)def forward(self, X: np.ndarray) -> np.ndarray: X = np.asarray(X, dtype=np.float32) h =self.relu(X @self.W1.T +self.b1) z = h @self.W2.T +self.b2return zdef forward_hidden(self, X: np.ndarray) -> np.ndarray:"""Return hidden representation h, which is a learned nonlinear feature.""" X = np.asarray(X, dtype=np.float32)returnself.relu(X @self.W1.T +self.b1)```Training this MLP with SGD on a classification task will produce a hidden representation `h` that is tuned to separate the classes in a way that a simple linear classifier in `h` space can use.In practice, we rely on frameworks like PyTorch or JAX for automatic differentiation and efficient batched operations, but the conceptual structure is the same.------------------------------------------------------------------------## 5. Distributional representations of words and tokensNatural language is the canonical domain where embeddings transformed the field. The **distributional hypothesis** {cite}`harris1954distributional` states:> Words that occur in similar contexts tend to have similar meanings.Word embeddings operationalize this by learning vector representations where words with similar usage patterns are close in vector space.### 5.1 Cooccurrence and matrix factorizationLet $\mathcal{V}$ be a vocabulary. Given a large corpus, we can build a **cooccurrence matrix** $X \in \mathbb{R}^{|\mathcal{V}| \times |\mathcal{V}|}$ where $X_{ij}$ counts how often word $j$ appears in a context window around word $i$.We can then seek low rank matrices $W, C \in \mathbb{R}^{|\mathcal{V}| \times d}$ such that$$W C^\top \approx f(X)$$for some transformation $f$, for example logarithm or pointwise mutual information (PMI). The rows of $W$ and $C$ are word and context embeddings. GloVe {cite}`pennington2014glove` is a refined version of this idea.Below is a small pipeline to build a cooccurrence matrix and derive embeddings via truncated SVD.```{python}from itertools import islicefrom typing import Iterableimport numpy as npimport scipy.sparse as spfrom sklearn.decomposition import TruncatedSVDdef build_cooccurrence_matrix( corpus: List[List[str]], vocab: Dict[str, int], window_size: int=4, symmetric: bool=True, min_cooccurrence: int=1,) -> sp.coo_matrix:""" Build a word-word cooccurrence matrix X where X[i, j] counts how often j appears in the context window of i. """ rows = [] cols = [] data = []for doc in corpus: ids = [vocab[token] for token in doc if token in vocab]for center_pos, center_id inenumerate(ids): left =max(0, center_pos - window_size) right =min(len(ids), center_pos + window_size +1)for pos inrange(left, right):if pos == center_pos:continue context_id = ids[pos] rows.append(center_id) cols.append(context_id) data.append(1.0)if symmetric: rows.append(context_id) cols.append(center_id) data.append(1.0) n_vocab =len(vocab) X = sp.coo_matrix((data, (rows, cols)), shape=(n_vocab, n_vocab), dtype=np.float32)if min_cooccurrence >1: X = X.tocsr() X.data[X.data < min_cooccurrence] =0.0 X.eliminate_zeros() X = X.tocoo()return Xdef pmi_matrix(X: sp.csr_matrix, positive: bool=True, eps: float=1e-8) -> sp.csr_matrix:""" Compute (positive) PMI matrix from a cooccurrence matrix X. PMI(i, j) = log ( p(i, j) / (p(i) p(j)) ). """ X = X.tocsr() total = X.sum() sum_over_rows = np.asarray(X.sum(axis=1)).ravel() # p(i) sum_over_cols = np.asarray(X.sum(axis=0)).ravel() # p(j) rows, cols = X.nonzero() data = []for i, j inzip(rows, cols): p_ij = X[i, j] / total p_i = sum_over_rows[i] / total p_j = sum_over_cols[j] / total pmi = np.log(p_ij / (p_i * p_j + eps) + eps)if positive: pmi =max(pmi, 0.0) data.append(pmi) M = sp.csr_matrix((data, (rows, cols)), shape=X.shape, dtype=np.float32)return Mdef svd_embeddings( M: sp.csr_matrix, dim: int, rng: np.random.Generator |None=None,) -> np.ndarray:""" Compute low rank embeddings from a matrix M via truncated SVD. Returns word embeddings of shape (n_vocab, dim). """if rng isNone: rng = np.random.default_rng(0) svd = TruncatedSVD(n_components=dim, random_state=rng.integers(10**9)) U = svd.fit_transform(M) # shape (n_vocab, dim)return U.astype(np.float32)```This approach is not how very large systems train embeddings in practice, because:- A dense $|\mathcal{V}| \times |\mathcal{V}|$ matrix is too large.- SVD on massive matrices is expensive.However, the **matrix factorization viewpoint** is conceptually important. It connects to:- Word2vec as implicit factorization of a shifted PMI matrix.- Collaborative filtering in recommender systems as factorization of user item matrices.### 5.2 Skip gram with negative sampling (word2vec style)Word2vec {cite}`mikolov2013efficient` introduced a very scalable embedding training method. The skip gram with negative sampling (SGNS) objective is:- Given a center word $w$ and a context word $c$ observed in the corpus, encourage $\langle v_w, u_c \rangle$ to be large.- For randomly sampled negative contexts $c'$, encourage $\langle v_w, u_{c'} \rangle$ to be small.For a single positive pair $(w, c)$ and $k$ negative samples $c_1', \dots, c_k'$, the loss is$$ \ell(w, c) = -\log \sigma(v_w^\top u_c) - \sum_{j=1}^k \log \sigma(-v_w^\top u_{c_j'}) $$where $\sigma(z) = 1 / (1 + e^{-z})$.A stripped down implementation for demonstration:```{python}class SkipGramNegativeSampling:""" Minimal skip-gram with negative sampling trainer for word embeddings. """def__init__(self, vocab_size: int, embedding_dim: int, num_negative: int=5, lr: float=0.025, rng: np.random.Generator |None=None, ):if rng isNone: rng = np.random.default_rng(0)self.vocab_size = vocab_sizeself.embedding_dim = embedding_dimself.num_negative = num_negativeself.lr = lr# Center word vectors v and context vectors u limit =0.5/ embedding_dimself.v = rng.uniform(-limit, limit, size=(vocab_size, embedding_dim)).astype(np.float32)self.u = np.zeros((vocab_size, embedding_dim), dtype=np.float32)# Unigram table for negative sampling (simplified) probs = np.ones(vocab_size, dtype=np.float64) probs /= probs.sum()self.neg_sampler_probs = probsself.rng = rng@staticmethoddef sigmoid(z: np.ndarray) -> np.ndarray:return1.0/ (1.0+ np.exp(-z))def sample_negative(self, batch_size: int) -> np.ndarray:""" Sample batch_size * num_negative negative contexts. Returns shape (batch_size, num_negative) of word IDs. """returnself.rng.choice(self.vocab_size, size=(batch_size, self.num_negative), p=self.neg_sampler_probs, replace=True, )def train_batch(self, center_ids: np.ndarray, context_ids: np.ndarray, ) ->float:""" Train on a batch of positive (center, context) pairs. center_ids, context_ids: shape (batch_size,) Returns average loss. """ center_ids = np.asarray(center_ids, dtype=int) context_ids = np.asarray(context_ids, dtype=int) batch_size = center_ids.shape[0] v_c =self.v[center_ids] # shape (B, d) u_pos =self.u[context_ids] # shape (B, d) neg_ids =self.sample_negative(batch_size) # (B, K) u_neg =self.u[neg_ids] # (B, K, d)# Positive logits and loss pos_logits = np.sum(v_c * u_pos, axis=1) # shape (B,) pos_sigmoid =self.sigmoid(pos_logits) pos_loss =-np.log(pos_sigmoid +1e-10)# Negative logits and loss# Reshape for broadcasting: v_c: (B, 1, d), u_neg: (B, K, d) v_c_exp = v_c[:, None, :] neg_logits = np.sum(v_c_exp * u_neg, axis=2) # (B, K) neg_sigmoid =self.sigmoid(-neg_logits) neg_loss =-np.sum(np.log(neg_sigmoid +1e-10), axis=1) # (B,) loss = np.mean(pos_loss + neg_loss)# Gradients# dL/d pos_logits = pos_sigmoid - 1 grad_pos_logits = pos_sigmoid -1.0# (B,) grad_v_pos = grad_pos_logits[:, None] * u_pos grad_u_pos = grad_pos_logits[:, None] * v_c# For negatives: dL/d(-neg_logits) = 1 - neg_sigmoid# dL/d neg_logits = neg_sigmoid - 1 grad_neg_logits = neg_sigmoid -1.0# (B, K) grad_v_neg = np.sum(grad_neg_logits[:, :, None] * u_neg, axis=1) # (B, d) grad_u_neg = grad_neg_logits[:, :, None] * v_c_exp # (B, K, d) grad_v = grad_v_pos + grad_v_neg # (B, d)# Apply updates to v and u# Accumulate center gradients grad_v_accum = np.zeros_like(self.v) np.add.at(grad_v_accum, center_ids, grad_v)self.v -=self.lr * grad_v_accum# Accumulate positive context gradients grad_u_accum = np.zeros_like(self.u) np.add.at(grad_u_accum, context_ids, grad_u_pos)# Accumulate negative context gradientsfor b inrange(batch_size): np.add.at(grad_u_accum, neg_ids[b], grad_u_neg[b])self.u -=self.lr * grad_u_accumreturnfloat(loss)```This implementation is intentionally explicit to show how embeddings and gradients interact. In real systems:- Batching is more careful for speed.- Negative samplers use frequency adjusted distributions.- Training is distributed across many workers.Still, the core idea is:- Embeddings are rows in matrices.- Training updates those rows based on cooccurrence patterns.------------------------------------------------------------------------## 6. Manifolds and the geometry of dataThe **manifold hypothesis** states that high dimensional data of interest often lies near a low dimensional manifold embedded in $\mathbb{R}^D$:> Although the ambient space is high dimensional, the data concentrates near a lower dimensional structure.Representations and embeddings can be seen as attempts to find coordinate systems that make this structure simple.### 6.1 Linear manifolds: PCAPrincipal component analysis (PCA) is the canonical linear method for discovering low dimensional structure. Given centered data $X \in \mathbb{R}^{n \times D}$, PCA finds directions $w_1, \dots, w_d$ that maximize variance.The PCA embedding is:$$z_i = W^\top x_i,\quad W = [w_1, \dots, w_d] \in \mathbb{R}^{D \times d}.$$In matrix terms, the top $d$ singular vectors of $X$ define the projection.```{python}from sklearn.decomposition import PCAfrom sklearn.datasets import make_swiss_rolldef pca_embedding(X: np.ndarray, dim: int) -> np.ndarray: pca = PCA(n_components=dim) Z = pca.fit_transform(X)return Z.astype(np.float32)def demo_pca_on_swiss_roll():""" Illustrate linear vs nonlinear embeddings on a 3D swiss roll manifold. """ X, _ = make_swiss_roll(n_samples=2000, noise=0.05)# X has shape (n_samples, 3) Z_linear = pca_embedding(X, dim=2)print("Original shape:", X.shape)print("Linear embedding shape:", Z_linear.shape)```PCA is efficient and widely used as a preprocessing step, but it can only recover linear manifolds. Many complex data manifolds (for example, images) are highly nonlinear.### 6.2 Nonlinear manifolds: autoencodersTo handle nonlinear structure, we use **autoencoders**:- An encoder maps $x$ to a latent code $z$: $$ z = \text{enc}_\psi(x). $$- A decoder reconstructs $x$ from $z$: $$ \hat{x} = \text{dec}_\theta(z). $$- We train by minimizing a reconstruction loss: $$ \min_{\psi, \theta} \ \frac{1}{n} \sum_{i=1}^n |x_i - \hat{x}_i|_2^2. $$The latent $z$ is a learned nonlinear embedding. With appropriate architectures and regularization, autoencoders discover useful structure in data.Below is a concise PyTorch autoencoder for 2D embeddings of 784 dimensional images (for example MNIST). Even if you do not run PyTorch in your pipeline, the structure is worth understanding.```{python}import torchfrom torch import nnfrom torch.utils.data import DataLoader, TensorDatasetclass Autoencoder(nn.Module):def__init__(self, input_dim: int, latent_dim: int):super().__init__()self.encoder = nn.Sequential( nn.Linear(input_dim, 400), nn.ReLU(), nn.Linear(400, 200), nn.ReLU(), nn.Linear(200, latent_dim), )self.decoder = nn.Sequential( nn.Linear(latent_dim, 200), nn.ReLU(), nn.Linear(200, 400), nn.ReLU(), nn.Linear(400, input_dim), )def forward(self, x): z =self.encoder(x) x_rec =self.decoder(z)return x_rec, zdef train_autoencoder(X: np.ndarray, latent_dim: int=2, num_epochs: int=10, batch_size: int=256): X_tensor = torch.from_numpy(X.astype(np.float32)) dataset = TensorDataset(X_tensor) loader = DataLoader(dataset, batch_size=batch_size, shuffle=True) model = Autoencoder(input_dim=X.shape[1], latent_dim=latent_dim) optimizer = torch.optim.Adam(model.parameters(), lr=1e-3) loss_fn = nn.MSELoss() model.train()for epoch inrange(num_epochs): total_loss =0.0for (batch,) in loader: optimizer.zero_grad() recon, _ = model(batch) loss = loss_fn(recon, batch) loss.backward() optimizer.step() total_loss += loss.item() * batch.size(0) avg_loss = total_loss /len(dataset)print(f"Epoch {epoch:3d}, reconstruction loss {avg_loss:.6f}")# Return latent codes for all data model.eval()with torch.no_grad(): _, Z = model(X_tensor)return model, Z.numpy()```Production level notes:- For large datasets, we stream from disk or remote storage.- For large images, we use convolutional encoders and decoders, not fully connected ones.- For stability and better structure in $z$, we might use **variational autoencoders**, which connect directly to information bottleneck ideas (next section).Manifold based thinking reappears constantly when we visualize and interpret representations. Tools like t-SNE or UMAP {cite}`vanDerMaaten2008tsne,mcinnes2018umap` are essentially nonlinear embedding methods for visualization.------------------------------------------------------------------------## 7. Information bottleneck and compressionThe **information bottleneck** (IB) framework {cite}`tishby1999information` provides a principled way to think about representations as lossy compressions.We have:- Input random variable $X$.- Target variable $Y$ (for example, labels).- Representation $Z$ produced from $X$.The IB principle:> Find a representation $Z$ that preserves as much information about $Y$ as possible, while being as compressed a summary of $X$ as possible.Formally, we consider the optimization problem$$\min_{p(z \mid x)} I(X; Z) - \beta I(Z; Y),$$where:- $I(\cdot; \cdot)$ denotes mutual information.- $I(X; Z)$ measures how much information $Z$ retains from $X$ (we want this small).- $I(Z; Y)$ measures how much information $Z$ has about $Y$ (we want this large).- $\beta > 0$ controls the tradeoff.While directly solving this optimization is hard for complex models, the IB perspective leads to:- Regularization schemes that encourage compression.- Variational approximations like variational autoencoders.- Intuitions about generalization: models that compress input information but retain label information may generalize better.### 7.1 Variational autoencoders (high level)Variational autoencoders (VAEs) implement a stochastic encoder:- Encoder: $q_\phi(z \mid x)$.- Decoder: $p_\theta(x \mid z)$.They optimize the evidence lower bound (ELBO):$$\mathcal{L}(\theta, \phi; x) =\mathbb{E}*{q*\phi(z \mid x)}[\log p_\theta(x \mid z)] - \text{KL}(q_\phi(z \mid x) \parallel p(z)),$$where $p(z)$ is a prior, typically standard normal.This has an IB flavor:- The reconstruction term encourages $Z$ to retain information about $X$.- The KL term encourages $Z$ to be close to the prior, limiting capacity and compressing information.Adjusting the weight of the KL term (as in $\beta$-VAEs) makes the connection to the information bottleneck explicit.### 7.2 Compression of embedding tablesThe IB ideas also suggest compressing representations where possible, especially in production:- Embedding table compression: - Reduce embedding dimension. - Apply quantization (for example 8 bit or 4 bit) to embedding vectors. - Factorize embedding matrices into smaller components.- Model compression: - Prune weights with small magnitudes. - Distill representations from a large model into a smaller one.A simple example: post training quantization of embeddings to 8 bits centered around zero.```{python}def quantize_embeddings_8bit(E: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:""" Uniformly quantize embeddings E to 8 bit signed integers per column. Returns integer matrix Q and per-dimension scale factors s such that E ≈ s * Q, with broadcasting s of shape (1, d). """ E = np.asarray(E, dtype=np.float32) max_abs = np.max(np.abs(E), axis=0) # shape (d,) max_abs[max_abs ==0] =1.0# Map to [-127, 127] scale = max_abs /127.0 Q = np.round(E / scale).astype(np.int8)return Q, scaledef dequantize_embeddings_8bit(Q: np.ndarray, scale: np.ndarray) -> np.ndarray:""" Recover approximate embeddings from quantized representation. """return Q.astype(np.float32) * scale```In large systems, more sophisticated schemes are used, but the basic pattern is:- Compress embedding representations while keeping information about targets intact.- Use approximate nearest neighbor search algorithms that operate efficiently on compressed embeddings.------------------------------------------------------------------------## 8. Geometry of learned representationsFinally, we turn to the geometry of representation spaces and how to analyze them.### 8.1 Similarity measuresGiven embeddings $z, z' \in \mathbb{R}^d$, typical similarity measures include:- Euclidean distance: $$ d_{\text{L2}}(z, z') = |z - z'|_2. $$- Cosine similarity: $$ \cos(z, z') = \frac{z^\top z'}{|z|_2 |z'|_2}. $$Cosine similarity is scale invariant and widely used for comparing word and sentence embeddings.```{python}def l2_distance(a: np.ndarray, b: np.ndarray, eps: float=1e-12) ->float:returnfloat(np.linalg.norm(a - b))def cosine_similarity(a: np.ndarray, b: np.ndarray, eps: float=1e-12) ->float: a = np.asarray(a, dtype=np.float32) b = np.asarray(b, dtype=np.float32) na = np.linalg.norm(a) nb = np.linalg.norm(b)if na < eps or nb < eps:return0.0returnfloat(a @ b / (na * nb))```For batches, we can compute all pairwise cosine similarities efficiently:```{python}def batch_cosine_similarity(E: np.ndarray) -> np.ndarray:""" Compute cosine similarity matrix S for embeddings E of shape (n, d), so that S[i, j] is the cosine similarity between E[i] and E[j]. """ E = np.asarray(E, dtype=np.float32) norms = np.linalg.norm(E, axis=1, keepdims=True) +1e-12 E_norm = E / normsreturn E_norm @ E_norm.T # shape (n, n)```In production retrieval systems, we rarely compute full $n \times n$ similarity matrices, but we rely on approximate nearest neighbor indices (for example HNSW or IVF) to retrieve top similar vectors in logarithmic or sublinear time.### 8.2 Nearest neighbors and analogiesOne of the most striking properties of good word embeddings is that they support semantic analogies. For example, with appropriate training, one can see:$$\text{vec}(\text{"king"}) - \text{vec}(\text{"man"}) + \text{vec}(\text{"woman"}) \approx \text{vec}(\text{"queen"}).$$Practically, we often want to find nearest neighbors of a given embedding in embedding space.```{python}def knn( query: np.ndarray, E: np.ndarray, k: int=10, metric: str="cosine",) -> np.ndarray:""" Return indices of k nearest neighbors of query in embedding matrix E. E has shape (n_items, d). query has shape (d,). """ query = np.asarray(query, dtype=np.float32) E = np.asarray(E, dtype=np.float32)if metric =="cosine": norms = np.linalg.norm(E, axis=1) * (np.linalg.norm(query) +1e-12) scores = E @ query / (norms +1e-12)# Larger is more similar idx = np.argsort(-scores)[:k]elif metric =="l2": diffs = E - query[None, :] dists = np.sum(diffs * diffs, axis=1) idx = np.argsort(dists)[:k]else:raiseValueError(f"Unknown metric: {metric}")return idx```To use this for words we need a mapping between vocabulary indices and tokens:```{python}def build_index_to_token(vocab: Dict[str, int]) -> Dict[int, str]:return {idx: tok for tok, idx in vocab.items()}def nearest_words( word: str, vocab: Dict[str, int], E: np.ndarray, k: int=10,) -> List[Tuple[str, float]]:""" Return top k nearest words to the given word based on cosine similarity. """if word notin vocab:raiseValueError(f"Unknown word: {word}") idx = vocab[word] query = E[idx] sim_matrix = batch_cosine_similarity(E) sims = sim_matrix[idx]# Exclude the word itself sims[idx] =-1.0 top_idx = np.argsort(-sims)[:k] idx_to_token = build_index_to_token(vocab)return [(idx_to_token[j], float(sims[j])) for j in top_idx]```In large production systems, similar functionality is implemented using:- Approximate nearest neighbor search (for example FAISS, ScaNN).- Precomputed indices updated periodically as embeddings change.The core geometry is the same: similar items correspond to nearby vectors.### 8.3 Linear probesTo assess the quality of a representation, we can train a simple **linear probe**:- Freeze the representation $\phi(x)$.- Train a linear classifier or regressor $g_\theta(z)$ on top of $z = \phi(x)$.If a simple linear probe can solve a task well, we say the relevant information is **linearly accessible** in the representation.In code, assuming we have embeddings `Z` of shape `(n_samples, d)` and labels `y`:```{python}from sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import accuracy_scorefrom sklearn.model_selection import train_test_splitdef linear_probe_classification(Z: np.ndarray, y: np.ndarray, test_size: float=0.2) ->float:""" Train a linear classifier on fixed embeddings Z and evaluate accuracy. """ X_train, X_test, y_train, y_test = train_test_split(Z, y, test_size=test_size, random_state=0) clf = LogisticRegression(max_iter=1000) clf.fit(X_train, y_train) y_pred = clf.predict(X_test) acc = accuracy_score(y_test, y_pred)returnfloat(acc)```This pattern is heavily used to evaluate representations of large pretrained models:- For language models: are syntax, semantics, sentiment, etc. linearly separable?- For vision models: are object categories, pose, or texture linearly accessible?From a practical standpoint, high quality representations can dramatically reduce the amount of labeled data and training required for downstream tasks.### 8.4 Anisotropy and whiteningReal world embeddings often exhibit **anisotropy**: the distribution of vectors is not isotropic in $\mathbb{R}^d$. For example:- Many vectors may have large projections along a few dominant directions.- Cosine similarities may be biased upward due to a shared mean vector.This can degrade nearest neighbor quality and interpretability. Common remedies include:- Centering: subtract the mean vector from all embeddings.- Whitening: apply a linear transform that makes covariance close to identity.- Removing top principal components.```{python}def center_embeddings(E: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:""" Center embeddings by subtracting the mean vector. Returns centered embeddings and the mean. """ mean = np.mean(E, axis=0, keepdims=True) E_centered = E - meanreturn E_centered, meandef whiten_embeddings(E: np.ndarray, eps: float=1e-5) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:""" Whiten embeddings: transform them so that covariance is close to identity. Returns whitened embeddings, mean, and whitening matrix W such that E_whitened = (E - mean) @ W. """ E = np.asarray(E, dtype=np.float32) n, d = E.shape mean = np.mean(E, axis=0, keepdims=True) X = E - mean# Covariance C = (X.T @ X) / (n -1) # shape (d, d)# Eigen decomposition evals, evecs = np.linalg.eigh(C)# Avoid division by zero evals_clipped = np.maximum(evals, eps) D_inv_sqrt = np.diag(1.0/ np.sqrt(evals_clipped)) W = evecs @ D_inv_sqrt @ evecs.T E_whitened = X @ Wreturn E_whitened, mean, W```Such geometric preprocessing can significantly improve downstream tasks that rely on distances and similarities in embedding space.------------------------------------------------------------------------## 9. Summary and connectionsIn this chapter, we developed a detailed view of representations and embeddings:- We formalized representations as mappings $\phi: \mathcal{X} \to \mathbb{R}^d$ and motivated vector spaces as the natural setting for learning and inference.- We built linear feature maps, including one hot encoding, bag of words, and learned linear projections, with attention to sparse handling and scalability.- We introduced embedding tables for large categorical spaces, emphasizing the row lookup and sparse update pattern that is central to production systems.- We explored nonlinear feature maps using random Fourier features and multilayer perceptrons as learned nonlinear embeddings.- We developed distributional word embeddings via cooccurrence matrices, PMI, and truncated SVD, and then moved to skip gram with negative sampling as an explicitly trainable embedding model.- We framed high dimensional data through the manifold hypothesis, using PCA and autoencoders to illustrate linear and nonlinear embeddings.- We connected representations to information bottleneck ideas, highlighting how compression and relevance trade off and how this appears in variational autoencoders and embedding compression.- We examined the geometry of embedding spaces through distances, cosine similarity, nearest neighbors, linear probes, anisotropy, and whitening.These ideas are not isolated. They are deeply woven into the topics that follow:- **Attention mechanisms** operate on sequences of embeddings, using dot product similarities to compute weighted combinations.- **Contrastive learning** often trains embeddings so that positive pairs are close and negative pairs are far in embedding space, directly operating on the geometric properties we analyzed.- **Generative models** (language models, diffusion models, VAEs) transform embeddings through deep architectures, and their quality depends heavily on the structure of the latent spaces they learn.A strong grasp of representations and embeddings is essential for understanding the behavior, limitations, and potential of modern AI systems. In later chapters, we will repeatedly return to these concepts, especially when designing architectures, choosing pretraining objectives, and analyzing model behavior.