50 Mathematical Notation Reference

This chapter consolidates the mathematical notation used throughout the book. It is meant as a quick lookup rather than a tutorial. Symbols are grouped by area, and each table maps a symbol to its meaning. Where the same letter carries different meanings in different areas, context resolves the ambiguity. We follow standard conventions: lowercase italic for scalars, lowercase bold for vectors, uppercase bold for matrices, and calligraphic or blackboard letters for sets.

50.1 How to read this reference

Notation is a compression scheme for mathematical ideas. A good convention is consistent (the same symbol means the same kind of thing everywhere) and unambiguous in context (any genuine collision is resolved by the surrounding sentence). This book follows the dominant conventions of the machine learning literature, which trace back to the textbooks listed under Further Reading. The table below fixes the typographic rules so that the type of an object is visible at a glance, before you read its name.

Form	Object	Example
lowercase italic	scalar	$x$, $\eta$, $\lambda$
lowercase bold	(column) vector	$\mathbf{x}$, $\boldsymbol{\theta}$
uppercase bold	matrix	$\mathbf{A}$, $\boldsymbol{\Sigma}$
uppercase italic	random variable	$X$, $Y$
calligraphic	set or operator	$\mathcal{X}$, $\mathcal{L}$
blackboard	number system or expectation	$\mathbb{R}$, $\mathbb{E}$
hat	estimate from data	$\hat{\theta}$, $\hat{y}$
star	optimal or true value	$\theta^\ast$, $\mathbf{x}^\ast$

Two reading habits make the rest of the chapter faster. First, indices are almost always one based and run over the dimension implied by the object, so $x_i$ ranges over the $n$ entries of $\mathbf{x} \in \mathbb{R}^n$. Second, dimensions should be checked like units in physics: an expression such as $\mathbf{A}\mathbf{x}$ is only well formed when the column count of $\mathbf{A}$ equals the length of $\mathbf{x}$, and tracking these shapes catches most notational errors before any computation.

50.2 1. Sets and Logic

We treat sets as the foundational layer. A set is an unordered collection of distinct elements, and the symbols below describe membership, construction, and relationships between sets.

Symbol	Meaning
$\{a, b, c\}$	Set containing the listed elements
$\{x : P(x)\}$	Set of all $x$ satisfying property $P$
$x \in \mathcal{X}$	$x$ is an element of $\mathcal{X}$
$x \notin \mathcal{X}$	$x$ is not an element of $\mathcal{X}$
$\varnothing$	Empty set
$\mathcal{A} \subseteq \mathcal{B}$	$\mathcal{A}$ is a subset of $\mathcal{B}$
$\mathcal{A} \cup \mathcal{B}$	Union of $\mathcal{A}$ and $\mathcal{B}$
$\mathcal{A} \cap \mathcal{B}$	Intersection of $\mathcal{A}$ and $\mathcal{B}$
$\mathcal{A} \setminus \mathcal{B}$	Set difference (elements in $\mathcal{A}$ but not $\mathcal{B}$)
$\mathcal{A} \times \mathcal{B}$	Cartesian product
$\lvert \mathcal{A} \rvert$	Cardinality (number of elements)
$2^{\mathcal{A}}$	Power set (set of all subsets)
$\mathbb{N}, \mathbb{Z}, \mathbb{Q}$	Natural numbers, integers, rationals
$\mathbb{R}, \mathbb{C}$	Real numbers, complex numbers
$\mathbb{R}^n$	$n$ dimensional real coordinate space
$[a, b]$	Closed interval $\{x : a \le x \le b\}$

Two conventions are worth fixing. The Cartesian product $\mathcal{A} \times \mathcal{B} = \{(a, b) : a \in \mathcal{A},\, b \in \mathcal{B}\}$ produces ordered pairs, so it satisfies $\lvert \mathcal{A} \times \mathcal{B} \rvert = \lvert \mathcal{A} \rvert \cdot \lvert \mathcal{B} \rvert$ for finite sets, and the power set has $\lvert 2^{\mathcal{A}} \rvert = 2^{\lvert \mathcal{A} \rvert}$, which is the origin of the notation. A data set of $n$ labeled examples is usually written $\{(\mathbf{x}_i, y_i)\}_{i=1}^{n} \subseteq \mathcal{X} \times \mathcal{Y}$, pairing each input from a feature space $\mathcal{X}$ with a label from a label space $\mathcal{Y}$.

Logical statements glue these objects together. The connectives below appear in proofs, definitions of estimators, and constraint specifications.

Symbol	Meaning
$\forall$	For all
$\exists$	There exists
$\neg$	Logical negation (not)
$\land, \lor$	Logical and, or
$\implies$	Implies
$\iff$	If and only if
$:=$	Defined to be equal to
$\propto$	Proportional to
$\mathbb{1}[\cdot]$	Indicator: $1$ if the condition holds, else $0$

The indicator function deserves emphasis because it converts a logical condition into arithmetic and so appears constantly in loss functions and estimators. A key identity links it to probability and expectation: for an event $A$, the expected value of its indicator is exactly its probability, $\mathbb{E}[\mathbb{1}[A]] = P(A)$. Empirical risk, for instance, is the sample average of an indicator of misclassification, $\frac{1}{n}\sum_{i=1}^{n} \mathbb{1}[\hat{y}_i \ne y_i]$.

50.3 2. Linear Algebra

Linear algebra supplies the data structures of machine learning. A vector $\mathbf{x} \in \mathbb{R}^n$ is a point in $n$ dimensional space, written as a column by convention, and a matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$ maps vectors between spaces.

Symbol	Meaning
$\mathbf{x}, \mathbf{y}$	Column vectors
$x_i$	$i$th component of vector $\mathbf{x}$
$\mathbf{A}, \mathbf{B}$	Matrices
$A_{ij}$	Entry in row $i$, column $j$ of $\mathbf{A}$
$\mathbf{a}_j$	$j$th column of $\mathbf{A}$
$\mathbf{0}, \mathbf{1}$	Vector or matrix of all zeros, all ones
$\mathbf{I}_n$	$n \times n$ identity matrix
$\operatorname{diag}(\mathbf{x})$	Diagonal matrix with $\mathbf{x}$ on the diagonal

The operations below combine and transform these objects. The inner product measures alignment, the outer product builds rank one matrices, and the inverse undoes a linear map when it exists.

Symbol	Meaning
$\mathbf{A}^\top$	Transpose of $\mathbf{A}$
$\mathbf{A}^{-1}$	Inverse of $\mathbf{A}$
$\mathbf{A}^+$	Moore Penrose pseudoinverse
$\mathbf{x}^\top \mathbf{y}$	Inner (dot) product, equal to $\sum_i x_i y_i$
$\langle \mathbf{x}, \mathbf{y} \rangle$	Inner product (alternative notation)
$\mathbf{x} \mathbf{y}^\top$	Outer product (rank one matrix)
$\mathbf{A} \mathbf{B}$	Matrix product
$\mathbf{x} \odot \mathbf{y}$	Elementwise (Hadamard) product
$\mathbf{x} \otimes \mathbf{y}$	Kronecker (tensor) product
$\operatorname{tr}(\mathbf{A})$	Trace (sum of diagonal entries)
$\det(\mathbf{A})$	Determinant
$\operatorname{rank}(\mathbf{A})$	Rank (dimension of column space)
$\operatorname{span}(\cdot)$	Linear span of a set of vectors

Norms quantify the size of a vector or matrix and underpin regularization and distance metrics. Eigenvalues and singular values expose the geometry of a linear map.

Symbol	Meaning
$\lVert \mathbf{x} \rVert_2$	Euclidean ($\ell_2$) norm, $\sqrt{\sum_i x_i^2}$
$\lVert \mathbf{x} \rVert_1$	$\ell_1$ norm, $\sum_i \lvert x_i \rvert$
$\lVert \mathbf{x} \rVert_p$	$\ell_p$ norm, $\left(\sum_i \lvert x_i \rvert^p\right)^{1/p}$
$\lVert \mathbf{x} \rVert_\infty$	Max norm, $\max_i \lvert x_i \rvert$
$\lVert \mathbf{A} \rVert_F$	Frobenius norm, $\sqrt{\sum_{ij} A_{ij}^2}$
$\lambda_i(\mathbf{A})$	$i$th eigenvalue of $\mathbf{A}$
$\sigma_i(\mathbf{A})$	$i$th singular value of $\mathbf{A}$
$\mathbf{A} \succeq 0$	$\mathbf{A}$ is positive semidefinite
$\mathbf{A} \succ 0$	$\mathbf{A}$ is positive definite

A few structural facts make these symbols easier to use. The inner product relates to the Euclidean norm by $\lVert \mathbf{x} \rVert_2^2 = \mathbf{x}^\top \mathbf{x}$ and encodes the angle $\phi$ between two vectors through $\mathbf{x}^\top \mathbf{y} = \lVert \mathbf{x} \rVert_2 \lVert \mathbf{y} \rVert_2 \cos\phi$, so orthogonality is precisely $\mathbf{x}^\top \mathbf{y} = 0$. The eigenvalues and singular values are connected: for any real matrix the singular values of $\mathbf{A}$ are the square roots of the eigenvalues of $\mathbf{A}^\top \mathbf{A}$, and for a symmetric positive semidefinite matrix the two coincide. Positive semidefiniteness has the operational definition $\mathbf{A} \succeq 0 \iff \mathbf{x}^\top \mathbf{A} \mathbf{x} \ge 0$ for all $\mathbf{x}$, which is exactly the condition that makes a quadratic form, such as a squared Mahalanobis distance, never negative. Covariance matrices and Hessians at a minimum are the two cases this book meets most often.

The $\ell_p$ norms form a family worth picturing, because the choice of $p$ changes the geometry of regularization. The set of points with $\lVert \mathbf{x} \rVert_p \le 1$ is the unit ball, and its shape governs which solutions a penalty prefers.

flowchart LR
  P0["p = 0 quasinorm, counts nonzeros, sparsest"]
  P1["p = 1 diamond ball, promotes sparsity, convex"]
  P2["p = 2 round ball, shrinks smoothly"]
  PI["p = infinity square ball, bounds the largest entry"]
  P0 --> P1 --> P2 --> PI

The $\ell_1$ ball has corners on the axes, so its constraint surface tends to touch the loss at points where some coordinates are exactly zero, which is the geometric reason the $\ell_1$ penalty induces sparsity. The $\ell_2$ ball is round and rotationally symmetric, so it shrinks all coordinates smoothly without forcing any to zero. The $\ell_0$ entry is written as a norm only by convention; it is not actually a norm because it fails homogeneity.

50.4 3. Calculus

Calculus describes how quantities change, and gradient based optimization is the engine of modern learning. For a scalar valued function $f : \mathbb{R}^n \to \mathbb{R}$, the gradient collects all first order partial derivatives into a vector that points in the direction of steepest ascent.

Symbol	Meaning
$\dfrac{df}{dx}$	Derivative of $f$ with respect to $x$
$\dfrac{\partial f}{\partial x_i}$	Partial derivative with respect to $x_i$
$f'(x), f''(x)$	First and second derivatives
$\nabla f(\mathbf{x})$	Gradient, vector of partials $\left[\frac{\partial f}{\partial x_1}, \dots, \frac{\partial f}{\partial x_n}\right]^\top$
$\nabla_{\boldsymbol{\theta}} f$	Gradient taken with respect to parameters $\boldsymbol{\theta}$
$\nabla^2 f(\mathbf{x})$	Hessian, matrix of second partials
$H_{ij} = \dfrac{\partial^2 f}{\partial x_i \partial x_j}$	Entry of the Hessian $\mathbf{H}$

For a vector valued function $\mathbf{f} : \mathbb{R}^n \to \mathbb{R}^m$, the Jacobian generalizes the gradient and records how every output responds to every input. The remaining symbols cover integration and limiting behavior.

Symbol	Meaning
$\mathbf{J}$ or $\dfrac{\partial \mathbf{f}}{\partial \mathbf{x}}$	Jacobian, with $J_{ij} = \dfrac{\partial f_i}{\partial x_j}$
$\displaystyle\int_a^b f(x)\, dx$	Definite integral over $[a, b]$
$\displaystyle\int f\, d\mu$	Integral with respect to measure $\mu$
$\lim_{x \to a} f(x)$	Limit of $f$ as $x$ approaches $a$
$\sum, \prod$	Summation, product over an index
$\mathcal{O}(\cdot), o(\cdot)$	Big O and little o asymptotic bounds
$\circ$	Function composition, $(f \circ g)(x) = f(g(x))$

The gradient, Hessian, and Jacobian fit together as the first and second order pieces of a Taylor expansion. Near a point $\mathbf{x}_0$ a smooth scalar function is approximated by

\[ f(\mathbf{x}) \approx f(\mathbf{x}_0) + \nabla f(\mathbf{x}_0)^\top (\mathbf{x} - \mathbf{x}_0) + \tfrac{1}{2}(\mathbf{x} - \mathbf{x}_0)^\top \nabla^2 f(\mathbf{x}_0)\, (\mathbf{x} - \mathbf{x}_0). \]

This expansion is the lens through which optimization reads curvature: the gradient gives the local slope, and the Hessian, which is symmetric whenever the second partials are continuous, gives the curvature. At a local minimum the gradient vanishes and the Hessian is positive semidefinite, the multivariate analogue of the familiar conditions $f'(x) = 0$ and $f''(x) \ge 0$.

The chain rule, written $\frac{d}{dx} f(g(x)) = f'(g(x)) g'(x)$, is the basis of backpropagation. In the vector setting it becomes a product of Jacobians: if $\mathbf{z} = \mathbf{f}(\mathbf{y})$ and $\mathbf{y} = \mathbf{g}(\mathbf{x})$, then $\frac{\partial \mathbf{z}}{\partial \mathbf{x}} = \frac{\partial \mathbf{z}}{\partial \mathbf{y}} \frac{\partial \mathbf{y}}{\partial \mathbf{x}}$. Backpropagation applies this identity recursively across the layers of a network, multiplying the per layer Jacobians from output to input to accumulate the gradient of a scalar loss with respect to every parameter.

50.5 4. Probability and Statistics

Probability formalizes uncertainty. A random variable maps outcomes to numbers, a distribution assigns probability mass or density to those numbers, and expectation summarizes a distribution by its average behavior. We write random variables in uppercase and their realized values in lowercase.

Symbol	Meaning
$\Omega$	Sample space of outcomes
$P(A)$	Probability of event $A$
$X, Y$	Random variables
$P(X = x)$	Probability mass at $x$ (discrete)
$p(x)$	Probability density at $x$ (continuous)
$P(X \mid Y)$	Conditional distribution of $X$ given $Y$
$X \sim \mathcal{D}$	$X$ is distributed according to $\mathcal{D}$
$X \perp Y$	$X$ and $Y$ are independent
$\text{i.i.d.}$	Independent and identically distributed

Expectation, variance, and their relatives are the workhorses of statistical analysis. The estimator notation distinguishes an unknown true quantity from the value computed from a finite sample.

Symbol	Meaning
$\mathbb{E}[X]$	Expected value of $X$
$\mathbb{E}_{x \sim p}[f(x)]$	Expectation of $f$ under distribution $p$
$\operatorname{Var}(X)$	Variance, $\mathbb{E}[(X - \mathbb{E}[X])^2]$
$\operatorname{Cov}(X, Y)$	Covariance between $X$ and $Y$
$\operatorname{Corr}(X, Y)$	Pearson correlation coefficient
$\boldsymbol{\Sigma}$	Covariance matrix
$\mu, \sigma, \sigma^2$	Mean, standard deviation, variance
$\bar{x}$	Sample mean
$\hat{\theta}$	Estimator of parameter $\theta$
$\theta^\ast$	True or optimal parameter value
$\mathcal{L}(\theta)$	Likelihood of parameters $\theta$
$\ell(\theta)$	Log likelihood

A handful of identities tie these symbols together. Expectation is linear, so $\mathbb{E}[aX + bY] = a\mathbb{E}[X] + b\mathbb{E}[Y]$ regardless of dependence, whereas variance is not, and $\operatorname{Var}(X) = \mathbb{E}[X^2] - \mathbb{E}[X]^2$. Bayes’ rule, which underlies inference throughout the book, rearranges the definition of conditional probability into

\[ P(\theta \mid \mathcal{D}) = \frac{P(\mathcal{D} \mid \theta)\, P(\theta)}{P(\mathcal{D})}, \]

reading from left to right as posterior, likelihood, prior, and evidence. The hat versus star distinction is the recurring grammar of statistics: $\theta^\ast$ is the fixed unknown that nature chose, while $\hat{\theta}$ is a random quantity computed from data, and most of estimation theory studies how the second tracks the first as the sample grows.

Several distributions recur often enough to merit a dedicated notation. The table lists their standard names and parameters.

Symbol	Meaning
$\mathcal{N}(\mu, \sigma^2)$	Gaussian (normal) with mean $\mu$, variance $\sigma^2$
$\mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})$	Multivariate Gaussian
$\operatorname{Bern}(p)$	Bernoulli with success probability $p$
$\operatorname{Bin}(n, p)$	Binomial over $n$ trials
$\operatorname{Cat}(\boldsymbol{\pi})$	Categorical with class probabilities $\boldsymbol{\pi}$
$\operatorname{Unif}(a, b)$	Uniform on $[a, b]$
$\operatorname{Pois}(\lambda)$	Poisson with rate $\lambda$

50.6 5. Information Theory

Information theory measures uncertainty and the cost of communication, and it provides the loss functions used to train probabilistic models. Entropy quantifies the average surprise of a distribution, while cross entropy and divergence compare two distributions.

Symbol	Meaning
$H(X)$	Entropy, $-\sum_x p(x) \log p(x)$
$H(X \mid Y)$	Conditional entropy
$H(p, q)$	Cross entropy between $p$ and $q$
$D_{\mathrm{KL}}(p \,\Vert\, q)$	Kullback Leibler divergence from $q$ to $p$
$I(X; Y)$	Mutual information between $X$ and $Y$
$\operatorname{JSD}(p \,\Vert\, q)$	Jensen Shannon divergence
$\log_2, \ln$	Logarithm base $2$ (bits), natural log (nats)

The KL divergence is defined as $D_{\mathrm{KL}}(p \,\Vert\, q) = \sum_x p(x) \log \frac{p(x)}{q(x)}$. It is non negative and equals zero exactly when $p$ and $q$ agree, which is why minimizing it pulls a model distribution toward the data distribution. These quantities are not independent. Cross entropy decomposes cleanly as $H(p, q) = H(p) + D_{\mathrm{KL}}(p \,\Vert\, q)$, so when $p$ is the fixed data distribution the entropy term is a constant and minimizing cross entropy is identical to minimizing the KL divergence. This is the precise sense in which training a classifier with the cross entropy loss is fitting the model distribution to the data. Mutual information measures shared content as the divergence between a joint distribution and the product of its marginals, $I(X; Y) = D_{\mathrm{KL}}(p(x, y) \,\Vert\, p(x) p(y))$, and it equals zero exactly when $X \perp Y$. Two cautions accompany the KL divergence: it is asymmetric, so $D_{\mathrm{KL}}(p \,\Vert\, q) \ne D_{\mathrm{KL}}(q \,\Vert\, p)$ in general and it is not a metric, and it diverges to infinity wherever $p$ places mass on an outcome that $q$ rules out. The Jensen Shannon divergence is a symmetrized, bounded alternative built from the two KL terms against the mixture $\tfrac{1}{2}(p + q)$.

50.7 6. Optimization

Optimization is the act of choosing parameters to minimize a loss. The notation below frames a generic problem: minimize an objective over a feasible region, possibly subject to constraints. Most training procedures are instances of this template.

Symbol	Meaning
$\min_{\mathbf{x}} f(\mathbf{x})$	Minimize $f$ over $\mathbf{x}$
$\max_{\mathbf{x}} f(\mathbf{x})$	Maximize $f$ over $\mathbf{x}$
$\arg\min_{\mathbf{x}} f(\mathbf{x})$	Argument that attains the minimum
$\arg\max_{\mathbf{x}} f(\mathbf{x})$	Argument that attains the maximum
$\text{s.t.}$	Subject to (constraints follow)
$f(\mathbf{x})$	Objective or loss function
$J(\boldsymbol{\theta})$	Cost function over parameters
$\boldsymbol{\theta}_t$	Parameter vector at iteration $t$
$\eta$ or $\alpha$	Learning rate (step size)
$\lambda$	Regularization strength
$g_t = \nabla J(\boldsymbol{\theta}_t)$	Gradient at iteration $t$
$\mathcal{L}(\mathbf{x}, \boldsymbol{\mu})$	Lagrangian with multipliers $\boldsymbol{\mu}$

A canonical gradient descent update reads $\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta\, \nabla J(\boldsymbol{\theta}_t)$, moving the parameters a small step in the direction that most rapidly decreases the cost. Stochastic variants replace the full gradient with an estimate computed on a minibatch, trading exactness for speed.

Symbol	Meaning
$\mathcal{B}$	Minibatch of training examples
$\nabla_{\boldsymbol{\theta}} \mathcal{L}$	Loss gradient with respect to parameters
$\preceq, \succeq$	Componentwise or matrix inequality
$\Pi_{\mathcal{C}}(\cdot)$	Projection onto feasible set $\mathcal{C}$
$\partial f(\mathbf{x})$	Subdifferential at $\mathbf{x}$

A function is convex when its domain is convex and every chord lies on or above the graph, a property that guarantees any local minimum is global. Convexity is the dividing line between problems we can solve reliably and the non convex landscapes typical of deep networks. The subdifferential $\partial f(\mathbf{x})$ extends the gradient to convex functions with kinks, such as the absolute value at zero, by collecting every slope that supports the graph from below, and a point is a global minimum exactly when $\mathbf{0} \in \partial f(\mathbf{x})$. This is what gives the $\ell_1$ penalty a well defined optimality condition despite being nondifferentiable at the origin.

50.8 Worked example: reading a regularized objective

To see the notation working together, consider the objective of ridge regression, which appears in nearly every chapter that fits a linear model:

\[ \hat{\boldsymbol{\theta}} \;=\; \arg\min_{\boldsymbol{\theta} \in \mathbb{R}^d} \;\; \underbrace{\tfrac{1}{n} \lVert \mathbf{X}\boldsymbol{\theta} - \mathbf{y} \rVert_2^2}_{\text{data fit}} \;+\; \underbrace{\lambda \lVert \boldsymbol{\theta} \rVert_2^2}_{\text{penalty}}. \]

Every symbol here is drawn from the tables above. Reading it as a sentence: find the parameter vector $\hat{\boldsymbol{\theta}}$ in $d$ dimensional real space ($\arg\min$, blackboard $\mathbb{R}^d$) that minimizes the average squared $\ell_2$ distance between the predictions $\mathbf{X}\boldsymbol{\theta}$ and the targets $\mathbf{y}$ (a matrix times a vector, then a norm), plus a penalty of strength $\lambda$ on the squared norm of the parameters. The hat on $\hat{\boldsymbol{\theta}}$ flags it as an estimate computed from data rather than a true value $\boldsymbol{\theta}^\ast$. Because both terms are convex and the penalty is strictly convex, the whole objective is strictly convex, so the optimization tools of section 6 apply and the minimizer is unique. Setting the gradient to zero, using the calculus of section 3, yields the closed form $\hat{\boldsymbol{\theta}} = (\mathbf{X}^\top \mathbf{X} + n\lambda \mathbf{I}_d)^{-1} \mathbf{X}^\top \mathbf{y}$, where the added $n\lambda \mathbf{I}_d$ makes the matrix positive definite and therefore invertible even when $\mathbf{X}^\top \mathbf{X}$ alone is singular. Once a reader can move fluently between the compressed formula and this spoken paraphrase, the notation has done its job.

50.9 Further Reading

Goodfellow, Bengio, and Courville, Deep Learning (2016), notation chapter.
Bishop, Pattern Recognition and Machine Learning (2006), appendices.
Strang, Introduction to Linear Algebra (2016).
Boyd and Vandenberghe, Convex Optimization (2004).
Cover and Thomas, Elements of Information Theory (2006).
Murphy, Probabilistic Machine Learning: An Introduction (2022).

# Mathematical Notation Reference This chapter consolidates the mathematical notation used throughout the book. It is meant as a quick lookup rather than a tutorial. Symbols are grouped by area, and each table maps a symbol to its meaning. Where the same letter carries different meanings in different areas, context resolves the ambiguity. We follow standard conventions: lowercase italic for scalars, lowercase bold for vectors, uppercase bold for matrices, and calligraphic or blackboard letters for sets. ## How to read this reference Notation is a compression scheme for mathematical ideas. A good convention is consistent (the same symbol means the same kind of thing everywhere) and unambiguous in context (any genuine collision is resolved by the surrounding sentence). This book follows the dominant conventions of the machine learning literature, which trace back to the textbooks listed under Further Reading. The table below fixes the typographic rules so that the type of an object is visible at a glance, before you read its name. | Form | Object | Example | |---|---|---| | lowercase italic | scalar | $x$, $\eta$, $\lambda$ | | lowercase bold | (column) vector | $\mathbf{x}$, $\boldsymbol{\theta}$ | | uppercase bold | matrix | $\mathbf{A}$, $\boldsymbol{\Sigma}$ | | uppercase italic | random variable | $X$, $Y$ | | calligraphic | set or operator | $\mathcal{X}$, $\mathcal{L}$ | | blackboard | number system or expectation | $\mathbb{R}$, $\mathbb{E}$ | | hat | estimate from data | $\hat{\theta}$, $\hat{y}$ | | star | optimal or true value | $\theta^\ast$, $\mathbf{x}^\ast$ | Two reading habits make the rest of the chapter faster. First, indices are almost always one based and run over the dimension implied by the object, so $x_i$ ranges over the $n$ entries of $\mathbf{x} \in \mathbb{R}^n$. Second, dimensions should be checked like units in physics: an expression such as $\mathbf{A}\mathbf{x}$ is only well formed when the column count of $\mathbf{A}$ equals the length of $\mathbf{x}$, and tracking these shapes catches most notational errors before any computation. ## 1. Sets and Logic We treat sets as the foundational layer. A set is an unordered collection of distinct elements, and the symbols below describe membership, construction, and relationships between sets. | Symbol | Meaning | |---|---| | $\{a, b, c\}$ | Set containing the listed elements | | $\{x : P(x)\}$ | Set of all $x$ satisfying property $P$ | | $x \in \mathcal{X}$ | $x$ is an element of $\mathcal{X}$ | | $x \notin \mathcal{X}$ | $x$ is not an element of $\mathcal{X}$ | | $\varnothing$ | Empty set | | $\mathcal{A} \subseteq \mathcal{B}$ | $\mathcal{A}$ is a subset of $\mathcal{B}$ | | $\mathcal{A} \cup \mathcal{B}$ | Union of $\mathcal{A}$ and $\mathcal{B}$ | | $\mathcal{A} \cap \mathcal{B}$ | Intersection of $\mathcal{A}$ and $\mathcal{B}$ | | $\mathcal{A} \setminus \mathcal{B}$ | Set difference (elements in $\mathcal{A}$ but not $\mathcal{B}$) | | $\mathcal{A} \times \mathcal{B}$ | Cartesian product | | $\lvert \mathcal{A} \rvert$ | Cardinality (number of elements) | | $2^{\mathcal{A}}$ | Power set (set of all subsets) | | $\mathbb{N}, \mathbb{Z}, \mathbb{Q}$ | Natural numbers, integers, rationals | | $\mathbb{R}, \mathbb{C}$ | Real numbers, complex numbers | | $\mathbb{R}^n$ | $n$ dimensional real coordinate space | | $[a, b]$ | Closed interval $\{x : a \le x \le b\}$ | Two conventions are worth fixing. The Cartesian product $\mathcal{A} \times \mathcal{B} = \{(a, b) : a \in \mathcal{A},\, b \in \mathcal{B}\}$ produces ordered pairs, so it satisfies $\lvert \mathcal{A} \times \mathcal{B} \rvert = \lvert \mathcal{A} \rvert \cdot \lvert \mathcal{B} \rvert$ for finite sets, and the power set has $\lvert 2^{\mathcal{A}} \rvert = 2^{\lvert \mathcal{A} \rvert}$, which is the origin of the notation. A data set of $n$ labeled examples is usually written $\{(\mathbf{x}_i, y_i)\}_{i=1}^{n} \subseteq \mathcal{X} \times \mathcal{Y}$, pairing each input from a feature space $\mathcal{X}$ with a label from a label space $\mathcal{Y}$. Logical statements glue these objects together. The connectives below appear in proofs, definitions of estimators, and constraint specifications. | Symbol | Meaning | |---|---| | $\forall$ | For all | | $\exists$ | There exists | | $\neg$ | Logical negation (not) | | $\land, \lor$ | Logical and, or | | $\implies$ | Implies | | $\iff$ | If and only if | | $:=$ | Defined to be equal to | | $\propto$ | Proportional to | | $\mathbb{1}[\cdot]$ | Indicator: $1$ if the condition holds, else $0$ | The indicator function deserves emphasis because it converts a logical condition into arithmetic and so appears constantly in loss functions and estimators. A key identity links it to probability and expectation: for an event $A$, the expected value of its indicator is exactly its probability, $\mathbb{E}[\mathbb{1}[A]] = P(A)$. Empirical risk, for instance, is the sample average of an indicator of misclassification, $\frac{1}{n}\sum_{i=1}^{n} \mathbb{1}[\hat{y}_i \ne y_i]$. ## 2. Linear Algebra Linear algebra supplies the data structures of machine learning. A vector $\mathbf{x} \in \mathbb{R}^n$ is a point in $n$ dimensional space, written as a column by convention, and a matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$ maps vectors between spaces. | Symbol | Meaning | |---|---| | $\mathbf{x}, \mathbf{y}$ | Column vectors | | $x_i$ | $i$th component of vector $\mathbf{x}$ | | $\mathbf{A}, \mathbf{B}$ | Matrices | | $A_{ij}$ | Entry in row $i$, column $j$ of $\mathbf{A}$ | | $\mathbf{a}_j$ | $j$th column of $\mathbf{A}$ | | $\mathbf{0}, \mathbf{1}$ | Vector or matrix of all zeros, all ones | | $\mathbf{I}_n$ | $n \times n$ identity matrix | | $\operatorname{diag}(\mathbf{x})$ | Diagonal matrix with $\mathbf{x}$ on the diagonal | The operations below combine and transform these objects. The inner product measures alignment, the outer product builds rank one matrices, and the inverse undoes a linear map when it exists. | Symbol | Meaning | |---|---| | $\mathbf{A}^\top$ | Transpose of $\mathbf{A}$ | | $\mathbf{A}^{-1}$ | Inverse of $\mathbf{A}$ | | $\mathbf{A}^+$ | Moore Penrose pseudoinverse | | $\mathbf{x}^\top \mathbf{y}$ | Inner (dot) product, equal to $\sum_i x_i y_i$ | | $\langle \mathbf{x}, \mathbf{y} \rangle$ | Inner product (alternative notation) | | $\mathbf{x} \mathbf{y}^\top$ | Outer product (rank one matrix) | | $\mathbf{A} \mathbf{B}$ | Matrix product | | $\mathbf{x} \odot \mathbf{y}$ | Elementwise (Hadamard) product | | $\mathbf{x} \otimes \mathbf{y}$ | Kronecker (tensor) product | | $\operatorname{tr}(\mathbf{A})$ | Trace (sum of diagonal entries) | | $\det(\mathbf{A})$ | Determinant | | $\operatorname{rank}(\mathbf{A})$ | Rank (dimension of column space) | | $\operatorname{span}(\cdot)$ | Linear span of a set of vectors | Norms quantify the size of a vector or matrix and underpin regularization and distance metrics. Eigenvalues and singular values expose the geometry of a linear map. | Symbol | Meaning | |---|---| | $\lVert \mathbf{x} \rVert_2$ | Euclidean ($\ell_2$) norm, $\sqrt{\sum_i x_i^2}$ | | $\lVert \mathbf{x} \rVert_1$ | $\ell_1$ norm, $\sum_i \lvert x_i \rvert$ | | $\lVert \mathbf{x} \rVert_p$ | $\ell_p$ norm, $\left(\sum_i \lvert x_i \rvert^p\right)^{1/p}$ | | $\lVert \mathbf{x} \rVert_\infty$ | Max norm, $\max_i \lvert x_i \rvert$ | | $\lVert \mathbf{A} \rVert_F$ | Frobenius norm, $\sqrt{\sum_{ij} A_{ij}^2}$ | | $\lambda_i(\mathbf{A})$ | $i$th eigenvalue of $\mathbf{A}$ | | $\sigma_i(\mathbf{A})$ | $i$th singular value of $\mathbf{A}$ | | $\mathbf{A} \succeq 0$ | $\mathbf{A}$ is positive semidefinite | | $\mathbf{A} \succ 0$ | $\mathbf{A}$ is positive definite | A few structural facts make these symbols easier to use. The inner product relates to the Euclidean norm by $\lVert \mathbf{x} \rVert_2^2 = \mathbf{x}^\top \mathbf{x}$ and encodes the angle $\phi$ between two vectors through $\mathbf{x}^\top \mathbf{y} = \lVert \mathbf{x} \rVert_2 \lVert \mathbf{y} \rVert_2 \cos\phi$, so orthogonality is precisely $\mathbf{x}^\top \mathbf{y} = 0$. The eigenvalues and singular values are connected: for any real matrix the singular values of $\mathbf{A}$ are the square roots of the eigenvalues of $\mathbf{A}^\top \mathbf{A}$, and for a symmetric positive semidefinite matrix the two coincide. Positive semidefiniteness has the operational definition $\mathbf{A} \succeq 0 \iff \mathbf{x}^\top \mathbf{A} \mathbf{x} \ge 0$ for all $\mathbf{x}$, which is exactly the condition that makes a quadratic form, such as a squared Mahalanobis distance, never negative. Covariance matrices and Hessians at a minimum are the two cases this book meets most often. The $\ell_p$ norms form a family worth picturing, because the choice of $p$ changes the geometry of regularization. The set of points with $\lVert \mathbf{x} \rVert_p \le 1$ is the unit ball, and its shape governs which solutions a penalty prefers. ```{mermaid} flowchart LR P0["p = 0 quasinorm, counts nonzeros, sparsest"] P1["p = 1 diamond ball, promotes sparsity, convex"] P2["p = 2 round ball, shrinks smoothly"] PI["p = infinity square ball, bounds the largest entry"] P0 --> P1 --> P2 --> PI ``` The $\ell_1$ ball has corners on the axes, so its constraint surface tends to touch the loss at points where some coordinates are exactly zero, which is the geometric reason the $\ell_1$ penalty induces sparsity. The $\ell_2$ ball is round and rotationally symmetric, so it shrinks all coordinates smoothly without forcing any to zero. The $\ell_0$ entry is written as a norm only by convention; it is not actually a norm because it fails homogeneity. ## 3. Calculus Calculus describes how quantities change, and gradient based optimization is the engine of modern learning. For a scalar valued function $f : \mathbb{R}^n \to \mathbb{R}$, the gradient collects all first order partial derivatives into a vector that points in the direction of steepest ascent. | Symbol | Meaning | |---|---| | $\dfrac{df}{dx}$ | Derivative of $f$ with respect to $x$ | | $\dfrac{\partial f}{\partial x_i}$ | Partial derivative with respect to $x_i$ | | $f'(x), f''(x)$ | First and second derivatives | | $\nabla f(\mathbf{x})$ | Gradient, vector of partials $\left[\frac{\partial f}{\partial x_1}, \dots, \frac{\partial f}{\partial x_n}\right]^\top$ | | $\nabla_{\boldsymbol{\theta}} f$ | Gradient taken with respect to parameters $\boldsymbol{\theta}$ | | $\nabla^2 f(\mathbf{x})$ | Hessian, matrix of second partials | | $H_{ij} = \dfrac{\partial^2 f}{\partial x_i \partial x_j}$ | Entry of the Hessian $\mathbf{H}$ | For a vector valued function $\mathbf{f} : \mathbb{R}^n \to \mathbb{R}^m$, the Jacobian generalizes the gradient and records how every output responds to every input. The remaining symbols cover integration and limiting behavior. | Symbol | Meaning | |---|---| | $\mathbf{J}$ or $\dfrac{\partial \mathbf{f}}{\partial \mathbf{x}}$ | Jacobian, with $J_{ij} = \dfrac{\partial f_i}{\partial x_j}$ | | $\displaystyle\int_a^b f(x)\, dx$ | Definite integral over $[a, b]$ | | $\displaystyle\int f\, d\mu$ | Integral with respect to measure $\mu$ | | $\lim_{x \to a} f(x)$ | Limit of $f$ as $x$ approaches $a$ | | $\sum, \prod$ | Summation, product over an index | | $\mathcal{O}(\cdot), o(\cdot)$ | Big O and little o asymptotic bounds | | $\circ$ | Function composition, $(f \circ g)(x) = f(g(x))$ | The gradient, Hessian, and Jacobian fit together as the first and second order pieces of a Taylor expansion. Near a point $\mathbf{x}_0$ a smooth scalar function is approximated by $$ f(\mathbf{x}) \approx f(\mathbf{x}_0) + \nabla f(\mathbf{x}_0)^\top (\mathbf{x} - \mathbf{x}_0) + \tfrac{1}{2}(\mathbf{x} - \mathbf{x}_0)^\top \nabla^2 f(\mathbf{x}_0)\, (\mathbf{x} - \mathbf{x}_0). $$ This expansion is the lens through which optimization reads curvature: the gradient gives the local slope, and the Hessian, which is symmetric whenever the second partials are continuous, gives the curvature. At a local minimum the gradient vanishes and the Hessian is positive semidefinite, the multivariate analogue of the familiar conditions $f'(x) = 0$ and $f''(x) \ge 0$. The chain rule, written $\frac{d}{dx} f(g(x)) = f'(g(x)) g'(x)$, is the basis of backpropagation. In the vector setting it becomes a product of Jacobians: if $\mathbf{z} = \mathbf{f}(\mathbf{y})$ and $\mathbf{y} = \mathbf{g}(\mathbf{x})$, then $\frac{\partial \mathbf{z}}{\partial \mathbf{x}} = \frac{\partial \mathbf{z}}{\partial \mathbf{y}} \frac{\partial \mathbf{y}}{\partial \mathbf{x}}$. Backpropagation applies this identity recursively across the layers of a network, multiplying the per layer Jacobians from output to input to accumulate the gradient of a scalar loss with respect to every parameter. ## 4. Probability and Statistics Probability formalizes uncertainty. A random variable maps outcomes to numbers, a distribution assigns probability mass or density to those numbers, and expectation summarizes a distribution by its average behavior. We write random variables in uppercase and their realized values in lowercase. | Symbol | Meaning | |---|---| | $\Omega$ | Sample space of outcomes | | $P(A)$ | Probability of event $A$ | | $X, Y$ | Random variables | | $P(X = x)$ | Probability mass at $x$ (discrete) | | $p(x)$ | Probability density at $x$ (continuous) | | $P(X \mid Y)$ | Conditional distribution of $X$ given $Y$ | | $X \sim \mathcal{D}$ | $X$ is distributed according to $\mathcal{D}$ | | $X \perp Y$ | $X$ and $Y$ are independent | | $\text{i.i.d.}$ | Independent and identically distributed | Expectation, variance, and their relatives are the workhorses of statistical analysis. The estimator notation distinguishes an unknown true quantity from the value computed from a finite sample. | Symbol | Meaning | |---|---| | $\mathbb{E}[X]$ | Expected value of $X$ | | $\mathbb{E}_{x \sim p}[f(x)]$ | Expectation of $f$ under distribution $p$ | | $\operatorname{Var}(X)$ | Variance, $\mathbb{E}[(X - \mathbb{E}[X])^2]$ | | $\operatorname{Cov}(X, Y)$ | Covariance between $X$ and $Y$ | | $\operatorname{Corr}(X, Y)$ | Pearson correlation coefficient | | $\boldsymbol{\Sigma}$ | Covariance matrix | | $\mu, \sigma, \sigma^2$ | Mean, standard deviation, variance | | $\bar{x}$ | Sample mean | | $\hat{\theta}$ | Estimator of parameter $\theta$ | | $\theta^\ast$ | True or optimal parameter value | | $\mathcal{L}(\theta)$ | Likelihood of parameters $\theta$ | | $\ell(\theta)$ | Log likelihood | A handful of identities tie these symbols together. Expectation is linear, so $\mathbb{E}[aX + bY] = a\mathbb{E}[X] + b\mathbb{E}[Y]$ regardless of dependence, whereas variance is not, and $\operatorname{Var}(X) = \mathbb{E}[X^2] - \mathbb{E}[X]^2$. Bayes' rule, which underlies inference throughout the book, rearranges the definition of conditional probability into $$ P(\theta \mid \mathcal{D}) = \frac{P(\mathcal{D} \mid \theta)\, P(\theta)}{P(\mathcal{D})}, $$ reading from left to right as posterior, likelihood, prior, and evidence. The hat versus star distinction is the recurring grammar of statistics: $\theta^\ast$ is the fixed unknown that nature chose, while $\hat{\theta}$ is a random quantity computed from data, and most of estimation theory studies how the second tracks the first as the sample grows. Several distributions recur often enough to merit a dedicated notation. The table lists their standard names and parameters. | Symbol | Meaning | |---|---| | $\mathcal{N}(\mu, \sigma^2)$ | Gaussian (normal) with mean $\mu$, variance $\sigma^2$ | | $\mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})$ | Multivariate Gaussian | | $\operatorname{Bern}(p)$ | Bernoulli with success probability $p$ | | $\operatorname{Bin}(n, p)$ | Binomial over $n$ trials | | $\operatorname{Cat}(\boldsymbol{\pi})$ | Categorical with class probabilities $\boldsymbol{\pi}$ | | $\operatorname{Unif}(a, b)$ | Uniform on $[a, b]$ | | $\operatorname{Pois}(\lambda)$ | Poisson with rate $\lambda$ | ## 5. Information Theory Information theory measures uncertainty and the cost of communication, and it provides the loss functions used to train probabilistic models. Entropy quantifies the average surprise of a distribution, while cross entropy and divergence compare two distributions. | Symbol | Meaning | |---|---| | $H(X)$ | Entropy, $-\sum_x p(x) \log p(x)$ | | $H(X \mid Y)$ | Conditional entropy | | $H(p, q)$ | Cross entropy between $p$ and $q$ | | $D_{\mathrm{KL}}(p \,\Vert\, q)$ | Kullback Leibler divergence from $q$ to $p$ | | $I(X; Y)$ | Mutual information between $X$ and $Y$ | | $\operatorname{JSD}(p \,\Vert\, q)$ | Jensen Shannon divergence | | $\log_2, \ln$ | Logarithm base $2$ (bits), natural log (nats) | The KL divergence is defined as $D_{\mathrm{KL}}(p \,\Vert\, q) = \sum_x p(x) \log \frac{p(x)}{q(x)}$. It is non negative and equals zero exactly when $p$ and $q$ agree, which is why minimizing it pulls a model distribution toward the data distribution. These quantities are not independent. Cross entropy decomposes cleanly as $H(p, q) = H(p) + D_{\mathrm{KL}}(p \,\Vert\, q)$, so when $p$ is the fixed data distribution the entropy term is a constant and minimizing cross entropy is identical to minimizing the KL divergence. This is the precise sense in which training a classifier with the cross entropy loss is fitting the model distribution to the data. Mutual information measures shared content as the divergence between a joint distribution and the product of its marginals, $I(X; Y) = D_{\mathrm{KL}}(p(x, y) \,\Vert\, p(x) p(y))$, and it equals zero exactly when $X \perp Y$. Two cautions accompany the KL divergence: it is asymmetric, so $D_{\mathrm{KL}}(p \,\Vert\, q) \ne D_{\mathrm{KL}}(q \,\Vert\, p)$ in general and it is not a metric, and it diverges to infinity wherever $p$ places mass on an outcome that $q$ rules out. The Jensen Shannon divergence is a symmetrized, bounded alternative built from the two KL terms against the mixture $\tfrac{1}{2}(p + q)$. ## 6. Optimization Optimization is the act of choosing parameters to minimize a loss. The notation below frames a generic problem: minimize an objective over a feasible region, possibly subject to constraints. Most training procedures are instances of this template. | Symbol | Meaning | |---|---| | $\min_{\mathbf{x}} f(\mathbf{x})$ | Minimize $f$ over $\mathbf{x}$ | | $\max_{\mathbf{x}} f(\mathbf{x})$ | Maximize $f$ over $\mathbf{x}$ | | $\arg\min_{\mathbf{x}} f(\mathbf{x})$ | Argument that attains the minimum | | $\arg\max_{\mathbf{x}} f(\mathbf{x})$ | Argument that attains the maximum | | $\text{s.t.}$ | Subject to (constraints follow) | | $f(\mathbf{x})$ | Objective or loss function | | $J(\boldsymbol{\theta})$ | Cost function over parameters | | $\boldsymbol{\theta}_t$ | Parameter vector at iteration $t$ | | $\eta$ or $\alpha$ | Learning rate (step size) | | $\lambda$ | Regularization strength | | $g_t = \nabla J(\boldsymbol{\theta}_t)$ | Gradient at iteration $t$ | | $\mathcal{L}(\mathbf{x}, \boldsymbol{\mu})$ | Lagrangian with multipliers $\boldsymbol{\mu}$ | A canonical gradient descent update reads $\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta\, \nabla J(\boldsymbol{\theta}_t)$, moving the parameters a small step in the direction that most rapidly decreases the cost. Stochastic variants replace the full gradient with an estimate computed on a minibatch, trading exactness for speed. | Symbol | Meaning | |---|---| | $\mathcal{B}$ | Minibatch of training examples | | $\nabla_{\boldsymbol{\theta}} \mathcal{L}$ | Loss gradient with respect to parameters | | $\preceq, \succeq$ | Componentwise or matrix inequality | | $\Pi_{\mathcal{C}}(\cdot)$ | Projection onto feasible set $\mathcal{C}$ | | $\partial f(\mathbf{x})$ | Subdifferential at $\mathbf{x}$ | A function is convex when its domain is convex and every chord lies on or above the graph, a property that guarantees any local minimum is global. Convexity is the dividing line between problems we can solve reliably and the non convex landscapes typical of deep networks. The subdifferential $\partial f(\mathbf{x})$ extends the gradient to convex functions with kinks, such as the absolute value at zero, by collecting every slope that supports the graph from below, and a point is a global minimum exactly when $\mathbf{0} \in \partial f(\mathbf{x})$. This is what gives the $\ell_1$ penalty a well defined optimality condition despite being nondifferentiable at the origin. ## Worked example: reading a regularized objective To see the notation working together, consider the objective of ridge regression, which appears in nearly every chapter that fits a linear model: $$ \hat{\boldsymbol{\theta}} \;=\; \arg\min_{\boldsymbol{\theta} \in \mathbb{R}^d} \;\; \underbrace{\tfrac{1}{n} \lVert \mathbf{X}\boldsymbol{\theta} - \mathbf{y} \rVert_2^2}_{\text{data fit}} \;+\; \underbrace{\lambda \lVert \boldsymbol{\theta} \rVert_2^2}_{\text{penalty}}. $$ Every symbol here is drawn from the tables above. Reading it as a sentence: find the parameter vector $\hat{\boldsymbol{\theta}}$ in $d$ dimensional real space ($\arg\min$, blackboard $\mathbb{R}^d$) that minimizes the average squared $\ell_2$ distance between the predictions $\mathbf{X}\boldsymbol{\theta}$ and the targets $\mathbf{y}$ (a matrix times a vector, then a norm), plus a penalty of strength $\lambda$ on the squared norm of the parameters. The hat on $\hat{\boldsymbol{\theta}}$ flags it as an estimate computed from data rather than a true value $\boldsymbol{\theta}^\ast$. Because both terms are convex and the penalty is strictly convex, the whole objective is strictly convex, so the optimization tools of section 6 apply and the minimizer is unique. Setting the gradient to zero, using the calculus of section 3, yields the closed form $\hat{\boldsymbol{\theta}} = (\mathbf{X}^\top \mathbf{X} + n\lambda \mathbf{I}_d)^{-1} \mathbf{X}^\top \mathbf{y}$, where the added $n\lambda \mathbf{I}_d$ makes the matrix positive definite and therefore invertible even when $\mathbf{X}^\top \mathbf{X}$ alone is singular. Once a reader can move fluently between the compressed formula and this spoken paraphrase, the notation has done its job. ## Further Reading * Goodfellow, Bengio, and Courville, *Deep Learning* (2016), notation chapter. * Bishop, *Pattern Recognition and Machine Learning* (2006), appendices. * Strang, *Introduction to Linear Algebra* (2016). * Boyd and Vandenberghe, *Convex Optimization* (2004). * Cover and Thomas, *Elements of Information Theory* (2006). * Murphy, *Probabilistic Machine Learning: An Introduction* (2022).

Symbol	Meaning
\(\{a, b, c\}\)	Set containing the listed elements
\(\{x : P(x)\}\)	Set of all \(x\) satisfying property \(P\)
\(x \in \mathcal{X}\)	\(x\) is an element of \(\mathcal{X}\)
\(x \notin \mathcal{X}\)	\(x\) is not an element of \(\mathcal{X}\)
\(\varnothing\)	Empty set
\(\mathcal{A} \subseteq \mathcal{B}\)	\(\mathcal{A}\) is a subset of \(\mathcal{B}\)
\(\mathcal{A} \cup \mathcal{B}\)	Union of \(\mathcal{A}\) and \(\mathcal{B}\)
\(\mathcal{A} \cap \mathcal{B}\)	Intersection of \(\mathcal{A}\) and \(\mathcal{B}\)
\(\mathcal{A} \setminus \mathcal{B}\)	Set difference (elements in \(\mathcal{A}\) but not \(\mathcal{B}\))
\(\mathcal{A} \times \mathcal{B}\)	Cartesian product
\(\lvert \mathcal{A} \rvert\)	Cardinality (number of elements)
\(2^{\mathcal{A}}\)	Power set (set of all subsets)
\(\mathbb{N}, \mathbb{Z}, \mathbb{Q}\)	Natural numbers, integers, rationals
\(\mathbb{R}, \mathbb{C}\)	Real numbers, complex numbers
\(\mathbb{R}^n\)	\(n\) dimensional real coordinate space
\([a, b]\)	Closed interval \(\{x : a \le x \le b\}\)

Symbol	Meaning
\(\mathbf{x}, \mathbf{y}\)	Column vectors
\(x_i\)	\(i\)th component of vector \(\mathbf{x}\)
\(\mathbf{A}, \mathbf{B}\)	Matrices
\(A_{ij}\)	Entry in row \(i\), column \(j\) of \(\mathbf{A}\)
\(\mathbf{a}_j\)	\(j\)th column of \(\mathbf{A}\)
\(\mathbf{0}, \mathbf{1}\)	Vector or matrix of all zeros, all ones
\(\mathbf{I}_n\)	\(n \times n\) identity matrix
\(\operatorname{diag}(\mathbf{x})\)	Diagonal matrix with \(\mathbf{x}\) on the diagonal

Symbol	Meaning
\(\mathbf{A}^\top\)	Transpose of \(\mathbf{A}\)
\(\mathbf{A}^{-1}\)	Inverse of \(\mathbf{A}\)
\(\mathbf{A}^+\)	Moore Penrose pseudoinverse
\(\mathbf{x}^\top \mathbf{y}\)	Inner (dot) product, equal to \(\sum_i x_i y_i\)
\(\langle \mathbf{x}, \mathbf{y} \rangle\)	Inner product (alternative notation)
\(\mathbf{x} \mathbf{y}^\top\)	Outer product (rank one matrix)
\(\mathbf{A} \mathbf{B}\)	Matrix product
\(\mathbf{x} \odot \mathbf{y}\)	Elementwise (Hadamard) product
\(\mathbf{x} \otimes \mathbf{y}\)	Kronecker (tensor) product
\(\operatorname{tr}(\mathbf{A})\)	Trace (sum of diagonal entries)
\(\det(\mathbf{A})\)	Determinant
\(\operatorname{rank}(\mathbf{A})\)	Rank (dimension of column space)
\(\operatorname{span}(\cdot)\)	Linear span of a set of vectors

Symbol	Meaning
\(\lVert \mathbf{x} \rVert_2\)	Euclidean (\(\ell_2\)) norm, \(\sqrt{\sum_i x_i^2}\)
\(\lVert \mathbf{x} \rVert_1\)	\(\ell_1\) norm, \(\sum_i \lvert x_i \rvert\)
\(\lVert \mathbf{x} \rVert_p\)	\(\ell_p\) norm, \(\left(\sum_i \lvert x_i \rvert^p\right)^{1/p}\)
\(\lVert \mathbf{x} \rVert_\infty\)	Max norm, \(\max_i \lvert x_i \rvert\)
\(\lVert \mathbf{A} \rVert_F\)	Frobenius norm, \(\sqrt{\sum_{ij} A_{ij}^2}\)
\(\lambda_i(\mathbf{A})\)	\(i\)th eigenvalue of \(\mathbf{A}\)
\(\sigma_i(\mathbf{A})\)	\(i\)th singular value of \(\mathbf{A}\)
\(\mathbf{A} \succeq 0\)	\(\mathbf{A}\) is positive semidefinite
\(\mathbf{A} \succ 0\)	\(\mathbf{A}\) is positive definite

Symbol	Meaning
\(\dfrac{df}{dx}\)	Derivative of \(f\) with respect to \(x\)
\(\dfrac{\partial f}{\partial x_i}\)	Partial derivative with respect to \(x_i\)
\(f'(x), f''(x)\)	First and second derivatives
\(\nabla f(\mathbf{x})\)	Gradient, vector of partials \(\left[\frac{\partial f}{\partial x_1}, \dots, \frac{\partial f}{\partial x_n}\right]^\top\)
\(\nabla_{\boldsymbol{\theta}} f\)	Gradient taken with respect to parameters \(\boldsymbol{\theta}\)
\(\nabla^2 f(\mathbf{x})\)	Hessian, matrix of second partials
\(H_{ij} = \dfrac{\partial^2 f}{\partial x_i \partial x_j}\)	Entry of the Hessian \(\mathbf{H}\)