50 Mathematical Notation Reference
This chapter consolidates the mathematical notation used throughout the book. It is meant as a quick lookup rather than a tutorial. Symbols are grouped by area, and each table maps a symbol to its meaning. Where the same letter carries different meanings in different areas, context resolves the ambiguity. We follow standard conventions: lowercase italic for scalars, lowercase bold for vectors, uppercase bold for matrices, and calligraphic or blackboard letters for sets.
50.1 1. Sets and Logic
We treat sets as the foundational layer. A set is an unordered collection of distinct elements, and the symbols below describe membership, construction, and relationships between sets.
| Symbol | Meaning |
|---|---|
| \(\{a, b, c\}\) | Set containing the listed elements |
| \(\{x : P(x)\}\) | Set of all \(x\) satisfying property \(P\) |
| \(x \in \mathcal{X}\) | \(x\) is an element of \(\mathcal{X}\) |
| \(x \notin \mathcal{X}\) | \(x\) is not an element of \(\mathcal{X}\) |
| \(\varnothing\) | Empty set |
| \(\mathcal{A} \subseteq \mathcal{B}\) | \(\mathcal{A}\) is a subset of \(\mathcal{B}\) |
| \(\mathcal{A} \cup \mathcal{B}\) | Union of \(\mathcal{A}\) and \(\mathcal{B}\) |
| \(\mathcal{A} \cap \mathcal{B}\) | Intersection of \(\mathcal{A}\) and \(\mathcal{B}\) |
| \(\mathcal{A} \setminus \mathcal{B}\) | Set difference (elements in \(\mathcal{A}\) but not \(\mathcal{B}\)) |
| \(\mathcal{A} \times \mathcal{B}\) | Cartesian product |
| \(\lvert \mathcal{A} \rvert\) | Cardinality (number of elements) |
| \(2^{\mathcal{A}}\) | Power set (set of all subsets) |
| \(\mathbb{N}, \mathbb{Z}, \mathbb{Q}\) | Natural numbers, integers, rationals |
| \(\mathbb{R}, \mathbb{C}\) | Real numbers, complex numbers |
| \(\mathbb{R}^n\) | \(n\) dimensional real coordinate space |
| \([a, b]\) | Closed interval \(\{x : a \le x \le b\}\) |
Logical statements glue these objects together. The connectives below appear in proofs, definitions of estimators, and constraint specifications.
| Symbol | Meaning |
|---|---|
| \(\forall\) | For all |
| \(\exists\) | There exists |
| \(\neg\) | Logical negation (not) |
| \(\land, \lor\) | Logical and, or |
| \(\implies\) | Implies |
| \(\iff\) | If and only if |
| \(:=\) | Defined to be equal to |
| \(\propto\) | Proportional to |
| \(\mathbb{1}[\cdot]\) | Indicator: \(1\) if the condition holds, else \(0\) |
50.2 2. Linear Algebra
Linear algebra supplies the data structures of machine learning. A vector \(\mathbf{x} \in \mathbb{R}^n\) is a point in \(n\) dimensional space, written as a column by convention, and a matrix \(\mathbf{A} \in \mathbb{R}^{m \times n}\) maps vectors between spaces.
| Symbol | Meaning |
|---|---|
| \(\mathbf{x}, \mathbf{y}\) | Column vectors |
| \(x_i\) | \(i\)th component of vector \(\mathbf{x}\) |
| \(\mathbf{A}, \mathbf{B}\) | Matrices |
| \(A_{ij}\) | Entry in row \(i\), column \(j\) of \(\mathbf{A}\) |
| \(\mathbf{a}_j\) | \(j\)th column of \(\mathbf{A}\) |
| \(\mathbf{0}, \mathbf{1}\) | Vector or matrix of all zeros, all ones |
| \(\mathbf{I}_n\) | \(n \times n\) identity matrix |
| \(\operatorname{diag}(\mathbf{x})\) | Diagonal matrix with \(\mathbf{x}\) on the diagonal |
The operations below combine and transform these objects. The inner product measures alignment, the outer product builds rank one matrices, and the inverse undoes a linear map when it exists.
| Symbol | Meaning |
|---|---|
| \(\mathbf{A}^\top\) | Transpose of \(\mathbf{A}\) |
| \(\mathbf{A}^{-1}\) | Inverse of \(\mathbf{A}\) |
| \(\mathbf{A}^+\) | Moore Penrose pseudoinverse |
| \(\mathbf{x}^\top \mathbf{y}\) | Inner (dot) product, equal to \(\sum_i x_i y_i\) |
| \(\langle \mathbf{x}, \mathbf{y} \rangle\) | Inner product (alternative notation) |
| \(\mathbf{x} \mathbf{y}^\top\) | Outer product (rank one matrix) |
| \(\mathbf{A} \mathbf{B}\) | Matrix product |
| \(\mathbf{x} \odot \mathbf{y}\) | Elementwise (Hadamard) product |
| \(\mathbf{x} \otimes \mathbf{y}\) | Kronecker (tensor) product |
| \(\operatorname{tr}(\mathbf{A})\) | Trace (sum of diagonal entries) |
| \(\det(\mathbf{A})\) | Determinant |
| \(\operatorname{rank}(\mathbf{A})\) | Rank (dimension of column space) |
| \(\operatorname{span}(\cdot)\) | Linear span of a set of vectors |
Norms quantify the size of a vector or matrix and underpin regularization and distance metrics. Eigenvalues and singular values expose the geometry of a linear map.
| Symbol | Meaning |
|---|---|
| \(\lVert \mathbf{x} \rVert_2\) | Euclidean (\(\ell_2\)) norm, \(\sqrt{\sum_i x_i^2}\) |
| \(\lVert \mathbf{x} \rVert_1\) | \(\ell_1\) norm, \(\sum_i \lvert x_i \rvert\) |
| \(\lVert \mathbf{x} \rVert_p\) | \(\ell_p\) norm, \(\left(\sum_i \lvert x_i \rvert^p\right)^{1/p}\) |
| \(\lVert \mathbf{x} \rVert_\infty\) | Max norm, \(\max_i \lvert x_i \rvert\) |
| \(\lVert \mathbf{A} \rVert_F\) | Frobenius norm, \(\sqrt{\sum_{ij} A_{ij}^2}\) |
| \(\lambda_i(\mathbf{A})\) | \(i\)th eigenvalue of \(\mathbf{A}\) |
| \(\sigma_i(\mathbf{A})\) | \(i\)th singular value of \(\mathbf{A}\) |
| \(\mathbf{A} \succeq 0\) | \(\mathbf{A}\) is positive semidefinite |
| \(\mathbf{A} \succ 0\) | \(\mathbf{A}\) is positive definite |
50.3 3. Calculus
Calculus describes how quantities change, and gradient based optimization is the engine of modern learning. For a scalar valued function \(f : \mathbb{R}^n \to \mathbb{R}\), the gradient collects all first order partial derivatives into a vector that points in the direction of steepest ascent.
| Symbol | Meaning |
|---|---|
| \(\dfrac{df}{dx}\) | Derivative of \(f\) with respect to \(x\) |
| \(\dfrac{\partial f}{\partial x_i}\) | Partial derivative with respect to \(x_i\) |
| \(f'(x), f''(x)\) | First and second derivatives |
| \(\nabla f(\mathbf{x})\) | Gradient, vector of partials \(\left[\frac{\partial f}{\partial x_1}, \dots, \frac{\partial f}{\partial x_n}\right]^\top\) |
| \(\nabla_{\boldsymbol{\theta}} f\) | Gradient taken with respect to parameters \(\boldsymbol{\theta}\) |
| \(\nabla^2 f(\mathbf{x})\) | Hessian, matrix of second partials |
| \(H_{ij} = \dfrac{\partial^2 f}{\partial x_i \partial x_j}\) | Entry of the Hessian \(\mathbf{H}\) |
For a vector valued function \(\mathbf{f} : \mathbb{R}^n \to \mathbb{R}^m\), the Jacobian generalizes the gradient and records how every output responds to every input. The remaining symbols cover integration and limiting behavior.
| Symbol | Meaning |
|---|---|
| \(\mathbf{J}\) or \(\dfrac{\partial \mathbf{f}}{\partial \mathbf{x}}\) | Jacobian, with \(J_{ij} = \dfrac{\partial f_i}{\partial x_j}\) |
| \(\displaystyle\int_a^b f(x)\, dx\) | Definite integral over \([a, b]\) |
| \(\displaystyle\int f\, d\mu\) | Integral with respect to measure \(\mu\) |
| \(\lim_{x \to a} f(x)\) | Limit of \(f\) as \(x\) approaches \(a\) |
| \(\sum, \prod\) | Summation, product over an index |
| \(\mathcal{O}(\cdot), o(\cdot)\) | Big O and little o asymptotic bounds |
| \(\circ\) | Function composition, \((f \circ g)(x) = f(g(x))\) |
The chain rule, written \(\frac{d}{dx} f(g(x)) = f'(g(x)) g'(x)\), is the basis of backpropagation, where it is applied recursively across the layers of a network.
50.4 4. Probability and Statistics
Probability formalizes uncertainty. A random variable maps outcomes to numbers, a distribution assigns probability mass or density to those numbers, and expectation summarizes a distribution by its average behavior. We write random variables in uppercase and their realized values in lowercase.
| Symbol | Meaning |
|---|---|
| \(\Omega\) | Sample space of outcomes |
| \(P(A)\) | Probability of event \(A\) |
| \(X, Y\) | Random variables |
| \(P(X = x)\) | Probability mass at \(x\) (discrete) |
| \(p(x)\) | Probability density at \(x\) (continuous) |
| \(P(X \mid Y)\) | Conditional distribution of \(X\) given \(Y\) |
| \(X \sim \mathcal{D}\) | \(X\) is distributed according to \(\mathcal{D}\) |
| \(X \perp Y\) | \(X\) and \(Y\) are independent |
| \(\text{i.i.d.}\) | Independent and identically distributed |
Expectation, variance, and their relatives are the workhorses of statistical analysis. The estimator notation distinguishes an unknown true quantity from the value computed from a finite sample.
| Symbol | Meaning |
|---|---|
| \(\mathbb{E}[X]\) | Expected value of \(X\) |
| \(\mathbb{E}_{x \sim p}[f(x)]\) | Expectation of \(f\) under distribution \(p\) |
| \(\operatorname{Var}(X)\) | Variance, \(\mathbb{E}[(X - \mathbb{E}[X])^2]\) |
| \(\operatorname{Cov}(X, Y)\) | Covariance between \(X\) and \(Y\) |
| \(\operatorname{Corr}(X, Y)\) | Pearson correlation coefficient |
| \(\boldsymbol{\Sigma}\) | Covariance matrix |
| \(\mu, \sigma, \sigma^2\) | Mean, standard deviation, variance |
| \(\bar{x}\) | Sample mean |
| \(\hat{\theta}\) | Estimator of parameter \(\theta\) |
| \(\theta^\ast\) | True or optimal parameter value |
| \(\mathcal{L}(\theta)\) | Likelihood of parameters \(\theta\) |
| \(\ell(\theta)\) | Log likelihood |
Several distributions recur often enough to merit a dedicated notation. The table lists their standard names and parameters.
| Symbol | Meaning |
|---|---|
| \(\mathcal{N}(\mu, \sigma^2)\) | Gaussian (normal) with mean \(\mu\), variance \(\sigma^2\) |
| \(\mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})\) | Multivariate Gaussian |
| \(\operatorname{Bern}(p)\) | Bernoulli with success probability \(p\) |
| \(\operatorname{Bin}(n, p)\) | Binomial over \(n\) trials |
| \(\operatorname{Cat}(\boldsymbol{\pi})\) | Categorical with class probabilities \(\boldsymbol{\pi}\) |
| \(\operatorname{Unif}(a, b)\) | Uniform on \([a, b]\) |
| \(\operatorname{Pois}(\lambda)\) | Poisson with rate \(\lambda\) |
50.5 5. Information Theory
Information theory measures uncertainty and the cost of communication, and it provides the loss functions used to train probabilistic models. Entropy quantifies the average surprise of a distribution, while cross entropy and divergence compare two distributions.
| Symbol | Meaning |
|---|---|
| \(H(X)\) | Entropy, \(-\sum_x p(x) \log p(x)\) |
| \(H(X \mid Y)\) | Conditional entropy |
| \(H(p, q)\) | Cross entropy between \(p\) and \(q\) |
| \(D_{\mathrm{KL}}(p \,\Vert\, q)\) | Kullback Leibler divergence from \(q\) to \(p\) |
| \(I(X; Y)\) | Mutual information between \(X\) and \(Y\) |
| \(\operatorname{JSD}(p \,\Vert\, q)\) | Jensen Shannon divergence |
| \(\log_2, \ln\) | Logarithm base \(2\) (bits), natural log (nats) |
The KL divergence is defined as \(D_{\mathrm{KL}}(p \,\Vert\, q) = \sum_x p(x) \log \frac{p(x)}{q(x)}\). It is non negative and equals zero exactly when \(p\) and \(q\) agree, which is why minimizing it pulls a model distribution toward the data distribution.
50.6 6. Optimization
Optimization is the act of choosing parameters to minimize a loss. The notation below frames a generic problem: minimize an objective over a feasible region, possibly subject to constraints. Most training procedures are instances of this template.
| Symbol | Meaning |
|---|---|
| \(\min_{\mathbf{x}} f(\mathbf{x})\) | Minimize \(f\) over \(\mathbf{x}\) |
| \(\max_{\mathbf{x}} f(\mathbf{x})\) | Maximize \(f\) over \(\mathbf{x}\) |
| \(\arg\min_{\mathbf{x}} f(\mathbf{x})\) | Argument that attains the minimum |
| \(\arg\max_{\mathbf{x}} f(\mathbf{x})\) | Argument that attains the maximum |
| \(\text{s.t.}\) | Subject to (constraints follow) |
| \(f(\mathbf{x})\) | Objective or loss function |
| \(J(\boldsymbol{\theta})\) | Cost function over parameters |
| \(\boldsymbol{\theta}_t\) | Parameter vector at iteration \(t\) |
| \(\eta\) or \(\alpha\) | Learning rate (step size) |
| \(\lambda\) | Regularization strength |
| \(g_t = \nabla J(\boldsymbol{\theta}_t)\) | Gradient at iteration \(t\) |
| \(\mathcal{L}(\mathbf{x}, \boldsymbol{\mu})\) | Lagrangian with multipliers \(\boldsymbol{\mu}\) |
A canonical gradient descent update reads \(\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta\, \nabla J(\boldsymbol{\theta}_t)\), moving the parameters a small step in the direction that most rapidly decreases the cost. Stochastic variants replace the full gradient with an estimate computed on a minibatch, trading exactness for speed.
| Symbol | Meaning |
|---|---|
| \(\mathcal{B}\) | Minibatch of training examples |
| \(\nabla_{\boldsymbol{\theta}} \mathcal{L}\) | Loss gradient with respect to parameters |
| \(\preceq, \succeq\) | Componentwise or matrix inequality |
| \(\Pi_{\mathcal{C}}(\cdot)\) | Projection onto feasible set \(\mathcal{C}\) |
| \(\partial f(\mathbf{x})\) | Subdifferential at \(\mathbf{x}\) |
A function is convex when its domain is convex and every chord lies on or above the graph, a property that guarantees any local minimum is global. Convexity is the dividing line between problems we can solve reliably and the non convex landscapes typical of deep networks.
50.7 Further Reading
- Goodfellow, Bengio, and Courville, Deep Learning (2016), notation chapter.
- Bishop, Pattern Recognition and Machine Learning (2006), appendices.
- Strang, Introduction to Linear Algebra (2016).
- Boyd and Vandenberghe, Convex Optimization (2004).
- Cover and Thomas, Elements of Information Theory (2006).
- Murphy, Probabilistic Machine Learning: An Introduction (2022).