81 The Classical Machine Learning Paradigm

Before transformers, before backpropagation through hundred billion parameter networks, there was a clean and durable idea: a learning algorithm is a procedure that takes data and returns a function. The classical machine learning paradigm formalizes this idea with enough precision that we can reason about when learning works, why it generalizes, and how to trade off competing concerns. This chapter develops that formalism, the supervised learning setup, empirical risk minimization, loss functions, hypothesis classes, and generalization theory, and then argues that this machinery remains indispensable in the deep learning era rather than a historical curiosity.

81.1 1. The Supervised Learning Setup

81.1.1 1.1 Inputs, outputs, and the data distribution

Supervised learning begins with an input space $\mathcal{X}$ and an output space $\mathcal{Y}$. We assume there exists an unknown joint probability distribution $\mathcal{D}$ over $\mathcal{X} \times \mathcal{Y}$. A training set is a collection of $n$ examples

\[ S = \{(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)\} \]

drawn independently and identically distributed (i.i.d.) from $\mathcal{D}$. The i.i.d. assumption is the load bearing assumption of the entire framework. It says the data we train on and the data we will eventually be evaluated on come from the same source, and that examples do not influence one another. When this assumption breaks, through distribution shift, temporal correlation, or selection bias, the guarantees discussed below weaken or vanish.

The goal is to produce a predictor $h: \mathcal{X} \to \mathcal{Y}$ that maps inputs to outputs well, where “well” is defined precisely in Section 2. When $\mathcal{Y}$ is a finite set of labels we call the task classification; when $\mathcal{Y} = \mathbb{R}$ or $\mathbb{R}^k$ we call it regression. The same theory covers both.

The full pipeline that this chapter formalizes can be read off a single diagram. The world supplies an unknown distribution, we observe only a finite i.i.d. sample, an algorithm searches a restricted hypothesis class for the predictor of lowest empirical (and possibly regularized) risk, and we judge the result by its true risk on fresh draws from the same distribution.

flowchart LR
    D["Unknown distribution D over X times Y"] -->|"i.i.d. sample"| S["Training set S of n examples"]
    S --> A["Learning algorithm: ERM over hypothesis class H"]
    H["Hypothesis class H (inductive bias)"] --> A
    L["Loss function l"] --> A
    A --> HH["Learned predictor h-hat"]
    HH -->|"evaluate on fresh draws"| R["True risk R(h-hat)"]
    D -.->|"defines"| R

81.1.2 1.2 What we are really estimating

It is tempting to think the target is a deterministic function $f^\star(x)$, but the world is noisier than that. The same input $x$ can be associated with different labels $y$. The framework handles this by working with the conditional distribution $\mathcal{D}(y \mid x)$. The best possible deterministic predictor under squared loss, for instance, is the conditional mean $\mathbb{E}[y \mid x]$, and under zero one loss it is the most probable label $\arg\max_y \mathcal{D}(y \mid x)$, called the Bayes optimal classifier. No algorithm can do better than the Bayes predictor on the underlying distribution, and the residual error it incurs, the Bayes error, is irreducible. Recognizing irreducible error keeps us honest about when more data or a bigger model will help and when it will not.

To see why the conditional mean is optimal under squared loss, write any predictor’s risk as $\mathbb{E}[(h(x) - y)^2]$ and condition on $x$. For a fixed $x$, let $m(x) = \mathbb{E}[y \mid x]$. Then \[ \mathbb{E}\big[(h(x) - y)^2 \mid x\big] = \big(h(x) - m(x)\big)^2 + \mathbb{E}\big[(y - m(x))^2 \mid x\big], \] because the cross term $2(h(x) - m(x))\,\mathbb{E}[y - m(x) \mid x]$ vanishes by the definition of $m(x)$. The first piece is nonnegative and is zeroed exactly by the choice $h(x) = m(x)$; the second piece does not depend on $h$ at all. The second term, averaged over $x$, is precisely the irreducible Bayes error $\sigma^2$ that reappears in the bias variance decomposition of Section 4.3.

81.2 2. Empirical Risk Minimization

81.2.1 2.1 Risk, the quantity we care about

Fix a loss function $\ell(\hat{y}, y) \geq 0$ that measures the penalty for predicting $\hat{y}$ when the truth is $y$. The true risk, also called the generalization error or population risk, of a predictor $h$ is its expected loss over the data distribution:

\[ R(h) = \mathbb{E}_{(x, y) \sim \mathcal{D}} \big[ \ell(h(x), y) \big]. \]

This is the number we ultimately want to minimize. The problem is that $\mathcal{D}$ is unknown, so $R(h)$ cannot be computed directly. We only have the sample $S$.

81.2.2 2.2 The empirical risk

The natural surrogate is the average loss on the training set, the empirical risk:

\[ \hat{R}_S(h) = \frac{1}{n} \sum_{i=1}^{n} \ell(h(x_i), y_i). \]

By the law of large numbers, for any fixed $h$ the empirical risk converges to the true risk as $n$ grows. Empirical risk minimization (ERM) is the principle of choosing the predictor that minimizes this quantity within some restricted set of candidate functions $\mathcal{H}$, the hypothesis class:

\[ \hat{h} = \arg\min_{h \in \mathcal{H}} \hat{R}_S(h). \]

Nearly every supervised algorithm you have used, linear regression, logistic regression, support vector machines, decision trees, and the training loop of a neural network, is an instance or approximation of ERM, often with an added regularization term discussed in Section 5.

81.2.3 2.3 Why restricting the hypothesis class is essential

If we allowed $\mathcal{H}$ to contain every possible function, ERM would simply memorize: build a lookup table returning $y_i$ for each $x_i$ and arbitrary values elsewhere. This achieves zero empirical risk while learning nothing about unseen inputs. The art of ERM is choosing $\mathcal{H}$ rich enough to express good predictors but constrained enough that low empirical risk implies low true risk. This tension is the central theme of the paradigm and is made quantitative in Section 4.

# ERM in pseudocode: search a hypothesis class for the
# function with lowest average training loss.
def erm(hypothesis_class, S, loss):
    best, best_risk = None, float("inf")
    for h in hypothesis_class:
        risk = sum(loss(h(x), y) for x, y in S) / len(S)
        if risk < best_risk:
            best, best_risk = h, risk
    return best

81.3 3. Loss Functions

81.3.1 3.1 The role of the loss

The loss function encodes what we mean by a good prediction. Choosing it is a modeling decision, not a mathematical inevitability, and different losses lead to different optimal predictors and different sensitivities to outliers. Two desiderata usually compete: fidelity to the real world cost of mistakes, and tractability for optimization, especially differentiability and convexity.

81.3.2 3.2 Regression losses

For real valued targets the squared loss $\ell(\hat{y}, y) = (\hat{y} - y)^2$ is standard. It is smooth, convex, and its minimizer is the conditional mean, but it penalizes large errors quadratically and is therefore sensitive to outliers. The absolute loss $|\hat{y} - y|$ targets the conditional median and is more robust, at the cost of nondifferentiability at zero. The Huber loss interpolates between the two, behaving quadratically for small residuals and linearly for large ones, which gives robustness without sacrificing smoothness near the optimum.

81.3.3 3.3 Classification losses

The loss we truly care about in classification is often the zero one loss, $\ell(\hat{y}, y) = \mathbb{1}[\hat{y} \neq y]$, which counts mistakes. Unfortunately it is nonconvex and discontinuous, so direct minimization is NP hard for most hypothesis classes. Practical algorithms minimize a convex surrogate that upper bounds the zero one loss. The cross entropy or logistic loss,

\[ \ell(\hat{p}, y) = -\sum_{c} \mathbb{1}[y = c] \log \hat{p}_c, \]

where $\hat{p}$ is a predicted probability vector, is the workhorse for both logistic regression and modern neural classifiers. It is not an arbitrary choice: minimizing average cross entropy is exactly maximizing the likelihood of the labels under the model’s predicted probabilities, so it inherits the asymptotic guarantees of maximum likelihood estimation. The hinge loss $\max(0, 1 - y \cdot \hat{y})$ underlies support vector machines and induces a margin, penalizing predictions that are correct but not confident. The exponential loss $\exp(-y \cdot \hat{y})$ drives AdaBoost. Each is a surrogate chosen because it is convex and (Fisher) consistent, meaning that the predictor minimizing the surrogate risk over all functions agrees in sign with the Bayes classifier, so minimizing it drives the predictor toward Bayes optimal as data grows.

81.3.4 3.4 Losses encode values

A loss function is also where domain priorities enter. Class weighting, asymmetric costs for false positives versus false negatives in medical screening, and quantile losses for risk aware forecasting all live here. The lesson is that the loss is a place to express what matters, not a setting to leave at its default.

81.4 4. Hypothesis Classes and Capacity

81.4.1 4.1 What a hypothesis class is

A hypothesis class $\mathcal{H}$ is the set of functions the learning algorithm is permitted to return. Linear predictors $h(x) = w^\top x + b$ form one class; degree $d$ polynomials form a larger one; decision trees of depth $k$, kernel machines, and neural networks of a given architecture each define their own. The choice of $\mathcal{H}$ embeds our inductive bias, the assumptions about the world that let finite data generalize to infinitely many unseen inputs. Without inductive bias, learning from finite samples is impossible, a fact formalized by the no free lunch theorem: averaged over all possible data distributions, no learner beats random guessing.

81.4.2 4.2 Measuring capacity

The richness of a hypothesis class is its capacity. For binary classification the classical measure is the Vapnik Chervonenkis (VC) dimension. A set of points is shattered by $\mathcal{H}$ if, for every one of the $2^m$ possible labelings of $m$ points, some $h \in \mathcal{H}$ realizes that labeling. The VC dimension $\mathrm{VC}(\mathcal{H})$ is the size of the largest set that $\mathcal{H}$ can shatter.

As a concrete example, consider thresholded linear classifiers in the plane, $h(x) = \mathrm{sign}(w^\top x + b)$ with $x \in \mathbb{R}^2$. Any three points in general position (not collinear) can be split in all $2^3 = 8$ ways by some line, so three points are shattered. But no set of four points can be shattered: by Radon’s theorem any four points in the plane can be partitioned into two subsets whose convex hulls intersect, and that particular labeling (one subset positive, the other negative) cannot be separated by any straight line. Hence $\mathrm{VC} = 3$, matching the general formula $d + 1$ for linear classifiers in $\mathbb{R}^d$.

Higher capacity means more expressive power but also a greater risk of fitting noise. Related measures include Rademacher complexity, which captures how well a class can correlate with random $\pm 1$ labels, and covering numbers. These are not just theoretical ornaments; they appear directly in the bounds that connect training error to test error.

81.4.3 4.3 The bias variance decomposition

Capacity controls a fundamental tradeoff. Treat the learned predictor $\hat{h}$ as random because it depends on the random training sample $S$. For a fixed test point $x$, write $\bar{h}(x) = \mathbb{E}_S[\hat{h}(x)]$ for the average prediction across resamples. The expected squared error decomposes as

\[ \mathbb{E}_{S, y}\big[(\hat{h}(x) - y)^2\big] = \underbrace{\big(\bar{h}(x) - m(x)\big)^2}_{\text{Bias}^2,\ \text{too rigid}} + \underbrace{\mathbb{E}_S\big[(\hat{h}(x) - \bar{h}(x))^2\big]}_{\text{Variance},\ \text{too sensitive}} + \underbrace{\sigma^2}_{\text{irreducible}}, \]

where $m(x) = \mathbb{E}[y \mid x]$ is the Bayes regressor from Section 1.2 and $\sigma^2$ is its irreducible noise. The bias term measures how far the class is systematically displaced from the truth; the variance term measures how much the fit jitters as the sample changes. A low capacity class has high bias: it cannot represent the truth and underfits. A high capacity class has high variance: it chases noise in the particular training sample and overfits. The classical prescription is to tune capacity to the point where the sum is minimized, the sweet spot of the U shaped test error curve. Section 6 revisits how deep learning complicates this picture without overturning it.

81.4.4 4.4 A worked example: fitting a polynomial

Make the tradeoff concrete. Suppose the truth is $m(x) = \sin(2\pi x)$ on $[0, 1]$ and we observe $n$ noisy points $y_i = \sin(2\pi x_i) + \varepsilon_i$ with $\varepsilon_i \sim \mathcal{N}(0, \sigma^2)$. Our hypothesis class is polynomials of degree $d$, fit by least squares, so $d$ is the capacity knob.

Degree $0$ or $1$. A constant or a line cannot bend to follow a full sine wave. Every resample produces nearly the same wrong curve, so variance is tiny but bias is large. This is underfitting.
Degree $3$ to $5$. A cubic or quintic tracks one period of the sine closely. Bias has collapsed and variance is still modest, so the total error is near its minimum. This is the sweet spot.
Degree $15$ with $n = 20$. The polynomial has nearly as many free coefficients as data points. It threads every noisy observation, oscillating wildly between them. Bias is essentially zero, but a single resampled point swings the curve dramatically, so variance explodes. This is overfitting.

The training error falls monotonically as $d$ rises, because a higher degree class contains every lower degree fit and can only reduce empirical risk. The test error instead traces the U shape: it falls while reducing bias dominates, then climbs once growing variance takes over. Cross validation (Section 5.3) is how we locate the bottom of that U without peeking at the test set. Adding an $L_2$ penalty on the coefficients (Section 5.2) lets us keep a high degree $d$ for expressiveness while $\lambda$ reins in the variance, which is often more effective than discretely choosing $d$.

81.5 5. Generalization

81.5.1 5.1 The generalization gap

Define the generalization gap as the difference between true and empirical risk, $R(\hat{h}) - \hat{R}_S(\hat{h})$. ERM controls the empirical term by construction; generalization theory bounds the gap. A representative uniform convergence result states that, with probability at least $1 - \delta$ over the draw of the sample, simultaneously for all $h \in \mathcal{H}$,

\[ R(h) \leq \hat{R}_S(h) + \mathcal{O}\!\left( \sqrt{\frac{\text{capacity}(\mathcal{H}) + \log(1/\delta)}{n}} \right). \]

Read this bound qualitatively. The gap shrinks as the sample size $n$ grows and widens as the capacity of $\mathcal{H}$ grows. More data helps; more flexible models cost you, unless you have the data to pay for them. This single inequality justifies the entire practice of holding out validation data, limiting model complexity, and gathering more examples.

Making the capacity term concrete, a standard VC bound for binary classification with the zero one loss states that with probability at least $1 - \delta$, for all $h \in \mathcal{H}$,

\[ R(h) \leq \hat{R}_S(h) + \sqrt{\frac{8\big(\mathrm{VC}(\mathcal{H})\,\log\frac{2en}{\mathrm{VC}(\mathcal{H})} + \log\frac{4}{\delta}\big)}{n}}. \]

The qualitative shape is exactly the inequality above: the numerator grows roughly linearly in the VC dimension and the denominator is the sample size, so the gap behaves like $\sqrt{\mathrm{VC}(\mathcal{H}) / n}$ up to logarithmic factors. A useful rule of thumb that falls out of such bounds is that the sample size needed for reliable generalization scales with the capacity of the class: you want $n$ comfortably larger than $\mathrm{VC}(\mathcal{H})$.

81.5.2 5.2 Regularization as capacity control

In practice we rarely enumerate hypothesis classes by hand. Instead we add a penalty to the empirical risk that discourages complex solutions, regularized risk minimization:

\[ \hat{h} = \arg\min_{h \in \mathcal{H}} \ \hat{R}_S(h) + \lambda \, \Omega(h). \]

The penalty $\Omega(h)$ might be the squared norm of the weights ($L_2$, ridge), the absolute norm ($L_1$, lasso, which also induces sparsity and feature selection), or structural constraints like tree depth. The hyperparameter $\lambda$ continuously trades empirical fit against capacity, effectively shrinking the hypothesis class. This is the operational form of the bias variance tradeoff and the most common knob a practitioner actually turns.

81.5.3 5.3 Estimating generalization honestly

Theory gives the shape of the bounds, but the constants are usually too loose to use directly. In practice we estimate generalization empirically. Hold out a test set never touched during training; use a separate validation set or cross validation to select hyperparameters. The cardinal sin is letting test information leak into model selection, which silently inflates capacity usage and breaks the guarantee. Cross validation, repeatedly partitioning the data into training and validation folds, gives a lower variance estimate of true risk when data is scarce.

# k-fold cross validation: estimate true risk without
# touching the held-out test set.
def cross_val_risk(fit, S, loss, k=5):
    folds = partition(S, k)
    scores = []
    for i in range(k):
        val = folds[i]
        train = concat(folds[:i] + folds[i+1:])
        h = fit(train)
        scores.append(mean(loss(h(x), y) for x, y in val))
    return mean(scores)

81.6 6. Why Classical ML Still Matters

81.6.1 6.1 The vocabulary did not change

Deep learning did not repeal empirical risk minimization; it is a spectacular instance of it. A neural network defines a hypothesis class parameterized by its weights, stochastic gradient descent is an approximate ERM solver, and cross entropy is the same surrogate loss logistic regression uses. Every concept in this chapter, risk, loss, capacity, the generalization gap, regularization, applies verbatim to a transformer. An engineer who understands the classical paradigm reads a training curve, a learning rate schedule, or a weight decay setting as a member of a family they already know, rather than as folklore.

81.6.2 6.2 The deep learning puzzle, and what survives

Deep networks complicate one part of the classical story. Heavily overparameterized models, with far more parameters than training examples, can fit random labels perfectly yet still generalize on real data, and test error can descend again past the interpolation point, the double descent phenomenon. This shows that raw parameter count is the wrong capacity measure for these models. What survives is the framework itself. Researchers explain the puzzle with the same tools, the implicit regularization of gradient descent, margin based bounds, and norm based rather than count based capacity. The questions are classical even where the answers are new.

81.6.3 6.3 When classical models are the right tool

For tabular data, the bread and butter of business, finance, and the sciences, gradient boosted trees and regularized linear models routinely match or beat deep networks while training in seconds and offering interpretability. When data is scarce, when latency and compute budgets are tight, when a model must be audited or its decisions explained to a regulator, or when a strong baseline is needed before reaching for anything heavier, the classical toolkit wins on the merits. Reaching for a billion parameter model when ridge regression suffices is an engineering error, not a sign of sophistication.

81.6.4 6.4 Common pitfalls

The paradigm is most useful when its failure modes are kept in view.

Breaking the i.i.d. assumption silently. Time series, grouped data (multiple rows per patient or user), and any sample selected non-randomly violate the foundation of every bound here. Validation folds that split such data at random leak information and report optimistic error. Split by time or by group instead.
Test set leakage through repeated peeking. Each time the test set guides a decision it stops being a fair estimate of true risk. Selecting among many models on the same test set inflates capacity usage exactly as enlarging $\mathcal{H}$ does.
Preprocessing fit on the full data. Computing scaling, imputation, or feature selection on all rows before splitting leaks test statistics into training. Fit every transform inside the training fold only.
Optimizing the wrong loss. A model tuned on accuracy can be useless when classes are imbalanced or when false positives and false negatives carry different costs. Choose the loss and the evaluation metric to match the real decision, as Section 3.4 urges.
Reading parameter count as capacity. As Section 6.2 notes, for overparameterized models the number of weights is a poor proxy for effective capacity. Judge generalization by held out performance, not by model size.

81.6.5 6.5 The enduring discipline

Most of all, the classical paradigm supplies the discipline that keeps applied machine learning honest. Split your data. Choose a loss that reflects real costs. Watch the gap between training and validation error. Prefer the simplest hypothesis class that does the job. These habits predate deep learning, outlast each generation of architectures, and apply with equal force to a logistic regression and a frontier model. The paradigm is not a stage we have left behind. It is the foundation every later development stands on.

81.7 References

Vapnik, V. N. “The Nature of Statistical Learning Theory.” Springer, 1995. https://link.springer.com/book/10.1007/978-1-4757-3264-1
Shalev-Shwartz, S. and Ben-David, S. “Understanding Machine Learning: From Theory to Algorithms.” Cambridge University Press, 2014. https://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning/
Hastie, T., Tibshirani, R., and Friedman, J. “The Elements of Statistical Learning.” Springer, 2009. https://hastie.su.domains/ElemStatLearn/
Mohri, M., Rostamizadeh, A., and Talwalkar, A. “Foundations of Machine Learning.” MIT Press, 2018. https://cs.nyu.edu/~mohri/mlbook/
Bishop, C. M. “Pattern Recognition and Machine Learning.” Springer, 2006. https://www.microsoft.com/en-us/research/publication/pattern-recognition-machine-learning/
Wolpert, D. H. “The Lack of A Priori Distinctions Between Learning Algorithms.” Neural Computation, 1996. https://doi.org/10.1162/neco.1996.8.7.1341
Belkin, M., Hsu, D., Ma, S., and Mandal, S. “Reconciling Modern Machine-Learning Practice and the Classical Bias-Variance Trade-off.” PNAS, 2019. https://www.pnas.org/doi/10.1073/pnas.1903070116
Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. “Understanding Deep Learning Requires Rethinking Generalization.” ICLR, 2017. https://arxiv.org/abs/1611.03530
Grinsztajn, L., Oyallon, E., and Varoquaux, G. “Why Do Tree-Based Models Still Outperform Deep Learning on Tabular Data?” NeurIPS, 2022. https://arxiv.org/abs/2207.08815
Bartlett, P. L., Jordan, M. I., and McAuliffe, J. D. “Convexity, Classification, and Risk Bounds.” Journal of the American Statistical Association, 2006. https://doi.org/10.1198/016214505000000907
Vapnik, V. N. and Chervonenkis, A. Ya. “On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities.” Theory of Probability and Its Applications, 1971. https://doi.org/10.1137/1116025

# The Classical Machine Learning Paradigm Before transformers, before backpropagation through hundred billion parameter networks, there was a clean and durable idea: a learning algorithm is a procedure that takes data and returns a function. The classical machine learning paradigm formalizes this idea with enough precision that we can reason about when learning works, why it generalizes, and how to trade off competing concerns. This chapter develops that formalism, the supervised learning setup, empirical risk minimization, loss functions, hypothesis classes, and generalization theory, and then argues that this machinery remains indispensable in the deep learning era rather than a historical curiosity. ## 1. The Supervised Learning Setup ### 1.1 Inputs, outputs, and the data distribution Supervised learning begins with an input space $\mathcal{X}$ and an output space $\mathcal{Y}$. We assume there exists an unknown joint probability distribution $\mathcal{D}$ over $\mathcal{X} \times \mathcal{Y}$. A training set is a collection of $n$ examples $$ S = \{(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)\} $$ drawn independently and identically distributed (i.i.d.) from $\mathcal{D}$. The i.i.d. assumption is the load bearing assumption of the entire framework. It says the data we train on and the data we will eventually be evaluated on come from the same source, and that examples do not influence one another. When this assumption breaks, through distribution shift, temporal correlation, or selection bias, the guarantees discussed below weaken or vanish. The goal is to produce a predictor $h: \mathcal{X} \to \mathcal{Y}$ that maps inputs to outputs well, where "well" is defined precisely in Section 2. When $\mathcal{Y}$ is a finite set of labels we call the task classification; when $\mathcal{Y} = \mathbb{R}$ or $\mathbb{R}^k$ we call it regression. The same theory covers both. The full pipeline that this chapter formalizes can be read off a single diagram. The world supplies an unknown distribution, we observe only a finite i.i.d. sample, an algorithm searches a restricted hypothesis class for the predictor of lowest empirical (and possibly regularized) risk, and we judge the result by its true risk on fresh draws from the same distribution. ```{mermaid} flowchart LR D["Unknown distribution D over X times Y"] -->|"i.i.d. sample"| S["Training set S of n examples"] S --> A["Learning algorithm: ERM over hypothesis class H"] H["Hypothesis class H (inductive bias)"] --> A L["Loss function l"] --> A A --> HH["Learned predictor h-hat"] HH -->|"evaluate on fresh draws"| R["True risk R(h-hat)"] D -.->|"defines"| R ``` ### 1.2 What we are really estimating It is tempting to think the target is a deterministic function $f^\star(x)$, but the world is noisier than that. The same input $x$ can be associated with different labels $y$. The framework handles this by working with the conditional distribution $\mathcal{D}(y \mid x)$. The best possible deterministic predictor under squared loss, for instance, is the conditional mean $\mathbb{E}[y \mid x]$, and under zero one loss it is the most probable label $\arg\max_y \mathcal{D}(y \mid x)$, called the Bayes optimal classifier. No algorithm can do better than the Bayes predictor on the underlying distribution, and the residual error it incurs, the Bayes error, is irreducible. Recognizing irreducible error keeps us honest about when more data or a bigger model will help and when it will not. To see why the conditional mean is optimal under squared loss, write any predictor's risk as $\mathbb{E}[(h(x) - y)^2]$ and condition on $x$. For a fixed $x$, let $m(x) = \mathbb{E}[y \mid x]$. Then $$ \mathbb{E}\big[(h(x) - y)^2 \mid x\big] = \big(h(x) - m(x)\big)^2 + \mathbb{E}\big[(y - m(x))^2 \mid x\big], $$ because the cross term $2(h(x) - m(x))\,\mathbb{E}[y - m(x) \mid x]$ vanishes by the definition of $m(x)$. The first piece is nonnegative and is zeroed exactly by the choice $h(x) = m(x)$; the second piece does not depend on $h$ at all. The second term, averaged over $x$, is precisely the irreducible Bayes error $\sigma^2$ that reappears in the bias variance decomposition of Section 4.3. ## 2. Empirical Risk Minimization ### 2.1 Risk, the quantity we care about Fix a loss function $\ell(\hat{y}, y) \geq 0$ that measures the penalty for predicting $\hat{y}$ when the truth is $y$. The true risk, also called the generalization error or population risk, of a predictor $h$ is its expected loss over the data distribution: $$ R(h) = \mathbb{E}_{(x, y) \sim \mathcal{D}} \big[ \ell(h(x), y) \big]. $$ This is the number we ultimately want to minimize. The problem is that $\mathcal{D}$ is unknown, so $R(h)$ cannot be computed directly. We only have the sample $S$. ### 2.2 The empirical risk The natural surrogate is the average loss on the training set, the empirical risk: $$ \hat{R}_S(h) = \frac{1}{n} \sum_{i=1}^{n} \ell(h(x_i), y_i). $$ By the law of large numbers, for any fixed $h$ the empirical risk converges to the true risk as $n$ grows. Empirical risk minimization (ERM) is the principle of choosing the predictor that minimizes this quantity within some restricted set of candidate functions $\mathcal{H}$, the hypothesis class: $$ \hat{h} = \arg\min_{h \in \mathcal{H}} \hat{R}_S(h). $$ Nearly every supervised algorithm you have used, linear regression, logistic regression, support vector machines, decision trees, and the training loop of a neural network, is an instance or approximation of ERM, often with an added regularization term discussed in Section 5. ### 2.3 Why restricting the hypothesis class is essential If we allowed $\mathcal{H}$ to contain every possible function, ERM would simply memorize: build a lookup table returning $y_i$ for each $x_i$ and arbitrary values elsewhere. This achieves zero empirical risk while learning nothing about unseen inputs. The art of ERM is choosing $\mathcal{H}$ rich enough to express good predictors but constrained enough that low empirical risk implies low true risk. This tension is the central theme of the paradigm and is made quantitative in Section 4. ```python # ERM in pseudocode: search a hypothesis class for the # function with lowest average training loss. def erm(hypothesis_class, S, loss): best, best_risk = None, float("inf") for h in hypothesis_class: risk = sum(loss(h(x), y) for x, y in S) / len(S) if risk < best_risk: best, best_risk = h, risk return best ``` ## 3. Loss Functions ### 3.1 The role of the loss The loss function encodes what we mean by a good prediction. Choosing it is a modeling decision, not a mathematical inevitability, and different losses lead to different optimal predictors and different sensitivities to outliers. Two desiderata usually compete: fidelity to the real world cost of mistakes, and tractability for optimization, especially differentiability and convexity. ### 3.2 Regression losses For real valued targets the squared loss $\ell(\hat{y}, y) = (\hat{y} - y)^2$ is standard. It is smooth, convex, and its minimizer is the conditional mean, but it penalizes large errors quadratically and is therefore sensitive to outliers. The absolute loss $|\hat{y} - y|$ targets the conditional median and is more robust, at the cost of nondifferentiability at zero. The Huber loss interpolates between the two, behaving quadratically for small residuals and linearly for large ones, which gives robustness without sacrificing smoothness near the optimum. ### 3.3 Classification losses The loss we truly care about in classification is often the zero one loss, $\ell(\hat{y}, y) = \mathbb{1}[\hat{y} \neq y]$, which counts mistakes. Unfortunately it is nonconvex and discontinuous, so direct minimization is NP hard for most hypothesis classes. Practical algorithms minimize a convex surrogate that upper bounds the zero one loss. The cross entropy or logistic loss, $$ \ell(\hat{p}, y) = -\sum_{c} \mathbb{1}[y = c] \log \hat{p}_c, $$ where $\hat{p}$ is a predicted probability vector, is the workhorse for both logistic regression and modern neural classifiers. It is not an arbitrary choice: minimizing average cross entropy is exactly maximizing the likelihood of the labels under the model's predicted probabilities, so it inherits the asymptotic guarantees of maximum likelihood estimation. The hinge loss $\max(0, 1 - y \cdot \hat{y})$ underlies support vector machines and induces a margin, penalizing predictions that are correct but not confident. The exponential loss $\exp(-y \cdot \hat{y})$ drives AdaBoost. Each is a surrogate chosen because it is convex and (Fisher) consistent, meaning that the predictor minimizing the surrogate risk over all functions agrees in sign with the Bayes classifier, so minimizing it drives the predictor toward Bayes optimal as data grows. ### 3.4 Losses encode values A loss function is also where domain priorities enter. Class weighting, asymmetric costs for false positives versus false negatives in medical screening, and quantile losses for risk aware forecasting all live here. The lesson is that the loss is a place to express what matters, not a setting to leave at its default. ## 4. Hypothesis Classes and Capacity ### 4.1 What a hypothesis class is A hypothesis class $\mathcal{H}$ is the set of functions the learning algorithm is permitted to return. Linear predictors $h(x) = w^\top x + b$ form one class; degree $d$ polynomials form a larger one; decision trees of depth $k$, kernel machines, and neural networks of a given architecture each define their own. The choice of $\mathcal{H}$ embeds our inductive bias, the assumptions about the world that let finite data generalize to infinitely many unseen inputs. Without inductive bias, learning from finite samples is impossible, a fact formalized by the no free lunch theorem: averaged over all possible data distributions, no learner beats random guessing. ### 4.2 Measuring capacity The richness of a hypothesis class is its capacity. For binary classification the classical measure is the Vapnik Chervonenkis (VC) dimension. A set of points is shattered by $\mathcal{H}$ if, for every one of the $2^m$ possible labelings of $m$ points, some $h \in \mathcal{H}$ realizes that labeling. The VC dimension $\mathrm{VC}(\mathcal{H})$ is the size of the largest set that $\mathcal{H}$ can shatter. As a concrete example, consider thresholded linear classifiers in the plane, $h(x) = \mathrm{sign}(w^\top x + b)$ with $x \in \mathbb{R}^2$. Any three points in general position (not collinear) can be split in all $2^3 = 8$ ways by some line, so three points are shattered. But no set of four points can be shattered: by Radon's theorem any four points in the plane can be partitioned into two subsets whose convex hulls intersect, and that particular labeling (one subset positive, the other negative) cannot be separated by any straight line. Hence $\mathrm{VC} = 3$, matching the general formula $d + 1$ for linear classifiers in $\mathbb{R}^d$. Higher capacity means more expressive power but also a greater risk of fitting noise. Related measures include Rademacher complexity, which captures how well a class can correlate with random $\pm 1$ labels, and covering numbers. These are not just theoretical ornaments; they appear directly in the bounds that connect training error to test error. ### 4.3 The bias variance decomposition Capacity controls a fundamental tradeoff. Treat the learned predictor $\hat{h}$ as random because it depends on the random training sample $S$. For a fixed test point $x$, write $\bar{h}(x) = \mathbb{E}_S[\hat{h}(x)]$ for the average prediction across resamples. The expected squared error decomposes as $$ \mathbb{E}_{S, y}\big[(\hat{h}(x) - y)^2\big] = \underbrace{\big(\bar{h}(x) - m(x)\big)^2}_{\text{Bias}^2,\ \text{too rigid}} + \underbrace{\mathbb{E}_S\big[(\hat{h}(x) - \bar{h}(x))^2\big]}_{\text{Variance},\ \text{too sensitive}} + \underbrace{\sigma^2}_{\text{irreducible}}, $$ where $m(x) = \mathbb{E}[y \mid x]$ is the Bayes regressor from Section 1.2 and $\sigma^2$ is its irreducible noise. The bias term measures how far the class is systematically displaced from the truth; the variance term measures how much the fit jitters as the sample changes. A low capacity class has high bias: it cannot represent the truth and underfits. A high capacity class has high variance: it chases noise in the particular training sample and overfits. The classical prescription is to tune capacity to the point where the sum is minimized, the sweet spot of the U shaped test error curve. Section 6 revisits how deep learning complicates this picture without overturning it. ### 4.4 A worked example: fitting a polynomial Make the tradeoff concrete. Suppose the truth is $m(x) = \sin(2\pi x)$ on $[0, 1]$ and we observe $n$ noisy points $y_i = \sin(2\pi x_i) + \varepsilon_i$ with $\varepsilon_i \sim \mathcal{N}(0, \sigma^2)$. Our hypothesis class is polynomials of degree $d$, fit by least squares, so $d$ is the capacity knob. - Degree $0$ or $1$. A constant or a line cannot bend to follow a full sine wave. Every resample produces nearly the same wrong curve, so variance is tiny but bias is large. This is underfitting. - Degree $3$ to $5$. A cubic or quintic tracks one period of the sine closely. Bias has collapsed and variance is still modest, so the total error is near its minimum. This is the sweet spot. - Degree $15$ with $n = 20$. The polynomial has nearly as many free coefficients as data points. It threads every noisy observation, oscillating wildly between them. Bias is essentially zero, but a single resampled point swings the curve dramatically, so variance explodes. This is overfitting. The training error falls monotonically as $d$ rises, because a higher degree class contains every lower degree fit and can only reduce empirical risk. The test error instead traces the U shape: it falls while reducing bias dominates, then climbs once growing variance takes over. Cross validation (Section 5.3) is how we locate the bottom of that U without peeking at the test set. Adding an $L_2$ penalty on the coefficients (Section 5.2) lets us keep a high degree $d$ for expressiveness while $\lambda$ reins in the variance, which is often more effective than discretely choosing $d$. ## 5. Generalization ### 5.1 The generalization gap Define the generalization gap as the difference between true and empirical risk, $R(\hat{h}) - \hat{R}_S(\hat{h})$. ERM controls the empirical term by construction; generalization theory bounds the gap. A representative uniform convergence result states that, with probability at least $1 - \delta$ over the draw of the sample, simultaneously for all $h \in \mathcal{H}$, $$ R(h) \leq \hat{R}_S(h) + \mathcal{O}\!\left( \sqrt{\frac{\text{capacity}(\mathcal{H}) + \log(1/\delta)}{n}} \right). $$ Read this bound qualitatively. The gap shrinks as the sample size $n$ grows and widens as the capacity of $\mathcal{H}$ grows. More data helps; more flexible models cost you, unless you have the data to pay for them. This single inequality justifies the entire practice of holding out validation data, limiting model complexity, and gathering more examples. Making the capacity term concrete, a standard VC bound for binary classification with the zero one loss states that with probability at least $1 - \delta$, for all $h \in \mathcal{H}$, $$ R(h) \leq \hat{R}_S(h) + \sqrt{\frac{8\big(\mathrm{VC}(\mathcal{H})\,\log\frac{2en}{\mathrm{VC}(\mathcal{H})} + \log\frac{4}{\delta}\big)}{n}}. $$ The qualitative shape is exactly the inequality above: the numerator grows roughly linearly in the VC dimension and the denominator is the sample size, so the gap behaves like $\sqrt{\mathrm{VC}(\mathcal{H}) / n}$ up to logarithmic factors. A useful rule of thumb that falls out of such bounds is that the sample size needed for reliable generalization scales with the capacity of the class: you want $n$ comfortably larger than $\mathrm{VC}(\mathcal{H})$. ### 5.2 Regularization as capacity control In practice we rarely enumerate hypothesis classes by hand. Instead we add a penalty to the empirical risk that discourages complex solutions, regularized risk minimization: $$ \hat{h} = \arg\min_{h \in \mathcal{H}} \ \hat{R}_S(h) + \lambda \, \Omega(h). $$ The penalty $\Omega(h)$ might be the squared norm of the weights ($L_2$, ridge), the absolute norm ($L_1$, lasso, which also induces sparsity and feature selection), or structural constraints like tree depth. The hyperparameter $\lambda$ continuously trades empirical fit against capacity, effectively shrinking the hypothesis class. This is the operational form of the bias variance tradeoff and the most common knob a practitioner actually turns. ### 5.3 Estimating generalization honestly Theory gives the shape of the bounds, but the constants are usually too loose to use directly. In practice we estimate generalization empirically. Hold out a test set never touched during training; use a separate validation set or cross validation to select hyperparameters. The cardinal sin is letting test information leak into model selection, which silently inflates capacity usage and breaks the guarantee. Cross validation, repeatedly partitioning the data into training and validation folds, gives a lower variance estimate of true risk when data is scarce. ```python # k-fold cross validation: estimate true risk without # touching the held-out test set. def cross_val_risk(fit, S, loss, k=5): folds = partition(S, k) scores = [] for i in range(k): val = folds[i] train = concat(folds[:i] + folds[i+1:]) h = fit(train) scores.append(mean(loss(h(x), y) for x, y in val)) return mean(scores) ``` ## 6. Why Classical ML Still Matters ### 6.1 The vocabulary did not change Deep learning did not repeal empirical risk minimization; it is a spectacular instance of it. A neural network defines a hypothesis class parameterized by its weights, stochastic gradient descent is an approximate ERM solver, and cross entropy is the same surrogate loss logistic regression uses. Every concept in this chapter, risk, loss, capacity, the generalization gap, regularization, applies verbatim to a transformer. An engineer who understands the classical paradigm reads a training curve, a learning rate schedule, or a weight decay setting as a member of a family they already know, rather than as folklore. ### 6.2 The deep learning puzzle, and what survives Deep networks complicate one part of the classical story. Heavily overparameterized models, with far more parameters than training examples, can fit random labels perfectly yet still generalize on real data, and test error can descend again past the interpolation point, the double descent phenomenon. This shows that raw parameter count is the wrong capacity measure for these models. What survives is the framework itself. Researchers explain the puzzle with the same tools, the implicit regularization of gradient descent, margin based bounds, and norm based rather than count based capacity. The questions are classical even where the answers are new. ### 6.3 When classical models are the right tool For tabular data, the bread and butter of business, finance, and the sciences, gradient boosted trees and regularized linear models routinely match or beat deep networks while training in seconds and offering interpretability. When data is scarce, when latency and compute budgets are tight, when a model must be audited or its decisions explained to a regulator, or when a strong baseline is needed before reaching for anything heavier, the classical toolkit wins on the merits. Reaching for a billion parameter model when ridge regression suffices is an engineering error, not a sign of sophistication. ### 6.4 Common pitfalls The paradigm is most useful when its failure modes are kept in view. - Breaking the i.i.d. assumption silently. Time series, grouped data (multiple rows per patient or user), and any sample selected non-randomly violate the foundation of every bound here. Validation folds that split such data at random leak information and report optimistic error. Split by time or by group instead. - Test set leakage through repeated peeking. Each time the test set guides a decision it stops being a fair estimate of true risk. Selecting among many models on the same test set inflates capacity usage exactly as enlarging $\mathcal{H}$ does. - Preprocessing fit on the full data. Computing scaling, imputation, or feature selection on all rows before splitting leaks test statistics into training. Fit every transform inside the training fold only. - Optimizing the wrong loss. A model tuned on accuracy can be useless when classes are imbalanced or when false positives and false negatives carry different costs. Choose the loss and the evaluation metric to match the real decision, as Section 3.4 urges. - Reading parameter count as capacity. As Section 6.2 notes, for overparameterized models the number of weights is a poor proxy for effective capacity. Judge generalization by held out performance, not by model size. ### 6.5 The enduring discipline Most of all, the classical paradigm supplies the discipline that keeps applied machine learning honest. Split your data. Choose a loss that reflects real costs. Watch the gap between training and validation error. Prefer the simplest hypothesis class that does the job. These habits predate deep learning, outlast each generation of architectures, and apply with equal force to a logistic regression and a frontier model. The paradigm is not a stage we have left behind. It is the foundation every later development stands on. ## References 1. Vapnik, V. N. "The Nature of Statistical Learning Theory." Springer, 1995. https://link.springer.com/book/10.1007/978-1-4757-3264-1 2. Shalev-Shwartz, S. and Ben-David, S. "Understanding Machine Learning: From Theory to Algorithms." Cambridge University Press, 2014. https://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning/ 3. Hastie, T., Tibshirani, R., and Friedman, J. "The Elements of Statistical Learning." Springer, 2009. https://hastie.su.domains/ElemStatLearn/ 4. Mohri, M., Rostamizadeh, A., and Talwalkar, A. "Foundations of Machine Learning." MIT Press, 2018. https://cs.nyu.edu/~mohri/mlbook/ 5. Bishop, C. M. "Pattern Recognition and Machine Learning." Springer, 2006. https://www.microsoft.com/en-us/research/publication/pattern-recognition-machine-learning/ 6. Wolpert, D. H. "The Lack of A Priori Distinctions Between Learning Algorithms." Neural Computation, 1996. https://doi.org/10.1162/neco.1996.8.7.1341 7. Belkin, M., Hsu, D., Ma, S., and Mandal, S. "Reconciling Modern Machine-Learning Practice and the Classical Bias-Variance Trade-off." PNAS, 2019. https://www.pnas.org/doi/10.1073/pnas.1903070116 8. Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. "Understanding Deep Learning Requires Rethinking Generalization." ICLR, 2017. https://arxiv.org/abs/1611.03530 9. Grinsztajn, L., Oyallon, E., and Varoquaux, G. "Why Do Tree-Based Models Still Outperform Deep Learning on Tabular Data?" NeurIPS, 2022. https://arxiv.org/abs/2207.08815 10. Bartlett, P. L., Jordan, M. I., and McAuliffe, J. D. "Convexity, Classification, and Risk Bounds." Journal of the American Statistical Association, 2006. https://doi.org/10.1198/016214505000000907 11. Vapnik, V. N. and Chervonenkis, A. Ya. "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities." Theory of Probability and Its Applications, 1971. https://doi.org/10.1137/1116025