81  The Classical Machine Learning Paradigm

Before transformers, before backpropagation through hundred billion parameter networks, there was a clean and durable idea: a learning algorithm is a procedure that takes data and returns a function. The classical machine learning paradigm formalizes this idea with enough precision that we can reason about when learning works, why it generalizes, and how to trade off competing concerns. This chapter develops that formalism, the supervised learning setup, empirical risk minimization, loss functions, hypothesis classes, and generalization theory, and then argues that this machinery remains indispensable in the deep learning era rather than a historical curiosity.

81.1 1. The Supervised Learning Setup

81.1.1 1.1 Inputs, outputs, and the data distribution

Supervised learning begins with an input space \(\mathcal{X}\) and an output space \(\mathcal{Y}\). We assume there exists an unknown joint probability distribution \(\mathcal{D}\) over \(\mathcal{X} \times \mathcal{Y}\). A training set is a collection of \(n\) examples

\[ S = \{(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)\} \]

drawn independently and identically distributed (i.i.d.) from \(\mathcal{D}\). The i.i.d. assumption is the load bearing assumption of the entire framework. It says the data we train on and the data we will eventually be evaluated on come from the same source, and that examples do not influence one another. When this assumption breaks, through distribution shift, temporal correlation, or selection bias, the guarantees discussed below weaken or vanish.

The goal is to produce a predictor \(h: \mathcal{X} \to \mathcal{Y}\) that maps inputs to outputs well, where “well” is defined precisely in Section 2. When \(\mathcal{Y}\) is a finite set of labels we call the task classification; when \(\mathcal{Y} = \mathbb{R}\) or \(\mathbb{R}^k\) we call it regression. The same theory covers both.

81.1.2 1.2 What we are really estimating

It is tempting to think the target is a deterministic function \(f^\star(x)\), but the world is noisier than that. The same input \(x\) can be associated with different labels \(y\). The framework handles this by working with the conditional distribution \(\mathcal{D}(y \mid x)\). The best possible deterministic predictor under squared loss, for instance, is the conditional mean \(\mathbb{E}[y \mid x]\), and under zero one loss it is the most probable label \(\arg\max_y \mathcal{D}(y \mid x)\), called the Bayes optimal classifier. No algorithm can do better than the Bayes predictor on the underlying distribution, and the residual error it incurs, the Bayes error, is irreducible. Recognizing irreducible error keeps us honest about when more data or a bigger model will help and when it will not.

81.2 2. Empirical Risk Minimization

81.2.1 2.1 Risk, the quantity we care about

Fix a loss function \(\ell(\hat{y}, y) \geq 0\) that measures the penalty for predicting \(\hat{y}\) when the truth is \(y\). The true risk, also called the generalization error or population risk, of a predictor \(h\) is its expected loss over the data distribution:

\[ R(h) = \mathbb{E}_{(x, y) \sim \mathcal{D}} \big[ \ell(h(x), y) \big]. \]

This is the number we ultimately want to minimize. The problem is that \(\mathcal{D}\) is unknown, so \(R(h)\) cannot be computed directly. We only have the sample \(S\).

81.2.2 2.2 The empirical risk

The natural surrogate is the average loss on the training set, the empirical risk:

\[ \hat{R}_S(h) = \frac{1}{n} \sum_{i=1}^{n} \ell(h(x_i), y_i). \]

By the law of large numbers, for any fixed \(h\) the empirical risk converges to the true risk as \(n\) grows. Empirical risk minimization (ERM) is the principle of choosing the predictor that minimizes this quantity within some restricted set of candidate functions \(\mathcal{H}\), the hypothesis class:

\[ \hat{h} = \arg\min_{h \in \mathcal{H}} \hat{R}_S(h). \]

Nearly every supervised algorithm you have used, linear regression, logistic regression, support vector machines, decision trees, and the training loop of a neural network, is an instance or approximation of ERM, often with an added regularization term discussed in Section 5.

81.2.3 2.3 Why restricting the hypothesis class is essential

If we allowed \(\mathcal{H}\) to contain every possible function, ERM would simply memorize: build a lookup table returning \(y_i\) for each \(x_i\) and arbitrary values elsewhere. This achieves zero empirical risk while learning nothing about unseen inputs. The art of ERM is choosing \(\mathcal{H}\) rich enough to express good predictors but constrained enough that low empirical risk implies low true risk. This tension is the central theme of the paradigm and is made quantitative in Section 4.

# ERM in pseudocode: search a hypothesis class for the
# function with lowest average training loss.
def erm(hypothesis_class, S, loss):
    best, best_risk = None, float("inf")
    for h in hypothesis_class:
        risk = sum(loss(h(x), y) for x, y in S) / len(S)
        if risk < best_risk:
            best, best_risk = h, risk
    return best

81.3 3. Loss Functions

81.3.1 3.1 The role of the loss

The loss function encodes what we mean by a good prediction. Choosing it is a modeling decision, not a mathematical inevitability, and different losses lead to different optimal predictors and different sensitivities to outliers. Two desiderata usually compete: fidelity to the real world cost of mistakes, and tractability for optimization, especially differentiability and convexity.

81.3.2 3.2 Regression losses

For real valued targets the squared loss \(\ell(\hat{y}, y) = (\hat{y} - y)^2\) is standard. It is smooth, convex, and its minimizer is the conditional mean, but it penalizes large errors quadratically and is therefore sensitive to outliers. The absolute loss \(|\hat{y} - y|\) targets the conditional median and is more robust, at the cost of nondifferentiability at zero. The Huber loss interpolates between the two, behaving quadratically for small residuals and linearly for large ones, which gives robustness without sacrificing smoothness near the optimum.

81.3.3 3.3 Classification losses

The loss we truly care about in classification is often the zero one loss, \(\ell(\hat{y}, y) = \mathbb{1}[\hat{y} \neq y]\), which counts mistakes. Unfortunately it is nonconvex and discontinuous, so direct minimization is NP hard for most hypothesis classes. Practical algorithms minimize a convex surrogate that upper bounds the zero one loss. The cross entropy or logistic loss,

\[ \ell(\hat{p}, y) = -\sum_{c} \mathbb{1}[y = c] \log \hat{p}_c, \]

where \(\hat{p}\) is a predicted probability vector, is the workhorse for both logistic regression and modern neural classifiers. The hinge loss \(\max(0, 1 - y \cdot \hat{y})\) underlies support vector machines and induces a margin. The exponential loss drives AdaBoost. Each is a surrogate chosen because it is convex and statistically consistent, meaning that minimizing it drives the predictor toward the Bayes classifier as data grows.

81.3.4 3.4 Losses encode values

A loss function is also where domain priorities enter. Class weighting, asymmetric costs for false positives versus false negatives in medical screening, and quantile losses for risk aware forecasting all live here. The lesson is that the loss is a place to express what matters, not a setting to leave at its default.

81.4 4. Hypothesis Classes and Capacity

81.4.1 4.1 What a hypothesis class is

A hypothesis class \(\mathcal{H}\) is the set of functions the learning algorithm is permitted to return. Linear predictors \(h(x) = w^\top x + b\) form one class; degree \(d\) polynomials form a larger one; decision trees of depth \(k\), kernel machines, and neural networks of a given architecture each define their own. The choice of \(\mathcal{H}\) embeds our inductive bias, the assumptions about the world that let finite data generalize to infinitely many unseen inputs. Without inductive bias, learning from finite samples is impossible, a fact formalized by the no free lunch theorem: averaged over all possible data distributions, no learner beats random guessing.

81.4.2 4.2 Measuring capacity

The richness of a hypothesis class is its capacity. For binary classification the classical measure is the Vapnik Chervonenkis (VC) dimension, the size of the largest set of points that \(\mathcal{H}\) can label in all possible ways. A linear classifier in \(\mathbb{R}^d\) has VC dimension \(d + 1\). Higher capacity means more expressive power but also a greater risk of fitting noise. Related measures include Rademacher complexity, which captures how well a class can correlate with random labels, and covering numbers. These are not just theoretical ornaments; they appear directly in the bounds that connect training error to test error.

81.4.3 4.3 The bias variance decomposition

Capacity controls a fundamental tradeoff. For squared loss the expected error of a learned predictor decomposes as

\[ \mathbb{E}[(h(x) - y)^2] = \underbrace{\text{Bias}^2}_{\text{too rigid}} + \underbrace{\text{Variance}}_{\text{too sensitive}} + \underbrace{\sigma^2}_{\text{irreducible}}. \]

A low capacity class has high bias: it cannot represent the truth and underfits. A high capacity class has high variance: it chases noise in the particular training sample and overfits. The classical prescription is to tune capacity to the point where the sum is minimized, the sweet spot of the U shaped test error curve. Section 6 revisits how deep learning complicates this picture without overturning it.

81.5 5. Generalization

81.5.1 5.1 The generalization gap

Define the generalization gap as the difference between true and empirical risk, \(R(\hat{h}) - \hat{R}_S(\hat{h})\). ERM controls the empirical term by construction; generalization theory bounds the gap. A representative uniform convergence result states that, with probability at least \(1 - \delta\) over the draw of the sample, simultaneously for all \(h \in \mathcal{H}\),

\[ R(h) \leq \hat{R}_S(h) + \mathcal{O}\!\left( \sqrt{\frac{\text{capacity}(\mathcal{H}) + \log(1/\delta)}{n}} \right). \]

Read this bound qualitatively. The gap shrinks as the sample size \(n\) grows and widens as the capacity of \(\mathcal{H}\) grows. More data helps; more flexible models cost you, unless you have the data to pay for them. This single inequality justifies the entire practice of holding out validation data, limiting model complexity, and gathering more examples.

81.5.2 5.2 Regularization as capacity control

In practice we rarely enumerate hypothesis classes by hand. Instead we add a penalty to the empirical risk that discourages complex solutions, regularized risk minimization:

\[ \hat{h} = \arg\min_{h \in \mathcal{H}} \ \hat{R}_S(h) + \lambda \, \Omega(h). \]

The penalty \(\Omega(h)\) might be the squared norm of the weights (\(L_2\), ridge), the absolute norm (\(L_1\), lasso, which also induces sparsity and feature selection), or structural constraints like tree depth. The hyperparameter \(\lambda\) continuously trades empirical fit against capacity, effectively shrinking the hypothesis class. This is the operational form of the bias variance tradeoff and the most common knob a practitioner actually turns.

81.5.3 5.3 Estimating generalization honestly

Theory gives the shape of the bounds, but the constants are usually too loose to use directly. In practice we estimate generalization empirically. Hold out a test set never touched during training; use a separate validation set or cross validation to select hyperparameters. The cardinal sin is letting test information leak into model selection, which silently inflates capacity usage and breaks the guarantee. Cross validation, repeatedly partitioning the data into training and validation folds, gives a lower variance estimate of true risk when data is scarce.

# k-fold cross validation: estimate true risk without
# touching the held-out test set.
def cross_val_risk(fit, S, loss, k=5):
    folds = partition(S, k)
    scores = []
    for i in range(k):
        val = folds[i]
        train = concat(folds[:i] + folds[i+1:])
        h = fit(train)
        scores.append(mean(loss(h(x), y) for x, y in val))
    return mean(scores)

81.6 6. Why Classical ML Still Matters

81.6.1 6.1 The vocabulary did not change

Deep learning did not repeal empirical risk minimization; it is a spectacular instance of it. A neural network defines a hypothesis class parameterized by its weights, stochastic gradient descent is an approximate ERM solver, and cross entropy is the same surrogate loss logistic regression uses. Every concept in this chapter, risk, loss, capacity, the generalization gap, regularization, applies verbatim to a transformer. An engineer who understands the classical paradigm reads a training curve, a learning rate schedule, or a weight decay setting as a member of a family they already know, rather than as folklore.

81.6.2 6.2 The deep learning puzzle, and what survives

Deep networks complicate one part of the classical story. Heavily overparameterized models, with far more parameters than training examples, can fit random labels perfectly yet still generalize on real data, and test error can descend again past the interpolation point, the double descent phenomenon. This shows that raw parameter count is the wrong capacity measure for these models. What survives is the framework itself. Researchers explain the puzzle with the same tools, the implicit regularization of gradient descent, margin based bounds, and norm based rather than count based capacity. The questions are classical even where the answers are new.

81.6.3 6.3 When classical models are the right tool

For tabular data, the bread and butter of business, finance, and the sciences, gradient boosted trees and regularized linear models routinely match or beat deep networks while training in seconds and offering interpretability. When data is scarce, when latency and compute budgets are tight, when a model must be audited or its decisions explained to a regulator, or when a strong baseline is needed before reaching for anything heavier, the classical toolkit wins on the merits. Reaching for a billion parameter model when ridge regression suffices is an engineering error, not a sign of sophistication.

81.6.4 6.4 The enduring discipline

Most of all, the classical paradigm supplies the discipline that keeps applied machine learning honest. Split your data. Choose a loss that reflects real costs. Watch the gap between training and validation error. Prefer the simplest hypothesis class that does the job. These habits predate deep learning, outlast each generation of architectures, and apply with equal force to a logistic regression and a frontier model. The paradigm is not a stage we have left behind. It is the foundation every later development stands on.

81.7 References

  1. Vapnik, V. N. “The Nature of Statistical Learning Theory.” Springer, 1995. https://link.springer.com/book/10.1007/978-1-4757-3264-1
  2. Shalev-Shwartz, S. and Ben-David, S. “Understanding Machine Learning: From Theory to Algorithms.” Cambridge University Press, 2014. https://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning/
  3. Hastie, T., Tibshirani, R., and Friedman, J. “The Elements of Statistical Learning.” Springer, 2009. https://hastie.su.domains/ElemStatLearn/
  4. Mohri, M., Rostamizadeh, A., and Talwalkar, A. “Foundations of Machine Learning.” MIT Press, 2018. https://cs.nyu.edu/~mohri/mlbook/
  5. Bishop, C. M. “Pattern Recognition and Machine Learning.” Springer, 2006. https://www.microsoft.com/en-us/research/publication/pattern-recognition-machine-learning/
  6. Wolpert, D. H. “The Lack of A Priori Distinctions Between Learning Algorithms.” Neural Computation, 1996. https://doi.org/10.1162/neco.1996.8.7.1341
  7. Belkin, M., Hsu, D., Ma, S., and Mandal, S. “Reconciling Modern Machine-Learning Practice and the Classical Bias-Variance Trade-off.” PNAS, 2019. https://www.pnas.org/doi/10.1073/pnas.1903070116
  8. Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. “Understanding Deep Learning Requires Rethinking Generalization.” ICLR, 2017. https://arxiv.org/abs/1611.03530
  9. Grinsztajn, L., Oyallon, E., and Varoquaux, G. “Why Do Tree-Based Models Still Outperform Deep Learning on Tabular Data?” NeurIPS, 2022. https://arxiv.org/abs/2207.08815