5 Types of Learning

Machine learning is often introduced as a single idea: fit a function to data. In practice, the field is organized into a taxonomy of learning paradigms that differ not in the optimizers they use but in the kind of supervisory signal they consume. Understanding this taxonomy is the difference between treating learning as a black box and treating it as an engineering discipline in which the choice of paradigm follows directly from the data you have, the feedback you can collect, and the cost of getting labels. This chapter surveys the major paradigms, makes their signals precise, and explains why modern foundation models deliberately combine several of them.

5.1 1. The Organizing Question: What Is the Signal?

Every learning paradigm answers one question: where does the information that shapes the model come from? In supervised learning it comes from human-provided target labels. In unsupervised learning it comes from structure latent in the inputs themselves. In reinforcement learning it comes from a scalar reward delivered by an environment. The taxonomy below is best read not as a list of separate algorithms but as a list of distinct answers to this single question.

To make this precise, fix a few objects that recur in every paradigm. Let inputs $x$ live in an input space $\mathcal{X}$, and when targets are available let $y$ live in a target space $\mathcal{Y}$. Assume the world generates examples according to a fixed but unknown distribution $\mathcal{D}$ over $\mathcal{X} \times \mathcal{Y}$. A learner chooses a hypothesis $f_\theta$ from a hypothesis class $\mathcal{F} = \{ f_\theta : \theta \in \Theta \}$. Performance is measured by a loss $L(\hat y, y) \ge 0$, and the quantity we ultimately care about is the risk, the expected loss under the true distribution:

\[ R(\theta) = \mathbb{E}_{(x,y)\sim\mathcal{D}}\big[\, L(f_\theta(x), y) \,\big]. \]

We never observe $\mathcal{D}$ directly, so we minimize the empirical risk computed on a finite sample of $N$ examples,

\[ \hat R(\theta) = \frac{1}{N} \sum_{i=1}^{N} L\!\left(f_\theta(x_i),\, y_i\right), \]

and we rely on generalization theory to guarantee that a small $\hat R$ implies a small $R$ when $N$ is large and $\mathcal{F}$ is suitably constrained. A learning paradigm is then characterized by three choices: (1) what is observed at training time, (2) what plays the role of $y$ and therefore of $L$, and (3) the assumptions that link the observed signal back to the true risk we want to control. Keeping these three things in mind lets you classify any new method you encounter, including hybrid methods that resist a single label.

It is the second choice, the source of the target, that distinguishes the paradigms. The table below previews the answer for each, and the sections that follow develop them in turn.

Paradigm	What stands in for the target $y$
Supervised	A human-provided label attached to each $x$
Unsupervised	Nothing; only the geometry of $x$ is used
Self-supervised	A withheld portion of $x$, predicted from the rest
Semi-supervised	Labels on a few $x$, structure of $x$ on the many
Reinforcement	A scalar reward returned by an environment

5.2 2. Supervised Learning

5.2.1 2.1 Definition

Supervised learning assumes access to a labeled dataset $\{(x_i, y_i)\}_{i=1}^N$ of input-target pairs drawn from $\mathcal{D}$, and the goal is to learn a mapping $f_\theta : \mathcal{X} \to \mathcal{Y}$ that predicts the target for new inputs. The signal is the explicit label $y$ attached to each example. Classification predicts a discrete category (is this email spam or not), and regression predicts a continuous quantity (what will this house sell for).

The objective is the empirical risk of Section 1 instantiated with a task-appropriate loss:

\[ \hat R(\theta) = \frac{1}{N} \sum_{i=1}^{N} L\!\left(f_\theta(x_i),\, y_i\right). \]

Two losses cover most cases. For regression with $\mathcal{Y} = \mathbb{R}$, the squared error $L(\hat y, y) = (\hat y - y)^2$ is minimized in expectation by the conditional mean $\mathbb{E}[y \mid x]$. For $K$-class classification, the model outputs a probability vector $\hat y = \operatorname{softmax}(f_\theta(x))$ and we minimize the cross entropy

\[ L(\hat y, y) = -\sum_{k=1}^{K} \mathbb{1}[y = k]\, \log \hat y_k = -\log \hat y_{y}, \]

which is the negative log-likelihood of the true class. Minimizing cross entropy is therefore maximum-likelihood estimation of a categorical model, and its minimizer recovers the true posterior $\mathbb{P}(y \mid x)$ when the model is well specified.

5.2.2 2.2 When to Use It

Use supervised learning when you can obtain labels that are accurate, consistent, and representative of deployment conditions. It remains the workhorse for tabular prediction, medical diagnosis from labeled scans, credit scoring, and any setting where a ground truth exists and can be recorded. Its central limitation is the cost of labels. Producing a large, high-quality labeled corpus often requires expert annotators, and label noise propagates directly into the model, since the empirical risk faithfully fits whatever targets it is given, correct or not. A second limitation is covariate shift: the guarantee that small empirical risk implies small true risk holds only when training and deployment inputs share a distribution, so a model trained on one population can fail silently on another.

5.3 3. Unsupervised Learning

5.3.1 3.1 Definition

Unsupervised learning works with inputs alone and no targets. The signal is the geometric and statistical structure of the data. The model is asked to discover that structure: to group similar examples (clustering), to find a lower-dimensional coordinate system that preserves information (dimensionality reduction), or to estimate the density from which the data were drawn.

Two classic methods make the idea concrete. k-means clustering partitions points by proximity to $k$ learned centroids $\mu_1, \dots, \mu_k$, minimizing the within-cluster sum of squares

\[ J(\{\mu_j\}, \{c_i\}) = \sum_{i=1}^{N} \big\lVert x_i - \mu_{c_i} \big\rVert^2, \]

where $c_i \in \{1, \dots, k\}$ is the cluster assigned to point $i$. Lloyd’s algorithm minimizes $J$ by alternating two steps that each weakly decrease the objective: assign every point to its nearest centroid, then move every centroid to the mean of its assigned points. Because $J$ is bounded below and each step is non-increasing, the procedure converges, though only to a local minimum, which is why practitioners restart from several initializations.

Principal component analysis finds orthogonal directions of maximal variance. If $\Sigma$ is the data covariance matrix, the first principal component is the unit vector $w$ maximizing $w^\top \Sigma w$, whose solution is the leading eigenvector of $\Sigma$; the top $d$ eigenvectors give the $d$-dimensional projection that minimizes reconstruction error. Both methods illustrate the unsupervised pattern: the objective references only $x$, never a label.

5.3.2 3.2 When to Use It

Use unsupervised learning for exploratory analysis, customer segmentation, anomaly detection, and as a preprocessing step that compresses or denoises features before a downstream supervised stage. Because there is no label to define correctness, evaluation is intrinsically harder. You must rely on intrinsic proxy metrics (silhouette scores, reconstruction error, held-out log-likelihood) or on extrinsic validation, namely whether the discovered structure proves useful for a downstream task. Mature open-source implementations of all of the above ship in scikit-learn, which is a sensible default starting point.

5.4 4. Self-Supervised Learning

5.4.1 4.1 Definition

Self-supervised learning is the paradigm that powers modern foundation models, and it sits between supervised and unsupervised learning. The data carry no human labels, yet the method constructs a supervised-style task automatically by withholding part of each input and asking the model to predict it from the rest. The signal is therefore generated from the data itself, hence the name.

The most influential instance is next-token prediction in language models. Given a sequence of tokens $x_1, x_2, \dots, x_T$, the model factorizes the joint probability by the chain rule and is trained to maximize the likelihood of each token given its predecessors. Equivalently, it minimizes the average negative log-likelihood

\[ \mathcal{L}(\theta) = -\frac{1}{T} \sum_{t=1}^{T} \log p_\theta\!\left(x_t \mid x_1, \dots, x_{t-1}\right). \]

The label for position $t$ is simply the token that actually occurs there, so a vast unlabeled text corpus becomes a vast supervised dataset at no annotation cost. The exponential of this loss is the model’s perplexity, a standard intrinsic measure of how well the model predicts held-out text.

Other variants change what is withheld. Masked prediction, used by BERT for text and by masked autoencoders for images, blanks a random subset of tokens or patches and reconstructs them from the visible remainder (Devlin et al., 2019). Contrastive learning, used by SimCLR for vision, builds two augmented views of the same example and learns an embedding that pulls matching views together while pushing non-matching views apart (Chen et al., 2020). A representative contrastive objective for an anchor view $z_i$ and its positive partner $z_j$ among a batch is

\[ L_{i,j} = -\log \frac{\exp\!\big(\operatorname{sim}(z_i, z_j) / \tau\big)}{\sum_{k \ne i} \exp\!\big(\operatorname{sim}(z_i, z_k) / \tau\big)}, \]

where $\operatorname{sim}$ is cosine similarity and $\tau$ is a temperature. In every case the supervisory target is carved out of the input, not supplied from outside.

5.4.2 4.2 Why It Matters

Self-supervised learning solved the central bottleneck of deep learning, namely the scarcity of labels, by turning the abundance of raw data into the source of supervision. A model pretrained this way learns broadly useful representations of language, vision, or audio that transfer to many downstream tasks. The pretraining objective is not the goal in itself; it is a pretext task whose only purpose is to force the model to build rich internal structure. Predicting the next token well, for instance, requires latent competence in syntax, world knowledge, and reasoning, none of which were labeled but all of which become available to later stages (Bommasani et al., 2021).

5.5 5. Semi-Supervised Learning

5.5.1 5.1 Definition

Semi-supervised learning addresses the common situation in which labels are scarce but unlabeled data are plentiful. It uses a small labeled set $\{(x_i, y_i)\}_{i=1}^{n}$ together with a large unlabeled set $\{x_j\}_{j=n+1}^{n+m}$ with $m \gg n$, combining the strengths of supervised and unsupervised approaches. The labeled examples anchor the decision boundary, while the unlabeled examples reveal the shape of the data distribution and discourage the boundary from cutting through dense regions (van Engelen and Hoos, 2020).

This intuition rests on explicit assumptions, and naming them clarifies when the method can work at all. The cluster assumption holds that points in the same cluster share a label; the closely related low-density separation assumption holds that the decision boundary should lie in a region of low data density. A combined objective makes these operational by adding an unsupervised regularizer to the supervised loss:

\[ \mathcal{L}(\theta) = \underbrace{\frac{1}{n}\sum_{i=1}^{n} L\!\left(f_\theta(x_i), y_i\right)}_{\text{supervised}} \;+\; \lambda \underbrace{\frac{1}{m}\sum_{j=n+1}^{n+m} U\!\left(f_\theta, x_j\right)}_{\text{unsupervised}}, \]

where $\lambda > 0$ weights the unsupervised term $U$. Two choices of $U$ dominate practice. Pseudo-labeling lets a model trained on the labeled data assign provisional labels to confident unlabeled examples and retrains on the union. Consistency regularization sets $U$ to penalize changes in the prediction when an unlabeled input is slightly perturbed, encoding the belief that small input perturbations should not cross the boundary.

5.5.2 5.2 When to Use It

Reach for semi-supervised learning when labeling is expensive but raw data collection is cheap, which is the norm in domains like speech recognition, medical imaging, and web-scale classification. The crucial assumption is that the unlabeled data come from the same distribution as the labeled data and that the cluster or low-density-separation assumption holds. When that assumption fails, for instance when the unlabeled pool contains out-of-distribution classes, propagating labels through unlabeled data can amplify errors rather than correct them, and the regularizer actively pulls the boundary in the wrong direction.

5.6 6. Reinforcement Learning

5.6.1 6.1 Definition

Reinforcement learning concerns an agent that interacts with an environment over time, taking actions, observing states, and receiving scalar rewards. There is no labeled correct action. The signal is the reward, which may be sparse and delayed, and the agent must learn a policy that maximizes cumulative reward over the long run (Sutton and Barto, 2018).

The setting is formalized as a Markov decision process (MDP), the tuple $(\mathcal{S}, \mathcal{A}, P, r, \gamma)$ of states, actions, a transition kernel $P(s' \mid s, a)$, a reward function $r(s, a)$, and a discount factor $\gamma \in [0, 1)$. A policy $\pi(a \mid s)$ maps states to action distributions, and the agent seeks the policy that maximizes the expected discounted return

\[ J(\pi) = \mathbb{E}_{\pi}\!\left[\, \sum_{t=0}^{\infty} \gamma^{t}\, r(s_t, a_t) \,\right]. \]

The discount $\gamma$ trades immediate against future reward and keeps the sum finite. The value of acting from a state is captured by the state-action value function $Q^\pi(s, a)$, the expected return after taking $a$ in $s$ and following $\pi$ thereafter, which satisfies the Bellman equation

\[ Q^\pi(s, a) = r(s, a) + \gamma\, \mathbb{E}_{s' \sim P(\cdot \mid s, a)}\Big[\, \mathbb{E}_{a' \sim \pi(\cdot \mid s')}\big[ Q^\pi(s', a') \big] \,\Big]. \]

Two features distinguish this paradigm. First, the data are not fixed: the agent’s own actions determine what it observes next, creating a feedback loop absent from supervised learning. Second, the agent must balance exploration (trying new actions to gather information) against exploitation (choosing actions known to yield reward), a tension with no analogue in the static paradigms.

5.6.2 6.2 When to Use It

Use reinforcement learning for sequential decision problems where success is defined by a long-horizon objective rather than per-step labels: game playing, robotic control, resource scheduling, and recommendation under long-term engagement. The cost is sample inefficiency and instability; reinforcement learning typically needs far more interactions than supervised learning needs labels, and reward design is notoriously delicate, since agents will exploit any loophole in a poorly specified reward, a failure mode called reward hacking. When a faithful simulator or a verifiable correctness signal is unavailable, the expense of collecting real interactions often makes another paradigm preferable.

5.7 7. Transfer Learning

5.7.1 7.1 Definition

Transfer learning reuses knowledge gained on a source task to improve learning on a different but related target task (Pan and Yang, 2010). Rather than training from random initialization, you start from a model already trained on a large source dataset and adapt it. The signal is indirect: it is the inductive bias baked into the pretrained weights, which encode regularities of the source domain that, with luck, also hold in the target.

The dominant recipe is pretrain then fine-tune. A model is pretrained, often self-supervised, on a broad corpus, and then its weights initialize supervised training on the smaller target dataset. Adaptation spans a spectrum: updating all weights (full fine-tuning), updating only the final layers, or freezing the backbone and inserting a small number of new trainable parameters (parameter-efficient fine-tuning, as in adapters or low-rank updates). The fewer parameters you free to move, the more of the source bias you retain, which is desirable exactly when the target data are too few to estimate many parameters reliably.

5.7.2 7.2 When to Use It

Transfer learning is the default whenever your target dataset is small relative to the capacity of the model you want to use, which is almost always the case in the foundation-model era. It dramatically reduces the data and compute needed for a new task. The main risk is negative transfer: if the source and target domains are too dissimilar, the pretrained bias can hurt rather than help, and the model may need more data to unlearn the source than it would have needed to learn the target from scratch.

5.8 8. Multi-Task Learning

5.8.1 8.1 Definition

Multi-task learning trains a single model to perform several tasks at once, sharing representations across them (Caruana, 1997). The signal is the union of the supervisory signals of all the tasks. By optimizing a combined objective, the model is encouraged to learn features useful across tasks, which acts as a form of inductive regularization and can improve generalization on each task individually.

A common architecture uses a shared backbone $g_\phi$ with task-specific heads $h_{\psi_k}$, and the loss is a weighted sum over $K$ tasks:

\[ \mathcal{L}(\phi, \psi_1, \dots, \psi_K) = \sum_{k=1}^{K} w_k\, \mathcal{L}_k\!\big( h_{\psi_k} \circ g_\phi \big). \]

The weights $w_k$ matter. If one task’s loss is on a much larger scale, or if its gradients dominate, the shared backbone $g_\phi$ drifts toward that task at the expense of the others. Methods that normalize gradient magnitudes across tasks or that treat the weights as learnable address this directly.

5.8.2 8.2 When to Use It

Multi-task learning helps when tasks are related and you have limited data for each one, so that knowledge learned for one task supports the others. It is widely used in autonomous driving (jointly detecting lanes, vehicles, and pedestrians) and in language models that perform many tasks through a unified interface. The difficulty lies in balancing the tasks: if one task dominates the gradient or if tasks make conflicting demands on the shared features, performance on some tasks can degrade, again a form of negative transfer, here between simultaneously trained tasks rather than across time.

5.9 9. Online Versus Batch Learning

5.9.1 9.1 The Distinction

The previous sections classified learning by its signal. This section classifies it by how data arrive in time, an orthogonal axis that cross-cuts every paradigm above. Batch learning, also called offline learning, trains on a fixed dataset all at once and then deploys a frozen model. Online learning consumes data sequentially, updating the model incrementally as each example or small mini-batch arrives, and never assumes the full dataset is available at once.

The contrast is one of access, not of objective:

# batch: one optimization over the whole dataset
model = train(all_data)

# online: update as each example arrives
for x, y in stream:
    model = update(model, x, y)

Online learning is the natural setting for streaming data and for non-stationary distributions, where $\mathcal{D}$ itself changes over time, a phenomenon called concept drift. Its theory is often framed in terms of regret, the gap between the learner’s cumulative loss and that of the best fixed model in hindsight; a good online algorithm has regret that grows sublinearly in the number of rounds, so its average performance approaches the best fixed competitor.

5.9.2 9.2 When to Use Which

Batch learning is appropriate when the data distribution is stable and you can afford periodic retraining; it is simpler to reason about and to evaluate. Online learning is the right choice when data arrive as a continuous stream, when the distribution drifts over time, or when the dataset is too large to fit in memory. Examples include fraud detection, news ranking, and trading systems. The trade-off is that online systems are more vulnerable to noisy or adversarial updates and can suffer catastrophic forgetting, in which new data overwrite previously learned knowledge. Mitigations include rehearsal (replaying old examples) and regularizing parameters that were important for earlier data.

5.10 10. Active Learning

5.10.1 10.1 Definition

Active learning is a paradigm for reducing labeling cost by letting the model choose which examples it most wants labeled (Settles, 2009). Instead of labeling data at random, a learner queries an oracle (typically a human annotator) for the labels of the unlabeled examples it expects to be most informative. The signal is still a human label, as in supervised learning, but the selection of what to label is driven by the model’s own state.

The dominant strategy is uncertainty sampling: query the examples on which the current model is least confident, since those lie near the decision boundary and are most likely to refine it. For a classifier outputting class probabilities $p_\theta(y \mid x)$, a natural uncertainty score is the predictive entropy

\[ H(x) = -\sum_{k=1}^{K} p_\theta(k \mid x)\, \log p_\theta(k \mid x), \]

which is maximized when the model spreads its mass evenly across classes. The learner labels $\arg\max_x H(x)$ from the unlabeled pool, adds it to the training set, retrains, and repeats. Other criteria include query-by-committee, which selects examples on which an ensemble disagrees most, and decision-theoretic methods that estimate the expected reduction in model error.

5.10.2 10.2 When to Use It

Active learning shines when unlabeled data are abundant but each label is expensive, for example when labeling requires a physician, a chemist, or a costly physical experiment. By concentrating the annotation budget on the most informative examples, it can reach a target accuracy with substantially fewer labels than random sampling. The caveats are real. The selection process can introduce sampling bias, since the labeled set is no longer drawn from $\mathcal{D}$, which complicates unbiased evaluation. And the informativeness heuristics depend on the current, possibly poor, model, so early queries made before the model is competent may be misguided.

5.11 11. How Foundation Models Blur the Categories

The taxonomy above is clarifying, but the most capable modern systems deliberately violate its boundaries by chaining paradigms into a pipeline. A contemporary large language model is not the product of one paradigm but of three stacked stages, each contributing a different signal.

The first stage is self-supervised pretraining on a web-scale corpus using next-token prediction. This stage consumes no human labels and produces a base model with broad linguistic and factual competence, a representation that can be transferred to almost anything. This is self-supervised learning serving as the foundation for transfer learning.

The second stage is supervised fine-tuning, sometimes called instruction tuning. Here a comparatively small set of high-quality demonstrations, often written by humans, teaches the base model to follow instructions and respond in a useful format. This is ordinary supervised learning applied on top of the pretrained weights.

The third stage uses reinforcement learning from human feedback, in which human preferences between candidate responses train a reward model, and the language model is then optimized against that reward with a reinforcement learning algorithm (Ouyang et al., 2022). This stage aligns the model’s behavior with human values and intentions that are difficult to capture with demonstrations alone. The same slot is increasingly filled by reinforcement learning on verifiable rewards, where correctness on tasks like mathematics or code execution provides the reward signal directly, sidestepping the need to learn a reward model.

flowchart LR
    A["Web-scale text corpus"] --> B["Stage 1: Self-supervised pretraining (next-token prediction)"]
    B --> C["Base model"]
    D["Human demonstrations"] --> E["Stage 2: Supervised fine-tuning (instruction following)"]
    C --> E
    E --> F["Instruction-tuned model"]
    G["Human preferences or verifiable rewards"] --> H["Stage 3: Reinforcement learning (alignment)"]
    F --> H
    H --> I["Aligned assistant"]

The result is a single artifact whose competence comes from self-supervision, whose helpfulness comes from supervised demonstrations, and whose alignment comes from reinforcement learning. Multi-task learning appears throughout, since a single instruction-tuned model performs translation, summarization, coding, and question answering through one interface. Transfer learning is the connective tissue, because every stage builds on the weights of the previous one. The lesson is that the paradigms are not competitors but components: a serious system selects a signal for each stage of training according to what that stage is trying to instill.

5.12 12. Comparison Table

Paradigm	Signal used	Typical use case	Key limitation
Supervised	Human target labels	Classification, regression with ground truth	Labels are costly and noisy
Unsupervised	Structure in inputs	Clustering, dimensionality reduction	Hard to evaluate correctness
Self-supervised	Withheld parts of the input	Pretraining language and vision models	Pretext task is not the end goal
Semi-supervised	Few labels plus many unlabeled	Speech, medical imaging at scale	Assumes shared distribution
Reinforcement	Scalar reward from environment	Control, games, sequential decisions	Sample inefficient, reward design hard
Transfer	Pretrained weights as prior	Adapting to small target datasets	Risk of negative transfer
Multi-task	Union of several task signals	Joint perception, unified models	Task balancing, conflicts
Online vs batch	Same signal, different arrival	Streaming versus stable data	Drift versus forgetting
Active	Human labels chosen by model	Expensive-label domains	Selection bias from weak model

5.13 13. A Worked Example: One Problem, Several Signals

To see how the choice of paradigm follows from the available signal rather than from a favorite algorithm, consider a single concrete problem: a hospital wants to flag chest X-rays that show pneumonia. The same images can be exploited by different paradigms depending on what feedback is on hand.

If the hospital has 50,000 X-rays each read and labeled by a radiologist, the problem is supervised: train a classifier on the labeled pairs and minimize cross entropy. If it has the same 50,000 images but only 500 are labeled, the problem becomes semi-supervised: use the 500 labels to anchor the boundary and the 49,500 unlabeled images, through consistency regularization, to shape it. If it has a million unlabeled X-rays and wants reusable features before any labeling, the move is self-supervised pretraining (mask patches and reconstruct them) followed by transfer learning onto whatever small labeled set it can afford. If radiologist time is the binding constraint and the hospital can label only a few hundred images total, active learning spends that budget on the images the model is least sure about, ranked by predictive entropy $H(x)$. And if the same backbone must also localize the affected lung region and estimate severity, multi-task learning shares one representation across all three heads.

The data did not change across these scenarios. What changed was the signal the hospital could collect, and with it the paradigm. This is the reasoning the next section generalizes.

5.14 14. Choosing a Paradigm in Practice

Confronted with a new problem, the practitioner should reason from the available signal rather than from a favorite algorithm. Ask first whether ground-truth labels exist and at what cost. If they are abundant and cheap, supervised learning is the natural baseline. If they are expensive but raw data are plentiful, consider self-supervised pretraining, semi-supervised methods, or active learning to stretch the labeling budget. If the problem is one of sequential decisions evaluated by a long-horizon objective, reinforcement learning is indicated despite its overhead. If a powerful pretrained model already exists for a related domain, transfer learning will almost always beat training from scratch. Finally, ask how data arrive in time, since a streaming, drifting source pushes you toward online updates regardless of which signal you use.

A short list of pitfalls recurs across paradigms and is worth keeping in view. Label noise and covariate shift undermine supervised learning. Unsupervised structure can be meaningless without downstream validation. Semi-supervised regularizers backfire when the unlabeled pool is off-distribution. Reinforcement learning invites reward hacking. Transfer and multi-task learning both risk negative transfer when sources or tasks conflict. Online learning courts catastrophic forgetting. In every case the failure traces back to a mismatch between the assumed signal and the real one, which is why naming the signal first, as Section 1 urges, is the most reliable safeguard.

The mature view is that these paradigms form a toolkit rather than a menu from which you pick exactly one item. The systems that define the current state of the art succeed precisely because they compose paradigms, matching each phase of training to the signal that phase can most efficiently exploit.

5.15 References

Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press. https://www.deeplearningbook.org/
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. https://www.microsoft.com/en-us/research/publication/pattern-recognition-machine-learning/
Sutton, R. S., and Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. http://incompleteideas.net/book/the-book-2nd.html
Settles, B. (2009). Active Learning Literature Survey. Computer Sciences Technical Report 1648, University of Wisconsin-Madison. https://minds.wisconsin.edu/handle/1793/60660
Caruana, R. (1997). Multitask Learning. Machine Learning, 28(1), 41 to 75. https://link.springer.com/article/10.1023/A:1007379606734
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL. https://arxiv.org/abs/1810.04805
Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020). A Simple Framework for Contrastive Learning of Visual Representations (SimCLR). ICML. https://arxiv.org/abs/2002.05709
Ouyang, L., et al. (2022). Training Language Models to Follow Instructions with Human Feedback (InstructGPT). NeurIPS. https://arxiv.org/abs/2203.02155
Bommasani, R., et al. (2021). On the Opportunities and Risks of Foundation Models. Stanford CRFM. https://arxiv.org/abs/2108.07258
Pan, S. J., and Yang, Q. (2010). A Survey on Transfer Learning. IEEE Transactions on Knowledge and Data Engineering, 22(10), 1345 to 1359. https://ieeexplore.ieee.org/document/5288526
van Engelen, J. E., and Hoos, H. H. (2020). A Survey on Semi-Supervised Learning. Machine Learning, 109, 373 to 440. https://link.springer.com/article/10.1007/s10994-019-05855-6

# Types of Learning Machine learning is often introduced as a single idea: fit a function to data. In practice, the field is organized into a taxonomy of learning paradigms that differ not in the optimizers they use but in the kind of supervisory signal they consume. Understanding this taxonomy is the difference between treating learning as a black box and treating it as an engineering discipline in which the choice of paradigm follows directly from the data you have, the feedback you can collect, and the cost of getting labels. This chapter surveys the major paradigms, makes their signals precise, and explains why modern foundation models deliberately combine several of them. ## 1. The Organizing Question: What Is the Signal? Every learning paradigm answers one question: where does the information that shapes the model come from? In supervised learning it comes from human-provided target labels. In unsupervised learning it comes from structure latent in the inputs themselves. In reinforcement learning it comes from a scalar reward delivered by an environment. The taxonomy below is best read not as a list of separate algorithms but as a list of distinct answers to this single question. To make this precise, fix a few objects that recur in every paradigm. Let inputs $x$ live in an input space $\mathcal{X}$, and when targets are available let $y$ live in a target space $\mathcal{Y}$. Assume the world generates examples according to a fixed but unknown distribution $\mathcal{D}$ over $\mathcal{X} \times \mathcal{Y}$. A learner chooses a hypothesis $f_\theta$ from a hypothesis class $\mathcal{F} = \{ f_\theta : \theta \in \Theta \}$. Performance is measured by a loss $L(\hat y, y) \ge 0$, and the quantity we ultimately care about is the **risk**, the expected loss under the true distribution: $$ R(\theta) = \mathbb{E}_{(x,y)\sim\mathcal{D}}\big[\, L(f_\theta(x), y) \,\big]. $$ We never observe $\mathcal{D}$ directly, so we minimize the **empirical risk** computed on a finite sample of $N$ examples, $$ \hat R(\theta) = \frac{1}{N} \sum_{i=1}^{N} L\!\left(f_\theta(x_i),\, y_i\right), $$ and we rely on generalization theory to guarantee that a small $\hat R$ implies a small $R$ when $N$ is large and $\mathcal{F}$ is suitably constrained. A learning paradigm is then characterized by three choices: (1) what is observed at training time, (2) what plays the role of $y$ and therefore of $L$, and (3) the assumptions that link the observed signal back to the true risk we want to control. Keeping these three things in mind lets you classify any new method you encounter, including hybrid methods that resist a single label. It is the second choice, the source of the target, that distinguishes the paradigms. The table below previews the answer for each, and the sections that follow develop them in turn. | Paradigm | What stands in for the target $y$ | |---|---| | Supervised | A human-provided label attached to each $x$ | | Unsupervised | Nothing; only the geometry of $x$ is used | | Self-supervised | A withheld portion of $x$, predicted from the rest | | Semi-supervised | Labels on a few $x$, structure of $x$ on the many | | Reinforcement | A scalar reward returned by an environment | ## 2. Supervised Learning ### 2.1 Definition Supervised learning assumes access to a labeled dataset $\{(x_i, y_i)\}_{i=1}^N$ of input-target pairs drawn from $\mathcal{D}$, and the goal is to learn a mapping $f_\theta : \mathcal{X} \to \mathcal{Y}$ that predicts the target for new inputs. The signal is the explicit label $y$ attached to each example. Classification predicts a discrete category (is this email spam or not), and regression predicts a continuous quantity (what will this house sell for). The objective is the empirical risk of Section 1 instantiated with a task-appropriate loss: $$ \hat R(\theta) = \frac{1}{N} \sum_{i=1}^{N} L\!\left(f_\theta(x_i),\, y_i\right). $$ Two losses cover most cases. For regression with $\mathcal{Y} = \mathbb{R}$, the **squared error** $L(\hat y, y) = (\hat y - y)^2$ is minimized in expectation by the conditional mean $\mathbb{E}[y \mid x]$. For $K$-class classification, the model outputs a probability vector $\hat y = \operatorname{softmax}(f_\theta(x))$ and we minimize the **cross entropy** $$ L(\hat y, y) = -\sum_{k=1}^{K} \mathbb{1}[y = k]\, \log \hat y_k = -\log \hat y_{y}, $$ which is the negative log-likelihood of the true class. Minimizing cross entropy is therefore maximum-likelihood estimation of a categorical model, and its minimizer recovers the true posterior $\mathbb{P}(y \mid x)$ when the model is well specified. ### 2.2 When to Use It Use supervised learning when you can obtain labels that are accurate, consistent, and representative of deployment conditions. It remains the workhorse for tabular prediction, medical diagnosis from labeled scans, credit scoring, and any setting where a ground truth exists and can be recorded. Its central limitation is the cost of labels. Producing a large, high-quality labeled corpus often requires expert annotators, and label noise propagates directly into the model, since the empirical risk faithfully fits whatever targets it is given, correct or not. A second limitation is **covariate shift**: the guarantee that small empirical risk implies small true risk holds only when training and deployment inputs share a distribution, so a model trained on one population can fail silently on another. ## 3. Unsupervised Learning ### 3.1 Definition Unsupervised learning works with inputs alone and no targets. The signal is the geometric and statistical structure of the data. The model is asked to discover that structure: to group similar examples (clustering), to find a lower-dimensional coordinate system that preserves information (dimensionality reduction), or to estimate the density from which the data were drawn. Two classic methods make the idea concrete. **k-means clustering** partitions points by proximity to $k$ learned centroids $\mu_1, \dots, \mu_k$, minimizing the within-cluster sum of squares $$ J(\{\mu_j\}, \{c_i\}) = \sum_{i=1}^{N} \big\lVert x_i - \mu_{c_i} \big\rVert^2, $$ where $c_i \in \{1, \dots, k\}$ is the cluster assigned to point $i$. Lloyd's algorithm minimizes $J$ by alternating two steps that each weakly decrease the objective: assign every point to its nearest centroid, then move every centroid to the mean of its assigned points. Because $J$ is bounded below and each step is non-increasing, the procedure converges, though only to a local minimum, which is why practitioners restart from several initializations. **Principal component analysis** finds orthogonal directions of maximal variance. If $\Sigma$ is the data covariance matrix, the first principal component is the unit vector $w$ maximizing $w^\top \Sigma w$, whose solution is the leading eigenvector of $\Sigma$; the top $d$ eigenvectors give the $d$-dimensional projection that minimizes reconstruction error. Both methods illustrate the unsupervised pattern: the objective references only $x$, never a label. ### 3.2 When to Use It Use unsupervised learning for exploratory analysis, customer segmentation, anomaly detection, and as a preprocessing step that compresses or denoises features before a downstream supervised stage. Because there is no label to define correctness, evaluation is intrinsically harder. You must rely on intrinsic proxy metrics (silhouette scores, reconstruction error, held-out log-likelihood) or on extrinsic validation, namely whether the discovered structure proves useful for a downstream task. Mature open-source implementations of all of the above ship in scikit-learn, which is a sensible default starting point. ## 4. Self-Supervised Learning ### 4.1 Definition Self-supervised learning is the paradigm that powers modern foundation models, and it sits between supervised and unsupervised learning. The data carry no human labels, yet the method constructs a supervised-style task automatically by withholding part of each input and asking the model to predict it from the rest. The signal is therefore generated from the data itself, hence the name. The most influential instance is **next-token prediction** in language models. Given a sequence of tokens $x_1, x_2, \dots, x_T$, the model factorizes the joint probability by the chain rule and is trained to maximize the likelihood of each token given its predecessors. Equivalently, it minimizes the average negative log-likelihood $$ \mathcal{L}(\theta) = -\frac{1}{T} \sum_{t=1}^{T} \log p_\theta\!\left(x_t \mid x_1, \dots, x_{t-1}\right). $$ The label for position $t$ is simply the token that actually occurs there, so a vast unlabeled text corpus becomes a vast supervised dataset at no annotation cost. The exponential of this loss is the model's **perplexity**, a standard intrinsic measure of how well the model predicts held-out text. Other variants change what is withheld. **Masked prediction**, used by BERT for text and by masked autoencoders for images, blanks a random subset of tokens or patches and reconstructs them from the visible remainder (Devlin et al., 2019). **Contrastive learning**, used by SimCLR for vision, builds two augmented views of the same example and learns an embedding that pulls matching views together while pushing non-matching views apart (Chen et al., 2020). A representative contrastive objective for an anchor view $z_i$ and its positive partner $z_j$ among a batch is $$ L_{i,j} = -\log \frac{\exp\!\big(\operatorname{sim}(z_i, z_j) / \tau\big)}{\sum_{k \ne i} \exp\!\big(\operatorname{sim}(z_i, z_k) / \tau\big)}, $$ where $\operatorname{sim}$ is cosine similarity and $\tau$ is a temperature. In every case the supervisory target is carved out of the input, not supplied from outside. ### 4.2 Why It Matters Self-supervised learning solved the central bottleneck of deep learning, namely the scarcity of labels, by turning the abundance of raw data into the source of supervision. A model pretrained this way learns broadly useful representations of language, vision, or audio that transfer to many downstream tasks. The pretraining objective is not the goal in itself; it is a **pretext task** whose only purpose is to force the model to build rich internal structure. Predicting the next token well, for instance, requires latent competence in syntax, world knowledge, and reasoning, none of which were labeled but all of which become available to later stages (Bommasani et al., 2021). ## 5. Semi-Supervised Learning ### 5.1 Definition Semi-supervised learning addresses the common situation in which labels are scarce but unlabeled data are plentiful. It uses a small labeled set $\{(x_i, y_i)\}_{i=1}^{n}$ together with a large unlabeled set $\{x_j\}_{j=n+1}^{n+m}$ with $m \gg n$, combining the strengths of supervised and unsupervised approaches. The labeled examples anchor the decision boundary, while the unlabeled examples reveal the shape of the data distribution and discourage the boundary from cutting through dense regions (van Engelen and Hoos, 2020). This intuition rests on explicit assumptions, and naming them clarifies when the method can work at all. The **cluster assumption** holds that points in the same cluster share a label; the closely related **low-density separation assumption** holds that the decision boundary should lie in a region of low data density. A combined objective makes these operational by adding an unsupervised regularizer to the supervised loss: $$ \mathcal{L}(\theta) = \underbrace{\frac{1}{n}\sum_{i=1}^{n} L\!\left(f_\theta(x_i), y_i\right)}_{\text{supervised}} \;+\; \lambda \underbrace{\frac{1}{m}\sum_{j=n+1}^{n+m} U\!\left(f_\theta, x_j\right)}_{\text{unsupervised}}, $$ where $\lambda > 0$ weights the unsupervised term $U$. Two choices of $U$ dominate practice. **Pseudo-labeling** lets a model trained on the labeled data assign provisional labels to confident unlabeled examples and retrains on the union. **Consistency regularization** sets $U$ to penalize changes in the prediction when an unlabeled input is slightly perturbed, encoding the belief that small input perturbations should not cross the boundary. ### 5.2 When to Use It Reach for semi-supervised learning when labeling is expensive but raw data collection is cheap, which is the norm in domains like speech recognition, medical imaging, and web-scale classification. The crucial assumption is that the unlabeled data come from the same distribution as the labeled data and that the cluster or low-density-separation assumption holds. When that assumption fails, for instance when the unlabeled pool contains out-of-distribution classes, propagating labels through unlabeled data can amplify errors rather than correct them, and the regularizer actively pulls the boundary in the wrong direction. ## 6. Reinforcement Learning ### 6.1 Definition Reinforcement learning concerns an agent that interacts with an environment over time, taking actions, observing states, and receiving scalar rewards. There is no labeled correct action. The signal is the reward, which may be sparse and delayed, and the agent must learn a policy that maximizes cumulative reward over the long run (Sutton and Barto, 2018). The setting is formalized as a **Markov decision process** (MDP), the tuple $(\mathcal{S}, \mathcal{A}, P, r, \gamma)$ of states, actions, a transition kernel $P(s' \mid s, a)$, a reward function $r(s, a)$, and a discount factor $\gamma \in [0, 1)$. A policy $\pi(a \mid s)$ maps states to action distributions, and the agent seeks the policy that maximizes the expected discounted return $$ J(\pi) = \mathbb{E}_{\pi}\!\left[\, \sum_{t=0}^{\infty} \gamma^{t}\, r(s_t, a_t) \,\right]. $$ The discount $\gamma$ trades immediate against future reward and keeps the sum finite. The value of acting from a state is captured by the **state-action value function** $Q^\pi(s, a)$, the expected return after taking $a$ in $s$ and following $\pi$ thereafter, which satisfies the Bellman equation $$ Q^\pi(s, a) = r(s, a) + \gamma\, \mathbb{E}_{s' \sim P(\cdot \mid s, a)}\Big[\, \mathbb{E}_{a' \sim \pi(\cdot \mid s')}\big[ Q^\pi(s', a') \big] \,\Big]. $$ Two features distinguish this paradigm. First, the data are not fixed: the agent's own actions determine what it observes next, creating a feedback loop absent from supervised learning. Second, the agent must balance **exploration** (trying new actions to gather information) against **exploitation** (choosing actions known to yield reward), a tension with no analogue in the static paradigms. ### 6.2 When to Use It Use reinforcement learning for sequential decision problems where success is defined by a long-horizon objective rather than per-step labels: game playing, robotic control, resource scheduling, and recommendation under long-term engagement. The cost is sample inefficiency and instability; reinforcement learning typically needs far more interactions than supervised learning needs labels, and reward design is notoriously delicate, since agents will exploit any loophole in a poorly specified reward, a failure mode called **reward hacking**. When a faithful simulator or a verifiable correctness signal is unavailable, the expense of collecting real interactions often makes another paradigm preferable. ## 7. Transfer Learning ### 7.1 Definition Transfer learning reuses knowledge gained on a source task to improve learning on a different but related target task (Pan and Yang, 2010). Rather than training from random initialization, you start from a model already trained on a large source dataset and adapt it. The signal is indirect: it is the inductive bias baked into the pretrained weights, which encode regularities of the source domain that, with luck, also hold in the target. The dominant recipe is **pretrain then fine-tune**. A model is pretrained, often self-supervised, on a broad corpus, and then its weights initialize supervised training on the smaller target dataset. Adaptation spans a spectrum: updating all weights (full fine-tuning), updating only the final layers, or freezing the backbone and inserting a small number of new trainable parameters (parameter-efficient fine-tuning, as in adapters or low-rank updates). The fewer parameters you free to move, the more of the source bias you retain, which is desirable exactly when the target data are too few to estimate many parameters reliably. ### 7.2 When to Use It Transfer learning is the default whenever your target dataset is small relative to the capacity of the model you want to use, which is almost always the case in the foundation-model era. It dramatically reduces the data and compute needed for a new task. The main risk is **negative transfer**: if the source and target domains are too dissimilar, the pretrained bias can hurt rather than help, and the model may need more data to unlearn the source than it would have needed to learn the target from scratch. ## 8. Multi-Task Learning ### 8.1 Definition Multi-task learning trains a single model to perform several tasks at once, sharing representations across them (Caruana, 1997). The signal is the union of the supervisory signals of all the tasks. By optimizing a combined objective, the model is encouraged to learn features useful across tasks, which acts as a form of inductive regularization and can improve generalization on each task individually. A common architecture uses a shared backbone $g_\phi$ with task-specific heads $h_{\psi_k}$, and the loss is a weighted sum over $K$ tasks: $$ \mathcal{L}(\phi, \psi_1, \dots, \psi_K) = \sum_{k=1}^{K} w_k\, \mathcal{L}_k\!\big( h_{\psi_k} \circ g_\phi \big). $$ The weights $w_k$ matter. If one task's loss is on a much larger scale, or if its gradients dominate, the shared backbone $g_\phi$ drifts toward that task at the expense of the others. Methods that normalize gradient magnitudes across tasks or that treat the weights as learnable address this directly. ### 8.2 When to Use It Multi-task learning helps when tasks are related and you have limited data for each one, so that knowledge learned for one task supports the others. It is widely used in autonomous driving (jointly detecting lanes, vehicles, and pedestrians) and in language models that perform many tasks through a unified interface. The difficulty lies in balancing the tasks: if one task dominates the gradient or if tasks make conflicting demands on the shared features, performance on some tasks can degrade, again a form of negative transfer, here between simultaneously trained tasks rather than across time. ## 9. Online Versus Batch Learning ### 9.1 The Distinction The previous sections classified learning by its signal. This section classifies it by how data arrive in time, an orthogonal axis that cross-cuts every paradigm above. **Batch learning**, also called offline learning, trains on a fixed dataset all at once and then deploys a frozen model. **Online learning** consumes data sequentially, updating the model incrementally as each example or small mini-batch arrives, and never assumes the full dataset is available at once. The contrast is one of access, not of objective: ``` # batch: one optimization over the whole dataset model = train(all_data) # online: update as each example arrives for x, y in stream: model = update(model, x, y) ``` Online learning is the natural setting for streaming data and for **non-stationary** distributions, where $\mathcal{D}$ itself changes over time, a phenomenon called **concept drift**. Its theory is often framed in terms of **regret**, the gap between the learner's cumulative loss and that of the best fixed model in hindsight; a good online algorithm has regret that grows sublinearly in the number of rounds, so its average performance approaches the best fixed competitor. ### 9.2 When to Use Which Batch learning is appropriate when the data distribution is stable and you can afford periodic retraining; it is simpler to reason about and to evaluate. Online learning is the right choice when data arrive as a continuous stream, when the distribution drifts over time, or when the dataset is too large to fit in memory. Examples include fraud detection, news ranking, and trading systems. The trade-off is that online systems are more vulnerable to noisy or adversarial updates and can suffer **catastrophic forgetting**, in which new data overwrite previously learned knowledge. Mitigations include rehearsal (replaying old examples) and regularizing parameters that were important for earlier data. ## 10. Active Learning ### 10.1 Definition Active learning is a paradigm for reducing labeling cost by letting the model choose which examples it most wants labeled (Settles, 2009). Instead of labeling data at random, a learner queries an oracle (typically a human annotator) for the labels of the unlabeled examples it expects to be most informative. The signal is still a human label, as in supervised learning, but the **selection** of what to label is driven by the model's own state. The dominant strategy is **uncertainty sampling**: query the examples on which the current model is least confident, since those lie near the decision boundary and are most likely to refine it. For a classifier outputting class probabilities $p_\theta(y \mid x)$, a natural uncertainty score is the predictive entropy $$ H(x) = -\sum_{k=1}^{K} p_\theta(k \mid x)\, \log p_\theta(k \mid x), $$ which is maximized when the model spreads its mass evenly across classes. The learner labels $\arg\max_x H(x)$ from the unlabeled pool, adds it to the training set, retrains, and repeats. Other criteria include **query-by-committee**, which selects examples on which an ensemble disagrees most, and decision-theoretic methods that estimate the expected reduction in model error. ### 10.2 When to Use It Active learning shines when unlabeled data are abundant but each label is expensive, for example when labeling requires a physician, a chemist, or a costly physical experiment. By concentrating the annotation budget on the most informative examples, it can reach a target accuracy with substantially fewer labels than random sampling. The caveats are real. The selection process can introduce **sampling bias**, since the labeled set is no longer drawn from $\mathcal{D}$, which complicates unbiased evaluation. And the informativeness heuristics depend on the current, possibly poor, model, so early queries made before the model is competent may be misguided. ## 11. How Foundation Models Blur the Categories The taxonomy above is clarifying, but the most capable modern systems deliberately violate its boundaries by chaining paradigms into a pipeline. A contemporary large language model is not the product of one paradigm but of three stacked stages, each contributing a different signal. The first stage is **self-supervised pretraining** on a web-scale corpus using next-token prediction. This stage consumes no human labels and produces a base model with broad linguistic and factual competence, a representation that can be transferred to almost anything. This is self-supervised learning serving as the foundation for transfer learning. The second stage is **supervised fine-tuning**, sometimes called instruction tuning. Here a comparatively small set of high-quality demonstrations, often written by humans, teaches the base model to follow instructions and respond in a useful format. This is ordinary supervised learning applied on top of the pretrained weights. The third stage uses **reinforcement learning from human feedback**, in which human preferences between candidate responses train a reward model, and the language model is then optimized against that reward with a reinforcement learning algorithm (Ouyang et al., 2022). This stage aligns the model's behavior with human values and intentions that are difficult to capture with demonstrations alone. The same slot is increasingly filled by reinforcement learning on verifiable rewards, where correctness on tasks like mathematics or code execution provides the reward signal directly, sidestepping the need to learn a reward model. ```{mermaid} flowchart LR A["Web-scale text corpus"] --> B["Stage 1: Self-supervised pretraining (next-token prediction)"] B --> C["Base model"] D["Human demonstrations"] --> E["Stage 2: Supervised fine-tuning (instruction following)"] C --> E E --> F["Instruction-tuned model"] G["Human preferences or verifiable rewards"] --> H["Stage 3: Reinforcement learning (alignment)"] F --> H H --> I["Aligned assistant"] ``` The result is a single artifact whose competence comes from self-supervision, whose helpfulness comes from supervised demonstrations, and whose alignment comes from reinforcement learning. Multi-task learning appears throughout, since a single instruction-tuned model performs translation, summarization, coding, and question answering through one interface. Transfer learning is the connective tissue, because every stage builds on the weights of the previous one. The lesson is that the paradigms are not competitors but components: a serious system selects a signal for each stage of training according to what that stage is trying to instill. ## 12. Comparison Table | Paradigm | Signal used | Typical use case | Key limitation | |---|---|---|---| | Supervised | Human target labels | Classification, regression with ground truth | Labels are costly and noisy | | Unsupervised | Structure in inputs | Clustering, dimensionality reduction | Hard to evaluate correctness | | Self-supervised | Withheld parts of the input | Pretraining language and vision models | Pretext task is not the end goal | | Semi-supervised | Few labels plus many unlabeled | Speech, medical imaging at scale | Assumes shared distribution | | Reinforcement | Scalar reward from environment | Control, games, sequential decisions | Sample inefficient, reward design hard | | Transfer | Pretrained weights as prior | Adapting to small target datasets | Risk of negative transfer | | Multi-task | Union of several task signals | Joint perception, unified models | Task balancing, conflicts | | Online vs batch | Same signal, different arrival | Streaming versus stable data | Drift versus forgetting | | Active | Human labels chosen by model | Expensive-label domains | Selection bias from weak model | ## 13. A Worked Example: One Problem, Several Signals To see how the choice of paradigm follows from the available signal rather than from a favorite algorithm, consider a single concrete problem: a hospital wants to flag chest X-rays that show pneumonia. The same images can be exploited by different paradigms depending on what feedback is on hand. If the hospital has 50,000 X-rays each read and labeled by a radiologist, the problem is **supervised**: train a classifier on the labeled pairs and minimize cross entropy. If it has the same 50,000 images but only 500 are labeled, the problem becomes **semi-supervised**: use the 500 labels to anchor the boundary and the 49,500 unlabeled images, through consistency regularization, to shape it. If it has a million unlabeled X-rays and wants reusable features before any labeling, the move is **self-supervised** pretraining (mask patches and reconstruct them) followed by **transfer learning** onto whatever small labeled set it can afford. If radiologist time is the binding constraint and the hospital can label only a few hundred images total, **active learning** spends that budget on the images the model is least sure about, ranked by predictive entropy $H(x)$. And if the same backbone must also localize the affected lung region and estimate severity, **multi-task learning** shares one representation across all three heads. The data did not change across these scenarios. What changed was the signal the hospital could collect, and with it the paradigm. This is the reasoning the next section generalizes. ## 14. Choosing a Paradigm in Practice Confronted with a new problem, the practitioner should reason from the available signal rather than from a favorite algorithm. Ask first whether ground-truth labels exist and at what cost. If they are abundant and cheap, supervised learning is the natural baseline. If they are expensive but raw data are plentiful, consider self-supervised pretraining, semi-supervised methods, or active learning to stretch the labeling budget. If the problem is one of sequential decisions evaluated by a long-horizon objective, reinforcement learning is indicated despite its overhead. If a powerful pretrained model already exists for a related domain, transfer learning will almost always beat training from scratch. Finally, ask how data arrive in time, since a streaming, drifting source pushes you toward online updates regardless of which signal you use. A short list of pitfalls recurs across paradigms and is worth keeping in view. Label noise and covariate shift undermine supervised learning. Unsupervised structure can be meaningless without downstream validation. Semi-supervised regularizers backfire when the unlabeled pool is off-distribution. Reinforcement learning invites reward hacking. Transfer and multi-task learning both risk negative transfer when sources or tasks conflict. Online learning courts catastrophic forgetting. In every case the failure traces back to a mismatch between the assumed signal and the real one, which is why naming the signal first, as Section 1 urges, is the most reliable safeguard. The mature view is that these paradigms form a toolkit rather than a menu from which you pick exactly one item. The systems that define the current state of the art succeed precisely because they compose paradigms, matching each phase of training to the signal that phase can most efficiently exploit. ## References 1. Goodfellow, I., Bengio, Y., and Courville, A. (2016). *Deep Learning*. MIT Press. https://www.deeplearningbook.org/ 2. Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*. Springer. https://www.microsoft.com/en-us/research/publication/pattern-recognition-machine-learning/ 3. Sutton, R. S., and Barto, A. G. (2018). *Reinforcement Learning: An Introduction* (2nd ed.). MIT Press. http://incompleteideas.net/book/the-book-2nd.html 4. Settles, B. (2009). *Active Learning Literature Survey*. Computer Sciences Technical Report 1648, University of Wisconsin-Madison. https://minds.wisconsin.edu/handle/1793/60660 5. Caruana, R. (1997). Multitask Learning. *Machine Learning*, 28(1), 41 to 75. https://link.springer.com/article/10.1023/A:1007379606734 6. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. *NAACL*. https://arxiv.org/abs/1810.04805 7. Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020). A Simple Framework for Contrastive Learning of Visual Representations (SimCLR). *ICML*. https://arxiv.org/abs/2002.05709 8. Ouyang, L., et al. (2022). Training Language Models to Follow Instructions with Human Feedback (InstructGPT). *NeurIPS*. https://arxiv.org/abs/2203.02155 9. Bommasani, R., et al. (2021). On the Opportunities and Risks of Foundation Models. *Stanford CRFM*. https://arxiv.org/abs/2108.07258 10. Pan, S. J., and Yang, Q. (2010). A Survey on Transfer Learning. *IEEE Transactions on Knowledge and Data Engineering*, 22(10), 1345 to 1359. https://ieeexplore.ieee.org/document/5288526 11. van Engelen, J. E., and Hoos, H. H. (2020). A Survey on Semi-Supervised Learning. *Machine Learning*, 109, 373 to 440. https://link.springer.com/article/10.1007/s10994-019-05855-6