5 Types of Learning
Machine learning is often introduced as a single idea: fit a function to data. In practice, the field is organized into a taxonomy of learning paradigms that differ not in the optimizers they use but in the kind of supervisory signal they consume. Understanding this taxonomy is the difference between treating learning as a black box and treating it as an engineering discipline in which the choice of paradigm follows directly from the data you have, the feedback you can collect, and the cost of getting labels. This chapter surveys the major paradigms, makes their signals precise, and explains why modern foundation models deliberately combine several of them.
5.1 1. The Organizing Question: What Is the Signal?
Every learning paradigm answers one question: where does the information that shapes the model come from? In supervised learning it comes from human-provided target labels. In unsupervised learning it comes from structure latent in the inputs themselves. In reinforcement learning it comes from a scalar reward delivered by an environment. The taxonomy below is best read not as a list of separate algorithms but as a list of distinct answers to this single question.
Formally, we usually assume data are drawn from some distribution. Let \(x\) denote inputs and, when available, \(y\) denote targets. A paradigm is characterized by (1) what is observed at training time, (2) the loss that the observation induces, and (3) the assumptions that make generalization to unseen data possible. Keeping these three things in mind lets you classify any new method you encounter, including hybrid methods that resist a single label.
5.2 2. Supervised Learning
5.2.1 2.1 Definition
Supervised learning assumes access to a labeled dataset of input-target pairs, and the goal is to learn a mapping that predicts the target for new inputs. The signal is the explicit label \(y\) attached to each example. Classification predicts a discrete category (is this email spam or not), and regression predicts a continuous quantity (what will this house sell for).
The canonical objective minimizes expected loss over the data distribution, approximated by the empirical average over the training set:
loss = (1/N) * sum_i L(model(x_i), y_i)
where \(L\) might be cross entropy for classification or squared error for regression.
5.2.2 2.2 When to Use It
Use supervised learning when you can obtain labels that are accurate, consistent, and representative of deployment conditions. It remains the workhorse for tabular prediction, medical diagnosis from labeled scans, credit scoring, and any setting where a ground truth exists and can be recorded. Its central limitation is the cost of labels: producing a large, high-quality labeled corpus often requires expert annotators, and label noise propagates directly into the model.
5.3 3. Unsupervised Learning
5.3.1 3.1 Definition
Unsupervised learning works with inputs alone and no targets. The signal is the geometric and statistical structure of the data. The model is asked to discover that structure: to group similar examples (clustering), to find a lower-dimensional coordinate system that preserves information (dimensionality reduction), or to estimate the density from which the data were drawn.
Classic methods include k-means clustering, which partitions points by proximity to learned centroids, and principal component analysis, which finds orthogonal directions of maximal variance. A clustering objective might minimize within-cluster distance:
loss = sum_over_clusters sum_over_points_in_cluster distance(point, centroid)
5.3.2 3.2 When to Use It
Use unsupervised learning for exploratory analysis, customer segmentation, anomaly detection, and as a preprocessing step that compresses or denoises features before a downstream supervised stage. Because there is no label to define correctness, evaluation is intrinsically harder: you must rely on proxy metrics (silhouette scores, reconstruction error) or on whether the discovered structure proves useful downstream.
5.4 4. Self-Supervised Learning
5.4.1 4.1 Definition
Self-supervised learning is the paradigm that powers modern foundation models, and it sits between supervised and unsupervised learning. The data carry no human labels, yet the method constructs a supervised-style task automatically by withholding part of each input and asking the model to predict it from the rest. The signal is therefore generated from the data itself, hence the name.
The most influential instance is next-token prediction in language models: given a sequence, predict the next token. The label for position \(t\) is simply the token that actually occurs at position \(t+1\), so a vast unlabeled text corpus becomes a vast supervised dataset at no annotation cost.
# next-token objective over a sequence of tokens
loss = (1/T) * sum_t cross_entropy(model(tokens[:t]), tokens[t])
Other variants include masked prediction, where random tokens or image patches are blanked and reconstructed, and contrastive learning, where the model learns to pull together two augmented views of the same example and push apart views of different examples.
5.4.2 4.2 Why It Matters
Self-supervised learning solved the central bottleneck of deep learning, namely the scarcity of labels, by turning the abundance of raw data into the source of supervision. A model pretrained this way learns broadly useful representations of language, vision, or audio that transfer to many downstream tasks. The pretraining objective is not the goal in itself; it is a pretext task whose only purpose is to force the model to build rich internal structure.
5.5 5. Semi-Supervised Learning
5.5.1 5.1 Definition
Semi-supervised learning addresses the common situation in which labels are scarce but unlabeled data are plentiful. It uses a small labeled set together with a large unlabeled set, combining the strengths of supervised and unsupervised approaches. The labeled examples anchor the decision boundary, while the unlabeled examples reveal the shape of the data distribution and discourage the boundary from cutting through dense regions.
Typical techniques include pseudo-labeling, in which a model trained on the labeled data assigns provisional labels to unlabeled examples and then retrains on the union, and consistency regularization, in which the model is penalized for changing its prediction when an unlabeled input is slightly perturbed.
5.5.2 5.2 When to Use It
Reach for semi-supervised learning when labeling is expensive but raw data collection is cheap, which is the norm in domains like speech recognition, medical imaging, and web-scale classification. The crucial assumption is that the unlabeled data come from the same distribution as the labeled data and that the cluster or low-density-separation assumption holds. When that assumption fails, propagating labels through unlabeled data can amplify errors rather than correct them.
5.6 6. Reinforcement Learning
5.6.1 6.1 Definition
Reinforcement learning concerns an agent that interacts with an environment over time, taking actions, observing states, and receiving scalar rewards. There is no labeled correct action. The signal is the reward, which may be sparse and delayed, and the agent must learn a policy that maximizes cumulative reward over the long run.
The setting is formalized as a Markov decision process with states, actions, a transition function, and a reward function. The agent seeks a policy that maximizes expected discounted return:
return = sum_t (gamma ** t) * reward_t
where the discount factor \(\gamma\) in \([0, 1)\) trades off immediate against future reward. Two features distinguish this paradigm. First, the data are not fixed: the agent’s own actions determine what it observes next, creating a feedback loop. Second, the agent must balance exploration (trying new actions to gather information) against exploitation (choosing actions known to yield reward).
5.6.2 6.2 When to Use It
Use reinforcement learning for sequential decision problems where success is defined by a long-horizon objective rather than per-step labels: game playing, robotic control, resource scheduling, and recommendation under long-term engagement. The cost is sample inefficiency and instability; reinforcement learning typically needs far more interactions than supervised learning needs labels, and reward design is notoriously delicate, since agents will exploit any loophole in a poorly specified reward.
5.7 7. Transfer Learning
5.7.1 7.1 Definition
Transfer learning reuses knowledge gained on a source task to improve learning on a different but related target task. Rather than training from random initialization, you start from a model already trained on a large source dataset and adapt it. The signal is indirect: it is the inductive bias baked into the pretrained weights.
The dominant recipe is pretrain then fine-tune. A model is pretrained, often self-supervised, on a broad corpus, and then its weights are used as the starting point for supervised training on the smaller target dataset. Adaptation can mean updating all weights, updating only a few layers, or inserting a small number of new trainable parameters while freezing the rest (parameter-efficient fine-tuning).
5.7.2 7.2 When to Use It
Transfer learning is the default whenever your target dataset is small relative to the capacity of the model you want to use, which is almost always the case in the foundation-model era. It dramatically reduces the data and compute needed for a new task. The main risk is negative transfer: if the source and target domains are too dissimilar, the pretrained bias can hurt rather than help.
5.8 8. Multi-Task Learning
5.8.1 8.1 Definition
Multi-task learning trains a single model to perform several tasks at once, sharing representations across them. The signal is the union of the supervisory signals of all the tasks. By optimizing a combined objective, the model is encouraged to learn features that are useful across tasks, which acts as a form of regularization and can improve generalization on each task individually.
A common architecture uses a shared backbone with task-specific heads, and the loss is a weighted sum:
loss = sum_over_tasks weight_k * task_loss_k(shared_backbone, head_k)
5.8.2 8.2 When to Use It
Multi-task learning helps when tasks are related and you have limited data for each one, so that knowledge learned for one task supports the others. It is widely used in autonomous driving (jointly detecting lanes, vehicles, and pedestrians) and in language models that perform many tasks through a unified interface. The difficulty lies in balancing the tasks: if one task dominates the gradient or if tasks conflict, performance on some tasks can degrade, a phenomenon known as negative transfer between tasks.
5.9 9. Online Versus Batch Learning
5.9.1 9.1 The Distinction
The previous sections classified learning by its signal. This section classifies it by how data arrive in time, an orthogonal axis. Batch learning, also called offline learning, trains on a fixed dataset all at once and then deploys a frozen model. Online learning consumes data sequentially, updating the model incrementally as each example or small mini-batch arrives, and never assumes the full dataset is available at once.
# batch: one optimization over the whole dataset
model = train(all_data)
# online: update as each example arrives
for x, y in stream:
model = update(model, x, y)
5.9.2 9.2 When to Use Which
Batch learning is appropriate when the data distribution is stable and you can afford periodic retraining; it is simpler to reason about and to evaluate. Online learning is the right choice when data arrive as a continuous stream, when the distribution drifts over time (concept drift), or when the dataset is too large to fit in memory. Examples include fraud detection, news ranking, and trading systems. The trade-off is that online systems are more vulnerable to noisy or adversarial updates and can suffer catastrophic forgetting, in which new data overwrite previously learned knowledge.
5.10 10. Active Learning
5.10.1 10.1 Definition
Active learning is a paradigm for reducing labeling cost by letting the model choose which examples it most wants labeled. Instead of labeling data at random, a learner queries an oracle (typically a human annotator) for the labels of the unlabeled examples it expects to be most informative. The signal is still a human label, as in supervised learning, but the selection of what to label is driven by the model’s own uncertainty.
A common strategy is uncertainty sampling: query the examples on which the current model is least confident, since those lie near the decision boundary and are most likely to refine it.
# pick the unlabeled example the model is least sure about
scores = [uncertainty(model, x) for x in unlabeled]
query = unlabeled[argmax(scores)]
label = ask_human(query)
Other criteria include query-by-committee, which selects examples on which an ensemble disagrees, and methods that estimate expected reduction in model error.
5.10.2 10.2 When to Use It
Active learning shines when unlabeled data are abundant but each label is expensive, for example when labeling requires a physician, a chemist, or a costly physical experiment. By concentrating the annotation budget on the most informative examples, it can reach a target accuracy with substantially fewer labels than random sampling. The caveats are that the selection process can introduce sampling bias, and that the informativeness heuristics depend on the current (possibly poor) model, so early queries may be misguided.
5.11 11. How Foundation Models Blur the Categories
The taxonomy above is clarifying, but the most capable modern systems deliberately violate its boundaries by chaining paradigms into a pipeline. A contemporary large language model is not the product of one paradigm but of three stacked stages, each contributing a different signal.
The first stage is self-supervised pretraining on a web-scale corpus using next-token prediction. This stage consumes no human labels and produces a base model with broad linguistic and factual competence, a representation that can be transferred to almost anything. This is self-supervised learning serving as the foundation for transfer learning.
The second stage is supervised fine-tuning, sometimes called instruction tuning. Here a comparatively small set of high-quality demonstrations, often written by humans, teaches the base model to follow instructions and respond in a useful format. This is ordinary supervised learning applied on top of the pretrained weights.
The third stage uses reinforcement learning from human feedback, in which human preferences between candidate responses train a reward model, and the language model is then optimized against that reward with a reinforcement learning algorithm. This stage aligns the model’s behavior with human values and intentions that are difficult to capture with demonstrations alone. The same slot is increasingly filled by reinforcement learning on verifiable rewards, where correctness on tasks like mathematics or code execution provides the reward signal directly.
The result is a single artifact whose competence comes from self-supervision, whose helpfulness comes from supervised demonstrations, and whose alignment comes from reinforcement learning. Multi-task learning appears throughout, since a single instruction-tuned model performs translation, summarization, coding, and question answering through one interface. Transfer learning is the connective tissue, because every stage builds on the weights of the previous one. The lesson is that the paradigms are not competitors but components: a serious system selects a signal for each stage of training according to what that stage is trying to instill.
5.12 12. Comparison Table
| Paradigm | Signal used | Typical use case | Key limitation |
|---|---|---|---|
| Supervised | Human target labels | Classification, regression with ground truth | Labels are costly and noisy |
| Unsupervised | Structure in inputs | Clustering, dimensionality reduction | Hard to evaluate correctness |
| Self-supervised | Withheld parts of the input | Pretraining language and vision models | Pretext task is not the end goal |
| Semi-supervised | Few labels plus many unlabeled | Speech, medical imaging at scale | Assumes shared distribution |
| Reinforcement | Scalar reward from environment | Control, games, sequential decisions | Sample inefficient, reward design hard |
| Transfer | Pretrained weights as prior | Adapting to small target datasets | Risk of negative transfer |
| Multi-task | Union of several task signals | Joint perception, unified models | Task balancing, conflicts |
| Online vs batch | Same signal, different arrival | Streaming versus stable data | Drift versus forgetting |
| Active | Human labels chosen by model | Expensive-label domains | Selection bias from weak model |
5.13 13. Choosing a Paradigm in Practice
Confronted with a new problem, the practitioner should reason from the available signal rather than from a favorite algorithm. Ask first whether ground-truth labels exist and at what cost. If they are abundant and cheap, supervised learning is the natural baseline. If they are expensive but raw data are plentiful, consider self-supervised pretraining, semi-supervised methods, or active learning to stretch the labeling budget. If the problem is one of sequential decisions evaluated by a long-horizon objective, reinforcement learning is indicated despite its overhead. If a powerful pretrained model already exists for a related domain, transfer learning will almost always beat training from scratch. Finally, ask how data arrive in time, since a streaming, drifting source pushes you toward online updates regardless of which signal you use.
The mature view is that these paradigms form a toolkit rather than a menu from which you pick exactly one item. The systems that define the current state of the art succeed precisely because they compose paradigms, matching each phase of training to the signal that phase can most efficiently exploit.
5.14 References
Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press. https://www.deeplearningbook.org/
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. https://www.microsoft.com/en-us/research/publication/pattern-recognition-machine-learning/
Sutton, R. S., and Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. http://incompleteideas.net/book/the-book-2nd.html
Settles, B. (2009). Active Learning Literature Survey. Computer Sciences Technical Report 1648, University of Wisconsin-Madison. https://minds.wisconsin.edu/handle/1793/60660
Caruana, R. (1997). Multitask Learning. Machine Learning, 28(1), 41 to 75. https://link.springer.com/article/10.1023/A:1007379606734
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL. https://arxiv.org/abs/1810.04805
Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020). A Simple Framework for Contrastive Learning of Visual Representations (SimCLR). ICML. https://arxiv.org/abs/2002.05709
Ouyang, L., et al. (2022). Training Language Models to Follow Instructions with Human Feedback (InstructGPT). NeurIPS. https://arxiv.org/abs/2203.02155
Bommasani, R., et al. (2021). On the Opportunities and Risks of Foundation Models. Stanford CRFM. https://arxiv.org/abs/2108.07258
Pan, S. J., and Yang, Q. (2010). A Survey on Transfer Learning. IEEE Transactions on Knowledge and Data Engineering, 22(10), 1345 to 1359. https://ieeexplore.ieee.org/document/5288526
van Engelen, J. E., and Hoos, H. H. (2020). A Survey on Semi-Supervised Learning. Machine Learning, 109, 373 to 440. https://link.springer.com/article/10.1007/s10994-019-05855-6