175 Learning Curves
Learning curves are among the most economical diagnostic tools in applied machine learning. By plotting error against either the amount of training data or the number of optimization iterations, a practitioner can read off whether a model suffers primarily from bias or from variance, whether collecting more data is likely to help, and whether the optimizer has converged. This chapter develops the theory behind these curves, shows how to construct them carefully, and turns the resulting shapes into concrete decisions about data, model capacity, and training budget.
175.1 1. Two Families of Learning Curves
The term “learning curve” refers to two related but distinct constructions, and conflating them is a common source of confusion.
The first family plots error as a function of training set size \(m\). We train the model on subsets of increasing size and, for each subset, record the training error and a validation error computed on a held out set. This curve answers the question of how performance scales with data and is the primary tool for deciding whether to collect more examples.
The second family plots error as a function of optimization iteration \(t\), where \(t\) might count gradient descent steps, epochs, or boosting rounds. This curve answers questions about optimization and convergence, and it is the natural place to detect overfitting that emerges late in training. Practitioners often call this second curve a training curve or a loss curve to distinguish it from the data scaling curve.
Both rest on the same underlying decomposition of generalization error, so we begin there.
175.2 2. The Bias Variance Decomposition
Let \(f(x)\) be the true regression function and let \(\hat{f}_{D}(x)\) be the model learned from a training set \(D\). For squared loss, the expected error at a point \(x\), averaged over random draws of the training set, decomposes as
\[ \mathbb{E}_{D}\left[(\hat{f}_{D}(x) - y)^2\right] = \underbrace{\left(\mathbb{E}_{D}[\hat{f}_{D}(x)] - f(x)\right)^2}_{\text{bias}^2} + \underbrace{\mathbb{E}_{D}\left[(\hat{f}_{D}(x) - \mathbb{E}_{D}[\hat{f}_{D}(x)])^2\right]}_{\text{variance}} + \sigma^2 . \]
The first term, bias, measures how far the average prediction sits from the truth. A model whose hypothesis class cannot represent \(f\) has high bias regardless of how much data it sees. The second term, variance, measures how much the prediction wobbles as the training set changes. A flexible model fit on limited data has high variance. The final term \(\sigma^2\) is the irreducible noise, a floor that no model can cross.
Learning curves are, in effect, a way to visualize where on the bias variance spectrum a given model and dataset sit. High bias and high variance leave distinct fingerprints, and the rest of this chapter is about reading them.
175.3 3. Curves Versus Training Set Size
175.3.1 3.1 How the Curves Move
Consider training error and validation error as functions of \(m\).
Training error tends to rise with \(m\). With very few examples the model can fit them almost perfectly, so training error starts near zero. As more examples arrive, the model can no longer memorize all of them, and training error climbs toward an asymptote.
Validation error tends to fall with \(m\). With few examples the model generalizes poorly, so validation error starts high. As data accumulates, the model captures more of the underlying structure and validation error decreases, also approaching an asymptote.
In the large sample limit both curves converge toward the same value, which equals the bias of the model plus the irreducible noise. The gap between them at finite \(m\) reflects variance.
175.3.2 3.2 The High Bias Signature
When a model has high bias, both curves flatten quickly and converge to a high error value. The training error itself is large, because the model cannot fit even the data it has seen. The two curves sit close together, separated by only a small gap, and that small gap is the telltale sign that variance is not the problem.
The practical consequence is blunt. If the curves have already converged at a disappointing error level, adding more data will not help. More examples push you along a curve that has already plateaued. The remedy lies elsewhere, in a more expressive model or richer features, topics taken up in Section 5.
175.3.3 3.3 The High Variance Signature
When a model has high variance, the training error is low, often much lower than the validation error, and a wide gap persists between the two curves even at the largest training sizes you have tried. Crucially the validation curve is still descending. It has not yet flattened.
This shape carries good news for data collection. Because the validation curve is still falling and the gap is closing, more training data is likely to reduce generalization error. The two curves are on a trajectory to meet, and additional examples move you toward that meeting point.
175.3.4 3.4 A Worked Mental Model
A useful sanity check is to imagine the asymptote both curves are heading toward. If you can extrapolate the validation curve and it appears to level off above your target error, then more data alone will not get you there even in principle, and you are bias limited. If the extrapolated asymptote sits below your target, more data is a viable path. This extrapolation is informal, but it disciplines the decision and prevents the common mistake of collecting data for a model that has already saturated.
175.4 4. Curves Versus Iterations
175.4.1 4.1 Reading the Optimization Curve
Now hold the data fixed and plot training and validation loss against iteration \(t\). Early in training both losses fall together as the optimizer reduces a large initial error. This regime reflects optimization progress and tells you whether the learning rate and other hyperparameters allow the model to fit at all.
As training proceeds, three patterns can emerge. If both losses plateau at a high value, the model is underfitting and the run is bias limited. If both losses keep falling and have not flattened, training has simply not finished, and the budget should be extended. If the training loss continues to fall while the validation loss bottoms out and then begins to rise, the model is overfitting, and the rising validation loss marks the onset.
175.4.2 4.2 Early Stopping
The point where validation loss reaches its minimum defines the early stopping criterion. Training past this point trades a still falling training loss for a worsening validation loss, the very definition of overfitting in the iteration domain. Early stopping is a regularizer in its own right, and it is often the cheapest one available, because it requires no change to the model and no new data. In practice one monitors validation loss with a patience window, stopping when no improvement appears for a fixed number of evaluations and restoring the best checkpoint.
best = inf; wait = 0; patience = 10
for t in range(max_iters):
train_one_step()
v = validate()
if v < best:
best = v; wait = 0; save_checkpoint()
else:
wait += 1
if wait >= patience:
restore_best_checkpoint(); break
175.4.3 4.3 Distinguishing the Two Curve Families
It bears repeating that an iteration curve and a training size curve answer different questions. A model can look perfectly converged on the iteration plot, with a flat training loss, yet still be data starved on the size plot, with a wide and closing validation gap. Diagnosing a system usually requires both views. The iteration curve confirms that optimization is healthy, and the size curve tells you whether the bottleneck is data or capacity.
175.5 5. From Diagnosis to Decision
The value of learning curves lies in the actions they recommend. The following decision rules summarize the analysis.
175.5.1 5.1 If the Diagnosis Is High Bias
When training error is high and the curves have converged, the model is too simple for the task. Productive moves include adding features or higher order interaction terms, increasing model capacity such as depth or width, decreasing regularization strength, and training longer if the iteration curve has not yet flattened. Collecting more data is not productive here, and recognizing that saves time and budget.
175.5.2 5.2 If the Diagnosis Is High Variance
When training error is low but a large validation gap persists and the validation curve is still falling, the model is overfitting the available data. Productive moves include gathering more training examples, adding regularization such as \(L_2\) penalties or dropout, reducing model capacity, applying data augmentation, and using early stopping on the iteration curve. Here additional data is among the most reliable remedies.
175.5.3 5.3 Quantifying the Value of More Data
When the validation curve is still descending, it is often worth fitting a simple parametric model to forecast the payoff of more data. Empirically, generalization error frequently follows a power law in the number of training examples,
\[ \varepsilon(m) \approx \varepsilon_{\infty} + a\, m^{-\alpha}, \]
where \(\varepsilon_{\infty}\) is the irreducible asymptote, \(a\) sets the scale, and \(\alpha > 0\) controls how quickly error decays. Estimating \(\alpha\) from the measured curve lets you project the error at a hypothetical \(10m\) or \(100m\) and weigh that gain against the cost of acquisition. A large \(\alpha\) means data is cheaply effective, while a small \(\alpha\) warns that even an order of magnitude more data buys little. These scaling relationships underpin modern empirical studies of how model and dataset size jointly govern performance [4].
175.6 6. Practical Construction
175.6.1 6.1 Building a Size Curve
To build a curve versus \(m\), fix a held out validation set, then for each of several training sizes draw a random subset, fit the model, and record both errors. Because a single subset is noisy, average over several random draws at each size and plot the mean with a band for variability. The training error should be measured on the same subset used to fit, not on the full data, otherwise the curve loses its meaning.
for m in sizes:
errs = []
for seed in repeats:
sub = sample(train, m, seed)
model = fit(sub)
errs.append((error(model, sub), error(model, val)))
record(m, mean(errs))
175.6.2 6.2 Common Pitfalls
Several mistakes recur often enough to warrant a checklist.
Measuring training error on the wrong set, such as the full training pool rather than the fitted subset, destroys the interpretation of the gap. The validation set must remain fixed across all sizes, since a moving target confounds the comparison. Hyperparameters should be held constant or tuned at each size with care, because a model that is well regularized at small \(m\) may be poorly regularized at large \(m\). Class imbalance can make accuracy a misleading vertical axis, and a metric such as the F1 score or area under the curve is often more honest. Finally, a single random subset per size produces jagged, unreliable curves, so repetition and averaging are essential.
175.6.3 6.3 Reading Real Curves
Real curves are rarely as clean as the textbook shapes. Noise, distribution shift between splits, and optimization instability all introduce wobble. The discipline is to look past the wobble for the two essential quantities, namely the height at which the curves are settling and the size of the gap between them. Those two numbers, the asymptotic error and the variance gap, drive nearly every decision described above.
175.7 7. A Compact Decision Table
The chapter can be distilled into a short reference. When training error is high and the gap is small, the model is bias limited, so increase capacity and do not collect data. When training error is low and the gap is large but closing, the model is variance limited, so collect data or regularize. When both errors are low and close, the model is well fit, and effort is better spent elsewhere. When both errors are high and the iteration curve is still falling, training is simply unfinished, so extend the budget before drawing any conclusion.
This table is the payoff of the whole exercise. Two cheap plots, read for height and gap, convert a vague sense that “the model is not good enough” into a specific and defensible next step.
175.8 8. Summary
Learning curves turn abstract questions about bias, variance, and data sufficiency into pictures that can be read at a glance. The size curve reveals whether more data will help by showing whether the validation error is still falling and whether a gap remains. The iteration curve reveals whether optimization is healthy and where overfitting begins, enabling early stopping. Read together, and interpreted through the bias variance decomposition, they tell a practitioner which lever to pull next, whether that lever is more data, more capacity, more regularization, or simply more patience. The cost of producing them is small, and the cost of skipping them, in wasted data collection and misdirected modeling effort, is often large.
175.9 References
- Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer. https://hastie.su.domains/ElemStatLearn/
- Ng, A. Machine Learning Yearning. https://www.deeplearning.ai/machine-learning-yearning/
- Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press. https://www.deeplearningbook.org/
- Hestness, J., Narang, S., Ardalani, N., et al. (2017). Deep Learning Scaling is Predictable, Empirically. https://arxiv.org/abs/1712.00409
- Perlich, C., Provost, F., and Simonoff, J. (2003). Tree Induction vs. Logistic Regression: A Learning Curve Analysis. Journal of Machine Learning Research. https://www.jmlr.org/papers/v4/perlich03a.html
- scikit-learn developers. Plotting Learning Curves. https://scikit-learn.org/stable/modules/learning_curve.html