175 Learning Curves

Learning curves are among the most economical diagnostic tools in applied machine learning. By plotting error against either the amount of training data or the number of optimization iterations, a practitioner can read off whether a model suffers primarily from bias or from variance, whether collecting more data is likely to help, and whether the optimizer has converged. This chapter develops the theory behind these curves, shows how to construct them carefully, and turns the resulting shapes into concrete decisions about data, model capacity, and training budget.

175.1 1. Two Families of Learning Curves

The term “learning curve” refers to two related but distinct constructions, and conflating them is a common source of confusion.

The first family plots error as a function of training set size $m$. We train the model on subsets of increasing size and, for each subset, record the training error and a validation error computed on a held out set. This curve answers the question of how performance scales with data and is the primary tool for deciding whether to collect more examples.

The second family plots error as a function of optimization iteration $t$, where $t$ might count gradient descent steps, epochs, or boosting rounds. This curve answers questions about optimization and convergence, and it is the natural place to detect overfitting that emerges late in training. Practitioners often call this second curve a training curve or a loss curve to distinguish it from the data scaling curve.

Definitions used throughout

Let a model be trained on a set $D$ of size $m = |D|$ drawn i.i.d. from a distribution $\mathcal{P}$ over pairs $(x, y)$. Fix a loss $\ell$.

The training error (empirical risk) is the average loss on the very examples used to fit the model, $\hat{R}_D(\hat{f}_D) = \frac{1}{m} \sum_{(x,y) \in D} \ell(\hat{f}_D(x), y)$.
The generalization error (true risk) is the expected loss on a fresh draw, $R(\hat{f}_D) = \mathbb{E}_{(x,y) \sim \mathcal{P}}[\ell(\hat{f}_D(x), y)]$. In practice it is estimated by the validation error on a fixed held out set disjoint from $D$.
The generalization gap is $R(\hat{f}_D) - \hat{R}_D(\hat{f}_D)$, the quantity that the visible distance between the two curves estimates.

Both rest on the same underlying decomposition of generalization error, so we begin there.

175.2 2. The Bias Variance Decomposition

Let $f(x) = \mathbb{E}[y \mid x]$ be the true regression function, let $y = f(x) + \varepsilon$ with $\mathbb{E}[\varepsilon] = 0$ and $\operatorname{Var}(\varepsilon) = \sigma^2$, and let $\hat{f}_{D}(x)$ be the model learned from a training set $D$. For squared loss, the expected error at a point $x$, averaged over random draws of the training set and over the label noise, decomposes as

\[ \mathbb{E}_{D, \varepsilon}\left[(\hat{f}_{D}(x) - y)^2\right] = \underbrace{\left(\mathbb{E}_{D}[\hat{f}_{D}(x)] - f(x)\right)^2}_{\text{bias}^2} + \underbrace{\mathbb{E}_{D}\left[(\hat{f}_{D}(x) - \mathbb{E}_{D}[\hat{f}_{D}(x)])^2\right]}_{\text{variance}} + \sigma^2 . \]

The first term, bias, measures how far the average prediction sits from the truth. A model whose hypothesis class cannot represent $f$ has high bias regardless of how much data it sees. The second term, variance, measures how much the prediction wobbles as the training set changes. A flexible model fit on limited data has high variance. The final term $\sigma^2$ is the irreducible noise, a floor that no model can cross.

This decomposition is exact for squared loss; the derivation uses only that the cross term $\mathbb{E}_D[(\hat{f}_D(x) - \mathbb{E}_D[\hat{f}_D(x)])(\mathbb{E}_D[\hat{f}_D(x)] - f(x))]$ vanishes, because the second factor is a constant and the first has mean zero. For other losses such as the 0/1 loss or cross entropy there is no single canonical decomposition, but the qualitative split into a systematic component (bias) and a sensitivity-to-sampling component (variance) still organizes the diagnostics that follow [3].

Two facts make the decomposition the right lens for learning curves. First, both bias and variance depend on $m$: increasing the sample size leaves bias essentially fixed (it is a property of the hypothesis class relative to $f$) while it shrinks variance, typically at rate $O(1/m)$ for well behaved estimators. Second, the training error and the validation error respond to these terms in opposite directions, which is exactly why the two curves move toward each other as data accumulates.

Learning curves are, in effect, a way to visualize where on the bias variance spectrum a given model and dataset sit. High bias and high variance leave distinct fingerprints, and the rest of this chapter is about reading them.

175.3 3. Curves Versus Training Set Size

175.3.1 3.1 How the Curves Move

Consider training error and validation error as functions of $m$.

Training error tends to rise with $m$. With very few examples the model can fit them almost perfectly, so training error starts near zero. As more examples arrive, the model can no longer memorize all of them, and training error climbs toward an asymptote.

Validation error tends to fall with $m$. With few examples the model generalizes poorly, so validation error starts high. As data accumulates, the model captures more of the underlying structure and validation error decreases, also approaching an asymptote.

In the large sample limit both curves converge toward the same value, which equals the bias of the model plus the irreducible noise, namely $\text{bias}^2 + \sigma^2$. The gap between them at finite $m$ reflects variance. The two statements together explain the geometry: validation error descends from above and training error rises from below, both squeezing toward the common floor $\text{bias}^2 + \sigma^2$ as variance is driven out by data.

175.3.2 3.2 The High Bias Signature

When a model has high bias, both curves flatten quickly and converge to a high error value. The training error itself is large, because the model cannot fit even the data it has seen. The two curves sit close together, separated by only a small gap, and that small gap is the telltale sign that variance is not the problem.

The practical consequence is blunt. If the curves have already converged at a disappointing error level, adding more data will not help. More examples push you along a curve that has already plateaued. The remedy lies elsewhere, in a more expressive model or richer features, topics taken up in Section 5.

175.3.3 3.3 The High Variance Signature

When a model has high variance, the training error is low, often much lower than the validation error, and a wide gap persists between the two curves even at the largest training sizes you have tried. Crucially the validation curve is still descending. It has not yet flattened.

This shape carries good news for data collection. Because the validation curve is still falling and the gap is closing, more training data is likely to reduce generalization error. The two curves are on a trajectory to meet, and additional examples move you toward that meeting point.

175.3.4 3.4 A Worked Mental Model

A useful sanity check is to imagine the asymptote both curves are heading toward. If you can extrapolate the validation curve and it appears to level off above your target error, then more data alone will not get you there even in principle, and you are bias limited. If the extrapolated asymptote sits below your target, more data is a viable path. This extrapolation is informal, but it disciplines the decision and prevents the common mistake of collecting data for a model that has already saturated.

175.3.5 3.5 A Concrete Worked Example

Numbers make the two signatures vivid. Suppose your target validation error is 0.10 and you observe the following size curve (errors are illustrative).

$m$	Training error	Validation error	Gap
500	0.02	0.34	0.32
1000	0.04	0.27	0.23
2000	0.06	0.21	0.15
4000	0.07	0.17	0.10
8000	0.08	0.14	0.06

Three readings follow directly. The training error is low and rising slowly, the gap is large but shrinking with every doubling, and the validation error is still falling. This is the high variance signature, so more data is the right lever. To quantify how much more, fit the power law of Section 5.3. Using the last two points, $\varepsilon(4000) = 0.17$ and $\varepsilon(8000) = 0.14$, and guessing an asymptote $\varepsilon_\infty \approx 0.09$ from the slowing training error, the implied decay exponent solves $(0.17 - 0.09) / (0.14 - 0.09) = 2^{\alpha}$, giving $\alpha = \log_2(0.08 / 0.05) \approx 0.68$. Projecting to $m = 32{,}000$ (two further doublings) gives roughly $0.09 + 0.05 \cdot 2^{-2 \times 0.68} \approx 0.11$, just above target, and to $m = 64{,}000$ roughly $0.10$. The forecast turns “collect more data” into a budgeted plan: expect to need on the order of eight times the current data to reach the goal, and decide whether that acquisition cost is justified.

Contrast a high bias table where training error reads 0.18, 0.19, 0.19, 0.20, 0.20 across the same sizes while validation sits at 0.24, 0.23, 0.22, 0.22, 0.21. Here both curves have nearly met near 0.20, far above the 0.10 target, and no amount of data closes that floor. The lever must be capacity or features, not acquisition.

175.4 4. Curves Versus Iterations

175.4.1 4.1 Reading the Optimization Curve

Now hold the data fixed and plot training and validation loss against iteration $t$. Early in training both losses fall together as the optimizer reduces a large initial error. This regime reflects optimization progress and tells you whether the learning rate and other hyperparameters allow the model to fit at all.

As training proceeds, three patterns can emerge. If both losses plateau at a high value, the model is underfitting and the run is bias limited. If both losses keep falling and have not flattened, training has simply not finished, and the budget should be extended. If the training loss continues to fall while the validation loss bottoms out and then begins to rise, the model is overfitting, and the rising validation loss marks the onset.

175.4.2 4.2 Early Stopping

The point where validation loss reaches its minimum defines the early stopping criterion. Training past this point trades a still falling training loss for a worsening validation loss, the very definition of overfitting in the iteration domain. Early stopping is a regularizer in its own right, and it is often the cheapest one available, because it requires no change to the model and no new data. For squared loss fit by gradient descent on a linear model, early stopping is closely connected to $L_2$ regularization: stopping after a finite number of steps restricts the effective parameter norm, and the number of steps plays a role analogous to the inverse of the ridge penalty [3]. In practice one monitors validation loss with a patience window, stopping when no improvement appears for a fixed number of evaluations and restoring the best checkpoint.

best = inf; wait = 0; patience = 10
for t in range(max_iters):
    train_one_step()
    v = validate()
    if v < best:
        best = v; wait = 0; save_checkpoint()
    else:
        wait += 1
        if wait >= patience:
            restore_best_checkpoint(); break

A subtlety worth flagging: the validation loss used for early stopping is itself an estimate, and selecting the iteration that minimizes it introduces a small optimistic bias into that minimum. For an unbiased read of final performance, report the stopped model’s error on a separate test set, not on the validation set that drove the stopping decision.

175.4.3 4.3 Distinguishing the Two Curve Families

It bears repeating that an iteration curve and a training size curve answer different questions. A model can look perfectly converged on the iteration plot, with a flat training loss, yet still be data starved on the size plot, with a wide and closing validation gap. Diagnosing a system usually requires both views. The iteration curve confirms that optimization is healthy, and the size curve tells you whether the bottleneck is data or capacity.

The following diagram summarizes how the two families route to a decision.

flowchart TD
    A["Symptom: model not good enough"] --> B["Plot iteration curve"]
    B --> C{"Both losses still falling?"}
    C -->|"yes"| D["Train longer, extend budget"]
    C -->|"no"| E{"Validation loss rising late?"}
    E -->|"yes"| F["Overfitting: early stop, regularize"]
    E -->|"no"| G["Plot size curve"]
    G --> H{"Training error high, small gap?"}
    H -->|"yes"| I["High bias: add capacity or features"]
    H -->|"no"| J["High variance: collect data or regularize"]

175.5 5. From Diagnosis to Decision

The value of learning curves lies in the actions they recommend. The following decision rules summarize the analysis.

175.5.1 5.1 If the Diagnosis Is High Bias

When training error is high and the curves have converged, the model is too simple for the task. Productive moves include adding features or higher order interaction terms, increasing model capacity such as depth or width, decreasing regularization strength, and training longer if the iteration curve has not yet flattened. Collecting more data is not productive here, and recognizing that saves time and budget.

175.5.2 5.2 If the Diagnosis Is High Variance

When training error is low but a large validation gap persists and the validation curve is still falling, the model is overfitting the available data. Productive moves include gathering more training examples, adding regularization such as $L_2$ penalties or dropout, reducing model capacity, applying data augmentation, and using early stopping on the iteration curve. Here additional data is among the most reliable remedies.

175.5.3 5.3 Quantifying the Value of More Data

When the validation curve is still descending, it is often worth fitting a simple parametric model to forecast the payoff of more data. Empirically, generalization error frequently follows a power law in the number of training examples,

\[ \varepsilon(m) \approx \varepsilon_{\infty} + a\, m^{-\alpha}, \]

where $\varepsilon_{\infty}$ is the irreducible asymptote, $a$ sets the scale, and $\alpha > 0$ controls how quickly error decays. Estimating $\alpha$ from the measured curve lets you project the error at a hypothetical $10m$ or $100m$ and weigh that gain against the cost of acquisition. A large $\alpha$ means data is cheaply effective, while a small $\alpha$ warns that even an order of magnitude more data buys little.

A clean way to estimate the parameters is to subtract a candidate asymptote and regress in log space. If the model holds, then $\log(\varepsilon(m) - \varepsilon_\infty) \approx \log a - \alpha \log m$, a straight line whose slope is $-\alpha$. One sweeps a few values of $\varepsilon_\infty$, picks the one that straightens the line best, and reads off $\alpha$. The exponent is usually modest in practice. Classic learning curve studies report values broadly in the range from about 0.2 to 1.0 depending on problem and model, and theory for parametric estimators predicts $\alpha = 1$ for excess squared error in the well specified case [5]. These scaling relationships underpin modern empirical studies of how model and dataset size jointly govern performance [4].

A caution: the power law is a local extrapolation device, not a law of nature. It can break where a new regime begins, for example when a model’s capacity becomes the binding constraint, when distribution shift appears between the small and large data regimes, or when the irreducible noise floor is reached. Treat any projection beyond a factor of a few in $m$ as a hypothesis to be checked, not a guarantee.

175.6 6. Practical Construction

175.6.1 6.1 Building a Size Curve

To build a curve versus $m$, fix a held out validation set, then for each of several training sizes draw a random subset, fit the model, and record both errors. Because a single subset is noisy, average over several random draws at each size and plot the mean with a band for variability. The training error should be measured on the same subset used to fit, not on the full data, otherwise the curve loses its meaning.

for m in sizes:
    errs = []
    for seed in repeats:
        sub = sample(train, m, seed)
        model = fit(sub)
        errs.append((error(model, sub), error(model, val)))
    record(m, mean(errs))

The mature, free, open-source path to these curves is the learning_curve and validation_curve utilities in scikit-learn, which handle the subset sampling, repeated fits, and cross validated averaging described above and return arrays ready to plot with matplotlib [6]. Reaching for them avoids a surprising number of the pitfalls below.

175.6.2 6.2 Common Pitfalls

Several mistakes recur often enough to warrant a checklist.

Measuring training error on the wrong set, such as the full training pool rather than the fitted subset, destroys the interpretation of the gap. The validation set must remain fixed across all sizes, since a moving target confounds the comparison. Hyperparameters should be held constant or tuned at each size with care, because a model that is well regularized at small $m$ may be poorly regularized at large $m$. Class imbalance can make accuracy a misleading vertical axis, and a metric such as the F1 score or area under the curve is often more honest. A subset that is not stratified can drift in class balance as $m$ shrinks, which injects spurious wobble, so stratified sampling is preferable for classification. Data leakage between the validation set and the training pool inflates apparent performance and flattens the gap deceptively, so the split must precede subsetting. Finally, a single random subset per size produces jagged, unreliable curves, so repetition and averaging are essential.

175.6.3 6.3 Reading Real Curves

Real curves are rarely as clean as the textbook shapes. Noise, distribution shift between splits, and optimization instability all introduce wobble. The discipline is to look past the wobble for the two essential quantities, namely the height at which the curves are settling and the size of the gap between them. Those two numbers, the asymptotic error and the variance gap, drive nearly every decision described above.

175.7 7. A Compact Decision Table

The chapter can be distilled into a short reference.

Training error	Gap (validation minus training)	Diagnosis	Action
High	Small	Bias limited	Add capacity or features, do not collect data
Low	Large but closing	Variance limited	Collect data or regularize
Low	Small	Well fit	Spend effort elsewhere
High	Either, iteration curve still falling	Unfinished training	Extend budget before concluding

This table is the payoff of the whole exercise. Two cheap plots, read for height and gap, convert a vague sense that “the model is not good enough” into a specific and defensible next step.

175.8 8. When to Use and When Not To

Learning curves earn their keep whenever the next modeling investment is expensive or ambiguous, for example when a data acquisition or labeling budget is on the table, when a team disputes whether the model or the data is the bottleneck, or when a training run is costly and you must justify extending it. They are equally valuable as a cheap routine check before scaling anything up.

They are less informative in a few settings. Under strong distribution shift between the deployment data and any available split, the validation curve estimates the wrong target and its asymptote misleads. When the held out set is tiny, both curves are dominated by sampling noise and neither height nor gap can be read reliably. And when the metric is highly non additive or threshold sensitive, a single scalar curve can hide the behavior that actually matters, so it should be paired with a more granular evaluation. In these cases the curves remain a useful first look but should not be the sole basis for a decision.

175.9 9. Summary

Learning curves turn abstract questions about bias, variance, and data sufficiency into pictures that can be read at a glance. The size curve reveals whether more data will help by showing whether the validation error is still falling and whether a gap remains. The iteration curve reveals whether optimization is healthy and where overfitting begins, enabling early stopping. Read together, and interpreted through the bias variance decomposition, they tell a practitioner which lever to pull next, whether that lever is more data, more capacity, more regularization, or simply more patience. The cost of producing them is small, and the cost of skipping them, in wasted data collection and misdirected modeling effort, is often large.

175.10 References

Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer. https://hastie.su.domains/ElemStatLearn/
Ng, A. Machine Learning Yearning. https://www.deeplearning.ai/machine-learning-yearning/
Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press. https://www.deeplearningbook.org/
Hestness, J., Narang, S., Ardalani, N., et al. (2017). Deep Learning Scaling is Predictable, Empirically. https://arxiv.org/abs/1712.00409
Perlich, C., Provost, F., and Simonoff, J. (2003). Tree Induction vs. Logistic Regression: A Learning Curve Analysis. Journal of Machine Learning Research. https://www.jmlr.org/papers/v4/perlich03a.html
scikit-learn developers. Plotting Learning Curves. https://scikit-learn.org/stable/modules/learning_curve.html

# Learning Curves Learning curves are among the most economical diagnostic tools in applied machine learning. By plotting error against either the amount of training data or the number of optimization iterations, a practitioner can read off whether a model suffers primarily from bias or from variance, whether collecting more data is likely to help, and whether the optimizer has converged. This chapter develops the theory behind these curves, shows how to construct them carefully, and turns the resulting shapes into concrete decisions about data, model capacity, and training budget. ## 1. Two Families of Learning Curves The term "learning curve" refers to two related but distinct constructions, and conflating them is a common source of confusion. The first family plots error as a function of training set size $m$. We train the model on subsets of increasing size and, for each subset, record the training error and a validation error computed on a held out set. This curve answers the question of how performance scales with data and is the primary tool for deciding whether to collect more examples. The second family plots error as a function of optimization iteration $t$, where $t$ might count gradient descent steps, epochs, or boosting rounds. This curve answers questions about optimization and convergence, and it is the natural place to detect overfitting that emerges late in training. Practitioners often call this second curve a training curve or a loss curve to distinguish it from the data scaling curve. ::: {.callout-note} ## Definitions used throughout Let a model be trained on a set $D$ of size $m = |D|$ drawn i.i.d. from a distribution $\mathcal{P}$ over pairs $(x, y)$. Fix a loss $\ell$. - The **training error** (empirical risk) is the average loss on the very examples used to fit the model, $\hat{R}_D(\hat{f}_D) = \frac{1}{m} \sum_{(x,y) \in D} \ell(\hat{f}_D(x), y)$. - The **generalization error** (true risk) is the expected loss on a fresh draw, $R(\hat{f}_D) = \mathbb{E}_{(x,y) \sim \mathcal{P}}[\ell(\hat{f}_D(x), y)]$. In practice it is estimated by the **validation error** on a fixed held out set disjoint from $D$. - The **generalization gap** is $R(\hat{f}_D) - \hat{R}_D(\hat{f}_D)$, the quantity that the visible distance between the two curves estimates. ::: Both rest on the same underlying decomposition of generalization error, so we begin there. ## 2. The Bias Variance Decomposition Let $f(x) = \mathbb{E}[y \mid x]$ be the true regression function, let $y = f(x) + \varepsilon$ with $\mathbb{E}[\varepsilon] = 0$ and $\operatorname{Var}(\varepsilon) = \sigma^2$, and let $\hat{f}_{D}(x)$ be the model learned from a training set $D$. For squared loss, the expected error at a point $x$, averaged over random draws of the training set and over the label noise, decomposes as $$ \mathbb{E}_{D, \varepsilon}\left[(\hat{f}_{D}(x) - y)^2\right] = \underbrace{\left(\mathbb{E}_{D}[\hat{f}_{D}(x)] - f(x)\right)^2}_{\text{bias}^2} + \underbrace{\mathbb{E}_{D}\left[(\hat{f}_{D}(x) - \mathbb{E}_{D}[\hat{f}_{D}(x)])^2\right]}_{\text{variance}} + \sigma^2 . $$ The first term, bias, measures how far the average prediction sits from the truth. A model whose hypothesis class cannot represent $f$ has high bias regardless of how much data it sees. The second term, variance, measures how much the prediction wobbles as the training set changes. A flexible model fit on limited data has high variance. The final term $\sigma^2$ is the irreducible noise, a floor that no model can cross. This decomposition is exact for squared loss; the derivation uses only that the cross term $\mathbb{E}_D[(\hat{f}_D(x) - \mathbb{E}_D[\hat{f}_D(x)])(\mathbb{E}_D[\hat{f}_D(x)] - f(x))]$ vanishes, because the second factor is a constant and the first has mean zero. For other losses such as the 0/1 loss or cross entropy there is no single canonical decomposition, but the qualitative split into a systematic component (bias) and a sensitivity-to-sampling component (variance) still organizes the diagnostics that follow [3]. Two facts make the decomposition the right lens for learning curves. First, both bias and variance depend on $m$: increasing the sample size leaves bias essentially fixed (it is a property of the hypothesis class relative to $f$) while it shrinks variance, typically at rate $O(1/m)$ for well behaved estimators. Second, the training error and the validation error respond to these terms in opposite directions, which is exactly why the two curves move toward each other as data accumulates. Learning curves are, in effect, a way to visualize where on the bias variance spectrum a given model and dataset sit. High bias and high variance leave distinct fingerprints, and the rest of this chapter is about reading them. ## 3. Curves Versus Training Set Size ### 3.1 How the Curves Move Consider training error and validation error as functions of $m$. Training error tends to rise with $m$. With very few examples the model can fit them almost perfectly, so training error starts near zero. As more examples arrive, the model can no longer memorize all of them, and training error climbs toward an asymptote. Validation error tends to fall with $m$. With few examples the model generalizes poorly, so validation error starts high. As data accumulates, the model captures more of the underlying structure and validation error decreases, also approaching an asymptote. In the large sample limit both curves converge toward the same value, which equals the bias of the model plus the irreducible noise, namely $\text{bias}^2 + \sigma^2$. The gap between them at finite $m$ reflects variance. The two statements together explain the geometry: validation error descends from above and training error rises from below, both squeezing toward the common floor $\text{bias}^2 + \sigma^2$ as variance is driven out by data. ### 3.2 The High Bias Signature When a model has high bias, both curves flatten quickly and converge to a high error value. The training error itself is large, because the model cannot fit even the data it has seen. The two curves sit close together, separated by only a small gap, and that small gap is the telltale sign that variance is not the problem. The practical consequence is blunt. If the curves have already converged at a disappointing error level, adding more data will not help. More examples push you along a curve that has already plateaued. The remedy lies elsewhere, in a more expressive model or richer features, topics taken up in Section 5. ### 3.3 The High Variance Signature When a model has high variance, the training error is low, often much lower than the validation error, and a wide gap persists between the two curves even at the largest training sizes you have tried. Crucially the validation curve is still descending. It has not yet flattened. This shape carries good news for data collection. Because the validation curve is still falling and the gap is closing, more training data is likely to reduce generalization error. The two curves are on a trajectory to meet, and additional examples move you toward that meeting point. ### 3.4 A Worked Mental Model A useful sanity check is to imagine the asymptote both curves are heading toward. If you can extrapolate the validation curve and it appears to level off above your target error, then more data alone will not get you there even in principle, and you are bias limited. If the extrapolated asymptote sits below your target, more data is a viable path. This extrapolation is informal, but it disciplines the decision and prevents the common mistake of collecting data for a model that has already saturated. ### 3.5 A Concrete Worked Example Numbers make the two signatures vivid. Suppose your target validation error is 0.10 and you observe the following size curve (errors are illustrative). | $m$ | Training error | Validation error | Gap | |----:|---------------:|-----------------:|----:| | 500 | 0.02 | 0.34 | 0.32 | | 1000 | 0.04 | 0.27 | 0.23 | | 2000 | 0.06 | 0.21 | 0.15 | | 4000 | 0.07 | 0.17 | 0.10 | | 8000 | 0.08 | 0.14 | 0.06 | Three readings follow directly. The training error is low and rising slowly, the gap is large but shrinking with every doubling, and the validation error is still falling. This is the high variance signature, so more data is the right lever. To quantify how much more, fit the power law of Section 5.3. Using the last two points, $\varepsilon(4000) = 0.17$ and $\varepsilon(8000) = 0.14$, and guessing an asymptote $\varepsilon_\infty \approx 0.09$ from the slowing training error, the implied decay exponent solves $(0.17 - 0.09) / (0.14 - 0.09) = 2^{\alpha}$, giving $\alpha = \log_2(0.08 / 0.05) \approx 0.68$. Projecting to $m = 32{,}000$ (two further doublings) gives roughly $0.09 + 0.05 \cdot 2^{-2 \times 0.68} \approx 0.11$, just above target, and to $m = 64{,}000$ roughly $0.10$. The forecast turns "collect more data" into a budgeted plan: expect to need on the order of eight times the current data to reach the goal, and decide whether that acquisition cost is justified. Contrast a high bias table where training error reads 0.18, 0.19, 0.19, 0.20, 0.20 across the same sizes while validation sits at 0.24, 0.23, 0.22, 0.22, 0.21. Here both curves have nearly met near 0.20, far above the 0.10 target, and no amount of data closes that floor. The lever must be capacity or features, not acquisition. ## 4. Curves Versus Iterations ### 4.1 Reading the Optimization Curve Now hold the data fixed and plot training and validation loss against iteration $t$. Early in training both losses fall together as the optimizer reduces a large initial error. This regime reflects optimization progress and tells you whether the learning rate and other hyperparameters allow the model to fit at all. As training proceeds, three patterns can emerge. If both losses plateau at a high value, the model is underfitting and the run is bias limited. If both losses keep falling and have not flattened, training has simply not finished, and the budget should be extended. If the training loss continues to fall while the validation loss bottoms out and then begins to rise, the model is overfitting, and the rising validation loss marks the onset. ### 4.2 Early Stopping The point where validation loss reaches its minimum defines the early stopping criterion. Training past this point trades a still falling training loss for a worsening validation loss, the very definition of overfitting in the iteration domain. Early stopping is a regularizer in its own right, and it is often the cheapest one available, because it requires no change to the model and no new data. For squared loss fit by gradient descent on a linear model, early stopping is closely connected to $L_2$ regularization: stopping after a finite number of steps restricts the effective parameter norm, and the number of steps plays a role analogous to the inverse of the ridge penalty [3]. In practice one monitors validation loss with a patience window, stopping when no improvement appears for a fixed number of evaluations and restoring the best checkpoint. ```text best = inf; wait = 0; patience = 10 for t in range(max_iters): train_one_step() v = validate() if v < best: best = v; wait = 0; save_checkpoint() else: wait += 1 if wait >= patience: restore_best_checkpoint(); break ``` A subtlety worth flagging: the validation loss used for early stopping is itself an estimate, and selecting the iteration that minimizes it introduces a small optimistic bias into that minimum. For an unbiased read of final performance, report the stopped model's error on a separate test set, not on the validation set that drove the stopping decision. ### 4.3 Distinguishing the Two Curve Families It bears repeating that an iteration curve and a training size curve answer different questions. A model can look perfectly converged on the iteration plot, with a flat training loss, yet still be data starved on the size plot, with a wide and closing validation gap. Diagnosing a system usually requires both views. The iteration curve confirms that optimization is healthy, and the size curve tells you whether the bottleneck is data or capacity. The following diagram summarizes how the two families route to a decision. ```{mermaid} flowchart TD A["Symptom: model not good enough"] --> B["Plot iteration curve"] B --> C{"Both losses still falling?"} C -->|"yes"| D["Train longer, extend budget"] C -->|"no"| E{"Validation loss rising late?"} E -->|"yes"| F["Overfitting: early stop, regularize"] E -->|"no"| G["Plot size curve"] G --> H{"Training error high, small gap?"} H -->|"yes"| I["High bias: add capacity or features"] H -->|"no"| J["High variance: collect data or regularize"] ``` ## 5. From Diagnosis to Decision The value of learning curves lies in the actions they recommend. The following decision rules summarize the analysis. ### 5.1 If the Diagnosis Is High Bias When training error is high and the curves have converged, the model is too simple for the task. Productive moves include adding features or higher order interaction terms, increasing model capacity such as depth or width, decreasing regularization strength, and training longer if the iteration curve has not yet flattened. Collecting more data is not productive here, and recognizing that saves time and budget. ### 5.2 If the Diagnosis Is High Variance When training error is low but a large validation gap persists and the validation curve is still falling, the model is overfitting the available data. Productive moves include gathering more training examples, adding regularization such as $L_2$ penalties or dropout, reducing model capacity, applying data augmentation, and using early stopping on the iteration curve. Here additional data is among the most reliable remedies. ### 5.3 Quantifying the Value of More Data When the validation curve is still descending, it is often worth fitting a simple parametric model to forecast the payoff of more data. Empirically, generalization error frequently follows a power law in the number of training examples, $$ \varepsilon(m) \approx \varepsilon_{\infty} + a\, m^{-\alpha}, $$ where $\varepsilon_{\infty}$ is the irreducible asymptote, $a$ sets the scale, and $\alpha > 0$ controls how quickly error decays. Estimating $\alpha$ from the measured curve lets you project the error at a hypothetical $10m$ or $100m$ and weigh that gain against the cost of acquisition. A large $\alpha$ means data is cheaply effective, while a small $\alpha$ warns that even an order of magnitude more data buys little. A clean way to estimate the parameters is to subtract a candidate asymptote and regress in log space. If the model holds, then $\log(\varepsilon(m) - \varepsilon_\infty) \approx \log a - \alpha \log m$, a straight line whose slope is $-\alpha$. One sweeps a few values of $\varepsilon_\infty$, picks the one that straightens the line best, and reads off $\alpha$. The exponent is usually modest in practice. Classic learning curve studies report values broadly in the range from about 0.2 to 1.0 depending on problem and model, and theory for parametric estimators predicts $\alpha = 1$ for excess squared error in the well specified case [5]. These scaling relationships underpin modern empirical studies of how model and dataset size jointly govern performance [4]. A caution: the power law is a local extrapolation device, not a law of nature. It can break where a new regime begins, for example when a model's capacity becomes the binding constraint, when distribution shift appears between the small and large data regimes, or when the irreducible noise floor is reached. Treat any projection beyond a factor of a few in $m$ as a hypothesis to be checked, not a guarantee. ## 6. Practical Construction ### 6.1 Building a Size Curve To build a curve versus $m$, fix a held out validation set, then for each of several training sizes draw a random subset, fit the model, and record both errors. Because a single subset is noisy, average over several random draws at each size and plot the mean with a band for variability. The training error should be measured on the same subset used to fit, not on the full data, otherwise the curve loses its meaning. ```text for m in sizes: errs = [] for seed in repeats: sub = sample(train, m, seed) model = fit(sub) errs.append((error(model, sub), error(model, val))) record(m, mean(errs)) ``` The mature, free, open-source path to these curves is the `learning_curve` and `validation_curve` utilities in scikit-learn, which handle the subset sampling, repeated fits, and cross validated averaging described above and return arrays ready to plot with matplotlib [6]. Reaching for them avoids a surprising number of the pitfalls below. ### 6.2 Common Pitfalls Several mistakes recur often enough to warrant a checklist. Measuring training error on the wrong set, such as the full training pool rather than the fitted subset, destroys the interpretation of the gap. The validation set must remain fixed across all sizes, since a moving target confounds the comparison. Hyperparameters should be held constant or tuned at each size with care, because a model that is well regularized at small $m$ may be poorly regularized at large $m$. Class imbalance can make accuracy a misleading vertical axis, and a metric such as the F1 score or area under the curve is often more honest. A subset that is not stratified can drift in class balance as $m$ shrinks, which injects spurious wobble, so stratified sampling is preferable for classification. Data leakage between the validation set and the training pool inflates apparent performance and flattens the gap deceptively, so the split must precede subsetting. Finally, a single random subset per size produces jagged, unreliable curves, so repetition and averaging are essential. ### 6.3 Reading Real Curves Real curves are rarely as clean as the textbook shapes. Noise, distribution shift between splits, and optimization instability all introduce wobble. The discipline is to look past the wobble for the two essential quantities, namely the height at which the curves are settling and the size of the gap between them. Those two numbers, the asymptotic error and the variance gap, drive nearly every decision described above. ## 7. A Compact Decision Table The chapter can be distilled into a short reference. | Training error | Gap (validation minus training) | Diagnosis | Action | |---|---|---|---| | High | Small | Bias limited | Add capacity or features, do not collect data | | Low | Large but closing | Variance limited | Collect data or regularize | | Low | Small | Well fit | Spend effort elsewhere | | High | Either, iteration curve still falling | Unfinished training | Extend budget before concluding | This table is the payoff of the whole exercise. Two cheap plots, read for height and gap, convert a vague sense that "the model is not good enough" into a specific and defensible next step. ## 8. When to Use and When Not To Learning curves earn their keep whenever the next modeling investment is expensive or ambiguous, for example when a data acquisition or labeling budget is on the table, when a team disputes whether the model or the data is the bottleneck, or when a training run is costly and you must justify extending it. They are equally valuable as a cheap routine check before scaling anything up. They are less informative in a few settings. Under strong distribution shift between the deployment data and any available split, the validation curve estimates the wrong target and its asymptote misleads. When the held out set is tiny, both curves are dominated by sampling noise and neither height nor gap can be read reliably. And when the metric is highly non additive or threshold sensitive, a single scalar curve can hide the behavior that actually matters, so it should be paired with a more granular evaluation. In these cases the curves remain a useful first look but should not be the sole basis for a decision. ## 9. Summary Learning curves turn abstract questions about bias, variance, and data sufficiency into pictures that can be read at a glance. The size curve reveals whether more data will help by showing whether the validation error is still falling and whether a gap remains. The iteration curve reveals whether optimization is healthy and where overfitting begins, enabling early stopping. Read together, and interpreted through the bias variance decomposition, they tell a practitioner which lever to pull next, whether that lever is more data, more capacity, more regularization, or simply more patience. The cost of producing them is small, and the cost of skipping them, in wasted data collection and misdirected modeling effort, is often large. ## References 1. Hastie, T., Tibshirani, R., and Friedman, J. (2009). *The Elements of Statistical Learning*, 2nd ed. Springer. https://hastie.su.domains/ElemStatLearn/ 2. Ng, A. *Machine Learning Yearning*. https://www.deeplearning.ai/machine-learning-yearning/ 3. Goodfellow, I., Bengio, Y., and Courville, A. (2016). *Deep Learning*. MIT Press. https://www.deeplearningbook.org/ 4. Hestness, J., Narang, S., Ardalani, N., et al. (2017). Deep Learning Scaling is Predictable, Empirically. https://arxiv.org/abs/1712.00409 5. Perlich, C., Provost, F., and Simonoff, J. (2003). Tree Induction vs. Logistic Regression: A Learning Curve Analysis. *Journal of Machine Learning Research*. https://www.jmlr.org/papers/v4/perlich03a.html 6. scikit-learn developers. Plotting Learning Curves. https://scikit-learn.org/stable/modules/learning_curve.html