63 Statistical Graphics for AI

Numbers summarize, but pictures reveal. A single accuracy figure can hide a model that fails badly on a minority class, a regression that systematically underpredicts large values, or a training run that stopped improving thousands of steps ago. Statistical graphics are the instruments through which a practitioner inspects data and models before, during, and after fitting. This chapter surveys the plots that earn their place in a machine learning workflow and, more importantly, teaches you how to read them so that you draw correct conclusions rather than comforting ones.

The guiding idea, due to Tukey, is that exploratory graphics exist to force us to notice what we did not expect to see. A plot is not a summary that confirms a hypothesis we already hold; it is a search for the structure that our model and our metrics have failed to account for. Everything below is organized around one question: which shape, if it appeared, would change the next thing you do?

63.1 1. Why Graphics Belong in the Machine Learning Loop

63.1.1 1.1 The Limits of Scalar Summaries

Anscombe’s quartet remains the canonical warning. Four datasets share nearly identical means, variances, correlation, and ordinary least squares fit, yet they look entirely different when plotted: one is linear with noise, one is curved, one is linear with a single outlier, and one is a vertical cluster with one leverage point. The lesson generalizes directly to machine learning. Aggregate metrics compress a high dimensional object into a scalar, and compression is lossy by construction. The question is never whether information is lost but whether the lost information mattered.

Graphics restore some of that lost structure. A residual plot exposes the curvature that a single $R^2$ conceals. A calibration plot exposes the overconfidence that accuracy ignores. A learning curve distinguishes a model that needs more data from one that needs more capacity. None of these distinctions is visible in the headline number, and all of them change what you do next.

It is worth being concrete about why scalar summaries are lossy in a way that is not merely philosophical. A summary statistic $T$ is a many to one map from the space of datasets to the real line, so the preimage $T^{-1}(t)$, the set of all datasets producing the same value $t$, is typically enormous. Anscombe’s quartet is simply four members of one such preimage chosen to look different. A graphic is a richer, higher dimensional map: a histogram of $n$ points retains the full empirical distribution up to bin resolution, and a scatterplot retains every pair. The art of diagnostics is choosing maps whose preimages separate the failure modes you care about.

63.1.2 1.2 Graphics as Diagnostics, Not Decoration

It helps to separate two uses of plots. Explanatory graphics communicate a finished result to an audience and should be polished, labeled, and selective. Exploratory and diagnostic graphics serve the analyst and should be fast, plentiful, and disposable. Most of the plots in this chapter are diagnostic. You make dozens, glance at each for a few seconds, and keep the two or three that surprised you. Treating diagnostics with the ceremony of a publication figure is a common way to make too few of them.

63.2 2. Looking at Distributions

63.2.1 2.1 Histograms, Densities, and Their Failure Modes

The histogram is the first plot most people reach for, and the most easily misread. Its appearance depends on two arbitrary choices: bin width and bin origin. Too few bins smooth away real structure such as bimodality; too many bins turn sampling noise into spurious spikes. A useful default is the Freedman Diaconis rule, which sets bin width to $2 \cdot \mathrm{IQR} \cdot n^{-1/3}$, where $\mathrm{IQR}$ is the interquartile range and $n$ the sample size. This scales the resolution to both the spread of the data and the amount of it. The cube root rate is not arbitrary: it is the bin width that asymptotically minimizes the mean integrated squared error of the histogram as a density estimator, and the use of the interquartile range in place of the standard deviation makes the rule robust to the heavy tails common in machine learning features.

Kernel density estimates avoid the bin origin problem by placing a smooth kernel at each observation. The estimator is

\[\hat{f}_h(x) = \frac{1}{n h} \sum_{i=1}^{n} K\!\left(\frac{x - x_i}{h}\right),\]

where $K$ is a kernel that integrates to one, typically Gaussian, and $h$ is the bandwidth. The bandwidth plays exactly the role that bin width plays for a histogram, and choosing it badly produces the same two failures: a small $h$ tracks sampling noise, a large $h$ oversmooths away real modes. Density estimates carry an additional hazard. They can invent support where there is none, drawing smooth tails into negative values for a strictly positive quantity such as a duration or a count, because the kernel placed near a boundary spills probability mass across it. When a variable has a hard boundary, prefer a histogram, a log transform before estimation, or an explicitly bounded estimator such as a reflected or beta kernel.

63.2.2 2.2 Boxplots, Violins, and the Tails

Boxplots compress a distribution to five numbers and flag points beyond $1.5 \times \mathrm{IQR}$ from the quartiles as candidate outliers. They are excellent for comparing many groups side by side and poor at revealing shape, since a bimodal distribution and a uniform one can produce identical boxes. Violin plots and their cousins overlay a density to recover shape information, which matters when you suspect mixtures, for example a feature whose distribution differs between two latent subpopulations.

For machine learning the most consequential part of a distribution is often the tail. Heavy tails break the implicit assumptions of many models and inflate the variance of gradient estimates. A log scaled histogram or a plot of the empirical survival function $\hat{S}(x) = 1 - \hat{F}(x)$ on log axes makes tail behavior legible. The reason a power law tail straightens on log log axes is immediate: if $S(x) \propto x^{-\alpha}$ for large $x$, then $\log S(x) = c - \alpha \log x$, a line of slope $-\alpha$. So the survival plot does double duty, both confirming the heavy tail and reading off its exponent. This is a far more reliable diagnostic than staring at a linear scale histogram whose tail is crushed against the axis, where dozens of distinct tail behaviors all look like the same flat sliver. A caution worth stating: a straight stretch on a log log plot is necessary but not sufficient for a power law, since other heavy tailed laws can mimic linearity over a limited range, so treat the slope as a description rather than a proof.

63.3 3. QQ Plots and the Question of Shape

63.3.1 3.1 Construction and Reading

A quantile quantile plot compares the quantiles of your data against the quantiles of a reference distribution, most often the standard normal. You sort the observations, assign each the cumulative probability $p_i = (i - 0.5)/n$, and plot the sample value against the theoretical quantile $\Phi^{-1}(p_i)$. If the data follow the reference distribution up to location and scale, the points fall on a straight line. The intercept reflects the mean and the slope reflects the standard deviation, so the line itself carries no diagnostic weight; only departures from it do.

The shapes of those departures are a vocabulary worth memorizing. Points that bend upward at the right and downward at the left, forming an S that is steeper than the reference at both ends, indicate heavy tails. The opposite curvature indicates light tails. A consistent C shape indicates skew. Isolated points far off the line at one end are outliers. Reading a QQ plot is reading these patterns, not checking whether every point sits exactly on the line, because finite samples always wobble.

63.3.2 3.2 Where QQ Plots Earn Their Keep in ML

Normality of raw features is rarely required by modern models, so a QQ plot of an input feature is mostly an exploratory tool that suggests transformations, such as a log transform to straighten a right skewed C shape. The higher value use is on residuals and on calibration of probabilistic predictions. Many uncertainty estimates assume Gaussian errors, and a QQ plot of standardized residuals tests that assumption directly. For a probabilistic forecaster you can construct a QQ style plot of the probability integral transform values, which should be uniform if the predictive distributions are well calibrated, turning an abstract calibration claim into a visible diagonal.

63.4 4. Residual Plots for Regression

63.4.1 4.1 The Core Diagnostic

The single most informative plot for a regression model is residuals against fitted values. Define the residual as $r_i = y_i - \hat{y}_i$ and plot $r_i$ on the vertical axis against $\hat{y}_i$ on the horizontal axis. Under a well specified model with homoskedastic errors, this plot should look like structureless noise: a horizontal band centered on zero with constant vertical spread. Every departure from that ideal names a specific problem.

A curved trend, often a U or inverted U, means the model has missed nonlinearity and the functional form is wrong. A fan shape, where spread grows with the fitted value, means heteroskedasticity, so the error variance depends on the prediction and ordinary confidence intervals will be miscalibrated. A residual band that sits above or below zero in some region means local bias. Distinct horizontal stripes can indicate a discrete or censored target. Because the plot maps each symptom to a cause, it is the regression analogue of a stethoscope.

residual = actual minus predicted
plot residual (y axis) vs predicted (x axis)
ideal: flat random band around zero
curve  -> wrong functional form
fan    -> non constant variance
offset -> local bias

63.4.2 4.2 Companion Residual Plots

Plotting residuals against each individual feature, including features not in the model, finds structure the model failed to capture. A trend against an omitted variable is direct evidence that it belongs in the model. A scale location plot, which graphs the square root of the absolute standardized residual against fitted values, isolates the variance trend and makes mild heteroskedasticity easier to see than the raw residual plot does.

For influence, a plot of residuals against leverage, with Cook’s distance contours overlaid, separates points that are merely unusual in $y$ from points that actually move the fit. Leverage measures how extreme an observation is in feature space. For a linear model with design matrix $X$, the fitted values are $\hat{y} = Hy$ where $H = X(X^\top X)^{-1} X^\top$ is the hat matrix, and the leverage of observation $i$ is its diagonal entry $h_{ii}$, which equals $\partial \hat{y}_i / \partial y_i$, the sensitivity of a point’s own fitted value to its own observed value. Leverages satisfy $0 \le h_{ii} \le 1$ and sum to the number of parameters $p$, so the average leverage is $p/n$ and a common rule flags points with $h_{ii} > 2p/n$. A point with high leverage and a large residual is the kind of single observation that can swing a coefficient. Cook’s distance combines both,

\[D_i = \frac{r_i^2}{p\, \hat{\sigma}^2} \cdot \frac{h_{ii}}{(1 - h_{ii})^2},\]

so it grows with both the residual $r_i$ and the leverage $h_{ii}$ and measures how far the full vector of fitted values moves when observation $i$ is deleted. In large models you rarely inspect individual points this way, but for linear baselines and for debugging suspicious slices it remains valuable.

63.5 5. Learning Curves

63.5.1 5.1 Two Different Plots With the Same Name

The phrase learning curve denotes two distinct diagnostics, and conflating them causes confusion. The first plots a performance metric against the number of training examples, holding the model fixed. The second plots a metric against training progress, meaning epochs or optimizer steps, holding the dataset fixed. The first answers whether more data would help. The second answers whether training is converging, overfitting, or unstable. Always know which one you are looking at.

63.5.2 5.2 Sample Size Curves and the Bias Variance Story

In the sample size version you plot both training and validation error as the training set grows. The classic reading: if training and validation error converge to a high value, the model is underfitting and is limited by bias, so more data will not help and you need a richer model or better features. If a wide gap persists, with low training error and high validation error, the model is overfitting and is limited by variance, so more data, regularization, or a simpler model should help. The gap between the two curves is a direct visual estimate of the generalization gap.

63.5.3 5.3 Epoch Curves and Early Stopping

The training progress version plots loss or metric per epoch for training and validation. The signature of overfitting is a validation curve that descends, reaches a minimum, then rises while the training curve keeps falling. The minimum of the validation curve is the early stopping point. Be cautious about reading these curves too literally. Validation curves are noisy, so a single uptick is not evidence of overfitting; use a patience window. Learning rate schedules create characteristic shapes, such as a sharp drop when the rate decays, that are properties of the optimizer rather than the model. Plotting loss on a log scale often makes the late stage behavior, where the interesting decisions live, far more legible.

63.6 6. Classifier Diagnostics

63.6.1 6.1 Confusion Matrices

A confusion matrix tabulates predicted class against true class. For binary problems its four cells are true positives, false positives, true negatives, and false negatives, and almost every classification metric is a ratio of these counts. Reading a confusion matrix well means normalizing it deliberately. Normalizing each row by its sum gives recall per class, answering what fraction of each true class was caught. Normalizing each column gives precision, answering what fraction of each prediction was correct. The raw counts and these two normalizations answer different questions, and quoting only one is a frequent source of misleading claims.

On imbalanced data the raw matrix is dominated by the majority class and can look excellent while the minority class is barely detected. Row normalization is the antidote, since it exposes a minority recall of, say, $0.2$ that the overall accuracy of $0.95$ hid. For multiclass problems the off diagonal cells form a map of which classes get confused with which, and clusters of confusion frequently reveal genuine label ambiguity or related categories rather than model failure.

63.6.2 6.2 ROC Curves

The receiver operating characteristic curve traces the true positive rate against the false positive rate as the decision threshold sweeps from permissive to strict. Each threshold yields one point; the curve is their locus. The diagonal represents random guessing, and the area under the curve, the AUC, equals the probability that the model ranks a random positive above a random negative. ROC is therefore a measure of ranking quality that is independent of any single threshold and independent of the chosen operating point.

\[\mathrm{TPR} = \frac{TP}{TP + FN}, \qquad \mathrm{FPR} = \frac{FP}{FP + TN}\]

The crucial subtlety is that the false positive rate has the number of true negatives in its denominator. When negatives vastly outnumber positives, a large absolute number of false positives still produces a small false positive rate, so the ROC curve can stay flatteringly close to the top left even when most positive predictions are wrong. This is why ROC curves can look strong on heavily imbalanced problems where the model is not actually useful.

63.6.3 6.3 Precision Recall Curves

The precision recall curve plots precision against recall across thresholds and is the right tool under heavy class imbalance precisely because neither axis involves the count of true negatives.

\[\text{precision} = \frac{TP}{TP + FP}, \qquad \text{recall} = \frac{TP}{TP + FN}\]

Because precision responds directly to false positives among a small positive class, the precision recall curve drops visibly when the model floods its positive predictions with negatives, exactly the failure ROC can mask. The no skill baseline is also more honest: it is a horizontal line at the positive class prevalence rather than a fixed diagonal, so a rare positive class sets a low and clearly marked bar. As a rule, report ROC when classes are roughly balanced or when ranking across the full range matters, and report precision recall when positives are rare and the cost of false positives is the concern. Many practitioners show both.

63.6.3.1 A Worked Example of the Imbalance Trap

The contrast is sharpest in numbers. Suppose a fraud detector is evaluated on $100{,}000$ transactions of which $100$ are fraudulent, a prevalence of $0.1$ percent. At one threshold the model catches $90$ of the $100$ frauds, so $TP = 90$ and $FN = 10$, but it also flags $4{,}000$ legitimate transactions, so $FP = 4{,}000$ and $TN = 95{,}900$. The two rates that ROC uses look reassuring:

\[\mathrm{TPR} = \frac{90}{100} = 0.90, \qquad \mathrm{FPR} = \frac{4{,}000}{99{,}900} \approx 0.040.\]

A point at recall $0.90$ and false positive rate $0.04$ sits comfortably in the upper left of an ROC plot, suggesting an excellent classifier. Precision tells the opposite story:

\[\text{precision} = \frac{90}{90 + 4{,}000} \approx 0.022.\]

Fewer than three in a hundred flagged transactions are actually fraud, so an analyst acting on the alerts wastes almost all of their effort. The false positive rate looked tiny only because its denominator, the $99{,}900$ true negatives, is huge. The precision recall curve would show this point sitting near precision $0.02$, barely above the prevalence baseline of $0.001$, and would make the problem impossible to miss. Same predictions, same threshold, two completely different impressions: the choice of plot is the choice of which truth you let yourself see.

63.7 7. Calibration Plots

63.7.1 7.1 What Calibration Means

A model is calibrated if its stated probabilities match observed frequencies: among all cases predicted with probability $0.7$, about $70$ percent should be positive. Calibration is orthogonal to discrimination. A model can rank perfectly, achieving high AUC, while being badly miscalibrated, and a well calibrated model can rank poorly. Both properties matter, and they answer different questions. Discrimination asks whether the ordering is right; calibration asks whether the numbers can be taken at face value, which is what any downstream decision based on expected value requires.

63.7.2 7.2 Reliability Diagrams and Their Pitfalls

The reliability diagram bins predictions by their stated probability and plots, for each bin, the mean predicted probability against the observed positive frequency. Perfect calibration lies on the diagonal. A curve that sags below the diagonal at high probabilities means the model is overconfident, claiming more certainty than the data support, which is the typical pathology of modern deep networks. A curve above the diagonal means underconfidence.

Two pitfalls recur. First, the diagram depends on binning just as a histogram does, and equal width bins can leave high probability bins nearly empty when predictions cluster, making the curve there extremely noisy; equal frequency bins mitigate this. Second, the popular scalar summary, expected calibration error, is a weighted average of the gaps between the curve and the diagonal and is sensitive to bin count; it can also hide compensating errors where overconfidence in one region cancels underconfidence in another. With $B$ bins, where bin $b$ holds $n_b$ predictions with mean confidence $\mathrm{conf}(b)$ and observed accuracy $\mathrm{acc}(b)$, the expected calibration error is

\[\mathrm{ECE} = \sum_{b=1}^{B} \frac{n_b}{n} \,\bigl| \mathrm{acc}(b) - \mathrm{conf}(b) \bigr|.\]

Because the absolute value sits inside each bin but the bins are summed with positive weights, a region where $\mathrm{acc} > \mathrm{conf}$ and a region where $\mathrm{acc} < \mathrm{conf}$ both add to the total rather than cancelling, yet within a single coarse bin opposite errors at the two ends do cancel before the absolute value is taken. Shrinking the bins reduces that within bin cancellation but raises the variance of each $\mathrm{acc}(b)$, which is the familiar bias variance tradeoff of binning reappearing in a calibration metric. Always look at the diagram, not only its summary.

When miscalibration is found, post hoc methods such as temperature scaling often straighten the curve. Temperature scaling divides the logits by a single learned positive scalar $T$ before the softmax, replacing $\mathrm{softmax}(z)$ with $\mathrm{softmax}(z/T)$, and fits $T$ by minimizing validation negative log likelihood. Because the same $T$ is applied to every logit, the relative ordering of the classes is untouched, so the predicted labels and the AUC are unchanged while the confidences are rescaled toward calibration. A value $T > 1$ softens an overconfident model and $T < 1$ sharpens an underconfident one.

A complementary and binning free check uses the probability integral transform. For a probabilistic forecaster that emits a full predictive cumulative distribution $F_i$ for each case, define $u_i = F_i(y_i)$, the forecast cumulative probability evaluated at the realized outcome. If the forecasts are ideal, the $u_i$ are uniform on $[0,1]$, so a histogram of the $u_i$ should be flat and a QQ plot of them against the uniform should lie on the diagonal. A $\cup$ shaped PIT histogram signals forecasts that are too narrow, an $\cap$ shape signals forecasts that are too wide, and a slope signals a systematic bias in the predicted center.

63.8 8. Practical Principles for Honest Graphics

63.8.1 8.1 Avoiding Self Deception

The order of operations matters. Inspect distributions and residuals before trusting any aggregate metric, and look at validation diagnostics on data the model never touched, since every plot in this chapter can be made to look perfect on the training set. Beware the axis that flatters: a truncated vertical axis exaggerates trivial differences, a linear axis crushes the tails and late stage training behavior that a log axis reveals, and a smoothed curve can hide the variance that tells you whether a difference is real. When comparing models, plot their curves on shared axes rather than judging from separate panels, because the eye is a poor calibrator across plots.

63.8.2 8.2 A Minimal Diagnostic Workflow

A disciplined sequence covers most needs without producing a flood of plots. Begin with per feature distributions and a target distribution to find skew, outliers, and imbalance. For regression, fit a simple baseline and study residuals against fitted values and against key features. For classification, build the confusion matrix with both normalizations, then choose ROC or precision recall by the prevalence of the positive class, and add a calibration plot if the probabilities will drive decisions. Throughout training, watch epoch learning curves on a log scale for convergence and overfitting. Each plot answers a specific question, and the value of the workflow comes from asking the questions in an order that lets each answer inform the next.

1. distributions   -> skew, outliers, imbalance
2. residuals       -> functional form, variance, bias
3. confusion + PR/ROC -> classifier behavior by class
4. calibration     -> are the probabilities trustworthy
5. learning curves -> convergence and overfitting

The same workflow can be read as a decision tree that routes each diagnostic question to the plot that answers it.

flowchart TD
    A["Inspect raw data and target"] --> B{"Task type"}
    B -->|"Regression"| C["Residuals vs fitted"]
    B -->|"Classification"| D["Confusion matrix, both normalizations"]
    C --> E{"Structure in residuals"}
    E -->|"Curve"| F["Fix functional form"]
    E -->|"Fan"| G["Address heteroskedasticity"]
    E -->|"Flat band"| H["Functional form looks adequate"]
    D --> I{"Positive class rare"}
    I -->|"Yes"| J["Precision recall curve"]
    I -->|"No"| K["ROC curve"]
    J --> L{"Probabilities drive decisions"}
    K --> L
    L -->|"Yes"| M["Calibration plot"]
    L -->|"No"| N["Stop at ranking diagnostics"]
    M --> O["Watch epoch learning curves throughout"]
    N --> O
    F --> O
    G --> O
    H --> O

63.8.3 8.3 When to Use Each Plot, and What Can Go Wrong

The table condenses the chapter into a quick reference. Each row pairs a plot with the question it answers and the failure that most often misleads a reader of it.

Plot	Primary question	Most common pitfall
Histogram or KDE	What is the shape of one variable	Bin width or bandwidth manufactures or hides modes; KDE leaks across hard boundaries
Log log survival	Is the tail heavy, and how heavy	A short straight stretch is read as proof of a power law
QQ plot	Does the shape match a reference	Judging point by point instead of reading the curvature pattern
Residuals vs fitted	Is the regression well specified	Inspecting on training data, where structure can be absorbed
Sample size learning curve	Would more data help	Stopping the curve before it has plateaued
Epoch learning curve	Is training converging or overfitting	Reading a single noisy uptick as overfitting; linear loss axis hides late behavior
Confusion matrix	Which classes are confused	Quoting one normalization as if it were the whole story
ROC curve	How good is the ranking	Looks strong under heavy imbalance even when precision is poor
Precision recall curve	Is the positive class found cleanly	Forgetting the baseline is prevalence, not one half
Calibration plot	Can the probabilities be trusted	Trusting ECE alone, which is bin sensitive and hides compensating errors

The thread running through every section is the same. A plot is worth making only if some shape in it would change your next decision, and it is worth reading correctly only if you know in advance which shapes mean which problems. Memorize the vocabulary of departures, distrust the flattering axis, and look before you summarize.

63.9 References

Anscombe, F. J. (1973). Graphs in Statistical Analysis. The American Statistician. https://www.jstor.org/stable/2682899
Freedman, D. and Diaconis, P. (1981). On the histogram as a density estimator. https://link.springer.com/article/10.1007/BF01025868
Wilke, C. O. (2019). Fundamentals of Data Visualization. https://clauswilke.com/dataviz/
Cleveland, W. S. (1993). Visualizing Data. https://www.stat.purdue.edu/~wsc/visualizing.html
scikit-learn Developers. Model Evaluation: Visualizations and Metrics. https://scikit-learn.org/stable/modules/model_evaluation.html
Saito, T. and Rehmsmeier, M. (2015). The Precision Recall Plot Is More Informative than the ROC Plot on Imbalanced Datasets. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0118432
Niculescu-Mizil, A. and Caruana, R. (2005). Predicting Good Probabilities with Supervised Learning. https://www.cs.cornell.edu/~alexn/papers/calibration.icml05.crc.rev3.pdf
Guo, C. et al. (2017). On Calibration of Modern Neural Networks. https://arxiv.org/abs/1706.04599
Fawcett, T. (2006). An Introduction to ROC Analysis. https://www.sciencedirect.com/science/article/abs/pii/S016786550500303X
Gneiting, T., Balabdaoui, F., and Raftery, A. E. (2007). Probabilistic Forecasts, Calibration and Sharpness. https://doi.org/10.1111/j.1467-9868.2007.00587.x
Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley.
Cook, R. D. (1977). Detection of Influential Observation in Linear Regression. Technometrics, 19(1), 15-18. https://doi.org/10.1080/00401706.1977.10489493
Diebold, F. X., Gunther, T. A., and Tay, A. S. (1998). Evaluating Density Forecasts, with Applications to Financial Risk Management. International Economic Review, 39(4), 863-883. https://doi.org/10.2307/2527342
Scott, D. W. (1979). On Optimal and Data-Based Histograms. Biometrika, 66(3), 605-610. https://doi.org/10.1093/biomet/66.3.605

# Statistical Graphics for AI Numbers summarize, but pictures reveal. A single accuracy figure can hide a model that fails badly on a minority class, a regression that systematically underpredicts large values, or a training run that stopped improving thousands of steps ago. Statistical graphics are the instruments through which a practitioner inspects data and models before, during, and after fitting. This chapter surveys the plots that earn their place in a machine learning workflow and, more importantly, teaches you how to read them so that you draw correct conclusions rather than comforting ones. The guiding idea, due to Tukey, is that exploratory graphics exist to force us to notice what we did not expect to see. A plot is not a summary that confirms a hypothesis we already hold; it is a search for the structure that our model and our metrics have failed to account for. Everything below is organized around one question: which shape, if it appeared, would change the next thing you do? ## 1. Why Graphics Belong in the Machine Learning Loop ### 1.1 The Limits of Scalar Summaries Anscombe's quartet remains the canonical warning. Four datasets share nearly identical means, variances, correlation, and ordinary least squares fit, yet they look entirely different when plotted: one is linear with noise, one is curved, one is linear with a single outlier, and one is a vertical cluster with one leverage point. The lesson generalizes directly to machine learning. Aggregate metrics compress a high dimensional object into a scalar, and compression is lossy by construction. The question is never whether information is lost but whether the lost information mattered. Graphics restore some of that lost structure. A residual plot exposes the curvature that a single $R^2$ conceals. A calibration plot exposes the overconfidence that accuracy ignores. A learning curve distinguishes a model that needs more data from one that needs more capacity. None of these distinctions is visible in the headline number, and all of them change what you do next. It is worth being concrete about why scalar summaries are lossy in a way that is not merely philosophical. A summary statistic $T$ is a many to one map from the space of datasets to the real line, so the preimage $T^{-1}(t)$, the set of all datasets producing the same value $t$, is typically enormous. Anscombe's quartet is simply four members of one such preimage chosen to look different. A graphic is a richer, higher dimensional map: a histogram of $n$ points retains the full empirical distribution up to bin resolution, and a scatterplot retains every pair. The art of diagnostics is choosing maps whose preimages separate the failure modes you care about. ### 1.2 Graphics as Diagnostics, Not Decoration It helps to separate two uses of plots. Explanatory graphics communicate a finished result to an audience and should be polished, labeled, and selective. Exploratory and diagnostic graphics serve the analyst and should be fast, plentiful, and disposable. Most of the plots in this chapter are diagnostic. You make dozens, glance at each for a few seconds, and keep the two or three that surprised you. Treating diagnostics with the ceremony of a publication figure is a common way to make too few of them. ## 2. Looking at Distributions ### 2.1 Histograms, Densities, and Their Failure Modes The histogram is the first plot most people reach for, and the most easily misread. Its appearance depends on two arbitrary choices: bin width and bin origin. Too few bins smooth away real structure such as bimodality; too many bins turn sampling noise into spurious spikes. A useful default is the Freedman Diaconis rule, which sets bin width to $2 \cdot \mathrm{IQR} \cdot n^{-1/3}$, where $\mathrm{IQR}$ is the interquartile range and $n$ the sample size. This scales the resolution to both the spread of the data and the amount of it. The cube root rate is not arbitrary: it is the bin width that asymptotically minimizes the mean integrated squared error of the histogram as a density estimator, and the use of the interquartile range in place of the standard deviation makes the rule robust to the heavy tails common in machine learning features. Kernel density estimates avoid the bin origin problem by placing a smooth kernel at each observation. The estimator is $$\hat{f}_h(x) = \frac{1}{n h} \sum_{i=1}^{n} K\!\left(\frac{x - x_i}{h}\right),$$ where $K$ is a kernel that integrates to one, typically Gaussian, and $h$ is the bandwidth. The bandwidth plays exactly the role that bin width plays for a histogram, and choosing it badly produces the same two failures: a small $h$ tracks sampling noise, a large $h$ oversmooths away real modes. Density estimates carry an additional hazard. They can invent support where there is none, drawing smooth tails into negative values for a strictly positive quantity such as a duration or a count, because the kernel placed near a boundary spills probability mass across it. When a variable has a hard boundary, prefer a histogram, a log transform before estimation, or an explicitly bounded estimator such as a reflected or beta kernel. ### 2.2 Boxplots, Violins, and the Tails Boxplots compress a distribution to five numbers and flag points beyond $1.5 \times \mathrm{IQR}$ from the quartiles as candidate outliers. They are excellent for comparing many groups side by side and poor at revealing shape, since a bimodal distribution and a uniform one can produce identical boxes. Violin plots and their cousins overlay a density to recover shape information, which matters when you suspect mixtures, for example a feature whose distribution differs between two latent subpopulations. For machine learning the most consequential part of a distribution is often the tail. Heavy tails break the implicit assumptions of many models and inflate the variance of gradient estimates. A log scaled histogram or a plot of the empirical survival function $\hat{S}(x) = 1 - \hat{F}(x)$ on log axes makes tail behavior legible. The reason a power law tail straightens on log log axes is immediate: if $S(x) \propto x^{-\alpha}$ for large $x$, then $\log S(x) = c - \alpha \log x$, a line of slope $-\alpha$. So the survival plot does double duty, both confirming the heavy tail and reading off its exponent. This is a far more reliable diagnostic than staring at a linear scale histogram whose tail is crushed against the axis, where dozens of distinct tail behaviors all look like the same flat sliver. A caution worth stating: a straight stretch on a log log plot is necessary but not sufficient for a power law, since other heavy tailed laws can mimic linearity over a limited range, so treat the slope as a description rather than a proof. ## 3. QQ Plots and the Question of Shape ### 3.1 Construction and Reading A quantile quantile plot compares the quantiles of your data against the quantiles of a reference distribution, most often the standard normal. You sort the observations, assign each the cumulative probability $p_i = (i - 0.5)/n$, and plot the sample value against the theoretical quantile $\Phi^{-1}(p_i)$. If the data follow the reference distribution up to location and scale, the points fall on a straight line. The intercept reflects the mean and the slope reflects the standard deviation, so the line itself carries no diagnostic weight; only departures from it do. The shapes of those departures are a vocabulary worth memorizing. Points that bend upward at the right and downward at the left, forming an S that is steeper than the reference at both ends, indicate heavy tails. The opposite curvature indicates light tails. A consistent C shape indicates skew. Isolated points far off the line at one end are outliers. Reading a QQ plot is reading these patterns, not checking whether every point sits exactly on the line, because finite samples always wobble. ### 3.2 Where QQ Plots Earn Their Keep in ML Normality of raw features is rarely required by modern models, so a QQ plot of an input feature is mostly an exploratory tool that suggests transformations, such as a log transform to straighten a right skewed C shape. The higher value use is on residuals and on calibration of probabilistic predictions. Many uncertainty estimates assume Gaussian errors, and a QQ plot of standardized residuals tests that assumption directly. For a probabilistic forecaster you can construct a QQ style plot of the probability integral transform values, which should be uniform if the predictive distributions are well calibrated, turning an abstract calibration claim into a visible diagonal. ## 4. Residual Plots for Regression ### 4.1 The Core Diagnostic The single most informative plot for a regression model is residuals against fitted values. Define the residual as $r_i = y_i - \hat{y}_i$ and plot $r_i$ on the vertical axis against $\hat{y}_i$ on the horizontal axis. Under a well specified model with homoskedastic errors, this plot should look like structureless noise: a horizontal band centered on zero with constant vertical spread. Every departure from that ideal names a specific problem. A curved trend, often a U or inverted U, means the model has missed nonlinearity and the functional form is wrong. A fan shape, where spread grows with the fitted value, means heteroskedasticity, so the error variance depends on the prediction and ordinary confidence intervals will be miscalibrated. A residual band that sits above or below zero in some region means local bias. Distinct horizontal stripes can indicate a discrete or censored target. Because the plot maps each symptom to a cause, it is the regression analogue of a stethoscope. ```text residual = actual minus predicted plot residual (y axis) vs predicted (x axis) ideal: flat random band around zero curve -> wrong functional form fan -> non constant variance offset -> local bias ``` ### 4.2 Companion Residual Plots Plotting residuals against each individual feature, including features not in the model, finds structure the model failed to capture. A trend against an omitted variable is direct evidence that it belongs in the model. A scale location plot, which graphs the square root of the absolute standardized residual against fitted values, isolates the variance trend and makes mild heteroskedasticity easier to see than the raw residual plot does. For influence, a plot of residuals against leverage, with Cook's distance contours overlaid, separates points that are merely unusual in $y$ from points that actually move the fit. Leverage measures how extreme an observation is in feature space. For a linear model with design matrix $X$, the fitted values are $\hat{y} = Hy$ where $H = X(X^\top X)^{-1} X^\top$ is the hat matrix, and the leverage of observation $i$ is its diagonal entry $h_{ii}$, which equals $\partial \hat{y}_i / \partial y_i$, the sensitivity of a point's own fitted value to its own observed value. Leverages satisfy $0 \le h_{ii} \le 1$ and sum to the number of parameters $p$, so the average leverage is $p/n$ and a common rule flags points with $h_{ii} > 2p/n$. A point with high leverage and a large residual is the kind of single observation that can swing a coefficient. Cook's distance combines both, $$D_i = \frac{r_i^2}{p\, \hat{\sigma}^2} \cdot \frac{h_{ii}}{(1 - h_{ii})^2},$$ so it grows with both the residual $r_i$ and the leverage $h_{ii}$ and measures how far the full vector of fitted values moves when observation $i$ is deleted. In large models you rarely inspect individual points this way, but for linear baselines and for debugging suspicious slices it remains valuable. ## 5. Learning Curves ### 5.1 Two Different Plots With the Same Name The phrase learning curve denotes two distinct diagnostics, and conflating them causes confusion. The first plots a performance metric against the number of training examples, holding the model fixed. The second plots a metric against training progress, meaning epochs or optimizer steps, holding the dataset fixed. The first answers whether more data would help. The second answers whether training is converging, overfitting, or unstable. Always know which one you are looking at. ### 5.2 Sample Size Curves and the Bias Variance Story In the sample size version you plot both training and validation error as the training set grows. The classic reading: if training and validation error converge to a high value, the model is underfitting and is limited by bias, so more data will not help and you need a richer model or better features. If a wide gap persists, with low training error and high validation error, the model is overfitting and is limited by variance, so more data, regularization, or a simpler model should help. The gap between the two curves is a direct visual estimate of the generalization gap. ### 5.3 Epoch Curves and Early Stopping The training progress version plots loss or metric per epoch for training and validation. The signature of overfitting is a validation curve that descends, reaches a minimum, then rises while the training curve keeps falling. The minimum of the validation curve is the early stopping point. Be cautious about reading these curves too literally. Validation curves are noisy, so a single uptick is not evidence of overfitting; use a patience window. Learning rate schedules create characteristic shapes, such as a sharp drop when the rate decays, that are properties of the optimizer rather than the model. Plotting loss on a log scale often makes the late stage behavior, where the interesting decisions live, far more legible. ## 6. Classifier Diagnostics ### 6.1 Confusion Matrices A confusion matrix tabulates predicted class against true class. For binary problems its four cells are true positives, false positives, true negatives, and false negatives, and almost every classification metric is a ratio of these counts. Reading a confusion matrix well means normalizing it deliberately. Normalizing each row by its sum gives recall per class, answering what fraction of each true class was caught. Normalizing each column gives precision, answering what fraction of each prediction was correct. The raw counts and these two normalizations answer different questions, and quoting only one is a frequent source of misleading claims. On imbalanced data the raw matrix is dominated by the majority class and can look excellent while the minority class is barely detected. Row normalization is the antidote, since it exposes a minority recall of, say, $0.2$ that the overall accuracy of $0.95$ hid. For multiclass problems the off diagonal cells form a map of which classes get confused with which, and clusters of confusion frequently reveal genuine label ambiguity or related categories rather than model failure. ### 6.2 ROC Curves The receiver operating characteristic curve traces the true positive rate against the false positive rate as the decision threshold sweeps from permissive to strict. Each threshold yields one point; the curve is their locus. The diagonal represents random guessing, and the area under the curve, the AUC, equals the probability that the model ranks a random positive above a random negative. ROC is therefore a measure of ranking quality that is independent of any single threshold and independent of the chosen operating point. $$\mathrm{TPR} = \frac{TP}{TP + FN}, \qquad \mathrm{FPR} = \frac{FP}{FP + TN}$$ The crucial subtlety is that the false positive rate has the number of true negatives in its denominator. When negatives vastly outnumber positives, a large absolute number of false positives still produces a small false positive rate, so the ROC curve can stay flatteringly close to the top left even when most positive predictions are wrong. This is why ROC curves can look strong on heavily imbalanced problems where the model is not actually useful. ### 6.3 Precision Recall Curves The precision recall curve plots precision against recall across thresholds and is the right tool under heavy class imbalance precisely because neither axis involves the count of true negatives. $$\text{precision} = \frac{TP}{TP + FP}, \qquad \text{recall} = \frac{TP}{TP + FN}$$ Because precision responds directly to false positives among a small positive class, the precision recall curve drops visibly when the model floods its positive predictions with negatives, exactly the failure ROC can mask. The no skill baseline is also more honest: it is a horizontal line at the positive class prevalence rather than a fixed diagonal, so a rare positive class sets a low and clearly marked bar. As a rule, report ROC when classes are roughly balanced or when ranking across the full range matters, and report precision recall when positives are rare and the cost of false positives is the concern. Many practitioners show both. #### A Worked Example of the Imbalance Trap The contrast is sharpest in numbers. Suppose a fraud detector is evaluated on $100{,}000$ transactions of which $100$ are fraudulent, a prevalence of $0.1$ percent. At one threshold the model catches $90$ of the $100$ frauds, so $TP = 90$ and $FN = 10$, but it also flags $4{,}000$ legitimate transactions, so $FP = 4{,}000$ and $TN = 95{,}900$. The two rates that ROC uses look reassuring: $$\mathrm{TPR} = \frac{90}{100} = 0.90, \qquad \mathrm{FPR} = \frac{4{,}000}{99{,}900} \approx 0.040.$$ A point at recall $0.90$ and false positive rate $0.04$ sits comfortably in the upper left of an ROC plot, suggesting an excellent classifier. Precision tells the opposite story: $$\text{precision} = \frac{90}{90 + 4{,}000} \approx 0.022.$$ Fewer than three in a hundred flagged transactions are actually fraud, so an analyst acting on the alerts wastes almost all of their effort. The false positive rate looked tiny only because its denominator, the $99{,}900$ true negatives, is huge. The precision recall curve would show this point sitting near precision $0.02$, barely above the prevalence baseline of $0.001$, and would make the problem impossible to miss. Same predictions, same threshold, two completely different impressions: the choice of plot is the choice of which truth you let yourself see. ## 7. Calibration Plots ### 7.1 What Calibration Means A model is calibrated if its stated probabilities match observed frequencies: among all cases predicted with probability $0.7$, about $70$ percent should be positive. Calibration is orthogonal to discrimination. A model can rank perfectly, achieving high AUC, while being badly miscalibrated, and a well calibrated model can rank poorly. Both properties matter, and they answer different questions. Discrimination asks whether the ordering is right; calibration asks whether the numbers can be taken at face value, which is what any downstream decision based on expected value requires. ### 7.2 Reliability Diagrams and Their Pitfalls The reliability diagram bins predictions by their stated probability and plots, for each bin, the mean predicted probability against the observed positive frequency. Perfect calibration lies on the diagonal. A curve that sags below the diagonal at high probabilities means the model is overconfident, claiming more certainty than the data support, which is the typical pathology of modern deep networks. A curve above the diagonal means underconfidence. Two pitfalls recur. First, the diagram depends on binning just as a histogram does, and equal width bins can leave high probability bins nearly empty when predictions cluster, making the curve there extremely noisy; equal frequency bins mitigate this. Second, the popular scalar summary, expected calibration error, is a weighted average of the gaps between the curve and the diagonal and is sensitive to bin count; it can also hide compensating errors where overconfidence in one region cancels underconfidence in another. With $B$ bins, where bin $b$ holds $n_b$ predictions with mean confidence $\mathrm{conf}(b)$ and observed accuracy $\mathrm{acc}(b)$, the expected calibration error is $$\mathrm{ECE} = \sum_{b=1}^{B} \frac{n_b}{n} \,\bigl| \mathrm{acc}(b) - \mathrm{conf}(b) \bigr|.$$ Because the absolute value sits inside each bin but the bins are summed with positive weights, a region where $\mathrm{acc} > \mathrm{conf}$ and a region where $\mathrm{acc} < \mathrm{conf}$ both add to the total rather than cancelling, yet within a single coarse bin opposite errors at the two ends do cancel before the absolute value is taken. Shrinking the bins reduces that within bin cancellation but raises the variance of each $\mathrm{acc}(b)$, which is the familiar bias variance tradeoff of binning reappearing in a calibration metric. Always look at the diagram, not only its summary. When miscalibration is found, post hoc methods such as temperature scaling often straighten the curve. Temperature scaling divides the logits by a single learned positive scalar $T$ before the softmax, replacing $\mathrm{softmax}(z)$ with $\mathrm{softmax}(z/T)$, and fits $T$ by minimizing validation negative log likelihood. Because the same $T$ is applied to every logit, the relative ordering of the classes is untouched, so the predicted labels and the AUC are unchanged while the confidences are rescaled toward calibration. A value $T > 1$ softens an overconfident model and $T < 1$ sharpens an underconfident one. A complementary and binning free check uses the probability integral transform. For a probabilistic forecaster that emits a full predictive cumulative distribution $F_i$ for each case, define $u_i = F_i(y_i)$, the forecast cumulative probability evaluated at the realized outcome. If the forecasts are ideal, the $u_i$ are uniform on $[0,1]$, so a histogram of the $u_i$ should be flat and a QQ plot of them against the uniform should lie on the diagonal. A $\cup$ shaped PIT histogram signals forecasts that are too narrow, an $\cap$ shape signals forecasts that are too wide, and a slope signals a systematic bias in the predicted center. ## 8. Practical Principles for Honest Graphics ### 8.1 Avoiding Self Deception The order of operations matters. Inspect distributions and residuals before trusting any aggregate metric, and look at validation diagnostics on data the model never touched, since every plot in this chapter can be made to look perfect on the training set. Beware the axis that flatters: a truncated vertical axis exaggerates trivial differences, a linear axis crushes the tails and late stage training behavior that a log axis reveals, and a smoothed curve can hide the variance that tells you whether a difference is real. When comparing models, plot their curves on shared axes rather than judging from separate panels, because the eye is a poor calibrator across plots. ### 8.2 A Minimal Diagnostic Workflow A disciplined sequence covers most needs without producing a flood of plots. Begin with per feature distributions and a target distribution to find skew, outliers, and imbalance. For regression, fit a simple baseline and study residuals against fitted values and against key features. For classification, build the confusion matrix with both normalizations, then choose ROC or precision recall by the prevalence of the positive class, and add a calibration plot if the probabilities will drive decisions. Throughout training, watch epoch learning curves on a log scale for convergence and overfitting. Each plot answers a specific question, and the value of the workflow comes from asking the questions in an order that lets each answer inform the next. ```text 1. distributions -> skew, outliers, imbalance 2. residuals -> functional form, variance, bias 3. confusion + PR/ROC -> classifier behavior by class 4. calibration -> are the probabilities trustworthy 5. learning curves -> convergence and overfitting ``` The same workflow can be read as a decision tree that routes each diagnostic question to the plot that answers it. ```{mermaid} flowchart TD A["Inspect raw data and target"] --> B{"Task type"} B -->|"Regression"| C["Residuals vs fitted"] B -->|"Classification"| D["Confusion matrix, both normalizations"] C --> E{"Structure in residuals"} E -->|"Curve"| F["Fix functional form"] E -->|"Fan"| G["Address heteroskedasticity"] E -->|"Flat band"| H["Functional form looks adequate"] D --> I{"Positive class rare"} I -->|"Yes"| J["Precision recall curve"] I -->|"No"| K["ROC curve"] J --> L{"Probabilities drive decisions"} K --> L L -->|"Yes"| M["Calibration plot"] L -->|"No"| N["Stop at ranking diagnostics"] M --> O["Watch epoch learning curves throughout"] N --> O F --> O G --> O H --> O ``` ### 8.3 When to Use Each Plot, and What Can Go Wrong The table condenses the chapter into a quick reference. Each row pairs a plot with the question it answers and the failure that most often misleads a reader of it. | Plot | Primary question | Most common pitfall | |------|------------------|---------------------| | Histogram or KDE | What is the shape of one variable | Bin width or bandwidth manufactures or hides modes; KDE leaks across hard boundaries | | Log log survival | Is the tail heavy, and how heavy | A short straight stretch is read as proof of a power law | | QQ plot | Does the shape match a reference | Judging point by point instead of reading the curvature pattern | | Residuals vs fitted | Is the regression well specified | Inspecting on training data, where structure can be absorbed | | Sample size learning curve | Would more data help | Stopping the curve before it has plateaued | | Epoch learning curve | Is training converging or overfitting | Reading a single noisy uptick as overfitting; linear loss axis hides late behavior | | Confusion matrix | Which classes are confused | Quoting one normalization as if it were the whole story | | ROC curve | How good is the ranking | Looks strong under heavy imbalance even when precision is poor | | Precision recall curve | Is the positive class found cleanly | Forgetting the baseline is prevalence, not one half | | Calibration plot | Can the probabilities be trusted | Trusting ECE alone, which is bin sensitive and hides compensating errors | The thread running through every section is the same. A plot is worth making only if some shape in it would change your next decision, and it is worth reading correctly only if you know in advance which shapes mean which problems. Memorize the vocabulary of departures, distrust the flattering axis, and look before you summarize. ## References 1. Anscombe, F. J. (1973). Graphs in Statistical Analysis. The American Statistician. https://www.jstor.org/stable/2682899 2. Freedman, D. and Diaconis, P. (1981). On the histogram as a density estimator. https://link.springer.com/article/10.1007/BF01025868 3. Wilke, C. O. (2019). Fundamentals of Data Visualization. https://clauswilke.com/dataviz/ 4. Cleveland, W. S. (1993). Visualizing Data. https://www.stat.purdue.edu/~wsc/visualizing.html 5. scikit-learn Developers. Model Evaluation: Visualizations and Metrics. https://scikit-learn.org/stable/modules/model_evaluation.html 6. Saito, T. and Rehmsmeier, M. (2015). The Precision Recall Plot Is More Informative than the ROC Plot on Imbalanced Datasets. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0118432 7. Niculescu-Mizil, A. and Caruana, R. (2005). Predicting Good Probabilities with Supervised Learning. https://www.cs.cornell.edu/~alexn/papers/calibration.icml05.crc.rev3.pdf 8. Guo, C. et al. (2017). On Calibration of Modern Neural Networks. https://arxiv.org/abs/1706.04599 9. Fawcett, T. (2006). An Introduction to ROC Analysis. https://www.sciencedirect.com/science/article/abs/pii/S016786550500303X 10. Gneiting, T., Balabdaoui, F., and Raftery, A. E. (2007). Probabilistic Forecasts, Calibration and Sharpness. https://doi.org/10.1111/j.1467-9868.2007.00587.x 11. Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley. 12. Cook, R. D. (1977). Detection of Influential Observation in Linear Regression. Technometrics, 19(1), 15-18. https://doi.org/10.1080/00401706.1977.10489493 13. Diebold, F. X., Gunther, T. A., and Tay, A. S. (1998). Evaluating Density Forecasts, with Applications to Financial Risk Management. International Economic Review, 39(4), 863-883. https://doi.org/10.2307/2527342 14. Scott, D. W. (1979). On Optimal and Data-Based Histograms. Biometrika, 66(3), 605-610. https://doi.org/10.1093/biomet/66.3.605