63  Statistical Graphics for AI

Numbers summarize, but pictures reveal. A single accuracy figure can hide a model that fails badly on a minority class, a regression that systematically underpredicts large values, or a training run that stopped improving thousands of steps ago. Statistical graphics are the instruments through which a practitioner inspects data and models before, during, and after fitting. This chapter surveys the plots that earn their place in a machine learning workflow and, more importantly, teaches you how to read them so that you draw correct conclusions rather than comforting ones.

63.1 1. Why Graphics Belong in the Machine Learning Loop

63.1.1 1.1 The Limits of Scalar Summaries

Anscombe’s quartet remains the canonical warning. Four datasets share nearly identical means, variances, correlation, and ordinary least squares fit, yet they look entirely different when plotted: one is linear with noise, one is curved, one is linear with a single outlier, and one is a vertical cluster with one leverage point. The lesson generalizes directly to machine learning. Aggregate metrics compress a high dimensional object into a scalar, and compression is lossy by construction. The question is never whether information is lost but whether the lost information mattered.

Graphics restore some of that lost structure. A residual plot exposes the curvature that a single \(R^2\) conceals. A calibration plot exposes the overconfidence that accuracy ignores. A learning curve distinguishes a model that needs more data from one that needs more capacity. None of these distinctions is visible in the headline number, and all of them change what you do next.

63.1.2 1.2 Graphics as Diagnostics, Not Decoration

It helps to separate two uses of plots. Explanatory graphics communicate a finished result to an audience and should be polished, labeled, and selective. Exploratory and diagnostic graphics serve the analyst and should be fast, plentiful, and disposable. Most of the plots in this chapter are diagnostic. You make dozens, glance at each for a few seconds, and keep the two or three that surprised you. Treating diagnostics with the ceremony of a publication figure is a common way to make too few of them.

63.2 2. Looking at Distributions

63.2.1 2.1 Histograms, Densities, and Their Failure Modes

The histogram is the first plot most people reach for, and the most easily misread. Its appearance depends on two arbitrary choices: bin width and bin origin. Too few bins smooth away real structure such as bimodality; too many bins turn sampling noise into spurious spikes. A useful default is the Freedman Diaconis rule, which sets bin width to \(2 \cdot \mathrm{IQR} \cdot n^{-1/3}\), where \(\mathrm{IQR}\) is the interquartile range and \(n\) the sample size. This scales the resolution to both the spread of the data and the amount of it.

Kernel density estimates avoid the bin origin problem by placing a smooth kernel at each observation, but they substitute a bandwidth choice that plays the same role as bin width. A density estimate can also invent support where there is none, drawing smooth tails into negative values for a strictly positive quantity such as a duration or a count. When a variable has a hard boundary, prefer a histogram or an explicitly bounded estimator.

63.2.2 2.2 Boxplots, Violins, and the Tails

Boxplots compress a distribution to five numbers and flag points beyond \(1.5 \times \mathrm{IQR}\) from the quartiles as candidate outliers. They are excellent for comparing many groups side by side and poor at revealing shape, since a bimodal distribution and a uniform one can produce identical boxes. Violin plots and their cousins overlay a density to recover shape information, which matters when you suspect mixtures, for example a feature whose distribution differs between two latent subpopulations.

For machine learning the most consequential part of a distribution is often the tail. Heavy tails break the implicit assumptions of many models and inflate the variance of gradient estimates. A log scaled histogram or a plot of the empirical survival function \(\hat{S}(x) = 1 - \hat{F}(x)\) on log axes makes tail behavior legible. A power law tail appears roughly linear on a log log survival plot, which is a far more reliable diagnostic than staring at a linear scale histogram whose tail is crushed against the axis.

63.3 3. QQ Plots and the Question of Shape

63.3.1 3.1 Construction and Reading

A quantile quantile plot compares the quantiles of your data against the quantiles of a reference distribution, most often the standard normal. You sort the observations, assign each the cumulative probability \(p_i = (i - 0.5)/n\), and plot the sample value against the theoretical quantile \(\Phi^{-1}(p_i)\). If the data follow the reference distribution up to location and scale, the points fall on a straight line. The intercept reflects the mean and the slope reflects the standard deviation, so the line itself carries no diagnostic weight; only departures from it do.

The shapes of those departures are a vocabulary worth memorizing. Points that bend upward at the right and downward at the left, forming an S that is steeper than the reference at both ends, indicate heavy tails. The opposite curvature indicates light tails. A consistent C shape indicates skew. Isolated points far off the line at one end are outliers. Reading a QQ plot is reading these patterns, not checking whether every point sits exactly on the line, because finite samples always wobble.

63.3.2 3.2 Where QQ Plots Earn Their Keep in ML

Normality of raw features is rarely required by modern models, so a QQ plot of an input feature is mostly an exploratory tool that suggests transformations, such as a log transform to straighten a right skewed C shape. The higher value use is on residuals and on calibration of probabilistic predictions. Many uncertainty estimates assume Gaussian errors, and a QQ plot of standardized residuals tests that assumption directly. For a probabilistic forecaster you can construct a QQ style plot of the probability integral transform values, which should be uniform if the predictive distributions are well calibrated, turning an abstract calibration claim into a visible diagonal.

63.4 4. Residual Plots for Regression

63.4.1 4.1 The Core Diagnostic

The single most informative plot for a regression model is residuals against fitted values. Define the residual as \(r_i = y_i - \hat{y}_i\) and plot \(r_i\) on the vertical axis against \(\hat{y}_i\) on the horizontal axis. Under a well specified model with homoskedastic errors, this plot should look like structureless noise: a horizontal band centered on zero with constant vertical spread. Every departure from that ideal names a specific problem.

A curved trend, often a U or inverted U, means the model has missed nonlinearity and the functional form is wrong. A fan shape, where spread grows with the fitted value, means heteroskedasticity, so the error variance depends on the prediction and ordinary confidence intervals will be miscalibrated. A residual band that sits above or below zero in some region means local bias. Distinct horizontal stripes can indicate a discrete or censored target. Because the plot maps each symptom to a cause, it is the regression analogue of a stethoscope.

residual = actual minus predicted
plot residual (y axis) vs predicted (x axis)
ideal: flat random band around zero
curve  -> wrong functional form
fan    -> non constant variance
offset -> local bias

63.4.2 4.2 Companion Residual Plots

Plotting residuals against each individual feature, including features not in the model, finds structure the model failed to capture. A trend against an omitted variable is direct evidence that it belongs in the model. A scale location plot, which graphs the square root of the absolute standardized residual against fitted values, isolates the variance trend and makes mild heteroskedasticity easier to see than the raw residual plot does.

For influence, a plot of residuals against leverage, with Cook’s distance contours overlaid, separates points that are merely unusual in \(y\) from points that actually move the fit. Leverage measures how extreme an observation is in feature space, and a point with high leverage and a large residual is the kind of single observation that can swing a coefficient. In large models you rarely inspect individual points this way, but for linear baselines and for debugging suspicious slices it remains valuable.

63.5 5. Learning Curves

63.5.1 5.1 Two Different Plots With the Same Name

The phrase learning curve denotes two distinct diagnostics, and conflating them causes confusion. The first plots a performance metric against the number of training examples, holding the model fixed. The second plots a metric against training progress, meaning epochs or optimizer steps, holding the dataset fixed. The first answers whether more data would help. The second answers whether training is converging, overfitting, or unstable. Always know which one you are looking at.

63.5.2 5.2 Sample Size Curves and the Bias Variance Story

In the sample size version you plot both training and validation error as the training set grows. The classic reading: if training and validation error converge to a high value, the model is underfitting and is limited by bias, so more data will not help and you need a richer model or better features. If a wide gap persists, with low training error and high validation error, the model is overfitting and is limited by variance, so more data, regularization, or a simpler model should help. The gap between the two curves is a direct visual estimate of the generalization gap.

63.5.3 5.3 Epoch Curves and Early Stopping

The training progress version plots loss or metric per epoch for training and validation. The signature of overfitting is a validation curve that descends, reaches a minimum, then rises while the training curve keeps falling. The minimum of the validation curve is the early stopping point. Be cautious about reading these curves too literally. Validation curves are noisy, so a single uptick is not evidence of overfitting; use a patience window. Learning rate schedules create characteristic shapes, such as a sharp drop when the rate decays, that are properties of the optimizer rather than the model. Plotting loss on a log scale often makes the late stage behavior, where the interesting decisions live, far more legible.

63.6 6. Classifier Diagnostics

63.6.1 6.1 Confusion Matrices

A confusion matrix tabulates predicted class against true class. For binary problems its four cells are true positives, false positives, true negatives, and false negatives, and almost every classification metric is a ratio of these counts. Reading a confusion matrix well means normalizing it deliberately. Normalizing each row by its sum gives recall per class, answering what fraction of each true class was caught. Normalizing each column gives precision, answering what fraction of each prediction was correct. The raw counts and these two normalizations answer different questions, and quoting only one is a frequent source of misleading claims.

On imbalanced data the raw matrix is dominated by the majority class and can look excellent while the minority class is barely detected. Row normalization is the antidote, since it exposes a minority recall of, say, \(0.2\) that the overall accuracy of \(0.95\) hid. For multiclass problems the off diagonal cells form a map of which classes get confused with which, and clusters of confusion frequently reveal genuine label ambiguity or related categories rather than model failure.

63.6.2 6.2 ROC Curves

The receiver operating characteristic curve traces the true positive rate against the false positive rate as the decision threshold sweeps from permissive to strict. Each threshold yields one point; the curve is their locus. The diagonal represents random guessing, and the area under the curve, the AUC, equals the probability that the model ranks a random positive above a random negative. ROC is therefore a measure of ranking quality that is independent of any single threshold and independent of the chosen operating point.

\[\mathrm{TPR} = \frac{TP}{TP + FN}, \qquad \mathrm{FPR} = \frac{FP}{FP + TN}\]

The crucial subtlety is that the false positive rate has the number of true negatives in its denominator. When negatives vastly outnumber positives, a large absolute number of false positives still produces a small false positive rate, so the ROC curve can stay flatteringly close to the top left even when most positive predictions are wrong. This is why ROC curves can look strong on heavily imbalanced problems where the model is not actually useful.

63.6.3 6.3 Precision Recall Curves

The precision recall curve plots precision against recall across thresholds and is the right tool under heavy class imbalance precisely because neither axis involves the count of true negatives.

\[\text{precision} = \frac{TP}{TP + FP}, \qquad \text{recall} = \frac{TP}{TP + FN}\]

Because precision responds directly to false positives among a small positive class, the precision recall curve drops visibly when the model floods its positive predictions with negatives, exactly the failure ROC can mask. The no skill baseline is also more honest: it is a horizontal line at the positive class prevalence rather than a fixed diagonal, so a rare positive class sets a low and clearly marked bar. As a rule, report ROC when classes are roughly balanced or when ranking across the full range matters, and report precision recall when positives are rare and the cost of false positives is the concern. Many practitioners show both.

63.7 7. Calibration Plots

63.7.1 7.1 What Calibration Means

A model is calibrated if its stated probabilities match observed frequencies: among all cases predicted with probability \(0.7\), about \(70\) percent should be positive. Calibration is orthogonal to discrimination. A model can rank perfectly, achieving high AUC, while being badly miscalibrated, and a well calibrated model can rank poorly. Both properties matter, and they answer different questions. Discrimination asks whether the ordering is right; calibration asks whether the numbers can be taken at face value, which is what any downstream decision based on expected value requires.

63.7.2 7.2 Reliability Diagrams and Their Pitfalls

The reliability diagram bins predictions by their stated probability and plots, for each bin, the mean predicted probability against the observed positive frequency. Perfect calibration lies on the diagonal. A curve that sags below the diagonal at high probabilities means the model is overconfident, claiming more certainty than the data support, which is the typical pathology of modern deep networks. A curve above the diagonal means underconfidence.

Two pitfalls recur. First, the diagram depends on binning just as a histogram does, and equal width bins can leave high probability bins nearly empty when predictions cluster, making the curve there extremely noisy; equal frequency bins mitigate this. Second, the popular scalar summary, expected calibration error, is a weighted average of the gaps between the curve and the diagonal and is sensitive to bin count; it can also hide compensating errors where overconfidence in one region cancels underconfidence in another. Always look at the diagram, not only its summary. When miscalibration is found, post hoc methods such as temperature scaling, which divides the logits by a single learned scalar \(T\), often straighten the curve without harming the ranking and so without changing AUC.

63.8 8. Practical Principles for Honest Graphics

63.8.1 8.1 Avoiding Self Deception

The order of operations matters. Inspect distributions and residuals before trusting any aggregate metric, and look at validation diagnostics on data the model never touched, since every plot in this chapter can be made to look perfect on the training set. Beware the axis that flatters: a truncated vertical axis exaggerates trivial differences, a linear axis crushes the tails and late stage training behavior that a log axis reveals, and a smoothed curve can hide the variance that tells you whether a difference is real. When comparing models, plot their curves on shared axes rather than judging from separate panels, because the eye is a poor calibrator across plots.

63.8.2 8.2 A Minimal Diagnostic Workflow

A disciplined sequence covers most needs without producing a flood of plots. Begin with per feature distributions and a target distribution to find skew, outliers, and imbalance. For regression, fit a simple baseline and study residuals against fitted values and against key features. For classification, build the confusion matrix with both normalizations, then choose ROC or precision recall by the prevalence of the positive class, and add a calibration plot if the probabilities will drive decisions. Throughout training, watch epoch learning curves on a log scale for convergence and overfitting. Each plot answers a specific question, and the value of the workflow comes from asking the questions in an order that lets each answer inform the next.

1. distributions   -> skew, outliers, imbalance
2. residuals       -> functional form, variance, bias
3. confusion + PR/ROC -> classifier behavior by class
4. calibration     -> are the probabilities trustworthy
5. learning curves -> convergence and overfitting

The thread running through every section is the same. A plot is worth making only if some shape in it would change your next decision, and it is worth reading correctly only if you know in advance which shapes mean which problems. Memorize the vocabulary of departures, distrust the flattering axis, and look before you summarize.

63.9 References

  1. Anscombe, F. J. (1973). Graphs in Statistical Analysis. The American Statistician. https://www.jstor.org/stable/2682899
  2. Freedman, D. and Diaconis, P. (1981). On the histogram as a density estimator. https://link.springer.com/article/10.1007/BF01025868
  3. Wilke, C. O. (2019). Fundamentals of Data Visualization. https://clauswilke.com/dataviz/
  4. Cleveland, W. S. (1993). Visualizing Data. https://www.stat.purdue.edu/~wsc/visualizing.html
  5. scikit-learn Developers. Model Evaluation: Visualizations and Metrics. https://scikit-learn.org/stable/modules/model_evaluation.html
  6. Saito, T. and Rehmsmeier, M. (2015). The Precision Recall Plot Is More Informative than the ROC Plot on Imbalanced Datasets. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0118432
  7. Niculescu-Mizil, A. and Caruana, R. (2005). Predicting Good Probabilities with Supervised Learning. https://www.cs.cornell.edu/~alexn/papers/calibration.icml05.crc.rev3.pdf
  8. Guo, C. et al. (2017). On Calibration of Modern Neural Networks. https://arxiv.org/abs/1706.04599
  9. Fawcett, T. (2006). An Introduction to ROC Analysis. https://www.sciencedirect.com/science/article/abs/pii/S016786550500303X
  10. Gneiting, T., Balabdaoui, F., and Raftery, A. E. (2007). Probabilistic Forecasts, Calibration and Sharpness. https://rss.onlinelibrary.wiley.com/doi/10.1111/j.1467-9868.2007.00587.x