159  ROC Curves and AUC

Binary classifiers rarely emit hard decisions. A logistic regression model, a gradient boosted tree, or a neural network typically produces a real valued score \(s(x) \in \mathbb{R}\) that ranks instances by their estimated propensity to belong to the positive class. The conversion of that score into a label requires a threshold \(\tau\), and every choice of \(\tau\) produces a different confusion matrix with its own error profile. The Receiver Operating Characteristic (ROC) curve is the device that summarizes classifier behavior across the entire continuum of thresholds, decoupling the evaluation of the score from the choice of operating point. This chapter develops the ROC space, the true and false positive rate tradeoff, the area under the curve and its probabilistic meaning, and the well known failure of AUC to reflect practical utility under severe class imbalance.

159.1 1. From Scores to the Confusion Matrix

Let \(Y \in \{0, 1\}\) denote the true label, with \(Y = 1\) marking the positive class, and let the classifier assign label \(\hat{Y} = \mathbb{1}[s(x) \ge \tau]\). Over a population or a held out sample, the joint distribution of \((Y, \hat{Y})\) is captured by four counts: true positives \(\mathrm{TP}\), false positives \(\mathrm{FP}\), false negatives \(\mathrm{FN}\), and true negatives \(\mathrm{TN}\). From these we define two rates that condition on the true class.

The true positive rate, also called sensitivity or recall, is the fraction of actual positives that are correctly flagged:

\[ \mathrm{TPR}(\tau) = \frac{\mathrm{TP}(\tau)}{\mathrm{TP}(\tau) + \mathrm{FN}(\tau)} = \Pr\!\left[s(X) \ge \tau \mid Y = 1\right]. \]

The false positive rate, equal to one minus specificity, is the fraction of actual negatives that are incorrectly flagged:

\[ \mathrm{FPR}(\tau) = \frac{\mathrm{FP}(\tau)}{\mathrm{FP}(\tau) + \mathrm{TN}(\tau)} = \Pr\!\left[s(X) \ge \tau \mid Y = 0\right]. \]

The crucial property of both quantities is that they are computed within a single true class. They are class conditional rates, and as a consequence neither depends on the prevalence \(\pi = \Pr[Y = 1]\). This invariance is the source of both the strength and the weakness of ROC analysis, a point we return to in Section 6.

159.2 2. The ROC Space

The ROC space is the unit square \([0, 1]^2\) with \(\mathrm{FPR}\) on the horizontal axis and \(\mathrm{TPR}\) on the vertical axis. A single classifier with a fixed threshold occupies one point in this space. Sweeping \(\tau\) from \(+\infty\) down to \(-\infty\) traces a curve from the origin to the top right corner.

159.2.1 2.1 Landmark Points

Several locations carry fixed interpretations.

  • \((0, 0)\) corresponds to \(\tau = +\infty\), where every instance is declared negative. No positives are caught and no negatives are wrongly flagged.
  • \((1, 1)\) corresponds to \(\tau = -\infty\), where every instance is declared positive.
  • \((0, 1)\) is the point of perfect classification: every positive is caught and no negative is flagged. A model whose positive and negative score distributions do not overlap can reach this corner at some threshold.
  • The diagonal line \(\mathrm{TPR} = \mathrm{FPR}\) represents a classifier whose score carries no information about the label. A coin that flags a fraction \(p\) of instances at random, independent of \(Y\), achieves \(\mathrm{TPR} = \mathrm{FPR} = p\) for every \(p\).

A point above the diagonal is better than random, and a point below it is worse than random, though a consistently below diagonal classifier can be inverted to land above the diagonal.

159.2.2 2.2 Monotonicity and Shape

As \(\tau\) decreases, more instances cross the decision boundary into the positive prediction, so both \(\mathrm{TP}\) and \(\mathrm{FP}\) can only grow. Therefore \(\mathrm{TPR}(\tau)\) and \(\mathrm{FPR}(\tau)\) are each monotone non increasing in \(\tau\), and the ROC curve, traced as \(\tau\) falls, is a monotone non decreasing function from \((0,0)\) to \((1,1)\). For a finite sample the curve is a staircase: lowering the threshold past one positive instance produces a vertical step of height \(1 / n_+\), where \(n_+\) is the number of positives, while crossing a negative produces a horizontal step of width \(1 / n_-\). Ties in the score yield diagonal segments.

159.3 3. The True and False Positive Rate Tradeoff

The shape of the ROC curve encodes a tradeoff that no single threshold can escape. Lowering \(\tau\) to catch more positives, raising \(\mathrm{TPR}\), simultaneously sweeps in more negatives, raising \(\mathrm{FPR}\). The slope of the ROC curve at a point quantifies the local exchange rate between the two.

159.3.1 3.1 The Slope as a Likelihood Ratio

Let \(f_1\) and \(f_0\) be the densities of the score under the positive and negative classes. Then

\[ \mathrm{TPR}(\tau) = \int_\tau^\infty f_1(s)\, ds, \qquad \mathrm{FPR}(\tau) = \int_\tau^\infty f_0(s)\, ds, \]

so that \(\frac{d\,\mathrm{TPR}}{d\tau} = -f_1(\tau)\) and \(\frac{d\,\mathrm{FPR}}{d\tau} = -f_0(\tau)\). The slope of the curve in ROC space is therefore

\[ \frac{d\,\mathrm{TPR}}{d\,\mathrm{FPR}} = \frac{f_1(\tau)}{f_0(\tau)}, \]

the likelihood ratio at the threshold. The curve is steep where positives are dense relative to negatives, that is at high scores for a well calibrated ranker, and flat where the reverse holds. A concave ROC curve corresponds to a monotone decreasing likelihood ratio as \(\tau\) falls, which is the hallmark of a proper scoring rule; non concavities signal regions where the score ranks instances suboptimally and could be improved by recalibration.

159.3.2 3.2 Choosing an Operating Point

The ROC curve presents the menu of achievable \((\mathrm{FPR}, \mathrm{TPR})\) pairs, but selecting one requires external information about costs and prevalence. Suppose a false negative costs \(c_{\mathrm{FN}}\) and a false positive costs \(c_{\mathrm{FP}}\). The expected cost at an operating point is

\[ \mathbb{E}[\text{cost}] = \pi\, c_{\mathrm{FN}} \left(1 - \mathrm{TPR}\right) + (1 - \pi)\, c_{\mathrm{FP}}\, \mathrm{FPR}. \]

Minimizing this over the curve, iso cost lines have slope

\[ m = \frac{(1 - \pi)\, c_{\mathrm{FP}}}{\pi\, c_{\mathrm{FN}}}, \]

and the optimal operating point is where a line of slope \(m\) is tangent to the ROC curve from above. This is the geometric content of the Neyman Pearson lemma: the optimal decision rule thresholds the likelihood ratio, and the tangency condition matches the curve slope \(f_1 / f_0\) to the cost weighted prevalence ratio. The same curve thus serves every cost regime; only the tangent slope changes.

159.4 4. Area Under the Curve

A scalar summary of the entire curve is convenient for model comparison and selection. The area under the ROC curve, abbreviated AUC or sometimes AUROC, is

\[ \mathrm{AUC} = \int_0^1 \mathrm{TPR}\, d\,\mathrm{FPR}. \]

It ranges from \(0\) to \(1\). A perfect classifier that passes through \((0,1)\) has \(\mathrm{AUC} = 1\); the random diagonal has \(\mathrm{AUC} = 0.5\); a classifier reliably worse than random has \(\mathrm{AUC} < 0.5\) and can be inverted. Because the curve is invariant to any strictly monotone transformation of the score, AUC depends only on the ranking the model induces, not on the numerical scale or calibration of the scores.

159.5 5. The Probabilistic Interpretation of AUC

The single most important fact about AUC is that it equals a probability about rankings. Draw one positive instance \(X^+\) at random from the positive class and one negative instance \(X^-\) at random from the negative class, independently. Then

\[ \mathrm{AUC} = \Pr\!\left[s(X^+) > s(X^-)\right] + \tfrac{1}{2}\Pr\!\left[s(X^+) = s(X^-)\right]. \]

AUC is the probability that the classifier ranks a random positive above a random negative, with ties broken evenly. This interpretation makes AUC a measure of discrimination, of the model’s ability to separate the two classes by score, entirely separate from where any threshold is set.

159.5.1 5.1 Derivation

Start from the integral and change the variable of integration from \(\mathrm{FPR}\) to the threshold \(\tau\):

\[ \mathrm{AUC} = \int_0^1 \mathrm{TPR}\, d\,\mathrm{FPR} = \int_{-\infty}^{\infty} \mathrm{TPR}(\tau)\, \big(-f_0(\tau)\big)\, d\tau = \int_{-\infty}^{\infty} \Pr\!\left[s(X^+) \ge \tau\right] f_0(\tau)\, d\tau. \]

Reading \(f_0(\tau)\,d\tau\) as the probability that a random negative has score in \([\tau, \tau + d\tau]\), and \(\Pr[s(X^+) \ge \tau]\) as the probability that a random positive scores at least that high, the integral sums over all negative score values the probability that an independent positive outscores it. That is exactly \(\Pr[s(X^+) > s(X^-)]\) in the continuous case where ties have measure zero.

159.5.2 5.2 The Mann Whitney Connection

The empirical AUC computed from a sample is the normalized Mann Whitney \(U\) statistic. With positives scored \(\{a_i\}_{i=1}^{n_+}\) and negatives scored \(\{b_j\}_{j=1}^{n_-}\),

\[ \widehat{\mathrm{AUC}} = \frac{1}{n_+ n_-} \sum_{i=1}^{n_+} \sum_{j=1}^{n_-} \left( \mathbb{1}[a_i > b_j] + \tfrac{1}{2}\,\mathbb{1}[a_i = b_j] \right). \]

This identity ties AUC to a century of nonparametric statistics and supplies its sampling distribution under the null hypothesis of no discrimination, enabling significance tests and confidence intervals. A naive evaluation of the double sum is \(O(n_+ n_-)\), but sorting the combined scores and accumulating ranks reduces it to \(O(n \log n)\).

sort all instances by score descending
walk the list, maintaining a running count of negatives seen
each time a positive is encountered, add the current negative count
AUC = accumulated_pairs / (n_pos * n_neg)

159.5.3 5.3 A Note on Equivalence with the Gini Coefficient

The Gini coefficient used in credit scoring is a linear rescaling, \(\mathrm{Gini} = 2\,\mathrm{AUC} - 1\), mapping the random baseline of \(0.5\) to \(0\) and perfect discrimination to \(1\). It conveys no information beyond AUC but is sometimes preferred because the rescaled baseline is more intuitive.

159.6 6. Limitations Under Class Imbalance

The prevalence invariance established in Section 1 is a double edged property. It makes ROC curves stable across populations with different base rates, which is desirable when a model trained on a balanced sample will be deployed where positives are rare. But it also means that ROC analysis is blind to the consequences of imbalance, and under severe skew this blindness becomes misleading.

159.6.1 6.1 The Insensitivity of FPR to Many False Positives

Consider a fraud detection setting with prevalence \(\pi = 0.001\). Suppose a sample contains \(1{,}000\) positives and \(1{,}000{,}000\) negatives, and a model operates at \(\mathrm{TPR} = 0.9\) and \(\mathrm{FPR} = 0.01\). The false positive rate looks small, but \(\mathrm{FPR} = 0.01\) over a million negatives generates \(10{,}000\) false positives against only \(900\) true positives. The precision,

\[ \mathrm{Precision} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FP}} = \frac{900}{900 + 10{,}000} \approx 0.083, \]

is dismal: more than nine in ten flagged cases are wrong. Yet the ROC point \((0.01, 0.9)\) sits comfortably in the upper left and contributes to a flattering AUC. The reason is structural. The horizontal axis normalizes false positives by the enormous count of negatives, so a number of false alarms that overwhelms the positives in absolute terms still registers as a tiny FPR. ROC analysis simply does not see the imbalance because each axis is internally normalized within its own class.

159.6.2 6.2 Precision Recall Curves

When the positive class is rare and the cost of false alarms is borne per flagged case, the precision recall (PR) curve is the more informative diagnostic. It plots precision against recall, equivalently \(\mathrm{TPR}\), as the threshold varies. Unlike \(\mathrm{FPR}\), precision has prevalence in its denominator through \(\mathrm{FP}\), so the PR curve responds sharply to class skew. The baseline of a random classifier in PR space is the horizontal line \(\mathrm{Precision} = \pi\), which collapses toward zero as positives become rare, exposing the difficulty that ROC hides. The relationship between precision and the two ROC rates is

\[ \mathrm{Precision} = \frac{\pi\, \mathrm{TPR}}{\pi\, \mathrm{TPR} + (1 - \pi)\, \mathrm{FPR}}. \]

A curve dominates in ROC space if and only if it dominates in PR space, so the two share the same notion of one model being uniformly better than another, but their summary areas reward different regions. The area under the PR curve, or the closely related average precision, weights performance in the high precision regime that matters when acting on positives is expensive.

159.6.3 6.3 Partial AUC and Cost Weighting

Even without imbalance, the standard AUC integrates over the entire range of \(\mathrm{FPR}\), including operating points no deployment would ever use. A spam filter that cannot tolerate \(\mathrm{FPR}\) above \(0.05\) gains nothing from good ranking at \(\mathrm{FPR} = 0.6\), yet that region contributes to AUC. The partial AUC restricts the integral to a relevant interval \([\,0, f^\star]\) and, after normalization, focuses the summary on the achievable operating region. More generally, when costs and prevalence are known, expected cost or a cost weighted score at the chosen operating point is a more honest figure of merit than any threshold free area.

159.6.4 6.4 What AUC Does and Does Not Tell You

AUC answers a clean and limited question: how well does the score rank a random positive above a random negative. It does not tell you whether the scores are calibrated probabilities, whether any usable threshold exists, what the precision will be at deployment, or whether the model is good enough given real costs. Two models with identical AUC can differ wildly in their performance at the single operating point that a system actually uses, because AUC averages over all thresholds with equal weight. The discipline of ROC and AUC analysis is to treat the curve as a summary of discriminative capacity, then to choose and report the operating point, the prevalence adjusted precision, and the cost weighted outcome that govern the deployed system.

159.7 7. Summary

The ROC curve plots the true positive rate against the false positive rate as a classification threshold sweeps across all values, displaying the full tradeoff between catching positives and admitting false alarms. Its slope is the likelihood ratio, and the cost optimal operating point is found by tangency with a line whose slope encodes prevalence and misclassification costs. The area under the curve has a clean probabilistic meaning as the chance that a random positive outscores a random negative, equal to the normalized Mann Whitney \(U\) statistic, and it depends only on the induced ranking. Because both ROC axes are class conditional, the analysis is invariant to prevalence, which makes it portable but blind to the absolute flood of false positives that severe class imbalance produces. In rare positive regimes, precision recall curves and cost weighted operating point metrics restore the visibility that AUC obscures.

159.8 References

  1. Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27(8), 861 to 874. https://doi.org/10.1016/j.patrec.2005.10.010
  2. Hanley, J. A., and McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(1), 29 to 36. https://doi.org/10.1148/radiology.143.1.7063747
  3. Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7), 1145 to 1159. https://doi.org/10.1016/S0031-3203(96)00142-2
  4. Davis, J., and Goadrich, M. (2006). The relationship between Precision Recall and ROC curves. Proceedings of the 23rd International Conference on Machine Learning, 233 to 240. https://doi.org/10.1145/1143844.1143874
  5. Saito, T., and Rehmsmeier, M. (2015). The precision recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLOS ONE, 10(3), e0118432. https://doi.org/10.1371/journal.pone.0118432
  6. Provost, F., and Fawcett, T. (2001). Robust classification for imprecise environments. Machine Learning, 42(3), 203 to 231. https://doi.org/10.1023/A:1007601015854
  7. Mann, H. B., and Whitney, D. R. (1947). On a test of whether one of two random variables is stochastically larger than the other. Annals of Mathematical Statistics, 18(1), 50 to 60. https://doi.org/10.1214/aoms/1177730491