159 ROC Curves and AUC

Binary classifiers rarely emit hard decisions. A logistic regression model, a gradient boosted tree, or a neural network typically produces a real valued score $s(x) \in \mathbb{R}$ that ranks instances by their estimated propensity to belong to the positive class. The conversion of that score into a label requires a threshold $\tau$, and every choice of $\tau$ produces a different confusion matrix with its own error profile. The Receiver Operating Characteristic (ROC) curve is the device that summarizes classifier behavior across the entire continuum of thresholds, decoupling the evaluation of the score from the choice of operating point. This chapter develops the ROC space, the true and false positive rate tradeoff, the area under the curve and its probabilistic meaning, and the well known failure of AUC to reflect practical utility under severe class imbalance.

159.1 1. From Scores to the Confusion Matrix

Let $Y \in \{0, 1\}$ denote the true label, with $Y = 1$ marking the positive class, and let the classifier assign label $\hat{Y} = \mathbb{1}[s(x) \ge \tau]$. Over a population or a held out sample, the joint distribution of $(Y, \hat{Y})$ is captured by four counts: true positives $\mathrm{TP}$, false positives $\mathrm{FP}$, false negatives $\mathrm{FN}$, and true negatives $\mathrm{TN}$. From these we define two rates that condition on the true class.

The true positive rate, also called sensitivity or recall, is the fraction of actual positives that are correctly flagged:

\[ \mathrm{TPR}(\tau) = \frac{\mathrm{TP}(\tau)}{\mathrm{TP}(\tau) + \mathrm{FN}(\tau)} = \Pr\!\left[s(X) \ge \tau \mid Y = 1\right]. \]

The false positive rate, equal to one minus specificity, is the fraction of actual negatives that are incorrectly flagged:

\[ \mathrm{FPR}(\tau) = \frac{\mathrm{FP}(\tau)}{\mathrm{FP}(\tau) + \mathrm{TN}(\tau)} = \Pr\!\left[s(X) \ge \tau \mid Y = 0\right]. \]

The crucial property of both quantities is that they are computed within a single true class. They are class conditional rates, and as a consequence neither depends on the prevalence $\pi = \Pr[Y = 1]$. This invariance is the source of both the strength and the weakness of ROC analysis, a point we return to in Section 6.

The flow from a continuous score to the four counts is summarized below. The same scores feed every threshold; only the placement of $\tau$ changes which side of the cut each instance falls on.

flowchart LR
  S["Score s(x)"] --> T{"Compare to threshold tau"}
  T -->|"s greater or equal tau"| P["Predict positive"]
  T -->|"s less than tau"| N["Predict negative"]
  P --> TP["TP if Y is 1"]
  P --> FP["FP if Y is 0"]
  N --> FN["FN if Y is 1"]
  N --> TN["TN if Y is 0"]

159.1.1 1.1 A Worked Confusion Matrix

Concrete numbers fix the definitions. Suppose a held out set has $200$ positives and $800$ negatives, and at a chosen threshold the classifier produces the counts $\mathrm{TP} = 160$, $\mathrm{FN} = 40$, $\mathrm{FP} = 80$, $\mathrm{TN} = 720$. The two class conditional rates are

\[ \mathrm{TPR} = \frac{160}{160 + 40} = 0.80, \qquad \mathrm{FPR} = \frac{80}{80 + 720} = 0.10, \]

placing this operating point at $(0.10, 0.80)$ in ROC space, comfortably above the diagonal. Notice that the prevalence here is $\pi = 200 / 1000 = 0.20$, but neither rate used that figure: $\mathrm{TPR}$ divided only by the $200$ positives and $\mathrm{FPR}$ divided only by the $800$ negatives. Precision, by contrast, mixes the classes, $160 / (160 + 80) \approx 0.667$, and would shift immediately if the negative count grew while the positives held fixed. This single example previews the entire tension of the chapter.

159.2 2. The ROC Space

The ROC space is the unit square $[0, 1]^2$ with $\mathrm{FPR}$ on the horizontal axis and $\mathrm{TPR}$ on the vertical axis. A single classifier with a fixed threshold occupies one point in this space. Sweeping $\tau$ from $+\infty$ down to $-\infty$ traces a curve from the origin to the top right corner.

159.2.1 2.1 Landmark Points

Several locations carry fixed interpretations.

$(0, 0)$ corresponds to $\tau = +\infty$, where every instance is declared negative. No positives are caught and no negatives are wrongly flagged.
$(1, 1)$ corresponds to $\tau = -\infty$, where every instance is declared positive.
$(0, 1)$ is the point of perfect classification: every positive is caught and no negative is flagged. A model whose positive and negative score distributions do not overlap can reach this corner at some threshold.
The diagonal line $\mathrm{TPR} = \mathrm{FPR}$ represents a classifier whose score carries no information about the label. A coin that flags a fraction $p$ of instances at random, independent of $Y$, achieves $\mathrm{TPR} = \mathrm{FPR} = p$ for every $p$.

A point above the diagonal is better than random, and a point below it is worse than random, though a consistently below diagonal classifier can be inverted to land above the diagonal.

159.2.2 2.2 Monotonicity and Shape

As $\tau$ decreases, more instances cross the decision boundary into the positive prediction, so both $\mathrm{TP}$ and $\mathrm{FP}$ can only grow. Therefore $\mathrm{TPR}(\tau)$ and $\mathrm{FPR}(\tau)$ are each monotone non increasing in $\tau$, and the ROC curve, traced as $\tau$ falls, is a monotone non decreasing function from $(0,0)$ to $(1,1)$. For a finite sample the curve is a staircase: lowering the threshold past one positive instance produces a vertical step of height $1 / n_+$, where $n_+$ is the number of positives, while crossing a negative produces a horizontal step of width $1 / n_-$. Ties in the score yield diagonal segments.

159.3 3. The True and False Positive Rate Tradeoff

The shape of the ROC curve encodes a tradeoff that no single threshold can escape. Lowering $\tau$ to catch more positives, raising $\mathrm{TPR}$, simultaneously sweeps in more negatives, raising $\mathrm{FPR}$. The slope of the ROC curve at a point quantifies the local exchange rate between the two.

159.3.1 3.1 The Slope as a Likelihood Ratio

Let $f_1$ and $f_0$ be the densities of the score under the positive and negative classes. Then

\[ \mathrm{TPR}(\tau) = \int_\tau^\infty f_1(s)\, ds, \qquad \mathrm{FPR}(\tau) = \int_\tau^\infty f_0(s)\, ds, \]

so that $\frac{d\,\mathrm{TPR}}{d\tau} = -f_1(\tau)$ and $\frac{d\,\mathrm{FPR}}{d\tau} = -f_0(\tau)$. The slope of the curve in ROC space is therefore

\[ \frac{d\,\mathrm{TPR}}{d\,\mathrm{FPR}} = \frac{f_1(\tau)}{f_0(\tau)}, \]

the likelihood ratio at the threshold. The curve is steep where positives are dense relative to negatives, that is at high scores for a well calibrated ranker, and flat where the reverse holds. A concave ROC curve corresponds to a monotone decreasing likelihood ratio as $\tau$ falls, which is the hallmark of a proper scoring rule; non concavities signal regions where the score ranks instances suboptimally and could be improved by recalibration.

159.3.2 3.2 Choosing an Operating Point

The ROC curve presents the menu of achievable $(\mathrm{FPR}, \mathrm{TPR})$ pairs, but selecting one requires external information about costs and prevalence. Suppose a false negative costs $c_{\mathrm{FN}}$ and a false positive costs $c_{\mathrm{FP}}$. The expected cost at an operating point is

\[ \mathbb{E}[\text{cost}] = \pi\, c_{\mathrm{FN}} \left(1 - \mathrm{TPR}\right) + (1 - \pi)\, c_{\mathrm{FP}}\, \mathrm{FPR}. \]

Minimizing this over the curve, iso cost lines have slope

\[ m = \frac{(1 - \pi)\, c_{\mathrm{FP}}}{\pi\, c_{\mathrm{FN}}}, \]

and the optimal operating point is where a line of slope $m$ is tangent to the ROC curve from above. This is the geometric content of the Neyman Pearson lemma: the optimal decision rule thresholds the likelihood ratio, and the tangency condition matches the curve slope $f_1 / f_0$ to the cost weighted prevalence ratio. The same curve thus serves every cost regime; only the tangent slope changes.

159.4 4. Area Under the Curve

A scalar summary of the entire curve is convenient for model comparison and selection. The area under the ROC curve, abbreviated AUC or sometimes AUROC, is

\[ \mathrm{AUC} = \int_0^1 \mathrm{TPR}\, d\,\mathrm{FPR}. \]

It ranges from $0$ to $1$. A perfect classifier that passes through $(0,1)$ has $\mathrm{AUC} = 1$; the random diagonal has $\mathrm{AUC} = 0.5$; a classifier reliably worse than random has $\mathrm{AUC} < 0.5$ and can be inverted. Because the curve is invariant to any strictly monotone transformation of the score, AUC depends only on the ranking the model induces, not on the numerical scale or calibration of the scores.

159.5 5. The Probabilistic Interpretation of AUC

The single most important fact about AUC is that it equals a probability about rankings. Draw one positive instance $X^+$ at random from the positive class and one negative instance $X^-$ at random from the negative class, independently. Then

\[ \mathrm{AUC} = \Pr\!\left[s(X^+) > s(X^-)\right] + \tfrac{1}{2}\Pr\!\left[s(X^+) = s(X^-)\right]. \]

AUC is the probability that the classifier ranks a random positive above a random negative, with ties broken evenly. This interpretation makes AUC a measure of discrimination, of the model’s ability to separate the two classes by score, entirely separate from where any threshold is set.

159.5.1 5.1 Derivation

Start from the integral and change the variable of integration from $\mathrm{FPR}$ to the threshold $\tau$:

\[ \mathrm{AUC} = \int_0^1 \mathrm{TPR}\, d\,\mathrm{FPR} = \int_{-\infty}^{\infty} \mathrm{TPR}(\tau)\, \big(-f_0(\tau)\big)\, d\tau = \int_{-\infty}^{\infty} \Pr\!\left[s(X^+) \ge \tau\right] f_0(\tau)\, d\tau. \]

Reading $f_0(\tau)\,d\tau$ as the probability that a random negative has score in $[\tau, \tau + d\tau]$, and $\Pr[s(X^+) \ge \tau]$ as the probability that a random positive scores at least that high, the integral sums over all negative score values the probability that an independent positive outscores it. That is exactly $\Pr[s(X^+) > s(X^-)]$ in the continuous case where ties have measure zero.

159.5.2 5.2 The Mann Whitney Connection

The empirical AUC computed from a sample is the normalized Mann Whitney $U$ statistic. With positives scored $\{a_i\}_{i=1}^{n_+}$ and negatives scored $\{b_j\}_{j=1}^{n_-}$,

\[ \widehat{\mathrm{AUC}} = \frac{1}{n_+ n_-} \sum_{i=1}^{n_+} \sum_{j=1}^{n_-} \left( \mathbb{1}[a_i > b_j] + \tfrac{1}{2}\,\mathbb{1}[a_i = b_j] \right). \]

This identity ties AUC to a century of nonparametric statistics and supplies its sampling distribution under the null hypothesis of no discrimination, enabling significance tests and confidence intervals. A naive evaluation of the double sum is $O(n_+ n_-)$, but sorting the combined scores and accumulating ranks reduces it to $O(n \log n)$.

The empirical AUC is an unbiased estimator of the population AUC, and its variance can be approximated in closed form. Writing $A = \mathrm{AUC}$, the classic Hanley and McNeil approximation expresses the variance through two derived quantities, $Q_1 = \Pr[s(X^+_1) > s(X^-) \text{ and } s(X^+_2) > s(X^-)]$, the probability that two random positives both outrank the same negative, and $Q_2 = \Pr[s(X^+) > s(X^-_1) \text{ and } s(X^+) > s(X^-_2)]$, the symmetric quantity for one positive against two negatives:

\[ \widehat{\mathrm{Var}}(\widehat{\mathrm{AUC}}) = \frac{A(1 - A) + (n_+ - 1)(Q_1 - A^2) + (n_- - 1)(Q_2 - A^2)}{n_+\, n_-}. \]

A Wald interval $\widehat{\mathrm{AUC}} \pm z_{1 - \alpha/2}\,\widehat{\mathrm{SE}}$ then follows, though it should be used with caution near the boundaries $0$ and $1$ where the normal approximation degrades. The distribution free DeLong method estimates the same variance from the empirical placement values without distributional assumptions and extends cleanly to the comparison of two correlated AUCs computed on the same test set, which is the usual situation when ranking competing models. The open source pROC package in R implements DeLong intervals and tests directly, and scikit-learn paired with a bootstrap over the scored test set gives an equivalent, assumption light interval in Python.

sort all instances by score descending
walk the list, maintaining a running count of negatives seen
each time a positive is encountered, add the current negative count
AUC = accumulated_pairs / (n_pos * n_neg)

159.5.3 5.3 A Note on Equivalence with the Gini Coefficient

The Gini coefficient used in credit scoring is a linear rescaling, $\mathrm{Gini} = 2\,\mathrm{AUC} - 1$, mapping the random baseline of $0.5$ to $0$ and perfect discrimination to $1$. It conveys no information beyond AUC but is sometimes preferred because the rescaled baseline is more intuitive.

159.6 6. Limitations Under Class Imbalance

The prevalence invariance established in Section 1 is a double edged property. It makes ROC curves stable across populations with different base rates, which is desirable when a model trained on a balanced sample will be deployed where positives are rare. But it also means that ROC analysis is blind to the consequences of imbalance, and under severe skew this blindness becomes misleading.

159.6.1 6.1 The Insensitivity of FPR to Many False Positives

Consider a fraud detection setting with prevalence $\pi = 0.001$. Suppose a sample contains $1{,}000$ positives and $1{,}000{,}000$ negatives, and a model operates at $\mathrm{TPR} = 0.9$ and $\mathrm{FPR} = 0.01$. The false positive rate looks small, but $\mathrm{FPR} = 0.01$ over a million negatives generates $10{,}000$ false positives against only $900$ true positives. The precision,

\[ \mathrm{Precision} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FP}} = \frac{900}{900 + 10{,}000} \approx 0.083, \]

is dismal: more than nine in ten flagged cases are wrong. Yet the ROC point $(0.01, 0.9)$ sits comfortably in the upper left and contributes to a flattering AUC. The reason is structural. The horizontal axis normalizes false positives by the enormous count of negatives, so a number of false alarms that overwhelms the positives in absolute terms still registers as a tiny FPR. ROC analysis simply does not see the imbalance because each axis is internally normalized within its own class.

159.6.2 6.2 Precision Recall Curves

When the positive class is rare and the cost of false alarms is borne per flagged case, the precision recall (PR) curve is the more informative diagnostic. It plots precision against recall, equivalently $\mathrm{TPR}$, as the threshold varies. Unlike $\mathrm{FPR}$, precision has prevalence in its denominator through $\mathrm{FP}$, so the PR curve responds sharply to class skew. The baseline of a random classifier in PR space is the horizontal line $\mathrm{Precision} = \pi$, which collapses toward zero as positives become rare, exposing the difficulty that ROC hides. The relationship between precision and the two ROC rates is

\[ \mathrm{Precision} = \frac{\pi\, \mathrm{TPR}}{\pi\, \mathrm{TPR} + (1 - \pi)\, \mathrm{FPR}}. \]

A curve dominates in ROC space if and only if it dominates in PR space, so the two share the same notion of one model being uniformly better than another, but their summary areas reward different regions. The area under the PR curve, or the closely related average precision, weights performance in the high precision regime that matters when acting on positives is expensive.

159.6.3 6.3 Partial AUC and Cost Weighting

Even without imbalance, the standard AUC integrates over the entire range of $\mathrm{FPR}$, including operating points no deployment would ever use. A spam filter that cannot tolerate $\mathrm{FPR}$ above $0.05$ gains nothing from good ranking at $\mathrm{FPR} = 0.6$, yet that region contributes to AUC. The partial AUC restricts the integral to a relevant interval $[\,0, f^\star]$ and, after normalization, focuses the summary on the achievable operating region. More generally, when costs and prevalence are known, expected cost or a cost weighted score at the chosen operating point is a more honest figure of merit than any threshold free area.

159.6.4 6.4 What AUC Does and Does Not Tell You

AUC answers a clean and limited question: how well does the score rank a random positive above a random negative. It does not tell you whether the scores are calibrated probabilities, whether any usable threshold exists, what the precision will be at deployment, or whether the model is good enough given real costs. Two models with identical AUC can differ wildly in their performance at the single operating point that a system actually uses, because AUC averages over all thresholds with equal weight. The discipline of ROC and AUC analysis is to treat the curve as a summary of discriminative capacity, then to choose and report the operating point, the prevalence adjusted precision, and the cost weighted outcome that govern the deployed system.

159.6.5 6.5 When to Use ROC and AUC, and When Not To

A short field guide consolidates the practical advice.

Reach for ROC and AUC when the two classes are roughly balanced, when you care about ranking quality independent of any fixed threshold, when you need a prevalence portable summary because deployment base rates differ from the test set, or when you are comparing the discriminative power of competing scorers.
Prefer precision recall curves and average precision when positives are rare and the cost of a false alarm is paid per flagged case, since precision exposes the absolute flood of false positives that FPR hides.
Prefer expected cost or a cost weighted score at a fixed operating point when misclassification costs and prevalence are known, because a single honest number at the threshold you will actually deploy beats any threshold free average.
Add a calibration assessment, such as a reliability diagram or the Brier score, whenever downstream decisions consume the scores as probabilities, since AUC is invariant to monotone rescaling and therefore says nothing about calibration.

Common pitfalls to avoid: reporting AUC alone on a severely imbalanced problem and declaring victory; comparing two AUCs without a paired test such as DeLong that accounts for their correlation on the shared test set; integrating over FPR regions no deployment would tolerate rather than using a partial AUC; and conflating a high AUC with usable precision at the chosen threshold.

159.7 7. Summary

The ROC curve plots the true positive rate against the false positive rate as a classification threshold sweeps across all values, displaying the full tradeoff between catching positives and admitting false alarms. Its slope is the likelihood ratio, and the cost optimal operating point is found by tangency with a line whose slope encodes prevalence and misclassification costs. The area under the curve has a clean probabilistic meaning as the chance that a random positive outscores a random negative, equal to the normalized Mann Whitney $U$ statistic, and it depends only on the induced ranking. Because both ROC axes are class conditional, the analysis is invariant to prevalence, which makes it portable but blind to the absolute flood of false positives that severe class imbalance produces. In rare positive regimes, precision recall curves and cost weighted operating point metrics restore the visibility that AUC obscures.

159.8 References

Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27(8), 861 to 874. https://doi.org/10.1016/j.patrec.2005.10.010
Hanley, J. A., and McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(1), 29 to 36. https://doi.org/10.1148/radiology.143.1.7063747
Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7), 1145 to 1159. https://doi.org/10.1016/S0031-3203(96)00142-2
Davis, J., and Goadrich, M. (2006). The relationship between Precision Recall and ROC curves. Proceedings of the 23rd International Conference on Machine Learning, 233 to 240. https://doi.org/10.1145/1143844.1143874
Saito, T., and Rehmsmeier, M. (2015). The precision recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLOS ONE, 10(3), e0118432. https://doi.org/10.1371/journal.pone.0118432
Provost, F., and Fawcett, T. (2001). Robust classification for imprecise environments. Machine Learning, 42(3), 203 to 231. https://doi.org/10.1023/A:1007601015854
Mann, H. B., and Whitney, D. R. (1947). On a test of whether one of two random variables is stochastically larger than the other. Annals of Mathematical Statistics, 18(1), 50 to 60. https://doi.org/10.1214/aoms/1177730491
DeLong, E. R., DeLong, D. M., and Clarke-Pearson, D. L. (1988). Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics, 44(3), 837 to 845. https://doi.org/10.2307/2531595
Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisacek, F., Sanchez, J. C., and Muller, M. (2011). pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics, 12, 77. https://doi.org/10.1186/1471-2105-12-77

# ROC Curves and AUC Binary classifiers rarely emit hard decisions. A logistic regression model, a gradient boosted tree, or a neural network typically produces a real valued score $s(x) \in \mathbb{R}$ that ranks instances by their estimated propensity to belong to the positive class. The conversion of that score into a label requires a threshold $\tau$, and every choice of $\tau$ produces a different confusion matrix with its own error profile. The Receiver Operating Characteristic (ROC) curve is the device that summarizes classifier behavior across the entire continuum of thresholds, decoupling the evaluation of the score from the choice of operating point. This chapter develops the ROC space, the true and false positive rate tradeoff, the area under the curve and its probabilistic meaning, and the well known failure of AUC to reflect practical utility under severe class imbalance. ## 1. From Scores to the Confusion Matrix Let $Y \in \{0, 1\}$ denote the true label, with $Y = 1$ marking the positive class, and let the classifier assign label $\hat{Y} = \mathbb{1}[s(x) \ge \tau]$. Over a population or a held out sample, the joint distribution of $(Y, \hat{Y})$ is captured by four counts: true positives $\mathrm{TP}$, false positives $\mathrm{FP}$, false negatives $\mathrm{FN}$, and true negatives $\mathrm{TN}$. From these we define two rates that condition on the true class. The true positive rate, also called sensitivity or recall, is the fraction of actual positives that are correctly flagged: $$ \mathrm{TPR}(\tau) = \frac{\mathrm{TP}(\tau)}{\mathrm{TP}(\tau) + \mathrm{FN}(\tau)} = \Pr\!\left[s(X) \ge \tau \mid Y = 1\right]. $$ The false positive rate, equal to one minus specificity, is the fraction of actual negatives that are incorrectly flagged: $$ \mathrm{FPR}(\tau) = \frac{\mathrm{FP}(\tau)}{\mathrm{FP}(\tau) + \mathrm{TN}(\tau)} = \Pr\!\left[s(X) \ge \tau \mid Y = 0\right]. $$ The crucial property of both quantities is that they are computed within a single true class. They are class conditional rates, and as a consequence neither depends on the prevalence $\pi = \Pr[Y = 1]$. This invariance is the source of both the strength and the weakness of ROC analysis, a point we return to in Section 6. The flow from a continuous score to the four counts is summarized below. The same scores feed every threshold; only the placement of $\tau$ changes which side of the cut each instance falls on. ```{mermaid} flowchart LR S["Score s(x)"] --> T{"Compare to threshold tau"} T -->|"s greater or equal tau"| P["Predict positive"] T -->|"s less than tau"| N["Predict negative"] P --> TP["TP if Y is 1"] P --> FP["FP if Y is 0"] N --> FN["FN if Y is 1"] N --> TN["TN if Y is 0"] ``` ### 1.1 A Worked Confusion Matrix Concrete numbers fix the definitions. Suppose a held out set has $200$ positives and $800$ negatives, and at a chosen threshold the classifier produces the counts $\mathrm{TP} = 160$, $\mathrm{FN} = 40$, $\mathrm{FP} = 80$, $\mathrm{TN} = 720$. The two class conditional rates are $$ \mathrm{TPR} = \frac{160}{160 + 40} = 0.80, \qquad \mathrm{FPR} = \frac{80}{80 + 720} = 0.10, $$ placing this operating point at $(0.10, 0.80)$ in ROC space, comfortably above the diagonal. Notice that the prevalence here is $\pi = 200 / 1000 = 0.20$, but neither rate used that figure: $\mathrm{TPR}$ divided only by the $200$ positives and $\mathrm{FPR}$ divided only by the $800$ negatives. Precision, by contrast, mixes the classes, $160 / (160 + 80) \approx 0.667$, and would shift immediately if the negative count grew while the positives held fixed. This single example previews the entire tension of the chapter. ## 2. The ROC Space The ROC space is the unit square $[0, 1]^2$ with $\mathrm{FPR}$ on the horizontal axis and $\mathrm{TPR}$ on the vertical axis. A single classifier with a fixed threshold occupies one point in this space. Sweeping $\tau$ from $+\infty$ down to $-\infty$ traces a curve from the origin to the top right corner. ### 2.1 Landmark Points Several locations carry fixed interpretations. - $(0, 0)$ corresponds to $\tau = +\infty$, where every instance is declared negative. No positives are caught and no negatives are wrongly flagged. - $(1, 1)$ corresponds to $\tau = -\infty$, where every instance is declared positive. - $(0, 1)$ is the point of perfect classification: every positive is caught and no negative is flagged. A model whose positive and negative score distributions do not overlap can reach this corner at some threshold. - The diagonal line $\mathrm{TPR} = \mathrm{FPR}$ represents a classifier whose score carries no information about the label. A coin that flags a fraction $p$ of instances at random, independent of $Y$, achieves $\mathrm{TPR} = \mathrm{FPR} = p$ for every $p$. A point above the diagonal is better than random, and a point below it is worse than random, though a consistently below diagonal classifier can be inverted to land above the diagonal. ### 2.2 Monotonicity and Shape As $\tau$ decreases, more instances cross the decision boundary into the positive prediction, so both $\mathrm{TP}$ and $\mathrm{FP}$ can only grow. Therefore $\mathrm{TPR}(\tau)$ and $\mathrm{FPR}(\tau)$ are each monotone non increasing in $\tau$, and the ROC curve, traced as $\tau$ falls, is a monotone non decreasing function from $(0,0)$ to $(1,1)$. For a finite sample the curve is a staircase: lowering the threshold past one positive instance produces a vertical step of height $1 / n_+$, where $n_+$ is the number of positives, while crossing a negative produces a horizontal step of width $1 / n_-$. Ties in the score yield diagonal segments. ## 3. The True and False Positive Rate Tradeoff The shape of the ROC curve encodes a tradeoff that no single threshold can escape. Lowering $\tau$ to catch more positives, raising $\mathrm{TPR}$, simultaneously sweeps in more negatives, raising $\mathrm{FPR}$. The slope of the ROC curve at a point quantifies the local exchange rate between the two. ### 3.1 The Slope as a Likelihood Ratio Let $f_1$ and $f_0$ be the densities of the score under the positive and negative classes. Then $$ \mathrm{TPR}(\tau) = \int_\tau^\infty f_1(s)\, ds, \qquad \mathrm{FPR}(\tau) = \int_\tau^\infty f_0(s)\, ds, $$ so that $\frac{d\,\mathrm{TPR}}{d\tau} = -f_1(\tau)$ and $\frac{d\,\mathrm{FPR}}{d\tau} = -f_0(\tau)$. The slope of the curve in ROC space is therefore $$ \frac{d\,\mathrm{TPR}}{d\,\mathrm{FPR}} = \frac{f_1(\tau)}{f_0(\tau)}, $$ the likelihood ratio at the threshold. The curve is steep where positives are dense relative to negatives, that is at high scores for a well calibrated ranker, and flat where the reverse holds. A concave ROC curve corresponds to a monotone decreasing likelihood ratio as $\tau$ falls, which is the hallmark of a proper scoring rule; non concavities signal regions where the score ranks instances suboptimally and could be improved by recalibration. ### 3.2 Choosing an Operating Point The ROC curve presents the menu of achievable $(\mathrm{FPR}, \mathrm{TPR})$ pairs, but selecting one requires external information about costs and prevalence. Suppose a false negative costs $c_{\mathrm{FN}}$ and a false positive costs $c_{\mathrm{FP}}$. The expected cost at an operating point is $$ \mathbb{E}[\text{cost}] = \pi\, c_{\mathrm{FN}} \left(1 - \mathrm{TPR}\right) + (1 - \pi)\, c_{\mathrm{FP}}\, \mathrm{FPR}. $$ Minimizing this over the curve, iso cost lines have slope $$ m = \frac{(1 - \pi)\, c_{\mathrm{FP}}}{\pi\, c_{\mathrm{FN}}}, $$ and the optimal operating point is where a line of slope $m$ is tangent to the ROC curve from above. This is the geometric content of the Neyman Pearson lemma: the optimal decision rule thresholds the likelihood ratio, and the tangency condition matches the curve slope $f_1 / f_0$ to the cost weighted prevalence ratio. The same curve thus serves every cost regime; only the tangent slope changes. ## 4. Area Under the Curve A scalar summary of the entire curve is convenient for model comparison and selection. The area under the ROC curve, abbreviated AUC or sometimes AUROC, is $$ \mathrm{AUC} = \int_0^1 \mathrm{TPR}\, d\,\mathrm{FPR}. $$ It ranges from $0$ to $1$. A perfect classifier that passes through $(0,1)$ has $\mathrm{AUC} = 1$; the random diagonal has $\mathrm{AUC} = 0.5$; a classifier reliably worse than random has $\mathrm{AUC} < 0.5$ and can be inverted. Because the curve is invariant to any strictly monotone transformation of the score, AUC depends only on the ranking the model induces, not on the numerical scale or calibration of the scores. ## 5. The Probabilistic Interpretation of AUC The single most important fact about AUC is that it equals a probability about rankings. Draw one positive instance $X^+$ at random from the positive class and one negative instance $X^-$ at random from the negative class, independently. Then $$ \mathrm{AUC} = \Pr\!\left[s(X^+) > s(X^-)\right] + \tfrac{1}{2}\Pr\!\left[s(X^+) = s(X^-)\right]. $$ AUC is the probability that the classifier ranks a random positive above a random negative, with ties broken evenly. This interpretation makes AUC a measure of discrimination, of the model's ability to separate the two classes by score, entirely separate from where any threshold is set. ### 5.1 Derivation Start from the integral and change the variable of integration from $\mathrm{FPR}$ to the threshold $\tau$: $$ \mathrm{AUC} = \int_0^1 \mathrm{TPR}\, d\,\mathrm{FPR} = \int_{-\infty}^{\infty} \mathrm{TPR}(\tau)\, \big(-f_0(\tau)\big)\, d\tau = \int_{-\infty}^{\infty} \Pr\!\left[s(X^+) \ge \tau\right] f_0(\tau)\, d\tau. $$ Reading $f_0(\tau)\,d\tau$ as the probability that a random negative has score in $[\tau, \tau + d\tau]$, and $\Pr[s(X^+) \ge \tau]$ as the probability that a random positive scores at least that high, the integral sums over all negative score values the probability that an independent positive outscores it. That is exactly $\Pr[s(X^+) > s(X^-)]$ in the continuous case where ties have measure zero. ### 5.2 The Mann Whitney Connection The empirical AUC computed from a sample is the normalized Mann Whitney $U$ statistic. With positives scored $\{a_i\}_{i=1}^{n_+}$ and negatives scored $\{b_j\}_{j=1}^{n_-}$, $$ \widehat{\mathrm{AUC}} = \frac{1}{n_+ n_-} \sum_{i=1}^{n_+} \sum_{j=1}^{n_-} \left( \mathbb{1}[a_i > b_j] + \tfrac{1}{2}\,\mathbb{1}[a_i = b_j] \right). $$ This identity ties AUC to a century of nonparametric statistics and supplies its sampling distribution under the null hypothesis of no discrimination, enabling significance tests and confidence intervals. A naive evaluation of the double sum is $O(n_+ n_-)$, but sorting the combined scores and accumulating ranks reduces it to $O(n \log n)$. The empirical AUC is an unbiased estimator of the population AUC, and its variance can be approximated in closed form. Writing $A = \mathrm{AUC}$, the classic Hanley and McNeil approximation expresses the variance through two derived quantities, $Q_1 = \Pr[s(X^+_1) > s(X^-) \text{ and } s(X^+_2) > s(X^-)]$, the probability that two random positives both outrank the same negative, and $Q_2 = \Pr[s(X^+) > s(X^-_1) \text{ and } s(X^+) > s(X^-_2)]$, the symmetric quantity for one positive against two negatives: $$ \widehat{\mathrm{Var}}(\widehat{\mathrm{AUC}}) = \frac{A(1 - A) + (n_+ - 1)(Q_1 - A^2) + (n_- - 1)(Q_2 - A^2)}{n_+\, n_-}. $$ A Wald interval $\widehat{\mathrm{AUC}} \pm z_{1 - \alpha/2}\,\widehat{\mathrm{SE}}$ then follows, though it should be used with caution near the boundaries $0$ and $1$ where the normal approximation degrades. The distribution free DeLong method estimates the same variance from the empirical placement values without distributional assumptions and extends cleanly to the comparison of two correlated AUCs computed on the same test set, which is the usual situation when ranking competing models. The open source `pROC` package in R implements DeLong intervals and tests directly, and `scikit-learn` paired with a bootstrap over the scored test set gives an equivalent, assumption light interval in Python. ```text sort all instances by score descending walk the list, maintaining a running count of negatives seen each time a positive is encountered, add the current negative count AUC = accumulated_pairs / (n_pos * n_neg) ``` ### 5.3 A Note on Equivalence with the Gini Coefficient The Gini coefficient used in credit scoring is a linear rescaling, $\mathrm{Gini} = 2\,\mathrm{AUC} - 1$, mapping the random baseline of $0.5$ to $0$ and perfect discrimination to $1$. It conveys no information beyond AUC but is sometimes preferred because the rescaled baseline is more intuitive. ## 6. Limitations Under Class Imbalance The prevalence invariance established in Section 1 is a double edged property. It makes ROC curves stable across populations with different base rates, which is desirable when a model trained on a balanced sample will be deployed where positives are rare. But it also means that ROC analysis is blind to the consequences of imbalance, and under severe skew this blindness becomes misleading. ### 6.1 The Insensitivity of FPR to Many False Positives Consider a fraud detection setting with prevalence $\pi = 0.001$. Suppose a sample contains $1{,}000$ positives and $1{,}000{,}000$ negatives, and a model operates at $\mathrm{TPR} = 0.9$ and $\mathrm{FPR} = 0.01$. The false positive rate looks small, but $\mathrm{FPR} = 0.01$ over a million negatives generates $10{,}000$ false positives against only $900$ true positives. The precision, $$ \mathrm{Precision} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FP}} = \frac{900}{900 + 10{,}000} \approx 0.083, $$ is dismal: more than nine in ten flagged cases are wrong. Yet the ROC point $(0.01, 0.9)$ sits comfortably in the upper left and contributes to a flattering AUC. The reason is structural. The horizontal axis normalizes false positives by the enormous count of negatives, so a number of false alarms that overwhelms the positives in absolute terms still registers as a tiny FPR. ROC analysis simply does not see the imbalance because each axis is internally normalized within its own class. ### 6.2 Precision Recall Curves When the positive class is rare and the cost of false alarms is borne per flagged case, the precision recall (PR) curve is the more informative diagnostic. It plots precision against recall, equivalently $\mathrm{TPR}$, as the threshold varies. Unlike $\mathrm{FPR}$, precision has prevalence in its denominator through $\mathrm{FP}$, so the PR curve responds sharply to class skew. The baseline of a random classifier in PR space is the horizontal line $\mathrm{Precision} = \pi$, which collapses toward zero as positives become rare, exposing the difficulty that ROC hides. The relationship between precision and the two ROC rates is $$ \mathrm{Precision} = \frac{\pi\, \mathrm{TPR}}{\pi\, \mathrm{TPR} + (1 - \pi)\, \mathrm{FPR}}. $$ A curve dominates in ROC space if and only if it dominates in PR space, so the two share the same notion of one model being uniformly better than another, but their summary areas reward different regions. The area under the PR curve, or the closely related average precision, weights performance in the high precision regime that matters when acting on positives is expensive. ### 6.3 Partial AUC and Cost Weighting Even without imbalance, the standard AUC integrates over the entire range of $\mathrm{FPR}$, including operating points no deployment would ever use. A spam filter that cannot tolerate $\mathrm{FPR}$ above $0.05$ gains nothing from good ranking at $\mathrm{FPR} = 0.6$, yet that region contributes to AUC. The partial AUC restricts the integral to a relevant interval $[\,0, f^\star]$ and, after normalization, focuses the summary on the achievable operating region. More generally, when costs and prevalence are known, expected cost or a cost weighted score at the chosen operating point is a more honest figure of merit than any threshold free area. ### 6.4 What AUC Does and Does Not Tell You AUC answers a clean and limited question: how well does the score rank a random positive above a random negative. It does not tell you whether the scores are calibrated probabilities, whether any usable threshold exists, what the precision will be at deployment, or whether the model is good enough given real costs. Two models with identical AUC can differ wildly in their performance at the single operating point that a system actually uses, because AUC averages over all thresholds with equal weight. The discipline of ROC and AUC analysis is to treat the curve as a summary of discriminative capacity, then to choose and report the operating point, the prevalence adjusted precision, and the cost weighted outcome that govern the deployed system. ### 6.5 When to Use ROC and AUC, and When Not To A short field guide consolidates the practical advice. - **Reach for ROC and AUC** when the two classes are roughly balanced, when you care about ranking quality independent of any fixed threshold, when you need a prevalence portable summary because deployment base rates differ from the test set, or when you are comparing the discriminative power of competing scorers. - **Prefer precision recall curves and average precision** when positives are rare and the cost of a false alarm is paid per flagged case, since precision exposes the absolute flood of false positives that FPR hides. - **Prefer expected cost or a cost weighted score at a fixed operating point** when misclassification costs and prevalence are known, because a single honest number at the threshold you will actually deploy beats any threshold free average. - **Add a calibration assessment**, such as a reliability diagram or the Brier score, whenever downstream decisions consume the scores as probabilities, since AUC is invariant to monotone rescaling and therefore says nothing about calibration. Common pitfalls to avoid: reporting AUC alone on a severely imbalanced problem and declaring victory; comparing two AUCs without a paired test such as DeLong that accounts for their correlation on the shared test set; integrating over FPR regions no deployment would tolerate rather than using a partial AUC; and conflating a high AUC with usable precision at the chosen threshold. ## 7. Summary The ROC curve plots the true positive rate against the false positive rate as a classification threshold sweeps across all values, displaying the full tradeoff between catching positives and admitting false alarms. Its slope is the likelihood ratio, and the cost optimal operating point is found by tangency with a line whose slope encodes prevalence and misclassification costs. The area under the curve has a clean probabilistic meaning as the chance that a random positive outscores a random negative, equal to the normalized Mann Whitney $U$ statistic, and it depends only on the induced ranking. Because both ROC axes are class conditional, the analysis is invariant to prevalence, which makes it portable but blind to the absolute flood of false positives that severe class imbalance produces. In rare positive regimes, precision recall curves and cost weighted operating point metrics restore the visibility that AUC obscures. ## References 1. Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27(8), 861 to 874. https://doi.org/10.1016/j.patrec.2005.10.010 2. Hanley, J. A., and McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(1), 29 to 36. https://doi.org/10.1148/radiology.143.1.7063747 3. Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7), 1145 to 1159. https://doi.org/10.1016/S0031-3203(96)00142-2 4. Davis, J., and Goadrich, M. (2006). The relationship between Precision Recall and ROC curves. Proceedings of the 23rd International Conference on Machine Learning, 233 to 240. https://doi.org/10.1145/1143844.1143874 5. Saito, T., and Rehmsmeier, M. (2015). The precision recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLOS ONE, 10(3), e0118432. https://doi.org/10.1371/journal.pone.0118432 6. Provost, F., and Fawcett, T. (2001). Robust classification for imprecise environments. Machine Learning, 42(3), 203 to 231. https://doi.org/10.1023/A:1007601015854 7. Mann, H. B., and Whitney, D. R. (1947). On a test of whether one of two random variables is stochastically larger than the other. Annals of Mathematical Statistics, 18(1), 50 to 60. https://doi.org/10.1214/aoms/1177730491 8. DeLong, E. R., DeLong, D. M., and Clarke-Pearson, D. L. (1988). Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics, 44(3), 837 to 845. https://doi.org/10.2307/2531595 9. Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisacek, F., Sanchez, J. C., and Muller, M. (2011). pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics, 12, 77. https://doi.org/10.1186/1471-2105-12-77