160  Precision-Recall Curves and Average Precision

Classification systems that operate under heavy class imbalance, such as fraud detection, information retrieval, rare disease screening, and click prediction, demand evaluation tools that remain sensitive to performance on the scarce positive class. The Precision-Recall (PR) curve and its summary statistics are the workhorses of this regime. This chapter develops the PR curve from first principles, defines average precision and the area under the curve with care, explains why PR analysis is often preferred to Receiver Operating Characteristic (ROC) analysis under imbalance, and makes the formal relationship between the two spaces precise.

160.1 1. From Scores to Operating Points

Most probabilistic classifiers do not emit a hard label. They produce a real valued score \(s(x) \in \mathbb{R}\) for each instance \(x\), where larger scores indicate stronger evidence for the positive class. A decision is recovered by thresholding,

\[ \hat{y}(x) = \mathbb{1}[\, s(x) \geq \tau \,], \]

for a threshold \(\tau\). Each choice of \(\tau\) induces a confusion matrix over a labeled evaluation set with true labels \(y \in \{0, 1\}\). Writing TP, FP, FN, and TN for the counts of true positives, false positives, false negatives, and true negatives, the two quantities of interest are

\[ \text{Precision}(\tau) = \frac{\text{TP}(\tau)}{\text{TP}(\tau) + \text{FP}(\tau)}, \qquad \text{Recall}(\tau) = \frac{\text{TP}(\tau)}{\text{TP}(\tau) + \text{FN}(\tau)}. \]

Precision answers the question “of the instances I flagged as positive, what fraction truly are?” Recall, also called the true positive rate or sensitivity, answers “of the truly positive instances, what fraction did I retrieve?” The denominator of recall, \(\text{TP} + \text{FN} = P\), is the fixed number of actual positives and does not depend on \(\tau\). The denominator of precision, \(\text{TP} + \text{FP}\), is the number of predicted positives, which shrinks as the threshold rises.

A PR curve is the locus of points \((\text{Recall}(\tau), \text{Precision}(\tau))\) traced as \(\tau\) sweeps from \(+\infty\) down to \(-\infty\). Because precision and recall both depend only on the ranking that the scores induce over the positive and negative classes, the curve is invariant to any strictly monotone transformation of the scores. Calibration of the scores into well behaved probabilities is therefore irrelevant to the curve. Only the relative order matters.

160.2 2. The Geometry of PR Space

PR space is the unit square with recall on the horizontal axis and precision on the vertical axis. Several structural facts shape how curves live in it.

160.2.1 2.1 Endpoints and the Baseline

At a very high threshold almost nothing is predicted positive. If the single highest scored instance is positive, the curve begins near precision \(1\) at recall close to \(0\). At a very low threshold everything is predicted positive, so \(\text{TP} = P\) and \(\text{FP} = N\), giving recall \(1\) and precision

\[ \pi = \frac{P}{P + N}, \]

the prevalence of the positive class. The right edge of any PR curve therefore pins to the horizontal line \(y = \pi\). A classifier that ranks instances no better than random achieves expected precision \(\pi\) at every recall, so the baseline in PR space is the horizontal line at the prevalence, not a diagonal. This is the first sharp contrast with ROC space, where the random baseline is always the main diagonal regardless of prevalence.

160.2.2 2.2 Non Monotonicity and Sawtooth Behavior

Unlike the ROC curve, the PR curve is not monotone. As the threshold decreases by enough to admit one more instance, recall either stays flat (the admitted instance is negative) or increases (it is positive). Precision, however, can move in either direction. Admitting a true positive nudges precision up; admitting a false positive pulls it down. The result is a characteristic sawtooth shape. Because of this, naive linear interpolation between adjacent operating points is not generally valid and can produce an optimistic curve, a subtlety addressed by interpolation schemes discussed below.

160.2.3 2.3 Achievable Points

Davis and Goadrich showed that PR space and ROC space are tightly coupled: a point is achievable in PR space if and only if its corresponding point is achievable in ROC space, given a fixed dataset. The mapping is one to one. A consequence is that a curve dominating another in ROC space, meaning it lies entirely above and to the left, also dominates in PR space, and vice versa. Dominance is preserved even though area is not.

160.3 3. Average Precision and the Area Under the PR Curve

We want a single scalar that summarizes the whole curve. Two closely related but distinct definitions are in common use, and conflating them is a frequent source of confusion.

160.3.1 3.1 Average Precision

Average precision (AP) is defined as a sum over thresholds, weighting each precision value by the increase in recall it achieves,

\[ \text{AP} = \sum_{k} \big( R_k - R_{k-1} \big)\, P_k, \]

where \(P_k\) and \(R_k\) are the precision and recall at the \(k\)-th operating point, taken in order of decreasing threshold, and \(R_0 = 0\). This is a right Riemann sum of precision against recall. Because each newly retrieved positive raises recall by exactly \(1/P\) in a dataset with \(P\) positives, AP can be rewritten as an average of the precision values observed precisely at the ranks where a true positive appears,

\[ \text{AP} = \frac{1}{P} \sum_{i : y_i = 1} \text{Precision@}k_i, \]

where \(k_i\) is the rank of the \(i\)-th positive in the score sorted list and \(\text{Precision@}k\) is the precision when the top \(k\) instances are predicted positive. This form makes clear that AP rewards ranking true positives ahead of negatives. It is the standard summary in information retrieval and in object detection benchmarks.

160.3.2 3.2 Area Under the PR Curve

The area under the PR curve (AUPRC) is the integral

\[ \text{AUPRC} = \int_0^1 P(R)\, dR, \]

where \(P(R)\) is precision viewed as a function of recall. AP is a particular numerical estimator of this integral, namely the one using right endpoint rectangles without interpolation. Other estimators, such as the trapezoidal rule or the interpolated estimator used by some software, give different numbers on the same curve. The trapezoidal rule tends to be optimistically biased because of the sawtooth structure, since linear interpolation cuts across the teeth. For this reason the rectangle based AP is generally the preferred estimator, and practitioners should report which estimator they used.

160.3.3 3.3 Interpolated Precision

Some benchmarks, notably older PASCAL VOC object detection, replace the raw precision at each recall by the maximum precision achieved at any recall greater than or equal to the current one,

\[ P_{\text{interp}}(R) = \max_{R' \geq R} P(R'). \]

This produces a monotonically non increasing envelope and removes the sawtooth. Eleven point and all point variants then integrate this envelope. Interpolated AP is always at least as large as raw AP and is a smoother but slightly optimistic summary. Modern detection benchmarks such as COCO use a dense all point interpolation, which is why their reported mean AP differs from earlier conventions.

A short illustration of the rank based computation:

sort instances by score, descending
P_seen, TP = 0, 0
ap = 0
for each instance in sorted order:
    P_seen += 1
    if label == positive:
        TP += 1
        precision_at_k = TP / P_seen
        ap += precision_at_k          # each positive contributes 1/P of recall
ap = ap / total_positives

160.4 4. Why Imbalance Favors PR Curves

The strongest argument for PR analysis appears when positives are rare. Consider a dataset with prevalence \(\pi = 0.001\), one positive per thousand instances. Suppose a model retrieves all positives but in doing so also flags an equal raw number of negatives.

160.4.1 4.1 The False Positive Rate Hides Large Absolute Errors

ROC space plots recall against the false positive rate,

\[ \text{FPR}(\tau) = \frac{\text{FP}(\tau)}{\text{FP}(\tau) + \text{TN}(\tau)} = \frac{\text{FP}(\tau)}{N}. \]

When \(N\) is enormous, the FPR can stay tiny even when the absolute number of false positives dwarfs the number of true positives. Imagine \(P = 100\) positives and \(N = 100{,}000\) negatives. A threshold yielding \(\text{TP} = 90\) and \(\text{FP} = 900\) produces an excellent looking \(\text{FPR} = 900 / 100{,}000 = 0.009\), a point hugging the top left of ROC space. Yet precision is only \(90 / (90 + 900) = 0.091\). Nine out of ten alerts are false. The ROC curve flatters the model because its denominator \(N\) absorbs the false positives, while precision exposes the problem because its denominator \(\text{TP} + \text{FP}\) is dominated by them.

160.4.2 4.2 Prevalence Sensitivity Is a Feature, Not a Bug

Precision is the positive predictive value, and by Bayes’ rule it depends explicitly on prevalence,

\[ \text{Precision} = \frac{\text{TPR}\cdot \pi}{\text{TPR}\cdot \pi + \text{FPR}\cdot(1 - \pi)}. \]

As \(\pi \to 0\) with TPR and FPR fixed, precision collapses toward zero. ROC coordinates, TPR and FPR, are conditional on the true class and so are invariant to prevalence by construction. That invariance is often advertised as a virtue, and it is, when the deployment prevalence differs from the test prevalence and you want a measure that transfers. But when the operational cost is driven by the burden of false alarms relative to genuine hits, the prevalence dependence of precision is exactly what you want to see, because it reflects the user’s actual experience. The PR curve and AUPRC therefore track the quantity that matters in needle in a haystack tasks.

160.4.3 4.3 Discriminative Power in the Region of Interest

Under heavy imbalance the interesting operating points cluster at low recall and modest precision, a region that occupies a thin sliver near the left edge of ROC space where competing curves are visually indistinguishable. PR space stretches this region across the full vertical axis, giving better visual and numerical resolution between models that all look near optimal in ROC terms. This magnification is the practical reason PR curves are favored for model selection in imbalanced settings.

160.5 5. The Formal Relationship to ROC

The two spaces are connected by a clean change of coordinates, since both are built from the same confusion matrix entries.

160.5.1 5.1 Coordinate Translation

A ROC point is \((\text{FPR}, \text{TPR})\) and the matching PR point is \((\text{Recall}, \text{Precision})\) with \(\text{Recall} = \text{TPR}\). Precision is recovered from the ROC coordinates and the class counts through

\[ \text{Precision} = \frac{\text{TPR}\cdot P}{\text{TPR}\cdot P + \text{FPR}\cdot N}. \]

Given the dataset’s \(P\) and \(N\), every ROC point determines a unique PR point and the reverse holds as well, which is the achievability equivalence of Section 2.3. The transformation is nonlinear, so equal areas do not map to equal areas. A model can have a higher area under the ROC curve (AUROC) than a competitor yet a lower AUPRC, although if one curve dominates the other everywhere the ordering is consistent across both spaces.

160.5.2 5.2 AUROC as a Ranking Probability

AUROC has a clean probabilistic meaning: it equals the probability that a uniformly random positive instance is scored above a uniformly random negative instance,

\[ \text{AUROC} = \Pr\big[\, s(X^{+}) > s(X^{-}) \,\big], \]

which is the Mann-Whitney U statistic normalized by \(PN\). AP and AUPRC have no equally clean single probability interpretation, but AP is closely tied to the expected precision a user encounters while scanning a ranked list from the top, which is why retrieval communities adopted it.

160.5.3 5.3 Baselines and Chance

A random classifier yields \(\text{AUROC} = 0.5\) regardless of prevalence, whereas a random classifier yields \(\text{AUPRC} \approx \pi\). The PR baseline therefore moves with the problem. When reporting AUPRC it is good practice to also state the prevalence, so that a value of \(0.30\) is recognized as strong when \(\pi = 0.01\) and weak when \(\pi = 0.25\). A normalized variant divides the gain over baseline by the maximum possible gain to produce a prevalence adjusted score.

160.5.4 5.4 When to Prefer Each

Choose ROC and AUROC when the two classes are comparably frequent, when you care about ranking quality independent of the operating prevalence, or when you expect the deployment prevalence to differ from the test set and want a transferable measure. Choose PR, AP, and AUPRC when positives are rare, when false positives are costly relative to the value of true positives, and when the practical question is the quality of the top of a ranked list. In many imbalanced applications both are reported, with PR taking the lead for model selection and ROC providing a prevalence independent sanity check.

160.6 6. Practical Reporting Guidance

Three habits prevent most misuse. First, name the estimator. State whether AUPRC was computed by the rectangle based AP rule, the trapezoidal rule, or an interpolated envelope, because the numbers are not interchangeable. Second, report prevalence alongside any PR summary, since the baseline is \(\pi\) rather than a fixed constant. Third, when comparing models, prefer curve dominance and confidence intervals from bootstrap resampling over a single scalar, because two curves crossing each other can yield equal areas while implying very different behavior at the operating point you actually intend to deploy. Together these practices keep PR analysis both rigorous and honest in the imbalanced regimes where it earns its keep.

160.7 References

  1. Davis, J. and Goadrich, M. (2006). The Relationship Between Precision-Recall and ROC Curves. Proceedings of the 23rd International Conference on Machine Learning. https://www.biostat.wisc.edu/~page/rocpr.pdf
  2. Saito, T. and Rehmsmeier, M. (2015). The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLoS ONE 10(3): e0118432. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0118432
  3. Boyd, K., Eng, K. H., and Page, C. D. (2013). Area Under the Precision-Recall Curve: Point Estimates and Confidence Intervals. ECML PKDD. https://link.springer.com/chapter/10.1007/978-3-642-40994-3_55
  4. Fawcett, T. (2006). An Introduction to ROC Analysis. Pattern Recognition Letters 27(8): 861-874. https://www.sciencedirect.com/science/article/abs/pii/S016786550500303X
  5. Everingham, M. et al. (2010). The PASCAL Visual Object Classes (VOC) Challenge. International Journal of Computer Vision 88: 303-338. https://link.springer.com/article/10.1007/s11263-009-0275-4
  6. scikit-learn developers. Precision-Recall and Average Precision. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html
  7. Lin, T.-Y. et al. (2014). Microsoft COCO: Common Objects in Context. ECCV. https://arxiv.org/abs/1405.0312