158 Precision, Recall, and F1
Accuracy is the metric everyone reaches for first and the one that fails first. When a classifier is asked to find fraud, disease, or defects, the interesting class is usually rare, and a model that predicts “negative” for every input can score 99 percent accuracy while being useless. Precision, recall, and the F1 score exist to describe performance in exactly these settings, where the cost of a mistake depends on which kind of mistake it is. This chapter develops the definitions, the tradeoff between them, the F-beta generalization, the averaging schemes used for multiclass problems, and a practical framework for deciding which metric should drive a decision.
158.1 1. The Confusion Matrix and Basic Definitions
Every count based classification metric is built from the confusion matrix. For a binary problem with a designated positive class, each prediction falls into one of four cells.
| Predicted positive | Predicted negative | |
|---|---|---|
| Actual positive | True positive (TP) | False negative (FN) |
| Actual negative | False positive (FP) | True negative (TN) |
A false positive is a negative example wrongly flagged as positive, sometimes called a type I error. A false negative is a positive example the model missed, a type II error. The total number of actual positives is \(\text{TP} + \text{FN}\), and the total number of predicted positives is \(\text{TP} + \text{FP}\).
Accuracy is the fraction of all predictions that are correct:
\[ \text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}. \]
The weakness of accuracy is that it averages over both classes weighted by their prevalence. If positives make up 1 percent of the data, the term \(\text{TN}\) dominates and swamps any signal about how well the rare class is handled. Precision and recall sidestep this by conditioning on different denominators, each of which ignores the true negative count entirely.
158.2 2. Precision and Recall
Precision answers the question: of the examples the model labeled positive, what fraction really were positive?
\[ \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}. \]
Precision is the reliability of a positive prediction. High precision means that when the model raises an alarm, you can trust it. It is the metric of interest whenever acting on a positive prediction is expensive or disruptive.
Recall, also called sensitivity or the true positive rate, answers a complementary question: of the examples that really were positive, what fraction did the model find?
\[ \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}. \]
Recall is the coverage of the positive class. High recall means few positives slip through. It is the metric of interest whenever missing a positive is the costly outcome.
A useful way to keep the two straight is to note their denominators. Precision divides by what the model predicted positive (the column), so it is degraded by false positives. Recall divides by what is actually positive (the row), so it is degraded by false negatives. Neither metric references \(\text{TN}\), which is why both remain informative under heavy class imbalance.
It helps to interpret these probabilistically. If we draw a random predicted positive, precision is the probability it is truly positive, \(P(Y=1 \mid \hat{Y}=1)\). If we draw a random true positive, recall is the probability the model catches it, \(P(\hat{Y}=1 \mid Y=1)\). The two are conditional probabilities in opposite directions, related through prevalence by Bayes’ rule, which is exactly why one can be high while the other is low.
158.3 3. The Precision-Recall Tradeoff
Most classifiers do not emit a hard label directly. They produce a score \(s(x)\), often a probability, and a threshold \(\tau\) converts it to a decision: predict positive when \(s(x) \geq \tau\). Sweeping the threshold traces out the entire range of operating points.
Lowering \(\tau\) makes the model more eager to predict positive. It catches more true positives, so recall rises, but it also admits more false positives, so precision tends to fall. Raising \(\tau\) does the reverse: the model only commits to a positive when very confident, which lifts precision but lets more positives escape, lowering recall. This inverse pressure is the precision-recall tradeoff, and it is structural rather than a defect of any particular model.
The tradeoff is visualized as a precision-recall curve, plotting precision against recall as \(\tau\) varies from permissive to strict. The area under this curve, the average precision, summarizes performance across all thresholds in a single number that, unlike the area under the ROC curve, stays sensitive to imbalance because both its axes ignore true negatives.
threshold high -> high precision, low recall (cautious)
threshold low -> low precision, high recall (eager)
The practical consequence is that precision and recall are not properties of a model alone but of a model paired with a threshold. Reporting a single precision number without stating the operating point, or comparing two models at different thresholds, is a common and misleading error. When a target precision or recall is fixed by the application, the right procedure is to choose \(\tau\) on a validation set to meet that target and then report the other metric at that point.
158.4 4. The F1 Score
Often we want a single number that rewards a model only when both precision and recall are reasonable. The natural candidate, the arithmetic mean, is a poor choice because it can be propped up by one component. A model with precision 1.0 and recall 0.02 has arithmetic mean 0.51, which badly overstates a system that finds almost nothing.
The F1 score instead uses the harmonic mean of precision and recall:
\[ F_1 = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}. \]
The harmonic mean is dominated by the smaller of its arguments, so \(F_1\) is high only when precision and recall are both high. For the example above, \(F_1 = \frac{2 (1.0)(0.02)}{1.0 + 0.02} \approx 0.039\), which honestly reflects a broken classifier. The harmonic mean always lies at or below the arithmetic mean, with equality only when precision equals recall.
There is a clean way to see what \(F_1\) counts. Substituting the definitions gives an expression purely in confusion matrix cells:
\[ F_1 = \frac{2\,\text{TP}}{2\,\text{TP} + \text{FP} + \text{FN}}. \]
This shows that \(F_1\) weighs each false positive and each false negative equally and, crucially, never involves \(\text{TN}\). That property makes it well suited to imbalanced problems and information retrieval, where the negative class is enormous and uninformative. It also reveals the metric’s main blind spot: by treating the two error types symmetrically, \(F_1\) implicitly assumes a false positive and a false negative cost the same. When they do not, \(F_1\) is the wrong summary.
158.5 5. The F-beta Family
The symmetry of \(F_1\) is a special case of a parameterized family that lets us weight recall more or less heavily than precision. The F-beta score is
\[ F_\beta = (1 + \beta^2) \cdot \frac{\text{Precision} \cdot \text{Recall}}{\beta^2 \cdot \text{Precision} + \text{Recall}}. \]
The parameter \(\beta\) sets the relative importance of recall to precision. The standard interpretation is that recall is considered \(\beta\) times as important as precision. Setting \(\beta = 1\) recovers \(F_1\). Two common choices anchor the intuition:
- \(F_2\) (\(\beta = 2\)) weights recall higher than precision. It is used when missing a positive is the more serious error, such as screening for a dangerous disease where a missed case is far worse than a false alarm.
- \(F_{0.5}\) (\(\beta = 0.5\)) weights precision higher than recall. It is used when acting on a false positive is costly, such as a system that automatically blocks accounts, where wrongly punishing a legitimate user is the error to avoid.
The limits make the weighting explicit. As \(\beta \to 0\), \(F_\beta \to \text{Precision}\); as \(\beta \to \infty\), \(F_\beta \to \text{Recall}\). A helpful way to read \(\beta\) is through the implied cost ratio: choosing \(\beta\) asserts that you are willing to tolerate \(\beta^2\) false positives to avoid one false negative. So \(F_2\) encodes a four to one tolerance for false positives over false negatives, and \(F_{0.5}\) encodes the inverse. Choosing \(\beta\) is therefore not a tuning knob to be optimized blindly but a statement about the cost structure of the application, and it should be set from that cost structure rather than from whatever value makes a model look best.
158.6 6. Averaging Across Classes: Micro, Macro, and Weighted
The definitions so far assume a single positive class. Multiclass and multilabel problems have many classes, each with its own precision and recall computed in a one versus rest manner. To report a single figure we must aggregate, and the choice of aggregation changes the meaning of the number.
Let there be \(K\) classes, and write \(\text{TP}_k\), \(\text{FP}_k\), \(\text{FN}_k\) for the per class counts.
Micro averaging pools the counts across all classes first, then computes the metric once on the global totals:
\[ \text{Precision}_{\text{micro}} = \frac{\sum_k \text{TP}_k}{\sum_k (\text{TP}_k + \text{FP}_k)}, \qquad \text{Recall}_{\text{micro}} = \frac{\sum_k \text{TP}_k}{\sum_k (\text{TP}_k + \text{FN}_k)}. \]
Because every individual prediction contributes equally to the pooled totals, micro averaging is dominated by frequent classes. For single label multiclass classification, micro precision, micro recall, and micro \(F_1\) all equal overall accuracy, since every error is simultaneously a false positive for one class and a false negative for another. Micro averaging answers: how well does the model classify a randomly chosen instance?
Macro averaging computes the metric per class and then takes an unweighted mean:
\[ \text{Precision}_{\text{macro}} = \frac{1}{K} \sum_{k=1}^{K} \text{Precision}_k. \]
Every class counts the same regardless of size, so a tiny class has as much influence as a huge one. Macro averaging is the right choice when rare classes matter as much as common ones, for example a medical taxonomy where rare conditions must not be ignored. Its sensitivity to small classes is also a hazard: a class with few examples produces a noisy per class score that the unweighted mean propagates directly into the headline number.
Weighted averaging is a compromise that takes the mean of per class metrics weighted by each class’s support \(n_k\), the number of true instances of that class:
\[ \text{Precision}_{\text{weighted}} = \frac{1}{N} \sum_{k=1}^{K} n_k \cdot \text{Precision}_k, \qquad N = \sum_k n_k. \]
This restores the influence of frequent classes that macro averaging discards, while still computing the metric per class. Note that weighted \(F_1\) is not in general bounded between weighted precision and weighted recall, an occasional source of confusion when reading a report.
micro : pool all TP/FP/FN, then compute -> favors frequent classes
macro : per-class metric, plain average -> every class equal, small classes loud
weighted : per-class metric, support-weighted average -> compromise
The guidance is straightforward. Report macro when you care about performance on every class equally, especially rare ones. Report micro, or equivalently accuracy, when you care about aggregate instance level correctness. Report weighted when you want a class aware figure that still reflects the population mix. A large gap between macro and micro scores is itself diagnostic: it signals that the model performs very differently on rare classes than on common ones, which is exactly the situation a single accuracy number would hide.
158.7 7. Choosing Between Precision and Recall by Use Case
No metric is correct in the abstract. The right metric follows from the asymmetric cost of the two error types in the deployed system. The disciplined approach is to ask which mistake hurts more, a false positive or a false negative, and to let that answer pick the metric.
Optimize recall when a missed positive is the expensive error. Disease screening is the canonical example: failing to flag a patient who has the condition can be fatal, while a false alarm leads to a follow up test. The same logic governs fraud detection at the screening stage, security threat detection, and search and rescue, where the downstream cost of investigating a false positive is small relative to the cost of missing a real case. These settings call for a low threshold and an \(F_\beta\) with \(\beta > 1\).
Optimize precision when a false positive is the expensive error. Consider a spam filter that diverts mail to a junk folder. A false positive, a legitimate and possibly important message silently hidden, is far worse than a false negative, a spam message that reaches the inbox. Recommender systems, automated content moderation that removes posts, and any pipeline that takes an irreversible or costly action on a positive prediction share this profile. These settings call for a high threshold and an \(F_\beta\) with \(\beta < 1\).
Use a balanced metric when the costs are roughly symmetric or genuinely unknown. \(F_1\) is the sensible default for general document classification, balanced benchmarking, and early model development before deployment costs are quantified.
Two refinements matter in practice. First, the costs are frequently not constant per error but scale with the instance: a fraudulent transaction of one hundred thousand dollars is not equivalent to one of ten dollars. When such weights are available, an expected cost objective that multiplies each error by its monetary impact dominates any unweighted count metric, and precision, recall, and \(F_\beta\) should be treated as proxies that approximate it. Second, the operating threshold should be selected on validation data to satisfy whatever constraint the business imposes, such as “maximize recall subject to precision at least 0.9,” and then evaluated once on a held out test set to get an honest estimate. Choosing the threshold and reporting on the same data inflates the result, and quoting precision or recall without naming the threshold that produced it is not a complete statement of performance.
158.8 8. Summary
Precision and recall decompose classification quality into the reliability of positive predictions and the coverage of actual positives, neither of which is fooled by class imbalance because neither counts true negatives. They trade off against each other as the decision threshold moves, so an operating point must always be specified. The F1 score combines them through the harmonic mean to reward balanced performance, and the F-beta family generalizes this to encode an explicit preference for precision or recall through the cost ratio \(\beta^2\). For multiclass problems, micro averaging measures aggregate instance correctness, macro averaging gives every class equal voice, and weighted averaging compromises by support. The choice among all of these is ultimately a question about the cost of errors in the real system, and the metric should be derived from that cost structure rather than chosen for convenience.
158.9 References
- Sokolova, M., and Lapalme, G. “A systematic analysis of performance measures for classification tasks.” Information Processing and Management, 2009. https://doi.org/10.1016/j.ipm.2009.03.002
- Davis, J., and Goadrich, M. “The relationship between Precision-Recall and ROC curves.” Proceedings of the 23rd International Conference on Machine Learning, 2006. https://doi.org/10.1145/1143844.1143874
- Saito, T., and Rehmsmeier, M. “The Precision-Recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets.” PLOS ONE, 2015. https://doi.org/10.1371/journal.pone.0118432
- Van Rijsbergen, C. J. “Information Retrieval,” 2nd ed. Butterworths, 1979. https://www.dcs.gla.ac.uk/Keith/Preface.html
- Powers, D. M. W. “Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation.” Journal of Machine Learning Technologies, 2011. https://arxiv.org/abs/2010.16061
- scikit-learn developers. “Metrics and scoring: quantifying the quality of predictions.” scikit-learn User Guide. https://scikit-learn.org/stable/modules/model_evaluation.html
- Yang, Y. “An evaluation of statistical approaches to text categorization.” Information Retrieval, 1999. https://doi.org/10.1023/A:1009982220290