158 Precision, Recall, and F1

Accuracy is the metric everyone reaches for first and the one that fails first. When a classifier is asked to find fraud, disease, or defects, the interesting class is usually rare, and a model that predicts “negative” for every input can score 99 percent accuracy while being useless. Precision, recall, and the F1 score exist to describe performance in exactly these settings, where the cost of a mistake depends on which kind of mistake it is. This chapter develops the definitions, the tradeoff between them, the F-beta generalization, the averaging schemes used for multiclass problems, and a practical framework for deciding which metric should drive a decision.

158.1 1. The Confusion Matrix and Basic Definitions

Every count based classification metric is built from the confusion matrix. For a binary problem with a designated positive class, each prediction falls into one of four cells.

	Predicted positive	Predicted negative
Actual positive	True positive (TP)	False negative (FN)
Actual negative	False positive (FP)	True negative (TN)

A false positive is a negative example wrongly flagged as positive, sometimes called a type I error. A false negative is a positive example the model missed, a type II error. The total number of actual positives is $\text{TP} + \text{FN}$, and the total number of predicted positives is $\text{TP} + \text{FP}$. The four cells partition the dataset exactly, so $\text{TP} + \text{FP} + \text{FN} + \text{TN} = N$, the total number of examples.

flowchart TD
    A["A test example"] --> B{"Actual label"}
    B -->|"positive"| C{"Predicted label"}
    B -->|"negative"| D{"Predicted label"}
    C -->|"positive"| TP["True positive"]
    C -->|"negative"| FN["False negative"]
    D -->|"positive"| FP["False positive"]
    D -->|"negative"| TN["True negative"]

Figure 158.1: How each prediction lands in one of the four confusion-matrix cells.

Accuracy is the fraction of all predictions that are correct:

\[ \text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}. \]

The weakness of accuracy is that it averages over both classes weighted by their prevalence. If positives make up 1 percent of the data, the term $\text{TN}$ dominates and swamps any signal about how well the rare class is handled. Precision and recall sidestep this by conditioning on different denominators, each of which ignores the true negative count entirely.

158.2 2. Precision and Recall

Precision answers the question: of the examples the model labeled positive, what fraction really were positive?

\[ \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}. \]

Precision is the reliability of a positive prediction. High precision means that when the model raises an alarm, you can trust it. It is the metric of interest whenever acting on a positive prediction is expensive or disruptive.

Recall, also called sensitivity or the true positive rate, answers a complementary question: of the examples that really were positive, what fraction did the model find?

\[ \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}. \]

Recall is the coverage of the positive class. High recall means few positives slip through. It is the metric of interest whenever missing a positive is the costly outcome.

A useful way to keep the two straight is to note their denominators. Precision divides by what the model predicted positive (the column), so it is degraded by false positives. Recall divides by what is actually positive (the row), so it is degraded by false negatives. Neither metric references $\text{TN}$, which is why both remain informative under heavy class imbalance.

It helps to interpret these probabilistically. If we draw a random predicted positive, precision is the probability it is truly positive, $P(Y=1 \mid \hat{Y}=1)$. If we draw a random true positive, recall is the probability the model catches it, $P(\hat{Y}=1 \mid Y=1)$. The two are conditional probabilities in opposite directions, related through prevalence by Bayes’ rule, which is exactly why one can be high while the other is low.

Making the Bayes link explicit clarifies a recurring surprise. Let the prevalence be $\pi = P(Y=1)$, and write recall as the true positive rate $\text{TPR} = P(\hat{Y}=1 \mid Y=1)$ and the false positive rate as $\text{FPR} = P(\hat{Y}=1 \mid Y=0)$. Then

\[ \text{Precision} = \frac{\pi \cdot \text{TPR}}{\pi \cdot \text{TPR} + (1 - \pi) \cdot \text{FPR}}. \]

Precision depends on the prevalence $\pi$, while recall, which conditions only on the positive subpopulation, does not. This is why a classifier with fixed recall and a fixed false positive rate can still have its precision collapse when the positive class is rare: the denominator is overwhelmed by the many negatives. A detector with a respectable $1\%$ false positive rate applied to a population where positives are $0.1\%$ produces roughly ten false alarms for every true catch, so precision sits near $0.09$ no matter how good the model looks on a balanced test set. Reporting precision without stating the prevalence it was measured at is therefore incomplete.

158.3 3. The Precision-Recall Tradeoff

Most classifiers do not emit a hard label directly. They produce a score $s(x)$, often a probability, and a threshold $\tau$ converts it to a decision: predict positive when $s(x) \geq \tau$. Sweeping the threshold traces out the entire range of operating points.

Lowering $\tau$ makes the model more eager to predict positive. It catches more true positives, so recall rises, but it also admits more false positives, so precision tends to fall. Raising $\tau$ does the reverse: the model only commits to a positive when very confident, which lifts precision but lets more positives escape, lowering recall. This inverse pressure is the precision-recall tradeoff, and it is structural rather than a defect of any particular model.

The tradeoff is visualized as a precision-recall curve, plotting precision against recall as $\tau$ varies from permissive to strict. The area under this curve, the average precision, summarizes performance across all thresholds in a single number that, unlike the area under the ROC curve, stays sensitive to imbalance because both its axes ignore true negatives. Average precision is a ranking metric: it measures how well the score $s(x)$ orders positives above negatives, independent of any single threshold, so it is the right tool for comparing models before an operating point has been fixed. Once the operating point is fixed, the threshold-specific precision, recall, and $F_\beta$ take over.

threshold high  ->  high precision, low recall   (cautious)
threshold low   ->  low precision, high recall    (eager)

A subtlety worth flagging is that the precision-recall curve need not be monotone. As the threshold drops, recall is non-decreasing because lowering $\tau$ can only add predicted positives, but precision can rise and fall locally as the next admitted example happens to be a true or a false positive. Only the overall trend trades precision for recall; the curve itself is often jagged on finite data.

The practical consequence is that precision and recall are not properties of a model alone but of a model paired with a threshold. Reporting a single precision number without stating the operating point, or comparing two models at different thresholds, is a common and misleading error. When a target precision or recall is fixed by the application, the right procedure is to choose $\tau$ on a validation set to meet that target and then report the other metric at that point.

158.4 4. The F1 Score

Often we want a single number that rewards a model only when both precision and recall are reasonable. The natural candidate, the arithmetic mean, is a poor choice because it can be propped up by one component. A model with precision 1.0 and recall 0.02 has arithmetic mean 0.51, which badly overstates a system that finds almost nothing.

The F1 score instead uses the harmonic mean of precision and recall:

\[ F_1 = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}. \]

The harmonic mean is dominated by the smaller of its arguments, so $F_1$ is high only when precision and recall are both high. For the example above, $F_1 = \frac{2 (1.0)(0.02)}{1.0 + 0.02} \approx 0.039$, which honestly reflects a broken classifier. The harmonic mean always lies at or below the arithmetic mean, with equality only when precision equals recall.

There is a clean way to see what $F_1$ counts. Substituting the definitions gives an expression purely in confusion matrix cells:

\[ F_1 = \frac{2\,\text{TP}}{2\,\text{TP} + \text{FP} + \text{FN}}. \]

This shows that $F_1$ weighs each false positive and each false negative equally and, crucially, never involves $\text{TN}$. That property makes it well suited to imbalanced problems and information retrieval, where the negative class is enormous and uninformative. It also reveals the metric’s main blind spot: by treating the two error types symmetrically, $F_1$ implicitly assumes a false positive and a false negative cost the same. When they do not, $F_1$ is the wrong summary.

158.4.1 A worked example

Concrete numbers anchor the definitions. Suppose a fraud model scores $10{,}000$ transactions, of which $100$ are genuinely fraudulent ($1\%$ prevalence). At a chosen threshold the model flags $160$ transactions as fraud, and $80$ of those flags are correct. The confusion matrix is then

	Predicted fraud	Predicted legitimate
Actual fraud	$\text{TP} = 80$	$\text{FN} = 20$
Actual legitimate	$\text{FP} = 80$	$\text{TN} = 9820$

The metrics follow directly:

\[ \text{Accuracy} = \frac{80 + 9820}{10000} = 0.990, \qquad \text{Precision} = \frac{80}{80 + 80} = 0.500, \qquad \text{Recall} = \frac{80}{80 + 20} = 0.800. \]

\[ F_1 = \frac{2(0.5)(0.8)}{0.5 + 0.8} \approx 0.615, \qquad F_2 = \frac{(1 + 4)(0.5)(0.8)}{4(0.5) + 0.8} \approx 0.714, \qquad F_{0.5} = \frac{(1.25)(0.5)(0.8)}{0.25(0.5) + 0.8} \approx 0.541. \]

Accuracy reads as a glowing $99\%$ even though half of every fraud alert is a false alarm and one fraud in five escapes. The recall-weighted $F_2$ exceeds the precision-weighted $F_{0.5}$ here precisely because this model recalls more than it precisions, so a metric that prizes recall rewards it more. The example is the whole argument of the chapter in miniature: the headline accuracy hides everything that matters, and which $F_\beta$ you quote silently encodes a judgment about whether the missed frauds or the false alarms are worse.

158.5 5. The F-beta Family

The symmetry of $F_1$ is a special case of a parameterized family that lets us weight recall more or less heavily than precision. The F-beta score is

\[ F_\beta = (1 + \beta^2) \cdot \frac{\text{Precision} \cdot \text{Recall}}{\beta^2 \cdot \text{Precision} + \text{Recall}}. \]

The parameter $\beta$ sets the relative importance of recall to precision. The standard interpretation is that recall is considered $\beta$ times as important as precision. Setting $\beta = 1$ recovers $F_1$. Two common choices anchor the intuition:

$F_2$ ($\beta = 2$) weights recall higher than precision. It is used when missing a positive is the more serious error, such as screening for a dangerous disease where a missed case is far worse than a false alarm.
$F_{0.5}$ ($\beta = 0.5$) weights precision higher than recall. It is used when acting on a false positive is costly, such as a system that automatically blocks accounts, where wrongly punishing a legitimate user is the error to avoid.

The limits make the weighting explicit. As $\beta \to 0$, $F_\beta \to \text{Precision}$; as $\beta \to \infty$, $F_\beta \to \text{Recall}$. A helpful way to read $\beta$ is through the implied cost ratio: choosing $\beta$ asserts that you are willing to tolerate $\beta^2$ false positives to avoid one false negative. So $F_2$ encodes a four to one tolerance for false positives over false negatives, and $F_{0.5}$ encodes the inverse. Choosing $\beta$ is therefore not a tuning knob to be optimized blindly but a statement about the cost structure of the application, and it should be set from that cost structure rather than from whatever value makes a model look best.

Two properties carry over from the $F_1$ case. First, $F_\beta$ is a weighted harmonic mean of precision and recall, which equals the form above when the precision weight is $\beta^2 / (1 + \beta^2)$ and the recall weight is $1 / (1 + \beta^2)$. Because it is a harmonic mean, $F_\beta$ remains bounded by the smaller component and is zero whenever either precision or recall is zero, so no value of $\beta$ can rescue a model that finds nothing or that flags everything. Second, in confusion-matrix terms,

\[ F_\beta = \frac{(1 + \beta^2)\,\text{TP}}{(1 + \beta^2)\,\text{TP} + \beta^2 \,\text{FN} + \text{FP}}, \]

which shows the asymmetry plainly: each false negative is weighted $\beta^2$ times as heavily as each false positive, and true negatives never appear.

158.6 6. Averaging Across Classes: Micro, Macro, and Weighted

The definitions so far assume a single positive class. Multiclass and multilabel problems have many classes, each with its own precision and recall computed in a one versus rest manner. To report a single figure we must aggregate, and the choice of aggregation changes the meaning of the number.

Let there be $K$ classes, and write $\text{TP}_k$, $\text{FP}_k$, $\text{FN}_k$ for the per class counts.

Micro averaging pools the counts across all classes first, then computes the metric once on the global totals:

\[ \text{Precision}_{\text{micro}} = \frac{\sum_k \text{TP}_k}{\sum_k (\text{TP}_k + \text{FP}_k)}, \qquad \text{Recall}_{\text{micro}} = \frac{\sum_k \text{TP}_k}{\sum_k (\text{TP}_k + \text{FN}_k)}. \]

Because every individual prediction contributes equally to the pooled totals, micro averaging is dominated by frequent classes. For single label multiclass classification, micro precision, micro recall, and micro $F_1$ all equal overall accuracy, since every error is simultaneously a false positive for one class and a false negative for another. Micro averaging answers: how well does the model classify a randomly chosen instance?

Macro averaging computes the metric per class and then takes an unweighted mean:

\[ \text{Precision}_{\text{macro}} = \frac{1}{K} \sum_{k=1}^{K} \text{Precision}_k. \]

Every class counts the same regardless of size, so a tiny class has as much influence as a huge one. Macro averaging is the right choice when rare classes matter as much as common ones, for example a medical taxonomy where rare conditions must not be ignored. Its sensitivity to small classes is also a hazard: a class with few examples produces a noisy per class score that the unweighted mean propagates directly into the headline number.

Weighted averaging is a compromise that takes the mean of per class metrics weighted by each class’s support $n_k$, the number of true instances of that class:

\[ \text{Precision}_{\text{weighted}} = \frac{1}{N} \sum_{k=1}^{K} n_k \cdot \text{Precision}_k, \qquad N = \sum_k n_k. \]

This restores the influence of frequent classes that macro averaging discards, while still computing the metric per class. Note that weighted $F_1$ is not in general bounded between weighted precision and weighted recall, an occasional source of confusion when reading a report.

micro    : pool all TP/FP/FN, then compute  -> favors frequent classes
macro    : per-class metric, plain average  -> every class equal, small classes loud
weighted : per-class metric, support-weighted average -> compromise

The guidance is straightforward. Report macro when you care about performance on every class equally, especially rare ones. Report micro, or equivalently accuracy, when you care about aggregate instance level correctness. Report weighted when you want a class aware figure that still reflects the population mix. A large gap between macro and micro scores is itself diagnostic: it signals that the model performs very differently on rare classes than on common ones, which is exactly the situation a single accuracy number would hide.

158.7 7. Choosing Between Precision and Recall by Use Case

No metric is correct in the abstract. The right metric follows from the asymmetric cost of the two error types in the deployed system. The disciplined approach is to ask which mistake hurts more, a false positive or a false negative, and to let that answer pick the metric.

Optimize recall when a missed positive is the expensive error. Disease screening is the canonical example: failing to flag a patient who has the condition can be fatal, while a false alarm leads to a follow up test. The same logic governs fraud detection at the screening stage, security threat detection, and search and rescue, where the downstream cost of investigating a false positive is small relative to the cost of missing a real case. These settings call for a low threshold and an $F_\beta$ with $\beta > 1$.

Optimize precision when a false positive is the expensive error. Consider a spam filter that diverts mail to a junk folder. A false positive, a legitimate and possibly important message silently hidden, is far worse than a false negative, a spam message that reaches the inbox. Recommender systems, automated content moderation that removes posts, and any pipeline that takes an irreversible or costly action on a positive prediction share this profile. These settings call for a high threshold and an $F_\beta$ with $\beta < 1$.

Use a balanced metric when the costs are roughly symmetric or genuinely unknown. $F_1$ is the sensible default for general document classification, balanced benchmarking, and early model development before deployment costs are quantified.

Two refinements matter in practice. First, the costs are frequently not constant per error but scale with the instance: a fraudulent transaction of one hundred thousand dollars is not equivalent to one of ten dollars. When such weights are available, an expected cost objective that multiplies each error by its monetary impact dominates any unweighted count metric, and precision, recall, and $F_\beta$ should be treated as proxies that approximate it. Second, the operating threshold should be selected on validation data to satisfy whatever constraint the business imposes, such as “maximize recall subject to precision at least 0.9,” and then evaluated once on a held out test set to get an honest estimate. Choosing the threshold and reporting on the same data inflates the result, and quoting precision or recall without naming the threshold that produced it is not a complete statement of performance.

158.8 8. Common Pitfalls

A handful of mistakes recur often enough to list explicitly.

Quoting precision or recall without the threshold. A single precision number is meaningless until the operating point is named, because sliding $\tau$ moves it anywhere along the curve.
Comparing models at different operating points. Model A at $\tau = 0.3$ and Model B at $\tau = 0.7$ are not comparable. Compare the full curves, or fix a shared constraint (such as precision at least $0.9$) and read off the other metric.
Picking the threshold on the test set. Choosing $\tau$ to maximize a metric and then reporting that same metric on the same data inflates the result. Select the threshold on validation data and evaluate once on a held-out test set.
Reading accuracy on imbalanced data. On a $1\%$-positive problem a do-nothing classifier scores $99\%$ accuracy. Accuracy and micro $F_1$ coincide for single-label multiclass, so neither one exposes failures on rare classes.
Treating $F_1$ as cost-neutral. $F_1$ is not free of assumptions: it asserts that a false positive and a false negative cost the same. If they do not, use $F_\beta$ with a $\beta$ derived from the actual cost ratio, or optimize expected cost directly.
Mismatching the averaging scheme to the question. Macro and micro answer different questions; a large gap between them is a signal about rare-class performance, not a number to average away.

158.9 9. Summary

Precision and recall decompose classification quality into the reliability of positive predictions and the coverage of actual positives, neither of which is fooled by class imbalance because neither counts true negatives. They trade off against each other as the decision threshold moves, so an operating point must always be specified. The F1 score combines them through the harmonic mean to reward balanced performance, and the F-beta family generalizes this to encode an explicit preference for precision or recall through the cost ratio $\beta^2$. For multiclass problems, micro averaging measures aggregate instance correctness, macro averaging gives every class equal voice, and weighted averaging compromises by support. The choice among all of these is ultimately a question about the cost of errors in the real system, and the metric should be derived from that cost structure rather than chosen for convenience.

158.10 References

Sokolova, M., and Lapalme, G. “A systematic analysis of performance measures for classification tasks.” Information Processing and Management, 2009. https://doi.org/10.1016/j.ipm.2009.03.002
Davis, J., and Goadrich, M. “The relationship between Precision-Recall and ROC curves.” Proceedings of the 23rd International Conference on Machine Learning, 2006. https://doi.org/10.1145/1143844.1143874
Saito, T., and Rehmsmeier, M. “The Precision-Recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets.” PLOS ONE, 2015. https://doi.org/10.1371/journal.pone.0118432
Van Rijsbergen, C. J. “Information Retrieval,” 2nd ed. Butterworths, 1979. https://www.dcs.gla.ac.uk/Keith/Preface.html
Powers, D. M. W. “Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation.” Journal of Machine Learning Technologies, 2011. https://arxiv.org/abs/2010.16061
scikit-learn developers. “Metrics and scoring: quantifying the quality of predictions.” scikit-learn User Guide. https://scikit-learn.org/stable/modules/model_evaluation.html
Yang, Y. “An evaluation of statistical approaches to text categorization.” Information Retrieval, 1999. https://doi.org/10.1023/A:1009982220290

# Precision, Recall, and F1 Accuracy is the metric everyone reaches for first and the one that fails first. When a classifier is asked to find fraud, disease, or defects, the interesting class is usually rare, and a model that predicts "negative" for every input can score 99 percent accuracy while being useless. Precision, recall, and the F1 score exist to describe performance in exactly these settings, where the cost of a mistake depends on which kind of mistake it is. This chapter develops the definitions, the tradeoff between them, the F-beta generalization, the averaging schemes used for multiclass problems, and a practical framework for deciding which metric should drive a decision. ## 1. The Confusion Matrix and Basic Definitions Every count based classification metric is built from the confusion matrix. For a binary problem with a designated positive class, each prediction falls into one of four cells. | | Predicted positive | Predicted negative | |---|---|---| | **Actual positive** | True positive (TP) | False negative (FN) | | **Actual negative** | False positive (FP) | True negative (TN) | A false positive is a negative example wrongly flagged as positive, sometimes called a type I error. A false negative is a positive example the model missed, a type II error. The total number of actual positives is $\text{TP} + \text{FN}$, and the total number of predicted positives is $\text{TP} + \text{FP}$. The four cells partition the dataset exactly, so $\text{TP} + \text{FP} + \text{FN} + \text{TN} = N$, the total number of examples. ```{mermaid} %%| label: fig-confusion %%| fig-cap: "How each prediction lands in one of the four confusion-matrix cells." flowchart TD A["A test example"] --> B{"Actual label"} B -->|"positive"| C{"Predicted label"} B -->|"negative"| D{"Predicted label"} C -->|"positive"| TP["True positive"] C -->|"negative"| FN["False negative"] D -->|"positive"| FP["False positive"] D -->|"negative"| TN["True negative"] ``` **Accuracy** is the fraction of all predictions that are correct: $$ \text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}. $$ The weakness of accuracy is that it averages over both classes weighted by their prevalence. If positives make up 1 percent of the data, the term $\text{TN}$ dominates and swamps any signal about how well the rare class is handled. Precision and recall sidestep this by conditioning on different denominators, each of which ignores the true negative count entirely. ## 2. Precision and Recall **Precision** answers the question: of the examples the model labeled positive, what fraction really were positive? $$ \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}. $$ Precision is the reliability of a positive prediction. High precision means that when the model raises an alarm, you can trust it. It is the metric of interest whenever acting on a positive prediction is expensive or disruptive. **Recall**, also called sensitivity or the true positive rate, answers a complementary question: of the examples that really were positive, what fraction did the model find? $$ \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}. $$ Recall is the coverage of the positive class. High recall means few positives slip through. It is the metric of interest whenever missing a positive is the costly outcome. A useful way to keep the two straight is to note their denominators. Precision divides by what the model predicted positive (the column), so it is degraded by false positives. Recall divides by what is actually positive (the row), so it is degraded by false negatives. Neither metric references $\text{TN}$, which is why both remain informative under heavy class imbalance. It helps to interpret these probabilistically. If we draw a random predicted positive, precision is the probability it is truly positive, $P(Y=1 \mid \hat{Y}=1)$. If we draw a random true positive, recall is the probability the model catches it, $P(\hat{Y}=1 \mid Y=1)$. The two are conditional probabilities in opposite directions, related through prevalence by Bayes' rule, which is exactly why one can be high while the other is low. Making the Bayes link explicit clarifies a recurring surprise. Let the prevalence be $\pi = P(Y=1)$, and write recall as the true positive rate $\text{TPR} = P(\hat{Y}=1 \mid Y=1)$ and the false positive rate as $\text{FPR} = P(\hat{Y}=1 \mid Y=0)$. Then $$ \text{Precision} = \frac{\pi \cdot \text{TPR}}{\pi \cdot \text{TPR} + (1 - \pi) \cdot \text{FPR}}. $$ Precision depends on the prevalence $\pi$, while recall, which conditions only on the positive subpopulation, does not. This is why a classifier with fixed recall and a fixed false positive rate can still have its precision collapse when the positive class is rare: the denominator is overwhelmed by the many negatives. A detector with a respectable $1\%$ false positive rate applied to a population where positives are $0.1\%$ produces roughly ten false alarms for every true catch, so precision sits near $0.09$ no matter how good the model looks on a balanced test set. Reporting precision without stating the prevalence it was measured at is therefore incomplete. ## 3. The Precision-Recall Tradeoff Most classifiers do not emit a hard label directly. They produce a score $s(x)$, often a probability, and a threshold $\tau$ converts it to a decision: predict positive when $s(x) \geq \tau$. Sweeping the threshold traces out the entire range of operating points. Lowering $\tau$ makes the model more eager to predict positive. It catches more true positives, so recall rises, but it also admits more false positives, so precision tends to fall. Raising $\tau$ does the reverse: the model only commits to a positive when very confident, which lifts precision but lets more positives escape, lowering recall. This inverse pressure is the precision-recall tradeoff, and it is structural rather than a defect of any particular model. The tradeoff is visualized as a precision-recall curve, plotting precision against recall as $\tau$ varies from permissive to strict. The area under this curve, the average precision, summarizes performance across all thresholds in a single number that, unlike the area under the ROC curve, stays sensitive to imbalance because both its axes ignore true negatives. Average precision is a ranking metric: it measures how well the score $s(x)$ orders positives above negatives, independent of any single threshold, so it is the right tool for comparing models before an operating point has been fixed. Once the operating point is fixed, the threshold-specific precision, recall, and $F_\beta$ take over. ```text threshold high -> high precision, low recall (cautious) threshold low -> low precision, high recall (eager) ``` A subtlety worth flagging is that the precision-recall curve need not be monotone. As the threshold drops, recall is non-decreasing because lowering $\tau$ can only add predicted positives, but precision can rise and fall locally as the next admitted example happens to be a true or a false positive. Only the overall trend trades precision for recall; the curve itself is often jagged on finite data. The practical consequence is that precision and recall are not properties of a model alone but of a model paired with a threshold. Reporting a single precision number without stating the operating point, or comparing two models at different thresholds, is a common and misleading error. When a target precision or recall is fixed by the application, the right procedure is to choose $\tau$ on a validation set to meet that target and then report the other metric at that point. ## 4. The F1 Score Often we want a single number that rewards a model only when both precision and recall are reasonable. The natural candidate, the arithmetic mean, is a poor choice because it can be propped up by one component. A model with precision 1.0 and recall 0.02 has arithmetic mean 0.51, which badly overstates a system that finds almost nothing. The **F1 score** instead uses the harmonic mean of precision and recall: $$ F_1 = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}. $$ The harmonic mean is dominated by the smaller of its arguments, so $F_1$ is high only when precision and recall are both high. For the example above, $F_1 = \frac{2 (1.0)(0.02)}{1.0 + 0.02} \approx 0.039$, which honestly reflects a broken classifier. The harmonic mean always lies at or below the arithmetic mean, with equality only when precision equals recall. There is a clean way to see what $F_1$ counts. Substituting the definitions gives an expression purely in confusion matrix cells: $$ F_1 = \frac{2\,\text{TP}}{2\,\text{TP} + \text{FP} + \text{FN}}. $$ This shows that $F_1$ weighs each false positive and each false negative equally and, crucially, never involves $\text{TN}$. That property makes it well suited to imbalanced problems and information retrieval, where the negative class is enormous and uninformative. It also reveals the metric's main blind spot: by treating the two error types symmetrically, $F_1$ implicitly assumes a false positive and a false negative cost the same. When they do not, $F_1$ is the wrong summary. ### A worked example Concrete numbers anchor the definitions. Suppose a fraud model scores $10{,}000$ transactions, of which $100$ are genuinely fraudulent ($1\%$ prevalence). At a chosen threshold the model flags $160$ transactions as fraud, and $80$ of those flags are correct. The confusion matrix is then | | Predicted fraud | Predicted legitimate | |---|---|---| | **Actual fraud** | $\text{TP} = 80$ | $\text{FN} = 20$ | | **Actual legitimate** | $\text{FP} = 80$ | $\text{TN} = 9820$ | The metrics follow directly: $$ \text{Accuracy} = \frac{80 + 9820}{10000} = 0.990, \qquad \text{Precision} = \frac{80}{80 + 80} = 0.500, \qquad \text{Recall} = \frac{80}{80 + 20} = 0.800. $$ $$ F_1 = \frac{2(0.5)(0.8)}{0.5 + 0.8} \approx 0.615, \qquad F_2 = \frac{(1 + 4)(0.5)(0.8)}{4(0.5) + 0.8} \approx 0.714, \qquad F_{0.5} = \frac{(1.25)(0.5)(0.8)}{0.25(0.5) + 0.8} \approx 0.541. $$ Accuracy reads as a glowing $99\%$ even though half of every fraud alert is a false alarm and one fraud in five escapes. The recall-weighted $F_2$ exceeds the precision-weighted $F_{0.5}$ here precisely because this model recalls more than it precisions, so a metric that prizes recall rewards it more. The example is the whole argument of the chapter in miniature: the headline accuracy hides everything that matters, and which $F_\beta$ you quote silently encodes a judgment about whether the missed frauds or the false alarms are worse. ## 5. The F-beta Family The symmetry of $F_1$ is a special case of a parameterized family that lets us weight recall more or less heavily than precision. The **F-beta score** is $$ F_\beta = (1 + \beta^2) \cdot \frac{\text{Precision} \cdot \text{Recall}}{\beta^2 \cdot \text{Precision} + \text{Recall}}. $$ The parameter $\beta$ sets the relative importance of recall to precision. The standard interpretation is that recall is considered $\beta$ times as important as precision. Setting $\beta = 1$ recovers $F_1$. Two common choices anchor the intuition: - $F_2$ ($\beta = 2$) weights recall higher than precision. It is used when missing a positive is the more serious error, such as screening for a dangerous disease where a missed case is far worse than a false alarm. - $F_{0.5}$ ($\beta = 0.5$) weights precision higher than recall. It is used when acting on a false positive is costly, such as a system that automatically blocks accounts, where wrongly punishing a legitimate user is the error to avoid. The limits make the weighting explicit. As $\beta \to 0$, $F_\beta \to \text{Precision}$; as $\beta \to \infty$, $F_\beta \to \text{Recall}$. A helpful way to read $\beta$ is through the implied cost ratio: choosing $\beta$ asserts that you are willing to tolerate $\beta^2$ false positives to avoid one false negative. So $F_2$ encodes a four to one tolerance for false positives over false negatives, and $F_{0.5}$ encodes the inverse. Choosing $\beta$ is therefore not a tuning knob to be optimized blindly but a statement about the cost structure of the application, and it should be set from that cost structure rather than from whatever value makes a model look best. Two properties carry over from the $F_1$ case. First, $F_\beta$ is a weighted harmonic mean of precision and recall, which equals the form above when the precision weight is $\beta^2 / (1 + \beta^2)$ and the recall weight is $1 / (1 + \beta^2)$. Because it is a harmonic mean, $F_\beta$ remains bounded by the smaller component and is zero whenever either precision or recall is zero, so no value of $\beta$ can rescue a model that finds nothing or that flags everything. Second, in confusion-matrix terms, $$ F_\beta = \frac{(1 + \beta^2)\,\text{TP}}{(1 + \beta^2)\,\text{TP} + \beta^2 \,\text{FN} + \text{FP}}, $$ which shows the asymmetry plainly: each false negative is weighted $\beta^2$ times as heavily as each false positive, and true negatives never appear. ## 6. Averaging Across Classes: Micro, Macro, and Weighted The definitions so far assume a single positive class. Multiclass and multilabel problems have many classes, each with its own precision and recall computed in a one versus rest manner. To report a single figure we must aggregate, and the choice of aggregation changes the meaning of the number. Let there be $K$ classes, and write $\text{TP}_k$, $\text{FP}_k$, $\text{FN}_k$ for the per class counts. **Micro averaging** pools the counts across all classes first, then computes the metric once on the global totals: $$ \text{Precision}_{\text{micro}} = \frac{\sum_k \text{TP}_k}{\sum_k (\text{TP}_k + \text{FP}_k)}, \qquad \text{Recall}_{\text{micro}} = \frac{\sum_k \text{TP}_k}{\sum_k (\text{TP}_k + \text{FN}_k)}. $$ Because every individual prediction contributes equally to the pooled totals, micro averaging is dominated by frequent classes. For single label multiclass classification, micro precision, micro recall, and micro $F_1$ all equal overall accuracy, since every error is simultaneously a false positive for one class and a false negative for another. Micro averaging answers: how well does the model classify a randomly chosen instance? **Macro averaging** computes the metric per class and then takes an unweighted mean: $$ \text{Precision}_{\text{macro}} = \frac{1}{K} \sum_{k=1}^{K} \text{Precision}_k. $$ Every class counts the same regardless of size, so a tiny class has as much influence as a huge one. Macro averaging is the right choice when rare classes matter as much as common ones, for example a medical taxonomy where rare conditions must not be ignored. Its sensitivity to small classes is also a hazard: a class with few examples produces a noisy per class score that the unweighted mean propagates directly into the headline number. **Weighted averaging** is a compromise that takes the mean of per class metrics weighted by each class's support $n_k$, the number of true instances of that class: $$ \text{Precision}_{\text{weighted}} = \frac{1}{N} \sum_{k=1}^{K} n_k \cdot \text{Precision}_k, \qquad N = \sum_k n_k. $$ This restores the influence of frequent classes that macro averaging discards, while still computing the metric per class. Note that weighted $F_1$ is not in general bounded between weighted precision and weighted recall, an occasional source of confusion when reading a report. ```text micro : pool all TP/FP/FN, then compute -> favors frequent classes macro : per-class metric, plain average -> every class equal, small classes loud weighted : per-class metric, support-weighted average -> compromise ``` The guidance is straightforward. Report macro when you care about performance on every class equally, especially rare ones. Report micro, or equivalently accuracy, when you care about aggregate instance level correctness. Report weighted when you want a class aware figure that still reflects the population mix. A large gap between macro and micro scores is itself diagnostic: it signals that the model performs very differently on rare classes than on common ones, which is exactly the situation a single accuracy number would hide. ## 7. Choosing Between Precision and Recall by Use Case No metric is correct in the abstract. The right metric follows from the asymmetric cost of the two error types in the deployed system. The disciplined approach is to ask which mistake hurts more, a false positive or a false negative, and to let that answer pick the metric. **Optimize recall when a missed positive is the expensive error.** Disease screening is the canonical example: failing to flag a patient who has the condition can be fatal, while a false alarm leads to a follow up test. The same logic governs fraud detection at the screening stage, security threat detection, and search and rescue, where the downstream cost of investigating a false positive is small relative to the cost of missing a real case. These settings call for a low threshold and an $F_\beta$ with $\beta > 1$. **Optimize precision when a false positive is the expensive error.** Consider a spam filter that diverts mail to a junk folder. A false positive, a legitimate and possibly important message silently hidden, is far worse than a false negative, a spam message that reaches the inbox. Recommender systems, automated content moderation that removes posts, and any pipeline that takes an irreversible or costly action on a positive prediction share this profile. These settings call for a high threshold and an $F_\beta$ with $\beta < 1$. **Use a balanced metric when the costs are roughly symmetric or genuinely unknown.** $F_1$ is the sensible default for general document classification, balanced benchmarking, and early model development before deployment costs are quantified. Two refinements matter in practice. First, the costs are frequently not constant per error but scale with the instance: a fraudulent transaction of one hundred thousand dollars is not equivalent to one of ten dollars. When such weights are available, an expected cost objective that multiplies each error by its monetary impact dominates any unweighted count metric, and precision, recall, and $F_\beta$ should be treated as proxies that approximate it. Second, the operating threshold should be selected on validation data to satisfy whatever constraint the business imposes, such as "maximize recall subject to precision at least 0.9," and then evaluated once on a held out test set to get an honest estimate. Choosing the threshold and reporting on the same data inflates the result, and quoting precision or recall without naming the threshold that produced it is not a complete statement of performance. ## 8. Common Pitfalls A handful of mistakes recur often enough to list explicitly. - **Quoting precision or recall without the threshold.** A single precision number is meaningless until the operating point is named, because sliding $\tau$ moves it anywhere along the curve. - **Comparing models at different operating points.** Model A at $\tau = 0.3$ and Model B at $\tau = 0.7$ are not comparable. Compare the full curves, or fix a shared constraint (such as precision at least $0.9$) and read off the other metric. - **Picking the threshold on the test set.** Choosing $\tau$ to maximize a metric and then reporting that same metric on the same data inflates the result. Select the threshold on validation data and evaluate once on a held-out test set. - **Reading accuracy on imbalanced data.** On a $1\%$-positive problem a do-nothing classifier scores $99\%$ accuracy. Accuracy and micro $F_1$ coincide for single-label multiclass, so neither one exposes failures on rare classes. - **Treating $F_1$ as cost-neutral.** $F_1$ is not free of assumptions: it asserts that a false positive and a false negative cost the same. If they do not, use $F_\beta$ with a $\beta$ derived from the actual cost ratio, or optimize expected cost directly. - **Mismatching the averaging scheme to the question.** Macro and micro answer different questions; a large gap between them is a signal about rare-class performance, not a number to average away. ## 9. Summary Precision and recall decompose classification quality into the reliability of positive predictions and the coverage of actual positives, neither of which is fooled by class imbalance because neither counts true negatives. They trade off against each other as the decision threshold moves, so an operating point must always be specified. The F1 score combines them through the harmonic mean to reward balanced performance, and the F-beta family generalizes this to encode an explicit preference for precision or recall through the cost ratio $\beta^2$. For multiclass problems, micro averaging measures aggregate instance correctness, macro averaging gives every class equal voice, and weighted averaging compromises by support. The choice among all of these is ultimately a question about the cost of errors in the real system, and the metric should be derived from that cost structure rather than chosen for convenience. ## References 1. Sokolova, M., and Lapalme, G. "A systematic analysis of performance measures for classification tasks." Information Processing and Management, 2009. https://doi.org/10.1016/j.ipm.2009.03.002 2. Davis, J., and Goadrich, M. "The relationship between Precision-Recall and ROC curves." Proceedings of the 23rd International Conference on Machine Learning, 2006. https://doi.org/10.1145/1143844.1143874 3. Saito, T., and Rehmsmeier, M. "The Precision-Recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets." PLOS ONE, 2015. https://doi.org/10.1371/journal.pone.0118432 4. Van Rijsbergen, C. J. "Information Retrieval," 2nd ed. Butterworths, 1979. https://www.dcs.gla.ac.uk/Keith/Preface.html 5. Powers, D. M. W. "Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation." Journal of Machine Learning Technologies, 2011. https://arxiv.org/abs/2010.16061 6. scikit-learn developers. "Metrics and scoring: quantifying the quality of predictions." scikit-learn User Guide. https://scikit-learn.org/stable/modules/model_evaluation.html 7. Yang, Y. "An evaluation of statistical approaches to text categorization." Information Retrieval, 1999. https://doi.org/10.1023/A:1009982220290

	Predicted fraud	Predicted legitimate
Actual fraud	\(\text{TP} = 80\)	\(\text{FN} = 20\)
Actual legitimate	\(\text{FP} = 80\)	\(\text{TN} = 9820\)