157 Classification Accuracy and the Confusion Matrix

Classification sits at the center of supervised machine learning, and almost every classification project eventually confronts a deceptively simple question: how good is this model? The answer begins with accuracy, the single most widely reported and most widely misunderstood metric in the field. This chapter develops the confusion matrix as the foundational data structure from which all common classification metrics are derived, defines accuracy precisely, and then dismantles the assumption that a high accuracy figure is sufficient evidence of a useful classifier. We pay particular attention to the accuracy paradox, the situation in which a model achieves impressive accuracy precisely because it has failed to learn anything interesting about the rare class that motivated the project. By the end, you should be able to state, with rigor, exactly when accuracy is the right tool and when it quietly lies.

157.1 1. The Confusion Matrix

157.1.1 1.1 Definition and Structure

Consider a binary classifier that assigns each instance to one of two classes, conventionally labeled positive and negative. Given a set of $n$ labeled examples, every prediction falls into exactly one of four categories formed by crossing the true label with the predicted label. These four counts constitute the confusion matrix.

	Predicted Positive	Predicted Negative
Actual Positive	True Positive (TP)	False Negative (FN)
Actual Negative	False Positive (FP)	True Negative (TN)

The diagonal entries, TP and TN, are correct predictions. The off-diagonal entries, FP and FN, are the two distinct kinds of error. A false positive is a negative instance mistakenly flagged as positive, sometimes called a Type I error. A false negative is a positive instance missed by the classifier, a Type II error. The total number of examples satisfies

\[ n = \text{TP} + \text{TN} + \text{FP} + \text{FN}. \]

The four cells are produced by a two-stage decision process: each instance has a true label, and the classifier emits a predicted label, and the pair determines the cell. The following diagram traces that flow for a single instance.

flowchart TD
    A["Instance with true label"] --> B{"True label is positive?"}
    B -->|"Yes"| C{"Predicted positive?"}
    B -->|"No"| D{"Predicted positive?"}
    C -->|"Yes"| TP["True Positive"]
    C -->|"No"| FN["False Negative"]
    D -->|"Yes"| FP["False Positive"]
    D -->|"No"| TN["True Negative"]

The power of the confusion matrix is that it preserves the structure of the errors rather than collapsing them. Two classifiers can share an identical accuracy yet have wildly different distributions of FP and FN, and in most real applications those two error types carry very different costs. A spam filter that deletes a legitimate email (a false positive on the “spam” class) inflicts a different harm than one that lets a junk message through (a false negative). The confusion matrix is the object that keeps this distinction visible.

Formally, on a fixed test set the confusion matrix is a sufficient statistic for any classification metric that depends only on the agreement between predicted and true labels. If two evaluations produce the same four counts, every count-based metric (accuracy, precision, recall, $F_1$, specificity, and the rest) takes the same value on both. This is why the chapter treats the matrix as primary and the individual metrics as projections of it: nothing in the count-based family is lost by storing the matrix, and a great deal is lost by storing only a scalar summary.

157.1.2 1.2 Marginals and Derived Quantities

The row and column sums of the matrix have names worth knowing. The actual positive count is $P = \text{TP} + \text{FN}$, and the actual negative count is $N = \text{TN} + \text{FP}$. The prevalence of the positive class is

\[ \pi = \frac{P}{n} = \frac{\text{TP} + \text{FN}}{n}, \]

a quantity that will turn out to be the hinge on which the entire accuracy discussion swings. From the four cells we can derive the family of conditional metrics that appear throughout the classification literature:

\[ \text{TPR (recall, sensitivity)} = \frac{\text{TP}}{\text{TP} + \text{FN}}, \qquad \text{TNR (specificity)} = \frac{\text{TN}}{\text{TN} + \text{FP}}, \]

\[ \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}, \qquad \text{FPR} = \frac{\text{FP}}{\text{TN} + \text{FP}} = 1 - \text{TNR}. \]

Recall conditions on the actual class (of all real positives, how many did we catch), whereas precision conditions on the prediction (of everything we flagged, how many were right). This chapter focuses on accuracy, but the reader should keep this derived family in view, because the central argument is that accuracy alone discards information that these conditional metrics retain.

157.1.3 1.2.1 The Threshold Behind the Counts

It is worth stressing that the confusion matrix is not a property of a model alone. Most classifiers output a score or a probability $s(x) \in [0, 1]$, and a discrete prediction is produced only after applying a decision threshold $\tau$: predict positive when $s(x) \ge \tau$, negative otherwise. The four counts, and therefore accuracy and every metric derived from it, are functions of $\tau$. Raising $\tau$ makes the classifier more conservative about predicting positive, which can only decrease TP and FP and can only increase TN and FN. Sweeping $\tau$ from $0$ to $1$ traces out a family of confusion matrices, and curves such as the ROC curve and the precision-recall curve summarize that entire family rather than a single operating point (Fawcett, 2006; Saito and Rehmsmeier, 2015). When a single accuracy number is reported, it reflects one specific (and often unstated) choice of threshold, usually the default $\tau = 0.5$, which is rarely the threshold that optimizes the deployment objective. A model that looks weak at $\tau = 0.5$ may be excellent at the threshold its application actually warrants.

157.1.4 1.3 The Multiclass Generalization

For a problem with $K$ classes the confusion matrix becomes a $K \times K$ array $C$, where entry $C_{ij}$ counts the instances whose true class is $i$ and whose predicted class is $j$. Correct predictions again lie on the main diagonal, and every off-diagonal entry $C_{ij}$ with $i \neq j$ records a specific confusion of class $i$ for class $j$. This finer structure is diagnostically rich: a model for handwritten digits may concentrate its errors in the cell for true $4$ predicted $9$, revealing a systematic visual confusion that a scalar metric would never expose.

            pred 0  pred 1  pred 2
true 0  [    50      2       0   ]
true 1  [     1     45       4   ]
true 2  [     0      3      48   ]

157.2 2. Accuracy and Its Limits

157.2.1 2.1 Definition

Accuracy is the proportion of predictions that are correct. In binary terms,

\[ \text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}} = \frac{\text{TP} + \text{TN}}{n}, \]

and in the multiclass case it is the sum of the diagonal of $C$ divided by the grand total,

\[ \text{Accuracy} = \frac{\sum_{i=1}^{K} C_{ii}}{\sum_{i=1}^{K} \sum_{j=1}^{K} C_{ij}}. \]

Its complement, the error rate, is $1 - \text{Accuracy}$. Accuracy estimates the probability that the classifier’s prediction matches the truth for a randomly drawn instance, and under the standard assumption that the test set is an i.i.d. sample from the deployment distribution it is an unbiased estimator of that probability. The metric is intuitive, symmetric in the classes, and directly interpretable as a percentage of correct decisions. These virtues explain its popularity.

Because each of the $n$ test predictions is either correct or incorrect, the number of correct predictions is a binomial count, $n \cdot \text{Accuracy} \sim \text{Binomial}(n, a)$, where $a$ is the unknown true accuracy on the deployment distribution. The observed accuracy is therefore a sample proportion, and its uncertainty obeys the usual binomial standard error,

\[ \widehat{\mathrm{SE}} = \sqrt{\frac{\text{Accuracy}\,(1 - \text{Accuracy})}{n}}. \]

This matters in practice: a reported accuracy of $0.92$ on a test set of $n = 100$ carries a standard error near $0.027$, so the difference between two models at $0.92$ and $0.90$ may be pure noise. Reporting a confidence interval (the Wilson interval is the standard, well-behaved choice for proportions and is implemented in the open-source statsmodels library) is far more honest than reporting a bare point estimate, and the same logic applies to every count-based metric in this chapter.

157.2.2 2.2 The Hidden Assumptions

Accuracy carries three assumptions that are easy to overlook. First, it treats all errors as equivalent: a false positive and a false negative each subtract the same amount from the score. This is an implicit statement that the cost of the two error types is identical, which is rarely true in practice. Second, accuracy is a function of the class prevalence $\pi$, so a single accuracy number cannot be interpreted without knowing the base rates of the classes in the evaluation set. Third, because it summarizes the entire confusion matrix into one scalar, accuracy is many-to-one: infinitely many confusion matrices yield the same accuracy, and the mapping cannot be inverted to recover the error structure.

We can make the prevalence dependence explicit, and the derivation is short. Start from the definition and split the correct predictions by true class:

\[ \text{Accuracy} = \frac{\text{TP} + \text{TN}}{n} = \frac{\text{TP}}{n} + \frac{\text{TN}}{n}. \]

Multiply and divide the first term by $P = \text{TP} + \text{FN}$ and the second by $N = \text{TN} + \text{FP}$:

\[ \text{Accuracy} = \frac{P}{n}\cdot\frac{\text{TP}}{P} + \frac{N}{n}\cdot\frac{\text{TN}}{N} = \frac{P}{n}\,\text{TPR} + \frac{N}{n}\,\text{TNR}. \]

Since $P/n = \pi$ and $N/n = 1 - \pi$, this yields the prevalence-weighted average of the two conditional rates,

\[ \text{Accuracy} = \pi \cdot \text{TPR} + (1 - \pi) \cdot \text{TNR}. \]

This decomposition is the key to everything that follows. It shows that accuracy is a convex combination of how well the model handles positives and how well it handles negatives, with the weights set by prevalence. When $\pi$ is far from $0.5$, the term with the larger weight dominates, and the classifier can post a high accuracy by performing well only on the majority class while performing arbitrarily badly on the minority class.

157.3 3. The Accuracy Paradox Under Imbalance

157.3.1 3.1 The Majority-Class Baseline

Class imbalance occurs when one class vastly outnumbers the other, as in fraud detection, rare-disease screening, defect identification, or click prediction, where the interesting positive class may constitute well under one percent of the data. Consider the trivial classifier that ignores its input entirely and always predicts the majority class. Suppose the negative class has prevalence $1 - \pi = 0.99$. This do-nothing model has $\text{TPR} = 0$ and $\text{TNR} = 1$, so by the decomposition above its accuracy is

\[ \text{Accuracy} = \pi \cdot 0 + (1 - \pi) \cdot 1 = 1 - \pi = 0.99. \]

A model that has learned nothing, that cannot identify a single positive instance, reports ninety-nine percent accuracy. This is the accuracy paradox: under strong imbalance the accuracy of a useless model approaches one, so any genuinely useful model must clear a very high baseline before its accuracy looks even slightly impressive, and a high accuracy figure conveys almost no information about whether the rare class is being detected.

157.3.2 3.2 A Worked Example

Suppose we screen $10{,}000$ patients for a disease with prevalence $\pi = 0.01$, so there are $100$ true cases. Two models are compared.

Model A (always negative):
  TP = 0    FN = 100
  FP = 0    TN = 9900
  Accuracy = 9900 / 10000 = 0.9900
  Recall   = 0 / 100      = 0.000

Model B (a real classifier):
  TP = 80   FN = 20
  FP = 300  TN = 9600
  Accuracy = 9680 / 10000 = 0.9680
  Recall   = 80 / 100     = 0.800

Model A wins on accuracy, $0.99$ against $0.968$, yet it is medically worthless: it detects none of the cases the screening program exists to find. Model B catches eighty percent of the cases at the cost of three hundred false alarms, which in a screening context are typically resolved by a cheap follow-up test. Ranking these two models by accuracy inverts the ordering that any sensible clinical objective would impose. The paradox is not a rare edge case; it is the default behavior of accuracy whenever the class of interest is rare.

157.3.3 3.3 Why the Paradox Arises

The mechanism is the prevalence weighting in the decomposition $\text{Accuracy} = \pi \cdot \text{TPR} + (1-\pi)\cdot\text{TNR}$. When $\pi$ is small, the coefficient on TPR is small, so the model’s performance on positives barely registers in the accuracy total. Errors on the minority class are numerically swamped by correct predictions on the majority class. Accuracy thus aligns itself with whatever objective the data distribution happens to favor, and under imbalance that objective is “be right about the common class,” which is usually the opposite of the project’s true goal.

157.3.4 3.4 Metrics That Survive Imbalance

The standard response is to report metrics that do not let the majority class drown out the minority. Balanced accuracy removes the prevalence weighting by averaging the two conditional rates with equal weight,

\[ \text{Balanced Accuracy} = \frac{1}{2}\left(\text{TPR} + \text{TNR}\right), \]

so the always-negative model of Section 3.1 earns $\tfrac{1}{2}(0 + 1) = 0.5$, correctly signaling that it is no better than a coin flip on the balanced problem. The $F_1$ score, the harmonic mean of precision and recall,

\[ F_1 = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}, \]

ignores the true negatives entirely and so cannot be inflated by an abundant negative class. Matthews correlation coefficient,

\[ \text{MCC} = \frac{\text{TP}\cdot\text{TN} - \text{FP}\cdot\text{FN}}{\sqrt{(\text{TP}+\text{FP})(\text{TP}+\text{FN})(\text{TN}+\text{FP})(\text{TN}+\text{FN})}}, \]

uses all four cells and is widely regarded as the most informative single number for imbalanced binary problems, returning a value near zero for the trivial classifier regardless of prevalence (Chicco and Jurman, 2020). MCC is the Pearson correlation coefficient between the binary vectors of true and predicted labels, which is why it ranges over $[-1, 1]$, with $+1$ for perfect prediction, $0$ for chance, and $-1$ for total disagreement.

A complementary idea is to correct accuracy for the agreement expected by chance. Cohen’s kappa rescales accuracy against the accuracy a random classifier would achieve given the same marginal class frequencies,

\[ \kappa = \frac{p_o - p_e}{1 - p_e}, \]

where $p_o = \text{Accuracy}$ is the observed agreement and $p_e$ is the agreement expected if predictions and labels were independent. For the always-negative model, the prediction marginal is degenerate and $p_e = p_o$, driving $\kappa$ to $0$ and again unmasking the do-nothing classifier. Each of these retains exactly the minority-class sensitivity that plain accuracy discards. The default recommendation for an imbalanced binary problem is to lead with MCC or balanced accuracy, report the full confusion matrix, and treat raw accuracy only as a sanity check against the majority-class baseline.

157.4 4. When Accuracy Is Appropriate

Having spent three sections warning against accuracy, we should be clear that it is not a bad metric, only a frequently misapplied one. There are well-defined conditions under which accuracy is the right summary, and recognizing them prevents the overcorrection of abandoning a simple, interpretable metric when it is in fact suitable.

157.4.1 4.1 Balanced Classes

When the classes are roughly balanced, $\pi \approx 0.5$, the prevalence weighting that drives the paradox disappears, and accuracy converges toward balanced accuracy. In this regime accuracy faithfully reflects overall performance, and the majority-class baseline it must beat is only fifty percent rather than the inflated value imbalance produces. Many benchmark datasets are deliberately balanced for exactly this reason, and on them accuracy is a defensible headline metric.

157.4.2 4.2 Symmetric Error Costs

Accuracy is the appropriate objective when the two error types genuinely carry equal cost and equal frequency of concern. If misclassifying a positive is no more or less harmful than misclassifying a negative, then the symmetric treatment baked into accuracy matches the decision problem. Formally, accuracy is the metric that minimizes expected loss under a zero-one loss function in which every misclassification incurs the same penalty. When that loss function is the correct model of the application, accuracy is not merely acceptable but optimal.

157.4.3 4.3 Accuracy as Expected Utility

We can state the appropriateness condition precisely. Let the cost of a false positive be $c_{\text{FP}}$ and of a false negative be $c_{\text{FN}}$. The expected misclassification cost is proportional to $c_{\text{FP}}\cdot\text{FP} + c_{\text{FN}}\cdot\text{FN}$, and error rate (one minus accuracy) is proportional to $\text{FP} + \text{FN}$. These two objectives coincide if and only if $c_{\text{FP}} = c_{\text{FN}}$. Accuracy is therefore the special case of cost-sensitive evaluation in which the cost matrix is symmetric and the classes are balanced enough that prevalence does not distort the comparison. Stating it this way reframes the choice of metric as a modeling decision about costs and base rates rather than a matter of convention.

157.4.4 4.4 Practical Guidance

In practice, treat accuracy as a first-glance summary to be reported alongside, never in place of, the full confusion matrix and at least one imbalance-robust metric. Always compare any accuracy figure against the majority-class baseline $1 - \pi$ rather than against zero; an accuracy of $0.95$ is excellent at $\pi = 0.5$ and embarrassing at $\pi = 0.99$. When error costs are asymmetric, replace accuracy with a cost-weighted criterion or with precision and recall reported at an operating threshold chosen to reflect those costs. The confusion matrix remains the right place to start in every case, because every metric in this chapter is a particular projection of it, and reporting the matrix itself lets a reader compute whichever projection their own problem demands.

157.4.5 4.5 When to Use Accuracy, and Common Pitfalls

A compact decision rule: report accuracy as the headline metric only when the classes are roughly balanced and the two error types carry comparable cost. Outside that regime, demote it to a secondary figure reported next to the confusion matrix. The recurring pitfalls are worth naming explicitly.

Reading accuracy without prevalence. A bare accuracy number is uninterpretable. Always state $\pi$ and the baseline $1 - \pi$ alongside it.
Comparing models on accuracy under imbalance. As the worked example showed, this can reverse the clinically or commercially correct ranking. Use balanced accuracy, $F_1$, or MCC instead.
Ignoring the threshold. Accuracy is computed at one operating point. A model dismissed at $\tau = 0.5$ may dominate at the threshold the application requires; inspect the full ROC or precision-recall curve before concluding.
Treating small accuracy gaps as real. Differences within a couple of standard errors are noise. Report a confidence interval and, when comparing two classifiers on the same test set, use a paired test such as McNemar’s test on the discordant predictions.
Reporting accuracy when costs are asymmetric. If a false negative is ten times as costly as a false positive, accuracy is simply the wrong objective; optimize and report a cost-weighted criterion.

The mature open-source tooling makes all of this cheap. The scikit-learn library computes the confusion matrix, balanced accuracy, $F_1$, MCC, and Cohen’s kappa from the same predicted and true label arrays, and statsmodels supplies the proportion confidence intervals and McNemar’s test. There is no practical reason to report accuracy in isolation.

157.5 5. Summary

The confusion matrix is the complete sufficient statistic for evaluating a classifier on a fixed test set, and accuracy is one scalar projection of it. Accuracy is intuitive and, under balanced classes with symmetric error costs, entirely appropriate. Its decomposition $\text{Accuracy} = \pi\cdot\text{TPR} + (1-\pi)\cdot\text{TNR}$ exposes its fatal weakness under imbalance: when the class of interest is rare, accuracy rewards classifiers that ignore that class, producing the accuracy paradox in which a do-nothing model outscores a genuinely useful one. The disciplined practitioner reads the confusion matrix directly, benchmarks accuracy against the majority-class rate, and reaches for balanced accuracy, $F_1$, or MCC whenever the positive class is both rare and important.

157.6 References

Powers, D. M. W. (2011). Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness and Correlation. Journal of Machine Learning Technologies, 2(1), 37-63. https://arxiv.org/abs/2010.16061
Saito, T., and Rehmsmeier, M. (2015). The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLOS ONE, 10(3), e0118432. https://doi.org/10.1371/journal.pone.0118432
Chicco, D., and Jurman, G. (2020). The Advantages of the Matthews Correlation Coefficient (MCC) over F1 Score and Accuracy in Binary Classification Evaluation. BMC Genomics, 21, 6. https://doi.org/10.1186/s12864-019-6413-7
Brodersen, K. H., Ong, C. S., Stephan, K. E., and Buhmann, J. M. (2010). The Balanced Accuracy and Its Posterior Distribution. International Conference on Pattern Recognition, 3121-3124. https://doi.org/10.1109/ICPR.2010.764
He, H., and Garcia, E. A. (2009). Learning from Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263-1284. https://doi.org/10.1109/TKDE.2008.239
Fawcett, T. (2006). An Introduction to ROC Analysis. Pattern Recognition Letters, 27(8), 861-874. https://doi.org/10.1016/j.patrec.2005.10.010
Cohen, J. (1960). A Coefficient of Agreement for Nominal Scales. Educational and Psychological Measurement, 20(1), 37-46. https://doi.org/10.1177/001316446002000104
Dietterich, T. G. (1998). Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Computation, 10(7), 1895-1923. https://doi.org/10.1162/089976698300017197
Scikit-learn Developers. Classification Metrics. https://scikit-learn.org/stable/modules/model_evaluation.html

# Classification Accuracy and the Confusion Matrix Classification sits at the center of supervised machine learning, and almost every classification project eventually confronts a deceptively simple question: how good is this model? The answer begins with accuracy, the single most widely reported and most widely misunderstood metric in the field. This chapter develops the confusion matrix as the foundational data structure from which all common classification metrics are derived, defines accuracy precisely, and then dismantles the assumption that a high accuracy figure is sufficient evidence of a useful classifier. We pay particular attention to the accuracy paradox, the situation in which a model achieves impressive accuracy precisely because it has failed to learn anything interesting about the rare class that motivated the project. By the end, you should be able to state, with rigor, exactly when accuracy is the right tool and when it quietly lies. ## 1. The Confusion Matrix ### 1.1 Definition and Structure Consider a binary classifier that assigns each instance to one of two classes, conventionally labeled positive and negative. Given a set of $n$ labeled examples, every prediction falls into exactly one of four categories formed by crossing the true label with the predicted label. These four counts constitute the confusion matrix. | | Predicted Positive | Predicted Negative | |---|---|---| | **Actual Positive** | True Positive (TP) | False Negative (FN) | | **Actual Negative** | False Positive (FP) | True Negative (TN) | The diagonal entries, TP and TN, are correct predictions. The off-diagonal entries, FP and FN, are the two distinct kinds of error. A false positive is a negative instance mistakenly flagged as positive, sometimes called a Type I error. A false negative is a positive instance missed by the classifier, a Type II error. The total number of examples satisfies $$ n = \text{TP} + \text{TN} + \text{FP} + \text{FN}. $$ The four cells are produced by a two-stage decision process: each instance has a true label, and the classifier emits a predicted label, and the pair determines the cell. The following diagram traces that flow for a single instance. ```{mermaid} flowchart TD A["Instance with true label"] --> B{"True label is positive?"} B -->|"Yes"| C{"Predicted positive?"} B -->|"No"| D{"Predicted positive?"} C -->|"Yes"| TP["True Positive"] C -->|"No"| FN["False Negative"] D -->|"Yes"| FP["False Positive"] D -->|"No"| TN["True Negative"] ``` The power of the confusion matrix is that it preserves the structure of the errors rather than collapsing them. Two classifiers can share an identical accuracy yet have wildly different distributions of FP and FN, and in most real applications those two error types carry very different costs. A spam filter that deletes a legitimate email (a false positive on the "spam" class) inflicts a different harm than one that lets a junk message through (a false negative). The confusion matrix is the object that keeps this distinction visible. Formally, on a fixed test set the confusion matrix is a sufficient statistic for any classification metric that depends only on the agreement between predicted and true labels. If two evaluations produce the same four counts, every count-based metric (accuracy, precision, recall, $F_1$, specificity, and the rest) takes the same value on both. This is why the chapter treats the matrix as primary and the individual metrics as projections of it: nothing in the count-based family is lost by storing the matrix, and a great deal is lost by storing only a scalar summary. ### 1.2 Marginals and Derived Quantities The row and column sums of the matrix have names worth knowing. The actual positive count is $P = \text{TP} + \text{FN}$, and the actual negative count is $N = \text{TN} + \text{FP}$. The prevalence of the positive class is $$ \pi = \frac{P}{n} = \frac{\text{TP} + \text{FN}}{n}, $$ a quantity that will turn out to be the hinge on which the entire accuracy discussion swings. From the four cells we can derive the family of conditional metrics that appear throughout the classification literature: $$ \text{TPR (recall, sensitivity)} = \frac{\text{TP}}{\text{TP} + \text{FN}}, \qquad \text{TNR (specificity)} = \frac{\text{TN}}{\text{TN} + \text{FP}}, $$ $$ \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}, \qquad \text{FPR} = \frac{\text{FP}}{\text{TN} + \text{FP}} = 1 - \text{TNR}. $$ Recall conditions on the actual class (of all real positives, how many did we catch), whereas precision conditions on the prediction (of everything we flagged, how many were right). This chapter focuses on accuracy, but the reader should keep this derived family in view, because the central argument is that accuracy alone discards information that these conditional metrics retain. ### 1.2.1 The Threshold Behind the Counts It is worth stressing that the confusion matrix is not a property of a model alone. Most classifiers output a score or a probability $s(x) \in [0, 1]$, and a discrete prediction is produced only after applying a decision threshold $\tau$: predict positive when $s(x) \ge \tau$, negative otherwise. The four counts, and therefore accuracy and every metric derived from it, are functions of $\tau$. Raising $\tau$ makes the classifier more conservative about predicting positive, which can only decrease TP and FP and can only increase TN and FN. Sweeping $\tau$ from $0$ to $1$ traces out a family of confusion matrices, and curves such as the ROC curve and the precision-recall curve summarize that entire family rather than a single operating point (Fawcett, 2006; Saito and Rehmsmeier, 2015). When a single accuracy number is reported, it reflects one specific (and often unstated) choice of threshold, usually the default $\tau = 0.5$, which is rarely the threshold that optimizes the deployment objective. A model that looks weak at $\tau = 0.5$ may be excellent at the threshold its application actually warrants. ### 1.3 The Multiclass Generalization For a problem with $K$ classes the confusion matrix becomes a $K \times K$ array $C$, where entry $C_{ij}$ counts the instances whose true class is $i$ and whose predicted class is $j$. Correct predictions again lie on the main diagonal, and every off-diagonal entry $C_{ij}$ with $i \neq j$ records a specific confusion of class $i$ for class $j$. This finer structure is diagnostically rich: a model for handwritten digits may concentrate its errors in the cell for true $4$ predicted $9$, revealing a systematic visual confusion that a scalar metric would never expose. ```text pred 0 pred 1 pred 2 true 0 [ 50 2 0 ] true 1 [ 1 45 4 ] true 2 [ 0 3 48 ] ``` ## 2. Accuracy and Its Limits ### 2.1 Definition Accuracy is the proportion of predictions that are correct. In binary terms, $$ \text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}} = \frac{\text{TP} + \text{TN}}{n}, $$ and in the multiclass case it is the sum of the diagonal of $C$ divided by the grand total, $$ \text{Accuracy} = \frac{\sum_{i=1}^{K} C_{ii}}{\sum_{i=1}^{K} \sum_{j=1}^{K} C_{ij}}. $$ Its complement, the error rate, is $1 - \text{Accuracy}$. Accuracy estimates the probability that the classifier's prediction matches the truth for a randomly drawn instance, and under the standard assumption that the test set is an i.i.d. sample from the deployment distribution it is an unbiased estimator of that probability. The metric is intuitive, symmetric in the classes, and directly interpretable as a percentage of correct decisions. These virtues explain its popularity. Because each of the $n$ test predictions is either correct or incorrect, the number of correct predictions is a binomial count, $n \cdot \text{Accuracy} \sim \text{Binomial}(n, a)$, where $a$ is the unknown true accuracy on the deployment distribution. The observed accuracy is therefore a sample proportion, and its uncertainty obeys the usual binomial standard error, $$ \widehat{\mathrm{SE}} = \sqrt{\frac{\text{Accuracy}\,(1 - \text{Accuracy})}{n}}. $$ This matters in practice: a reported accuracy of $0.92$ on a test set of $n = 100$ carries a standard error near $0.027$, so the difference between two models at $0.92$ and $0.90$ may be pure noise. Reporting a confidence interval (the Wilson interval is the standard, well-behaved choice for proportions and is implemented in the open-source `statsmodels` library) is far more honest than reporting a bare point estimate, and the same logic applies to every count-based metric in this chapter. ### 2.2 The Hidden Assumptions Accuracy carries three assumptions that are easy to overlook. First, it treats all errors as equivalent: a false positive and a false negative each subtract the same amount from the score. This is an implicit statement that the cost of the two error types is identical, which is rarely true in practice. Second, accuracy is a function of the class prevalence $\pi$, so a single accuracy number cannot be interpreted without knowing the base rates of the classes in the evaluation set. Third, because it summarizes the entire confusion matrix into one scalar, accuracy is many-to-one: infinitely many confusion matrices yield the same accuracy, and the mapping cannot be inverted to recover the error structure. We can make the prevalence dependence explicit, and the derivation is short. Start from the definition and split the correct predictions by true class: $$ \text{Accuracy} = \frac{\text{TP} + \text{TN}}{n} = \frac{\text{TP}}{n} + \frac{\text{TN}}{n}. $$ Multiply and divide the first term by $P = \text{TP} + \text{FN}$ and the second by $N = \text{TN} + \text{FP}$: $$ \text{Accuracy} = \frac{P}{n}\cdot\frac{\text{TP}}{P} + \frac{N}{n}\cdot\frac{\text{TN}}{N} = \frac{P}{n}\,\text{TPR} + \frac{N}{n}\,\text{TNR}. $$ Since $P/n = \pi$ and $N/n = 1 - \pi$, this yields the prevalence-weighted average of the two conditional rates, $$ \text{Accuracy} = \pi \cdot \text{TPR} + (1 - \pi) \cdot \text{TNR}. $$ This decomposition is the key to everything that follows. It shows that accuracy is a convex combination of how well the model handles positives and how well it handles negatives, with the weights set by prevalence. When $\pi$ is far from $0.5$, the term with the larger weight dominates, and the classifier can post a high accuracy by performing well only on the majority class while performing arbitrarily badly on the minority class. ## 3. The Accuracy Paradox Under Imbalance ### 3.1 The Majority-Class Baseline Class imbalance occurs when one class vastly outnumbers the other, as in fraud detection, rare-disease screening, defect identification, or click prediction, where the interesting positive class may constitute well under one percent of the data. Consider the trivial classifier that ignores its input entirely and always predicts the majority class. Suppose the negative class has prevalence $1 - \pi = 0.99$. This do-nothing model has $\text{TPR} = 0$ and $\text{TNR} = 1$, so by the decomposition above its accuracy is $$ \text{Accuracy} = \pi \cdot 0 + (1 - \pi) \cdot 1 = 1 - \pi = 0.99. $$ A model that has learned nothing, that cannot identify a single positive instance, reports ninety-nine percent accuracy. This is the accuracy paradox: under strong imbalance the accuracy of a useless model approaches one, so any genuinely useful model must clear a very high baseline before its accuracy looks even slightly impressive, and a high accuracy figure conveys almost no information about whether the rare class is being detected. ### 3.2 A Worked Example Suppose we screen $10{,}000$ patients for a disease with prevalence $\pi = 0.01$, so there are $100$ true cases. Two models are compared. ```text Model A (always negative): TP = 0 FN = 100 FP = 0 TN = 9900 Accuracy = 9900 / 10000 = 0.9900 Recall = 0 / 100 = 0.000 Model B (a real classifier): TP = 80 FN = 20 FP = 300 TN = 9600 Accuracy = 9680 / 10000 = 0.9680 Recall = 80 / 100 = 0.800 ``` Model A wins on accuracy, $0.99$ against $0.968$, yet it is medically worthless: it detects none of the cases the screening program exists to find. Model B catches eighty percent of the cases at the cost of three hundred false alarms, which in a screening context are typically resolved by a cheap follow-up test. Ranking these two models by accuracy inverts the ordering that any sensible clinical objective would impose. The paradox is not a rare edge case; it is the default behavior of accuracy whenever the class of interest is rare. ### 3.3 Why the Paradox Arises The mechanism is the prevalence weighting in the decomposition $\text{Accuracy} = \pi \cdot \text{TPR} + (1-\pi)\cdot\text{TNR}$. When $\pi$ is small, the coefficient on TPR is small, so the model's performance on positives barely registers in the accuracy total. Errors on the minority class are numerically swamped by correct predictions on the majority class. Accuracy thus aligns itself with whatever objective the data distribution happens to favor, and under imbalance that objective is "be right about the common class," which is usually the opposite of the project's true goal. ### 3.4 Metrics That Survive Imbalance The standard response is to report metrics that do not let the majority class drown out the minority. Balanced accuracy removes the prevalence weighting by averaging the two conditional rates with equal weight, $$ \text{Balanced Accuracy} = \frac{1}{2}\left(\text{TPR} + \text{TNR}\right), $$ so the always-negative model of Section 3.1 earns $\tfrac{1}{2}(0 + 1) = 0.5$, correctly signaling that it is no better than a coin flip on the balanced problem. The $F_1$ score, the harmonic mean of precision and recall, $$ F_1 = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}, $$ ignores the true negatives entirely and so cannot be inflated by an abundant negative class. Matthews correlation coefficient, $$ \text{MCC} = \frac{\text{TP}\cdot\text{TN} - \text{FP}\cdot\text{FN}}{\sqrt{(\text{TP}+\text{FP})(\text{TP}+\text{FN})(\text{TN}+\text{FP})(\text{TN}+\text{FN})}}, $$ uses all four cells and is widely regarded as the most informative single number for imbalanced binary problems, returning a value near zero for the trivial classifier regardless of prevalence (Chicco and Jurman, 2020). MCC is the Pearson correlation coefficient between the binary vectors of true and predicted labels, which is why it ranges over $[-1, 1]$, with $+1$ for perfect prediction, $0$ for chance, and $-1$ for total disagreement. A complementary idea is to correct accuracy for the agreement expected by chance. Cohen's kappa rescales accuracy against the accuracy a random classifier would achieve given the same marginal class frequencies, $$ \kappa = \frac{p_o - p_e}{1 - p_e}, $$ where $p_o = \text{Accuracy}$ is the observed agreement and $p_e$ is the agreement expected if predictions and labels were independent. For the always-negative model, the prediction marginal is degenerate and $p_e = p_o$, driving $\kappa$ to $0$ and again unmasking the do-nothing classifier. Each of these retains exactly the minority-class sensitivity that plain accuracy discards. The default recommendation for an imbalanced binary problem is to lead with MCC or balanced accuracy, report the full confusion matrix, and treat raw accuracy only as a sanity check against the majority-class baseline. ## 4. When Accuracy Is Appropriate Having spent three sections warning against accuracy, we should be clear that it is not a bad metric, only a frequently misapplied one. There are well-defined conditions under which accuracy is the right summary, and recognizing them prevents the overcorrection of abandoning a simple, interpretable metric when it is in fact suitable. ### 4.1 Balanced Classes When the classes are roughly balanced, $\pi \approx 0.5$, the prevalence weighting that drives the paradox disappears, and accuracy converges toward balanced accuracy. In this regime accuracy faithfully reflects overall performance, and the majority-class baseline it must beat is only fifty percent rather than the inflated value imbalance produces. Many benchmark datasets are deliberately balanced for exactly this reason, and on them accuracy is a defensible headline metric. ### 4.2 Symmetric Error Costs Accuracy is the appropriate objective when the two error types genuinely carry equal cost and equal frequency of concern. If misclassifying a positive is no more or less harmful than misclassifying a negative, then the symmetric treatment baked into accuracy matches the decision problem. Formally, accuracy is the metric that minimizes expected loss under a zero-one loss function in which every misclassification incurs the same penalty. When that loss function is the correct model of the application, accuracy is not merely acceptable but optimal. ### 4.3 Accuracy as Expected Utility We can state the appropriateness condition precisely. Let the cost of a false positive be $c_{\text{FP}}$ and of a false negative be $c_{\text{FN}}$. The expected misclassification cost is proportional to $c_{\text{FP}}\cdot\text{FP} + c_{\text{FN}}\cdot\text{FN}$, and error rate (one minus accuracy) is proportional to $\text{FP} + \text{FN}$. These two objectives coincide if and only if $c_{\text{FP}} = c_{\text{FN}}$. Accuracy is therefore the special case of cost-sensitive evaluation in which the cost matrix is symmetric and the classes are balanced enough that prevalence does not distort the comparison. Stating it this way reframes the choice of metric as a modeling decision about costs and base rates rather than a matter of convention. ### 4.4 Practical Guidance In practice, treat accuracy as a first-glance summary to be reported alongside, never in place of, the full confusion matrix and at least one imbalance-robust metric. Always compare any accuracy figure against the majority-class baseline $1 - \pi$ rather than against zero; an accuracy of $0.95$ is excellent at $\pi = 0.5$ and embarrassing at $\pi = 0.99$. When error costs are asymmetric, replace accuracy with a cost-weighted criterion or with precision and recall reported at an operating threshold chosen to reflect those costs. The confusion matrix remains the right place to start in every case, because every metric in this chapter is a particular projection of it, and reporting the matrix itself lets a reader compute whichever projection their own problem demands. ### 4.5 When to Use Accuracy, and Common Pitfalls A compact decision rule: report accuracy as the headline metric only when the classes are roughly balanced and the two error types carry comparable cost. Outside that regime, demote it to a secondary figure reported next to the confusion matrix. The recurring pitfalls are worth naming explicitly. - **Reading accuracy without prevalence.** A bare accuracy number is uninterpretable. Always state $\pi$ and the baseline $1 - \pi$ alongside it. - **Comparing models on accuracy under imbalance.** As the worked example showed, this can reverse the clinically or commercially correct ranking. Use balanced accuracy, $F_1$, or MCC instead. - **Ignoring the threshold.** Accuracy is computed at one operating point. A model dismissed at $\tau = 0.5$ may dominate at the threshold the application requires; inspect the full ROC or precision-recall curve before concluding. - **Treating small accuracy gaps as real.** Differences within a couple of standard errors are noise. Report a confidence interval and, when comparing two classifiers on the same test set, use a paired test such as McNemar's test on the discordant predictions. - **Reporting accuracy when costs are asymmetric.** If a false negative is ten times as costly as a false positive, accuracy is simply the wrong objective; optimize and report a cost-weighted criterion. The mature open-source tooling makes all of this cheap. The `scikit-learn` library computes the confusion matrix, balanced accuracy, $F_1$, MCC, and Cohen's kappa from the same predicted and true label arrays, and `statsmodels` supplies the proportion confidence intervals and McNemar's test. There is no practical reason to report accuracy in isolation. ## 5. Summary The confusion matrix is the complete sufficient statistic for evaluating a classifier on a fixed test set, and accuracy is one scalar projection of it. Accuracy is intuitive and, under balanced classes with symmetric error costs, entirely appropriate. Its decomposition $\text{Accuracy} = \pi\cdot\text{TPR} + (1-\pi)\cdot\text{TNR}$ exposes its fatal weakness under imbalance: when the class of interest is rare, accuracy rewards classifiers that ignore that class, producing the accuracy paradox in which a do-nothing model outscores a genuinely useful one. The disciplined practitioner reads the confusion matrix directly, benchmarks accuracy against the majority-class rate, and reaches for balanced accuracy, $F_1$, or MCC whenever the positive class is both rare and important. ## References 1. Powers, D. M. W. (2011). Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness and Correlation. Journal of Machine Learning Technologies, 2(1), 37-63. https://arxiv.org/abs/2010.16061 2. Saito, T., and Rehmsmeier, M. (2015). The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLOS ONE, 10(3), e0118432. https://doi.org/10.1371/journal.pone.0118432 3. Chicco, D., and Jurman, G. (2020). The Advantages of the Matthews Correlation Coefficient (MCC) over F1 Score and Accuracy in Binary Classification Evaluation. BMC Genomics, 21, 6. https://doi.org/10.1186/s12864-019-6413-7 4. Brodersen, K. H., Ong, C. S., Stephan, K. E., and Buhmann, J. M. (2010). The Balanced Accuracy and Its Posterior Distribution. International Conference on Pattern Recognition, 3121-3124. https://doi.org/10.1109/ICPR.2010.764 5. He, H., and Garcia, E. A. (2009). Learning from Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263-1284. https://doi.org/10.1109/TKDE.2008.239 6. Fawcett, T. (2006). An Introduction to ROC Analysis. Pattern Recognition Letters, 27(8), 861-874. https://doi.org/10.1016/j.patrec.2005.10.010 7. Cohen, J. (1960). A Coefficient of Agreement for Nominal Scales. Educational and Psychological Measurement, 20(1), 37-46. https://doi.org/10.1177/001316446002000104 8. Dietterich, T. G. (1998). Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Computation, 10(7), 1895-1923. https://doi.org/10.1162/089976698300017197 9. Scikit-learn Developers. Classification Metrics. https://scikit-learn.org/stable/modules/model_evaluation.html