161 Multiclass Classification Metrics

Binary classification metrics rest on a comfortable simplification: there is one positive class and one negative class, and every prediction lands in one of four cells. Multiclass problems break this symmetry. When a model must assign each instance to one of $K > 2$ mutually exclusive classes, the notions of true positive, recall, and ROC all require generalization. This chapter develops those generalizations rigorously, with attention to the averaging schemes that turn per-class scores into a single headline number, to agreement coefficients that correct for chance, and to the curve-based diagnostics that extend ROC analysis beyond two classes.

A single running example threads the chapter together. Throughout we use a three-class animal classifier evaluated on $N = 100$ images of cats, dogs, and foxes. Its confusion matrix appears in Section 1, and every metric we define is computed on those same counts so that the numbers can be checked by hand and compared directly. The mathematics is deliberately self-contained: each formula is stated, motivated, and then applied, so that the chapter doubles as a reference one can return to when deciding what to report.

flowchart TD
    A["Multiclass predictions and scores"] --> B["Confusion matrix C, K by K"]
    B --> C["Per class OvR counts TP FP FN TN"]
    C --> D["Per class precision recall F1"]
    D --> E["Macro average, classes weighted equally"]
    D --> F["Weighted average, by support"]
    C --> G["Micro average, pooled counts"]
    B --> H["Cohen kappa, chance corrected"]
    A --> I["OvR or OvO ROC and AUC"]

The diagram traces the dependencies. Everything begins with the confusion matrix; the one versus rest reduction feeds the per-class table; and the averaging schemes, the agreement coefficient, and the curve-based diagnostics are the three families of scalar summaries built on top.

161.1 1. The Multiclass Confusion Matrix

Let the classes be indexed $1, \dots, K$. The confusion matrix $C \in \mathbb{N}^{K \times K}$ collects counts, where $C_{ij}$ is the number of instances whose true class is $i$ and whose predicted class is $j$. The diagonal entries $C_{ii}$ are correct predictions; off-diagonal mass records the specific ways the classifier confuses one class for another. The total sample size is $N = \sum_{i,j} C_{ij}$.

Overall accuracy is the normalized trace:

\[ \text{Accuracy} = \frac{1}{N} \sum_{i=1}^{K} C_{ii}. \]

Accuracy alone is a weak summary. With imbalanced classes it can be high while the classifier ignores rare categories entirely, and it tells us nothing about which confusions occur. The richness of the multiclass setting lives in the off-diagonal structure, and most useful metrics are derived by reducing the full matrix to per-class quantities.

The standard reduction is the one versus rest (OvR) decomposition. For a fixed class $c$, collapse all other classes into a single negative class. This induces a binary confusion matrix with

\[ \text{TP}_c = C_{cc}, \quad \text{FP}_c = \sum_{i \neq c} C_{ic}, \quad \text{FN}_c = \sum_{j \neq c} C_{cj}, \quad \text{TN}_c = N - \text{TP}_c - \text{FP}_c - \text{FN}_c. \]

Here $\text{FP}_c$ sums the column of class $c$ excluding the diagonal (instances wrongly pushed into $c$), and $\text{FN}_c$ sums the row (instances of $c$ sent elsewhere). Every binary metric can now be computed once per class.

              predicted
            cat  dog  fox
   cat  [   18    2    0 ]   row sum = 20  (true cats)
   dog  [    3   25    2 ]   row sum = 30
   fox  [    1    4   45 ]   row sum = 50
            col 22   31   47

Read the entries directly. The row labeled cat has true cats; of its 20 members, 18 were predicted cat (the diagonal), 2 were predicted dog, and 0 were predicted fox. The column labeled cat collects everything predicted cat: 18 true cats, 3 dogs, and 1 fox, for a column sum of 22. Applying the OvR formulas to class cat gives

\[ \text{TP}_{\text{cat}} = 18, \quad \text{FN}_{\text{cat}} = 2 + 0 = 2, \quad \text{FP}_{\text{cat}} = 3 + 1 = 4, \quad \text{TN}_{\text{cat}} = 100 - 18 - 4 - 2 = 76. \]

We will reuse these counts, and the analogous counts for dog ($\text{TP}=25$, $\text{FN}=5$, $\text{FP}=6$) and fox ($\text{TP}=45$, $\text{FN}=5$, $\text{FP}=2$), throughout the chapter. The trace of the matrix is $18 + 25 + 45 = 88$, so overall accuracy is $88 / 100 = 0.88$.

161.2 2. Extending Binary Metrics Per Class

With the OvR counts in hand, precision, recall, and the $F_1$ score carry over verbatim for each class $c$:

\[ \text{Precision}_c = \frac{\text{TP}_c}{\text{TP}_c + \text{FP}_c}, \qquad \text{Recall}_c = \frac{\text{TP}_c}{\text{TP}_c + \text{FN}_c}, \]

\[ F_{1,c} = \frac{2 \, \text{Precision}_c \, \text{Recall}_c}{\text{Precision}_c + \text{Recall}_c}. \]

Recall for class $c$ is exactly the $c$-th diagonal entry divided by the $c$-th row sum, also called the per-class sensitivity or the true positive rate of that class. Precision is the diagonal entry divided by the column sum. These two quantities answer complementary questions. Recall asks: of the genuine members of $c$, what fraction did we recover? Precision asks: of the instances we labeled $c$, what fraction truly belong?

On the running example, class cat has $\text{Precision}_{\text{cat}} = 18 / 22 \approx 0.818$ and $\text{Recall}_{\text{cat}} = 18 / 20 = 0.900$, so

\[ F_{1,\text{cat}} = \frac{2 (0.818)(0.900)}{0.818 + 0.900} \approx 0.857. \]

The same arithmetic gives $F_{1,\text{dog}} \approx 0.820$ from $\text{Precision}_{\text{dog}} = 25/31 \approx 0.806$ and $\text{Recall}_{\text{dog}} = 25/30 \approx 0.833$, and $F_{1,\text{fox}} \approx 0.928$ from $\text{Precision}_{\text{fox}} = 45/47 \approx 0.957$ and $\text{Recall}_{\text{fox}} = 45/50 = 0.900$. These three triples are the raw material that every aggregate below compresses.

A per-class report exposes failure modes that a scalar hides. A model may achieve $\text{Recall}_{\text{cat}} = 0.90$ yet $\text{Recall}_{\text{rare}} = 0.10$, a disparity invisible in overall accuracy when the rare class is small. For this reason, mature evaluation pipelines always print the full per-class table before collapsing it into an aggregate. The question of how to collapse it is the subject of averaging.

161.3 3. Micro and Macro Averaging

Three averaging conventions dominate practice. They differ in how they weight classes, and the difference is consequential under imbalance.

161.3.1 3.1 Macro Averaging

Macro averaging computes a metric per class and takes an unweighted mean:

\[ \text{Precision}_{\text{macro}} = \frac{1}{K} \sum_{c=1}^{K} \text{Precision}_c, \qquad \text{Recall}_{\text{macro}} = \frac{1}{K} \sum_{c=1}^{K} \text{Recall}_c. \]

The macro $F_1$ is most commonly defined as the mean of the per-class $F_{1,c}$ values,

\[ F_{1,\text{macro}} = \frac{1}{K} \sum_{c=1}^{K} F_{1,c}. \]

On the running example this is $(0.857 + 0.820 + 0.928)/3 \approx 0.868$. Because each class contributes equally regardless of its support, macro averaging treats a class of ten instances and a class of ten thousand as equally important. This is the right choice when minority class performance matters as much as majority performance, as in medical screening across rare conditions. It is also the metric most punished by a model that abandons small classes.

A definitional fork deserves a warning. The form above, the arithmetic mean of per-class $F_1$ values, is the one scikit-learn reports and the one assumed here. An older alternative first macro-averages precision and recall and then combines those two averages with the $F_1$ formula. The two agree only when precision equals recall in every class, and they can diverge noticeably otherwise, with the mean-of-$F_1$ form being the more conservative when classes vary in their precision-recall balance. When you cite a macro $F_1$, you are implicitly committing to one of these definitions, so name the tool or the formula.

161.3.2 3.2 Micro Averaging

Micro averaging pools the counts across all classes before computing the metric:

\[ \text{Precision}_{\text{micro}} = \frac{\sum_c \text{TP}_c}{\sum_c (\text{TP}_c + \text{FP}_c)}, \qquad \text{Recall}_{\text{micro}} = \frac{\sum_c \text{TP}_c}{\sum_c (\text{TP}_c + \text{FN}_c)}. \]

In single-label multiclass classification a clean identity holds. Every instance produces exactly one predicted label and has exactly one true label, so $\sum_c \text{TP}_c$ equals the number of correct predictions, while $\sum_c (\text{TP}_c + \text{FP}_c) = \sum_c (\text{TP}_c + \text{FN}_c) = N$. Consequently

\[ \text{Precision}_{\text{micro}} = \text{Recall}_{\text{micro}} = F_{1,\text{micro}} = \text{Accuracy}. \]

Micro averaging therefore collapses to accuracy in the standard setting, which means it inherits accuracy’s blindness to minority classes. Micro and macro genuinely diverge in multilabel problems, where each instance can carry several labels and the counts no longer sum to $N$ per class.

161.3.3 3.3 Weighted Averaging

Weighted averaging takes the per-class metric and weights by support $n_c$, the number of true instances of class $c$:

\[ \text{Precision}_{\text{weighted}} = \frac{1}{N} \sum_{c=1}^{K} n_c \, \text{Precision}_c. \]

This sits between the two extremes: it respects class frequency like micro averaging but is computed from per-class scores like macro averaging. It is a reasonable default for reporting on imbalanced data when you want a number that tracks population performance without fully ignoring small classes. One caution is that the weighted $F_1$ can fall outside the interval spanned by weighted precision and weighted recall, because the harmonic mean is taken before weighting, so it should not be over-interpreted.

report:           precision  recall  f1   support
  cat               0.82     0.90   0.86     20
  dog               0.81     0.83   0.82     30
  fox               0.96     0.90   0.93     50
  macro avg         0.86     0.88   0.87    100
  weighted avg      0.89     0.88   0.88    100

The choice among these is not a technicality. State the averaging scheme whenever you report a multiclass score, since a macro $F_1$ and a micro $F_1$ on the same predictions can differ by tens of points. On the running example the gap is modest, micro $F_1 = 0.88$ against macro $F_1 \approx 0.87$, because no class is severely starved, but on data where a rare class collapses to near-zero recall the two diverge sharply, and reporting only the larger of them is a common way to flatter a model.

161.3.4 3.4 Balanced Accuracy

A frequently useful special case of macro averaging is balanced accuracy, defined as the macro-averaged recall:

\[ \text{Balanced Accuracy} = \frac{1}{K} \sum_{c=1}^{K} \text{Recall}_c = \frac{1}{K} \sum_{c=1}^{K} \frac{C_{cc}}{\sum_j C_{cj}}. \]

It is the unweighted mean of the per-class true positive rates, equal to ordinary accuracy when classes are balanced but immune to the inflation that imbalance produces. On the running example it is $(0.900 + 0.833 + 0.900)/3 \approx 0.878$, close to the raw accuracy of $0.88$ only because the classes here are not extremely skewed. Its great virtue is interpretability: a trivial majority-class predictor scores $1/K$ rather than the deceptively high number that plain accuracy would award, so balanced accuracy is a strong default headline for imbalanced problems where false negatives on small classes carry real cost.

161.4 4. Cohen’s Kappa

Accuracy and its micro twin ignore the fact that some agreement between predictions and truth arises by chance, especially when one class dominates. Cohen’s kappa corrects observed agreement for the agreement expected under independence.

Let $p_o$ be the observed agreement, equal to accuracy:

\[ p_o = \frac{1}{N} \sum_{i=1}^{K} C_{ii}. \]

Let $p_e$ be the chance agreement, computed from the marginals. With row marginal $a_i = \sum_j C_{ij}$ (true class frequency) and column marginal $b_i = \sum_j C_{ji}$ (predicted class frequency),

\[ p_e = \frac{1}{N^2} \sum_{i=1}^{K} a_i \, b_i. \]

This is the agreement two independent raters would reach given the same marginal class proportions. Cohen’s kappa is then

\[ \kappa = \frac{p_o - p_e}{1 - p_e}. \]

The denominator normalizes by the maximum possible improvement over chance. When the classifier is perfect, $p_o = 1$ and $\kappa = 1$. When it does no better than chance, $p_o = p_e$ and $\kappa = 0$. Negative values, where $p_o < p_e$, indicate systematic disagreement worse than random labeling. Common but informal interpretive bands place $\kappa$ above $0.8$ as strong agreement and below $0.4$ as weak, though these thresholds are conventions rather than laws and should be reported alongside the raw value.

On the running example, $p_o = 0.88$. The true class marginals are $a = (20, 30, 50)$ and the predicted marginals are the column sums $b = (22, 31, 47)$, so

\[ p_e = \frac{(20)(22) + (30)(31) + (50)(47)}{100^2} = \frac{440 + 930 + 2350}{10000} = 0.372, \]

and therefore

\[ \kappa = \frac{0.88 - 0.372}{1 - 0.372} = \frac{0.508}{0.628} \approx 0.809. \]

The classifier is correct $88\%$ of the time, but roughly $37\%$ of that agreement was available by chance alone given these marginals, and kappa rescales the genuine $51$ point improvement against the $63$ points of headroom that remained.

Kappa’s appeal is that it discounts the easy agreement available when classes are skewed. A model predicting the majority class for every instance achieves high accuracy but $\kappa \approx 0$, exposing it as uninformative. Its main subtlety is that $\kappa$ depends on the marginal distributions, so the same misclassification rate can yield different $\kappa$ values under different class balances, which complicates comparison across datasets. This sensitivity produces the well-known kappa paradoxes: under high prevalence of one class, or under asymmetric marginals between the two raters, $\kappa$ can be low even when observed agreement is very high, because the chance correction $p_e$ is itself inflated. Treat a single $\kappa$ value as informative only alongside the raw accuracy and the marginals that produced it. The weighted variant of kappa, which assigns graded penalties to different confusions, is appropriate when the classes are ordinal so that confusing adjacent categories is less serious than confusing distant ones.

A closely related chance-corrected summary is the multiclass Matthews correlation coefficient (MCC), which extends the binary phi coefficient by treating the confusion matrix as a contingency table and computing a correlation between true and predicted labels. Like kappa it returns $1$ for perfect prediction, $0$ for chance-level prediction, and negative values for anti-correlation, but it is symmetric in its treatment of the two label vectors and is often preferred for its robustness under severe imbalance. When the goal is a single chance-corrected scalar that does not privilege any class, the macro-averaged recall, also called balanced accuracy, the multiclass MCC, and Cohen’s kappa form a useful trio to report together, since they fail in different ways and rarely mislead simultaneously.

161.5 5. The Multiclass ROC

The receiver operating characteristic curve plots the true positive rate against the false positive rate as a decision threshold sweeps across the range of scores, and the area under it (AUC) summarizes ranking quality independent of any single threshold. Both depend on a binary positive-versus-negative split, so extension to $K$ classes requires a strategy for inducing such splits from a classifier that outputs a score vector $s(x) = (s_1, \dots, s_K)$.

161.5.1 5.1 One versus Rest ROC

The OvR approach produces one curve per class. For class $c$, treat $c$ as positive and the union of the others as negative, then sweep the threshold over the score $s_c$. Each class yields an AUC, denoted $\text{AUC}_c$. These can be aggregated by macro averaging,

\[ \text{AUC}_{\text{macro}} = \frac{1}{K} \sum_{c=1}^{K} \text{AUC}_c, \]

or by micro averaging, which pools the binarized score and label pairs across all classes into one long binary problem and computes a single curve. Macro AUC weights each class equally; micro AUC is dominated by frequent classes. As with $F_1$, the gap between them diagnoses imbalance.

161.5.2 5.2 One versus One ROC

The OvR scheme can be distorted by the heavy negative class it constructs. The one versus one (OvO) alternative, formalized by Hand and Till, considers each unordered pair of classes $\{i, j\}$ and computes a pairwise AUC on instances belonging to those two classes. Their multiclass measure averages over all pairs:

\[ M = \frac{2}{K(K-1)} \sum_{i < j} \hat{A}(i, j), \]

where $\hat{A}(i,j)$ is the AUC distinguishing class $i$ from class $j$. Hand and Till define $\hat{A}(i,j)$ symmetrically by averaging the two directional AUCs, so the measure is insensitive to which class of the pair is treated as positive. A key property is that $M$ is insensitive to class prior probabilities, which is precisely the bias that troubles OvR aggregation under imbalance. The cost is computational, since there are $\binom{K}{2}$ pairs to evaluate.

161.5.3 5.3 Interpretation and the Volume Under the Surface

A subtlety often missed is that the convenient probabilistic reading of binary AUC, namely the probability that a random positive outranks a random negative, does not transfer cleanly to multiclass averages. The OvR and OvO aggregates are useful scalar diagnostics of ranking quality, but they are summaries of many binary comparisons rather than a single coherent area. A genuinely $K$-dimensional generalization exists, the volume under the ROC surface (VUS), which measures the probability that a random tuple drawn one per class is ranked in correct order. The VUS is theoretically elegant but grows costly and hard to visualize as $K$ increases, which is why the OvR and OvO scalar reductions remain the practical default.

# scoring sketch, not executable
for c in classes:
    y_bin = (y_true == c)
    auc[c] = roc_auc(y_bin, scores[:, c])   # OvR
macro_auc = mean(auc)

161.6 6. Choosing and Reporting Metrics

No single number captures multiclass performance. A defensible report combines several layers. Begin with the full confusion matrix, which preserves all information and reveals the specific confusions a model makes. Add a per-class precision, recall, and $F_1$ table so minority class behavior is visible. Then report aggregates, stating the averaging scheme explicitly: macro when every class matters equally, weighted when population performance is the goal, and micro only with the understanding that it equals accuracy in the single-label case. Include Cohen’s kappa to discount chance agreement under imbalance, and report a multiclass AUC, OvO when priors are skewed, when threshold-independent ranking quality is of interest. The discipline is to match the metric to the cost structure of the application rather than to default to whatever a library prints first.

The mature open-source tooling makes the full report nearly free to produce. The classification_report, confusion_matrix, cohen_kappa_score, balanced_accuracy_score, matthews_corrcoef, and roc_auc_score functions in scikit-learn cover every quantity in this chapter, and the last accepts multi_class="ovr" or multi_class="ovo" for the two ROC reductions. There is no reason to hand-roll these or to settle for a single library default.

161.6.1 6.1 Pitfalls

A short catalog of the traps that recur in practice.

Reporting micro $F_1$ on single-label data as if it were informative. It is exactly accuracy and carries all of accuracy’s blindness to minority classes, so a high micro $F_1$ on imbalanced data says little.
Quoting a macro $F_1$ without naming the definition. The mean-of-$F_1$ form and the $F_1$-of-means form disagree, so the number is ambiguous on its own.
Comparing kappa across datasets with different class balances. Because $\kappa$ depends on the marginals through $p_e$, two models with the same per-class error rates can post different $\kappa$ values on differently balanced test sets.
Treating a macro or micro AUC as a single probabilistic quantity. These are averages of many binary comparisons, not one coherent area, so the random-positive-outranks-random-negative interpretation does not transfer.
Choosing the averaging scheme after seeing the numbers. The cost structure of the application should fix the metric in advance; selecting the most flattering aggregate afterward is a form of result shopping.

161.7 7. Summary

The move from binary to multiclass evaluation is organized around the confusion matrix and its one versus rest reduction, which lets every binary metric reappear per class. Aggregation then forces a choice: macro averaging weights classes equally and exposes minority failures, micro averaging weights instances equally and collapses to accuracy in the single-label setting, and weighted averaging interpolates by support. Balanced accuracy, the macro-averaged recall, is an interpretable default headline for imbalanced data. Cohen’s kappa and the multiclass Matthews correlation coefficient correct agreement for chance and are essential under imbalance, though kappa must be read alongside its marginals because of the prevalence paradoxes. ROC analysis generalizes through OvR and OvO aggregation, with the OvO measure of Hand and Till offering prior insensitivity at higher computational cost. Reporting all of these layers, rather than a lone scalar, is what gives an honest picture of multiclass performance. Throughout, the single three-class example showed each metric collapsing the same confusion matrix in a different way, which is the clearest evidence that the choice of metric is a choice of what to care about.

161.8 References

Sokolova, M., and Lapalme, G. A systematic analysis of performance measures for classification tasks. Information Processing and Management, 2009. https://doi.org/10.1016/j.ipm.2009.03.002
Hand, D. J., and Till, R. J. A simple generalisation of the area under the ROC curve for multiple class classification problems. Machine Learning, 2001. https://doi.org/10.1023/A:1010920819831
Cohen, J. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 1960. https://doi.org/10.1177/001316446002000104
Fawcett, T. An introduction to ROC analysis. Pattern Recognition Letters, 2006. https://doi.org/10.1016/j.patrec.2005.10.010
Grandini, M., Bagli, E., and Visani, G. Metrics for multi-class classification: an overview. arXiv:2008.05756, 2020. https://arxiv.org/abs/2008.05756
Pedregosa, F., et al. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 2011. https://scikit-learn.org/stable/modules/model_evaluation.html
Ferri, C., Hernandez-Orallo, J., and Modroiu, R. An experimental comparison of performance measures for classification. Pattern Recognition Letters, 2009. https://doi.org/10.1016/j.patrec.2008.08.010
Gorodkin, J. Comparing two K-category assignments by a K-category correlation coefficient. Computational Biology and Chemistry, 2004. https://doi.org/10.1016/j.compbiolchem.2004.09.006

# Multiclass Classification Metrics Binary classification metrics rest on a comfortable simplification: there is one positive class and one negative class, and every prediction lands in one of four cells. Multiclass problems break this symmetry. When a model must assign each instance to one of $K > 2$ mutually exclusive classes, the notions of true positive, recall, and ROC all require generalization. This chapter develops those generalizations rigorously, with attention to the averaging schemes that turn per-class scores into a single headline number, to agreement coefficients that correct for chance, and to the curve-based diagnostics that extend ROC analysis beyond two classes. A single running example threads the chapter together. Throughout we use a three-class animal classifier evaluated on $N = 100$ images of cats, dogs, and foxes. Its confusion matrix appears in Section 1, and every metric we define is computed on those same counts so that the numbers can be checked by hand and compared directly. The mathematics is deliberately self-contained: each formula is stated, motivated, and then applied, so that the chapter doubles as a reference one can return to when deciding what to report. ```{mermaid} flowchart TD A["Multiclass predictions and scores"] --> B["Confusion matrix C, K by K"] B --> C["Per class OvR counts TP FP FN TN"] C --> D["Per class precision recall F1"] D --> E["Macro average, classes weighted equally"] D --> F["Weighted average, by support"] C --> G["Micro average, pooled counts"] B --> H["Cohen kappa, chance corrected"] A --> I["OvR or OvO ROC and AUC"] ``` The diagram traces the dependencies. Everything begins with the confusion matrix; the one versus rest reduction feeds the per-class table; and the averaging schemes, the agreement coefficient, and the curve-based diagnostics are the three families of scalar summaries built on top. ## 1. The Multiclass Confusion Matrix Let the classes be indexed $1, \dots, K$. The confusion matrix $C \in \mathbb{N}^{K \times K}$ collects counts, where $C_{ij}$ is the number of instances whose true class is $i$ and whose predicted class is $j$. The diagonal entries $C_{ii}$ are correct predictions; off-diagonal mass records the specific ways the classifier confuses one class for another. The total sample size is $N = \sum_{i,j} C_{ij}$. Overall accuracy is the normalized trace: $$ \text{Accuracy} = \frac{1}{N} \sum_{i=1}^{K} C_{ii}. $$ Accuracy alone is a weak summary. With imbalanced classes it can be high while the classifier ignores rare categories entirely, and it tells us nothing about which confusions occur. The richness of the multiclass setting lives in the off-diagonal structure, and most useful metrics are derived by reducing the full matrix to per-class quantities. The standard reduction is the one versus rest (OvR) decomposition. For a fixed class $c$, collapse all other classes into a single negative class. This induces a binary confusion matrix with $$ \text{TP}_c = C_{cc}, \quad \text{FP}_c = \sum_{i \neq c} C_{ic}, \quad \text{FN}_c = \sum_{j \neq c} C_{cj}, \quad \text{TN}_c = N - \text{TP}_c - \text{FP}_c - \text{FN}_c. $$ Here $\text{FP}_c$ sums the column of class $c$ excluding the diagonal (instances wrongly pushed into $c$), and $\text{FN}_c$ sums the row (instances of $c$ sent elsewhere). Every binary metric can now be computed once per class. ```text predicted cat dog fox cat [ 18 2 0 ] row sum = 20 (true cats) dog [ 3 25 2 ] row sum = 30 fox [ 1 4 45 ] row sum = 50 col 22 31 47 ``` Read the entries directly. The row labeled cat has true cats; of its 20 members, 18 were predicted cat (the diagonal), 2 were predicted dog, and 0 were predicted fox. The column labeled cat collects everything predicted cat: 18 true cats, 3 dogs, and 1 fox, for a column sum of 22. Applying the OvR formulas to class cat gives $$ \text{TP}_{\text{cat}} = 18, \quad \text{FN}_{\text{cat}} = 2 + 0 = 2, \quad \text{FP}_{\text{cat}} = 3 + 1 = 4, \quad \text{TN}_{\text{cat}} = 100 - 18 - 4 - 2 = 76. $$ We will reuse these counts, and the analogous counts for dog ($\text{TP}=25$, $\text{FN}=5$, $\text{FP}=6$) and fox ($\text{TP}=45$, $\text{FN}=5$, $\text{FP}=2$), throughout the chapter. The trace of the matrix is $18 + 25 + 45 = 88$, so overall accuracy is $88 / 100 = 0.88$. ## 2. Extending Binary Metrics Per Class With the OvR counts in hand, precision, recall, and the $F_1$ score carry over verbatim for each class $c$: $$ \text{Precision}_c = \frac{\text{TP}_c}{\text{TP}_c + \text{FP}_c}, \qquad \text{Recall}_c = \frac{\text{TP}_c}{\text{TP}_c + \text{FN}_c}, $$ $$ F_{1,c} = \frac{2 \, \text{Precision}_c \, \text{Recall}_c}{\text{Precision}_c + \text{Recall}_c}. $$ Recall for class $c$ is exactly the $c$-th diagonal entry divided by the $c$-th row sum, also called the per-class sensitivity or the true positive rate of that class. Precision is the diagonal entry divided by the column sum. These two quantities answer complementary questions. Recall asks: of the genuine members of $c$, what fraction did we recover? Precision asks: of the instances we labeled $c$, what fraction truly belong? On the running example, class cat has $\text{Precision}_{\text{cat}} = 18 / 22 \approx 0.818$ and $\text{Recall}_{\text{cat}} = 18 / 20 = 0.900$, so $$ F_{1,\text{cat}} = \frac{2 (0.818)(0.900)}{0.818 + 0.900} \approx 0.857. $$ The same arithmetic gives $F_{1,\text{dog}} \approx 0.820$ from $\text{Precision}_{\text{dog}} = 25/31 \approx 0.806$ and $\text{Recall}_{\text{dog}} = 25/30 \approx 0.833$, and $F_{1,\text{fox}} \approx 0.928$ from $\text{Precision}_{\text{fox}} = 45/47 \approx 0.957$ and $\text{Recall}_{\text{fox}} = 45/50 = 0.900$. These three triples are the raw material that every aggregate below compresses. A per-class report exposes failure modes that a scalar hides. A model may achieve $\text{Recall}_{\text{cat}} = 0.90$ yet $\text{Recall}_{\text{rare}} = 0.10$, a disparity invisible in overall accuracy when the rare class is small. For this reason, mature evaluation pipelines always print the full per-class table before collapsing it into an aggregate. The question of how to collapse it is the subject of averaging. ## 3. Micro and Macro Averaging Three averaging conventions dominate practice. They differ in how they weight classes, and the difference is consequential under imbalance. ### 3.1 Macro Averaging Macro averaging computes a metric per class and takes an unweighted mean: $$ \text{Precision}_{\text{macro}} = \frac{1}{K} \sum_{c=1}^{K} \text{Precision}_c, \qquad \text{Recall}_{\text{macro}} = \frac{1}{K} \sum_{c=1}^{K} \text{Recall}_c. $$ The macro $F_1$ is most commonly defined as the mean of the per-class $F_{1,c}$ values, $$ F_{1,\text{macro}} = \frac{1}{K} \sum_{c=1}^{K} F_{1,c}. $$ On the running example this is $(0.857 + 0.820 + 0.928)/3 \approx 0.868$. Because each class contributes equally regardless of its support, macro averaging treats a class of ten instances and a class of ten thousand as equally important. This is the right choice when minority class performance matters as much as majority performance, as in medical screening across rare conditions. It is also the metric most punished by a model that abandons small classes. A definitional fork deserves a warning. The form above, the arithmetic mean of per-class $F_1$ values, is the one scikit-learn reports and the one assumed here. An older alternative first macro-averages precision and recall and then combines those two averages with the $F_1$ formula. The two agree only when precision equals recall in every class, and they can diverge noticeably otherwise, with the mean-of-$F_1$ form being the more conservative when classes vary in their precision-recall balance. When you cite a macro $F_1$, you are implicitly committing to one of these definitions, so name the tool or the formula. ### 3.2 Micro Averaging Micro averaging pools the counts across all classes before computing the metric: $$ \text{Precision}_{\text{micro}} = \frac{\sum_c \text{TP}_c}{\sum_c (\text{TP}_c + \text{FP}_c)}, \qquad \text{Recall}_{\text{micro}} = \frac{\sum_c \text{TP}_c}{\sum_c (\text{TP}_c + \text{FN}_c)}. $$ In single-label multiclass classification a clean identity holds. Every instance produces exactly one predicted label and has exactly one true label, so $\sum_c \text{TP}_c$ equals the number of correct predictions, while $\sum_c (\text{TP}_c + \text{FP}_c) = \sum_c (\text{TP}_c + \text{FN}_c) = N$. Consequently $$ \text{Precision}_{\text{micro}} = \text{Recall}_{\text{micro}} = F_{1,\text{micro}} = \text{Accuracy}. $$ Micro averaging therefore collapses to accuracy in the standard setting, which means it inherits accuracy's blindness to minority classes. Micro and macro genuinely diverge in multilabel problems, where each instance can carry several labels and the counts no longer sum to $N$ per class. ### 3.3 Weighted Averaging Weighted averaging takes the per-class metric and weights by support $n_c$, the number of true instances of class $c$: $$ \text{Precision}_{\text{weighted}} = \frac{1}{N} \sum_{c=1}^{K} n_c \, \text{Precision}_c. $$ This sits between the two extremes: it respects class frequency like micro averaging but is computed from per-class scores like macro averaging. It is a reasonable default for reporting on imbalanced data when you want a number that tracks population performance without fully ignoring small classes. One caution is that the weighted $F_1$ can fall outside the interval spanned by weighted precision and weighted recall, because the harmonic mean is taken before weighting, so it should not be over-interpreted. ```text report: precision recall f1 support cat 0.82 0.90 0.86 20 dog 0.81 0.83 0.82 30 fox 0.96 0.90 0.93 50 macro avg 0.86 0.88 0.87 100 weighted avg 0.89 0.88 0.88 100 ``` The choice among these is not a technicality. State the averaging scheme whenever you report a multiclass score, since a macro $F_1$ and a micro $F_1$ on the same predictions can differ by tens of points. On the running example the gap is modest, micro $F_1 = 0.88$ against macro $F_1 \approx 0.87$, because no class is severely starved, but on data where a rare class collapses to near-zero recall the two diverge sharply, and reporting only the larger of them is a common way to flatter a model. ### 3.4 Balanced Accuracy A frequently useful special case of macro averaging is balanced accuracy, defined as the macro-averaged recall: $$ \text{Balanced Accuracy} = \frac{1}{K} \sum_{c=1}^{K} \text{Recall}_c = \frac{1}{K} \sum_{c=1}^{K} \frac{C_{cc}}{\sum_j C_{cj}}. $$ It is the unweighted mean of the per-class true positive rates, equal to ordinary accuracy when classes are balanced but immune to the inflation that imbalance produces. On the running example it is $(0.900 + 0.833 + 0.900)/3 \approx 0.878$, close to the raw accuracy of $0.88$ only because the classes here are not extremely skewed. Its great virtue is interpretability: a trivial majority-class predictor scores $1/K$ rather than the deceptively high number that plain accuracy would award, so balanced accuracy is a strong default headline for imbalanced problems where false negatives on small classes carry real cost. ## 4. Cohen's Kappa Accuracy and its micro twin ignore the fact that some agreement between predictions and truth arises by chance, especially when one class dominates. Cohen's kappa corrects observed agreement for the agreement expected under independence. Let $p_o$ be the observed agreement, equal to accuracy: $$ p_o = \frac{1}{N} \sum_{i=1}^{K} C_{ii}. $$ Let $p_e$ be the chance agreement, computed from the marginals. With row marginal $a_i = \sum_j C_{ij}$ (true class frequency) and column marginal $b_i = \sum_j C_{ji}$ (predicted class frequency), $$ p_e = \frac{1}{N^2} \sum_{i=1}^{K} a_i \, b_i. $$ This is the agreement two independent raters would reach given the same marginal class proportions. Cohen's kappa is then $$ \kappa = \frac{p_o - p_e}{1 - p_e}. $$ The denominator normalizes by the maximum possible improvement over chance. When the classifier is perfect, $p_o = 1$ and $\kappa = 1$. When it does no better than chance, $p_o = p_e$ and $\kappa = 0$. Negative values, where $p_o < p_e$, indicate systematic disagreement worse than random labeling. Common but informal interpretive bands place $\kappa$ above $0.8$ as strong agreement and below $0.4$ as weak, though these thresholds are conventions rather than laws and should be reported alongside the raw value. On the running example, $p_o = 0.88$. The true class marginals are $a = (20, 30, 50)$ and the predicted marginals are the column sums $b = (22, 31, 47)$, so $$ p_e = \frac{(20)(22) + (30)(31) + (50)(47)}{100^2} = \frac{440 + 930 + 2350}{10000} = 0.372, $$ and therefore $$ \kappa = \frac{0.88 - 0.372}{1 - 0.372} = \frac{0.508}{0.628} \approx 0.809. $$ The classifier is correct $88\%$ of the time, but roughly $37\%$ of that agreement was available by chance alone given these marginals, and kappa rescales the genuine $51$ point improvement against the $63$ points of headroom that remained. Kappa's appeal is that it discounts the easy agreement available when classes are skewed. A model predicting the majority class for every instance achieves high accuracy but $\kappa \approx 0$, exposing it as uninformative. Its main subtlety is that $\kappa$ depends on the marginal distributions, so the same misclassification rate can yield different $\kappa$ values under different class balances, which complicates comparison across datasets. This sensitivity produces the well-known kappa paradoxes: under high prevalence of one class, or under asymmetric marginals between the two raters, $\kappa$ can be low even when observed agreement is very high, because the chance correction $p_e$ is itself inflated. Treat a single $\kappa$ value as informative only alongside the raw accuracy and the marginals that produced it. The weighted variant of kappa, which assigns graded penalties to different confusions, is appropriate when the classes are ordinal so that confusing adjacent categories is less serious than confusing distant ones. A closely related chance-corrected summary is the multiclass Matthews correlation coefficient (MCC), which extends the binary phi coefficient by treating the confusion matrix as a contingency table and computing a correlation between true and predicted labels. Like kappa it returns $1$ for perfect prediction, $0$ for chance-level prediction, and negative values for anti-correlation, but it is symmetric in its treatment of the two label vectors and is often preferred for its robustness under severe imbalance. When the goal is a single chance-corrected scalar that does not privilege any class, the macro-averaged recall, also called balanced accuracy, the multiclass MCC, and Cohen's kappa form a useful trio to report together, since they fail in different ways and rarely mislead simultaneously. ## 5. The Multiclass ROC The receiver operating characteristic curve plots the true positive rate against the false positive rate as a decision threshold sweeps across the range of scores, and the area under it (AUC) summarizes ranking quality independent of any single threshold. Both depend on a binary positive-versus-negative split, so extension to $K$ classes requires a strategy for inducing such splits from a classifier that outputs a score vector $s(x) = (s_1, \dots, s_K)$. ### 5.1 One versus Rest ROC The OvR approach produces one curve per class. For class $c$, treat $c$ as positive and the union of the others as negative, then sweep the threshold over the score $s_c$. Each class yields an AUC, denoted $\text{AUC}_c$. These can be aggregated by macro averaging, $$ \text{AUC}_{\text{macro}} = \frac{1}{K} \sum_{c=1}^{K} \text{AUC}_c, $$ or by micro averaging, which pools the binarized score and label pairs across all classes into one long binary problem and computes a single curve. Macro AUC weights each class equally; micro AUC is dominated by frequent classes. As with $F_1$, the gap between them diagnoses imbalance. ### 5.2 One versus One ROC The OvR scheme can be distorted by the heavy negative class it constructs. The one versus one (OvO) alternative, formalized by Hand and Till, considers each unordered pair of classes $\{i, j\}$ and computes a pairwise AUC on instances belonging to those two classes. Their multiclass measure averages over all pairs: $$ M = \frac{2}{K(K-1)} \sum_{i < j} \hat{A}(i, j), $$ where $\hat{A}(i,j)$ is the AUC distinguishing class $i$ from class $j$. Hand and Till define $\hat{A}(i,j)$ symmetrically by averaging the two directional AUCs, so the measure is insensitive to which class of the pair is treated as positive. A key property is that $M$ is insensitive to class prior probabilities, which is precisely the bias that troubles OvR aggregation under imbalance. The cost is computational, since there are $\binom{K}{2}$ pairs to evaluate. ### 5.3 Interpretation and the Volume Under the Surface A subtlety often missed is that the convenient probabilistic reading of binary AUC, namely the probability that a random positive outranks a random negative, does not transfer cleanly to multiclass averages. The OvR and OvO aggregates are useful scalar diagnostics of ranking quality, but they are summaries of many binary comparisons rather than a single coherent area. A genuinely $K$-dimensional generalization exists, the volume under the ROC surface (VUS), which measures the probability that a random tuple drawn one per class is ranked in correct order. The VUS is theoretically elegant but grows costly and hard to visualize as $K$ increases, which is why the OvR and OvO scalar reductions remain the practical default. ```text # scoring sketch, not executable for c in classes: y_bin = (y_true == c) auc[c] = roc_auc(y_bin, scores[:, c]) # OvR macro_auc = mean(auc) ``` ## 6. Choosing and Reporting Metrics No single number captures multiclass performance. A defensible report combines several layers. Begin with the full confusion matrix, which preserves all information and reveals the specific confusions a model makes. Add a per-class precision, recall, and $F_1$ table so minority class behavior is visible. Then report aggregates, stating the averaging scheme explicitly: macro when every class matters equally, weighted when population performance is the goal, and micro only with the understanding that it equals accuracy in the single-label case. Include Cohen's kappa to discount chance agreement under imbalance, and report a multiclass AUC, OvO when priors are skewed, when threshold-independent ranking quality is of interest. The discipline is to match the metric to the cost structure of the application rather than to default to whatever a library prints first. The mature open-source tooling makes the full report nearly free to produce. The `classification_report`, `confusion_matrix`, `cohen_kappa_score`, `balanced_accuracy_score`, `matthews_corrcoef`, and `roc_auc_score` functions in scikit-learn cover every quantity in this chapter, and the last accepts `multi_class="ovr"` or `multi_class="ovo"` for the two ROC reductions. There is no reason to hand-roll these or to settle for a single library default. ### 6.1 Pitfalls A short catalog of the traps that recur in practice. - Reporting micro $F_1$ on single-label data as if it were informative. It is exactly accuracy and carries all of accuracy's blindness to minority classes, so a high micro $F_1$ on imbalanced data says little. - Quoting a macro $F_1$ without naming the definition. The mean-of-$F_1$ form and the $F_1$-of-means form disagree, so the number is ambiguous on its own. - Comparing kappa across datasets with different class balances. Because $\kappa$ depends on the marginals through $p_e$, two models with the same per-class error rates can post different $\kappa$ values on differently balanced test sets. - Treating a macro or micro AUC as a single probabilistic quantity. These are averages of many binary comparisons, not one coherent area, so the random-positive-outranks-random-negative interpretation does not transfer. - Choosing the averaging scheme after seeing the numbers. The cost structure of the application should fix the metric in advance; selecting the most flattering aggregate afterward is a form of result shopping. ## 7. Summary The move from binary to multiclass evaluation is organized around the confusion matrix and its one versus rest reduction, which lets every binary metric reappear per class. Aggregation then forces a choice: macro averaging weights classes equally and exposes minority failures, micro averaging weights instances equally and collapses to accuracy in the single-label setting, and weighted averaging interpolates by support. Balanced accuracy, the macro-averaged recall, is an interpretable default headline for imbalanced data. Cohen's kappa and the multiclass Matthews correlation coefficient correct agreement for chance and are essential under imbalance, though kappa must be read alongside its marginals because of the prevalence paradoxes. ROC analysis generalizes through OvR and OvO aggregation, with the OvO measure of Hand and Till offering prior insensitivity at higher computational cost. Reporting all of these layers, rather than a lone scalar, is what gives an honest picture of multiclass performance. Throughout, the single three-class example showed each metric collapsing the same confusion matrix in a different way, which is the clearest evidence that the choice of metric is a choice of what to care about. ## References 1. Sokolova, M., and Lapalme, G. A systematic analysis of performance measures for classification tasks. Information Processing and Management, 2009. https://doi.org/10.1016/j.ipm.2009.03.002 2. Hand, D. J., and Till, R. J. A simple generalisation of the area under the ROC curve for multiple class classification problems. Machine Learning, 2001. https://doi.org/10.1023/A:1010920819831 3. Cohen, J. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 1960. https://doi.org/10.1177/001316446002000104 4. Fawcett, T. An introduction to ROC analysis. Pattern Recognition Letters, 2006. https://doi.org/10.1016/j.patrec.2005.10.010 5. Grandini, M., Bagli, E., and Visani, G. Metrics for multi-class classification: an overview. arXiv:2008.05756, 2020. https://arxiv.org/abs/2008.05756 6. Pedregosa, F., et al. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 2011. https://scikit-learn.org/stable/modules/model_evaluation.html 7. Ferri, C., Hernandez-Orallo, J., and Modroiu, R. An experimental comparison of performance measures for classification. Pattern Recognition Letters, 2009. https://doi.org/10.1016/j.patrec.2008.08.010 8. Gorodkin, J. Comparing two K-category assignments by a K-category correlation coefficient. Computational Biology and Chemistry, 2004. https://doi.org/10.1016/j.compbiolchem.2004.09.006