161 Multiclass Classification Metrics
Binary classification metrics rest on a comfortable simplification: there is one positive class and one negative class, and every prediction lands in one of four cells. Multiclass problems break this symmetry. When a model must assign each instance to one of \(K > 2\) mutually exclusive classes, the notions of true positive, recall, and ROC all require generalization. This chapter develops those generalizations rigorously, with attention to the averaging schemes that turn per-class scores into a single headline number, to agreement coefficients that correct for chance, and to the curve-based diagnostics that extend ROC analysis beyond two classes.
161.1 1. The Multiclass Confusion Matrix
Let the classes be indexed \(1, \dots, K\). The confusion matrix \(C \in \mathbb{N}^{K \times K}\) collects counts, where \(C_{ij}\) is the number of instances whose true class is \(i\) and whose predicted class is \(j\). The diagonal entries \(C_{ii}\) are correct predictions; off-diagonal mass records the specific ways the classifier confuses one class for another. The total sample size is \(N = \sum_{i,j} C_{ij}\).
Overall accuracy is the normalized trace:
\[ \text{Accuracy} = \frac{1}{N} \sum_{i=1}^{K} C_{ii}. \]
Accuracy alone is a weak summary. With imbalanced classes it can be high while the classifier ignores rare categories entirely, and it tells us nothing about which confusions occur. The richness of the multiclass setting lives in the off-diagonal structure, and most useful metrics are derived by reducing the full matrix to per-class quantities.
The standard reduction is the one versus rest (OvR) decomposition. For a fixed class \(c\), collapse all other classes into a single negative class. This induces a binary confusion matrix with
\[ \text{TP}_c = C_{cc}, \quad \text{FP}_c = \sum_{i \neq c} C_{ic}, \quad \text{FN}_c = \sum_{j \neq c} C_{cj}, \quad \text{TN}_c = N - \text{TP}_c - \text{FP}_c - \text{FN}_c. \]
Here \(\text{FP}_c\) sums the column of class \(c\) excluding the diagonal (instances wrongly pushed into \(c\)), and \(\text{FN}_c\) sums the row (instances of \(c\) sent elsewhere). Every binary metric can now be computed once per class.
predicted
cat dog fox
cat [ 18 2 0 ] row sum = 20 (true cats)
dog [ 3 25 2 ] row sum = 30
fox [ 1 4 45 ] row sum = 50
col 22 31 47
For class cat: \(\text{TP}=18\), \(\text{FN}=2\), \(\text{FP}=4\), \(\text{TN}=76\).
161.2 2. Extending Binary Metrics Per Class
With the OvR counts in hand, precision, recall, and the \(F_1\) score carry over verbatim for each class \(c\):
\[ \text{Precision}_c = \frac{\text{TP}_c}{\text{TP}_c + \text{FP}_c}, \qquad \text{Recall}_c = \frac{\text{TP}_c}{\text{TP}_c + \text{FN}_c}, \]
\[ F_{1,c} = \frac{2 \, \text{Precision}_c \, \text{Recall}_c}{\text{Precision}_c + \text{Recall}_c}. \]
Recall for class \(c\) is exactly the \(c\)-th diagonal entry divided by the \(c\)-th row sum, also called the per-class sensitivity or the true positive rate of that class. Precision is the diagonal entry divided by the column sum. These two quantities answer complementary questions. Recall asks: of the genuine members of \(c\), what fraction did we recover? Precision asks: of the instances we labeled \(c\), what fraction truly belong?
A per-class report exposes failure modes that a scalar hides. A model may achieve \(\text{Recall}_{\text{cat}} = 0.90\) yet \(\text{Recall}_{\text{rare}} = 0.10\), a disparity invisible in overall accuracy when the rare class is small. For this reason, mature evaluation pipelines always print the full per-class table before collapsing it into an aggregate. The question of how to collapse it is the subject of averaging.
161.3 3. Micro and Macro Averaging
Three averaging conventions dominate practice. They differ in how they weight classes, and the difference is consequential under imbalance.
161.3.1 3.1 Macro Averaging
Macro averaging computes a metric per class and takes an unweighted mean:
\[ \text{Precision}_{\text{macro}} = \frac{1}{K} \sum_{c=1}^{K} \text{Precision}_c, \qquad \text{Recall}_{\text{macro}} = \frac{1}{K} \sum_{c=1}^{K} \text{Recall}_c. \]
The macro \(F_1\) is most commonly defined as the mean of the per-class \(F_{1,c}\) values. Because each class contributes equally regardless of its support, macro averaging treats a class of ten instances and a class of ten thousand as equally important. This is the right choice when minority class performance matters as much as majority performance, as in medical screening across rare conditions. It is also the metric most punished by a model that abandons small classes.
161.3.2 3.2 Micro Averaging
Micro averaging pools the counts across all classes before computing the metric:
\[ \text{Precision}_{\text{micro}} = \frac{\sum_c \text{TP}_c}{\sum_c (\text{TP}_c + \text{FP}_c)}, \qquad \text{Recall}_{\text{micro}} = \frac{\sum_c \text{TP}_c}{\sum_c (\text{TP}_c + \text{FN}_c)}. \]
In single-label multiclass classification a clean identity holds. Every instance produces exactly one predicted label and has exactly one true label, so \(\sum_c \text{TP}_c\) equals the number of correct predictions, while \(\sum_c (\text{TP}_c + \text{FP}_c) = \sum_c (\text{TP}_c + \text{FN}_c) = N\). Consequently
\[ \text{Precision}_{\text{micro}} = \text{Recall}_{\text{micro}} = F_{1,\text{micro}} = \text{Accuracy}. \]
Micro averaging therefore collapses to accuracy in the standard setting, which means it inherits accuracy’s blindness to minority classes. Micro and macro genuinely diverge in multilabel problems, where each instance can carry several labels and the counts no longer sum to \(N\) per class.
161.3.3 3.3 Weighted Averaging
Weighted averaging takes the per-class metric and weights by support \(n_c\), the number of true instances of class \(c\):
\[ \text{Precision}_{\text{weighted}} = \frac{1}{N} \sum_{c=1}^{K} n_c \, \text{Precision}_c. \]
This sits between the two extremes: it respects class frequency like micro averaging but is computed from per-class scores like macro averaging. It is a reasonable default for reporting on imbalanced data when you want a number that tracks population performance without fully ignoring small classes. One caution is that the weighted \(F_1\) can fall outside the interval spanned by weighted precision and weighted recall, because the harmonic mean is taken before weighting, so it should not be over-interpreted.
report: precision recall f1 support
cat 0.82 0.90 0.86 20
dog 0.81 0.83 0.82 30
fox 0.96 0.90 0.93 50
macro avg 0.86 0.88 0.87 100
weighted avg 0.89 0.88 0.88 100
The choice among these is not a technicality. State the averaging scheme whenever you report a multiclass score, since a macro \(F_1\) and a micro \(F_1\) on the same predictions can differ by tens of points.
161.4 4. Cohen’s Kappa
Accuracy and its micro twin ignore the fact that some agreement between predictions and truth arises by chance, especially when one class dominates. Cohen’s kappa corrects observed agreement for the agreement expected under independence.
Let \(p_o\) be the observed agreement, equal to accuracy:
\[ p_o = \frac{1}{N} \sum_{i=1}^{K} C_{ii}. \]
Let \(p_e\) be the chance agreement, computed from the marginals. With row marginal \(a_i = \sum_j C_{ij}\) (true class frequency) and column marginal \(b_i = \sum_j C_{ji}\) (predicted class frequency),
\[ p_e = \frac{1}{N^2} \sum_{i=1}^{K} a_i \, b_i. \]
This is the agreement two independent raters would reach given the same marginal class proportions. Cohen’s kappa is then
\[ \kappa = \frac{p_o - p_e}{1 - p_e}. \]
The denominator normalizes by the maximum possible improvement over chance. When the classifier is perfect, \(p_o = 1\) and \(\kappa = 1\). When it does no better than chance, \(p_o = p_e\) and \(\kappa = 0\). Negative values, where \(p_o < p_e\), indicate systematic disagreement worse than random labeling. Common but informal interpretive bands place \(\kappa\) above \(0.8\) as strong agreement and below \(0.4\) as weak, though these thresholds are conventions rather than laws and should be reported alongside the raw value.
Kappa’s appeal is that it discounts the easy agreement available when classes are skewed. A model predicting the majority class for every instance achieves high accuracy but \(\kappa \approx 0\), exposing it as uninformative. Its main subtlety is that \(\kappa\) depends on the marginal distributions, so the same misclassification rate can yield different \(\kappa\) values under different class balances, which complicates comparison across datasets. The weighted variant of kappa, which assigns graded penalties to different confusions, is appropriate when the classes are ordinal so that confusing adjacent categories is less serious than confusing distant ones.
161.5 5. The Multiclass ROC
The receiver operating characteristic curve plots the true positive rate against the false positive rate as a decision threshold sweeps across the range of scores, and the area under it (AUC) summarizes ranking quality independent of any single threshold. Both depend on a binary positive-versus-negative split, so extension to \(K\) classes requires a strategy for inducing such splits from a classifier that outputs a score vector \(s(x) = (s_1, \dots, s_K)\).
161.5.1 5.1 One versus Rest ROC
The OvR approach produces one curve per class. For class \(c\), treat \(c\) as positive and the union of the others as negative, then sweep the threshold over the score \(s_c\). Each class yields an AUC, denoted \(\text{AUC}_c\). These can be aggregated by macro averaging,
\[ \text{AUC}_{\text{macro}} = \frac{1}{K} \sum_{c=1}^{K} \text{AUC}_c, \]
or by micro averaging, which pools the binarized score and label pairs across all classes into one long binary problem and computes a single curve. Macro AUC weights each class equally; micro AUC is dominated by frequent classes. As with \(F_1\), the gap between them diagnoses imbalance.
161.5.2 5.2 One versus One ROC
The OvR scheme can be distorted by the heavy negative class it constructs. The one versus one (OvO) alternative, formalized by Hand and Till, considers each unordered pair of classes \(\{i, j\}\) and computes a pairwise AUC on instances belonging to those two classes. Their multiclass measure averages over all pairs:
\[ M = \frac{2}{K(K-1)} \sum_{i < j} \hat{A}(i, j), \]
where \(\hat{A}(i,j)\) is the AUC distinguishing class \(i\) from class \(j\). Hand and Till define \(\hat{A}(i,j)\) symmetrically by averaging the two directional AUCs, so the measure is insensitive to which class of the pair is treated as positive. A key property is that \(M\) is insensitive to class prior probabilities, which is precisely the bias that troubles OvR aggregation under imbalance. The cost is computational, since there are \(\binom{K}{2}\) pairs to evaluate.
161.5.3 5.3 Interpretation and the Volume Under the Surface
A subtlety often missed is that the convenient probabilistic reading of binary AUC, namely the probability that a random positive outranks a random negative, does not transfer cleanly to multiclass averages. The OvR and OvO aggregates are useful scalar diagnostics of ranking quality, but they are summaries of many binary comparisons rather than a single coherent area. A genuinely \(K\)-dimensional generalization exists, the volume under the ROC surface (VUS), which measures the probability that a random tuple drawn one per class is ranked in correct order. The VUS is theoretically elegant but grows costly and hard to visualize as \(K\) increases, which is why the OvR and OvO scalar reductions remain the practical default.
# scoring sketch, not executable
for c in classes:
y_bin = (y_true == c)
auc[c] = roc_auc(y_bin, scores[:, c]) # OvR
macro_auc = mean(auc)
161.6 6. Choosing and Reporting Metrics
No single number captures multiclass performance. A defensible report combines several layers. Begin with the full confusion matrix, which preserves all information and reveals the specific confusions a model makes. Add a per-class precision, recall, and \(F_1\) table so minority class behavior is visible. Then report aggregates, stating the averaging scheme explicitly: macro when every class matters equally, weighted when population performance is the goal, and micro only with the understanding that it equals accuracy in the single-label case. Include Cohen’s kappa to discount chance agreement under imbalance, and report a multiclass AUC, OvO when priors are skewed, when threshold-independent ranking quality is of interest. The discipline is to match the metric to the cost structure of the application rather than to default to whatever a library prints first.
161.7 7. Summary
The move from binary to multiclass evaluation is organized around the confusion matrix and its one versus rest reduction, which lets every binary metric reappear per class. Aggregation then forces a choice: macro averaging weights classes equally and exposes minority failures, micro averaging weights instances equally and collapses to accuracy in the single-label setting, and weighted averaging interpolates by support. Cohen’s kappa corrects agreement for chance and is essential under imbalance. ROC analysis generalizes through OvR and OvO aggregation, with the OvO measure of Hand and Till offering prior insensitivity at higher computational cost. Reporting all of these layers, rather than a lone scalar, is what gives an honest picture of multiclass performance.
161.8 References
- Sokolova, M., and Lapalme, G. A systematic analysis of performance measures for classification tasks. Information Processing and Management, 2009. https://doi.org/10.1016/j.ipm.2009.03.002
- Hand, D. J., and Till, R. J. A simple generalisation of the area under the ROC curve for multiple class classification problems. Machine Learning, 2001. https://doi.org/10.1023/A:1010920819831
- Cohen, J. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 1960. https://doi.org/10.1177/001316446002000104
- Fawcett, T. An introduction to ROC analysis. Pattern Recognition Letters, 2006. https://doi.org/10.1016/j.patrec.2005.10.010
- Grandini, M., Bagli, E., and Visani, G. Metrics for multi-class classification: an overview. arXiv:2008.05756, 2020. https://arxiv.org/abs/2008.05756
- Pedregosa, F., et al. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 2011. https://scikit-learn.org/stable/modules/model_evaluation.html
- Ferri, C., Hernandez-Orallo, J., and Modroiu, R. An experimental comparison of performance measures for classification. Pattern Recognition Letters, 2009. https://doi.org/10.1016/j.patrec.2008.08.010