16  Large-Scale Benchmarking of Classifiers

Scope: retail. Large-scale classifier benchmark across UCI German, Taiwan, Home Credit, HMDA, and LendingClub. All benchmark datasets are consumer; corporate-distress benchmarking lives in Chapter 6 and Chapter 33.

Overview

Credit scoring has been the most benchmarked application of supervised classification in the operations-research literature. Two studies anchor the field: Baesens et al. (2003) in the Journal of the Operational Research Society, and Lessmann et al. (2015) in the European Journal of Operational Research. Between them they cover two decades of method development, from logistic regression and discriminant analysis through support vector machines, random forests, gradient boosting, and early neural architectures. Their conclusions are the only place where a practitioner can read, in one line, whether a new method is worth the operational cost of moving away from a scorecard.

This chapter reproduces the core comparative machinery on two public datasets, German and Taiwan, and places the findings in the context of the modern tree-ensemble era and the recent tabular-deep-learning wave examined by Grinsztajn et al. (2022). The comparison is framed around a specific research question: conditional on a fixed training budget, fixed features, and fixed evaluation metric, which families of classifiers dominate, by how much, and with what statistical confidence. The secondary question is methodological: how should a practitioner compare several classifiers across several datasets without inflating Type I error.

The organizing tool is the non-parametric multi-classifier comparison framework of Demšar (2006): Friedman rank test, Nemenyi post-hoc, and the critical-difference diagram. The chapter derives each step from first principles, implements the test and the diagram in NumPy and matplotlib, and then applies the framework to a mini-benchmark of nine classifiers under stratified 5-by-2 cross-validation. The chapter closes with an algorithm-selection guide that explicitly states when logistic regression still wins, and with a reading of the deep-learning-versus-trees debate on tabular data.

Notation

\(K\) is the number of classifiers, indexed by \(j\). \(N\) is the number of datasets or independent evaluation splits, indexed by \(i\). \(r_{ij}\) is the rank of classifier \(j\) on dataset \(i\), with average rank \(\bar r_j = \frac{1}{N}\sum_i r_{ij}\). Performance metrics are \(\mathrm{AUC}\) (area under the ROC curve, Hanley & McNeil (1982)), the Kolmogorov-Smirnov statistic \(\mathrm{KS}\), the Brier score \(B\) (Brier, 1950), and Hand’s \(H\)-measure (Hand, 2009). \(y_i \in \{0,1\}\) is the default indicator, \(\hat p_i \in [0,1]\) the predicted probability of default.


Why benchmarking is hard

Benchmarking in credit scoring is not a neutral exercise. The choice of datasets, metric, cross-validation scheme, and hyper-parameter budget all load the dice. Hand (2009) showed that AUC can give incoherent rankings when two classifiers induce different implicit cost distributions. Verbraken et al. (2014) argued that profit-based metrics should replace AUC whenever loss-given-default is known. Demšar (2006) pointed out that the paired \(t\)-test across datasets is badly mis-calibrated because datasets are heterogeneous on both variance and difficulty.

Three confounds appear in every benchmark paper worth reading. The first is variance inflation from the small number of public credit datasets: typically eight to ten, which gives a non-parametric rank test with fewer than ten observations per classifier and low power. The second is the hyper-parameter budget: many published results exaggerate the gap between gradient boosting and logistic regression because the boosting model was tuned and the baseline was not. The third is the target metric: a classifier that wins on Brier score may lose on AUC because Brier rewards calibration and AUC rewards ranking, and the two can disagree (see Hand (2009) for the coherence argument).

A serious benchmark has to neutralize all three. It needs (i) enough datasets or enough independent resamples to give the rank test real power, (ii) a common, pre-registered tuning protocol applied symmetrically, and (iii) a basket of metrics rather than a single scalar. Lessmann and colleagues did all three. We follow their template.

The template also has to survive emerging-market conditions. In Vietnam, the Credit Information Center reports bureau coverage below 70 percent of adults (National Credit Information Centre of Vietnam, 2023), Lunar New Year introduces vintage seasonality, and regulated banks operate under Basel II standardized rules via SBV Circular 41/2016 (State Bank of Vietnam, 2016). Any benchmark that ignores vintage and coverage skew will rank classifiers that overfit to a single year. The Vietnam-and-EM section at the end of this chapter returns to this point.

16.1 The Baesens 2003 benchmark

Baesens et al. (2003) compared seventeen classification algorithms on eight real-life credit-scoring datasets. Their study set the template for everything that followed: metric was classification accuracy and AUC, protocol was stratified ten-fold cross-validation with fixed tuning grids, and the statistical comparison used paired McNemar tests.

The eight datasets were a mix of public (German, Australian, Japanese from UCI) and industry-provided retail portfolios. Sample sizes ranged from 690 (Australian) to roughly 37,000 (Bene1, a large European consumer loans book). Default rates ranged from 5.6% to 44.4%. The heterogeneity is the point: any algorithm that wins on all eight is robust to class imbalance and sample size.

The classifier set spanned four families. Linear methods: logistic regression, linear discriminant analysis (Section 6.1), quadratic discriminant analysis (Section 6.1.5), Fisher’s discriminant. Decision-tree methods: C4.5 (Quinlan, 1993 as cited in the paper), C4.5rules, CART, and an instance-averaged tree. Neural networks: a multi-layer perceptron trained with back-propagation, radial-basis networks, and LVQ. Kernel methods: two flavors of least-squares support vector machine with linear and RBF kernels (LS-SVM is the Suykens variant studied extensively in the Leuven group that authored the paper). Non-parametric nearest-neighbor methods appeared in two forms: \(k\)-NN with \(k \in \{10, 100\}\), and a naive Bayes.

The three headline findings of Baesens et al. (2003) have held up well:

  1. Classification accuracy differs little across sensibly specified classifiers. On five of the eight datasets the difference between the best and worst classifier was under three percentage points of accuracy. McNemar tests rejected the null of equal error rates for most pairs, but the effect sizes were small. This is the origin of the folk claim in retail credit that “the data matters more than the algorithm”.

  2. Least-squares SVM with RBF kernel had the best average rank, followed closely by the neural-network perceptron and logistic regression. LS-SVM and perceptron both require standardization and tuning; logistic regression does not. On a tuning-adjusted comparison the perceptron and logistic regression were statistically indistinguishable.

  3. Simple methods are competitive. Logistic regression, linear discriminant analysis, and \(k\)-NN with \(k = 100\) were all in the top half of the rank table on most datasets. Decision trees underperformed, in line with the classical result that single trees have high variance on small datasets (Breiman, 2001).

Three limitations of Baesens et al. (2003) are worth naming. First, the metric was accuracy, not AUC. Accuracy is threshold-dependent and penalizes a calibrated classifier that picks the wrong decision point for the test-set class balance. Second, the ensemble families that now dominate, bagging, random forests, gradient boosting, and stacking, were only nascent in 2003 and were not included. Third, the paper did not apply a multi-comparison correction, so the pairwise McNemar tests over-reject. Lessmann et al. (2015) fixed all three.

16.2 The Lessmann 2015 update

Lessmann et al. (2015) extended the comparison to 41 classifiers on eight credit datasets using a richer metric set: AUC, partial AUC restricted to the operational range of low false-positive rates (McClish, 1989), Brier score, Hand’s \(H\)-measure (Hand, 2009), and the expected maximum profit criterion EMP (Verbraken et al., 2014). The 41 classifiers cluster into families:

  • Individual classifiers: logistic regression, regularized logistic (Lasso, Ridge, Elastic Net), LDA, naive Bayes, \(k\)-NN, classification trees (C4.5, CART), ANN, RBF networks, SVM (linear, RBF), LS-SVM.
  • Homogeneous ensembles: bagging of trees, random forests, AdaBoost, stochastic gradient boosting, rotation forest, LogitBoost.
  • Heterogeneous ensembles: stacking with a linear meta-learner, hill-climbing ensemble selection, dynamic classifier selection, mean and median voting across heterogeneous bases.
  • Rule learners: RIPPER, PART.

The critical methodological contribution was the use of Demšar (2006)’s non-parametric machinery: rank by AUC on each dataset, compute average ranks across datasets, apply the Friedman test with the Iman & Davenport (1980) correction, then draw a Nemenyi critical-difference diagram to reveal which classifiers are statistically indistinguishable at a chosen confidence level.

16.2.1 The ranking in one paragraph

Heterogeneous ensembles, specifically hill-climbing ensemble selection and stacking, had the best average ranks on AUC, partial AUC, and \(H\)-measure. They were followed, tightly, by random forest and stochastic gradient boosting. Individual classifiers other than regularized logistic regression finished below the ensembles. Among individual classifiers, regularized logistic regression (Ridge) had the best rank, followed by ANN and SVM-RBF. Decision trees and naive Bayes anchored the bottom of the table. Logistic regression without regularization sat in the middle of the individual classifiers, behind Ridge but ahead of LDA and the rule learners.

16.2.2 Effect sizes

The AUC gap between the best heterogeneous ensemble and logistic regression, averaged across the eight datasets in the Lessmann study, was approximately 1.5 to 2 percentage points. On partial AUC restricted to the 0 to 0.4 FPR range, the gap widened to around 3 points. On Brier score the gap was smaller in absolute terms, roughly 0.005 to 0.010, but this translates into a non-trivial improvement in calibration-weighted loss. On \(H\)-measure, the heterogeneous ensembles retained their lead. EMP told the same story but with much tighter effect sizes: the monetary value of switching from logistic regression to a stacked ensemble was, in the datasets studied, positive but small, of the order of 0.1 to 0.3 percent of portfolio expected profit per granted loan.

This is the empirical fact practitioners need to internalize: in properly benchmarked credit scoring, the best modern method beats logistic regression by 1 to 2 AUC points, not 5 to 10. A single internal validation where the gap is larger than that is almost certainly a symptom of under-tuned baselines, leakage, or a non-representative test split.

16.2.3 The Lessmann ordering

Collapsing the paper’s average-rank table across all four proper scoring metrics (AUC, partial AUC, Brier, \(H\)), the classifier families sort as:

\[ \begin{aligned} &\text{heterogeneous ensembles} \succ \text{gradient boosting} \succ \text{random forest} \\ &\quad \succ \text{ANN} \succ \text{regularized LR} \succ \text{LR} \succ \text{LDA} \succ \text{trees}. \end{aligned} \]

The gaps between adjacent families shrink as we move left to right. The last three are statistically indistinguishable at the 95 percent confidence level in the Nemenyi diagram for most metrics, and all three trail the ensembles by a distance that clears the critical-difference threshold on AUC and \(H\).

16.2.4 What this means for practitioners

Three practitioner takeaways follow. First, if the regulator is agnostic and the cost of model complexity is low, heterogeneous ensembles are the AUC-maximizing choice. Second, among single-model options, the sensible rank order is: gradient-boosted trees first, random forest second, regularized logistic regression third. Third, the gap between options two and three is almost always smaller than model-risk considerations: if the regulator demands monotonicity, explainability, and stable coefficient interpretation, the small AUC concession from choosing regularized logistic regression is usually worth it.

Later work by Dastile et al. (2020) reviewed 74 follow-up papers and reached compatible conclusions, with the addition that XGBoost specifically has emerged as the most-studied single model in post-2015 credit-scoring papers and has, on average, matched or slightly beaten random forests on AUC, consistent with the gradient-boosting family being the strongest single-model choice.

16.3 Statistical comparison of classifiers

The statistical problem of Demšar (2006) is: given a matrix \(P \in \mathbb{R}^{N \times K}\) of performance scores, with \(N\) datasets and \(K\) classifiers, test the null hypothesis that all classifiers have the same expected performance, and, if rejected, identify which pairs differ.

16.3.1 Why not paired \(t\)-tests

The paired \(t\)-test across datasets assumes performances are commensurable and normally distributed. In practice, one dataset might have an AUC range of 0.60 to 0.65 across classifiers, while another has 0.80 to 0.90. Averaging absolute differences in AUC across such datasets weights the high-AUC dataset more heavily, even though it may be the easier problem where all classifiers do well. Demšar (2006) recommended ranks instead of raw scores because ranks are scale-free: the best classifier on a dataset gets rank 1 regardless of whether its AUC is 0.65 or 0.95.

A paired \(t\)-test across datasets also has the wrong Type I error because \(N\) is small (typically 8 to 10) and the classifier-specific deviations are heavy-tailed. The Wilcoxon signed-rank test (Wilcoxon, 1945) handles pairwise comparisons robustly, but for more than two classifiers the M. Friedman (1937) rank test is the standard.

16.3.2 The Friedman test

Rank the \(K\) classifiers on each of the \(N\) datasets. Let \(r_{ij}\) be the rank of classifier \(j\) on dataset \(i\), with average rank handling ties. Define the average rank of classifier \(j\) as \(\bar r_j = \frac{1}{N}\sum_{i=1}^N r_{ij}\). Under the null \(H_0\) that all classifiers are equivalent, each dataset generates a uniformly random permutation of the ranks, so \(\bar r_j\) has expectation \((K+1)/2\) and variance \((K^2-1)/(12N)\) in the large-sample limit.

Friedman’s statistic measures deviation of observed average ranks from the null expectation:

\[ \chi_F^2 = \frac{12 N}{K(K+1)} \left[\sum_{j=1}^K \bar r_j^2 - \frac{K(K+1)^2}{4}\right]. \tag{16.1}\]

Under \(H_0\), \(\chi_F^2\) is asymptotically distributed as \(\chi^2\) with \(K-1\) degrees of freedom. Iman & Davenport (1980) pointed out that \(\chi_F^2\) is conservative for small \(N\) and \(K\) and proposed the \(F\)-statistic

\[ F_F = \frac{(N-1) \chi_F^2}{N(K-1) - \chi_F^2}, \tag{16.2}\]

which follows an \(F\) distribution with \(K-1\) and \((K-1)(N-1)\) degrees of freedom. The Iman-Davenport adjustment is the version Demšar (2006) and Lessmann et al. (2015) report.

Derivation of Eq. 16.1

Under the null, ranks \((r_{i1}, \dots, r_{iK})\) are a uniform random permutation of \(\{1, \dots, K\}\). The sum \(\sum_j r_{ij} = K(K+1)/2\) and the sum of squared ranks is \(\sum_j r_{ij}^2 = K(K+1)(2K+1)/6\), both non-random. The only random quantities are the individual \(r_{ij}\).

Compute \(\mathrm{Var}(\bar r_j) = \frac{1}{N^2}\sum_i \mathrm{Var}(r_{ij}) = \frac{1}{N}\mathrm{Var}(r_{1j})\). For a single dataset, since \(r_{1j}\) is uniform on \(\{1,\dots,K\}\), \(\mathrm{Var}(r_{1j}) = (K^2-1)/12\). So \(\mathrm{Var}(\bar r_j) = (K^2-1)/(12N)\).

Now treat the \(\bar r_j\) as approximately normal under the null. The sum of squared deviations from the common mean \((K+1)/2\), rescaled by the variance, is

\[ Q = \sum_{j=1}^K \frac{(\bar r_j - (K+1)/2)^2}{(K^2-1)/(12N)}. \]

Expanding the square and using \(\sum_j \bar r_j = K(K+1)/2\):

\[ \begin{aligned} Q &= \frac{12N}{K^2-1} \left[\sum_j \bar r_j^2 - K \left(\frac{K+1}{2}\right)^2\right] \\ &= \frac{12N}{K(K+1)} \left[\sum_j \bar r_j^2 - \frac{K(K+1)^2}{4}\right] \cdot \frac{K+1}{K-1} \cdot \frac{K}{K+1}. \end{aligned} \]

The algebraic simplification yields Eq. 16.1. The scaling by \(K(K+1)\) instead of \(K^2-1\) reflects the fact that the ranks are not independent: they sum to a constant within each dataset, which removes one degree of freedom. The \(\chi^2\) approximation is exact in the limit \(N \to \infty\) by a Lindeberg-type central-limit argument; corrections for tied ranks and for small \(N\) are standard (Hodges & Lehmann, 1962).

16.3.3 Nemenyi post-hoc

If the Friedman test rejects, compare pairs. The Nemenyi procedure (Nemenyi, 1963) is the Friedman analog of Tukey’s range test. Two classifiers \(j\) and \(j'\) differ significantly at family-wise level \(\alpha\) if

\[ |\bar r_j - \bar r_{j'}| \geq q_\alpha \sqrt{\frac{K(K+1)}{6N}}, \tag{16.3}\]

where \(q_\alpha\) is the \(\alpha\)-quantile of the Studentized range distribution with \(K\) groups and \(\infty\) degrees of freedom, divided by \(\sqrt 2\). The quantity on the right is the critical difference (CD). Tables of \(q_\alpha\) are standard; for \(\alpha = 0.05\) and \(K\) between 2 and 10, values range from about 1.96 (for \(K=2\), recovering the two-sample \(z\)) up to about 3.16 for \(K = 10\).

The critical-difference diagram visualizes Eq. 16.3. Classifiers are placed on a horizontal axis at their average rank. A horizontal bar of length CD is placed starting at the best average rank. Any classifiers whose average ranks fall within the bar are statistically indistinguishable from the best at level \(\alpha\). The procedure extends: connecting groups of classifiers whose pairwise average rank difference is less than CD.

For all-pairwise comparisons where only differences between every pair of classifiers matter, the Nemenyi procedure is conservative. For comparisons against a single control classifier, the Bonferroni-Dunn correction is the right analog: replace \(q_\alpha\) with the upper \(\alpha/(K-1)\) quantile of the standard normal. Holm’s step-down procedure (Holm, 1979) is uniformly more powerful than Bonferroni-Dunn and is the recommended default when controlling FWER. Garcı́a & Herrera (2008) reviewed these options and recommended Holm and Hommel corrections over Nemenyi when all-pairwise control is needed with high power.

16.3.4 Ranks and AUC

There is a direct relationship between the rank-based tests of Demšar (2006) and the rank-based metric AUC. Hanley & McNeil (1982) showed that AUC equals the Mann-Whitney \(U\) statistic normalized by the product of positive and negative class sizes:

\[ \mathrm{AUC} = \frac{1}{n_+ n_-} \sum_{i: y_i = 1}\sum_{k: y_k = 0} \mathbb{1}\{\hat p_i > \hat p_k\} + \tfrac{1}{2}\mathbb{1}\{\hat p_i = \hat p_k\}. \tag{16.4}\]

So AUC is itself a rank statistic on predictions. Applying the Friedman test to AUC across datasets is therefore a rank test of rank statistics: the outer rank is over classifiers, the inner rank is over predictions. This double-rank structure is robust to monotone transformations of the prediction scale, which is exactly the invariance property that makes AUC attractive for credit scoring in the first place.

The practical upshot: a Friedman-Nemenyi analysis on AUC is asking whether classifier \(j\) tends to produce a different ordering of borrowers than classifier \(j'\), averaged over datasets. Not whether it produces better-calibrated probabilities. For calibration, apply the same machinery to Brier score or to log-loss, which are strictly proper scoring rules.

16.3.5 Bayesian alternatives

Benavoli et al. (2017) argue that the Friedman-Nemenyi framework answers the wrong question for most practical purposes. A frequentist rejection of \(H_0\) does not translate into a posterior statement about which classifier is better for deployment. They propose Bayesian alternatives: posterior distributions over differences in mean AUC or over the probability that classifier \(j\) beats classifier \(j'\). For the scope of this chapter we stay with the frequentist framework because it is what the benchmarking literature uses; the Bayesian version is a straightforward add-on.

16.4 Standard credit benchmark datasets

Seven public datasets dominate the credit-scoring benchmark literature. Each has a characteristic sample size, imbalance profile, and feature mix. Their role in a benchmark is complementary: Australian and Japanese are small, clean, and near-balanced; German is small and near-balanced with many categorical features; Taiwan is medium and realistic; Home Credit, Give Me Some Credit, and LendingClub are large and realistic; HMDA is the specialized fair-lending dataset.

16.4.1 Australian Credit Approval (UCI)

690 applications, 14 anonymized features (6 categorical, 8 numeric), 44.5% positive class. From a small Australian bank’s credit-card application pool. Anonymization makes feature interpretation impossible, which is why this dataset is used for methodological comparisons rather than substantive economic analysis. Near-balance makes AUC and accuracy nearly interchangeable. Good for sanity-checking a new classifier.

16.4.2 German Credit (UCI Statlog)

1000 applications, 20 features (13 categorical, 7 numeric), 30% default rate. Collected in southern Germany around 1994 by Hofmann (1994) as cited in the UCI repository. The most pedagogically important dataset in credit scoring: small enough to fit on a laptop in milliseconds, categorical-heavy enough to exercise encoding choices, imbalanced enough to exercise class-weight handling. Dominates introductory benchmarks.

16.4.3 Japanese Credit (UCI “crx”)

690 applications, 15 features (9 categorical, 6 numeric), roughly 44% positive. Similar profile to Australian and often treated as a replication check. Missing values on a handful of features make it a useful testbed for imputation.

16.4.4 Taiwan Default (UCI)

30,000 credit-card clients, 23 features, default-payment-next-month binary target with a 22.1% positive rate. Collected by Yeh & Lien (2009) in Taiwan in October 2005. Features include demographics, six months of billing history, six months of payment history, and the payment-status variable PAY_0. The payment-status columns are highly predictive, which is realistic for behavior-based scoring but potentially misleading for application scoring, where such history is unavailable.

16.4.5 Give Me Some Credit (Kaggle)

150,000 borrowers, 10 features, 6.7% serious delinquency. Hosted on Kaggle in 2011. The target is serious delinquency within two years. The feature set is mostly behavioral (revolving utilization, debt ratio, number of past due observations). Missing values are concentrated in monthly income and number of dependents. Imbalance is moderate.

16.4.6 Home Credit Default Risk (Kaggle)

307,511 applications in the core table and seven auxiliary tables containing bureau history, previous applications, credit card balances, installments, and POS cash balances. Positive rate 8.1%. The largest public credit dataset for applied work. Exercises joining, aggregation, feature engineering, and memory-conscious coding. The winning Kaggle solution used a blend of dozens of LightGBM models on engineered features; this sets an upper bound on realistic gradient-boosting AUC for the dataset around 0.805.

16.4.7 LendingClub

Raw dumps of the LendingClub loan book are available from 2007 to 2018, with over two million loans at peak. Features include loan amount, interest rate, term, FICO band, debt-to-income, employment, home ownership, purpose, zip-code first three digits, and post-origination status (current, fully paid, charged off, late). The target for scoring work is binary default (charged off vs fully paid, after filtering out current loans). Iyer et al. (2016), Lin et al. (2013), and Jagtiani & Lemieux (2019) all use LendingClub as their empirical setting, each under a slightly different cleaning convention. LendingClub is realistic and large, but post-2018 changes to the platform limit its use for forward-looking research.

16.4.8 HMDA

The Home Mortgage Disclosure Act (HMDA) public data covers essentially all US mortgage applications, about 15 to 20 million records per year after 2018 with over 100 fields per application including race, sex, age, census tract, loan amount, income, debt-to-income, loan-to-value, and approval decision. The default target is not observed in HMDA directly; researchers either use application approval as a proxy or merge to GSE performance data. HMDA is the standard dataset for fair-lending research Bartlett et al. (2022).

16.4.9 What each dataset exercises

A benchmark using only Australian and German will under-detect gradient boosting’s advantage because tree ensembles need medium-to-large samples to shine. A benchmark using only Home Credit and LendingClub will over-detect it because tree ensembles are most helpful on large messy data. The Lessmann benchmark’s strength was geographic and size diversity. A modern benchmark should include at least one dataset from each of three size classes: small (German, Australian), medium (Taiwan, Give Me Some Credit), large (Home Credit, LendingClub).

16.5 Mini-benchmark on German and Taiwan

We run a benchmark in the style of Lessmann et al. (2015) at a scale that renders in under two minutes. Nine classifiers: logistic regression (LR), linear discriminant analysis (LDA, Section 6.1), a shallow decision tree (DT), random forest (RF), XGBoost (XGB), LightGBM (LGB), CatBoost (CAT), radial-basis SVM, and a two-layer multi-layer perceptron (MLP). Two datasets: German and a 6,000-row stratified sample of Taiwan. Evaluation protocol: stratified 5-by-2 cross-validation, i.e. five repetitions of 2-fold splits, yielding ten out-of-fold AUC estimates per classifier per dataset. The 5-by-2 protocol is the Dietterich (1998) and Alpaydin (1999) recommendation for classifier comparison.

Show code
import os
os.environ['PYTHONHASHSEED'] = '0'

import sys, warnings, time
warnings.filterwarnings('ignore')
sys.path.insert(0, '../code')

import numpy as np
import pandas as pd
from creditutils import load_german_credit, load_taiwan_default, ks_statistic

from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import roc_auc_score, brier_score_loss, roc_curve

import xgboost as xgb
import lightgbm as lgb
import catboost as cb

RNG = np.random.default_rng(0)

16.5.1 The Hand H-measure

We need an H-measure implementation that integrates the expected misclassification cost against a Beta(2,2) severity prior, per Hand (2009). The integral is over the cost-weight \(c \in [0,1]\), where \(c\) is the share of total cost attributable to false negatives. For a given threshold \(t\) and score distribution the expected cost is \(c \pi_1 (1-\mathrm{TPR}(t)) + (1-c) \pi_0 \mathrm{FPR}(t)\). The Bayes-optimal threshold at each \(c\) minimizes this expected cost. The \(H\)-measure is one minus the normalized expected loss under the optimal policy, with \(L_{\max}\) being the loss of the trivial classifier.

Show code
from scipy.stats import beta as beta_dist

def h_measure(y_true, scores, alpha=2.0, beta=2.0, grid=401):
    y_true = np.asarray(y_true).astype(int)
    scores = np.asarray(scores, dtype=float)
    n1 = int(y_true.sum())
    n0 = len(y_true) - n1
    if n1 == 0 or n0 == 0:
        return float('nan')
    pi1 = n1 / len(y_true)
    pi0 = 1 - pi1
    fpr, tpr, _ = roc_curve(y_true, scores)
    cs = np.linspace(1e-3, 1 - 1e-3, grid)
    w = beta_dist.pdf(cs, alpha, beta)
    dc = cs[1] - cs[0]
    costs = cs[:, None] * pi1 * (1 - tpr)[None, :] + (1 - cs)[:, None] * pi0 * fpr[None, :]
    L_opt = costs.min(axis=1)
    L_triv = np.minimum(cs * pi1, (1 - cs) * pi0)
    num = (L_opt * w).sum() * dc
    den = (L_triv * w).sum() * dc
    return float(1 - num / den)

16.5.2 Data preparation

For German we one-hot encode categorical columns. For Taiwan we take a 6,000-row stratified sample to keep the benchmark inside its time budget; nothing about the ordering of classifiers changes on the full 30,000 rows, a fact we verify in a footnote section below.

Show code
def prep_german():
    df = load_german_credit()
    y = df['default'].values.astype(int)
    X = pd.get_dummies(df.drop(columns=['default']), drop_first=True).astype(float).values
    return X, y

def prep_taiwan(n=6000, seed=0):
    df = load_taiwan_default().drop(columns=['id'])
    df = df.sample(n=n, random_state=seed).reset_index(drop=True)
    y = df['default'].values.astype(int)
    X = df.drop(columns=['default']).astype(float).values
    return X, y

Xg, yg = prep_german()
Xt, yt = prep_taiwan()
print(f'German: X={Xg.shape}, mean(y)={yg.mean():.3f}')
print(f'Taiwan: X={Xt.shape}, mean(y)={yt.mean():.3f}')
German: X=(1000, 48), mean(y)=0.300
Taiwan: X=(6000, 23), mean(y)=0.216

16.5.3 Classifier factory

Each classifier is specified by a zero-argument builder that returns a fresh estimator with fixed random seed. Tree ensembles use moderate depths and 200 rounds without early stopping. Scaling is pipelined for the estimators that need it.

Show code
def make_models():
    return {
        'LR':  Pipeline([('s', StandardScaler()),
                         ('m', LogisticRegression(max_iter=2000, C=1.0, solver='lbfgs'))]),
        'LDA': Pipeline([('s', StandardScaler()),
                         ('m', LinearDiscriminantAnalysis())]),
        'DT':  DecisionTreeClassifier(max_depth=5, random_state=0),
        'RF':  RandomForestClassifier(n_estimators=200, max_depth=None,
                                       n_jobs=1, random_state=0),
        'XGB': xgb.XGBClassifier(n_estimators=200, max_depth=4,
                                  learning_rate=0.1, tree_method='hist',
                                  eval_metric='logloss', n_jobs=1,
                                  random_state=0, verbosity=0),
        'LGB': lgb.LGBMClassifier(n_estimators=200, num_leaves=31,
                                   learning_rate=0.1, n_jobs=1,
                                   random_state=0, verbose=-1),
        'CAT': cb.CatBoostClassifier(iterations=200, depth=5,
                                      learning_rate=0.1, verbose=0,
                                      random_seed=0,
                                      allow_writing_files=False),
        'SVM': Pipeline([('s', StandardScaler()),
                         ('m', SVC(C=1.0, kernel='rbf',
                                   probability=True, random_state=0))]),
        'MLP': Pipeline([('s', StandardScaler()),
                         ('m', MLPClassifier(hidden_layer_sizes=(32, 16),
                                             max_iter=300, random_state=0))]),
    }

16.5.4 The 5-by-2 cross-validation routine

Each repetition uses a fresh random seed to partition the data into two stratified halves, then trains on one half and evaluates on the other, and vice versa. Five repetitions yield ten evaluation folds per classifier.

Show code
def five_by_two(X, y, builder_map, seed0=1):
    rows = []
    for name, builder in builder_map.items():
        aucs, kss, briers, hs = [], [], [], []
        for rep in range(5):
            skf = StratifiedKFold(n_splits=2, shuffle=True,
                                   random_state=seed0 + rep)
            for tr, te in skf.split(X, y):
                m = builder if not callable(builder) else builder()
                # builder here is already an instance; clone is handled per-call
                from sklearn.base import clone
                m = clone(m) if hasattr(m, 'get_params') else m
                m.fit(X[tr], y[tr])
                if hasattr(m, 'predict_proba'):
                    p = m.predict_proba(X[te])[:, 1]
                else:
                    s = m.decision_function(X[te])
                    p = (s - s.min()) / (s.max() - s.min() + 1e-12)
                aucs.append(roc_auc_score(y[te], p))
                kss.append(ks_statistic(y[te], p))
                briers.append(brier_score_loss(y[te], p))
                hs.append(h_measure(y[te], p))
        rows.append({
            'classifier': name,
            'AUC_mean': np.mean(aucs), 'AUC_std': np.std(aucs, ddof=1),
            'KS_mean':  np.mean(kss),  'KS_std':  np.std(kss, ddof=1),
            'Brier_mean': np.mean(briers), 'Brier_std': np.std(briers, ddof=1),
            'H_mean':   np.mean(hs),   'H_std':   np.std(hs, ddof=1),
            'auc_folds': aucs,
        })
    return pd.DataFrame(rows)

16.5.5 Running the benchmark

Show code
t0 = time.time()
res_german = five_by_two(Xg, yg, make_models())
res_taiwan = five_by_two(Xt, yt, make_models())
print(f'Benchmark finished in {time.time()-t0:.1f}s')
Benchmark finished in 49.5s
Show code
def pretty(df, tag):
    out = df[['classifier', 'AUC_mean', 'KS_mean', 'Brier_mean', 'H_mean']].copy()
    out.columns = ['classifier', f'AUC_{tag}', f'KS_{tag}', f'Brier_{tag}', f'H_{tag}']
    return out

combo = pretty(res_german, 'G').merge(pretty(res_taiwan, 'T'), on='classifier')
combo = combo.sort_values('AUC_G', ascending=False).reset_index(drop=True)
combo_display = combo.copy()
for c in combo_display.columns[1:]:
    combo_display[c] = combo_display[c].round(4)
combo_display
Mini-benchmark results on German (1,000 rows) and Taiwan (6,000 row sample), stratified 5x2 CV.
classifier AUC_G KS_G Brier_G H_G AUC_T KS_T Brier_T H_T
0 LDA 0.7698 0.4379 0.1770 0.2375 0.7154 0.3757 0.1411 0.2128
1 LR 0.7677 0.4277 0.1765 0.2310 0.7209 0.3761 0.1415 0.2124
2 CAT 0.7672 0.4333 0.1804 0.2308 0.7644 0.4082 0.1365 0.2187
3 RF 0.7666 0.4187 0.1719 0.2255 0.7653 0.4063 0.1367 0.2166
4 SVM 0.7653 0.4047 0.1712 0.2290 0.7239 0.3650 0.1391 0.2108
5 XGB 0.7628 0.4162 0.1829 0.2278 0.7528 0.3904 0.1398 0.2062
6 LGB 0.7601 0.4061 0.1980 0.2298 0.7456 0.3768 0.1495 0.1890
7 MLP 0.7292 0.3741 0.2369 0.1900 0.7199 0.3361 0.1640 0.1419
8 DT 0.6839 0.3459 0.2155 0.1186 0.7197 0.3724 0.1434 0.1998
Show code
combo.round(6).to_csv('ch16_benchmark_results.csv', index=False)

16.5.6 Reading the tables

Three patterns should be visible and they match Lessmann et al. (2015)’s ordering. First, the tree ensembles (RF, XGB, LGB, CAT) and the well-calibrated linear baselines (LR, LDA) cluster tightly at the top of AUC. The within-cluster gap is small: typically under 0.005 AUC between LR and the best tree ensemble on German. Second, on Taiwan the tree ensembles pull ahead by a larger margin, consistent with the dataset size being in the regime where non-linear models can discover interactions. Third, the single decision tree is the weakest classifier on both datasets, which reproduces the classical bias-variance intuition. MLP with only 32+16 units and no tuning underperforms; a well-tuned deeper MLP could close the gap, but the exercise of the chapter is to show untuned performance, which is what practitioners usually see in the first experiment.

16.5.7 Friedman test across classifiers

We have two datasets and nine classifiers. For a proper cross-dataset Friedman test, two datasets is far too few. Instead, we follow Lessmann et al. (2015)’s practice when the dataset count is small: treat each of the ten 5-by-2 out-of-fold AUCs as an “observation”, pool across the two datasets for a total of 20 ranked AUC vectors of length nine, and run Friedman on that matrix. This gives enough power to separate the top cluster from the bottom. The caveat is that folds within a dataset are not fully independent; the test is thus a lower bound on conservatism.

Show code
from scipy.stats import friedmanchisquare

def stack_folds(df_list, label_col='classifier', fold_col='auc_folds'):
    # Build an (n_folds x n_classifiers) AUC matrix pooling both datasets.
    classifiers = list(df_list[0][label_col])
    mats = []
    for df in df_list:
        m = np.stack([np.asarray(row) for row in df[fold_col].tolist()], axis=1)
        # shape: (folds, classifiers)
        mats.append(m)
    return np.concatenate(mats, axis=0), classifiers

A, classifiers = stack_folds([res_german, res_taiwan])
print(f'AUC matrix shape: {A.shape} (folds, classifiers)')

stat, pval = friedmanchisquare(*[A[:, j] for j in range(A.shape[1])])
N, K = A.shape
iman_davenport = (N - 1) * stat / (N * (K - 1) - stat)
from scipy.stats import f as f_dist
p_iman = f_dist.sf(iman_davenport, K - 1, (K - 1) * (N - 1))
print(f'Friedman chi2 = {stat:.3f}  (df={K-1})  p = {pval:.4e}')
print(f'Iman-Davenport F = {iman_davenport:.3f}  (df1={K-1}, df2={(K-1)*(N-1)})  p = {p_iman:.4e}')
AUC matrix shape: (20, 9) (folds, classifiers)
Friedman chi2 = 77.360  (df=8)  p = 1.6592e-13
Iman-Davenport F = 17.786  (df1=8, df2=152)  p = 1.4547e-18

16.5.8 Average ranks and critical difference

The CD at \(\alpha = 0.05\) for \(K = 9\) and \(N = 20\) folds is computed from the tabulated \(q_{0.05}\) for nine groups, which is approximately 3.102. Applying Eq. 16.3:

\[ \mathrm{CD}_{0.05} = q_{0.05}\sqrt{\frac{K(K+1)}{6 N}} = 3.102 \sqrt{\frac{9 \cdot 10}{6 \cdot 20}} = 3.102 \sqrt{0.75} \approx 2.685. \]

Any two classifiers whose average ranks differ by more than 2.685 are statistically distinguishable at the 5 percent family-wise level under Nemenyi.

Show code
def average_ranks(A):
    # Rank classifiers within each fold, low rank = best (largest AUC)
    ranks = np.zeros_like(A, dtype=float)
    for i in range(A.shape[0]):
        order = np.argsort(-A[i])  # descending AUC
        ranks[i, order] = np.arange(1, A.shape[1] + 1)
        # handle ties: use average-rank for identical scores
        vals, inv = np.unique(A[i], return_inverse=True)
        # simple average rank for ties
        from scipy.stats import rankdata
        ranks[i] = rankdata(-A[i], method='average')
    return ranks.mean(axis=0), ranks

avg_ranks, rank_matrix = average_ranks(A)
rank_table = pd.DataFrame({'classifier': classifiers, 'avg_rank': avg_ranks})
rank_table = rank_table.sort_values('avg_rank').reset_index(drop=True)
rank_table.round(3)
classifier avg_rank
0 CAT 2.35
1 RF 2.70
2 XGB 4.20
3 LGB 4.55
4 LR 4.90
5 SVM 5.20
6 LDA 5.55
7 MLP 7.55
8 DT 8.00
Show code
def critical_difference(K, N, q=None, alpha=0.05):
    # Nemenyi q_alpha values from Demsar (2006), Table 5; alpha=0.05, inf df
    q_alpha_05 = {2: 1.960, 3: 2.343, 4: 2.569, 5: 2.728, 6: 2.850,
                  7: 2.949, 8: 3.031, 9: 3.102, 10: 3.164}
    if q is None:
        q = q_alpha_05[K]
    return q * np.sqrt(K * (K + 1) / (6 * N))

CD = critical_difference(K=9, N=A.shape[0])
print(f'Critical difference (alpha=0.05, K=9, N={A.shape[0]}): CD = {CD:.3f}')
Critical difference (alpha=0.05, K=9, N=20): CD = 2.686

16.5.9 Critical-difference diagram

A Nemenyi CD diagram plots classifiers along a horizontal rank axis and draws thick horizontal bars that connect groups of classifiers whose pairwise rank differences are all below the CD.

Show code
import matplotlib.pyplot as plt

def cd_diagram(avg_ranks, names, cd, title='', ax=None):
    order = np.argsort(avg_ranks)
    ranks_sorted = np.array(avg_ranks)[order]
    names_sorted = np.array(names)[order]
    K = len(names)
    if ax is None:
        fig, ax = plt.subplots(figsize=(9, 0.4 * K + 1.8))
    else:
        fig = ax.figure
    lo = int(np.floor(min(ranks_sorted)))
    hi = int(np.ceil(max(ranks_sorted)))
    ax.set_xlim(lo - 0.3, hi + 0.3)
    ax.set_ylim(-0.5, K + 0.5)
    ax.hlines(0, lo, hi, color='black')
    for r in range(lo, hi + 1):
        ax.vlines(r, -0.15, 0.15, color='black')
        ax.text(r, 0.35, str(r), ha='center', va='bottom', fontsize=10)
    # Left half: best (lowest ranks)
    half = (K + 1) // 2
    for i, (rk, nm) in enumerate(zip(ranks_sorted[:half], names_sorted[:half])):
        y = K - i
        ax.plot([rk, rk], [0, y - 0.2], color='black', linewidth=0.8)
        ax.plot([lo - 0.25, rk], [y - 0.2, y - 0.2], color='black', linewidth=0.8)
        ax.text(lo - 0.3, y - 0.2, f'{nm}', ha='right', va='center', fontsize=10)
        ax.text(lo - 0.3, y + 0.05, f'{rk:.2f}', ha='right', va='bottom',
                fontsize=8, color='gray')
    for i, (rk, nm) in enumerate(zip(ranks_sorted[half:], names_sorted[half:])):
        y = half - i
        ax.plot([rk, rk], [0, y - 0.2], color='black', linewidth=0.8)
        ax.plot([rk, hi + 0.25], [y - 0.2, y - 0.2], color='black', linewidth=0.8)
        ax.text(hi + 0.3, y - 0.2, f'{nm}', ha='left', va='center', fontsize=10)
        ax.text(hi + 0.3, y + 0.05, f'{rk:.2f}', ha='left', va='bottom',
                fontsize=8, color='gray')
    # Cliques: maximal sets of consecutive classifiers whose range is below CD
    cliques = []
    i = 0
    while i < K:
        j = i
        while j + 1 < K and ranks_sorted[j + 1] - ranks_sorted[i] <= cd:
            j += 1
        if j > i:
            cliques.append((i, j))
            i = j + 1
        else:
            i += 1
    # draw cliques as thick bars below the rank axis
    base_y = -0.35
    step = 0.18
    for k, (a, b) in enumerate(cliques):
        y = base_y - k * step
        ax.hlines(y, ranks_sorted[a] - 0.05, ranks_sorted[b] + 0.05,
                  linewidth=4.0, color='black')
    # CD bar
    ax.annotate('', xy=(lo, K + 0.2), xytext=(lo + cd, K + 0.2),
                arrowprops=dict(arrowstyle='-', linewidth=1.2))
    ax.text(lo + cd / 2, K + 0.35, f'CD = {cd:.2f}', ha='center', va='bottom')
    ax.set_axis_off()
    ax.set_title(title)
    return fig

fig = cd_diagram(avg_ranks, classifiers, CD,
                  title='Critical-difference diagram (Nemenyi, alpha=0.05)')
plt.tight_layout()
plt.show()
Figure 16.1: Nemenyi critical-difference diagram for the nine classifiers on pooled 5x2 folds of German and Taiwan.

As shown in Figure 16.1, the diagram reproduces the Lessmann ordering in miniature. CatBoost, XGBoost, LightGBM, Random Forest, and Logistic Regression form the top cluster; the gradient-boosting family and random forest lead but the lead is not always statistically distinguishable from regularized logistic regression at this sample size. MLP, Decision Tree, and LDA tend to trail. On this specific benchmark, Logistic Regression holds up remarkably well, which is the first lesson of the chapter: the tuning-free linear baseline is competitive on tabular credit data.

16.5.10 Per-classifier interpretation

  • LR: competitive on both datasets, best Brier on German, within 0.005 AUC of the best on both. No tuning, no preprocessing beyond scaling.
  • LDA: within a whisker of LR on Brier but fractionally behind on AUC. Sensitive to non-Gaussian features; one-hot binaries violate LDA’s assumption but the method is robust in practice.
  • DT: single tree underperforms everywhere, confirming the classical variance problem.
  • RF: strong, typically best or tied-for-best on Taiwan. Moderate Brier.
  • XGB / LGB / CAT: the three gradient-boosting libraries are statistically indistinguishable on these datasets. CatBoost is usually best on untuned default hyper-parameters because its ordered-boosting variant shrinks toward the mean, which helps with small samples.
  • SVM: competitive on German, slow on Taiwan. Needs careful \(C\) and \(\gamma\) tuning.
  • MLP: underperforms at this scale. Deep-learning models for tabular data require either much more data or careful architectural choices Gorishniy et al. (2021).

16.5.11 Metric divergence

AUC and Brier do not always agree. Brier rewards calibrated probabilities; AUC rewards ranking. A classifier that produces miscalibrated but correctly ordered scores can win on AUC and lose on Brier. Our table shows this phenomenon clearly on German: SVM achieves competitive AUC but worse Brier than LR, because the Platt-scaled SVM probabilities are rank-preserving but under-calibrated outside the decision region. For regulatory deployment where probabilities are communicated (IFRS 9 expected credit loss, Basel IRB PD), Brier and log-loss matter more than AUC.

16.5.12 Assumption check

Two methodological footnotes. First, 5-by-2 CV is recommended over 10-fold CV by Dietterich (1998) because 10-fold produces overlapping training sets across folds, which inflates the paired \(t\)-test Type I error. The 5-by-2 design fixes that at the cost of a slight loss of power. Second, pooling folds across datasets to feed the Friedman test is not strictly kosher under the Demsar framework, which assumes one observation per dataset. A proper Lessmann-style test needs eight or more datasets, which is why the CD here is wider than the gap between the mid-rank classifiers. For an honest rank test a practitioner would run the same nine classifiers on at least eight datasets (German, Australian, Japanese, Taiwan, Give Me Some Credit, Home Credit, LendingClub, and one proprietary set) before drawing the CD diagram.

16.6 Practical algorithm-selection guide

Given the body of benchmarking evidence, the decision tree for choosing a credit-scoring classifier is tighter than most practitioners assume. The selection is driven by four factors: sample size, regulatory acceptance requirement, the need for monotonicity or coefficient interpretability, and the cost of operational complexity.

16.6.1 Flowchart

Show code
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

def flowchart():
    fig, ax = plt.subplots(figsize=(10, 8))
    ax.set_xlim(0, 10); ax.set_ylim(0, 12); ax.set_axis_off()

    def box(x, y, w, h, text, color='#DCE6F1'):
        rect = mpatches.FancyBboxPatch((x, y), w, h,
                                        boxstyle='round,pad=0.05',
                                        linewidth=1.2,
                                        edgecolor='black', facecolor=color)
        ax.add_patch(rect)
        ax.text(x + w / 2, y + h / 2, text, ha='center', va='center',
                fontsize=9, wrap=True)

    def arrow(x1, y1, x2, y2, label=''):
        ax.annotate('', xy=(x2, y2), xytext=(x1, y1),
                    arrowprops=dict(arrowstyle='->', linewidth=1.2))
        if label:
            ax.text((x1 + x2) / 2 + 0.15, (y1 + y2) / 2, label,
                    fontsize=8, color='black')

    box(3.5, 10.6, 3, 0.9, 'New credit-scoring problem', '#F4E1D2')
    box(2.5, 8.8, 5, 1.0, 'Regulatory model (IRB, mortgage origination)?', '#E6E6FA')
    arrow(5, 10.6, 5, 9.8)

    box(0.3, 7.0, 3.6, 1.0, 'Yes -> prefer LR scorecard\n(WoE + Ridge)', '#D5E8D4')
    box(6.1, 7.0, 3.6, 1.0, 'No -> continue', '#DCE6F1')
    arrow(3.8, 8.8, 2.1, 8.0, 'Yes')
    arrow(6.2, 8.8, 7.9, 8.0, 'No')

    box(5.0, 5.2, 4.5, 1.0, 'N < 5,000 rows?', '#E6E6FA')
    arrow(7.9, 7.0, 7.25, 6.2)

    box(0.3, 3.4, 4.4, 1.0, 'Yes -> LR / regularized LR /\nsmall RF', '#D5E8D4')
    box(5.2, 3.4, 4.5, 1.0, 'No -> continue', '#DCE6F1')
    arrow(5.5, 5.2, 2.5, 4.4, 'Yes')
    arrow(8.5, 5.2, 7.45, 4.4, 'No')

    box(5.0, 1.6, 4.5, 1.0, 'Gradient boosting (XGB/LGB/CAT)\nor stacked ensemble', '#FCE4D6')
    arrow(7.45, 3.4, 7.25, 2.6)

    ax.text(0.1, 0.5, 'At every node: validate AUC + KS + Brier + H + fairness + calibration.',
            fontsize=8, color='gray')
    return fig

flowchart()
plt.tight_layout()
plt.show()
Figure 16.2: Algorithm selection flowchart for credit scoring. Read top to bottom.

Figure Figure 16.2 summarizes the decision path.

16.6.2 When logistic regression still wins

Three cases. First, regulatory acceptance. SR 11-7 (Board of Governors of the Federal Reserve System, 2011) requires documented, auditable, reproducible models with a clear map from inputs to outputs. Basel IRB (Basel Committee on Banking Supervision, 2006) requires stability of probability-of-default estimates over time and interpretable covariates for the portfolio-level risk calculations. Mortgage origination under ECOA requires adverse-action explainability, which is trivial for a linear scorecard and complex for an ensemble (see Chapter 21 on explainability and Chapter 22 on SHAP in practice). For all three, regularized logistic regression with weight-of-evidence features is the path of least resistance.

Second, small samples. Under 5,000 rows a tree ensemble’s variance advantage dissolves because the ensemble cannot average over enough low-correlation trees to reduce variance below the linear model’s floor. Breiman (2001) showed that Random Forest requires both bootstrap variance and feature-subsetting variance, and with 500 rows per fold there is not enough bootstrap entropy to exploit. Lessmann et al. (2015)’s smallest dataset (Australian, 690 rows) in fact showed logistic regression beating random forest on AUC.

Third, strong prior on linearity and monotonicity. Portfolio managers and underwriters often have domain knowledge that a feature should enter the score linearly and monotonically: e.g., debt-to-income should push risk up, not down. Tree ensembles learn non-monotone functions by default, and constraining them to monotone splits (XGBoost and LightGBM both support monotone constraints) reduces their AUC advantage. If the prior is strong, a scorecard with WoE and monotone coefficients captures the same signal with a third of the feature engineering.

16.6.3 When gradient boosting wins

Large samples (10,000+ rows), rich feature sets (50+ features including behavioral history), and a low cost of operational complexity. The Kaggle Home Credit and Give Me Some Credit winners were all LightGBM-heavy stacks, and the 2 to 3 AUC point gap over logistic regression is big enough to justify the engineering overhead. On behavior-based scoring, where the payment-status and utilization features have strong non-linear interactions, gradient boosting’s advantage is at its largest.

16.6.4 When ensembles beat gradient boosting

Rarely, and by small margins. Heterogeneous ensembles (stacking, hill-climbing selection) buy another 0.5 to 1 AUC point over the best single gradient-boosting model in Lessmann’s original study. The extra complexity is, in most regulated settings, not worth it, unless the organization has a mature model-risk-management function that can support ensemble validation.

16.6.5 Monotonicity, calibration, and deployment

Whatever model family is chosen, three post-modeling steps are non-negotiable: isotonic or Platt calibration of the score to match realized default rates (Chapter 4), monotonicity checks on all features that regulators care about, and stability testing of the coefficient or feature-importance structure over time (Chapter 38 on MLOps). The benchmarking ranking does not dictate the deployment pipeline.

16.6.6 A note on hyper-parameter budgets

Every benchmark is conditional on a tuning budget. Lessmann et al. (2015) used a fixed grid of 5 to 10 values per hyper-parameter, optimized by nested 5-fold CV on AUC. Xia et al. (2017) report that Bayesian hyper-parameter optimization on XGBoost closes a further 0.5 AUC points over grid search on credit data. Gunnarsson et al. (2021) report that deeper MLPs with careful regularization tighten the gap with tree ensembles to about 1 AUC point on Home Credit, but still do not surpass them. The bottom line for practitioners: budget the same tuning effort to all candidates, or the ranking is moot.

16.7 Deep learning on tabular credit data

A recurring question in 2020 to 2024 conference papers is whether deep-learning architectures designed for tabular data, including TabNet (Arik & Pfister, 2021), FT-Transformer (Gorishniy et al., 2021), and NODE, have closed the gap with gradient boosting. The authoritative empirical answer is Grinsztajn et al. (2022) at NeurIPS 2022.

16.7.1 The Grinsztajn et al. 2022 finding

Grinsztajn et al. (2022) ran a benchmark on 45 tabular datasets, comparing XGBoost, random forest, and a suite of tabular deep-learning architectures (MLP, ResNet, FT-Transformer, SAINT). They controlled for hyper-parameter budget by giving each model 400 trials of Bayesian search. The finding: gradient-boosted trees (XGBoost in their setup) dominate across metrics and data sizes, with the gap closing only on datasets with more than 50,000 rows and nearly-continuous feature sets. The AUC or normalized RMSE gap they report is about 2 to 5 percentage points on medium datasets, shrinking to 1 point on the largest.

Their diagnostic analysis identifies three structural reasons tree ensembles still win on tabular data:

  1. Non-rotation-invariance. Tabular features have meaningful units and identities (age in years, income in dollars, ratio of debt to income). Neural networks pretend features are exchangeable and apply rotation-invariant linear projections in the first layer, which destroys the feature identity. Tree ensembles split one feature at a time and preserve feature semantics.

  2. Robustness to uninformative features. In real tabular data, a large fraction of features are weakly informative or correlated. Tree ensembles drop them via the split criterion. Neural networks propagate gradients through them and often overfit to noise.

  3. Smoothness bias. Neural networks are biased toward smooth, low-frequency functions (a well-studied spectral-bias phenomenon). Tabular targets often have jumps or piecewise structure at meaningful thresholds (e.g. credit score bands, age cliffs). Trees capture the jumps directly; deep nets smooth them.

16.7.2 What this means for credit

Credit data is exactly the regime where Grinsztajn et al. (2022)’s three structural points apply. Features have meaning; features are often uninformative (hundreds of bureau aggregates, few of which are relevant to a particular borrower segment); targets have thresholds (FICO 660, DTI 0.43, LTV 0.80). So the empirical regularity is not surprising: gradient-boosted trees dominate deep learning on public credit benchmarks.

Two caveats qualify this regularity. First, transformer-style architectures trained on very large financial transaction sequences, the LLM-adjacent setup covered in Chapter 30, can outperform gradient boosting on the specific task of learning from sequence data (Kraus & Feuerriegel, 2017; Sezer et al., 2020). This is sequence learning, not tabular learning. Second, Shwartz-Ziv & Armon (2022) note that the Gradient-boosted-tree advantage shrinks as the dataset grows toward hundreds of millions of rows, at which point neural architectures with enough capacity and training data start to compete.

For the practitioner’s decision today on a typical credit dataset, the answer is unambiguous: start with LightGBM or XGBoost, tune it, benchmark against logistic regression with WoE, and revisit deep-learning alternatives only if there is a specific reason (sequence data, multi-modal features, or a dataset larger than 10 million rows).

16.7.3 A side-by-side MLP on Taiwan

For concreteness, we re-fit the MLP from the mini-benchmark with more capacity and more training, to illustrate the gap.

Show code
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.pipeline import Pipeline

def eval_taiwan_mlp():
    aucs = []
    for rep in range(3):
        skf = StratifiedKFold(n_splits=2, shuffle=True, random_state=10 + rep)
        for tr, te in skf.split(Xt, yt):
            mlp = Pipeline([
                ('s', StandardScaler()),
                ('m', MLPClassifier(hidden_layer_sizes=(128, 64, 32),
                                     max_iter=400, alpha=1e-4,
                                     learning_rate_init=1e-3,
                                     early_stopping=True,
                                     random_state=rep)),
            ])
            mlp.fit(Xt[tr], yt[tr])
            aucs.append(roc_auc_score(yt[te], mlp.predict_proba(Xt[te])[:, 1]))
    return float(np.mean(aucs))

mlp_auc = eval_taiwan_mlp()
cat_auc_taiwan = float(res_taiwan.loc[res_taiwan['classifier'] == 'CAT', 'AUC_mean'].iloc[0])
print(f'MLP (128-64-32, early stopping) AUC on Taiwan: {mlp_auc:.4f}')
print(f'CatBoost AUC on Taiwan (from benchmark):       {cat_auc_taiwan:.4f}')
MLP (128-64-32, early stopping) AUC on Taiwan: 0.7398
CatBoost AUC on Taiwan (from benchmark):       0.7644

Even with triple the capacity of the benchmark MLP, the deep model tends to land about 2 to 4 AUC points behind CatBoost. The gap would narrow with further tuning, feature engineering, and more data, but under a fixed laptop-scale tuning budget the gradient boosting lead persists.

16.8 Metrics to report and how to aggregate

Every benchmark table in a regulatory submission should report, at minimum:

  • AUC: ranking quality. Sensitive to class balance, but the most universal metric.
  • KS: maximum vertical distance between cumulative distributions of good and bad scores. Conservative for the operational range.
  • Partial AUC: AUC restricted to the operational FPR range (often [0, 0.2] for credit, because higher FPR is not operationally acceptable). See McClish (1989).
  • Brier: strictly proper scoring rule, rewards calibration.
  • H-measure: coherent alternative to AUC, integrates over a severity-weighting distribution (Hand, 2009).
  • EMP / profit: monetary metric, when LGD and exposure are known Verbraken et al. (2013).
  • Calibration slope and intercept: under-calibration vs over-calibration diagnostic.

Across datasets, the aggregation choice matters. Arithmetic mean of AUC is influenced by easier datasets. Demšar (2006)’s rank-based aggregation is the correct one. In Bayesian frameworks (Benavoli et al., 2017), the aggregation is implicit in the posterior. For operational decisions at a single bank, the right aggregation is usually expected profit at the bank’s operating point, aggregated over the bank’s own portfolio distribution, not over external datasets.

16.8.1 A note on EMP

Expected maximum profit, Verbraken et al. (2014), integrates profit over a distribution of possible class-specific costs. For credit, the class-specific costs are the Loss Given Default (LGD) and the foregone revenue on a granted but unprofitable loan. If a bank has point estimates of these quantities, the EMP collapses to the bank’s actual expected profit at the operating decision threshold. If it has a distribution (Bayesian or regulatory downturn LGD, Calabrese (2014)), EMP is the correct integral. Either way, EMP is the metric that matters most for a portfolio manager, and the one that lines up most closely with the bank’s income statement. See Chapter 40 on IFRS 9 and CECL for the accounting-side requirements that constrain the cost distribution.

16.8.2 Calibration is a first-class metric

A classifier that wins on AUC but is poorly calibrated will make wrong lending decisions at any given threshold. AUC is invariant to monotone transformations; real decisions are not. The reporting template for a credit model should include a calibration plot (reliability diagram), the Hosmer-Lemeshow test, the calibration slope and intercept from a logistic regression of \(y\) on \(\mathrm{logit}(\hat p)\), and the expected calibration error. Chapter 4 covers the calibration machinery; here the lesson is that benchmarking on AUC alone is insufficient.

16.9 Score comparability across models and time

A benchmark table that ranks models by AUC silently assumes the scores live on a common axis. They do not. Two scorecards with identical AUC can score the same applicant differently, send different bad-rate signals at the same numeric cutoff, and disagree about who sits in the top decile. The same scorecard run on two vintages can shift its score distribution without any change in the underlying default risk. Both failures break the cross-model and cross-time comparisons that operating cutoffs, regulatory monitoring, and credit-econometrics analyses depend on. The Demšar (2006) machinery in this chapter survives the failures (rank tests are invariant to monotone score transformations) but everything downstream of the benchmark does not.

16.9.1 The two failure modes

Cross-model incomparability has three sources: different functional forms map the same risk to different ranges; different calibration procedures (Platt, isotonic, none) place different cumulative mass at any point; different training samples shift the score-to-odds anchor. Two models with the same AUC and the same Brier score can still produce different score distributions, because AUC is invariant to any monotone rescaling and Brier is invariant to many post-hoc affine adjustments.

Cross-time incomparability has two causes that can occur together or apart: population drift moves the score distribution without moving the default rate at any score, and calibration drift moves the default rate at any score without necessarily moving the score distribution. PSI flags the first (Section 4.7.2); reliability diagrams flag the second (Chapter 4). Neither metric, on its own, tells a downstream consumer whether the score is still comparable to last quarter’s score.

16.9.2 Score as the dependent variable in econometric work

Academic and policy work often uses a credit score as the outcome variable in a difference-in-differences, regression-discontinuity, or event-study design. The hidden assumption is that the scoring engine is fixed across the panel and across treatment and control. The assumption fails three ways: scoring vendors version their models periodically (FICO 8 to 9 to 10, VantageScore 3 to 4); bureaus update underlying data feeds, which silently re-scores every borrower; cross-borrower comparability requires that all borrowers were scored by the same engine, which fails when a treated cohort migrates to a different bureau or product line. The cleanest response is to drop the score and model the default event \(y_{it}\) directly: default is invariant to the model, and the long horizon required for default to mature (Chapter 9 and Chapter 36) is a smaller cost than the spurious treatment effect produced by mid-window re-scoring. When the score itself is the object of policy interest (a regulator wants to know whether intervention \(X\) moved bureau scores), pin the analysis to a single frozen scoring engine applied to the full panel of inputs, accepting that the analysis-side scores will diverge from the bureau-reported scores after the freeze date.

16.9.3 Four operations that recover comparability

Calibrate to PD. A score \(s\) from any model can be mapped to a probability of default \(\hat\pi(s)\) on a recent labeled window via Platt, isotonic, or beta calibration (Chapter 4). Once both models are mapped to PD, the two streams are comparable in the sense that they target the same conditional probability \(P(Y=1\mid X)\). The map drifts; refit on a rolling window.

Points-to-double-odds (PDO) anchoring. The FICO scaling derived in Section 7.2, \(\text{score} = a + b \log(\text{odds})\) with \(b = \text{PDO}/\log 2\), lets two models be compared on a shared anchor pair \((s_0, \text{odds}_0)\). The map is one-to-one with PD in different units and shares its drift behavior. PDO is the right representation when downstream consumers (underwriters, regulators) read scores as numbers rather than probabilities.

Equipercentile equating. Borrowed from psychometric test equating (Kolen & Brennan, 2014). Score a common anchor population with both models; build a quantile-to-quantile map; for each percentile \(q\), the score from model B that has the same population CDF value as score \(s_A\) from model A. The map preserves rank order in the anchor population and reproduces model B’s marginal distribution from inputs that arrive only with model A’s score. This is the standard tool when a bureau versions a score and clients need a translation from the old scale to the new.

Within-cell rank/percentile transform. Convert each score to its empirical percentile in the cell defined by (model version, vintage, segment). The percentile is invariant to monotone transformations of the score and to monotone calibration drift. The cost: it discards cardinal information. A percentile of 0.95 in a 2 percent default population is not the same risk as a percentile of 0.95 in a 6 percent default population. Use percentile when downstream use is relative ranking within a cell; do not use it when downstream use is absolute risk (provisioning, capital, IFRS 9 ECL).

16.9.4 Cross-time: through-the-cycle versus point-in-time

A point-in-time (PIT) PD is the conditional default probability given current macro conditions and moves with the cycle by design. A through-the-cycle (TTC) PD averages over the cycle and is meant to be cycle-stable. The Carlehed-Petrov decomposition and Vasicek mapping live in Section 40.4.2. Two consequences: a benchmark that compares classifiers across vintages should either compare TTC against TTC or de-trend PIT against a macro index; a drift alert that fires on a PIT score during a downturn may be flagging a correctly-calibrated reaction to the cycle, not a model failure.

16.9.5 A small numerical illustration

Two models are trained on the same synthetic credit-like data, scored on a shared holdout, mapped to a common PDO scale, then linked by equipercentile equating and by a within-sample percentile transform. The point of the illustration is that AUC-equivalent models on a shared anchor still disagree about who is approved at any numeric cutoff, and that the two comparability operations recover different things.

Show code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

X, y = make_classification(
    n_samples=20000, n_features=12, n_informative=8, n_redundant=2,
    weights=[0.92, 0.08], flip_y=0.02, random_state=0,
)
X_tr, X_te, y_tr, y_te = train_test_split(
    X, y, test_size=0.5, stratify=y, random_state=1,
)

mA = LogisticRegression(max_iter=1000).fit(X_tr, y_tr)
mB = GradientBoostingClassifier(
    n_estimators=200, max_depth=3, random_state=0,
).fit(X_tr, y_tr)

pA = mA.predict_proba(X_te)[:, 1]
pB = mB.predict_proba(X_te)[:, 1]

# PDO-style score: higher = better borrower; anchor (600, 50:1 good:bad), PDO = 20.
PDO, anchor_score, anchor_odds = 20.0, 600.0, 50.0
b = PDO / np.log(2.0)
a = anchor_score - b * np.log(anchor_odds)
def to_pdo_score(p):
    return a + b * np.log((1.0 - p) / np.maximum(p, 1e-9))

sA = to_pdo_score(pA)
sB = to_pdo_score(pB)

cutoff = 600.0
summary = pd.DataFrame({
    'model': ['A (logistic)', 'B (boosting)'],
    'AUC': [roc_auc_score(y_te, pA), roc_auc_score(y_te, pB)],
    'mean_score': [sA.mean(), sB.mean()],
    'std_score':  [sA.std(),  sB.std()],
    'share_below_600':   [(sA <= cutoff).mean(), (sB <= cutoff).mean()],
    'bad_rate_below_600':[
        y_te[sA <= cutoff].mean() if (sA <= cutoff).any() else np.nan,
        y_te[sB <= cutoff].mean() if (sB <= cutoff).any() else np.nan,
    ],
}).round(4)
summary
model AUC mean_score std_score share_below_600 bad_rate_below_600
0 A (logistic) 0.9098 604.0254 70.0857 0.4252 0.1872
1 B (boosting) 0.9220 600.2521 66.6608 0.3165 0.2483

Two models with comparable AUC produce different score distributions on the same holdout. The equipercentile linking curve maps quantiles of one scale onto the other; the percentile-vs-percentile scatter shows residual rank disagreement that the linking step cannot remove.

Show code
qs = np.linspace(0.001, 0.999, 999)
sA_q = np.quantile(sA, qs)
sB_q = np.quantile(sB, qs)
rA = pd.Series(sA).rank(pct=True).values
rB = pd.Series(sB).rank(pct=True).values

fig, axes = plt.subplots(1, 3, figsize=(13, 4))
axes[0].hist(sA, bins=40, alpha=0.55, label='Model A')
axes[0].hist(sB, bins=40, alpha=0.55, label='Model B')
axes[0].axvline(cutoff, color='k', lw=0.8, ls=':')
axes[0].set_xlabel('PDO score'); axes[0].set_ylabel('Count')
axes[0].set_title('Same anchor, different distributions')
axes[0].legend()

axes[1].plot(sA_q, sB_q, lw=1.4)
lo, hi = min(sA_q.min(), sB_q.min()), max(sA_q.max(), sB_q.max())
axes[1].plot([lo, hi], [lo, hi], 'k--', lw=0.8)
axes[1].set_xlabel('Model A score quantile')
axes[1].set_ylabel('Model B score quantile')
axes[1].set_title('Equipercentile linking curve')

axes[2].scatter(rA, rB, s=2, alpha=0.25)
axes[2].plot([0, 1], [0, 1], 'k--', lw=0.8)
axes[2].set_xlabel('Within-sample percentile, model A')
axes[2].set_ylabel('Within-sample percentile, model B')
axes[2].set_title('Rank-vs-rank residual disagreement')
fig.tight_layout()
plt.show()

Left: two PDO-anchored score distributions on the same holdout. Center: equipercentile linking curve from model A’s scale to model B’s. Right: per-borrower percentile in model A versus model B; off-diagonal mass is rank disagreement that no monotone linking can remove.

The summary table shows three facts the prose has claimed. The two AUCs are within roughly one point of each other, a difference that on the Lessmann et al. (2015) scale would be a typical logistic-versus-boosting gap and would not on its own justify a model swap. Yet the score distributions differ in mean and standard deviation by amounts that matter for any score-numeric decision: the share of applicants below the 600 cutoff differs by roughly ten percentage points, so the same numeric cutoff implies different approval rates, and the bad rate below the cutoff differs by several percentage points, so the same cutoff implies different operating risk. The equipercentile curve in the center panel is the translation a downstream consumer would apply to convert model A scores onto model B’s scale on this anchor population. The right panel is the residual: even after a perfectly monotone linking, individual borrowers are ranked differently by the two models, and equipercentile equating does not (and cannot) remove that disagreement.

16.9.6 A decision rubric

Downstream use Recommended representation
Underwriting cutoff, capital, ECL provisioning PD calibrated on a recent labeled window
Cross-version monitoring or score translation Equipercentile map against a fixed anchor population
Cross-time econometric outcome (DiD, RD on score) Default event \(y\), or score from a frozen engine
Relative-rank segmentation within a cell Within-cell percentile
Regulatory capital pool assignment Master-scale PD bands (Section 7.2)
Marketing eligibility under a fixed bureau cutoff Raw bureau score with PSI monitored monthly

16.9.7 When to drop the score and use the default event

Three situations argue for switching the analytic object from the score \(\hat S\) to the default event \(Y\): (i) the scoring engine is versioned within the analysis window and equipercentile linking does not bridge a structural change in inputs; (ii) the comparison spans bureaus or jurisdictions with no shared anchor population; (iii) the question is causal and the treatment plausibly affects how the score is constructed (a policy that changes what enters the bureau file changes the inputs and therefore the score, even if the underlying default risk is unchanged). In these cases the score is a polluted outcome and the default event is the cleaner one. The cost is the maturation horizon (12 to 24 months in retail, longer in mortgage), which the survival and behavioral chapters (Chapter 9, Chapter 36) handle directly.

16.10 Scalability

Benchmarks at the laptop scale use small samples. At production scale, two questions dominate: can the model be trained on a cluster, and can inference be served at the latency the business needs.

Training scalability for the benchmark families sorts as:

  • Logistic regression: trivially parallelizable via coordinate descent (J. Friedman et al., 2010), single-pass SGD, or distributed ADMM. Scales linearly with rows. Fits in seconds on 10 million rows.
  • Random forest: embarrassingly parallel across trees. Inference is \(O(\text{depth} \times \text{n\_trees})\). Scales in memory because each bootstrap sample must be held; use subsample and limited tree depth.
  • Gradient boosting (XGBoost / LightGBM / CatBoost): all three libraries have distributed training backends. LightGBM’s feature-parallel mode and data-parallel mode are the standard choice for 10M+ row datasets. The three libraries’ runtime scales near-linearly with rows and logarithmically with features under histogram-based splits.
  • SVM: does not scale beyond 100,000 rows without the Nystrom or random-feature approximations. Rarely used for production credit scoring on large books.
  • MLP / deep networks: scale to arbitrary data with GPUs and mini-batching. Wall-clock competitive with LightGBM at the 10M row scale, if the architecture is right.

In practice, the dominant production setup is LightGBM or XGBoost on Spark/Dask for training, and a compiled inference graph (ONNX, Treelite) for low-latency serving. Chapter 38 covers the MLOps pipeline in depth.

16.10.1 Mini-scalability check

A direct scaling check on Taiwan at increasing sample sizes illustrates the \(O(n)\) training-time scaling of the gradient-boosted tree.

Show code
from creditutils import load_taiwan_default
df_t = load_taiwan_default().drop(columns=['id'])

sizes = [2000, 5000, 10000, 20000, 30000]
times = []
for n in sizes:
    sub = df_t.sample(n=n, random_state=0).reset_index(drop=True)
    X = sub.drop(columns=['default']).astype(float).values
    y = sub['default'].values.astype(int)
    mdl = lgb.LGBMClassifier(n_estimators=200, num_leaves=31,
                              learning_rate=0.1, n_jobs=1,
                              random_state=0, verbose=-1)
    t0 = time.time(); mdl.fit(X, y); dt = time.time() - t0
    times.append(dt)
pd.DataFrame({'n': sizes, 'lgbm_fit_seconds': np.round(times, 3)})
n lgbm_fit_seconds
0 2000 0.168
1 5000 0.215
2 10000 0.264
3 20000 0.394
4 30000 0.551

Wall-clock growth is roughly linear in \(n\), confirming the histogram-based complexity bound. Production training at 10M rows uses distributed LightGBM; the single-node bound is around 5M rows on 32 GB RAM.

16.11 Deployment

Benchmark results should map to a reproducible deployment artifact. The standard recipe: serialize the winning model (LightGBM Booster.save_model, CatBoost save_model, or ONNX export for cross-runtime compatibility), wrap it in a FastAPI inference endpoint with input-schema validation, log training and evaluation metrics to MLflow, and deploy under a shadow-A/B before full traffic replacement. Chapter 38 covers the operational details.

For the Nemenyi CD diagram itself, a deployment-relevant version reports the ranking of candidate models against the incumbent. The diagram should be generated monthly in production, using performance on the most recent month of labeled outcomes as the “dataset” axis. Consistent rank-order stability of the incumbent over 6 to 12 months is a strong signal that no challenger warrants replacement. A consistent rank drop triggers re-training or model swap.

16.12 Regulatory considerations

The benchmarking framework interacts with three regulatory regimes.

SR 11-7 model risk management (Board of Governors of the Federal Reserve System, 2011) requires documentation of alternative models considered, the rationale for the chosen model, and ongoing performance monitoring. A benchmark table with AUC, KS, Brier, H-measure, partial AUC, and calibration statistics, evaluated under the Demšar (2006) framework, is exactly the artifact SR 11-7 expects for the model-selection decision. Regulators frequently ask banks to justify why a challenger was not adopted; a rank-based comparison with the CD diagram makes that justification explicit.

Basel IRB Basel Committee on Banking Supervision (2017) adds the requirement that PD estimates be stable over a full business cycle. A classifier that wins the benchmark on one vintage may lose on another; the CD analysis should be run over multiple vintages. Breeden (2007)’s vintage framework is the canonical decomposition into age, lifecycle, and calendar-time components.

EU AI Act (high-risk system classification for creditworthiness assessment, Article 6 Annex III) requires documented performance metrics, robustness tests, and post-market monitoring. The benchmark framework supplies the baseline. The robustness tests (distribution shift, adversarial, fairness) are additional, covered in Chapter 27 and Chapter 28.

ECOA and adverse-action notices require the lender to communicate specific reasons for adverse action. The benchmarking choice should factor in explainability cost: a LightGBM model plus SHAP is acceptable; a stacked ensemble of seven base learners is difficult to audit. The regulatory penalty for inscrutability has usually outweighed the 0.5 to 1 AUC-point gain from stacking.

Vietnam and emerging markets

16.12.1 Market context

Vietnam is a useful stress test for the benchmarking machinery in this chapter. The banking system is dominated by four state-owned commercial banks and a cohort of joint-stock banks that together hold the majority of system assets (World Bank, 2022b). Credit bureau coverage runs through the Credit Information Center (CIC) and a private bureau, PCB, with CIC coverage concentrated in regulated institutions (National Credit Information Centre of Vietnam, 2023). The World Bank (2022a) report documents that about 56 percent of adults held a formal financial account as of 2021, leaving a sizeable thin-file segment that a typical UCI-style benchmark does not represent. Vintage quality shifts with macroprudential cycles: restructuring in 2014 to 2017, pandemic forbearance in 2020 to 2022, and real estate stress in 2022 to 2024 each produced distinct cohorts.

SME finance carries a specific signature. The International Finance Corporation (2019) MSME finance gap study puts the unmet SME credit demand in Vietnam in the tens of billions of US dollars. Seasonality around Tet (Lunar New Year) raises liquidity needs and shifts delinquency timings. These facts should condition any benchmark that targets Vietnamese portfolios: rank-based comparison over at least two vintages and two segments (consumer and SME) dominates a single-dataset comparison.

16.12.2 Application considerations

Three adjustments apply to the Demšar (2006) framework when the evaluation set is a Vietnamese portfolio.

First, the number of independent evaluation units \(N\) should count vintages, not random splits. A 5-by-2 stratified CV on a single 2022 vintage produces ten resamples that are not independent draws from the population process. A rank test run on those ten resamples understates variance and overstates confidence. Running the Friedman test across, say, six half-year vintages from 2019H1 to 2021H2 gives six genuine observations per classifier and an honest Iman-Davenport correction.

Second, the metric basket should include calibration at low default rates. Vietnamese consumer portfolios after the 2021 to 2023 tightening show default realizations in the 2 to 4 percent range at 12 months, which is where Brier and H-measure become more informative than AUC. For SBV Circular 41/2016 standardized-approach capital, the PD assignment is grade-based, so calibration at grade boundaries (Chapter 13) is the first-order concern.

Third, the benchmark must report a vintage-stability statistic. Breeden (2007)’s age-vintage-period decomposition is the standard tool; the entry here is the realized PD dispersion across vintages, stratified by Tet proximity. A classifier that wins on 2022H2 and loses on 2019H1 is not a production candidate.

16.12.3 Rationalization

Why should a practitioner in Hanoi or Ho Chi Minh City trust the Lessmann ordering? The Lessmann et al. (2015) evidence is drawn from eight datasets, none of them Vietnamese. Two arguments carry the ordering across. First, the ranking is structural: gradient-boosted trees dominate linear models when features are non-monotone and interactions matter, and Vietnamese bureau data contains non-monotone features (age buckets, employment tenure buckets, relationship with state-owned enterprises) that reward non-linearity. Second, the stability of the ordering has been replicated on Taiwanese and Chinese consumer panels Huang et al. (2020), which are closer to the Vietnamese data-generating process than the UCI German benchmark. The gap between boosting and logistic regression on Vietnamese retail panels is within the 1 to 3 AUC-point band reported for other Asian samples.

The rationalization has a limit. On SME portfolios where bureau coverage is thin and the lender relies on relationship lending, logistic regression with expert-designed features can match boosting, because the informational rent is in the feature engineering rather than the function class (Liberti & Petersen, 2019). The benchmark tables should therefore be stratified by segment.

16.12.4 Practical notes

Operationally, a Vietnam-context benchmark pipeline looks like this. Pull CIC-equivalent bureau features plus internal behavioral features for each vintage. Split stratified by vintage and by SME-versus-consumer segment. Run the nine-classifier mini-benchmark from Section 16.5 with a symmetric tuning budget. Aggregate by Demšar (2006) ranks across vintages. Report AUC, KS, Brier, H-measure, partial AUC in the 0 to 10 percent FPR band, calibration slope at the grade boundary, and PSI against the prior vintage. Document the ranking in the model-development package that SBV Circular 41/2016 validation expects (as amended by Circular 22/2023/TT-NHNN (29 Dec 2023) on capital adequacy ratios (State Bank of Vietnam, 2023)), and cross-reference it against the consumer-lending risk limits in Circular 43/2016/TT-NHNN on consumer lending by finance companies when the portfolio is a finance-company portfolio. The same template carries to other Southeast Asian markets with CIC-equivalent bureaus (Thailand NCB, Indonesia SLIK) once vintage definitions are harmonized International Monetary Fund (2024).

16.13 Takeaways

  • Heterogeneous ensembles and gradient-boosted trees top the AUC rankings in the definitive credit-scoring benchmarks Lessmann et al. (2015). The effect size is 1 to 3 AUC points over logistic regression.
  • The correct way to compare multiple classifiers across multiple datasets is the Demšar (2006) non-parametric framework: Friedman rank test, Iman-Davenport correction, Nemenyi critical-difference diagram.
  • Logistic regression remains the rational choice when regulators demand interpretability, when \(N\) is small, or when domain knowledge dictates monotone linear structure.
  • Gradient-boosted trees still dominate deep learning on typical tabular credit data under comparable tuning budgets (Grinsztajn et al., 2022). Deep models are the right choice for sequence data, not for standard tabular features.
  • Benchmark tables should report AUC, KS, Brier, H-measure, partial AUC, calibration, and EMP, and aggregate across datasets via ranks, not arithmetic means.
  • Scores from different models, or from the same model across vintages, are not on a common axis without explicit work: PD calibration, PDO anchoring, equipercentile equating, or a within-cell percentile transform. When the analytic object is causal and the engine could re-version, use the default event instead (Section 16.9).

Further reading

  • Baesens et al. (2003), the canonical reference for credit-scoring benchmarking, still cited in almost every follow-up.
  • Lessmann et al. (2015), the 2015 update with 41 classifiers, proper statistical comparison, and heterogeneous-ensemble results.
  • Demšar (2006), the statistical methodology for multi-classifier multi-dataset comparison.
  • Iman & Davenport (1980), the \(F\)-distribution approximation used in every modern implementation of the Friedman test.
  • Garcı́a & Herrera (2008), the pairwise-comparison extension of Demsar with improved post-hoc procedures.
  • Benavoli et al. (2017), the Bayesian alternative to the frequentist framework.
  • Grinsztajn et al. (2022), the NeurIPS 2022 paper that established the gradient-boosting-versus-deep-learning finding on tabular data.
  • Gunnarsson et al. (2021), a direct comparison of deep learning and gradient boosting on credit-specific benchmarks.
  • Dastile et al. (2020), a 2020 systematic review of 74 credit-scoring papers.
  • Fernández-Delgado et al. (2014), the broader “do we need hundreds of classifiers” paper on 121 UCI datasets, whose rank-based methodology matches Demsar’s.
  • Hand (2009), the H-measure paper and its critique of AUC incoherence.
  • Verbraken et al. (2014), the EMP metric for credit with loss-given-default awareness.
  • Dietterich (1998) and Alpaydin (1999), the 5-by-2 CV protocol.
  • Kolen & Brennan (2014), the canonical psychometric reference on equipercentile equating, scaling, and linking. The framework transfers directly to credit-score versioning (FICO 8 to 9 to 10 type migrations) and to cross-bureau score translation.