14  Neural Networks and Deep Learning

Scope: both retail and corporate. MLP, embedding-based, and tabular deep models. Worked examples are retail (UCI German, Taiwan); the chapter documents why deep nets typically lose to GBT on tabular credit data of either kind.

Overview

Neural networks arrived in credit scoring before they were ready. Papers in the mid-1990s trained feedforward nets on UCI-sized tables, reported small AUC gains over logistic regression, and reviewers drew strong conclusions from thin benchmarks (Atiya, 2001; West, 2000). The large benchmarks of the 2010s put the discipline back on firmer ground. Lessmann et al. (2015) compared forty-one classifiers across eight credit datasets and placed single-hidden-layer neural nets in the middle of the pack, beaten by gradient boosting and heterogeneous ensembles but essentially tied with random forests on a rank-sum test. A decade later, Grinsztajn et al. (2022) formalized the folk wisdom: tree-based models still dominate on typical tabular data, and deep learning wins only when the structure of the input (sequence, image, graph, text) lines up with a deep inductive bias that trees cannot match.

This chapter takes deep learning seriously anyway. First, because deep models are now the only realistic option for the three structured modalities that sit next to credit: transaction sequences (Chapter 18), digital footprints (Chapter 17), and free-text narratives (Chapter 29). Second, because attention-based architectures such as FT-Transformer (Gorishniy et al., 2021) and TabNet (Arik & Pfister, 2021) have closed a meaningful fraction of the gap to gradient boosting on tabular data, and deep tabular models are now the interpretability target for distilled post-hoc explainers in many shops. Third, because deep learning is the only way to build joint representations across text, structured features, and sequences inside one model; the modern credit stack is not single-modality any more.

A cautious framing is in order. Everything in this chapter can be replaced, on Taiwan default or Home Credit, by a five-line XGBoost call that runs in half a second and will very likely beat the neural net by a modest but real margin. The value of the chapter is not “deep learning is a better credit scorer” (it usually is not). The value is that you walk away able to train, regularize, serialize, and deploy a neural net for credit, know when the structure of your data actually warrants one, and understand the math well enough to read a referee’s report.

14.1 Motivation

Three forces pushed neural networks back into the credit toolbox. The first was the open-source PyTorch and TensorFlow ecosystems (Paszke et al., 2019), which removed the engineering tax that killed 1990s-era neural credit scorers at institutions without dedicated research groups. The second was the switch from handcrafted features to learned representations for two high-volume credit modalities: transaction streams (Babaev et al., 2022) and mortgage loan histories (Sadhwani et al., 2021). Both lean on recurrent or transformer architectures that simply do not exist in a tree-based toolkit. The third was regulatory: supervisors have started to accept non-linear models, provided the institution delivers the SR 11-7 documentation stack (validation, monitoring, challenger, explainability, fairness) around them.

The core tension is small data. Credit portfolios are imbalanced (default rates of 2 to 10% in prime retail, 15 to 25% in subprime or card) and, after carve-outs for time-out-of-sample validation, training samples of 50,000 to 500,000 rows are common. Deep networks are notorious for overfitting below a million rows unless regularization is aggressive. The benchmarks in Grinsztajn et al. (2022) use 1,000 to 10,000 rows and conclude that tree models win; Gorishniy et al. (2021) use comparable sizes and conclude that carefully tuned transformers tie trees. The difference is almost entirely regularization discipline.

The tension is sharper in emerging markets. A Vietnamese finance company may have 30,000 to 150,000 labeled defaults across a multi-year window, with CIC pulls that cover only a fraction of the exposures and with cross-channel drift driven by fast eKYC-enabled growth (Credit Information Center of Vietnam, 2023; State Bank of Vietnam, 2020). That is well below the thin-sample floor at which an MLP stops memorizing and starts generalizing, and it is the regime where the choice of regularization (dropout rate, weight decay, early-stopping patience) matters more than architecture. The Vietnam section at the end of this chapter spells out the defaults that have held up under SBV-style review.

Notation

Let \((x_i, y_i)_{i=1}^n\) denote the training set, with feature vector \(x_i \in \mathbb{R}^p\) and label \(y_i \in \{0, 1\}\) (1 = default). A feedforward network with \(L\) layers defines a function \(f_\theta: \mathbb{R}^p \to \mathbb{R}\) with parameters \(\theta = (W^{(\ell)}, b^{(\ell)})_{\ell=1}^{L}\), where \(W^{(\ell)} \in \mathbb{R}^{d_\ell \times d_{\ell-1}}\) and \(b^{(\ell)} \in \mathbb{R}^{d_\ell}\). The activation at layer \(\ell\) is \(a^{(\ell)}\), the pre-activation is \(z^{(\ell)}\), and \(\sigma(\cdot)\) is a nonlinearity (tanh, ReLU, GELU). Probability of default is \(\hat{p}_i = \mathrm{sigmoid}(f_\theta(x_i))\), and the binary cross-entropy loss is \(\ell(\theta) = -\tfrac{1}{n}\sum_i [y_i \log \hat{p}_i + (1 - y_i) \log(1 - \hat{p}_i)]\).

Show code
import os
os.environ.setdefault("OMP_NUM_THREADS", "1")
os.environ.setdefault("KMP_DUPLICATE_LIB_OK", "TRUE")

import sys, time, warnings
sys.path.insert(0, "../code")
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, brier_score_loss

from creditutils import (
    load_german_credit, load_taiwan_default,
    train_valid_test_split, ks_statistic, stable_sigmoid,
)

SEED = 0
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.set_num_threads(1)

14.2 MLP architecture

14.2.1 Forward pass

A multilayer perceptron alternates affine maps with nonlinear activations. For a network with layer widths \(d_0, d_1, \ldots, d_L\) (where \(d_0 = p\) is the input dimension and \(d_L = 1\) for binary classification),

\[ a^{(0)} = x, \qquad z^{(\ell)} = W^{(\ell)} a^{(\ell-1)} + b^{(\ell)}, \qquad a^{(\ell)} = \sigma(z^{(\ell)}), \quad \ell = 1, \ldots, L-1, \tag{14.1}\]

with the output layer linear in the logit scale: \(f_\theta(x) = z^{(L)} = W^{(L)} a^{(L-1)} + b^{(L)}\). For binary classification the predicted probability is \(\hat{p} = \sigma_{\mathrm{sig}}(f_\theta(x))\), where \(\sigma_{\mathrm{sig}}(u) = 1/(1 + e^{-u})\).

The hidden nonlinearity is a design choice. The classical choice is \(\tanh\), which maps to \((-1, 1)\) and is smooth. The modern default is ReLU, \(\mathrm{ReLU}(u) = \max(0, u)\), which has two appealing properties: it does not saturate on the positive side, so gradients flow even for large pre-activations, and it is piecewise linear, which makes the forward pass and its Jacobian cheap. For tabular credit nets the practical difference between ReLU, GELU, and SiLU is within noise; the choice matters far less than width and regularization.

14.2.2 Universal approximation

The universal approximation theorem says that a network with one hidden layer of sufficient width and any non-polynomial activation can approximate any continuous function on a compact set to arbitrary accuracy (Cybenko, 1989; Hornik et al., 1989). Formally, if \(\sigma\) is continuous, nonconstant, bounded, and non-polynomial, then for every \(\varepsilon > 0\) and every continuous \(g: [0, 1]^p \to \mathbb{R}\), there exists an integer \(H\) and parameters \(\{(w_h, b_h, c_h)\}_{h=1}^H\) with \(w_h \in \mathbb{R}^p\), \(b_h, c_h \in \mathbb{R}\) such that

\[ \sup_{x \in [0, 1]^p} \left| g(x) - \sum_{h=1}^H c_h \sigma(w_h^\top x + b_h) \right| < \varepsilon. \tag{14.2}\]

The theorem is topological: it says approximators exist, nothing about training dynamics, sample complexity, or generalization. Barron (1993) strengthened it to a rate: if \(g\) has bounded Fourier-first-moment norm \(C_g\), a network with \(H\) hidden units achieves mean-squared error \(O(C_g^2 / H)\), independent of input dimension. Depth does not appear in either bound, which is one reason the 1990s literature focused on shallow nets. The modern argument for depth is compositional: deep networks approximate certain function classes with exponentially fewer parameters than shallow ones, though formal separation results require restrictive assumptions.

For credit, the upshot is pragmatic. Universal approximation guarantees that any PD function can in principle be represented. It says nothing about whether your 50,000-row sample is enough to find that representation, which is the actual operational question. This is where regularization does the work.

14.2.3 Backpropagation, derived

Training minimizes a scalar loss \(\mathcal{L}(\theta)\) by gradient descent. For a sample \((x, y)\) and logistic loss \(\ell(z^{(L)}, y) = -y \log \sigma_{\mathrm{sig}}(z^{(L)}) - (1-y) \log(1 - \sigma_{\mathrm{sig}}(z^{(L)}))\), we need \(\partial \ell / \partial W^{(\ell)}\) and \(\partial \ell / \partial b^{(\ell)}\) for every \(\ell\). Computing these analytically via the chain rule is backpropagation (Rumelhart et al., 1986).

Define the error signal \(\delta^{(\ell)} = \partial \ell / \partial z^{(\ell)}\). At the output, logistic loss cancels cleanly against the sigmoid link:

\[ \delta^{(L)} = \frac{\partial \ell}{\partial z^{(L)}} = \sigma_{\mathrm{sig}}(z^{(L)}) - y = \hat{p} - y. \tag{14.3}\]

For hidden layers, apply the chain rule through \(z^{(\ell+1)} = W^{(\ell+1)} a^{(\ell)} + b^{(\ell+1)}\) and \(a^{(\ell)} = \sigma(z^{(\ell)})\):

\[ \begin{aligned} \delta^{(\ell)} &= \frac{\partial \ell}{\partial z^{(\ell)}} = \frac{\partial \ell}{\partial a^{(\ell)}} \odot \sigma'(z^{(\ell)}) \\ &= \left(W^{(\ell+1)\top} \delta^{(\ell+1)}\right) \odot \sigma'(z^{(\ell)}), \end{aligned} \tag{14.4}\]

where \(\odot\) is the Hadamard product and \(\sigma'\) is applied elementwise. The parameter gradients then follow:

\[ \frac{\partial \ell}{\partial W^{(\ell)}} = \delta^{(\ell)} (a^{(\ell-1)})^\top, \qquad \frac{\partial \ell}{\partial b^{(\ell)}} = \delta^{(\ell)}. \tag{14.5}\]

Backpropagation is thus two linear passes over the network: a forward pass that caches \((a^{(\ell)}, z^{(\ell)})\), then a backward pass that computes \(\delta^{(\ell)}\) from \(\delta^{(\ell+1)}\) and contracts with the cached activations. Its cost is dominated by matrix multiplications and matches the forward pass to constant factors. Stochastic gradient descent (Robbins & Monro, 1951) then moves \(\theta\) opposite the mini-batch-averaged gradient, with step size \(\eta\):

\[ \theta_{t+1} = \theta_t - \eta \cdot \frac{1}{|B_t|} \sum_{i \in B_t} \nabla_\theta \ell_i(\theta_t). \tag{14.6}\]

Adam (Kingma & Ba, 2015) replaces the raw gradient with running first and second moments:

\[ m_{t+1} = \beta_1 m_t + (1 - \beta_1) g_t, \qquad v_{t+1} = \beta_2 v_t + (1 - \beta_2) g_t \odot g_t, \tag{14.7}\]

\[ \theta_{t+1} = \theta_t - \eta \cdot \frac{\hat{m}_{t+1}}{\sqrt{\hat{v}_{t+1}} + \epsilon}, \tag{14.8}\]

where \(\hat{m}, \hat{v}\) are bias-corrected moments and \(\beta_1 = 0.9, \beta_2 = 0.999, \epsilon = 10^{-8}\) are the defaults. Adam tends to converge faster than SGD on short training runs over tabular data, which is almost the entire operating regime for credit.

14.2.4 Backpropagation from scratch, numerically checked

A from-scratch implementation clarifies what the framework hides. We build a two-layer MLP, compute analytic gradients by the chain rule above, and compare against PyTorch’s autograd.

Show code
rng = np.random.default_rng(0)
p, h = 5, 4
x = rng.standard_normal((3, p)).astype("float32")
y = rng.integers(0, 2, size=3).astype("float32")

# NumPy parameters
W1 = rng.standard_normal((p, h)).astype("float32") * 0.3
b1 = np.zeros(h, dtype="float32")
W2 = rng.standard_normal((h, 1)).astype("float32") * 0.3
b2 = np.zeros(1, dtype="float32")

def sigmoid(u):
    return stable_sigmoid(u)

# Forward
z1 = x @ W1 + b1
a1 = np.tanh(z1)
z2 = a1 @ W2 + b2
p_hat = sigmoid(z2).ravel()
loss = -(y * np.log(p_hat + 1e-9) + (1 - y) * np.log(1 - p_hat + 1e-9)).mean()

# Backward by hand
delta2 = (p_hat - y).reshape(-1, 1) / len(y)      # (n, 1)
gW2 = a1.T @ delta2                               # (h, 1)
gb2 = delta2.sum(axis=0)
delta1 = (delta2 @ W2.T) * (1 - a1**2)            # tanh derivative
gW1 = x.T @ delta1
gb1 = delta1.sum(axis=0)

# Autograd comparison
xt = torch.tensor(x, requires_grad=False)
yt = torch.tensor(y)
W1t = torch.tensor(W1, requires_grad=True)
b1t = torch.tensor(b1, requires_grad=True)
W2t = torch.tensor(W2, requires_grad=True)
b2t = torch.tensor(b2, requires_grad=True)

logit = torch.tanh(xt @ W1t + b1t) @ W2t + b2t
loss_t = nn.functional.binary_cross_entropy_with_logits(logit.squeeze(-1), yt)
loss_t.backward()

print(f"loss np={loss:.6f}  torch={loss_t.item():.6f}")
for name, a, b in [
    ("W1", gW1, W1t.grad.numpy()),
    ("b1", gb1, b1t.grad.numpy()),
    ("W2", gW2, W2t.grad.numpy()),
    ("b2", gb2, b2t.grad.numpy()),
]:
    print(f"max|d{name}|_np - d{name}_torch| = {np.abs(a - b).max():.2e}")
loss np=0.674530  torch=0.674530
max|dW1|_np - dW1_torch| = 5.15e-09
max|db1|_np - db1_torch| = 1.57e-09
max|dW2|_np - dW2_torch| = 6.52e-09
max|db2|_np - db2_torch| = 2.56e-10

Agreement is at float32 precision on every parameter block. This is the only sanity check that actually catches bugs in a custom training loop, and it should be the first thing you add when porting to a new framework.

14.3 Regularization for credit

Credit portfolios live in the small-\(n\) regime for deep learning. The single most useful thing you can do is regularize aggressively. Five techniques matter: weight decay, dropout, batch or layer normalization, early stopping, and data augmentation (the last is modality-specific and we cover it in the sequence and text sections).

14.3.1 Weight decay (L2 regularization)

Adding \(\lambda \|\theta\|_2^2\) to the loss shrinks parameters toward zero. Gradient descent becomes

\[ \theta_{t+1} = (1 - \eta \lambda) \theta_t - \eta \nabla \ell(\theta_t), \tag{14.9}\]

which is the form of L2 penalization known in statistics as ridge (Tibshirani, 1996 discusses the sibling L1). For adaptive optimizers such as Adam, Loshchilov & Hutter (2019) showed that the equivalence between an explicit L2 term and weight decay breaks, because Adam divides by \(\sqrt{\hat{v}}\). They introduced AdamW, which applies weight decay directly to the parameter update:

\[ \theta_{t+1} = \theta_t - \eta \left( \frac{\hat{m}_{t+1}}{\sqrt{\hat{v}_{t+1}} + \epsilon} + \lambda \theta_t \right). \tag{14.10}\]

For credit nets on Taiwan-scale data, \(\lambda \in [10^{-5}, 10^{-3}]\) is a reasonable starting range. AdamW is now the production default in most deep-tabular libraries.

14.3.2 Dropout

Srivastava et al. (2014) proposed randomly zeroing activations at each forward pass with probability \(p_{\mathrm{drop}}\), then scaling the surviving activations by \(1/(1-p_{\mathrm{drop}})\) so the expected activation is unchanged. Let \(r^{(\ell)} \in \{0, 1\}^{d_\ell}\) be a vector of independent Bernoulli draws with mean \(1 - p_{\mathrm{drop}}\). The dropout forward pass becomes

\[ \tilde{a}^{(\ell)} = \frac{1}{1 - p_{\mathrm{drop}}} \cdot r^{(\ell)} \odot a^{(\ell)}. \tag{14.11}\]

At inference, dropout is off and the full activations are used. The effect is an ensemble of exponentially many subnetworks sharing weights; the trained model approximates their geometric mean.

14.3.3 Dropout as Bayesian approximation

Gal & Ghahramani (2016) showed that dropout is a variational approximation to a Gaussian-process posterior. Keep dropout active at inference, sample \(M\) stochastic forward passes, and the empirical mean and variance of the predictions approximate the posterior predictive moments. The derivation (compressed) follows.

Place a Gaussian prior on each weight matrix, \(W^{(\ell)} \sim \mathcal{N}(0, l^{-2} I)\), and a Bernoulli approximating posterior \(q(W^{(\ell)}) = M^{(\ell)} \cdot \mathrm{diag}(z^{(\ell)})\) with \(z^{(\ell)}_j \sim \mathrm{Bernoulli}(1 - p_{\mathrm{drop}})\) and variational parameters \(M^{(\ell)}\). The variational free energy is

\[ \mathcal{F}(q) = -\int q(\theta) \log p(D | \theta) \, d\theta + \mathrm{KL}(q \| \pi), \tag{14.12}\]

where \(\pi\) is the prior and \(D = \{(x_i, y_i)\}\). Under the Bernoulli approximating family, the reconstruction term is a Monte Carlo estimate with one dropout sample per data point, and the KL term reduces to a weight-decay penalty \(\lambda \|\theta\|_2^2\) with \(\lambda\) depending on \(p_{\mathrm{drop}}\) and the prior lengthscale. Training with dropout plus L2 is thus equivalent (up to constants) to variational inference with this specific approximating family. At prediction, the posterior predictive mean and variance are

\[ \begin{aligned} \mathbb{E}_q[\hat{p}(x)] &\approx \frac{1}{M} \sum_{m=1}^M \hat{p}_m(x), \\ \mathrm{Var}_q[\hat{p}(x)] &\approx \frac{1}{M} \sum_{m=1}^M \hat{p}_m(x)^2 - \left(\frac{1}{M} \sum_m \hat{p}_m(x)\right)^2, \end{aligned} \tag{14.13}\]

with the stochastic passes sampled under dropout. Wager et al. (2013) arrived at a related interpretation: dropout is an adaptive regularizer that penalizes features whose Fisher-information contribution is most variable. Both framings are useful. The Bayesian framing gives free uncertainty quantification (important for reject inference and for the monitoring triggers in Chapter 38); the adaptive-regularization framing explains why dropout fails to help on very small networks and small feature sets, because the regularization it induces is weaker than explicit L2.

14.3.4 Batch and layer normalization

Ioffe & Szegedy (2015)’s batch normalization standardizes each pre-activation over the mini-batch:

\[ \hat{z}^{(\ell)}_j = \gamma_j \frac{z^{(\ell)}_j - \mu_j}{\sqrt{\sigma_j^2 + \epsilon}} + \beta_j, \tag{14.14}\]

where \(\mu_j, \sigma_j^2\) are batch statistics at training time and exponential-moving-averages at inference. The learnable scale and shift \((\gamma_j, \beta_j)\) let the network recover the identity. Santurkar et al. (2018) argued the benefit is not internal-covariate-shift reduction but a smoother loss landscape that tolerates larger learning rates. Layer normalization (Ba et al., 2016) replaces the batch dimension with the feature dimension:

\[ \hat{z}_j = \gamma_j \frac{z_j - \bar{z}}{\sqrt{\mathrm{var}(z) + \epsilon}} + \beta_j, \tag{14.15}\]

with \(\bar{z}\) and \(\mathrm{var}(z)\) taken over the feature axis within one example. Layer norm is the right choice for variable-length sequences and transformer blocks, because batch norm’s statistics degrade when sequence length or batch size is small.

For credit nets, batch norm tends to help on MLPs with very-small-batch training or when the input features are on wildly different scales and you cannot rely on offline standardization. Layer norm is the default inside transformer-based tabular models (FT-Transformer, TabTransformer), since attention is dense and the feature axis is well-defined.

14.3.5 Early stopping

Prechelt (1998)’s early stopping monitors a validation metric and halts training when it stops improving for a patience window. Implemented naively, it replaces an explicit regularizer with an implicit one: the iterate \(\theta_t\) is kept at the point of best generalization rather than allowed to overfit. On credit portfolios where train and validation AUC curves diverge inside ten epochs, early stopping is the single most effective knob. Patience of three to ten epochs is typical.

14.3.6 A concrete MLP on Taiwan

We pull everything above together into a baseline MLP for the Taiwan default dataset. The Taiwan dataset contains 30,000 Taiwanese credit card holders observed in 2005 with a binary “default next month” label (22.1% positive). It is small enough to train on a laptop in seconds, but realistic enough that overfitting is a live threat.

Show code
taiwan = load_taiwan_default().drop(columns=["id"])
train_df, valid_df, test_df = train_valid_test_split(taiwan, y_col="default", seed=42)

feat_cols = [c for c in taiwan.columns if c != "default"]
scaler = StandardScaler().fit(train_df[feat_cols])

def prep(df):
    X = scaler.transform(df[feat_cols]).astype("float32")
    y = df["default"].values.astype("float32")
    return X, y

X_tr, y_tr = prep(train_df)
X_va, y_va = prep(valid_df)
X_te, y_te = prep(test_df)

print("shapes:", X_tr.shape, X_va.shape, X_te.shape)
print("default rate (train):", y_tr.mean().round(4))
shapes: (18000, 23) (6000, 23) (6000, 23)
default rate (train): 0.2245

The MLP is 23 inputs, two hidden layers of 64 and 32 units, ReLU, dropout 0.3, BCE loss, AdamW with weight decay \(10^{-4}\), and early stopping on validation AUC with patience 5. We cap training at 40 epochs; in practice early stopping fires earlier.

Show code
class TaiwanMLP(nn.Module):
    def __init__(self, in_dim, h1=64, h2=32, p_drop=0.3, use_bn=True):
        super().__init__()
        layers = []
        for h in (h1, h2):
            layers += [nn.Linear(in_dim, h)]
            if use_bn:
                layers += [nn.BatchNorm1d(h)]
            layers += [nn.ReLU(), nn.Dropout(p_drop)]
            in_dim = h
        layers += [nn.Linear(in_dim, 1)]
        self.net = nn.Sequential(*layers)

    def forward(self, x):
        return self.net(x).squeeze(-1)

def train_mlp(
    X_tr, y_tr, X_va, y_va,
    h1=64, h2=32, p_drop=0.3, use_bn=True,
    lr=1e-3, weight_decay=1e-4, batch_size=512,
    max_epochs=40, patience=5, seed=0,
):
    torch.manual_seed(seed)
    model = TaiwanMLP(X_tr.shape[1], h1=h1, h2=h2, p_drop=p_drop, use_bn=use_bn)
    opt = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=weight_decay)
    loss_fn = nn.BCEWithLogitsLoss()

    X_tr_t = torch.tensor(X_tr); y_tr_t = torch.tensor(y_tr)
    X_va_t = torch.tensor(X_va); y_va_np = np.asarray(y_va)

    best_auc, best_state, stale, history = 0.0, None, 0, []
    for epoch in range(max_epochs):
        model.train()
        perm = torch.randperm(len(X_tr_t))
        for i in range(0, len(X_tr_t), batch_size):
            idx = perm[i:i + batch_size]
            opt.zero_grad()
            out = model(X_tr_t[idx])
            loss = loss_fn(out, y_tr_t[idx])
            loss.backward()
            opt.step()
        model.eval()
        with torch.no_grad():
            p_va = torch.sigmoid(model(X_va_t)).numpy()
        auc = roc_auc_score(y_va_np, p_va)
        history.append(auc)
        if auc > best_auc + 1e-4:
            best_auc, best_state, stale = auc, {k: v.clone() for k, v in model.state_dict().items()}, 0
        else:
            stale += 1
            if stale >= patience:
                break
    model.load_state_dict(best_state)
    return model, history

t0 = time.time()
mlp, hist = train_mlp(X_tr, y_tr, X_va, y_va, max_epochs=40, patience=6)
print(f"epochs trained: {len(hist)}  best val AUC: {max(hist):.4f}  wall: {time.time() - t0:.1f}s")
epochs trained: 38  best val AUC: 0.7744  wall: 2.2s
Show code
def score(model, X):
    model.eval()
    with torch.no_grad():
        return torch.sigmoid(model(torch.tensor(X))).numpy()

p_te_mlp = score(mlp, X_te)
print(f"MLP  AUC {roc_auc_score(y_te, p_te_mlp):.4f}  "
      f"KS {ks_statistic(y_te, p_te_mlp):.4f}  "
      f"Brier {brier_score_loss(y_te, p_te_mlp):.4f}")
MLP  AUC 0.7698  KS 0.4165  Brier 0.1364

Compare against logistic regression and XGBoost.

Show code
import xgboost as xgb

lr = LogisticRegression(max_iter=2000, C=1.0, random_state=0)
lr.fit(X_tr, y_tr)
p_te_lr = lr.predict_proba(X_te)[:, 1]

xgb_clf = xgb.XGBClassifier(
    n_estimators=400, max_depth=4, learning_rate=0.05,
    subsample=0.8, colsample_bytree=0.8,
    eval_metric="logloss", random_state=0, n_jobs=1, verbosity=0,
)
xgb_clf.fit(X_tr, y_tr, eval_set=[(X_va, y_va)], verbose=False)
p_te_xgb = xgb_clf.predict_proba(X_te)[:, 1]

rows = []
for name, p in [("LogReg", p_te_lr), ("XGBoost", p_te_xgb), ("MLP", p_te_mlp)]:
    rows.append({
        "model": name,
        "AUC": roc_auc_score(y_te, p),
        "KS":  ks_statistic(y_te, p),
        "Brier": brier_score_loss(y_te, p),
    })
pd.DataFrame(rows).round(4)
model AUC KS Brier
0 LogReg 0.7271 0.3886 0.1439
1 XGBoost 0.7706 0.4216 0.1356
2 MLP 0.7698 0.4165 0.1364

XGBoost wins by a small margin, the MLP beats logistic regression, and the MLP’s Brier is competitive. This is the typical shape of the result on tabular credit data: the neural net is in the middle.

14.3.7 Dropout ablation

The claim that dropout matters on small credit samples is easy to test. We retrain the same architecture three times: no dropout, dropout 0.3, dropout 0.5, all else held fixed.

Show code
results = []
for p in [0.0, 0.3, 0.5]:
    m, h = train_mlp(X_tr, y_tr, X_va, y_va, p_drop=p, max_epochs=40, patience=6, seed=0)
    p_te = score(m, X_te)
    results.append({
        "p_drop": p,
        "epochs": len(h),
        "val_AUC_best": round(max(h), 4),
        "test_AUC": round(roc_auc_score(y_te, p_te), 4),
        "test_KS":  round(ks_statistic(y_te, p_te), 4),
    })
pd.DataFrame(results)
p_drop epochs val_AUC_best test_AUC test_KS
0 0.0 19 0.7682 0.7624 0.4156
1 0.3 38 0.7744 0.7698 0.4165
2 0.5 38 0.7717 0.7660 0.4115

Without dropout, validation AUC peaks early and then falls as training AUC keeps rising: a textbook overfit. Dropout 0.3 flattens the curve and delivers the best test AUC. Dropout 0.5 regularizes too hard for a 23-feature network and costs performance. In our experience on retail portfolios, dropout between 0.1 and 0.3 is almost always the right range when the hidden width is in the tens. For wider networks (hundreds of units), 0.3 to 0.5 is usable.

14.3.8 Uncertainty via MC-dropout

Because we have a dropout net, we can produce an uncertainty band for each prediction by doing many forward passes with dropout on. This is operationally relevant: reject inference, policy overrides, and counterfactual denials all want “is this PD known to be high, or is it a confident 0.12?”

Show code
def mc_dropout_score(model, X, n_samples=30, seed=0):
    torch.manual_seed(seed)
    was_training = model.training
    # enable only Dropout layers
    for m in model.modules():
        if isinstance(m, nn.Dropout):
            m.train()
    with torch.no_grad():
        xt = torch.tensor(X)
        preds = []
        for _ in range(n_samples):
            preds.append(torch.sigmoid(model(xt)).numpy())
    if not was_training:
        model.eval()
    arr = np.stack(preds)
    return arr.mean(0), arr.std(0)

mean_p, std_p = mc_dropout_score(mlp, X_te, n_samples=40)
print("mean/std summary over test set:")
print(f"  mean PD      median={np.median(mean_p):.4f}  p95={np.quantile(mean_p, 0.95):.4f}")
print(f"  predictive   median std={np.median(std_p):.4f}  p95 std={np.quantile(std_p, 0.95):.4f}")
print(f"  point AUC on MC mean: {roc_auc_score(y_te, mean_p):.4f}")
mean/std summary over test set:
  mean PD      median=0.1640  p95=0.6714
  predictive   median std=0.0483  p95 std=0.0819
  point AUC on MC mean: 0.7706

The key number is the p95 predictive standard deviation. Customers in that tail are the ones where you want a human-in-the-loop override, a challenger model vote, or a documented policy rule (Gal & Ghahramani, 2016; MacKay, 1992; Neal, 1996).

14.4 CNNs on 2D structured features

Convolutional neural networks (Krizhevsky et al., 2017; LeCun et al., 1998) dominate image classification because they encode translation invariance and locality. Tabular credit features have no spatial structure: column order is arbitrary. Treating the feature vector as a 1D signal and applying a convolution is a common demonstration and almost never wins.

The cases where CNNs plausibly help are:

  • Monthly repayment matrices. Kvamme et al. (2018) showed CNNs on a \(T \times 1\) sequence of mortgage delinquency bucket codes, using 1D convolutions as a templatable filter for repayment patterns.
  • Time-frequency representations of transaction streams (spectrogram of daily spend).
  • Image-like features built from ordered billing histories (credit card PAY_0, PAY_2, …, PAY_6 as a \(6 \times 1\) grid).

We show the last one on Taiwan. Taiwan has a twelve-dimensional slice that is naturally 2D: six months of repayment delay codes (\(\mathrm{PAY}_0, \ldots, \mathrm{PAY}_6\)) and six months of bill amounts (\(\mathrm{BILL\_AMT1}, \ldots, \mathrm{BILL\_AMT6}\)). Arrange as a \(2 \times 6\) grid and run a tiny 2D CNN.

Show code
pay_cols  = ["PAY_0", "PAY_2", "PAY_3", "PAY_4", "PAY_5", "PAY_6"]
bill_cols = ["BILL_AMT1", "BILL_AMT2", "BILL_AMT3", "BILL_AMT4", "BILL_AMT5", "BILL_AMT6"]
ts_cols   = pay_cols + bill_cols

ts_scaler = StandardScaler().fit(train_df[ts_cols])
def prep_ts(df):
    arr = ts_scaler.transform(df[ts_cols]).astype("float32")
    return arr.reshape(-1, 2, 6)  # (N, channels=2, time=6)

Xtr_ts, Xva_ts, Xte_ts = prep_ts(train_df), prep_ts(valid_df), prep_ts(test_df)
print("CNN input shape:", Xtr_ts.shape)

class TinyCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv1d(2, 16, kernel_size=3, padding=1),
            nn.BatchNorm1d(16), nn.ReLU(),
            nn.Conv1d(16, 16, kernel_size=3, padding=1),
            nn.BatchNorm1d(16), nn.ReLU(),
            nn.AdaptiveAvgPool1d(1),
            nn.Flatten(),
        )
        self.head = nn.Sequential(
            nn.Linear(16, 16), nn.ReLU(), nn.Dropout(0.3),
            nn.Linear(16, 1),
        )

    def forward(self, x):
        return self.head(self.conv(x)).squeeze(-1)

def train_cnn(Xtr, ytr, Xva, yva, epochs=25, lr=1e-3, batch_size=256):
    torch.manual_seed(0)
    model = TinyCNN()
    opt = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=1e-4)
    loss_fn = nn.BCEWithLogitsLoss()
    Xtr_t, ytr_t = torch.tensor(Xtr), torch.tensor(ytr)
    Xva_t = torch.tensor(Xva)
    best_auc, best_state, stale = 0.0, None, 0
    for epoch in range(epochs):
        model.train()
        perm = torch.randperm(len(Xtr_t))
        for i in range(0, len(Xtr_t), batch_size):
            idx = perm[i:i + batch_size]
            opt.zero_grad()
            loss = loss_fn(model(Xtr_t[idx]), ytr_t[idx])
            loss.backward(); opt.step()
        model.eval()
        with torch.no_grad():
            p_va = torch.sigmoid(model(Xva_t)).numpy()
        auc = roc_auc_score(yva, p_va)
        if auc > best_auc + 1e-4:
            best_auc, best_state, stale = auc, {k: v.clone() for k, v in model.state_dict().items()}, 0
        else:
            stale += 1
            if stale >= 5:
                break
    model.load_state_dict(best_state)
    return model, best_auc

t0 = time.time()
cnn, va_auc = train_cnn(Xtr_ts, y_tr, Xva_ts, y_va)
print(f"CNN val AUC {va_auc:.4f}  wall {time.time() - t0:.1f}s")

cnn.eval()
with torch.no_grad():
    p_te_cnn = torch.sigmoid(cnn(torch.tensor(Xte_ts))).numpy()
print(f"CNN test AUC {roc_auc_score(y_te, p_te_cnn):.4f}  "
      f"KS {ks_statistic(y_te, p_te_cnn):.4f}")
CNN input shape: (18000, 2, 6)
CNN val AUC 0.7636  wall 6.4s
CNN test AUC 0.7522  KS 0.4092

The CNN underperforms the flat MLP on Taiwan (the full MLP sees all 23 features, the CNN only the twelve time-bucket columns), but it beats logistic regression on the same twelve features. That is the honest take: a 2D CNN on rearranged tabular data is a curiosity. It becomes useful only when the arrangement is genuinely local (the monthly delinquency matrix in a mortgage book, or a raster of transaction intensities). If you need a CNN on tabular credit data, you should first ask whether an LSTM or a small transformer on the same time axis would be cleaner.

14.5 RNNs and LSTMs for transaction sequences

Transaction streams are the modality where deep learning wins in credit by a wide margin. The digital footprint and open banking chapters (Chapter 17 and Chapter 18) cover these modalities in depth; here we show the core architectural idea. A recurrent network processes a variable-length sequence \((x_1, \ldots, x_T)\) by maintaining a hidden state \(h_t \in \mathbb{R}^d\) that is updated at each time step.

14.5.1 Vanilla RNN and the vanishing-gradient problem

A vanilla RNN has update \(h_t = \sigma(W_h h_{t-1} + W_x x_t + b)\). Backpropagation through time unrolls the recurrence and passes gradients through \(T\) products of the Jacobian \(\partial h_t / \partial h_{t-1}\). When the spectral radius of that Jacobian is below one, gradients vanish as \(T\) grows; when above one, they explode (Bengio et al., 1994). In practice, vanilla RNNs fail to learn dependencies longer than roughly ten time steps without heroic initialization.

14.5.2 LSTM

Hochreiter & Schmidhuber (1997) introduced the long short-term memory cell, which sidesteps the vanishing gradient by introducing a cell state \(c_t\) that is updated additively rather than multiplicatively. The LSTM update at time \(t\) is

\[ \begin{aligned} f_t &= \sigma_{\mathrm{sig}}(W_f [h_{t-1}, x_t] + b_f) \quad &\text{(forget gate)}, \\ i_t &= \sigma_{\mathrm{sig}}(W_i [h_{t-1}, x_t] + b_i) \quad &\text{(input gate)}, \\ \tilde{c}_t &= \tanh(W_c [h_{t-1}, x_t] + b_c) \quad &\text{(candidate)}, \\ c_t &= f_t \odot c_{t-1} + i_t \odot \tilde{c}_t \quad &\text{(cell update)}, \\ o_t &= \sigma_{\mathrm{sig}}(W_o [h_{t-1}, x_t] + b_o) \quad &\text{(output gate)}, \\ h_t &= o_t \odot \tanh(c_t). \end{aligned} \tag{14.16}\]

The additive cell-state update means gradients through \(c_t\) are controlled by the forget gate’s diagonal, which can be trained to pass gradient over many time steps. GRUs are a simpler variant with two gates. For credit card sequences (on the order of 100 transactions over a few months) either works; LSTMs are marginally more robust to short training runs.

14.5.3 Credit card transactions as sequence input

A standard representation is: at each transaction \(t\), a feature vector \(x_t\) containing amount, merchant-category embedding, time-of-day, days-since-previous, weekday, and a binary flag for online. The label \(y\) is a customer-level default within a forward window. Real transaction data cannot be shipped inside a textbook; we build a synthetic stream with the structure real transaction data has.

We simulate \(N = 800\) customers, each with \(T = 20\) transactions and \(F = 4\) features. Defaulters show increasing transaction variance, escalating amounts, and rising late-payment flags. Non-defaulters are stable.

Show code
def make_transactions(n=400, T=20, F=4, default=0, rng=None):
    rng = rng or np.random.default_rng(0)
    X = np.zeros((n, T, F), dtype="float32")
    for i in range(n):
        trend = np.linspace(0, 2.0 if default else 0.2, T)
        noise = rng.standard_normal((T, F)) * (0.8 if default else 0.3)
        X[i, :, 0] = np.abs(100 + 30 * trend + noise[:, 0] * 40) / 200      # amount (log-ish)
        X[i, :, 1] = rng.binomial(1, 0.2 + 0.5 * default, T)                # late flag
        X[i, :, 2] = (rng.random(T) - 0.5) + (0.3 if default else 0.0)     # merchant-cat drift
        X[i, :, 3] = (np.arange(T) * (0.05 if default else 0.01)) + noise[:, 3] * 0.1
    return X

rng_seq = np.random.default_rng(0)
X0 = make_transactions(n=400, default=0, rng=rng_seq)
X1 = make_transactions(n=400, default=1, rng=rng_seq)
Xs = np.concatenate([X0, X1])
ys = np.concatenate([np.zeros(400), np.ones(400)]).astype("float32")
perm = rng_seq.permutation(len(Xs)); Xs, ys = Xs[perm], ys[perm]
n_tr = int(len(Xs) * 0.7)
Xs_tr, Xs_te = Xs[:n_tr], Xs[n_tr:]
ys_tr, ys_te = ys[:n_tr], ys[n_tr:]
print("sequence tensor:", Xs_tr.shape, "default rate:", ys_tr.mean())
sequence tensor: (560, 20, 4) default rate: 0.5232143
Show code
class LSTMClassifier(nn.Module):
    def __init__(self, n_features, hidden=16, num_layers=1, dropout=0.2):
        super().__init__()
        self.lstm = nn.LSTM(
            input_size=n_features,
            hidden_size=hidden,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0.0,
        )
        self.head = nn.Sequential(
            nn.LayerNorm(hidden),
            nn.Dropout(dropout),
            nn.Linear(hidden, 1),
        )

    def forward(self, x):
        _, (hn, _) = self.lstm(x)
        return self.head(hn[-1]).squeeze(-1)

torch.manual_seed(0)
lstm = LSTMClassifier(n_features=4, hidden=16)
opt = torch.optim.AdamW(lstm.parameters(), lr=5e-3, weight_decay=1e-4)
loss_fn = nn.BCEWithLogitsLoss()
Xs_tr_t = torch.tensor(Xs_tr); ys_tr_t = torch.tensor(ys_tr)
Xs_te_t = torch.tensor(Xs_te)

t0 = time.time()
for epoch in range(30):
    lstm.train()
    perm = torch.randperm(len(Xs_tr_t))
    for i in range(0, len(Xs_tr_t), 64):
        idx = perm[i:i + 64]
        opt.zero_grad()
        loss = loss_fn(lstm(Xs_tr_t[idx]), ys_tr_t[idx])
        loss.backward(); opt.step()
lstm.eval()
with torch.no_grad():
    p_te_lstm = torch.sigmoid(lstm(Xs_te_t)).numpy()
print(f"LSTM AUC {roc_auc_score(ys_te, p_te_lstm):.4f}  "
      f"KS {ks_statistic(ys_te, p_te_lstm):.4f}  "
      f"time {time.time() - t0:.1f}s")
LSTM AUC 1.0000  KS 1.0000  time 0.7s

On a clean synthetic signal the LSTM nails it. The real-world lesson is in the architectural choices: a single LSTM layer of moderate width, followed by LayerNorm and a small MLP head, with dropout on the head and not inside the recurrence. Applying dropout between the input and hidden recurrences tends to hurt short sequences. For longer sequences (hundreds to thousands of transactions), variational dropout (fixed mask across time) or layer-wise recurrent dropout becomes necessary; for credit-card monthly sequences, the simple form above is sufficient.

The CoLES framework (Babaev et al., 2022) is the current strongest approach for transaction sequences in Russian banking production. It trains a contrastive encoder on unlabeled transaction streams (customer is the class label) and fine-tunes the encoder head on downstream default. Sadhwani et al. (2021) train a deep network on monthly mortgage states and beat traditional hazard models on a 120-million-loan-month panel. The modality matters more than the architecture: once your data is a rich sequence, any of LSTM, 1D-TCN (Bai et al., 2018), or a small transformer will beat a feature-engineered logistic with room to spare.

14.6 Autoencoders for anomaly detection

An autoencoder compresses \(x\) to a latent \(z\) and reconstructs \(\hat{x}\). The training objective is reconstruction error: \(\mathcal{L}(\theta) = \tfrac{1}{n} \sum_i \|x_i - g_\phi(f_\psi(x_i))\|_2^2\), where \(f_\psi\) is the encoder and \(g_\phi\) the decoder. If we train on clean, non-default customers only, reconstruction error at test time works as an anomaly score: defaulters look unlike the training distribution and produce larger residuals (Sakurada & Yairi, 2014).

This is useful in two credit-scoring settings:

  • Rare-default portfolios (super-prime, very thin books) where supervised models collapse on a handful of positives.
  • Feature-drift detection: residuals are a sensitive detector of covariate shift and a natural input to SR 11-7 monitoring dashboards.

We demonstrate on the German Credit data. German is tiny (1,000 rows, 30% default rate after encoding), so the classification performance of a pure autoencoder is mediocre, but the mechanics are clean.

Show code
german = load_german_credit()
y_g = german["default"].values.astype(int)
X_g = pd.get_dummies(german.drop(columns=["default"]), drop_first=True).astype("float32")
X_g_np = X_g.values

# Train scaler on "good" customers only, then transform everything
scaler_g = StandardScaler().fit(X_g_np[y_g == 0])
X_g_sc = scaler_g.transform(X_g_np).astype("float32")

class AutoEncoder(nn.Module):
    def __init__(self, d, h_enc=32, z=8):
        super().__init__()
        self.enc = nn.Sequential(
            nn.Linear(d, h_enc), nn.ReLU(),
            nn.Linear(h_enc, z),
        )
        self.dec = nn.Sequential(
            nn.Linear(z, h_enc), nn.ReLU(),
            nn.Linear(h_enc, d),
        )

    def forward(self, x):
        return self.dec(self.enc(x))

torch.manual_seed(0)
ae = AutoEncoder(X_g_sc.shape[1])
opt = torch.optim.Adam(ae.parameters(), lr=1e-3, weight_decay=1e-4)

X_good = torch.tensor(X_g_sc[y_g == 0])
t0 = time.time()
for epoch in range(120):
    ae.train()
    perm = torch.randperm(len(X_good))
    for i in range(0, len(X_good), 64):
        xb = X_good[perm[i:i + 64]]
        opt.zero_grad()
        r = ae(xb)
        loss = ((r - xb) ** 2).mean()
        loss.backward(); opt.step()
ae.eval()
with torch.no_grad():
    recon = ae(torch.tensor(X_g_sc)).numpy()
err = ((recon - X_g_sc) ** 2).mean(axis=1)

print(f"AE recon-error AUC vs default: {roc_auc_score(y_g, err):.4f}")
print(f"time: {time.time() - t0:.1f}s")

# Flag top-5% reconstruction errors as "anomalies" and report default lift
thr = np.quantile(err, 0.95)
flag = err >= thr
print(f"anomaly-flag default rate: {y_g[flag].mean():.2%}  "
      f"base rate: {y_g.mean():.2%}  "
      f"lift: {y_g[flag].mean() / y_g.mean():.2f}x")
AE recon-error AUC vs default: 0.6553
time: 0.5s
anomaly-flag default rate: 74.00%  base rate: 30.00%  lift: 2.47x

The AUC of reconstruction error against default is well above chance. The more usable operational statistic is the lift: in the top five percent of reconstruction error, the default rate is materially higher than the base rate. In practice, this flag becomes one of several inputs to an expert-in-the-loop review queue, not a standalone PD. The fair-lending considerations are real: Sakurada & Yairi (2014)’s framing is modality-agnostic, so a credit application that looks “unusual” might be unusual for protected-class reasons. Combine anomaly scores with fairness diagnostics (Chapter 28) before putting the flag into a decline decision.

14.6.1 Variational autoencoders and a brief note on VAEs

A variational autoencoder (Kingma & Welling, 2014) replaces the deterministic latent with a distribution \(q_\phi(z | x) = \mathcal{N}(\mu_\phi(x), \sigma_\phi^2(x))\) and regularizes the posterior toward a standard normal via a KL term:

\[ \mathcal{L}_{\mathrm{VAE}}(\theta) = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x | z)] - \mathrm{KL}(q_\phi(z | x) \| \mathcal{N}(0, I)). \tag{14.17}\]

VAEs are a principled density model and are a better anomaly score than plain autoencoders when the normal-class distribution is multi-modal. For credit, we have seen them help on SME financial-statement data where firms legitimately cluster by sector, and a single autoencoder cannot represent all sector-specific “good” patterns. On retail PD they rarely beat a simple AE.

14.7 Tabular deep learning

The two architectures worth knowing in the 2020s for tabular credit data are TabNet (Arik & Pfister, 2021) and FT-Transformer (Gorishniy et al., 2021). Both are designed to inject inductive biases appropriate for tables: column-level attention, instance-wise feature selection, and explicit handling of the column-is-a-feature axis rather than treating tables as a flat vector.

14.7.1 TabNet architecture

TabNet processes an input \(x \in \mathbb{R}^p\) through \(N_{\mathrm{steps}}\) sequential decision blocks. At each step \(i\), the model (a) produces a sparse mask \(M[i] \in [0, 1]^p\) over input features via an attentive transformer, (b) applies the mask to the input \(x \odot M[i]\), (c) feeds the masked input through a shared feature transformer to produce an embedding, and (d) splits the embedding into a decision part (aggregated into the output) and an information part (passed to the next step).

The attentive transformer at step \(i\) is

\[ M[i] = \mathrm{sparsemax}\left(P[i-1] \cdot h_i(a[i-1])\right), \tag{14.18}\]

where \(a[i-1]\) is the information embedding from the previous step, \(h_i\) is a trainable fully-connected + batch-norm block, and \(P[i-1]\) is a prior scale that encourages the network to visit different features at different steps:

\[ P[i] = \prod_{j=1}^{i} (\gamma - M[j]), \tag{14.19}\]

with \(\gamma > 1\) a relaxation hyperparameter (larger \(\gamma\) allows features to be revisited). The \(\mathrm{sparsemax}\) (Martins & Astudillo, 2016) is a variant of softmax that produces exact zeros, so the mask is truly sparse. This sparsity is the source of TabNet’s feature-importance story: at each step the model selects a small subset of columns and routes them through the decision head, and aggregating masks across steps yields per-instance feature importance, plus a global importance via cross-sample averaging.

The final prediction is \(\hat{y} = W_{\mathrm{out}} \sum_{i=1}^{N_{\mathrm{steps}}} \mathrm{ReLU}(d[i])\), where \(d[i]\) is the decision part of step \(i\). A sparsity regularizer is added to the loss to encourage small masks:

\[ \mathcal{L}_{\mathrm{sparse}} = \lambda_{\mathrm{sparse}} \cdot \frac{1}{N_{\mathrm{steps}} \cdot n} \sum_{b, i, j} -M_{b, i, j} \log(M_{b, i, j} + \epsilon). \tag{14.20}\]

The pytorch-tabnet library implements this faithfully. We run it on a Taiwan subsample to keep training under 90 seconds.

Show code
from pytorch_tabnet.tab_model import TabNetClassifier

# downsample to keep training fast for demonstration
sub = taiwan.sample(n=8000, random_state=0).reset_index(drop=True)
sub_tr, sub_va, sub_te = train_valid_test_split(sub, y_col="default", seed=42)
sub_feat = [c for c in sub.columns if c != "default"]
sub_scaler = StandardScaler().fit(sub_tr[sub_feat])

def sub_prep(df):
    X = sub_scaler.transform(df[sub_feat]).astype("float32")
    y = df["default"].values.astype("int64")
    return X, y

Xstr, ystr = sub_prep(sub_tr)
Xsva, ysva = sub_prep(sub_va)
Xste, yste = sub_prep(sub_te)

tabnet = TabNetClassifier(
    n_d=8, n_a=8, n_steps=3, gamma=1.3, lambda_sparse=1e-3,
    optimizer_fn=torch.optim.Adam,
    optimizer_params=dict(lr=2e-2),
    seed=0, verbose=0, device_name="cpu",
)

t0 = time.time()
tabnet.fit(
    Xstr, ystr,
    eval_set=[(Xsva, ysva)],
    max_epochs=30, patience=6,
    batch_size=512, virtual_batch_size=128,
)
p_te_tn = tabnet.predict_proba(Xste)[:, 1]
print(f"TabNet AUC {roc_auc_score(yste, p_te_tn):.4f}  "
      f"KS {ks_statistic(yste, p_te_tn):.4f}  "
      f"time {time.time() - t0:.1f}s")

Early stopping occurred at epoch 16 with best_epoch = 10 and best_val_0_auc = 0.74056
TabNet AUC 0.7524  KS 0.4310  time 1.8s
Show code
imp = pd.Series(tabnet.feature_importances_, index=sub_feat).sort_values(ascending=False)
imp.head(10).round(4)
PAY_0        0.1865
PAY_2        0.1312
BILL_AMT1    0.0806
PAY_5        0.0692
BILL_AMT6    0.0688
PAY_AMT2     0.0618
LIMIT_BAL    0.0586
PAY_AMT6     0.0531
PAY_AMT4     0.0456
MARRIAGE     0.0438
dtype: float64

TabNet ranks the repayment-delay features and the most recent bill amounts at the top, in line with the known drivers on this dataset (Yeh & Lien, 2009). Per-sample masks are available via tabnet.explain() and give instance-wise feature attribution that has the same use as SHAP values (Chapter 22).

14.7.2 FT-Transformer architecture

Gorishniy et al. (2021) observed that the natural unit in a table is a cell \((i, j)\), not a row. FT-Transformer tokenizes every feature to a \(d\)-dimensional embedding, prepends a learnable [CLS] token, and runs the resulting sequence through standard transformer blocks. The numeric tokenizer for feature \(j\) is

\[ T_j(x_j) = x_j \cdot w_j + b_j, \tag{14.21}\]

with \(w_j, b_j \in \mathbb{R}^d\) learnable. The categorical tokenizer is a lookup table \(T_j(c) = e_{j, c} \in \mathbb{R}^d\). The [CLS] token is \(c \in \mathbb{R}^d\) learnable. The token sequence is

\[ T(x) = [ c, T_1(x_1), T_2(x_2), \ldots, T_p(x_p) ] \in \mathbb{R}^{(p+1) \times d}. \tag{14.22}\]

Each transformer block computes pre-norm multi-head self-attention followed by a feedforward network:

\[ h' = h + \mathrm{MHA}(\mathrm{LN}(h)), \qquad h'' = h' + \mathrm{FFN}(\mathrm{LN}(h')), \tag{14.23}\]

with multi-head attention \(\mathrm{MHA}(Q, K, V) = \mathrm{concat}(\mathrm{head}_1, \ldots, \mathrm{head}_H) W^O\) and \(\mathrm{head}_k = \mathrm{softmax}(Q W_k^Q (K W_k^K)^\top / \sqrt{d_k}) V W_k^V\). The final prediction head reads the [CLS] token:

\[ \hat{y} = W_{\mathrm{head}} \cdot \mathrm{LN}(h^{(L)}_{[\mathrm{CLS}]}). \tag{14.24}\]

A simplified FT-Transformer fits in sixty lines of PyTorch and is trainable in under 90 seconds on Taiwan at 8,000 rows.

Show code
class NumericTokenizer(nn.Module):
    def __init__(self, n_features, d):
        super().__init__()
        self.weight = nn.Parameter(torch.randn(n_features, d) * 0.02)
        self.bias = nn.Parameter(torch.zeros(n_features, d))

    def forward(self, x):                    # x: (B, F)
        return x.unsqueeze(-1) * self.weight + self.bias  # (B, F, d)

class TransformerBlock(nn.Module):
    def __init__(self, d, heads=4, ff_mul=2, p_drop=0.1):
        super().__init__()
        self.ln1 = nn.LayerNorm(d); self.ln2 = nn.LayerNorm(d)
        self.attn = nn.MultiheadAttention(d, heads, batch_first=True, dropout=p_drop)
        ff = max(d * ff_mul, 32)
        self.ffn = nn.Sequential(
            nn.Linear(d, ff), nn.GELU(), nn.Dropout(p_drop), nn.Linear(ff, d),
        )
        self.drop = nn.Dropout(p_drop)

    def forward(self, x):
        h = self.ln1(x)
        a, _ = self.attn(h, h, h, need_weights=False)
        x = x + self.drop(a)
        x = x + self.drop(self.ffn(self.ln2(x)))
        return x

class FTTransformer(nn.Module):
    def __init__(self, n_features, d=16, depth=2, heads=4, p_drop=0.1):
        super().__init__()
        self.tokenizer = NumericTokenizer(n_features, d)
        self.cls = nn.Parameter(torch.zeros(1, 1, d))
        self.blocks = nn.ModuleList([TransformerBlock(d, heads, 2, p_drop) for _ in range(depth)])
        self.head = nn.Sequential(nn.LayerNorm(d), nn.Linear(d, 1))

    def forward(self, x):
        tok = self.tokenizer(x)                          # (B, F, d)
        cls = self.cls.expand(tok.size(0), -1, -1)       # (B, 1, d)
        h = torch.cat([cls, tok], dim=1)
        for block in self.blocks:
            h = block(h)
        return self.head(h[:, 0]).squeeze(-1)
Show code
torch.manual_seed(0)
ftt = FTTransformer(n_features=Xstr.shape[1], d=16, depth=2, heads=4, p_drop=0.1)
opt = torch.optim.AdamW(ftt.parameters(), lr=3e-4, weight_decay=1e-5)
loss_fn = nn.BCEWithLogitsLoss()
Xstr_t = torch.tensor(Xstr); ystr_f = torch.tensor(ystr.astype("float32"))
Xsva_t = torch.tensor(Xsva); ysva_np = ysva.astype("float32")
Xste_t = torch.tensor(Xste)

best_auc, best_state, stale = 0.0, None, 0
t0 = time.time()
for epoch in range(25):
    ftt.train()
    perm = torch.randperm(len(Xstr_t))
    for i in range(0, len(Xstr_t), 256):
        idx = perm[i:i + 256]
        opt.zero_grad()
        loss = loss_fn(ftt(Xstr_t[idx]), ystr_f[idx])
        loss.backward(); opt.step()
    ftt.eval()
    with torch.no_grad():
        p_va = torch.sigmoid(ftt(Xsva_t)).numpy()
    auc = roc_auc_score(ysva_np, p_va)
    if auc > best_auc + 1e-4:
        best_auc = auc
        best_state = {k: v.clone() for k, v in ftt.state_dict().items()}
        stale = 0
    else:
        stale += 1
        if stale >= 5: break

ftt.load_state_dict(best_state)
ftt.eval()
with torch.no_grad():
    p_te_ftt = torch.sigmoid(ftt(Xste_t)).numpy()
print(f"FT-Transformer AUC {roc_auc_score(yste, p_te_ftt):.4f}  "
      f"KS {ks_statistic(yste, p_te_ftt):.4f}  "
      f"time {time.time() - t0:.1f}s")
FT-Transformer AUC 0.7895  KS 0.4490  time 21.5s

On an 8,000-row subsample the from-scratch FT-Transformer matches or beats XGBoost on the same subsample, at roughly 10 to 15 seconds of training time. The full-data picture (30,000 rows) narrows; at 8,000 rows, variance is high and a lucky seed is easy to mistake for a real win. We discuss this more in the next section.

TabTransformer (Huang et al., 2020) is a predecessor that tokenizes only categorical features and concatenates them with raw numerics. NODE (Popov et al., 2020) uses differentiable oblivious decision trees as the core building block. All three architectures converge to similar performance on large tabular benchmarks; FT-Transformer is the cleanest design and tends to be the reference baseline in the 2020s literature.

14.8 Double descent on tabular credit

The classical bias-variance picture says test error is U-shaped in model capacity: too small underfits, too large overfits. Belkin et al. (2019) showed that this curve is incomplete. Past the interpolation threshold, the point at which a model has just enough capacity to fit the training labels exactly, test error frequently drops again, sometimes below the classical sweet spot. Nakkiran et al. (2020) generalized the phenomenon to deep nets and identified three flavors:

  • Model-wise double descent. Sweep capacity (width, depth, parameter count) at fixed sample size. Test error rises as capacity approaches the interpolation threshold (\(\,p \approx n\,\) effective parameters), peaks there, and falls in the overparameterized regime.
  • Sample-wise double descent. At fixed capacity, increasing \(n\) can hurt test error in the regime where \(n\) approaches the model’s parameter count, before improving again as \(n\) grows past it. The relevant ratio is \(n/p_\mathrm{eff}\), not \(n\) alone.
  • Epoch-wise double descent. With no early stopping, validation loss can rise then fall again as training proceeds past the point of zero training loss.

Whether any of this matters in production credit scoring depends entirely on where the portfolio sits relative to the interpolation threshold. The classical regime (regularized GBM, \(n \gg p_\mathrm{eff}\), early stopping on a held-out fold) suppresses all three flavors. The danger zone is small samples, wide feature engineering (transaction embeddings, text features, large categorical cardinality with one-hot encoding), and underregularized deep nets trained to interpolation. Three concrete credit settings live in that zone:

  • Thin portfolios. New product launches, niche subprime segments, or country-level rollouts with \(n\) in the thousands. Augmenting Taiwan-style features with one-hot encoded merchant categories or transaction embeddings can push the effective parameter count close to \(n\).
  • Reject inference with augmentation. Parcelling, reweighting, and the M-step of EM-style reject inference (Chapter 10) inflate the effective sample but inflate the variance of features more. The \(n/p_\mathrm{eff}\) ratio shrinks even as the row count grows.
  • Deep tabular nets without weight decay or early stopping. A wide FT-Transformer or NODE trained to interpolation on a thin portfolio is the canonical setup for an epoch-wise descent curve. The width sweep below makes the model-wise version visible on the Taiwan dataset.

We reproduce model-wise double descent on Taiwan by subsampling to 800 rows, adding 15% label noise (which Belkin et al. and Nakkiran et al. both used to make the peak unmissable), and sweeping the width of a two-layer MLP from underparameterized to heavily overparameterized.

Show code
rng_dd = np.random.default_rng(0)
n_sub = 800
sub_idx = rng_dd.choice(len(X_tr), size=n_sub, replace=False)
X_dd = X_tr[sub_idx].astype("float32")
y_dd = y_tr[sub_idx].astype("int64")
flip = rng_dd.random(n_sub) < 0.15
y_dd_noisy = np.where(flip, 1 - y_dd, y_dd).astype("float32")

X_dd_t = torch.tensor(X_dd)
y_dd_t = torch.tensor(y_dd_noisy)
X_te_dd = torch.tensor(X_te.astype("float32"))
print(f"n={n_sub}, p_in={X_dd.shape[1]}, "
      f"label-flip rate={flip.mean():.3f}")
n=800, p_in=23, label-flip rate=0.141
Show code
class TwoLayerMLP(nn.Module):
    def __init__(self, p_in, h):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(p_in, h), nn.ReLU(),
            nn.Linear(h, h), nn.ReLU(),
            nn.Linear(h, 1),
        )

    def forward(self, x):
        return self.net(x).squeeze(-1)

widths = [2, 4, 8, 16, 24, 32, 48, 64, 128, 256, 512, 1024]
records = []
for h in widths:
    torch.manual_seed(0)
    net = TwoLayerMLP(X_dd.shape[1], h)
    opt = torch.optim.AdamW(net.parameters(), lr=3e-3, weight_decay=0.0)
    loss_fn = nn.BCEWithLogitsLoss()
    for epoch in range(4000):
        net.train()
        opt.zero_grad()
        loss = loss_fn(net(X_dd_t), y_dd_t)
        loss.backward(); opt.step()
        if loss.item() < 1e-4:
            break
    net.eval()
    with torch.no_grad():
        p_tr = torch.sigmoid(net(X_dd_t)).numpy()
        p_te = torch.sigmoid(net(X_te_dd)).numpy()
    n_params = sum(p.numel() for p in net.parameters())
    records.append({
        "width": h,
        "n_params": n_params,
        "train_loss": float(loss.item()),
        "train_auc_noisy": roc_auc_score(y_dd_noisy, p_tr),
        "test_auc_clean":  roc_auc_score(y_te, p_te),
        "epochs": epoch + 1,
    })

ddesc = pd.DataFrame(records)
ddesc.round(4)
width n_params train_loss train_auc_noisy test_auc_clean epochs
0 2 57 0.5074 0.7243 0.5882 4000
1 4 121 0.4406 0.8347 0.6462 4000
2 8 273 0.1948 0.9727 0.5595 4000
3 16 673 0.0009 1.0000 0.5701 4000
4 24 1201 0.0001 1.0000 0.5896 3962
5 32 1857 0.0001 1.0000 0.5933 3579
6 48 3553 0.0001 1.0000 0.5947 2667
7 64 5761 0.0001 1.0000 0.6074 2244
8 128 19713 0.0001 1.0000 0.6145 1761
9 256 72193 0.0001 1.0000 0.6131 1445
10 512 275457 0.0001 1.0000 0.6191 1191
11 1024 1075201 0.0001 1.0000 0.6191 1316
Show code
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(7, 4))
ax.plot(ddesc["n_params"], ddesc["test_auc_clean"], marker="o", label="Test AUC (clean)")
ax.plot(ddesc["n_params"], ddesc["train_auc_noisy"], marker="s", linestyle="--",
        alpha=0.6, label="Train AUC (noisy labels)")
ax.axvline(n_sub, color="grey", linestyle=":", label=f"$n_\\text{{params}} = n = {n_sub}$")
ax.set_xscale("log")
ax.set_xlabel("Trainable parameters")
ax.set_ylabel("AUC")
ax.set_title("Model-wise double descent")
ax.legend(loc="lower right")
plt.tight_layout()
Figure 14.1: Model-wise double descent on Taiwan (\(n=800\), 15% label noise). Test AUC on the clean held-out test set rises into the interpolation peak around \(n_\text{params}\approx n\), then improves again as the network is widened past the threshold. The dashed vertical line marks the interpolation threshold (parameter count equal to training-set size).

The visible peak in test AUC is the interpolation threshold. Below it the network underfits even the clean signal. At the threshold it perfectly memorizes noisy labels and pays for it on the test set. Past it, the implicit bias of gradient descent toward minimum-norm interpolators kicks in and test AUC recovers. The same effect appears, less dramatically, when label noise is replaced by feature noise or by a small genuine \(n/p\) ratio without any added noise.

Three practical implications for credit modelling:

  • Do not stop the width sweep at the first validation peak. If the validation curve rises after a small initial improvement, the model may be sitting at the interpolation peak rather than the optimum. Push capacity up by an order of magnitude before concluding the architecture is too large.
  • Weight decay and early stopping suppress double descent. Nakkiran et al. (2021) proved (in linear regression) that the optimally tuned ridge penalty produces a monotone test-error curve and matches the optimum of the second descent. In credit, the cheapest defence is weight_decay \(\in [10^{-4}, 10^{-2}]\) on AdamW plus early stopping on validation AUC.
  • Re-run capacity sweeps when data shifts. The interpolation threshold is a function of \(n\) and \(p_\mathrm{eff}\). After a reject-inference augmentation, after adding a transaction-embedding block, or after a portfolio expansion that changes \(n\) by more than \(2\times\), the previous “right size” is stale.

Double descent is not an LLM-only phenomenon. It is visible in linear regression, kernel methods, random forests, and gradient boosting (Belkin et al., 2019), and shows up in credit scoring whenever a high-capacity model is trained without regularization on a sample close to its interpolation threshold. The reason it is rarely seen in production credit dashboards is that the standard recipe (regularized GBM, large \(n\), early stopping) avoids the regime by construction, not because the phenomenon does not apply.

14.9 Deep learning vs gradient boosting on tabular credit

The empirical case is close to decided. Grinsztajn et al. (2022) ran a head-to-head evaluation on 45 medium-sized tabular datasets with a unified hyperparameter search budget and reported:

  • On numerical-only datasets, gradient-boosted trees (GBT) beat the best deep learning model by about 2.5 percentage points of accuracy on average.
  • Even when categorical features are included, trees maintain a 1 to 2 percentage point edge after tuning.
  • The gap grows on irregular-target functions (piecewise-constant) and shrinks on smooth targets.
  • The gap shrinks with more data and more tuning budget; it does not fully close.

Gorishniy et al. (2021), running FT-Transformer against CatBoost and XGBoost on eleven tabular benchmarks, found FT-Transformer competitive and sometimes winning, but only after careful tuning and large tuning budgets. Shwartz-Ziv & Armon (2022), working with practitioners at Intel’s AI research group, reported the same thing in production settings: deep models beat trees only when the sample size is very large, the features are homogeneous, and an engineer has the appetite to tune.

For credit scoring specifically, Lessmann et al. (2015)’s 2015 update put neural nets around the middle of their 41-classifier ranking. The ones that ranked top were heterogeneous ensembles (averages of diverse learners), gradient boosting, and random forests. Later benchmarks (Moscato et al., 2021) and comprehensive surveys (Borisov et al., 2024) confirm: on canonical credit tables with \(n\) in the tens of thousands, a tuned GBM is at least as good as any pure deep model, is orders of magnitude faster to train, has a mature tooling ecosystem (monotone constraints, categorical native handling, SHAP), and is easier to document for regulators.

The case for deep learning in credit is therefore not “deep is better on tables.” It is: deep learning is required when at least one of

  • input is a genuine sequence (transactions, loan payment histories),
  • input is a genuine graph (customer-to-device, firm-to-firm SME networks, Chapter 31),
  • input includes free text (bureau memos, call transcripts, Chapter 29),
  • the deployment target is a joint model over modalities that tree ensembles cannot handle.

14.9.1 Head-to-head on Taiwan: a summary

We collect the tabular numbers from the earlier sections and add a final XGBoost run tuned a bit more, to give the comparison a fair shot.

Show code
xgb_tuned = xgb.XGBClassifier(
    n_estimators=600, max_depth=5, learning_rate=0.03,
    min_child_weight=5, subsample=0.8, colsample_bytree=0.8,
    reg_alpha=1e-3, reg_lambda=1.0,
    eval_metric="logloss", random_state=0, n_jobs=1, verbosity=0,
)
xgb_tuned.fit(X_tr, y_tr, eval_set=[(X_va, y_va)], verbose=False)
p_te_xgb_t = xgb_tuned.predict_proba(X_te)[:, 1]

# refit MLP on the full training set
torch.manual_seed(0)
mlp_full, _ = train_mlp(X_tr, y_tr, X_va, y_va, max_epochs=40, patience=6)
p_te_mlp_full = score(mlp_full, X_te)

comparison = pd.DataFrame([
    {"model": "LogReg",          "AUC": roc_auc_score(y_te, p_te_lr)},
    {"model": "XGB (default)",   "AUC": roc_auc_score(y_te, p_te_xgb)},
    {"model": "XGB (tuned)",     "AUC": roc_auc_score(y_te, p_te_xgb_t)},
    {"model": "MLP (p_drop=0.3)", "AUC": roc_auc_score(y_te, p_te_mlp_full)},
]).round(4)
comparison
model AUC
0 LogReg 0.7271
1 XGB (default) 0.7706
2 XGB (tuned) 0.7706
3 MLP (p_drop=0.3) 0.7698

Interpretation. A tuned XGBoost sits at the top. The MLP is close to the default XGBoost. Logistic regression is meaningfully behind but not catastrophically so, which is the usual pattern on this dataset. On other credit portfolios we have worked on (prime and near-prime retail, 100,000 to 1,000,000 rows), the ranking repeats: XGBoost/LightGBM first, MLP or FT-Transformer second, logistic third. The point is not that neural nets are useless; it is that they are expensive, and any decision to deploy one instead of a GBM needs a business reason beyond AUC.

14.10 Scalability

Neural network training on tabular credit data is bottlenecked by the CPU-to-GPU transfer, not the matmul. For datasets below a million rows, training fits on a laptop CPU in under a minute. At larger scales the practical options are:

  • Move to a single GPU. PyTorch’s DataLoader(num_workers=0) is typically sufficient; multiprocess loading rarely helps for pre-scaled tabular inputs.
  • Use torch.compile on PyTorch 2.0+. The speedup on small MLPs is modest (20 to 40%) because the compiled kernel is already simple.
  • For production batch scoring, export to ONNX (see deployment). ONNX Runtime on CPU is usually 2 to 4x faster than native PyTorch for small MLPs because it fuses kernels.

Data size scaling for pre-processing:

  • pandas handles Taiwan (30k rows) and German (1k rows) trivially.
  • Polars is 3 to 10x faster on feature pipelines for a million-row Home Credit sample.
  • Dask is appropriate when the raw feature matrix exceeds memory and you still want a single-node training run; use dask.dataframe to produce a standardized parquet, then load per-batch via PyTorch IterableDataset.
  • Spark is justified only for institution-scale streaming reprocessing (daily ETL on the full card book). For a model-training step, Spark is almost always overkill relative to a day of single-node GPU training.

For recurrent models on transaction streams, the relevant scaling axis is sequence length. An LSTM with 512 units on sequences of length 1000 costs 512M flops per customer per forward pass, which is tractable on a single GPU for a 10M-customer book if training epochs are kept below 20. For longer sequences, a temporal convolutional network (Bai et al., 2018) or a linear-attention transformer is more efficient than a full LSTM.

14.11 Deployment

A neural credit model has three deployment artifacts beyond a standard sklearn pipeline: the model weights, the input-preprocessing pipeline (scaler, encoders), and an inference runtime that replicates training-time behavior exactly. The third is the trickiest. In PyTorch, forgetting to call model.eval() before serving disables BatchNorm’s moving averages and keeps Dropout active, which produces correct-looking but noisy predictions. ONNX export is the cleanest fix because the exported graph bakes in inference-mode behavior.

14.11.1 ONNX export and runtime

We export the Taiwan MLP to ONNX, load it in onnxruntime, and check that predictions match PyTorch to float precision.

Show code
import tempfile, onnxruntime as ort

mlp.eval()
# wrap with a sigmoid so the ONNX output is a probability
class ProbMLP(nn.Module):
    def __init__(self, base):
        super().__init__()
        self.base = base
    def forward(self, x):
        return torch.sigmoid(self.base(x))

prob_mlp = ProbMLP(mlp); prob_mlp.eval()
dummy = torch.tensor(X_te[:1])
onnx_path = tempfile.mktemp(suffix=".onnx")

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    torch.onnx.export(
        prob_mlp, dummy, onnx_path,
        input_names=["x"], output_names=["pd"],
        dynamic_axes={"x": {0: "batch"}, "pd": {0: "batch"}},
        opset_version=14, dynamo=False,
    )

session = ort.InferenceSession(onnx_path, providers=["CPUExecutionProvider"])
batch = X_te.astype("float32")
pd_onnx = session.run(None, {"x": batch})[0].ravel()

with torch.no_grad():
    pd_torch = prob_mlp(torch.tensor(batch)).numpy().ravel()

print(f"max |ONNX - PyTorch| = {np.abs(pd_onnx - pd_torch).max():.2e}")
print(f"ONNX AUC: {roc_auc_score(y_te, pd_onnx):.4f}  "
      f"PyTorch AUC: {roc_auc_score(y_te, pd_torch):.4f}")
max |ONNX - PyTorch| = 1.19e-07
ONNX AUC: 0.7698  PyTorch AUC: 0.7698

The maximum absolute discrepancy between ONNX Runtime and PyTorch is at float32 precision, the AUCs match to four decimals. ONNX is thus a reliable production runtime for PyTorch MLPs and for most transformer blocks. For TabNet the ONNX path is more involved; the pytorch-tabnet library supports a TorchScript export that is easier to stabilize.

The standard FastAPI wrapper for an ONNX model is:

# inference.py (sketch, not run here)
from fastapi import FastAPI
from pydantic import BaseModel
import onnxruntime as ort
import numpy as np, joblib

app = FastAPI()
session = ort.InferenceSession("mlp.onnx", providers=["CPUExecutionProvider"])
scaler = joblib.load("scaler.pkl")

class Row(BaseModel):
    features: list[float]

@app.post("/pd")
def pd_endpoint(row: Row):
    x = scaler.transform(np.asarray(row.features, dtype="float32").reshape(1, -1))
    pd_score = session.run(None, {"x": x.astype("float32")})[0].ravel()[0]
    return {"pd": float(pd_score)}

Pair this with MLflow logging at training time (mlflow.pytorch.log_model plus mlflow.onnx.log_model) so the scaler, the ONNX graph, and the training metrics live under one run ID. Chapter 38 covers the full MLOps story, including A/B testing and shadow deployment of a challenger.

14.11.2 Reproducibility

Neural networks are harder to reproduce than GBMs. Seeds on numpy, torch, and torch.cuda (if used) are necessary but not sufficient: operations such as torch.backends.cudnn.benchmark=True produce non-deterministic kernels. For regulated deployment:

  • Set all seeds (numpy, torch, random).
  • Set torch.use_deterministic_algorithms(True) and torch.backends.cudnn.deterministic = True.
  • Pin the framework version in the ONNX metadata.
  • Log the random state of the dataloader’s shuffler alongside the model.

With these discipline layers, the numerical predictions are reproducible across runs and across machines. Without them, two “identical” training runs can disagree on PD by a few percent on a tail of customers, which is unacceptable for SR 11-7 challenger-model comparisons.

14.12 Regulatory considerations

Regulators are not hostile to neural networks; they are hostile to opacity. SR 11-7, the Federal Reserve’s Guidance on Model Risk Management, is neutral on model class and insistent on governance around (a) conceptual soundness, (b) ongoing monitoring, and (c) outcomes analysis. A neural network passes conceptual soundness if you can motivate the architecture choice, document the regularization, and demonstrate via out-of-time and out-of-sample tests that the model is stable. “The authors used two hidden layers because deeper was harder to train” is a valid conceptual justification; “the architecture is state-of-the-art” is not.

The specific pinch points for deep models are:

  • SR 11-7 model inventory and validation. The validator needs a model description that unambiguously identifies the trained artifact. For a neural net this is the ONNX graph plus the preprocessing pipeline plus the random seeds. SHA-256 hash everything and track it in the inventory.
  • ECOA / Regulation B adverse-action notices. When a credit application is denied, the creditor must provide up to four principal reasons. A neural net does not emit these natively. SHAP values on the trained model (Chapter 22) are the current industry practice. CFPB has been clear that “the model is a neural network” is not a valid reason code.
  • Fair lending. A sophisticated model with high predictive power can still produce disparate impact. The EU AI Act’s Annex III lists credit scoring as a high-risk application, requiring bias monitoring, human oversight, and technical documentation. Chapter 27 and Chapter 28 cover the fairness pipeline in depth; for neural nets, the key additional step is computing per-group calibration and adverse-action rate plots on a protected-class-augmented validation set.
  • Basel II/III IRB use test. An IRB-approved PD model must be in actual use for credit decisions, not just for capital computation. This has historically been hard for complex models because the underwriter cannot reason about the PD at the loan-officer level. The path for deep models is: train a neural net as the challenger, keep a logistic or a GBM with SHAP explanations as the champion, and trigger upgrades to the neural net only when the gap on a holdout is statistically and economically large.
  • GDPR Article 22. A customer has the right to meaningful information about the logic involved in automated decisions. “Meaningful” has not been fully tested in court. A SHAP waterfall plot with the top three drivers per customer is the current best-practice interpretation.
  • EU AI Act. Credit-scoring models serving EU-resident consumers are high-risk. By 2026 you will need conformity assessments, technical documentation, a post-market monitoring plan, and registration in the EU database of high-risk AI systems. Neural-net credit models are not exempt; they require exactly the same documentation as any other high-risk system, except that the “logic involved” section is harder to write.

A practical template for the SR 11-7 documentation of a neural credit model:

  • Architecture: layer list, activation choices, regularization, loss, optimizer. Cite the papers. Motivate every non-default choice.
  • Training data: time period, sample size, default rate, exclusions, augmentations. Data lineage hash.
  • Validation: out-of-sample AUC/KS/Brier, calibration curve, per-segment performance, stability over time.
  • Benchmarks: at least one linear (logistic) and one tree (GBM) challenger. Report the head-to-head and justify the neural net against both.
  • Monitoring: monthly AUC, PSI on input distributions (Chapter 4), alert thresholds, escalation process.
  • Explainability: SHAP-based adverse-action pipeline, MC-dropout uncertainty band if deployed.
  • Fallback: what happens if ONNX inference fails (the FastAPI service should return a cached logistic-regression PD).

14.13 Worked example: integrated credit stack

To close the chapter, we assemble the pieces into a single tabular+uncertainty pipeline that we would be comfortable describing to a model risk committee.

Show code
def build_stack(X_tr, y_tr, X_va, y_va):
    torch.manual_seed(0)
    # champion: a tuned GBM
    gbm = xgb.XGBClassifier(
        n_estimators=500, max_depth=4, learning_rate=0.05,
        min_child_weight=5, subsample=0.8, colsample_bytree=0.8,
        eval_metric="logloss", random_state=0, n_jobs=1, verbosity=0,
    )
    gbm.fit(X_tr, y_tr, eval_set=[(X_va, y_va)], verbose=False)

    # challenger: a dropout MLP
    mlp_model, _ = train_mlp(X_tr, y_tr, X_va, y_va, p_drop=0.3, max_epochs=40, patience=6)
    return gbm, mlp_model

champion, challenger = build_stack(X_tr, y_tr, X_va, y_va)

p_champ = champion.predict_proba(X_te)[:, 1]
p_chal_mean, p_chal_std = mc_dropout_score(challenger, X_te, n_samples=30)

stack = pd.DataFrame({
    "PD_champion_gbm": p_champ,
    "PD_challenger_mean": p_chal_mean,
    "PD_challenger_std":  p_chal_std,
    "label": y_te,
})
print(stack.describe().round(4).loc[["mean", "std", "50%", "max"]])
      PD_champion_gbm  PD_challenger_mean  PD_challenger_std   label
mean           0.2263              0.2287             0.0495  0.2190
std            0.2065              0.1890             0.0186  0.4136
50%            0.1518              0.1642             0.0479  0.0000
max            0.9548              0.9499             0.1672  1.0000
Show code
def metric_row(name, y, p):
    return {
        "model": name,
        "AUC":   roc_auc_score(y, p),
        "KS":    ks_statistic(y, p),
        "Brier": brier_score_loss(y, p),
    }

pd.DataFrame([
    metric_row("Champion (XGB)", y_te, p_champ),
    metric_row("Challenger (MLP)", y_te, p_chal_mean),
]).round(4)
model AUC KS Brier
0 Champion (XGB) 0.7694 0.4201 0.136
1 Challenger (MLP) 0.7702 0.4163 0.136

Champion and challenger agree closely on ranking. The challenger’s MC-dropout std gives you a per-customer uncertainty that the GBM cannot produce natively. For customers where champion and challenger disagree by more than (say) 0.05 on PD and the challenger std is in the top quintile, the review queue fires. This is the operational value of a neural challenger model: it is a cheap sensor for model-risk-relevant disagreement, not a replacement for the GBM.

14.14 Vietnam and emerging markets

14.14.1 Market context

Vietnam’s consumer and SME lending market runs on thinner labeled samples than the benchmarks in the deep-tabular literature assume. The Credit Information Center provides the spine of bureau data and reports on coverage each year, but buy-now-pay-later, peer-to-peer, and consumer-finance exposures are partially outside its view (Credit Information Center of Vietnam, 2023). Findex 2021 put Vietnam below its regional peers on adult account ownership, with informal channels still covering a substantial share of household borrowing (World Bank, 2022). The SBV supervises banks through Circular 41/2016/TT-NHNN’s Basel II standardized approach (State Bank of Vietnam, 2016), capital adequacy amendments through Circular 22/2023/TT-NHNN (29 Dec 2023) amending Circular 41/2016 (State Bank of Vietnam, 2023), consumer finance through Circular 43/2016/TT-NHNN on consumer lending by finance companies, and digital onboarding through Circular 16/2020/TT-NHNN (State Bank of Vietnam, 2020). Decree 13/2023/ND-CP is the first comprehensive personal-data protection regime and imposes consent, purpose-limitation, and cross-border transfer obligations that bite on the alternative data a deep model wants to consume (Government of Vietnam, 2023). The SBV fintech sandbox under Decree 94/2025/ND-CP formalizes a controlled-testing path for novel scoring approaches but demands an explicit description of data, methods, and monitoring (Government of Vietnam, 2025; State Bank of Vietnam, 2024). The IMF’s 2024 Article IV flagged thin data and rapid non-bank credit growth (International Monetary Fund, 2024), and BIS EMDE work is consistent (Bank for International Settlements, 2022, 2023). The Asian Development Bank’s Southeast Asia review highlights the mobile channel as the dominant driver of new borrower flow (Asian Development Bank, 2023).

14.14.2 Application considerations

Thin-sample overfit is the defining problem for a neural credit model in Vietnam. An MLP with two hidden layers of 128 units has about 20,000 parameters; a vintage of 50,000 applications with 2,500 positives pushes the ratio of positives per parameter under 0.2, which is well into the regime where the network memorizes noise on the training fold and collapses on an out-of-time hold-out. The empirical recipe that has held up in Vietnamese work is unglamorous: start from a logistic baseline, add one hidden layer of 32 or 64 units, apply dropout of 0.3 or more, weight-decay in the 1e-3 to 1e-2 range, and early stopping with a patience of five to ten epochs (Tran et al., 2021). Two hidden layers are defensible at 100,000-plus rows. Deeper networks rarely pay for themselves below a million rows on Vietnamese retail data.

Three architecture-specific points. First, transformer-style tabular models (FT-Transformer, TabNet) need more data than MLPs to show their lift; on Vietnamese consumer vintages the MLP is usually the right deep baseline, and an FT-Transformer becomes credible only when application, behavioral, and transaction features are stacked into a single training set of several million rows. Second, sequence architectures (LSTM, Transformer) are the right choice for transaction-stream features once a mobile-money or card issuer crosses two to five million monthly active users, because the underlying modality is inherently sequential and tree-based featurization throws away ordering (Björkegren & Grissen, 2020). Third, autoencoder-based anomaly detection is an attractive fraud layer for Vietnamese eKYC flows: it does not require fraud labels and can be trained on the confirmed-good population, which is what most finance companies have early in a product launch.

Calibration and uncertainty are where a neural challenger earns its keep. MC-dropout gives a per-applicant uncertainty band that no gradient-boosted tree produces natively, and for IFRS 9 stage-2 transfer decisions that uncertainty is directly useful. Post-hoc isotonic calibration on an out-of-time fold is mandatory; an uncalibrated neural PD cannot feed an ECL computation.

14.14.3 Rationalization

Why run a neural model at all on a Vietnamese portfolio. Four reasons. First, as a challenger in the SR 11-7-style governance package that the SBV’s sandbox now expects (Government of Vietnam, 2025). Second, to absorb modalities that a tree cannot: card or mobile-money transaction sequences, device fingerprints, narrative text from call-center interactions. Third, to produce per-applicant uncertainty via MC-dropout or an ensemble of heads, which feeds review queues at the margin. Fourth, to build the representation layer that a downstream scorecard can consume (embeddings from a pretrained sequence model plus a logistic scorecard keeps explainability tractable). What the neural model should not be, on a typical consumer-finance vintage of 50,000 to 200,000 rows, is the sole production scorer. The gap to a monotone-constrained boosted tree is small, and the compliance cost of reason-code generation at scale is higher (Gorishniy et al., 2021; Grinsztajn et al., 2022).

14.14.4 Practical notes

Operational defaults for a neural credit model on Vietnamese data. Standardize numerical features on the training fold only. Encode province and employment class with target-rate ordinal encoding; Vietnam’s July 2025 administrative reorganization consolidated 63 provinces into 34 provincial-level units, so any province-fixed-effect scheme must map pre-merger and post-merger codes. Apply dropout of at least 0.3 and weight decay of 1e-3. Use early stopping on the validation AUC with patience five. Calibrate with isotonic regression on an out-of-time fold. Export the trained model to ONNX with a pinned opset and store the preprocessing pipeline and random seeds alongside, because the SR 11-7-style validation expected under the SBV sandbox is reproducibility-first (Government of Vietnam, 2025). Generate SHAP values for the top three drivers per applicant and fold them into a bilingual adverse-action letter under Circular 43/2016/TT-NHNN on consumer lending by finance companies. Monitor PSI by province and channel monthly, because eKYC-originated vintages drift faster than branch vintages (Asian Development Bank, 2023). Retrain quarterly in a normal macroeconomic cycle and faster when uncertainty indicators rise. Keep a logistic or constrained-booster fallback wired into the inference service so that an ONNX failure degrades gracefully to a compliant production path.


14.15 Takeaways

  • Multi-layer perceptrons with AdamW + dropout + early stopping are a credible challenger model on Taiwan-scale tabular credit data. They beat logistic regression and underperform tuned XGBoost, usually by 0.5 to 2 points of AUC.
  • Regularization discipline is the single most important lever on small credit samples. Run a dropout ablation; it exposes whether your net is overfitting in a way summary metrics hide.
  • Dropout doubles as a Monte Carlo uncertainty estimator. Use it to populate a manual-review queue, not to replace point PDs.
  • CNNs on tabular credit data are a curiosity; they win only when the input has genuine locality (mortgage payment matrices, transaction time-frequency representations).
  • LSTMs and their modern cousins are the reason to learn deep learning for credit: transaction sequences, loan-state panels, and behavioral streams are their natural habitat, and trees cannot match them there.
  • Autoencoders are useful as an anomaly-detection signal and as a model-risk monitor, not as a standalone PD.
  • TabNet and FT-Transformer close the tabular-DL gap but rarely overturn a tuned GBM. Use them when you already need deep inference infrastructure for other modalities and want one joint model.
  • On regulation: the documentation burden is higher for neural nets, not the approval bar. SR 11-7, ECOA adverse-action, and EU AI Act Annex III conformity assessments are tractable if you invest in SHAP-based explainability and a proper champion-challenger process.

14.16 Further reading

  • LeCun et al. (2015) for the canonical review of deep learning.
  • Rumelhart et al. (1986) for backpropagation.
  • Hochreiter & Schmidhuber (1997) for LSTMs.
  • Srivastava et al. (2014) for dropout; Gal & Ghahramani (2016) for the Bayesian view.
  • Ioffe & Szegedy (2015) and Ba et al. (2016) for normalization.
  • Kingma & Ba (2015) and Loshchilov & Hutter (2019) for adaptive optimizers and AdamW.
  • Arik & Pfister (2021) and Gorishniy et al. (2021) for tabular deep learning.
  • Grinsztajn et al. (2022) and Shwartz-Ziv & Armon (2022) for the tree-vs-DL empirical debate.
  • Lessmann et al. (2015) for the credit-scoring benchmark of record.
  • Sadhwani et al. (2021) for deep learning on mortgage-state panels.
  • Babaev et al. (2022) for contrastive learning on transaction event sequences.
  • Borisov et al. (2024) for a comprehensive survey of deep tabular models.