3 Data: Sources, Features, and Preprocessing

Scope: both retail and corporate, retail-leaning. Bureau, application, transaction, and alternative-data sources. Worked examples lean retail (UCI German, Taiwan, CIC Vietnam); corporate financial-statement features are covered alongside.

Overview

Credit scoring lives or dies on its inputs. A logistic model trained on the wrong population, a gradient-boosted tree fit to leaky features, or a deep network that imputes missingness with a mean will all fail in production, and they will fail in ways that regulators care about. The modeling choices that textbooks emphasize matter. The data choices matter more.

This chapter takes data seriously. We walk through the traditional sources that sit inside a bank’s scorecard, catalog the alternative signals that have appeared in the past decade, and formalize the preprocessing steps that translate raw tables into model-ready features. Three tools get most of the attention: (1) weight of evidence for monotone encoding, (2) imputation for missingness, and (3) time-aware splitting for leakage control. Each is worked out in math and in code that runs on the public UCI data sets.

The chapter also makes a scalability argument. A scorecard team that only thinks in pandas hits a wall at a few million rows. Polars, Dask, and Spark each solve a different piece of that wall, and weight of evidence encoding is one of the simplest places to show the tradeoff. The last section walks through a classic leakage bug, trains a model on it, and shows the out-of-time hit.

A note for the emerging-market reader. The data stack this chapter describes (bureau file, internal core-banking data, alternative overlays) looks different when the bureau is the Credit Information Center (CIC) rather than Experian, when roughly half the adult population has no tradeline, when declared income comes from cash work rather than payroll, and when the origination channel is a mobile app with an eKYC liveness check rather than a branch visit. The preprocessing decisions that follow, weight-of-evidence binning, missingness treatment, and point-in-time feature construction, have to absorb a higher missingness rate, a shorter tradeline depth, and a heavier reliance on transaction-level cash-flow features. The chapter’s methods are the right methods, but the defaults (bin counts, IV thresholds, imputation strategy) need to be set with the thin-file population in mind.

The intended reader is a senior practitioner or an academic researcher who already understands logistic regression and classical statistical learning. The chapter spends no time re-deriving maximum likelihood for a linear model; it spends most of its time on the joins, the cohort definitions, and the pre-model transformations that separate a demonstration notebook from a production scorecard. The empirical sections lean on UCI German and UCI Taiwan because they are reproducible everywhere, but the methods port cleanly to larger Home Credit, LendingClub, and HMDA samples covered in later chapters.

Notation

We use $X \in \mathcal{X}$ for features and $Y \in \{0, 1\}$ for the default label, with $Y = 1$ denoting a bad. Population rates are $\pi_1 = \Pr(Y = 1)$ and $\pi_0 = 1 - \pi_1$. A binned feature partitions $\mathcal{X}$ into disjoint bins $\{A_j\}_{j=1}^{J}$. For a categorical variable, bins are level groupings. For a numeric variable, bins are intervals. Conditional probabilities are $p_{j \mid 1} = \Pr(X \in A_j \mid Y = 1)$ and $p_{j \mid 0} = \Pr(X \in A_j \mid Y = 0)$.

Show code

import sys, warnings, time
warnings.filterwarnings("ignore")
sys.path.insert(0, "../code")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from creditutils import load_german_credit, load_taiwan_default, stable_sigmoid

rng = np.random.default_rng(42)
np.random.seed(42)

3.1 Traditional data

Banks have collected the same categories of consumer credit data for four decades. Little has changed in the core schema. The four pillars of a traditional retail scorecard are the bureau report (Section 3.1.1), the bank’s internal master file (Section 3.1.2), the application form (Section 3.1.4), and the external overlay such as income verification or fraud flags (Section 3.1.7).

3.1.1 The bureau report

A consumer credit bureau is a private clearinghouse. It ingests monthly tradeline updates from thousands of furnishers, normalizes them into a canonical schema, and sells reports and scores back to lenders. In the United States the three nationwide bureaus are Equifax, Experian, and TransUnion (Avery et al., 2003). The Fair Credit Reporting Act (FCRA) governs what they can collect, how long they can keep it, and what consumers can dispute. Europe runs a mix of positive and negative bureaus by country. China operates the Credit Reference Center of the People’s Bank of China alongside several private bureaus.

A bureau report breaks into five sections that every modern scorecard touches:

Identification: name, date of birth, social security or national identifier, current and prior addresses. Used to link the report to the application and to detect identity fraud.
Tradelines: one row per active or closed credit account. Each tradeline has an opening date, an account type (revolving, installment, mortgage, open), a credit limit or original balance, a current balance, a minimum payment, and a 24-month payment history string such as OK OK OK 30 60 OK OK .... These strings are the raw material for every delinquency-based feature.
Inquiries: every hard pull in the past two years, with date and subscriber name. A burst of inquiries in the last 30 days is a strong short-horizon risk signal.
Public records: bankruptcies, tax liens, civil judgments. Post-NCAP changes in the United States, most judgments and tax liens no longer appear, which reshaped the public record feature bank after 2017.
Collections: charged-off accounts placed with third-party collectors. Often shown with original creditor, collection agency, and charge-off balance.

The FICO score, the dominant consumer credit score in the United States, derives from bureau data only. Myfico.com publishes the category weights: payment history (roughly 35 percent), amounts owed (30 percent), length of credit history (15 percent), new credit (10 percent), credit mix (10 percent). The underlying algorithm is proprietary, but its inputs are public knowledge and follow a small number of archetypes. Utilization is the ratio of revolving balance to limit, computed per tradeline and aggregated to the file level. Delinquency depth is the worst 24-month payment code on each line, rolled up to file level as the fraction of tradelines that were 30+, 60+, or 90+ in the last 6, 12, or 24 months. Age features include the age of the oldest account and the average age of open tradelines.

VantageScore, the joint bureau product, and the proprietary scorecards that large lenders build in-house use the same tradeline and inquiry data but different binning, weighting, and target definitions. A common pattern inside a bank is to stack an in-house behavioral score on top of a bureau score, so the scorecard captures both the generic credit-file signal and the account-specific behavior that the bureau does not see.

The specific feature vocabulary is remarkably stable across bureaus and across decades. A partial list of the archetype features a scorecard developer can expect to find useful:

bureau_score: the FICO or VantageScore on file at the observation date.
oldest_tradeline_age_months: age of the oldest account. Tracks length of credit history.
avg_open_tradeline_age_months: average age of open tradelines. Captures both length and churn.
utilization_revolving: sum of balance divided by sum of limit across open revolving lines.
utilization_maximum_tradeline: the maximum utilization across any single revolving tradeline.
num_tradelines_30dpd_12m: count of tradelines that reached 30+ days past due in the last 12 months.
num_tradelines_60dpd_24m: the 60+ DPD analog over 24 months.
num_inquiries_6m: hard pulls in the last 6 months.
num_new_tradelines_12m: newly opened accounts in the last 12 months.
bankruptcy_flag, collections_flag, tax_lien_flag: public record presence indicators.
secured_installment_flag, mortgage_flag: structural presence indicators.
revolving_total_balance, installment_total_balance: dollar aggregates by account type.
months_since_last_delinquency: recency of the most recent bad event; “never” is usually coded as a large positive number.

A clean scorecard typically uses 15 to 30 of these, with roughly a two-thirds weight on delinquency-adjacent features (past behavior), a 15-percent weight on utilization (current behavior), and a 10- to 20-percent weight on length and mix (structural).

3.1.2 Bank internal data

Internal data is what the lender knows from its own books. It is almost always more predictive than bureau data for customers who already hold a product. A current-account issuer sees the full flow of salary credits and direct debits. A mortgage servicer sees escrow behavior. A credit card issuer sees transactional authorizations in real time.

The operational data stores used for model building tend to be organized by the source system:

Core banking: account master, balances, interest accruals, statement-level tables.
Card authorization: every swipe, with merchant category, amount, timestamp, and channel.
Payments and transfers: ACH, wire, SEPA, faster payments, internal transfers.
Collections: arrears, promise-to-pay history, agent notes, settlement agreements.
Customer contact: call center records, digital channel logs, complaint flags.

A behavioral scorecard reduces this to a set of standardized windows: 1-month, 3-month, 6-month, and 12-month. Each window is aggregated into counts, sums, max, min, ratios, and trends. A typical card model will have 30 to 80 such features. The specific recipe matters less than the window discipline. The windows must end strictly before the observation point of the score, and they must be computable at that same point in the production path. If they are not, the model is leaky, a topic we return to in Section 3.6.

There is a structural asymmetry between new-account origination scorecards and behavioral scorecards on existing accounts. An origination scorecard knows the bureau, the application, and nothing else. A behavioral scorecard on an existing card account has the full payment, balance, and transaction history of that account, along with the bureau refresh. The behavioral scorecard is almost always more accurate within its existing customer base; it typically reaches Gini coefficients in the 0.55 to 0.70 range on 12-month horizons, while origination models for the same lender reach 0.40 to 0.55. The gap comes entirely from the richer internal data.

Two specific internal features consistently dominate in behavioral scorecards. The first is the ratio of payment to statement balance, also called the revolving-pay rate. A customer who pays the full balance every month is structurally lower risk than one who pays the minimum, even at the same utilization. The second is the trend in transaction frequency and amount over the last 3 to 6 months. A sudden drop in transaction count while balance grows is a strong short-horizon risk signal, often preceding a 30+ DPD event.

3.1.3 Account-level versus customer-level modeling

Some scorecards are built per-account; others are built per-customer. The choice matters. A customer-level model aggregates across all of the customer’s accounts with the lender and produces one score. An account-level model produces a score per account, so a customer with three cards receives three scores. The customer-level model is more data-efficient but forces the aggregation to happen before the model sees the data. The account-level model is more flexible but requires the lender to manage three predictions for the same person.

In practice, origination uses the customer level (one application, one decision) and account management uses the account level (per-card credit-line changes, per-card repricing). IFRS 9 and CECL (Chapter 40) have specific implications: the expected credit loss calculation is at the account level, so account-level PDs are the operational requirement even when the decision model runs at the customer level.

3.1.4 Application data

The application form is the scorecard’s only chance to see information a customer volunteers that is not in the bureau or the internal master file. Typical fields are employment status, occupation, income, employer tenure, housing tenure, marital status, dependents, purpose of loan, requested amount, and term. Most fields are self-reported. Some lenders cross-check a subset through payroll data providers or open-banking connections.

Application features are high-signal for thin-file borrowers, who by definition have sparse bureau records. For customers with thick files, the marginal value of application data is lower, and the scorecard usually regresses toward bureau features. Income is the perennial exception. A consistent, verified income feature dominates many other inputs, even for thick-file customers.

3.1.5 Tradeline-level features

The bureau delivers tradelines as a repeated-measures structure. A report with 12 open tradelines contains 12 rows of account-level information, not 1 aggregated row. Turning that into a single observation per borrower is a feature engineering problem, and it is where most scorecard teams spend their time.

Three families of aggregation dominate:

Pointwise: count of open installments, sum of revolving balances, and maximum utilization.
Temporal: count of tradelines with 30+ delinquency in the last 12 months, months since last delinquency, months since the oldest account opened.
Structural: presence of mortgage, presence of auto loan, ratio of secured to unsecured balance.

Modern gradient-boosted scorecards work directly on 100 to 500 such derived columns. Classical scorecards collapse further to 15 to 30 features, selected by information value and stability. We build both kinds in Chapter 7 and Chapter 12; this chapter sets up the ingredients.

Tradeline aggregation is where many hidden failure modes live. A common one is double-counting: if a mortgage is reported by both the servicer and the originator for a brief window, a naive aggregation double-counts it in the mortgage-count feature and in the sum-of-balances feature. Bureaus use identifier keys to deduplicate, but the keys are imperfect, and lenders usually write their own dedupe rules. Another is the treatment of authorized users: a thin-file consumer can ride on a spouse’s or parent’s revolving account, and the bureau reports the authorized user tradeline in the primary report. Whether the feature engine counts that line is a policy choice that affects both risk discrimination and fair lending posture.

3.1.6 The credit-invisible and unscored populations

A large fraction of adults in any jurisdiction have insufficient bureau data to produce a score. Brevoort et al. (2016) estimate that roughly 26 million Americans are credit invisible, meaning they have no bureau record, and another 19 million are unscored, meaning their record exists but is too thin or stale for the bureau’s scoring model. The invisible and unscored populations skew younger, non-white, and lower-income, which makes them the principal policy target for alternative-data scoring. Any scorecard designed for mass-market retail lending has to make an explicit choice about how to treat thin-file applicants. The three common choices are (1) to score them on a dedicated thin-file model, (2) to route them to a judgmental underwriter, or (3) to decline by default. Each choice has fair-lending implications that model governance must document.

3.1.7 External overlays

A bureau report is not the only external data source. Income verification services like The Work Number, Finicity, and Plaid provide real-time payroll feeds that confirm stated income within seconds. Fraud databases such as FICO’s Falcon, LexisNexis ThreatMetrix, and Early Warning Services flag known fraud rings, synthetic identities, and device fingerprints. AML and sanctions screening, while not directly part of credit risk, feeds the same decision workflow, and its data quality affects the stability of any model downstream. Model governance treats each overlay as a separate data source requiring its own lineage documentation, its own monitoring, and its own retraining plan.

3.2 Alternative data

The line between “traditional” and “alternative” is time-dependent. Utility payment data was alternative in 2005 and is standard today. What the term currently means is any signal that is not in a nationwide credit bureau and not in a bank’s own ledger. Five categories cover most of what large lenders have deployed or piloted.

3.2.1 A working taxonomy

Psychometric: survey-style questionnaires designed to measure personality traits (conscientiousness, honesty, locus of control) that correlate with repayment. Deployed in frontier markets where bureau coverage is sparse.
Behavioral and device: smartphone metadata, browser fingerprints, typing dynamics, session-level app usage. Berg et al. (2020) document that ten easily observed digital footprint variables, such as device operating system and time-of-day login patterns, deliver predictive power comparable to a credit bureau score at a German e-commerce lender.
Transactional: bank account data obtained through open banking APIs (PSD2 in Europe, CDR in Australia, the 1033 rule in the United States). Each transaction has a date, amount, counterparty, and a classification tag.
Social and platform: data collected inside a digital platform. Iyer et al. (2016) and Lin et al. (2013) show that in peer-to-peer lending, social ties and soft information embedded in listings contain residual risk information beyond traditional hard information.
Utility and telco: electricity, water, mobile phone bills. Thin-file consumers often have 6 to 24 months of telecom usage even when they have zero tradelines.

Chinese BigTech platforms have reported that transactional and platform data dominate bureau data for their own ecosystem (Gambacorta et al., 2020; Gambacorta et al., 2024). The underlying point is older than FinTech. The richer the lender’s view of the borrower’s cash flow, the less a centralized bureau adds on the margin.

3.2.2 The information content of alternative data

For any new feature, the relevant question is whether it adds risk-adjusted discrimination on top of what the scorecard already has. Berg et al. (2020) formalize this through nested models, regressing default on credit bureau score alone, on digital footprint alone, and on both. The marginal $R^2$ from adding digital footprints is comparable to the marginal $R^2$ of the bureau score itself. Gambacorta et al. (2024) run a similar test using Chinese fintech data and show that a model built on transactional data alone beats a model built on a bureau score alone, although the two together beat either.

The regulatory question, which we will see in Chapter 5, is whether the resulting model complies with fair lending rules. Alternative data that correlates with protected characteristics can open disparate impact exposure even when the feature is nominally neutral. The empirical finding in Fuster et al. (2022) is that machine learning models with richer feature sets can widen or narrow racial pricing gaps depending on the choice of input set. Data policy is a model policy.

3.2.3 Pitfalls of alternative data

Alternative data has three recurring failure modes. First, many signals drift fast. A social feature built in 2015 might not exist in 2025 because the platform has changed. Second, the distribution of missingness is usually not random. A customer without a smartphone has no device fingerprint, and the absence correlates with both income and credit risk. Third, the population on which an alternative data model is validated is almost always a self-selected sample of borrowers who consented to the data collection. Reject inference, covered in Chapter 10, becomes essential.

A fourth, less-discussed failure mode is the regulatory half-life of novelty. When a signal first enters the market, it usually carries strong predictive power and light compliance scrutiny. As it diffuses across lenders, the signal loses economic rent, adversarial actors learn to game it, and regulators begin to ask how it maps onto protected classes. Fuster et al. (2022) make the latter point sharply for ML scorecards: models with richer feature sets can move the distribution of predictions in ways that widen or narrow racial pricing gaps, and the direction is not determined by the model class alone. Data governance for alternative signals needs an explicit sunset clause in the same way that a model governance framework has a model retirement clause.

3.2.4 Cash flow underwriting

Cash flow underwriting is the most important new category of alternative data in the past five years. The workflow starts with a consumer-consented open-banking pull of 12 to 24 months of transaction history across all of the customer’s depository accounts. A categorizer assigns each transaction a merchant category and a cash-flow tag (inflow, fixed outflow, discretionary outflow, transfer, fee). Aggregate features are computed at daily, weekly, and monthly frequency: net cash flow, income stability, rent coverage ratio, recurring debit presence, and overdraft count. Gambacorta et al. (2020) argue that the informational content of this data can substitute for collateral in SME lending, which is a strong statement about its power.

Two operational facts make cash-flow underwriting different from the other alternative categories. It is consented, so the data subject is aware of the collection in a way that is rarely true for device or behavioral features. And it is structured, so the feature engineering pipeline can be specified and tested deterministically. These properties make cash flow data easier to defend in model validation. They also constrain the universe: only applicants who connect a bank account have any signal at all, which maps directly back onto the missingness-at-design problem that structural alternative data creates.

3.2.5 Summary of marginal information content

Across the empirical literature, the pattern is consistent. A credit bureau score captures roughly two-thirds of the variation in default that the combined set of bureau, bank, and alternative data captures, with the remaining third split roughly evenly across application data (for thin-file) and alternative data (for both thin- and thick-file). The size of the alternative data contribution depends strongly on the customer segment. For prime thick-file borrowers, the alternative signal is mostly redundant. For thin-file, credit-invisible, and frontier-market borrowers, the bureau is sparse, and the alternative signal is essential. This heterogeneity argues for a segmented modeling approach rather than a single global model. Later chapter treat inclusion and segmentation head-on.

3.3 Weight of evidence and information value

3.3.1 Why bin features at all?

Before any formula, it is worth asking why a credit team would throw away information by replacing a continuous income figure with the bucket it falls into. The answer is that binning exchanges resolution for six properties that matter more than resolution in the production setting where a scorecard runs for years against drifting data and is subject to formal model validation.

Robustness to outliers, measurement error, and reporting noise. Bureau fields are reported with varying conventions across furnishers; income is self-reported on applications and has long upper tails; tradeline counts are zero-inflated. Bins absorb all of this into a discrete level whose log-odds contribution is bounded.
Explicit treatment of missingness. A missing tradeline summary is not the same as a zero. Binning makes the missing level its own bin with its own WoE, so the model uses missingness as a signal when it is informative and ignores it when it is not. No imputation choice is hidden inside the pipeline.
Monotone, additive contribution to log-odds. WoE-encoded features enter the logistic regression as linear terms whose coefficients factor cleanly into a base-odds piece and a per-bin contribution (Eq. 3.7), which is what enables the points-based scorecard formulation in Chapter 7. Underwriters and adverse-action systems can read each bin’s contribution as a fixed point increment.
Stability across resamples and through time. A coarse five-bin partition of a feature is far more stable under resampling and population drift than a continuous coefficient, because the only thing that can change is the per-bin event rate. A scorecard that is refreshed annually but whose binning was set during development needs the binning to be stable across years; that is what the bootstrap and through-time diagnostics in the stability subsection below test.
Two-stage decoupling of binning and coefficient estimation. Bin selection is supervised but happens before the model is fit. The binning table is reviewed in isolation against IV, monotonicity, and bin-share rules. The coefficient estimation step then becomes essentially a sanity check, because the slope on a single WoE-encoded feature is approximately $-1$ in population. This decoupling is what lets a five-person scorecard team ship a model that a ten-person model risk team can validate.
A model artifact that regulators read. SR 11-7 model risk validators, ECOA Reg B examiners reviewing reason codes, and GDPR Article 22 explainability reviewers all expect a tabular artifact that lists every model input, every bin, and every contribution. The binning table is that artifact. No post-hoc explanation method (SHAP, LIME, permutation importance) produces something that validators trust the same way, because those methods are computed on the trained model rather than fixed at training time.

The first three reasons are the technical ones taught in Siddiqi (2017). The last three are the institutional reasons that explain why binned-WoE pipelines remain dominant in retail credit even decades after gradient boosting matched or exceeded their predictive performance. We return to the question of whether modern algorithms eliminate the need for any of this in the subsection on modern models below.

Weight of evidence (WoE) is the canonical encoding used for this purpose in credit, going back to Kullback’s work on information statistics in the 1950s (Kullback & Leibler, 1951) and commercialized in banking by Fair, Isaac, and Company in the 1970s. Siddiqi (2017) is the industry-standard reference for the practical pipeline.

3.3.2 Formal definition

Fix a feature that has been binned into $J$ disjoint bins $A_1, \dots, A_J$. For each bin define the share of goods and share of bads that fall in that bin:

\[ g_j = \frac{\#\{i: x_i \in A_j, y_i = 0\}}{\#\{i: y_i = 0\}}, \qquad b_j = \frac{\#\{i: x_i \in A_j, y_i = 1\}}{\#\{i: y_i = 1\}}. \tag{3.1}\]

The weight of evidence for bin $j$ is the log-ratio of those shares:

\[ \mathrm{WoE}_j = \ln\!\left(\frac{g_j}{b_j}\right). \tag{3.2}\]

Positive WoE means the bin is enriched with goods relative to the base rate. Negative WoE means the bin is enriched with bads. The information value of the feature is the weighted sum

\[ \mathrm{IV} = \sum_{j=1}^{J} (g_j - b_j) \mathrm{WoE}_j = \sum_{j=1}^{J} (g_j - b_j) \ln\!\left(\frac{g_j}{b_j}\right). \tag{3.3}\]

Practitioners rank features by IV with the rules of thumb from Siddiqi (2017): IV less than 0.02 is unpredictive, 0.02 to 0.1 is weak, 0.1 to 0.3 is medium, 0.3 to 0.5 is strong, and above 0.5 is suspiciously good and usually means the bin count is too fine or the feature is leaky.

3.3.3 A worked example by hand

Before invoking optbinning, it helps to compute the formulas in Eq. 3.1 and Eq. A.27 on a toy portfolio with three bins. Suppose 1,000 applicants split across an income_bucket feature with $G = 900$ goods and $B = 100$ bads in total:

Show code

import numpy as np, pandas as pd

ex = pd.DataFrame({
    "bin":   ["Low", "Mid", "High"],
    "goods": [250, 370, 280],
    "bads":  [ 50,  30,  20],
})
G, B = ex["goods"].sum(), ex["bads"].sum()
ex["g_j"]   = ex["goods"] / G
ex["b_j"]   = ex["bads"]  / B
ex["WoE_j"] = np.log(ex["g_j"] / ex["b_j"])
ex["IV_j"]  = (ex["g_j"] - ex["b_j"]) * ex["WoE_j"]
print(ex.round(4).to_string(index=False))
print(f"\nIV (total) = {ex['IV_j'].sum():.4f}")

 bin  goods  bads    g_j  b_j   WoE_j   IV_j
 Low    250    50 0.2778  0.5 -0.5878 0.1306
 Mid    370    30 0.4111  0.3  0.3151 0.0350
High    280    20 0.3111  0.2  0.4418 0.0491

IV (total) = 0.2147

Read the table row by row. The Low bin holds $250/900 \approx 27.8\%$ of all goods but $50/100 = 50\%$ of all bads. Its WoE is $\ln(0.278/0.500) \approx -0.59$, a negative value flagging higher-than-average risk, consistent with a $50/300 = 16.7\%$ bad rate against the population base rate of $10\%$. The High bin reverses the imbalance and contributes $\mathrm{WoE} \approx +0.44$. The IV total of about $0.21$ would land the feature in the “medium predictive” tier of the rule of thumb above.

Three sanity checks the table makes obvious:

The columns $g_j$ and $b_j$ each sum to $1$, because they are class-conditional shares.
A bin with $g_j = b_j$ contributes zero to IV, because $\ln(1) = 0$. Equality in every bin is exactly the case where the feature is independent of $Y$.
Both negative and positive WoE bins contribute positively to IV, because the signs of $(g_j - b_j)$ and $\mathrm{WoE}_j$ always match.

These same numbers are what optbinning would print if it were given the same three bins, modulo the Laplace pseudo-count discussed in the from-scratch implementation below.

3.3.4 Equivalence to a symmetric KL divergence

The IV in Eq. A.27 is exactly the symmetrized Kullback-Leibler divergence between the class-conditional distributions of the binned feature. Let $P$ denote the distribution of $X$ conditional on $Y = 0$ and $Q$ the distribution of $X$ conditional on $Y = 1$, so $P(A_j) = g_j$ and $Q(A_j) = b_j$. The KL divergence of $P$ from $Q$ is

\[ D_{\mathrm{KL}}(P \parallel Q) = \sum_j g_j \ln(g_j/b_j). \tag{3.4}\]

By symmetry,

\[ D_{\mathrm{KL}}(Q \parallel P) = \sum_j b_j \ln(b_j/g_j) = -\sum_j b_j \ln(g_j/b_j). \tag{3.5}\]

Adding the two,

\[ D_{\mathrm{KL}}(P \parallel Q) + D_{\mathrm{KL}}(Q \parallel P) = \sum_j (g_j - b_j)\ln(g_j/b_j) = \mathrm{IV}. \tag{3.6}\]

Information value is the Jeffreys divergence between the good and bad class-conditional feature distributions (Kullback & Leibler, 1951). Two consequences follow:

IV is always non-negative, with equality if and only if $g_j = b_j$ for all $j$, the case where the feature contains no information about the label.
IV is additive across independent disjoint features only in the limit where no feature carries any information contained in another, so the IV-based ranking should be read as a marginal screen, not a joint optimum.

3.3.5 Connection to logistic regression

The link to logistic regression is tight. Conditional on bin $j$,

\[ \mathrm{logit} \Pr(Y = 1 \mid X \in A_j) = \ln\!\left(\frac{\Pr(Y=1, X \in A_j)}{\Pr(Y=0, X \in A_j)}\right) = \ln\!\left(\frac{\pi_1}{\pi_0}\right) + \ln\!\left(\frac{b_j}{g_j}\right) = \alpha - \mathrm{WoE}_j, \tag{3.7}\]

where $\alpha = \ln(\pi_1 / \pi_0)$ is the log base-odds. Fitting a logistic regression on a single WoE-encoded feature recovers an intercept equal to $\alpha$ and a slope equal to $-1$ in population. In sample, the slope is close to $-1$ and the deviation measures how close the bin assignment is to the saturated model. Because of Eq. 3.7, WoE encoding gives logistic regression coefficients that factor cleanly into a base-odds piece and a bin-contribution piece, which is what enables the standard points-based scorecard formulation we develop in Chapter 7.

3.3.6 Empirical ranking on Taiwan default

Show code

from optbinning import BinningProcess
from sklearn.model_selection import train_test_split

df_tw = load_taiwan_default().drop(columns=["id"])
y_tw = df_tw["default"].astype(int).values
X_tw = df_tw.drop(columns=["default"])

X_tr, X_te, y_tr, y_te = train_test_split(
    X_tw, y_tw, test_size=0.3, random_state=42, stratify=y_tw
)

bp = BinningProcess(variable_names=X_tw.columns.tolist())
bp.fit(X_tr.values, y_tr)

iv_tbl = (bp.summary()
          .loc[:, ["name", "iv", "js", "gini", "n_bins"]]
          .sort_values("iv", ascending=False)
          .reset_index(drop=True))
iv_tbl.head(15).round(4)

	name	iv	js	gini	n_bins
0	PAY_0	0.879211	0.098011	0.408066	4
1	PAY_2	0.557144	0.064793	0.303639	4
2	PAY_3	0.419835	0.04944	0.263657	4
3	PAY_4	0.367989	0.043082	0.236466	4
4	PAY_5	0.342765	0.039685	0.210781	4
5	PAY_6	0.295911	0.034716	0.202056	4
6	LIMIT_BAL	0.200189	0.024597	0.248464	8
7	PAY_AMT1	0.182789	0.022386	0.227024	7
8	PAY_AMT2	0.16787	0.020458	0.215512	8
9	PAY_AMT3	0.136166	0.016759	0.199464	8
10	PAY_AMT4	0.107822	0.013341	0.179822	6
11	PAY_AMT6	0.1006	0.012441	0.173132	7
12	PAY_AMT5	0.09533	0.011768	0.165292	6
13	AGE	0.02018	0.002517	0.077412	10
14	BILL_AMT6	0.015322	0.001902	0.050182	7

optbinning (Navas-Palencia, 2020) uses a mixed-integer programming formulation to find an optimal monotone binning that maximizes IV subject to monotonicity and bin-size constraints. The algorithm extends classical supervised discretization (Fayyad & Irani, 1993) by enforcing risk monotonicity, which is what underwriters expect from a scorecard. The highest-IV feature in the Taiwan data is the most recent delinquency code (PAY_0), followed by older payment codes, which matches the domain intuition.

The summary() table includes two additional ranking columns alongside iv, both computed on the same binned distributions and sometimes preferred when IV is unstable.

js: Jensen-Shannon divergence. A bounded, symmetric variant of the KL divergences from Eq. 3.4 and Eq. 3.5. With $M = (P + Q)/2$ the bin-share midpoint, $\mathrm{JS}(P, Q) = \tfrac{1}{2} D_{\mathrm{KL}}(P \parallel M) + \tfrac{1}{2} D_{\mathrm{KL}}(Q \parallel M)$. Always lies in $[0, \ln 2]$, so it cannot blow up the way IV can when a bin is nearly empty. Read the same way as IV: bigger means more class separation. Often used as a robustness check on the IV ranking.
gini: bin-level Gini coefficient. Twice the area between the bin-ordered cumulative-goods and cumulative-bads curves; equivalently $2 \cdot \mathrm{AUC} - 1$ computed using the bin-ordered WoE as the score. Reported on $[0, 1]$. Same monotone direction as IV but scaled like the discriminative AUC measure used in Chapter 4, so it lets the scorecard team compare a feature’s marginal predictive contribution against the overall model AUC in the same units. The full treatment of AUC, Gini, KS, and Brier sits in Chapter 4.

In practice, these three columns rank features almost identically; large disagreements are a flag that one bin is dominating IV, in which case a coarser binning or a JS-based ranking is the safer choice.

3.3.7 WoE-encoded features versus raw features in logistic regression

Show code

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score

X_tr_woe = bp.transform(X_tr.values, metric="woe")
X_te_woe = bp.transform(X_te.values, metric="woe")

lr_woe = LogisticRegression(max_iter=1000, C=1.0).fit(X_tr_woe, y_tr)
auc_woe = roc_auc_score(y_te, lr_woe.predict_proba(X_te_woe)[:, 1])

scaler = StandardScaler()
X_tr_raw = scaler.fit_transform(X_tr.values)
X_te_raw = scaler.transform(X_te.values)
lr_raw = LogisticRegression(max_iter=1000, C=1.0).fit(X_tr_raw, y_tr)
auc_raw = roc_auc_score(y_te, lr_raw.predict_proba(X_te_raw)[:, 1])

pd.DataFrame({
    "encoding": ["WoE", "raw (standardized)"],
    "AUC": [auc_woe, auc_raw],
}).round(4)

	encoding	AUC
0	WoE	0.7609
1	raw (standardized)	0.7150

The WoE-encoded logistic model gains roughly 4 to 5 AUC points relative to standardized raw features on Taiwan. The gap narrows once we move to flexible models like gradient boosting, which can internally approximate monotone step functions. The gap persists, however, for the class of linear models that regulators prefer because they are auditable.

Three mechanisms drive the gap, and naming them clarifies when WoE will help and when it will not.

Linearization in log-odds. WoE turns a non-monotone or kinked empirical risk curve into a monotone, additive contribution in log-odds space, exactly the space the logistic link operates in. A single linear coefficient then fits a relationship that the raw feature would need a polynomial, spline, or piecewise basis to express.
Common units across columns. WoE rescales every feature, numeric or categorical, into log-odds. The L2 penalty in LogisticRegression therefore stops privileging high-variance columns over high-information ones, which is what causes the standardized-raw baseline to leak coefficient mass to noisy features.
Bounded leverage. Outliers and missing values are absorbed into bins with finite WoE, so a single extreme observation cannot tilt the regression line. Standardization shifts and scales but leaves the long tail intact, and a logistic regression with a few extreme rows still gets dragged.

None of the three is unique to WoE. Splines, target encoders, and isotonic transforms each capture some subset. WoE is the only encoding that captures all three and preserves a binning table that a model validator can read. That second property is what makes it the default in regulated retail credit, even where it does not maximize AUC.

3.3.8 WoE is univariate: handling interactions and non-linearity

WoE is computed one feature at a time, and the IV in Eq. A.27 treats each feature in isolation. The encoding captures non-linearity within a feature (e.g., the bin shape can be U-shaped, monotone, or step), but it does not capture any signal that lives only in the joint distribution of two or more columns. “High credit limit is risky only when paired with a thin payment history” is invisible to a WoE-plus-logistic model unless the interaction is engineered explicitly. Equivalently, the IV-based ranking is a marginal screen: a feature with low IV may still carry conditional information that a joint model would use, and a feature with high IV may be redundant given another already in the model.

This is the structural reason for the AUC gap pattern in the comparison above and in the end-to-end benchmark in Section 3.8. Three remedies sit on a complexity-versus-interpretability spectrum.

Hand-built interaction features. Cross-tabulate two binned features into a single categorical (PAY_0 × LIMIT_BAL_bin), then WoE-encode the cross. The result stays inside the scorecard pipeline and remains auditable. Cost: combinatorial explosion if used liberally, and small bin counts that hurt stability.
Two-dimensional supervised binning. optbinning ships an OptimalBinning2D that solves the same MIP over a pair of features. Useful for a small number of known-interactive pairs (utilization × age of oldest tradeline is a classic).
Segmented scorecards. Fit one scorecard per pre-defined segment (thin-file vs thick-file, secured vs unsecured). Interactions with the segmenting variable are absorbed by the segmentation. Chapter 35 treats this in depth.

If the dominant signal is genuinely interactive, none of the above competes with a tree ensemble that splits in arbitrary feature combinations by construction. The empirical fact that WoE-plus-logistic stays close to gradient boosting on regulated retail credit data is a statement about that data: monotone main effects from delinquency, utilization, and tenure dominate, and interactions are second-order, not a general property of the encoding. On data where interactions are first-order (fraud, marketing response on heterogeneous customer pools), the calculus reverses and a tree ensemble is the right starting point.

3.3.9 A from-scratch implementation

It is good practice to verify the library against a short NumPy reference. The function below reproduces optbinning’s per-bin WoE for a fixed set of bin edges.

Show code

def woe_iv_from_bins(x: np.ndarray, y: np.ndarray, edges: np.ndarray,
                     laplace: float = 0.5):
    """Return (woe_per_bin, iv, bin_index_of_x) for a fixed set of edges."""
    bin_idx = np.digitize(x, edges[1:-1], right=False)
    df = pd.DataFrame({"bin": bin_idx, "y": y})
    tab = (df.groupby("bin")["y"]
             .agg(["count", "sum"])
             .rename(columns={"count": "n", "sum": "b"}))
    tab["g"] = tab["n"] - tab["b"]
    total_b = max(tab["b"].sum(), 1.0)
    total_g = max(tab["g"].sum(), 1.0)
    p_b = (tab["b"] + laplace) / (total_b + laplace * len(tab))
    p_g = (tab["g"] + laplace) / (total_g + laplace * len(tab))
    woe = np.log(p_g / p_b)
    iv = float(((p_g - p_b) * woe).sum())
    return woe.to_dict(), iv, bin_idx

opt_var = bp.get_binned_variable("LIMIT_BAL")
lib_edges = np.concatenate([[-np.inf], opt_var.splits, [np.inf]])
woe_scratch, iv_scratch, _ = woe_iv_from_bins(
    X_tr["LIMIT_BAL"].values, y_tr, lib_edges
)
iv_lib = float(opt_var.binning_table.build().loc["Totals", "IV"])
print(f"IV (optbinning):  {iv_lib:.4f}")
print(f"IV (from-scratch): {iv_scratch:.4f}")

IV (optbinning):  0.2002
IV (from-scratch): 0.1998

The two IV numbers (here $0.2002$ from optbinning and $0.1998$ from the scratch implementation) differ by about $4 \times 10^{-4}$, with the scratch value the smaller of the two. The gap is not a constant offset: it comes from the Laplace pseudo-count of $0.5$, which shrinks the empirical bin shares toward uniform, compresses the WoE magnitudes, and therefore lowers the IV. Setting laplace=0 in woe_iv_from_bins reproduces optbinning’s value exactly whenever no bin is empty, which is the right cross-check to run during development. The pseudo-count earns its keep in production, where a single bin occasionally drops to zero in a refresh sample and an unsmoothed $\ln(0)$ would crash the scoring service.

3.3.10 Reading a binning table

Show code

opt_var.binning_table.build().round(4)

	Bin	Count	Count (%)	Non-event	Event	Event rate	WoE	IV	JS
0	(-inf, 45000.00)	3028	0.1442	1911	1117	0.3689	-0.721762	0.0892	0.0109
1	[45000.00, 75000.00)	3434	0.1635	2509	925	0.2694	-0.260896	0.0119	0.0015
2	[75000.00, 105000.00)	2262	0.1077	1718	544	0.2405	-0.108775	0.0013	0.0002
3	[105000.00, 145000.00)	1950	0.0929	1492	458	0.2349	-0.077739	0.0006	0.0001
4	[145000.00, 165000.00)	1262	0.0601	1029	233	0.1846	0.226562	0.0029	0.0004
5	[165000.00, 245000.00)	4155	0.1979	3438	717	0.1726	0.308827	0.0172	0.0021
6	[245000.00, 375000.00)	3229	0.1538	2771	458	0.1418	0.541352	0.0383	0.0047
7	[375000.00, inf)	1680	0.0800	1487	193	0.1149	0.783083	0.0387	0.0047
8	Special	0	0.0000	0	0	0.0000	0.0	0.0000	0.0000
9	Missing	0	0.0000	0	0	0.0000	0.0	0.0000	0.0000
Totals		21000	1.0000	16355	4645	0.2212		0.2002	0.0246

This table is the canonical artifact a scorecard developer hands over for model validation. Each row is a bin. The columns show the bin boundaries, the fraction of the population in the bin, the event rate, the WoE, and the IV contribution. The totals row at the bottom gives the IV for the feature. Binning tables like this one are what SR 11-7 (Board of Governors of the Federal Reserve System, 2011) validators will read first when reviewing a scorecard submission.

Three diagnostics in the binning table carry most of the validation signal. First, the WoE column should be monotone when the feature is ordinal and the business logic calls for monotonicity. An income feature that increases in WoE, then dips for the highest bin, then increases again, either has too many bins, has a sample-size problem, or has a real structural break (high earners with complicated tax situations, for instance) that needs a dedicated flag. Second, the bin-share column (often labeled count (%)) should not have any bin with less than 3 to 5 percent of the population. Small bins have unstable WoE and produce scorecards that swing wildly under resampling. Third, the event rate column should step smoothly across bins when ordered by WoE. Large jumps suggest that the bin boundaries are not where the risk boundary actually sits.

The optbinning algorithm is a mixed-integer programming formulation that enforces these properties globally rather than through heuristic post-processing (Navas-Palencia, 2020). It extends the classical supervised discretization literature, which treats binning as a greedy information-gain split, by adding constraints that express what a scorecard developer would actually want. The classical reference is Fayyad & Irani (1993), which uses a minimum description length principle to choose the number of bins. The MIP formulation subsumes this as a special case with different constraints.

3.3.11 Edge cases in WoE encoding

Three edge cases appear in almost every scorecard build, and each has a standard fix.

Zero-count bins. A bin may contain no goods or no bads in the training sample. The raw WoE in Eq. A.26 is then $\ln 0$ or $-\infty$. Three fixes exist:

A Laplace smoothing term adds a pseudo-count of $0.5$ or $1$ to each bin.
A bin merge folds the offending bin into an adjacent bin with consistent risk.
An optbinning constraint on minimum bin size avoids the zero-count state entirely during fitting.

The Laplace fix is the simplest and adequate for most production code.

Out-of-range values at scoring time. Production data may contain values that fall outside the training-time range. For numeric features, the conventional answer is to extend the outermost bins to $(-\infty, \text{edge}_1]$ and $(\text{edge}_{J-1}, \infty)$, so every value maps to a bin. For categorical features, the analog is to keep a catch-all level that absorbs unseen categories with a WoE set equal to the population-weighted average of the observed levels.

Missing values. optbinning treats missing values as a separate bin, which is the behavior we want. If the missingness itself is informative, the WoE of the missing bin will be nonzero, and the model will use it. If not, the WoE will be close to zero, and the model ignores it. Either way, the treatment is explicit rather than hidden behind a mean imputation.

3.3.12 Stability of WoE under resampling

A scorecard is only useful if the bin edges and WoE values are stable. Two diagnostics verify stability: bootstrap resampling and through-time partitioning.

Show code

from optbinning import OptimalBinning

x_var = X_tr["PAY_0"].values
y_var = y_tr

boot_iv = []
boot_edges = []
boot_rng = np.random.default_rng(0)
for b in range(25):
    idx = boot_rng.integers(0, len(x_var), size=len(x_var))
    ob = OptimalBinning(name="PAY_0", dtype="numerical", max_n_bins=6)
    ob.fit(x_var[idx], y_var[idx])
    iv_b = float(ob.binning_table.build().loc["Totals", "IV"])
    boot_iv.append(iv_b)
    boot_edges.append(tuple(np.round(ob.splits, 3)))

pd.DataFrame({
    "statistic": ["mean IV", "std IV", "min IV", "max IV"],
    "value":     [np.mean(boot_iv), np.std(boot_iv),
                  np.min(boot_iv), np.max(boot_iv)],
}).round(4)

	statistic	value
0	mean IV	0.8827
1	std IV	0.0309
2	min IV	0.8241
3	max IV	0.9483

If the bootstrap standard deviation of IV is large relative to the mean, the feature’s binning is unstable and the scorecard will be brittle. On the PAY_0 feature the bootstrap spread is modest, which is the expected picture for a strongly predictive feature. Features that fail bootstrap stability almost always benefit from a coarser binning.

3.3.13 The relationship to supervised discretization

WoE is one member of a family of supervised discretization methods. Fayyad & Irani (1993) introduced the entropy-based minimum description length principle for choosing the number of bins. Chi-merge and ChiSquare-based methods use a test of independence between adjacent bins as the merge criterion. The CART tree, treated at length in Chapter 11, is a univariate supervised binner when grown as a stump, and its splitting criterion is Gini impurity rather than WoE. All of these methods can be expressed as choices of the split function and the stopping criterion; the optbinning MIP formulation lets the user specify both explicitly.

The deep reason WoE dominates in credit is not statistical performance but interpretability. A CART split gives a threshold and a count; WoE gives a threshold, a count, and a log-odds contribution that plugs directly into a points-based scorecard. The transformation from WoE to scorecard points is linear in the log-odds, which Chapter 7 works out in detail.

3.3.14 Do modern algorithms still need binning and WoE?

A reasonable reading of the chapter so far is that WoE is a piece of legacy machinery that exists because logistic regression cannot handle non-linearity on its own. Gradient boosting splits at arbitrary thresholds, neural networks learn arbitrary feature transformations, and either approach matches the AUC of a WoE-plus-logistic pipeline on most retail credit data. So why bin?

The honest answer has two parts. The first is that for predictive performance alone, you do not need WoE if your downstream model is a tree ensemble or a sufficiently regularized neural network. The end-to-end benchmark in Section 3.8 makes this explicit: a gradient-boosted model on raw features matches the WoE-plus-logistic configuration on Taiwan within noise. Tree learners pick their own thresholds, absorb missingness through surrogate splits or dedicated handling, and are insensitive to monotone transforms of inputs. WoE is computationally and statistically wasted on them.

The second part is that production credit scoring is not a pure prediction problem. It carries six constraints that WoE-style preprocessing addresses by construction and that modern algorithms address only with extra effort:

Reason codes for adverse action notices under ECOA Reg B and FCRA §1681m must be ordered, stable, and explainable per applicant. A scorecard derives reason codes mechanically from the per-bin WoE contributions; a gradient boosting model derives them from SHAP values, which are post-hoc, sample-dependent, and not always monotone in the underlying feature, even when the feature is supposed to be.
Monotonicity constraints on features such as utilization, delinquency, and tenure are required by both regulators and underwriters. WoE binning enforces monotonicity at the binning step; LightGBM and XGBoost support monotonicity flags but at a real cost in fit, and neural networks need either a Lipschitz architecture or a monotone-by-construction layer such as in Sill (1998).
Stability under population drift is harder to guarantee with continuous splits chosen on a single training sample than with bins reviewed against a stability index. Champion scorecards stay in production five-plus years; champion gradient-boosted models are typically refreshed every quarter.
Auditability of the model artifact by SR 11-7 model risk management groups, by GDPR Article 22 explainability reviewers in the EU, and by a non-technical credit committee. The binning table is the artifact those audiences read. SHAP and partial dependence are explanations of the artifact, not the artifact itself.
Reproducibility across pipelines. A binning table can be re-implemented by a different team in a different language with the same WoE values. A gradient-boosted model with a particular set of hyperparameters cannot be reproduced exactly outside the original training pipeline.
Sample efficiency for thin-file segments. Where the relevant subpopulation is small (frontier-market lenders, new-to-credit applicants, niche product lines), a five-bin discretization extracts more reliable signal than a continuous spline that needs many degrees of freedom to fit non-linearity.

The pragmatic stack used by many regulated lenders today is a hybrid. Gradient boosting on raw or lightly engineered features is used for ranking and for the challenger model; a WoE-binned logistic scorecard is used for the production decision model that is actually deployed. The two are reconciled with a calibration step. Where a single model must serve both purposes, the rising option is the Explainable Boosting Machine (EBM) of Lou et al. (2013) and Nori et al. (2019), which fits a generalized additive model with one shape function per feature and optionally one shape function per pairwise interaction. EBMs are essentially a continuous-bin generalization of WoE, with the per-feature shape playing the role of the WoE column and the per-pair shape playing the role of an OptimalBinning2D cross. They typically match gradient boosting on AUC while preserving the per-feature artifact regulators expect.

So the short answer to “is this an artifact of logistic regression from decades ago?” is: only the encoding is. The underlying constraints (e.g., reason codes, monotonicity, stability, auditability, sample efficiency) are not artifacts of the algorithm; they are properties of the regulatory and operational environment in which credit models live. WoE persists because it satisfies those constraints almost for free, not because anyone is nostalgic for the 1970s.

3.4 Missing data

Every real credit data set has missing values. A bureau-less applicant has no tradelines. An application that skipped an optional field has nulls. An open-banking connection that failed mid-session has a truncated history. How missingness gets handled determines, in practice, whether a model works in production for the tail of the customer base where it matters most.

3.4.1 Rubin’s taxonomy

Rubin (1976) classified missing-data mechanisms into three types. Let $X$ be the complete data, $M$ the missingness indicator matrix, and $X_{\mathrm{obs}}, X_{\mathrm{mis}}$ the observed and missing partitions.

MCAR (missing completely at random): $\Pr(M \mid X) = \Pr(M)$. Missingness is independent of both observed and unobserved data. Rare in practice. The classic example is a random sensor dropout.
MAR (missing at random): $\Pr(M \mid X) = \Pr(M \mid X_{\mathrm{obs}})$. Missingness depends only on observed variables. An income field that is more likely to be blank for younger applicants is MAR, because age is observed. Under MAR, likelihood-based methods such as multiple imputation are unbiased.
MNAR (missing not at random): $\Pr(M \mid X)$ depends on $X_{\mathrm{mis}}$. Missingness depends on the unobserved value itself. High earners who decline to report income are MNAR. Imputation alone cannot recover the full data distribution without assumptions.

Little & Rubin (2019) gives the textbook treatment. For scorecard work, the relevant message is that MCAR is a convenient fiction, MAR is often defensible given rich observed data, and MNAR is the state of the world for sensitive fields like income, mortgage balance on outside institutions, or credit applications at competitors.

3.4.2 Imputation strategies

Five strategies cover most of what a scorecard pipeline needs:

Simple statistic: replace with the column mean, median, or mode. Fast, unbiased under MCAR, biased otherwise. Collapses variance.
Indicator plus statistic: add a binary “was missing” column and impute the underlying value. Captures the information in the fact of missingness, which for credit is often predictive on its own.
k-nearest neighbors: find the $k$ most similar rows under a defined distance, average their values. Works well when the data has strong local structure. Compute scales quadratically with sample size.
Multivariate iterative (MICE): model each incomplete feature as a function of the others, iterate to convergence (Buuren & Groothuis-Oudshoorn, 2011; White et al., 2011). Scikit-learn’s IterativeImputer is the Python implementation. Recovers MAR under mild assumptions.
Model-based with native support: tree learners like XGBoost, LightGBM, and CatBoost natively route missing values to the child node that minimizes loss. For those learners, imputation is a pre-model choice only if you also run a linear benchmark.

For credit scoring, the missing-indicator strategy deserves first-line status. If an applicant failed to fill in an optional field, that refusal often correlates with risk. Losing it by silent mean-imputation is a substantive information loss.

3.4.3 Simulated experiment on German credit

We take the German credit data, inject missingness under the three mechanisms, and compare five imputation strategies on the downstream AUC of a logistic model.

Show code

from sklearn.experimental import enable_iterative_imputer  # noqa: F401
from sklearn.impute import SimpleImputer, KNNImputer, IterativeImputer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

rng_miss = np.random.default_rng(0)
df_ger = load_german_credit()
num_cols = ["duration", "amount", "age", "installment_rate",
            "residence_since", "existing_credits", "people_liable"]
X_full = df_ger[num_cols].astype(float).copy()
y_full = df_ger["default"].values


def inject_mcar(X, frac, rng):
    X = X.copy()
    for c in X.columns:
        mask = rng.random(len(X)) < frac
        X.loc[mask, c] = np.nan
    return X


def inject_mar(X, driver, frac, rng):
    X = X.copy()
    thr = X[driver].quantile(0.6)
    for c in [c for c in X.columns if c != driver]:
        p = np.where(X[driver] > thr, frac, frac * 0.2)
        mask = rng.random(len(X)) < p
        X.loc[mask, c] = np.nan
    return X


def inject_mnar(X, frac, rng):
    X = X.copy()
    for c in X.columns:
        thr = X[c].quantile(0.7)
        p = np.where(X[c] > thr, frac, frac * 0.1)
        mask = rng.random(len(X)) < p
        X.loc[mask, c] = np.nan
    return X

The MCAR mechanism drops each cell independently with probability 0.2. The MAR mechanism drops a cell with probability 0.4 when duration is in the top 40 percent and 0.08 otherwise. The MNAR mechanism drops a cell with probability 0.4 when the cell’s own value is in the top 30 percent of its column and 0.04 otherwise. Credit analogs map cleanly: MAR is “applicants with long tenure skip optional fields”; MNAR is “applicants with high balances skip the balance field”.

Show code

from sklearn.metrics import roc_auc_score

def eval_imputer(Xtr, Xte, ytr, yte, imputer):
    pipe = Pipeline([
        ("imp", imputer),
        ("scale", StandardScaler()),
        ("lr", LogisticRegression(max_iter=1000)),
    ])
    pipe.fit(Xtr, ytr)
    p = pipe.predict_proba(Xte)[:, 1]
    return roc_auc_score(yte, p)


mechanisms = {
    "MCAR": lambda X: inject_mcar(X, 0.2, rng_miss),
    "MAR":  lambda X: inject_mar(X, "duration", 0.4, rng_miss),
    "MNAR": lambda X: inject_mnar(X, 0.4, rng_miss),
}

imputers = {
    "mean":       SimpleImputer(strategy="mean"),
    "median":     SimpleImputer(strategy="median"),
    "mean+ind":   SimpleImputer(strategy="mean", add_indicator=True),
    "kNN(5)":     KNNImputer(n_neighbors=5),
    "MICE":       IterativeImputer(max_iter=10, random_state=0),
}

records = []
for mech_name, inject in mechanisms.items():
    Xm = inject(X_full)
    Xtr, Xte, ytr, yte = train_test_split(
        Xm, y_full, test_size=0.3, random_state=0, stratify=y_full
    )
    for imp_name, imp in imputers.items():
        auc = eval_imputer(Xtr, Xte, ytr, yte, imp)
        records.append({"mechanism": mech_name, "imputer": imp_name, "AUC": auc})

res_miss = (pd.DataFrame(records)
            .pivot(index="imputer", columns="mechanism", values="AUC")
            .round(4))
res_miss

mechanism	MAR	MCAR	MNAR
imputer
MICE	0.6910	0.6418	0.6568
kNN(5)	0.6848	0.6483	0.6726
mean	0.6965	0.6591	0.6686
mean+ind	0.6711	0.6720	0.6621
median	0.6928	0.6535	0.6646

Two observations. First, no imputer dominates across all three mechanisms, which is the expected result once we understand that each method encodes different assumptions. Under MCAR the mean-plus-indicator strategy leads because the indicator itself is random and the underlying distribution is symmetric. Under MAR, mean or median imputation is competitive because the drivers are observed in the other features. Under MNAR no method recovers the full signal. The second observation is that the differences are small in absolute terms, a few AUC points, but they compound into large differences in expected profit when stacked across features. This is where adding a missingness indicator pays dividends for essentially zero cost.

3.4.4 When to use which

A practical decision rule for scorecard work:

Categorical feature with natural “missing” level: keep the missing level as its own bin. WoE handles it. This is the default in optbinning.
Continuous numeric with MAR missingness and rich observed data: IterativeImputer (MICE) with a small number of iterations. Add an indicator if missingness rate exceeds 5 percent.
Sparse high-dimensional matrix with MNAR fields: keep the indicator, impute the value with a median. Do not try to be clever.
Tree learner downstream: do not impute. Pass NaNs through. Compare to imputation only if required by governance.
Linear or neural scorecard: impute explicitly and persist the imputer in the same artifact as the model.

The industry’s single largest imputation failure mode is pipeline drift. The training data had a missingness rate of 2 percent for income. The production data has 15 percent, because a downstream upstream vendor changed a default. The imputer silently mean-fills the missing values, and the score concentrates near the population mean. Monitor missingness rates on every feature in production with the same care that you monitor PSI (Chapter 16).

3.4.5 What about matrix completion, GAIN, MIWAE, MissForest?

A reasonable reader will ask why the list above stops at MICE when the imputation literature has moved on to low-rank matrix completion (Mazumder et al., 2010), random-forest imputation (Stekhoven & Bühlmann, 2012), generative-adversarial imputation (Yoon et al., 2018), deep-latent-variable imputation (Mattei & Frellsen, 2019), and AutoML-style imputer selection (Jarrett et al., 2022). The answer is not that these methods are bad. They are excluded from the recommended scorecard pipeline for four reasons that all bind at the same time in credit.

First, low-rank assumptions do not fit credit feature matrices. Matrix completion methods assume the underlying data matrix is approximately low-rank, which is the right model for collaborative filtering (a few latent taste factors generate every user-item rating) and for image inpainting (smoothness in pixel space). Credit features are a heterogeneous mix of bureau scores, demographics, employment fields, behavioral aggregates, and self-reported items. There is no shared low-dimensional latent factor that generates all of them, and SoftImpute-style nuclear-norm completion silently shrinks toward a basis that has no interpretation. Empirically, low-rank completion is competitive on dense numeric panels (e.g., genomics) and weak on the wide mixed-type tables that scorecards consume.

Second, the inference-time story is broken or expensive. A scorecard must score one new applicant in milliseconds. Matrix completion, GAIN, and MIWAE were all designed for the in-sample setting, where you complete a fixed matrix once. Out-of-sample completion for a single new row requires either projection onto a stored basis (matrix completion), a forward pass through a generator trained on potentially stale data (GAIN), or sampling from a learned posterior (MIWAE). Each of these adds a second model to the production path that must itself be versioned, monitored for drift, and validated under SR 11-7. The marginal AUC gain rarely justifies the operational cost.

Third, the empirical lift on tabular data is small. Two large published benchmarks of imputation methods on tabular data, Jäger et al. (2021) and Le Morvan et al. (2021), both find that median imputation with a missing-indicator column is within one or two AUC points of the best deep imputer on essentially every downstream classification task they test. Where the deep methods win, they win by margins that are smaller than the variance across random seeds. For credit, where the dominant model is a gradient-boosted tree that handles NaN natively (Section 12.6), the practical gain from a sophisticated imputer is close to zero.

Fourth, governance penalizes opacity. A bank validator will ask three questions about any imputer: what assumption does it encode, what happens when production missingness drifts, and how would you detect a regression. Median-plus-indicator answers all three in one sentence each. GAN- or VAE-based imputation answers none of them cleanly, and the validator will require a separate model risk file, a champion-challenger setup, and ongoing monitoring of the imputer’s own outputs. This burden is real and is why most production scorecards still ship with median-plus-indicator or with a tree learner that ignores the question entirely.

The honest summary is that matrix completion and its deep-learning successors are excellent tools for the problems they were designed for (collaborative filtering, image inpainting, gene-expression panels) and a poor fit for the wide-mixed-type, low-latency, high-governance environment of credit scoring. The gap is not theoretical sophistication; it is fit-for-purpose. A team that wants to experiment with HyperImpute or MissForest as a challenger to the median-plus-indicator champion should do so, but the production default belongs with the simpler tool.

3.4.6 Multiple imputation and variance inflation

Single imputation, where each missing cell is replaced with one value, systematically understates downstream standard errors. Multiple imputation draws $M$ imputed data sets, fits the model on each, and combines the results using the rules of Rubin (1976). Buuren & Groothuis-Oudshoorn (2011) and White et al. (2011) treat the methodology in detail. For credit scoring, the pragmatic reality is that $M = 1$ is almost always used. The argument is that scorecard inference is about the predicted probability rather than the coefficient standard errors, and the downstream decisions (approve, decline, price) are robust to the kind of variance inflation that single imputation glosses over. That argument is correct for pure decision-making, but breaks down when the model’s coefficients are used in capital calculations under Basel IRB, where the regulator cares about the confidence interval around the PD. Banks that run IRB models generally carry a multiple-imputation pipeline for a subset of critical features, even when single imputation is the default for decision making.

3.4.7 Imputation and monotonicity

A subtle but important property of any imputer is whether it preserves the monotone risk structure that scorecard binning relies on. Mean imputation breaks monotonicity because the imputed value sits near the middle of the distribution, while the underlying missing-value risk may be at one of the tails. Median imputation has the same problem. An imputation strategy that restores monotonicity is to impute to the value that matches the observed risk of the missing group. Operationally, this means fitting a univariate logistic regression of the label on the feature in the non-missing subsample, then assigning the imputed value so that the predicted log-odds of a missing row equals the empirical log-odds of the missing subset. This is worth the effort when downstream scorecard monotonicity is a constraint, and overkill otherwise.

The construction is short enough to demonstrate in code. We inject MNAR missingness on duration (longer-duration applicants are more likely to have the field blank, and longer duration is also riskier), then compare the imputed value chosen by the mean, the median, and the risk-matched rule.

Show code

from sklearn.linear_model import LogisticRegression as _LR

rng_rmi = np.random.default_rng(7)
X_rmi = X_full.copy()
thr_d = X_rmi["duration"].quantile(0.6)
p_drop = np.where(X_rmi["duration"] > thr_d, 0.6, 0.05)
miss_mask = rng_rmi.random(len(X_rmi)) < p_drop
X_rmi.loc[miss_mask, "duration"] = np.nan

obs = ~X_rmi["duration"].isna()
x_obs = X_rmi.loc[obs, "duration"].values.reshape(-1, 1)
y_obs = y_full[obs]
y_mis = y_full[~obs]

p_mis = y_mis.mean()
logit_mis = np.log(p_mis / (1 - p_mis))

uni = _LR(penalty=None, max_iter=1000).fit(x_obs, y_obs)
a, b = float(uni.intercept_[0]), float(uni.coef_[0, 0])
x_risk = (logit_mis - a) / b

cmp = pd.DataFrame({
    "strategy": ["mean", "median", "risk-matched"],
    "imputed_value": [
        X_rmi["duration"].mean(),
        X_rmi["duration"].median(),
        x_risk,
    ],
})
cmp["pred_logit_at_imp"] = a + b * cmp["imputed_value"]
cmp["empirical_logit_missing"] = logit_mis
cmp["abs_gap"] = (cmp["pred_logit_at_imp"] - logit_mis).abs()
cmp.round(3)

	strategy	imputed_value	pred_logit_at_imp	empirical_logit_missing	abs_gap
0	mean	18.056	-1.021	-0.265	0.756
1	median	15.500	-1.118	-0.265	0.853
2	risk-matched	37.970	-0.265	-0.265	0.000

The risk-matched row has abs_gap equal to zero by construction. The mean and median rows have a positive gap whose sign tells you which direction the imputer is pulling the missing population: a negative pred_logit_at_imp minus empirical_logit_missing means mean or median imputation is making the missing rows look safer than they really are, which is exactly the failure mode that breaks monotonicity in the downstream scorecard.

Show code

def fill_and_score(value, X_in):
    Xf = X_in.copy()
    Xf["duration"] = Xf["duration"].fillna(value)
    Xtr, Xte, ytr, yte = train_test_split(
        Xf, y_full, test_size=0.3, random_state=0, stratify=y_full
    )
    pipe = Pipeline([("scale", StandardScaler()),
                     ("lr", LogisticRegression(max_iter=1000))])
    pipe.fit(Xtr, ytr)
    return roc_auc_score(yte, pipe.predict_proba(Xte)[:, 1])

aucs = {row.strategy: fill_and_score(row.imputed_value, X_rmi)
        for row in cmp.itertuples()}
pd.Series(aucs, name="AUC").round(4).to_frame()

	AUC
mean	0.6560
median	0.6536
risk-matched	0.6803

The AUC differences are typically small on a single feature, which matches the broader finding from the simulated experiment above. The point of risk matching is not to win the AUC race; it is to keep the scorecard’s monotone risk structure intact when the binning step downstream insists on it. Two further notes. First, the rule generalizes from a single feature to a multivariate setting by replacing the univariate logit with a model that uses all observed features and solving for the imputed value of the missing column at the row’s own observed covariates. Second, the rule is a single-imputation device. If you need calibrated standard errors, draw the imputed value from the posterior predictive distribution of the univariate logit instead of using the point estimate, then average across draws using Rubin’s rules.

3.4.8 Missingness indicator interpretation

When a missingness indicator column is added, the logistic coefficient on it has a direct risk interpretation. Let $M_j$ be the indicator that feature $j$ is missing and $X_j^{\text{imp}}$ be the imputed value. The fitted model has the form

\[ \mathrm{logit} \Pr(Y=1 \mid X, M) = \alpha + \beta_j X_j^{\text{imp}} + \gamma_j M_j + \cdots \tag{3.8}\]

The coefficient $\gamma_j$ measures the risk premium (or discount) associated with the fact of missingness itself, holding the imputed value constant. A positive $\gamma_j$ with a substantial magnitude says that a missing-on-$X_j$ row is riskier than an observed row with the same imputed value, which is direct evidence that the missingness mechanism is MNAR. A near-zero $\gamma_j$ is evidence that the mechanism is plausibly MCAR or MAR.

Two cautions apply. First, the indicator is colinear with the imputed value if every imputed row shares the same value, which in mean imputation is always the case for a single feature. Regularized logistic regression handles this cleanly; unpenalized logistic regression may show nonfinite standard errors. Second, if two features have correlated missingness (for example, both self-reported income and self-reported employment length are blank for the same applicant), adding both indicators recovers almost all the useful signal but can make the model’s bias unstable across resampling. A joint “application-form incomplete” indicator often works better than two separate indicators.

3.5 Feature selection

Scorecards live with more features than they use. A fintech’s feature store routinely holds thousands of columns. A deployed model uses 10 to 60. The process of getting from one to the other is feature selection. Four approaches cover almost everything production teams deploy.

3.5.1 IV filter

The simplest screen is to rank every feature by information value and keep the top $K$, or keep every feature with $\mathrm{IV} \ge 0.02$. This is a univariate filter. It ignores correlations. In a scorecard with 500 candidate features and heavy correlation, it is nevertheless the right first step, because it cuts the search space from 500 to 100 or 50 before any multivariate method runs.

Show code

iv_filter = iv_tbl[iv_tbl["iv"] >= 0.02].reset_index(drop=True)
print(f"Features with IV >= 0.02: {len(iv_filter)} of {len(iv_tbl)}")
iv_filter.head(10).round(3)

Features with IV >= 0.02: 14 of 23

	name	iv	js	gini	n_bins
0	PAY_0	0.879211	0.098011	0.408066	4
1	PAY_2	0.557144	0.064793	0.303639	4
2	PAY_3	0.419835	0.04944	0.263657	4
3	PAY_4	0.367989	0.043082	0.236466	4
4	PAY_5	0.342765	0.039685	0.210781	4
5	PAY_6	0.295911	0.034716	0.202056	4
6	LIMIT_BAL	0.200189	0.024597	0.248464	8
7	PAY_AMT1	0.182789	0.022386	0.227024	7
8	PAY_AMT2	0.16787	0.020458	0.215512	8
9	PAY_AMT3	0.136166	0.016759	0.199464	8

On Taiwan most of the 23 numeric features clear the 0.02 threshold. The handful that fall below are demographic fields whose marginal signal is weak once LIMIT_BAL and the payment-delinquency block are in the pool. On a raw feature store with thousands of variables, two-thirds typically fall below.

3.5.2 LASSO

The LASSO (Tibshirani, 1996) adds an $\ell_1$ penalty to the logistic log-likelihood,

\[ \hat\beta(\lambda) = \arg\min_{\beta_0, \beta} \Big\{ -\tfrac{1}{n}\sum_{i=1}^n \ell_i(\beta_0, \beta) + \lambda \sum_{j=1}^{p} |\beta_j| \Big\}, \tag{3.9}\]

where $\ell_i$ is the per-observation log-likelihood. For $\lambda$ large, the solution is the null model; for $\lambda = 0$ the solution is ordinary logistic regression. Between those extremes, coefficients enter one at a time as $\lambda$ decreases, which gives the characteristic LASSO path. Features whose coefficients remain at zero across most of the path are dropped. The glmnet coordinate-descent algorithm (Friedman et al., 2010) is the standard solver.

Three properties make LASSO attractive for scorecard selection. First, it handles correlated features by picking one and shrinking the others, which reduces the collinearity problem that flat logistic regression suffers on WoE-encoded features. Second, the regularization path is cheap to compute, so a team can inspect the entire trajectory rather than committing to a single $\lambda$. Third, the elastic net extension (Zou & Hastie, 2005) smooths the all-or-nothing selection into a convex combination with ridge regularization, which is usually the better default in production.

Show code

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

df_tw = load_taiwan_default().drop(columns=["id"])
y = df_tw["default"].values.astype(int)
X = df_tw.drop(columns=["default"]).values.astype(float)
X_tr, X_te, y_tr, y_te = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)
sc = StandardScaler().fit(X_tr)
X_tr_s = sc.transform(X_tr)
X_te_s = sc.transform(X_te)

Cs = np.logspace(-3, 1, 25)
feat_names = df_tw.drop(columns=["default"]).columns.tolist()
coefs = np.zeros((len(Cs), X_tr_s.shape[1]))
for i, C in enumerate(Cs):
    m = LogisticRegression(penalty="l1", solver="liblinear",
                           C=C, max_iter=2000)
    m.fit(X_tr_s, y_tr)
    coefs[i] = m.coef_[0]

fig, ax = plt.subplots(figsize=(7, 4.2))
for j in range(coefs.shape[1]):
    ax.plot(Cs, coefs[:, j], lw=1.0)
ax.set_xscale("log")
ax.axhline(0, color="k", lw=0.6)
ax.set_xlabel("Inverse regularization strength C (larger C = weaker penalty)")
ax.set_ylabel("Coefficient")
ax.set_title("LASSO logistic regression path on Taiwan default")
plt.tight_layout()
plt.show()

The shape of this plot is typical. At the strongest penalty most coefficients are zero; as the penalty weakens the PAY_0 family enters first, followed by LIMIT_BAL and PAY_AMT. When two highly collinear features are present, the LASSO picks one and keeps the other at zero until the penalty is very weak. That behavior is the motivation for the elastic net extension.

Show code

C_star = 0.02
m_star = LogisticRegression(penalty="l1", solver="liblinear",
                            C=C_star, max_iter=2000).fit(X_tr_s, y_tr)
kept = np.array(feat_names)[m_star.coef_[0] != 0]
dropped = np.array(feat_names)[m_star.coef_[0] == 0]
print(f"Kept   ({len(kept)}):   {list(kept)}")
print(f"Dropped ({len(dropped)}): {list(dropped)}")

Kept   (18):   [np.str_('LIMIT_BAL'), np.str_('SEX'), np.str_('EDUCATION'), np.str_('MARRIAGE'), np.str_('AGE'), np.str_('PAY_0'), np.str_('PAY_2'), np.str_('PAY_3'), np.str_('PAY_4'), np.str_('PAY_5'), np.str_('PAY_6'), np.str_('BILL_AMT1'), np.str_('PAY_AMT1'), np.str_('PAY_AMT2'), np.str_('PAY_AMT3'), np.str_('PAY_AMT4'), np.str_('PAY_AMT5'), np.str_('PAY_AMT6')]
Dropped (5): [np.str_('BILL_AMT2'), np.str_('BILL_AMT3'), np.str_('BILL_AMT4'), np.str_('BILL_AMT5'), np.str_('BILL_AMT6')]

At $C = 0.02$, the LASSO drops the weakest features and keeps a compact subset. The practical workflow is to cross-validate over $\lambda$ to pick the operating point, then stability-select across bootstrap resamples to drop features that enter the solution only intermittently.

3.5.3 Mutual information and permutation importance

Two nonparametric alternatives round out the toolkit.

Mutual information $I(X_j; Y) = \sum_{x,y} p(x,y) \ln \frac{p(x,y)}{p(x)p(y)}$ is closely related to IV. For a binary $Y$, $I(X_j; Y) = H(Y) - H(Y \mid X_j)$, which measures the expected reduction in label entropy from knowing $X_j$. sklearn.feature_selection.mutual_info_classif estimates it for continuous and discrete features.

Permutation importance (Altmann et al., 2010; Breiman, 2001) measures the drop in model performance when a single feature is randomly permuted on the validation set. It is model-agnostic and captures interactions that univariate IV misses. The cost is that permutation importance for correlated features is misleading: permuting one feature often leaves the information intact through its correlated neighbor, so the importance of both looks low. Conditional permutation variants partly fix this.

Show code

from sklearn.feature_selection import mutual_info_classif
from sklearn.inspection import permutation_importance

mi = mutual_info_classif(X_tr_s, y_tr, random_state=0)
lr = LogisticRegression(max_iter=1000).fit(X_tr_s, y_tr)
perm = permutation_importance(lr, X_te_s, y_te, n_repeats=10,
                              random_state=0, n_jobs=1)

def conditional_perm_importance(model, X, y, n_repeats=10,
                                corr_thresh=0.5, n_bins=4,
                                random_state=0):
    """Permute each feature within bins of its most-correlated neighbor.

    Approximates Strobl-style conditional permutation: if feature j is
    correlated with k, marginal permutation breaks j but leaves k intact,
    so the model recovers most of the signal and j looks unimportant.
    Permuting j within strata of k removes that leakage path.
    """
    rng = np.random.default_rng(random_state)
    base = model.score(X, y)
    corr = np.abs(np.corrcoef(X, rowvar=False))
    np.fill_diagonal(corr, 0)
    out = np.zeros(X.shape[1])
    for j in range(X.shape[1]):
        partner = int(np.argmax(corr[j]))
        if corr[j, partner] < corr_thresh:
            strata = np.zeros(X.shape[0], dtype=int)
        else:
            strata = pd.qcut(X[:, partner], q=n_bins,
                             labels=False, duplicates="drop")
        drops = np.empty(n_repeats)
        for r in range(n_repeats):
            Xp = X.copy()
            for s in np.unique(strata):
                idx = np.where(strata == s)[0]
                Xp[idx, j] = rng.permutation(Xp[idx, j])
            drops[r] = base - model.score(Xp, y)
        out[j] = drops.mean()
    return out

cond_perm = conditional_perm_importance(lr, X_te_s, y_te,
                                        n_repeats=10, random_state=0)

imp_df = (pd.DataFrame({
            "feature": feat_names,
            "MI": mi,
            "perm_importance": perm.importances_mean,
            "cond_perm_importance": cond_perm,
         })
         .sort_values("MI", ascending=False)
         .reset_index(drop=True))
imp_df.head(10).round(4)

	feature	MI	perm_importance	cond_perm_importance
0	PAY_0	0.0785	0.0398	0.0212
1	PAY_2	0.0485	0.0017	-0.0000
2	PAY_3	0.0392	0.0008	-0.0012
3	PAY_4	0.0367	0.0004	0.0002
4	PAY_6	0.0315	-0.0001	-0.0001
5	PAY_5	0.0278	0.0001	0.0005
6	PAY_AMT1	0.0226	-0.0004	-0.0003
7	PAY_AMT4	0.0210	0.0002	-0.0000
8	PAY_AMT2	0.0181	0.0004	0.0002
9	PAY_AMT3	0.0180	0.0002	0.0000

The mutual-information ranking and the marginal permutation-importance ranking agree on the top features with the IV ranking, which is the usual picture when the candidate set is clean. When they disagree, disagreement usually points to either collinearity (MI and IV up, permutation importance down) or a nonlinear interaction (permutation importance up, MI down). The conditional permutation column sharpens this read: features whose marginal importance was suppressed by a correlated neighbor recover here, because permuting within strata of that neighbor blocks the leakage path. A feature that looks important marginally but collapses under conditional permutation is mostly proxying for its partner; a feature that looks weak marginally but rises under conditional permutation carries information the rest of the matrix cannot reconstruct.

Reading the table, the top ten are all repayment-status (PAY_*) and payment-amount (PAY_AMT*) variables, consistent with the IV ranking in Section 3.3. PAY_0, the most recent repayment status, dominates on every column: $I(X;Y) \approx 0.079$, marginal permutation importance $\approx 0.040$, and conditional permutation importance $\approx 0.021$. It is the only feature whose importance survives conditioning, meaning its signal is not reconstructible from any single correlated neighbor.

The older repayment lags tell the collinearity story cleanly. PAY_2 through PAY_6 have non-trivial mutual information (roughly $0.03$-$0.05$) but their marginal permutation importance is already near zero, and conditional permutation drives it to zero or slightly negative. Mechanically, PAY_0 through PAY_6 are strongly autocorrelated across months, so shuffling PAY_2 alone barely hurts the model because PAY_0 (and the other lags) still carry the same delinquency signal. Mutual information is a univariate quantity and does not see this substitution, which is why the MI column stays elevated while both permutation columns collapse. The practical consequence is that the linear model is essentially reading PAY_0 and using the older lags as noisy confirmation; dropping PAY_3-PAY_6 would barely move held-out performance, though keeping them can still help in nonlinear models that exploit interactions (e.g. “was delinquent two months ago but recovered”).

The PAY_AMT* variables sit at the bottom of the table with MI around $0.02$ and permutation importance within sampling noise of zero (the small negative values at n_repeats = 10 are pure variance, not evidence against the feature). For a linear model on the standardized scale, raw payment amounts carry little marginal information once repayment status is known: a customer who is two months delinquent is risky regardless of whether last month’s payment was $500 or $5,000. These features typically become useful only after transformation (ratio to bill amount, log-scaling) or inside a model that can interact them with PAY_0.

3.5.4 Boruta

Boruta (Kursa & Rudnicki, 2010) is a wrapper method built on random forests. For each feature it creates a shadow feature that is a random permutation of the original. A random forest is fit on the augmented matrix, and each real feature is tested against the best shadow feature. Features that consistently beat their shadow are confirmed; features that lose consistently are rejected; borderline features are marked as tentative. Boruta is aggressive in retaining correlated relevant features, which is desirable when the downstream model can handle them, and it is slow.

Show code

from boruta import BorutaPy
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=200, max_depth=7, n_jobs=-1,
    class_weight="balanced", random_state=0,
)
boruta = BorutaPy(
    estimator=rf, n_estimators="auto",
    max_iter=50, random_state=0, verbose=0,
)
boruta.fit(X_tr_s.astype(np.float64), y_tr.astype(int))

feat_arr   = np.array(feat_names)
confirmed  = feat_arr[boruta.support_]
tentative  = feat_arr[boruta.support_weak_]
rejected   = feat_arr[~(boruta.support_ | boruta.support_weak_)]

print(f"Confirmed ({len(confirmed)}): {confirmed.tolist()}")
print(f"Tentative ({len(tentative)}): {tentative.tolist()}")
print(f"Rejected  ({len(rejected)}): {rejected.tolist()}")

Confirmed (19): ['LIMIT_BAL', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']
Tentative (0): []
Rejected  (4): ['SEX', 'EDUCATION', 'MARRIAGE', 'AGE']

On Taiwan, Boruta confirms a broad core (around nineteen of the twenty-three predictors): LIMIT_BAL, every PAY_* delinquency counter, every BILL_AMT_*, and every PAY_AMT_*. The four demographics (SEX, EDUCATION, MARRIAGE, AGE) are rejected. This overlaps heavily with the LASSO choice at $C \approx 0.05$ and with the top of the IV ranking: the repayment-history block dominates regardless of which selector we trust.

3.5.5 Stability across methods

A useful diagnostic is to compare the top-10 features from each method. If three of four methods agree on a feature, it is very likely signal. If a feature appears in only one method, investigate. The literature, going back to Guyon & Elisseeff (2003), warns against over-reliance on any single score; practical scorecard teams combine a univariate filter (IV), a sparse model (LASSO), and a nonlinear wrapper (Boruta or permutation importance on a tree learner) before freezing the feature list.

3.5.6 Stability selection

A richer version of the stability argument is stability selection, introduced in the statistics literature for LASSO-type methods. The idea is to fit the LASSO on many bootstrap resamples and record how often each feature is selected. Features that appear in 80 or 90 percent of bootstraps are confirmed; features that appear in fewer than 50 percent are rejected; the intermediate band is marked for review. Stability selection has strong theoretical guarantees for controlling the expected number of false positives in high-dimensional settings.

Show code

n_boot = 30
C_sel = 0.05
boot_sel = np.zeros((n_boot, X_tr_s.shape[1]), dtype=int)
boot_rng_s = np.random.default_rng(0)
for b in range(n_boot):
    idx = boot_rng_s.integers(0, len(X_tr_s), size=len(X_tr_s))
    m = LogisticRegression(penalty="l1", solver="liblinear",
                           C=C_sel, max_iter=2000)
    m.fit(X_tr_s[idx], y_tr[idx])
    boot_sel[b] = (m.coef_[0] != 0).astype(int)

selection_freq = boot_sel.mean(axis=0)
stab = (pd.DataFrame({"feature": feat_names,
                      "selection_frequency": selection_freq})
          .sort_values("selection_frequency", ascending=False)
          .reset_index(drop=True))
stab.head(10).round(3)

	feature	selection_frequency
0	LIMIT_BAL	1.0
1	PAY_2	1.0
2	PAY_AMT2	1.0
3	PAY_AMT1	1.0
4	SEX	1.0
5	BILL_AMT1	1.0
6	PAY_0	1.0
7	AGE	1.0
8	EDUCATION	1.0
9	MARRIAGE	1.0

Features that appear in every bootstrap fit are the robust core of the model. Features that appear in 50 to 70 percent of fits are correlated with the robust core and will drop in and out depending on the resample. Features that appear in fewer than 30 percent of fits are weak and should be removed even if they clear the IV threshold.

3.5.7 Redundancy analysis

Univariate feature selection tells you which features carry signal; it does not tell you which features are redundant. Redundancy analysis uses the correlation structure of the candidate matrix to collapse highly correlated groups before fitting the downstream model.

Show code

import numpy as np
corr = np.corrcoef(X_tr_s, rowvar=False)
labels = feat_names

# find pairs with |r| > 0.8 among the kept features
pairs = []
for i in range(len(labels)):
    for j in range(i + 1, len(labels)):
        if abs(corr[i, j]) > 0.8:
            pairs.append((labels[i], labels[j], round(float(corr[i, j]), 3)))
pd.DataFrame(pairs, columns=["feature_a", "feature_b", "correlation"]).head(12)

	feature_a	feature_b	correlation
0	PAY_4	PAY_5	0.817
1	PAY_5	PAY_6	0.813
2	BILL_AMT1	BILL_AMT2	0.953
3	BILL_AMT1	BILL_AMT3	0.901
4	BILL_AMT1	BILL_AMT4	0.863
5	BILL_AMT1	BILL_AMT5	0.834
6	BILL_AMT1	BILL_AMT6	0.808
7	BILL_AMT2	BILL_AMT3	0.936
8	BILL_AMT2	BILL_AMT4	0.894
9	BILL_AMT2	BILL_AMT5	0.862
10	BILL_AMT2	BILL_AMT6	0.834
11	BILL_AMT3	BILL_AMT4	0.933

Pairs with absolute correlation above 0.8 are the candidates for group collapse. One common rule is to keep the higher-IV member of each pair and drop the other. Another is to collapse the pair into a ratio or a difference feature that captures the residual signal. For scorecard work, the former is simpler and adequate; for gradient boosting, neither matters because the tree learner handles correlation natively.

3.5.8 Practical selection pipeline

A working recipe that survives audit:

Start with the full feature candidate set from the feature store.
Drop features with more than 50 percent missingness unless a missingness indicator is strongly predictive.
Compute IV on the training fold. Keep features with $\mathrm{IV} \ge 0.02$ and drop features with $\mathrm{IV} > 0.5$ for manual review (usually leakage).
Compute the correlation matrix on the kept features. For each pair with $|r| > 0.8$, keep the higher-IV member.
Run a LASSO path with stability selection. Keep features that appear in at least 70 percent of bootstraps at the cross-validated $\lambda$.
Optionally run Boruta as a consistency check. Features that survive both stability-selection LASSO and Boruta are the robust core.
Freeze the feature list. Document the rejection reason for every dropped feature in the model development document.

Every step should be reproducible from a seed and a snapshot of the training data. A scorecard that cannot be regenerated byte-for-byte from its inputs will fail SR 11-7 validation on its first audit cycle.

3.6 Temporal leakage and lookahead bias

The single most damaging data bug in credit scoring is using information at training time that would not be available at scoring time in production. Khandani et al. (2010) showed that machine-learning credit models trained on out-of-time cohorts can forecast delinquencies with substantial economic value, and that claim hinges entirely on correct temporal splits. In practice, this kind of bug is common, it is hard to detect with standard cross-validation, and it inflates backtest performance so dramatically that a model can look like a breakthrough until the first month of live scoring.

3.6.1 The time structure of credit data

Two calendars govern any credit data set. The first is the observation date, the point at which a score is computed. The second is the performance window, the forward-looking horizon over which the label is defined. A 12-month PD model might score at month $t_0$ and label the customer as a bad if they reach 90+ days past due at any point in $[t_0, t_0 + 12]$. For training, the label requires all data up to $t_0 + 12$, which means the most recent observation point for which a complete label exists is 12 months behind the data engineer’s clock.

Figure 3.1 makes the two-calendar structure explicit. Cohort A sits twelve months in the past, so its performance window closes at today and its label is fully observed. Cohort B is more recent; its observation date is only six months ago, so six months of its performance window has not happened yet. Cohort B cannot enter the training set until the calendar advances enough for its label to resolve.

Show code

import matplotlib.patches as mpatches

fig, ax = plt.subplots(figsize=(7.6, 3.6))

gray   = "#C8C8C8"
blue   = "#4C72B0"
red    = "#B03030"

def draw_cohort(ax, y, t0, horizon, feat_start, today,
                label_left, resolved):
    ax.barh(y, width=t0 - feat_start, left=feat_start,
            height=0.45, color=gray, edgecolor="black", lw=0.6)
    observed_end = min(t0 + horizon, today)
    if observed_end > t0:
        ax.barh(y, width=observed_end - t0, left=t0,
                height=0.45, color=blue, edgecolor="black", lw=0.6)
    if t0 + horizon > today:
        ax.barh(y, width=(t0 + horizon) - max(t0, today),
                left=max(t0, today), height=0.45,
                facecolor="white", edgecolor=blue, lw=0.8, hatch="//")
    ax.plot([t0, t0], [y - 0.28, y + 0.28], color="black", lw=1.2)
    ax.text(t0, y + 0.42, r"$t_0$", ha="center", fontsize=9)
    ax.text(t0 + horizon, y + 0.42, r"$t_0 + 12$",
            ha="center", fontsize=9)
    tag = "label observed" if resolved else "label not yet observed"
    ax.text(feat_start - 0.8, y, f"{label_left}\n({tag})",
            ha="right", va="center", fontsize=9)

today   = 0
horizon = 12

draw_cohort(ax, y=2.55, t0=-12, horizon=horizon, feat_start=-28,
            today=today, label_left="Cohort A", resolved=True)
draw_cohort(ax, y=1.05, t0=-6,  horizon=horizon, feat_start=-22,
            today=today, label_left="Cohort B", resolved=False)

ax.axvline(today, color=red, lw=1.2, ls="--")
ax.text(today + 0.4, 4.05, "today (engineer's clock)",
        color=red, fontsize=9, ha="left", va="center")

ax.annotate("", xy=(0, 3.55), xytext=(-12, 3.55),
            arrowprops=dict(arrowstyle="<->", color="black", lw=1))
ax.text(-6, 3.68, "12-month lag = performance horizon",
        ha="center", fontsize=8.8)

ax.set_xlim(-30, 8)
ax.set_ylim(0, 4.3)
ax.set_xticks([-24, -18, -12, -6, 0, 6])
ax.set_xticklabels(["t-24", "t-18", "t-12", "t-6", "today", "t+6"])
ax.set_xlabel("Calendar time (months relative to today)")
ax.set_yticks([])
for sp in ["top", "right", "left"]:
    ax.spines[sp].set_visible(False)

legend_handles = [
    mpatches.Patch(facecolor=gray, edgecolor="black",
                   label="Feature history (pre-$t_0$)"),
    mpatches.Patch(facecolor=blue, edgecolor="black",
                   label="Performance window, observed"),
    mpatches.Patch(facecolor="white", edgecolor=blue, hatch="//",
                   label="Performance window, not yet observed"),
]
ax.legend(handles=legend_handles, loc="lower center",
          bbox_to_anchor=(0.5, -0.32), ncol=3,
          frameon=False, fontsize=8.5)

plt.tight_layout()
plt.show()

Figure 3.1: Observation date and performance window. A 12-month PD model scores at $t_0$ and labels a bad if delinquency hits 90+ DPD inside $[t_0, t_0+12]$. Cohort A sits far enough in the past that its label is fully observed and the row is trainable. Cohort B’s observation date is only six months old, so half of its performance window is unobserved (hatched) and the cohort cannot yet enter training. The newest trainable observation date trails today by exactly the performance horizon.

This architecture implies three rules:

Features must be computable strictly from data known at the observation date.
Labels must come from data in the performance window.
The split between training and test must respect the ordering of observation dates, not the ordering of label dates.

Violate rule 1, and the model is leaky. Violate rule 2, and the label is wrong. Violate rule 3, and the evaluation is optimistic.

3.6.2 Types of leakage

Four kinds of temporal leakage show up in scorecard pipelines:

Direct target leakage: a feature includes the label. Happens when a team builds a feature from a table that has already been updated with performance outcomes. The “30+ in month $t$” flag sourced from a data warehouse version that has processed later data is the classic example.
Aggregate leakage: a feature uses a statistic computed over a pool that includes future observations. Mean encodings computed over the whole data set are the archetype.
Split leakage: a customer appears in both train and test because a random split ignored time or customer identity. Common with repeat borrowers.
Snapshot leakage: a feature is pulled from a production system that updates continuously, without anchoring to a specific as-of date. The feature value at training time differs from its value at scoring time because the underlying record has changed.

3.6.3 A reproducible bug and its fix

We simulate 24 monthly cohorts with a regime change at month 18. We engineer a leaky feature that uses the same-cohort default rate, and a non-leaky feature that uses the lagged default rate from the previous three months. We then compare a random split against an out-of-time split.

Show code

rng_t = np.random.default_rng(0)
N_per = 2000
rows = []
for m in range(24):
    x = rng_t.normal(0, 1, N_per)
    state = 1.0 if m >= 18 else 0.0
    logits = -1.2 + 0.9 * x + 2.0 * state
    p = stable_sigmoid(logits)
    y = (rng_t.random(N_per) < p).astype(int)
    rows.append(pd.DataFrame({"month": m, "x": x, "y": y}))
df_t = pd.concat(rows, ignore_index=True)

global_mean = df_t["y"].mean()
df_t["same_month_default_rate"] = df_t.groupby("month")["y"].transform("mean")

roll = df_t.groupby("month")["y"].mean()
lagged = roll.shift(1).rolling(3, min_periods=1).mean()
df_t["prior_months_default_rate"] = df_t["month"].map(lagged.to_dict()).fillna(global_mean)

df_t.head(3)

	x	y	same_month_default_rate	prior_months_default_rate
0	0.125730	0	0.2625	0.36375
1	-0.132105	0	0.2625	0.36375
2	0.640423	1	0.2625	0.36375

Both engineered columns are summaries of the default rate (the fraction of borrowers who defaulted in a given set of months). They differ only in which months go into the summary, and that single difference decides whether the feature is admissible. The preview shows three borrowers who all sit in month = 0, so any statistic that is defined per month takes the same value across the three rows; that repetition is an artifact of grouping, not redundancy between the columns.

x is the single idiosyncratic predictor, drawn independently per borrower from $\mathcal{N}(0, 1)$.
y is the realized default indicator: $0$ for rows 0 and 1, $1$ for row 2. In the real world this column is only populated after the performance window closes, typically 12 months after the observation date.
same_month_default_rate reads $0.2625$ on every row in the month-0 cohort. That number is literally df_t[df_t["month"]==0]["y"].mean(): the default rate of the current cohort, computed from every borrower’s label in that month, including the borrower sitting in the current row. Row 2’s feature value uses row 2’s own y = 1 as part of the average. This is the aggregate-leakage archetype from the list above: the statistic is defined over a pool that contains the future.
prior_months_default_rate reads $0.36375$, which for month 0 is just the global mean of y across all 24 months. The rolling calculation asks for the default rate of the previous three months, but month 0 has no prior history, so fillna(global_mean) plugs the missing window with the unconditional base rate. From month 3 onward this column becomes the true three-month trailing default rate, computed only from cohorts whose performance windows have already closed.

The two columns look alike on purpose: both are “average of y for some set of months”, and both are constant across borrowers inside a cohort. That surface similarity is the trap. What distinguishes them is the window. same_month_default_rate reaches forward into labels that do not yet exist at scoring time; prior_months_default_rate reaches only backward, into cohorts that have already resolved. The correct way to judge any feature is to ask what would have to happen in production to reproduce its value for a live applicant.

For same_month_default_rate, the recipe is “take the mean of $y$ among all borrowers in the same cohort as this applicant”. The applicant’s cohort is the current month. Their own label will not exist for another 12 months. Neither will the labels of the other borrowers booked in the same month. A production scorer cannot compute this feature, cannot approximate it, and cannot substitute a plausible placeholder without changing the model. The column is a training-time ghost: it looks informative because it is smuggling the answer in as an input, and a naive random split carries that cheat straight into the test fold.

For prior_months_default_rate, the recipe is “take the mean of $y$ from cohorts 1, 2, and 3 months before the applicant’s cohort”. If today is March 2026 and the observation date is March 2026, the relevant cohorts are December 2025, January 2026, and February 2026 only after their labels have resolved. For a 12-month PD that means the feature is usable at scoring time if we interpret “lag” in terms of fully-observed label months, so in practice the lookup runs against the December 2024, January 2025, and February 2025 bookings (13 to 15 months ago), whose performance windows closed months ago. The value on March 2026’s scoring job is byte-for-byte identical to the value the training pipeline would have computed for a March 2026 observation date. The feature is honest: the causal arrow runs strictly from past to future.

The general test is a single question you should ask of every engineered column before it enters the design matrix: if I froze the entire database at the observation date, could I still compute this value? If yes, the feature is admissible. If no, no amount of cross-validation machinery will save the model from its first live scoring month; the random-versus-out-of-time comparison in the next chunk is exactly the diagnostic that exposes the gap.

Show code

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

features_leak = ["x", "same_month_default_rate"]
features_ok   = ["x", "prior_months_default_rate"]

tr_rand, te_rand = train_test_split(
    df_t, test_size=0.25, random_state=0, stratify=df_t["y"]
)
m_leak_r = LogisticRegression(max_iter=1000).fit(
    tr_rand[features_leak], tr_rand["y"])
auc_leak_rand = roc_auc_score(
    te_rand["y"], m_leak_r.predict_proba(te_rand[features_leak])[:, 1])

m_ok_r = LogisticRegression(max_iter=1000).fit(
    tr_rand[features_ok], tr_rand["y"])
auc_ok_rand = roc_auc_score(
    te_rand["y"], m_ok_r.predict_proba(te_rand[features_ok])[:, 1])

tr_oot = df_t[df_t["month"] <= 15]
te_oot = df_t[df_t["month"] >= 18]
m_leak_o = LogisticRegression(max_iter=1000).fit(
    tr_oot[features_leak], tr_oot["y"])
auc_leak_oot = roc_auc_score(
    te_oot["y"], m_leak_o.predict_proba(te_oot[features_leak])[:, 1])

m_ok_o = LogisticRegression(max_iter=1000).fit(
    tr_oot[features_ok], tr_oot["y"])
auc_ok_oot = roc_auc_score(
    te_oot["y"], m_ok_o.predict_proba(te_oot[features_ok])[:, 1])

pd.DataFrame({
    "split":   ["random", "random", "OOT", "OOT"],
    "feature": ["leaky", "lagged", "leaky", "lagged"],
    "AUC":     [auc_leak_rand, auc_ok_rand, auc_leak_oot, auc_ok_oot],
}).round(4)

	split	feature	AUC
0	random	leaky	0.7913
1	random	lagged	0.7702
2	OOT	leaky	0.7307
3	OOT	lagged	0.7302

The leaky feature looks best on a random split, because the random split leaks the regime information across train and test. Under the out-of-time split the leaky feature loses its advantage, because the training cohorts end at month 15 and the test cohorts start at month 18, so the training-time aggregate carries no regime signal. An honest backtest pulls the feature back down to its true value. The morality of the exercise is not that the leaky feature is useless. The leakage-induced lift is what is useless, and the backtest has to be structured to kill it.

3.6.4 Point-in-time feature construction

The discipline that prevents leakage is point-in-time (PIT) feature engineering. A PIT feature store stores every feature with two timestamps: the event time, when the underlying fact occurred, and the as-of time, when that fact became visible to the lender. When a training row is built for observation date $t_0$, only facts with as-of time $\le t_0$ are joined. Systems like Feast, Tecton, and Databricks Feature Store expose this temporal-join primitive natively. López de Prado (2018) makes the same argument for financial machine learning, where the analog is survivorship bias and lookahead bias in backtesting (MacKinlay, 1997).

For cross-validation over time, Bergmeir et al. (2018) show that random K-fold on time series data is systematically biased. The right cross-validation scheme for scorecards is an expanding-window or rolling-window walk-forward: train on months 1 to $K$, test on month $K+1$, roll forward by one month, repeat. This gives $T - K$ out-of-time estimates, which can be averaged to give a performance distribution rather than a point estimate.

3.6.5 A checklist for diagnosing leakage

Did the training labels use data dated after the observation date?
Are any features computed from aggregates over the full panel rather than from per-row history?
Does the same customer appear in both train and test?
If a feature is time-varying, does the training snapshot of it match the production scoring snapshot?
Does the OOT AUC match the random-split AUC? If OOT is substantially lower, that is probably honest deterioration. If OOT is higher, something is wrong.

3.6.6 Walk-forward cross-validation

Random K-fold cross-validation is the default in scikit-learn and the wrong default for credit data. Random K-fold treats the rows as exchangeable, which they are not when there is a time index. A row from month 24 in the training fold and a row from month 12 in the test fold creates an implicit leak: the model sees a future row and then evaluates on a past row. Bergmeir et al. (2018) show that for stationary time series this can be benign, but for regime-shifting data it inflates performance.

Walk-forward cross-validation fits the time structure directly. For a data set spanning months $1$ through $T$, pick an initial training window $[1, K]$ and a test window $[K+1, K+h]$. Fit the model on the training window and evaluate on the test window. Roll the training window forward by $h$ months and repeat. This produces $(T - K) / h$ out-of-sample evaluations, each of which is a genuine out-of-time test. The sklearn TimeSeriesSplit class implements the basic variant.

Show code

from sklearn.model_selection import TimeSeriesSplit

df_t_sorted = df_t.sort_values("month").reset_index(drop=True)
tscv = TimeSeriesSplit(n_splits=5)
aucs = []
for train_idx, test_idx in tscv.split(df_t_sorted):
    tr = df_t_sorted.iloc[train_idx]
    te = df_t_sorted.iloc[test_idx]
    m = LogisticRegression(max_iter=1000).fit(tr[features_ok], tr["y"])
    p = m.predict_proba(te[features_ok])[:, 1]
    aucs.append(roc_auc_score(te["y"], p))

pd.DataFrame({
    "fold": range(1, len(aucs) + 1),
    "AUC": [round(a, 4) for a in aucs],
})

	fold	AUC
0	1	0.7268
1	2	0.7321
2	3	0.7300
3	4	0.6887
4	5	0.6945

Walk-forward AUCs give a performance distribution that incorporates regime shifts. The standard deviation across folds is a better summary of production risk than a single holdout AUC because it reflects how the model behaves across real cohorts.

3.6.7 Label maturation and the cold-start problem

Credit labels mature slowly. A “bad” defined as 90+ days past due at any point in a 12-month horizon cannot be observed until 12 months after origination. A 60-day performance window takes 60 days; a 24-month window takes 24 months. In practice, scorecard teams use a two-calendar workflow. The label calendar defines the earliest origination date for which a complete label exists; the feature calendar defines the latest date on which features are available. Training data lives in the intersection.

The consequence is that a scorecard trained at time $T$ on a 12-month bad definition has its most recent complete cohort at $T - 12$ months. The 12 months between $T - 12$ and $T$ contain applications that have not had time to develop, but they have the most current feature distributions. Most banks use those immature cohorts only for monitoring, not for training, and accept the constraint that the training data is always 12 to 18 months stale. Survival analysis (Chapter 9) provides a way to use immature cohorts in training by censoring them, which is its principal operational advantage over binary classification in credit.

3.6.8 Survivorship and sample-selection bias

Related to leakage is survivorship and sample-selection bias. A scorecard trained on the population of applications that were approved and booked systematically excludes the population that was declined, which is the population the scorecard is trying to classify. Heckman (1979) formalized the bias this induces. In credit, it is addressed through reject inference (Chapter 10) and through careful cohort construction: the scorecard development sample should, where possible, include reject bureau pulls so that the model sees the full application flow rather than only the approved subset. Even then, the bias persists unless the reject inference is done correctly, and most scorecards carry a structural skew toward the approved population’s risk profile.

3.6.9 Cohort effects from policy changes

A lender that tightens its underwriting rules at month 10 creates a structural break in the training data: approvals after month 10 are a different population than approvals before month 10. A model trained on the pooled data estimates a blend of the two populations and predicts poorly on either. The “fix” usually summarized as “treat the policy change as a cohort indicator” is true but useless on its own. The actual fix is a four-stage pipeline:

collect the policy events as structured data,
join them onto the training rows so every observation knows its policy version,
detect breaks the policy log missed,
choose a modeling response per regime and validate it walk-forward.

Every stage is concrete and worth implementing in code.

Stage 1: Collect policy events as structured data

The model risk discipline starts with a policy_log table maintained by credit policy, not by data science. The minimum schema is below; in practice, banks add columns for the approving committee, the legal-vetting status, and a free-text rationale.

Show code

import pandas as pd

policy_log = pd.DataFrame([
    {"policy_id": "P-2024-03", "effective_month": 6,
     "scope_product": "term-loan", "scope_segment": "all",
     "rule_change": "min_x raised from -inf to -0.25",
     "expected_effect": "approval_rate down ~15pct"},
    {"policy_id": "P-2024-09", "effective_month": 12,
     "scope_product": "term-loan", "scope_segment": "all",
     "rule_change": "min_x raised from -0.25 to 0.10; min_score raised 20pts",
     "expected_effect": "approval_rate down ~20pct, default_rate down ~25pct"},
    {"policy_id": "P-2025-02", "effective_month": 18,
     "scope_product": "term-loan", "scope_segment": "thin-file",
     "rule_change": "thin-file applicants require co-signer",
     "expected_effect": "thin-file mix down sharply"},
])
policy_log

	policy_id	effective_month	scope_product	scope_segment	rule_change	expected_effect
0	P-2024-03	6	term-loan	all	min_x raised from -inf to -0.25	approval_rate down ~15pct
1	P-2024-09	12	term-loan	all	min_x raised from -0.25 to 0.10; min_score rai...	approval_rate down ~20pct, default_rate down ~...
2	P-2025-02	18	term-loan	thin-file	thin-file applicants require co-signer	thin-file mix down sharply

The effective_month column is a date in production (effective_at TIMESTAMP); we use integer months here so the example aligns with df_t. The scope_* columns matter: a policy that fires only on a sub-segment must not be applied to the rest of the book. Store this table in the same warehouse as the application data and version it; an immutable append-only log with valid_from and valid_to columns is the right shape, so retroactive corrections do not silently rewrite history.

Stage 2: Simulate the application stream and join policy versions

We rebuild a 24-cohort applicant stream where the latent risk variable x is generated for every applicant, but the booking decision and therefore the training population depends on the active policy.

Show code

import numpy as np

rng_p = np.random.default_rng(7)
N_per = 2000

def active_policy(month: int) -> str:
    if month >= 18: return "P-2025-02"
    if month >= 12: return "P-2024-09"
    if month >= 6:  return "P-2024-03"
    return "P-baseline"

def min_x_for(policy_id: str) -> float:
    return {"P-baseline": -np.inf, "P-2024-03": -0.25,
            "P-2024-09":  0.10,    "P-2025-02":  0.10}[policy_id]

apps = []
for m in range(24):
    x = rng_p.normal(0, 1, N_per)
    thin_file = rng_p.random(N_per) < 0.20
    has_cosigner = rng_p.random(N_per) < 0.15
    pol = active_policy(m)
    cutoff = min_x_for(pol)
    booked = x >= cutoff
    if pol == "P-2025-02":
        booked &= ~thin_file | has_cosigner
    logits = -1.0 + 0.9 * x - 0.4 * thin_file.astype(float)
    p = stable_sigmoid(logits)
    y = (rng_p.random(N_per) < p).astype(int)
    apps.append(pd.DataFrame({
        "month": m, "x": x, "thin_file": thin_file.astype(int),
        "has_cosigner": has_cosigner.astype(int),
        "booked": booked.astype(int), "y": y,
    }))
df_p = pd.concat(apps, ignore_index=True)
df_p["policy_id"] = df_p["month"].map(active_policy)
booked = df_p[df_p["booked"] == 1].reset_index(drop=True)
booked.head(3)

	x	thin_file	has_cosigner	booked	y	policy_id
0	0.001230	0	1	1	1	P-baseline
1	0.298746	0	0	1	1	P-baseline
2	-0.274138	1	0	1	0	P-baseline

Two things just happened that matter operationally. First, df_p is the application stream and booked is the training stream; the gap between them is exactly the survivorship bias the previous section warned about, and the policy log is the only honest record of why the gap is there. Second, the join from policy log to applications is by (month, scope_product, scope_segment), not by month alone. In production, this is a LEFT JOIN ... AND application_date BETWEEN policy.valid_from AND policy.valid_to AND application.product = policy.scope_product against the warehouse, run once at training-set construction time and stored alongside the row.

Stage 3: Detect breaks the policy log missed

Two things go wrong. The policy team forgets to log a soft change (a credit officer dialing up manual overrides), or an external event (a macro shock, a competitor exit) creates a structural break with no internal policy attached. Detection is mechanical and worth running on every refresh.

Show code

monthly = booked.groupby("month").agg(
    n=("y", "size"),
    default_rate=("y", "mean"),
    mean_x=("x", "mean"),
    thin_share=("thin_file", "mean"),
)

def psi(expected: np.ndarray, actual: np.ndarray, bins: int = 10) -> float:
    edges = np.quantile(expected, np.linspace(0, 1, bins + 1))
    edges[0], edges[-1] = -np.inf, np.inf
    e, _ = np.histogram(expected, bins=edges)
    a, _ = np.histogram(actual,   bins=edges)
    e = np.where(e == 0, 1, e) / max(e.sum(), 1)
    a = np.where(a == 0, 1, a) / max(a.sum(), 1)
    return float(np.sum((a - e) * np.log(a / e)))

baseline_x = booked.loc[booked["month"] < 6, "x"].to_numpy()
psi_by_month = {
    m: psi(baseline_x, booked.loc[booked["month"] == m, "x"].to_numpy())
    for m in sorted(booked["month"].unique()) if m >= 6
}
psi_series = pd.Series(psi_by_month, name="psi_x_vs_baseline")

dr = monthly["default_rate"].to_numpy()
target = dr[:6].mean()
cusum = np.cumsum(dr - target)
chow_break = int(np.argmax(np.abs(cusum)))

pd.DataFrame({
    "psi_max":      [round(psi_series.max(), 3)],
    "psi_argmax":   [int(psi_series.idxmax())],
    "cusum_break":  [chow_break],
}).assign(known_policy_months=[[6, 12, 18]])

	psi_max	psi_argmax	cusum_break	known_policy_months
0	2.663	15	23	[6, 12, 18]

Read the row: the largest PSI on x against the pre-policy baseline lines up with the months in known_policy_months, and the CUSUM argmax lands on or near the same boundary. PSI > 0.25 is the conventional “significant shift” threshold; CUSUM peaks identify the candidate break date. Both are unsupervised and run independently of the policy log, so a discrepancy is the signal that the log is incomplete. For change-point detection on multivariate streams, ruptures (Pelt, Binseg, Window) implements the full family in Truong et al. (2020); it accepts a cost function and returns segment boundaries, which is what you want when you suspect more than one break and do not know K in advance.

Show code

import ruptures as rpt

# Pelt minimizes a penalized cost function over all possible segmentations;
# the rbf cost handles non-linear regime shifts. The returned list contains
# the inclusive endpoints of each segment, with len(dr) as the final entry.
algo = rpt.Pelt(model="rbf").fit(dr.reshape(-1, 1))
breaks_pelt = algo.predict(pen=2.0)
print("ruptures Pelt breaks (segment endpoints):", breaks_pelt)

ruptures Pelt breaks (segment endpoints): [5, 15, 24]

The Pelt output is a list of segment boundaries; with the policy log claiming breaks at months 6, 12, and 18, you should see endpoints near those values. The pen argument is the only hyperparameter that needs tuning; raise it to suppress spurious breaks, lower it if you suspect breaks the algorithm is missing. For a multivariate signal (default rate, mean x, and thin-file share jointly), pass a 2-D array of shape (T, K) instead of dr.reshape(-1, 1); the rbf cost generalizes without code changes.

Once a candidate break date is in hand, the classical hypothesis-test version is the Chow (1960) F-test for parameter equality across two regression segments. The full machinery is one statsmodels call per candidate date for the F-test, plus one global call for CUSUM-of-residuals (statsmodels.stats.diagnostic.breaks_cusumolsresid).

Show code

import statsmodels.api as sm
from scipy.stats import f as f_dist
from statsmodels.stats.diagnostic import breaks_cusumolsresid

months_arr = np.arange(len(dr))

def chow_f(y: np.ndarray, x: np.ndarray, k: int) -> tuple[float, float]:
    """Chow F-test that two OLS regressions on (x, y) have equal coefficients
    on either side of index k. Returns (F, p)."""
    X = sm.add_constant(x)
    rss_r = sm.OLS(y, X).fit().ssr
    rss_u = sm.OLS(y[:k], X[:k]).fit().ssr + sm.OLS(y[k:], X[k:]).fit().ssr
    df_n  = X.shape[1]
    df_d  = len(y) - 2 * X.shape[1]
    F = ((rss_r - rss_u) / df_n) / (rss_u / df_d)
    return float(F), float(1 - f_dist.cdf(F, df_n, df_d))

chow_table = pd.DataFrame(
    [{"candidate_break": k, "F": F, "p_value": p}
     for k in [6, 12, 18]
     for F, p in [chow_f(dr, months_arr.astype(float), k)]]
).round(4)

ols   = sm.OLS(dr, sm.add_constant(months_arr.astype(float))).fit()
stat, p_cusum, _ = breaks_cusumolsresid(ols.resid, ddof=2)

print(chow_table.to_string(index=False))
print(f"CUSUM-of-residuals: stat={stat:.3f}  p={p_cusum:.4f}")

 candidate_break       F  p_value
               6 34.2868   0.0000
              12  7.5605   0.0036
              18  5.1323   0.0159
CUSUM-of-residuals: stat=1.179  p=0.1243

Read the F-test column: the candidate break with the smallest p_value is the most defensible split point, and p < 0.05 (with 0.01 the conventional bar in a model risk submission) is the formal evidence the chair of the model risk committee will ask for. CUSUM-of-residuals is a single global test: a small p-value rejects the null of “stable coefficients across the entire sample” without committing to a specific break date. Use ruptures to find candidate break dates, chow_f to rank them, then CUSUM as a sanity check that something broke at all.

Stage 4: Pick a modeling response and backtest it walk-forward

The four candidate responses to a documented break, in increasing sophistication, are listed below. Each is a one-line change to the training stage, but each makes a different assumption and has a different sample-size cost.

Show code

feats = ["x", "thin_file", "has_cosigner"]

def fit_pred(tr, te, sample_weight=None, extra_feats=()):
    cols = feats + list(extra_feats)
    m = LogisticRegression(max_iter=1000).fit(
        tr[cols], tr["y"], sample_weight=sample_weight)
    return roc_auc_score(te["y"], m.predict_proba(te[cols])[:, 1])

def importance_weights(tr, te):
    z = pd.concat([tr.assign(_d=0), te.assign(_d=1)], ignore_index=True)
    dc = LogisticRegression(max_iter=1000).fit(z[feats], z["_d"])
    p = dc.predict_proba(tr[feats])[:, 1]
    p = np.clip(p, 0.05, 0.95)
    w = p / (1 - p)
    return w / w.mean()

booked_sorted = booked.sort_values(["month"]).reset_index(drop=True)
booked_sorted["policy_post"] = (booked_sorted["month"] >= 12).astype(int)
tscv_p = TimeSeriesSplit(n_splits=5)

results = []
for fold, (tr_idx, te_idx) in enumerate(tscv_p.split(booked_sorted), start=1):
    tr = booked_sorted.iloc[tr_idx]
    te = booked_sorted.iloc[te_idx]
    pooled    = fit_pred(tr, te)
    subset_tr = tr[tr["policy_id"] == te["policy_id"].mode().iloc[0]]
    subset    = (fit_pred(subset_tr, te) if len(subset_tr) >= 200 else np.nan)
    indicator = fit_pred(tr, te, extra_feats=("policy_post",))
    iw        = fit_pred(tr, te, sample_weight=importance_weights(tr, te))
    results.append({"fold": fold, "te_month_min": int(te["month"].min()),
                    "pooled": pooled, "subset_post": subset,
                    "indicator": indicator, "importance_weighted": iw})
pd.DataFrame(results).round(4)

	fold	te_month_min	pooled	subset_post	indicator	importance_weighted
0	1	2	0.7158	0.7158	0.7158	0.7158
1	2	4	0.7014	NaN	0.7014	0.7015
2	3	8	0.6507	0.6499	0.6507	0.6504
3	4	12	0.6555	0.6553	0.6555	0.6554
4	5	17	0.6447	NaN	0.6447	0.6448

The four columns implement four distinct fixes:

pooled is the do-nothing baseline. It blends regimes and is the failure mode the section warns about.
subset_post keeps only rows whose policy_id matches the regime currently in test. This is the cleanest answer when the post-policy population is large enough; the cost is sample size, and the rule of thumb (used above) is to fall back to pooled below 200 training rows.
indicator keeps the full sample but adds a policy_post 0/1 control. This assumes the slope on x is constant across regimes and only the intercept moves. Add a policy_post * x interaction term if you want to relax that assumption; the F-test on the interaction is exactly the Chow (1960) test.
importance_weighted keeps the full sample but reweights training rows by the density ratio between the test regime and the train regime, estimated by a domain classifier. This is the Sugiyama et al. (2007) covariate-shift correction. It dominates the indicator approach when the shift is in P(x) (the population mix changed) rather than P(y | x) (the relationship changed).

The right choice is data-dependent and is exactly what the walk-forward backtest above is for: the column with the highest mean and lowest std across folds wins. There is no universally correct answer, only an empirically defensible one.

Stage 5: Wire it into the training pipeline

Once the chosen fix is selected, it lives in the feature/data pipeline, not in a notebook. The minimum production-grade integration has four pieces:

Show code

import tempfile, warnings
import polars as pl
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import mlflow

warnings.filterwarnings("ignore", category=FutureWarning, module="mlflow")

POLICY_BREAK_MONTH = 12
PSI_THRESHOLD      = 0.25
BASELINE           = (0, 5)
CURRENT_POLICY     = "P-2024-09"

# Reuse the simulated booked applications, plus a `product` column for the join.
applications = pl.from_pandas(
    booked.assign(product="term-loan")[["month", "x", "thin_file",
                                        "has_cosigner", "y", "product"]]
)
policy_log_pl = pl.DataFrame([
    {"policy_id": "P-baseline", "valid_from":  0, "valid_to":  5,
     "scope_product": "term-loan", "effective_month":  0},
    {"policy_id": "P-2024-03",  "valid_from":  6, "valid_to": 11,
     "scope_product": "term-loan", "effective_month":  6},
    {"policy_id": "P-2024-09",  "valid_from": 12, "valid_to": 17,
     "scope_product": "term-loan", "effective_month": 12},
    {"policy_id": "P-2025-02",  "valid_from": 18, "valid_to": 99,
     "scope_product": "term-loan", "effective_month": 18},
])

# 1. Materialize the policy_log join into the training table. The join key in
#    production is `application_date BETWEEN valid_from AND valid_to`; here
#    `month` plays the role of `application_date`.
training = (applications
    .join(policy_log_pl, left_on="product", right_on="scope_product", how="left")
    .filter(pl.col("month").is_between(pl.col("valid_from"), pl.col("valid_to")))
    .with_columns(policy_post=(pl.col("effective_month") >= POLICY_BREAK_MONTH).cast(pl.Int8))
).to_pandas()

# 2. Run break detection on every refresh; alert if undocumented.
def monthly_psi(df: pd.DataFrame, baseline_window: tuple[int, int]) -> pd.Series:
    base = df.loc[df["month"].between(*baseline_window), "x"].to_numpy()
    return pd.Series({m: psi(base, df.loc[df["month"] == m, "x"].to_numpy())
                      for m in sorted(df["month"].unique()) if m > baseline_window[1]})

documented_break_months = set(policy_log_pl["effective_month"].to_list()) - {0}
psi_series   = monthly_psi(training, BASELINE)
psi_alert    = psi_series > PSI_THRESHOLD
undocumented = sorted(set(psi_alert[psi_alert].index) - documented_break_months)

# 3. Apply the chosen reweighting / subsetting in a sklearn Pipeline step. The
#    RegimeFilter restricts which policies feed the fit; row filtering happens
#    on the input frame, the transformer is a no-op pass-through so the same
#    pipeline serializes cleanly for inference.
class RegimeFilter(BaseEstimator, TransformerMixin):
    def __init__(self, allowed_policies):
        self.allowed_policies = list(allowed_policies)
    def fit(self, X, y=None): return self
    def transform(self, X):    return X

allowed = [CURRENT_POLICY, "P-2025-02"]
mask    = training["policy_id"].isin(allowed)
tr      = training[mask & (training["month"] <  20)].reset_index(drop=True)
te      = training[mask & (training["month"] >= 20)].reset_index(drop=True)

pipeline = Pipeline([
    ("regime_filter", RegimeFilter(allowed_policies=allowed)),
    ("scaler",        StandardScaler()),
    ("clf",           LogisticRegression(max_iter=1000)),
]).fit(tr[feats], tr["y"])

# 4. Track per-policy AUC in MLflow as a first-class metric, not a sidecar.
mlflow.set_tracking_uri(f"file://{tempfile.mkdtemp(prefix='mlruns_')}")
mlflow.set_experiment("regime-monitoring")
with mlflow.start_run() as run:
    mlflow.log_metric("auc_overall",
        roc_auc_score(te["y"], pipeline.predict_proba(te[feats])[:, 1]))
    for pol, sub in te.groupby("policy_id"):
        if sub["y"].nunique() < 2:
            continue
        mlflow.log_metric(f"auc_{pol}",
            roc_auc_score(sub["y"], pipeline.predict_proba(sub[feats])[:, 1]))
    logged = mlflow.get_run(run.info.run_id).data.metrics

pd.DataFrame({
    "policy_post_share":      [round(training["policy_post"].mean(), 3)],
    "psi_breach_count":       [int(psi_alert.sum())],
    "undocumented_breaks":    [undocumented],
    "logged_metric_names":    [sorted(logged.keys())],
})

	policy_post_share	psi_breach_count	undocumented_breaks	logged_metric_names
0	0.348	18	[7, 8, 9, 10, 11, 13, 14, 15, 16, 17, 19, 20, ...	[auc_P-2025-02, auc_overall]

The four numbered steps map one-to-one to the failure modes the section listed. Step 1 forces every row to know its regime via a temporal BETWEEN join against policy_log and prevents silent drift; the join is a single Polars expression in development and a single SQL statement in production. Step 2 catches breaks the policy team forgot to log: the diff psi_breach_months - documented_break_months is the alert payload, and an empty diff is what you want to see in steady state. Step 3 wraps the regime decision in a sklearn Pipeline so the same object that trained also serves predictions, with RegimeFilter documenting the regime contract even though the row filtering happens upstream. Step 4 logs per-policy AUC as a first-class MLflow metric so the model risk dashboard tracks degradation per regime, not just an aggregate; the production replacement for the temp-dir tracking URI is mlflow.set_tracking_uri("databricks") or whichever managed backend the team runs.

3.7 Scalability: Polars versus pandas

Weight-of-evidence encoding is embarrassingly parallel at the row level. The bin assignment step is a vectorized lookup; the WoE mapping is a per-bin scalar. Neither operation needs a global join. On data that fits in memory, pandas and Polars are both fast. The interesting question is when the gap between them matters.

Show code

import polars as pl

N = 1_000_000
rng_s = np.random.default_rng(0)
x = rng_s.normal(50.0, 20.0, N)
y = (rng_s.random(N) < 0.2).astype(int)

eps = 1e-9
edges = np.quantile(x, np.linspace(0, 1, 11))
edges[0], edges[-1] = -np.inf, np.inf
bins_inner = list(edges[1:-1])

# pandas path
t0 = time.perf_counter()
df_pd = pd.DataFrame({"x": x, "y": y})
df_pd["bin"] = pd.cut(df_pd["x"], bins=edges, labels=False,
                     include_lowest=True)
agg = (df_pd.groupby("bin")["y"]
             .agg(bad="sum", n="count"))
agg["good"] = agg["n"] - agg["bad"]
agg["p_b"] = agg["bad"] / agg["bad"].sum()
agg["p_g"] = agg["good"] / agg["good"].sum()
agg["woe"] = np.log((agg["p_g"] + eps) / (agg["p_b"] + eps))
df_pd = df_pd.merge(agg["woe"], left_on="bin", right_index=True)
t_pd = time.perf_counter() - t0

# polars path
t0 = time.perf_counter()
df_pl = pl.DataFrame({"x": x, "y": y})
df_pl = df_pl.with_columns(
    pl.col("x").cut(breaks=bins_inner,
                    labels=[str(i) for i in range(10)]).alias("bin")
)
agg_pl = (df_pl.group_by("bin")
                .agg([pl.col("y").sum().alias("bad"),
                      pl.col("y").count().alias("n")])
                .with_columns((pl.col("n") - pl.col("bad")).alias("good"))
                .with_columns([
                    (pl.col("bad") / pl.col("bad").sum()).alias("p_b"),
                    (pl.col("good") / pl.col("good").sum()).alias("p_g"),
                ])
                .with_columns(
                    ((pl.col("p_g") + eps).log()
                     - (pl.col("p_b") + eps).log()).alias("woe")
                ))
df_pl = df_pl.join(agg_pl.select(["bin", "woe"]), on="bin", how="left")
t_pl = time.perf_counter() - t0

pd.DataFrame({
    "engine": ["pandas", "polars"],
    "wall_seconds": [round(t_pd, 3), round(t_pl, 3)],
    "rows": [N, N],
})

	engine	wall_seconds	rows
0	pandas	0.081	1000000
1	polars	0.061	1000000

On a laptop, both engines finish a 1M-row WoE encoding in well under a second. The pandas path runs single-threaded on the pandas block manager; the Polars path parallelizes over cores through its Rust backend. The absolute runtimes are close at 1M rows because the bottleneck is memory bandwidth rather than compute. The gap widens at 10M to 100M rows, where Polars’s columnar execution and multi-threaded groupby start to matter. For scorecard training, the practical rule is that up to a few million rows, pandas is fine; past that, Polars gives a 3x to 10x speedup with the same code shape.

Past 100M rows the data no longer fits comfortably on a single machine and the engine choice shifts to Dask or Spark. The logical structure is identical to the pandas/Polars version: assign bins (map), aggregate counts per bin (reduce), broadcast the small WoE lookup back onto every row (broadcast join). What changes is that the partitioning is explicit, the aggregation is two-pass (per-partition then global), and the broadcast join is a first-class operator. The Dask implementation below runs end-to-end on a small partitioned Parquet store the chunk writes itself; replace the local path with s3://... and the same code runs unchanged on a billion rows.

Show code

import os, tempfile
import dask.dataframe as dd

# Build a tiny partitioned Parquet store. In production the warehouse already
# wrote it; this loop is here only so the chunk runs without external data.
rng_dk = np.random.default_rng(3)
N_dk   = 200_000
x_dk   = rng_dk.normal(50.0, 20.0, N_dk)
y_dk   = (rng_dk.random(N_dk) < 0.2).astype(int)
month_dk = rng_dk.integers(0, 12, N_dk)
store = tempfile.mkdtemp(prefix="dask_woe_")
for m, sub in pd.DataFrame({"x": x_dk, "y": y_dk, "month": month_dk}).groupby("month"):
    sub.drop(columns="month").to_parquet(
        os.path.join(store, f"month={m:02d}.parquet"), index=False)

# In production: dd.read_parquet("s3://bucket/applications/*.parquet",
# columns=["x", "y"]). One Parquet file becomes one partition; nothing loads.
ddf = dd.read_parquet(os.path.join(store, "*.parquet"), columns=["x", "y"])

# Quantile bin edges via Dask's distributed approximate quantile (t-digest);
# cheap and accurate. Exact percentiles would force a global sort.
edges = ddf["x"].quantile(np.linspace(0, 1, 11)).compute().to_numpy()
edges[0], edges[-1] = -np.inf, np.inf

def assign_bin(part: pd.DataFrame) -> pd.DataFrame:
    part = part.copy()
    part["bin"] = pd.cut(part["x"], bins=edges, labels=False,
                         include_lowest=True).astype("int8")
    return part

ddf_meta = ddf._meta.assign(bin=pd.Series(dtype="int8"))
ddf = ddf.map_partitions(assign_bin, meta=ddf_meta)

# Two-pass groupby: per-partition aggregate then global combine.
agg = (ddf.groupby("bin")["y"].agg(["sum", "count"]).compute()
          .rename(columns={"sum": "bad", "count": "n"}))
agg["good"] = agg["n"] - agg["bad"]
agg["woe"]  = np.log((agg["good"] / agg["good"].sum() + 1e-9)
                     / (agg["bad"]  / agg["bad"].sum()  + 1e-9))

# Broadcast join: woe_lookup is a small pandas DataFrame, so ship it to every
# worker. If both sides were Dask, you would pay a shuffle.
woe_lookup = agg["woe"].reset_index()
out_meta   = ddf._meta.assign(woe=pd.Series(dtype="float64"))
ddf = ddf.merge(woe_lookup, on="bin", how="left")

out_dir = tempfile.mkdtemp(prefix="dask_woe_out_")
ddf.to_parquet(out_dir, write_index=False)

pd.DataFrame({
    "input_partitions":  [ddf.npartitions],
    "rows":               [N_dk],
    "bins":               [len(agg)],
    "output_files":       [len([f for f in os.listdir(out_dir) if f.endswith('.parquet')])],
    "output_woe_min":     [round(float(agg["woe"].min()), 4)],
    "output_woe_max":     [round(float(agg["woe"].max()), 4)],
})

	input_partitions	rows	bins	output_files	output_woe_min	output_woe_max
0	12	200000	10	12	-0.0136	0.0171

Two practical notes on the Dask version. The quantile call uses Dask’s t-digest approximation, which is the only scalable way to get global percentiles without shuffling the whole column. The merge is automatically a broadcast join because woe_lookup is a pandas DataFrame, not a Dask one; if both sides were Dask DataFrames, you would pay a shuffle, so always materialize the small side with .compute() before joining.

The PySpark port follows the same three-stage logic; the only thing that changes is the syntax. The block below is shipped as a python code fence rather than a {python} chunk because starting a Spark session in the book renderer adds 10 to 20 seconds per build and requires a Java runtime; uncomment to run.

# PySpark port. Reads the same `store` directory written by the Dask cell.
# Run with `pip install pyspark` and a Java runtime on PATH.
from pyspark.sql import SparkSession, functions as F
from pyspark.ml.feature import QuantileDiscretizer
from pyspark.sql.functions import broadcast

spark = SparkSession.builder.appName("scale-woe").master("local[*]").getOrCreate()
df = spark.read.parquet(store)

bucketizer = QuantileDiscretizer(
    inputCol="x", outputCol="bin", numBuckets=10,
    relativeError=0.001, handleInvalid="keep",
).fit(df)
df = bucketizer.transform(df).withColumn("bin", F.col("bin").cast("int"))

agg = (df.groupBy("bin")
         .agg(F.sum("y").alias("bad"), F.count("y").alias("n"))
         .withColumn("good", F.col("n") - F.col("bad")))

totals = agg.agg(F.sum("bad").alias("B"), F.sum("good").alias("G")).first()
agg = agg.withColumn(
    "woe",
    F.log((F.col("good") / F.lit(totals["G"]) + F.lit(1e-9))
        / (F.col("bad")  / F.lit(totals["B"]) + F.lit(1e-9)))
).select("bin", "woe")

# Broadcast hint is mandatory once the right side has more than a handful
# of partitions; Spark's optimizer can under-broadcast under skew.
df = df.join(broadcast(agg), on="bin", how="left")
df.write.mode("overwrite").parquet("/tmp/scale_woe_spark/")

Three practical notes on the Spark version. QuantileDiscretizer is the production-grade analog of pd.cut on quantiles; for fixed user-supplied edges (regulatory bins) use Bucketizer(splits=edges) instead. The broadcast() hint is mandatory rather than optional once the right side has more than a handful of partitions; Spark’s cost-based optimizer will under-broadcast in the presence of skew. Configure spark.sql.shuffle.partitions to roughly the cluster’s vCPU count for the aggregation; the default of 200 is wrong for both small clusters (over-partitioned) and large ones (under-partitioned), and is the single most common cause of slow Spark scorecard jobs.

When porting between engines, the sanity check is to run all four (pandas, Polars, Dask, Spark) on the same 1M-row sample and assert that the resulting per-bin WoE values agree to 1e-6. Disagreement is almost always an off-by-one in bin edges (include_lowest, right=True/False, handleInvalid for Spark) or a quantile-approximation mismatch; it is always solvable, but it has to be caught at port time, not in production. For a one-shot scorecard rebuild that has to run on a billion rows, the pragmatic recipe is: develop and validate on a 1M-row Polars sample, then port to Spark for the full-volume rebuild, keeping the Polars code as the regression oracle.

For larger feature panels where the binning itself must learn from a sample (rather than be quantile-uniform), the same fit-on-sample-then-distribute pattern works with optbinning.BinningProcess. The runnable Dask cell at the start of this section already implements that pattern; the BinningProcess object is pickled, broadcast to workers via the filesystem (or Client.scatter(...)) , and applied row-locally inside map_partitions. The asymmetry between fit (single-node, on a sample, runs once) and transform (distributed, row-local, runs everywhere) is the standard shape of every preprocessing step in a production ML pipeline; the same template ports to scaling, mean imputation, target encoding, and embedding lookups, with the only change being the object inside the pickle.

3.7.1 Scaling imputation and feature selection

Imputation does not scale linearly. kNN imputation is $O(n^2)$ in the number of rows, which rules it out past a few hundred thousand rows. MICE with $M$ iterations scales as $O(M \cdot n \cdot p^2)$ for linear base learners; it scales well in $n$ but poorly in $p$. Simple mean and median imputation is trivially parallel. For large-scale scorecards the pragmatic pattern is to use mean-plus-indicator by default and escalate to MICE only for a small number of carefully chosen features where the missingness distribution is known to be MAR.

LASSO with coordinate descent (Friedman et al., 2010) has a computational cost of $O(n \cdot p \cdot \text{paths})$ per fit, which scales well up to tens of millions of rows with a few thousand features. Above that, stochastic variants become the tool of choice. Boruta scales poorly, because it fits a random forest repeatedly; on large data, it is typically run on a subsample rather than on the full set.

For scorecard work, the dominant scalability bottleneck is not any single algorithm but the full pipeline from raw tradelines to bin-assigned WoE features. Point-in-time joins against a multi-terabyte feature store are where most of the compute lives. That is an engineering problem solved by feature stores and efficient event-time joins; it is not a machine learning problem solved by a faster estimator.

3.8 Benchmark on German and Taiwan

The discussion so far has been piecewise: WoE on Taiwan, imputation on German, leakage on a synthetic panel. We now stitch these together into a single end-to-end benchmark that compares four data-pipeline configurations on both public data sets.

Show code

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import GradientBoostingClassifier
from creditutils import ks_statistic, gini

def score_pipeline(Xtr, Xte, ytr, yte, pre, model):
    Xtr_p = pre.fit_transform(Xtr, ytr)
    Xte_p = pre.transform(Xte)
    model.fit(Xtr_p, ytr)
    proba = model.predict_proba(Xte_p)[:, 1]
    return {
        "AUC":  roc_auc_score(yte, proba),
        "Gini": gini(yte, proba),
        "KS":   ks_statistic(yte, proba),
    }

Show code

df_bench = load_taiwan_default().drop(columns=["id"])
y = df_bench["default"].astype(int).values
X = df_bench.drop(columns=["default"])
Xtr, Xte, ytr, yte = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Config A: raw features + standardization + logistic regression
from sklearn.preprocessing import StandardScaler
pipe_raw = StandardScaler()
res_a = score_pipeline(Xtr, Xte, ytr, yte, pipe_raw,
                       LogisticRegression(max_iter=1000))

# Config B: WoE features + logistic regression
class WoETransformer:
    def __init__(self, variables):
        self.variables = variables
    def fit(self, X, y):
        self.bp = BinningProcess(variable_names=self.variables)
        self.bp.fit(np.asarray(X), np.asarray(y))
        return self
    def transform(self, X):
        return self.bp.transform(np.asarray(X), metric="woe")
    def fit_transform(self, X, y):
        return self.fit(X, y).transform(X)

pipe_woe = WoETransformer(variables=list(X.columns))
res_b = score_pipeline(Xtr, Xte, ytr, yte, pipe_woe,
                       LogisticRegression(max_iter=1000))

# Config C: WoE features + gradient boosting
res_c = score_pipeline(Xtr, Xte, ytr, yte, WoETransformer(list(X.columns)),
                       GradientBoostingClassifier(n_estimators=200,
                                                  max_depth=3, random_state=0))

# Config D: raw features + gradient boosting (tree learners do not need WoE)
res_d = score_pipeline(Xtr, Xte, ytr, yte, StandardScaler(),
                       GradientBoostingClassifier(n_estimators=200,
                                                  max_depth=3, random_state=0))

bench_tw = pd.DataFrame([
    {"config": "A: raw + LogReg",      **res_a},
    {"config": "B: WoE + LogReg",      **res_b},
    {"config": "C: WoE + GBM",         **res_c},
    {"config": "D: raw + GBM",         **res_d},
])
bench_tw.round(4)

	config	AUC	Gini	KS
0	A: raw + LogReg	0.7150	0.4301	0.3751
1	B: WoE + LogReg	0.7609	0.5217	0.4085
2	C: WoE + GBM	0.7771	0.5541	0.4299
3	D: raw + GBM	0.7797	0.5594	0.4334

The table tells a compact story. Within logistic regression, WoE encoding delivers a meaningful lift (config B vs A). Gradient boosting on either raw or WoE-encoded features (configs C and D) matches or exceeds the WoE-plus-logistic result, because the tree learner already captures monotone step functions internally. WoE encoding is therefore primarily useful when the downstream model class is linear, which for regulated scorecards it usually is.

Show code

df_g = load_german_credit()
# One-hot encode categoricals, standardize numerics, then score
cat_cols = df_g.select_dtypes(include=["object"]).columns.tolist()
num_cols = [c for c in df_g.columns if c not in cat_cols + ["default"]]

from sklearn.compose import ColumnTransformer
pre_g = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), num_cols),
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
    ]
)

y_g = df_g["default"].values
X_g = df_g.drop(columns=["default"])
Xtr_g, Xte_g, ytr_g, yte_g = train_test_split(
    X_g, y_g, test_size=0.3, random_state=0, stratify=y_g
)
Xtr_g_p = pre_g.fit_transform(Xtr_g, ytr_g)
Xte_g_p = pre_g.transform(Xte_g)

m_lr = LogisticRegression(max_iter=1000, C=1.0).fit(Xtr_g_p, ytr_g)
m_gb = GradientBoostingClassifier(n_estimators=200, max_depth=3,
                                  random_state=0).fit(Xtr_g_p, ytr_g)

p_lr = m_lr.predict_proba(Xte_g_p)[:, 1]
p_gb = m_gb.predict_proba(Xte_g_p)[:, 1]

pd.DataFrame([
    {"model": "Logistic (OHE+Std)", "AUC": roc_auc_score(yte_g, p_lr),
     "Gini": gini(yte_g, p_lr), "KS": ks_statistic(yte_g, p_lr)},
    {"model": "GBM (OHE+Std)",      "AUC": roc_auc_score(yte_g, p_gb),
     "Gini": gini(yte_g, p_gb), "KS": ks_statistic(yte_g, p_gb)},
]).round(4)

	model	AUC	Gini	KS
0	Logistic (OHE+Std)	0.8041	0.6081	0.4794
1	GBM (OHE+Std)	0.7653	0.5307	0.4556

The German credit data is small (1000 rows) and noisy; AUC in the 0.75 to 0.80 range is what modern models achieve, and the gap between linear and gradient-boosted models is small at this scale. Chapter 16 benchmarks the two data sets across a larger set of learners with profit curves and calibration plots.

3.9 Deployment considerations

Data preprocessing is part of the model artifact. When the logistic scorecard or gradient-boosted tree is persisted, the binning tables, imputers, and WoE lookups that sit before the model must persist alongside it. Three patterns prevent drift between training and production.

First, bundle preprocessing into a scikit-learn Pipeline object that includes imputation, binning, WoE mapping, and the model. The pickle of the pipeline is the deployable artifact. When the artifact is loaded in a FastAPI service, the predict_proba call uses the training-time fit for all preprocessing.

Second, export the pipeline to ONNX through skl2onnx for language-neutral deployment. Not every scorecard stack runs on Python; Java and C# services are common in banks. ONNX captures the preprocessing graph along with the model weights.

Third, log the artifact to MLflow with a data schema. The schema pins every input column’s name, dtype, and nullability. Schema violations at inference time are caught at the edge rather than silently producing bad scores. BCBS 239 (Basel Committee on Banking Supervision, 2013) requires this kind of lineage for regulated institutions.

A common operational rule is to re-fit the entire pipeline, preprocessing included, every time the model is retrained. Re-using a binning table that was fit on data from two years ago while refitting the downstream model on fresh data is a subtle form of drift that is hard to detect and easy to avoid.

3.10 Regulatory considerations

Data choices have direct regulatory exposure. A brief map:

FCRA (Fair Credit Reporting Act, United States): governs bureau data. Requires accuracy, consumer access, and dispute rights. Any model using bureau data must respect the adverse action notice requirements of Regulation B.
ECOA (Equal Credit Opportunity Act): prohibits discrimination on race, color, religion, national origin, sex, marital status, age, and receipt of public assistance. Alternative data can create disparate impact even when not explicitly discriminatory (Barocas & Selbst, 2016; Fuster et al., 2022). Fair lending review must be part of the data onboarding process.
GDPR Article 22 (European Union): gives data subjects the right not to be subject to a decision based solely on automated processing, including profiling, that produces legal effects. Credit decisions are in scope. Article 15 gives the right to an explanation of the logic involved. Scorecard binning and WoE make this easier than black-box tree ensembles.
EU AI Act: credit scoring is classified as a high-risk AI system. Data governance requirements include documentation of training data, testing for bias, and robustness testing. The Act is being phased in, with full application for high-risk systems expected in 2026 to 2027 depending on category.
BCBS 239: sets expectations for risk data aggregation, including lineage and quality. Scorecard data pipelines fall in scope for banks subject to Basel framework supervision.
SR 11-7 (Board of Governors of the Federal Reserve System, 2011): the Federal Reserve’s model risk management guidance. Data is an explicit model risk dimension. Model validation must test that inputs are accurate, complete, and appropriate.

Missing data handling is a frequent audit finding. A validator will ask what happens when the income field is blank, whether that outcome was tested, and whether the model response is bounded and monotone in the unknown direction. Imputation strategy decisions belong in the model development document, not in tribal knowledge.

3.11 Vietnam and emerging markets

3.11.1 Market context

Data in Vietnam flows through two bureaus and a growing set of alternative channels. The Credit Information Center (CIC) sits inside the SBV and is the mandatory reporting destination for all regulated lenders; it maintains a national registry of tradelines, delinquency status, and inquiry history, plus a domestic consumer score product. The Vietnam Credit Information Joint Stock Company (PCB) is the private bureau, launched in 2007 and majority-owned by a bank consortium. Adult bureau coverage is around 50 to 55 percent, with CIC and PCB files overlapping substantially on regulated-lender tradelines and diverging on utility and telecom data (Credit Information Center of Vietnam, 2023; World Bank, 2022). Mobile subscriptions exceed 140 percent of the adult population, and smartphone adoption above 80 percent of urban adults makes app-based transaction data a realistic alternative overlay (Asian Development Bank, 2023). Remote onboarding is governed by SBV Circular 16/2020/TT-NHNN on electronic KYC (State Bank of Vietnam, 2020), and personal-data processing is bound by Decree 13/2023/ND-CP, which introduces explicit consent, purpose limitation, and cross-border data transfer assessment into the pipeline (Government of Vietnam, 2023).

3.11.2 Application considerations

The data methods in this chapter transplant with three main adjustments. First, missingness is higher across the board. Declared-income fields in application forms have informative non-response because cash income does not fit the form. The missing-indicator columns that Section 3.4 recommends as a safety net are load-bearing in Vietnam, not decorative. A naive mean imputation on declared income pushes a measurable fraction of thin-file applicants into an artificially safe bucket and corrupts the weight-of-evidence table. Second, bureau tradeline depth is shallower. CIC carries two to five years of tradeline history for a typical obligor, versus ten to fifteen years on a US file. Features that assume long histories (oldest-tradeline age, average-age-of-accounts) are either missing or compressed and lose their predictive ordering. The IV-filter step should be re-run on local cohorts rather than copied from US scorecard conventions. Third, alternative data carries more marginal weight. Bank-statement parsing, e-wallet flow features (through MoMo, ZaloPay, VNPay and similar), telco recharge and top-up patterns, and utility bill payments are substantive signals for the thin-file population and often dominate self-reported income. The trade-off is that each of these feeds falls squarely inside Decree 13’s definition of personal data, and some (health-linked insurance premia, location traces) are sensitive personal data with stricter consent and assessment obligations.

Leakage is a different problem in an emerging market. The dominant leakage mode is not a future feature reaching back into the observation window; it is a feature whose meaning shifts mid-cycle because the regulation or the product changed. Circular 43/2016/TT-NHNN on consumer lending by finance companies reshaped consumer-finance collection practice, and Circular 22/2023/TT-NHNN (29 Dec 2023) amended Circular 41/2016 on capital adequacy ratios in the middle of most training windows, which pushes both the 30+ delinquency behavior and the risk-weighting of cohorts pre- and post-amendment into distinct regimes (State Bank of Vietnam, 2023). Tet seasonality adds a second structural break: features that measure rolling 30-day activity bleed across the Lunar New Year window and need a calendar-aware rather than a fixed-offset construction.

3.11.3 Rationalization

Weight-of-evidence encoding is a good fit for Vietnam. The bureau file is shallow and categorical-heavy, the analyst team is typically small, and the supervisory expectation under Circular 41/2016 is documented and auditable per-feature transformations (State Bank of Vietnam, 2016). WoE gives that out of the box. Information value as a filter is also a good fit because it is stable under the re-sampling that a typical Vietnamese lender does to cope with a thin positive class. LASSO and Boruta are reasonable but secondary tools; the feature count after sensible WoE binning rarely exceeds forty, and stability selection gives diminishing returns at that scale. Imputation strategies that assume MAR conditional on a rich covariate set are shakier in Vietnam than in the US: missingness on declared income is correlated with occupation and with cash-economy participation in ways that the available covariates do not fully capture. The missing-indicator approach dominates multiple imputation in practice. Finally, the Polars-over-pandas scalability argument is less urgent in Vietnam at typical book sizes (one to three million accounts) than in a US money-center book, but the data engineering cost of a pandas-only pipeline is still real because monthly behavioral refreshes multiply row counts by twelve.

3.11.4 Practical notes

A practical Vietnamese data stack starts with a daily CIC delta feed, a weekly PCB refresh, an internal core-banking feed for bank-owned lenders, and an e-wallet or transaction-API feed for fintech-affiliated lenders. Reporting obligations on the data side run to the SBV Banking Supervision Agency for bank submissions, to CIC for tradeline contribution, and to the Ministry of Public Security for Decree 13 compliance, including the annual data-processing impact assessment. Fair-lending review in Vietnam is less codified than under US ECOA, but is increasingly scoped under the consumer-protection provisions of Circular 43/2016/TT-NHNN on consumer lending by finance companies. A Vietnamese team building on the chapter’s WoE and imputation pipeline should expect to document, for each feature, the source system, the retention policy under Decree 13, the consent basis, and the cross-border flag.

3.12 Takeaways

Traditional credit data comes from the bureau, the bank’s internal systems, and the application form. The schema has been stable for four decades. Alternative data extends the signal set with transactional, behavioral, device, social, and telco categories, each with its own drift profile and compliance exposure.
Weight of evidence encoding, with information value $= \sum_j (g_j - b_j) \ln(g_j / b_j)$, is the Jeffreys divergence between class-conditional feature distributions. IV gives a univariate ranking; the bin assignment makes a linear model interpretable and robust.
Rubin’s MCAR, MAR, MNAR taxonomy drives imputation choice. A missing-indicator column is the cheapest safety net against information loss from silent imputation, and it should be the default for any scorecard feature with a nontrivial missingness rate.
Feature selection needs a univariate filter (IV), a sparse linear method (LASSO or elastic net), and a nonlinear wrapper (Boruta or permutation importance). No one method dominates. Use agreement across methods as a stability signal.
Temporal leakage is the most damaging bug in scorecard work. Every feature must be computable from information available at the observation date. Evaluate with out-of-time splits and walk-forward cross-validation, never random K-fold on stacked cohorts.
Polars handles multi-core WoE encoding with roughly the same code as pandas. The performance gap grows past 10M rows. Bundle preprocessing with the model in a serialized pipeline to prevent train-serve skew.

3.13 Further reading

Siddiqi (2017): the standard industry reference on scorecard development, with a full treatment of WoE binning and variable selection.
Kullback & Leibler (1951): the original paper on the KL divergence and its symmetrization. The Jeffreys divergence is what scorecard developers call information value.
Rubin (1976); Little & Rubin (2019): the formal foundation for missing-data mechanisms and likelihood-based imputation.
Buuren & Groothuis-Oudshoorn (2011); White et al. (2011): MICE, the multivariate chained-equations approach to imputation, with guidance for applied researchers.
Tibshirani (1996); Friedman et al. (2010); Zou & Hastie (2005): the LASSO and its elastic-net refinement, including the coordinate-descent algorithm that underlies every modern implementation.
Kursa & Rudnicki (2010): the Boruta wrapper method.
Navas-Palencia (2020): the mixed-integer programming formulation of optimal monotone binning that the optbinning package implements.
Berg et al. (2020); Gambacorta et al. (2024); Gambacorta et al. (2020): the empirical literature on the marginal information content of digital footprints and transactional data beyond credit bureau scores.
Khandani et al. (2010): machine-learning credit models with a rigorous treatment of out-of-sample evaluation.
López de Prado (2018); Bergmeir et al. (2018): lookahead bias and walk-forward validation in financial machine learning.
Brevoort et al. (2016): the population of credit invisibles and unscored consumers, and why alternative data matters for inclusion.
Avery et al. (2003): a canonical overview of consumer credit reporting data from the Federal Reserve.

Altmann, A., Toloşi, L., Sander, O., & Lengauer, T. (2010). Permutation importance: A corrected feature importance measure. Bioinformatics, 26(10), 1340–1347. https://doi.org/10.1093/bioinformatics/btq134

Asian Development Bank. (2023). Digital financial inclusion in Southeast Asia. Asian Development Bank. https://www.adb.org/publications/digital-financial-inclusion-southeast-asia

Avery, R. B., Calem, P. S., Canner, G. B., & Bostic, R. W. (2003). An overview of consumer data and credit reporting. Federal Reserve Bulletin, 89, 47–73.

Barocas, S., & Selbst, A. D. (2016). Big data’s disparate impact. California Law Review, 104(3), 671–732.

Basel Committee on Banking Supervision. (2013). Principles for effective risk data aggregation and risk reporting (BCBS 239). Bank for International Settlements. https://www.bis.org/publ/bcbs239.htm

Berg, T., Burg, V., Gombović, A., & Puri, M. (2020). On the rise of FinTechs: Credit scoring using digital footprints. The Review of Financial Studies, 33(7), 2845–2897. https://doi.org/10.1093/rfs/hhz099

Bergmeir, C., Hyndman, R. J., & Koo, B. (2018). A note on the validity of cross-validation for evaluating autoregressive time series prediction. Computational Statistics & Data Analysis, 120, 70–83. https://doi.org/10.1016/j.csda.2017.11.003

Board of Governors of the Federal Reserve System. (2011). Supervisory guidance on model risk management (SR 11-7). Federal Reserve. https://www.federalreserve.gov/supervisionreg/srletters/sr1107.htm

Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324

Brevoort, K. P., Grimm, P., & Kambara, M. (2016). Credit invisibles and the unscored. Cityscape, 18(2), 9–34.

Buuren, S. van, & Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in R. Journal of Statistical Software, 45(3), 1–67. https://doi.org/10.18637/jss.v045.i03

Chow, G. C. (1960). Tests of equality between sets of coefficients in two linear regressions. Econometrica, 28(3), 591–605. https://doi.org/10.2307/1910133

Credit Information Center of Vietnam. (2023). Annual report on credit information activities. CIC, State Bank of Vietnam. https://cic.gov.vn/

Fayyad, U. M., & Irani, K. B. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. 1022–1027.

Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1–22. https://doi.org/10.18637/jss.v033.i01

Fuster, A., Goldsmith-Pinkham, P., Ramadorai, T., & Walther, A. (2022). Predictably unequal? The effects of machine learning on credit markets. Journal of Finance, 77(1), 5–47. https://doi.org/10.1111/jofi.13090

Gambacorta, L., Huang, Y., Li, Z., Qiu, H., & Chen, S. (2020). Data vs collateral (BIS Working Paper 881). Bank for International Settlements. https://www.bis.org/publ/work881.htm

Gambacorta, L., Huang, Y., Qiu, H., & Wang, J. (2024). How do machine learning and non-traditional data affect credit scoring? New evidence from a chinese fintech firm. Journal of Financial Stability, 73, 101284. https://doi.org/10.1016/j.jfs.2024.101284

Government of Vietnam. (2023). Decree 13/2023/ND-CP on personal data protection. Hanoi. https://vanbanphapluat.co/

Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182.

Heckman, J. J. (1979). Sample selection bias as a specification error. Econometrica, 47(1), 153–161. https://doi.org/10.2307/1912352

Iyer, R., Khwaja, A. I., Luttmer, E. F. P., & Shue, K. (2016). Screening peers softly: Inferring the quality of small borrowers. Management Science, 62(6), 1554–1577. https://doi.org/10.1287/mnsc.2015.2181

Jäger, S., Allhorn, A., & Bießmann, F. (2021). A benchmark for data imputation methods. Frontiers in Big Data, 4, 693674. https://doi.org/10.3389/fdata.2021.693674

Jarrett, D., Cebere, B. C., Liu, T., Curth, A., & Schaar, M. van der. (2022). HyperImpute: Generalized iterative imputation with automatic model selection. Proceedings of the 39th International Conference on Machine Learning (ICML).

Khandani, A. E., Kim, A. J., & Lo, A. W. (2010). Consumer credit-risk models via machine-learning algorithms. Journal of Banking and Finance, 34(11), 2767–2787. https://doi.org/10.1016/j.jbankfin.2010.06.001

Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22(1), 79–86. https://doi.org/10.1214/aoms/1177729694

Kursa, M. B., & Rudnicki, W. R. (2010). Feature selection with the Boruta package. Journal of Statistical Software, 36(11), 1–13. https://doi.org/10.18637/jss.v036.i11

Le Morvan, M., Josse, J., Scornet, E., & Varoquaux, G. (2021). What’s a good imputation to predict with missing values? Advances in Neural Information Processing Systems (NeurIPS), 34.

Lin, M., Prabhala, N. R., & Viswanathan, S. (2013). Judging borrowers by the company they keep: Friendship networks and information asymmetry in online peer-to-peer lending. Management Science, 59(1), 17–35. https://doi.org/10.1287/mnsc.1120.1560

Little, R. J. A., & Rubin, D. B. (2019). Statistical analysis with missing data.

López de Prado, M. (2018). Advances in financial machine learning.

Lou, Y., Caruana, R., Gehrke, J., & Hooker, G. (2013). Accurate intelligible models with pairwise interactions. Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 623–631. https://doi.org/10.1145/2487575.2487579

MacKinlay, A. C. (1997). Event studies in economics and finance. Journal of Economic Literature, 35(1), 13–39.

Mattei, P.-A., & Frellsen, J. (2019). MIWAE: Deep generative modelling and imputation of incomplete data sets. Proceedings of the 36th International Conference on Machine Learning (ICML).

Mazumder, R., Hastie, T., & Tibshirani, R. (2010). Spectral regularization algorithms for learning large incomplete matrices. Journal of Machine Learning Research, 11, 2287–2322.

Navas-Palencia, G. (2020). Optimal binning: Mathematical programming formulation. arXiv Preprint arXiv:2001.08025.

Nguyen, M. (2026). Author twitter handle sentinel (do not cite). https://twitter.com/mikenguyen13.

Nori, H., Jenkins, S., Koch, P., & Caruana, R. (2019). InterpretML: A unified framework for machine learning interpretability. https://arxiv.org/abs/1909.09223

Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581–592. https://doi.org/10.1093/biomet/63.3.581

Siddiqi, N. (2017). Intelligent credit scoring: Building and implementing better credit risk scorecards.

Sill, J. (1998). Monotonic networks. Advances in Neural Information Processing Systems (NeurIPS), 10.

State Bank of Vietnam. (2016). Circular 41/2016/TT-NHNN on capital adequacy ratios for banks and foreign bank branches. Hanoi. https://www.sbv.gov.vn/

State Bank of Vietnam. (2020). Circular 16/2020/TT-NHNN on electronic know-your-customer for payment account opening. Hanoi. https://www.sbv.gov.vn/

State Bank of Vietnam. (2023). Circular 22/2023/TT-NHNN amending circular 41/2016/TT-NHNN on capital adequacy ratios for banks and foreign bank branches. Hanoi. https://www.sbv.gov.vn/

Stekhoven, D. J., & Bühlmann, P. (2012). MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), 112–118. https://doi.org/10.1093/bioinformatics/btr597

Sugiyama, M., Krauledat, M., & Müller, K.-R. (2007). Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research, 8, 985–1005.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x

Truong, C., Oudre, L., & Vayatis, N. (2020). Selective review of offline change point detection methods. Signal Processing, 167, 107299. https://doi.org/10.1016/j.sigpro.2019.107299

White, I. R., Royston, P., & Wood, A. M. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 30(4), 377–399. https://doi.org/10.1002/sim.4067

World Bank. (2022). The global findex database 2021: Financial inclusion, digital payments, and resilience in the age of COVID-19. Washington, DC. https://www.worldbank.org/en/publication/globalfindex

Yoon, J., Jordon, J., & Schaar, M. van der. (2018). GAIN: Missing data imputation using generative adversarial nets. Proceedings of the 35th International Conference on Machine Learning (ICML).

Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 67(2), 301–320. https://doi.org/10.1111/j.1467-9868.2005.00503.x