# Credit Scoring: Theory, Methods, and Practice Source: https://mikenguyen13.github.io/credit_score Author: Mike Nguyen License: CC-BY-4.0 (text), MIT (code) This file is a single-document ingestion bundle for LLMs. It contains the full prose of every chapter and appendix with executable code chunks stripped. For runnable code, see the GitHub repository: https://github.com/mikenguyen13/credit_score ================================================================================ ================================================================================ # Source: index.qmd ================================================================================ # Preface This book is a working reference for people who build, audit, deploy, and regulate credit scoring models. Every method is derived, every line of code runs in the reader's own environment, and every dataset is publicly downloadable under a permissive license. ## Who this book is for **Practitioners**: model developers, validators, MLOps engineers, credit analysts, and risk officers who need code that works and methods that pass audit. **Academics**: researchers in finance, statistics, machine learning, and law who want a single coherent reference with verified derivations and top-tier citations. **Regulators and auditors** will also find the regulatory chapters, the model risk workflow, and the fairness and explainability material directly useful. ## How to use this book The book is a Quarto project. Each chapter is a `.qmd` file with executable Python. Clone the repo, install the environment, render locally: Details on environment setup and macOS OpenMP notes are in @sec-app-B-env. ## Data Four public datasets anchor most examples: - **UCI Statlog German Credit Data** (Hofmann 1994): 1,000 consumer loans, 20 features. Small enough for pedagogy, large enough for real benchmarks. - **UCI Default of Credit Card Clients** (Yeh and Lien 2009): 30,000 Taiwanese credit card customers. Class imbalance around 22%, rich behavioral history. - **Home Credit Default Risk** (Kaggle, CC0): large, real-world mixed tabular with application, bureau, and installment tables. - **HMDA Loan-level Public Data** (CFPB, public domain): millions of U.S. mortgage applications, the default source for fair-lending research. These anchor examples, but several chapters also simulate data when a specific statistical property is pedagogically necessary. @sec-app-C-data provides download and caching code. ## What is new in this treatment Four things distinguish this book from the existing literature: 1. Every algorithm ships with a from-scratch derivation, a reference NumPy implementation, and the standard production library call. Readers see the math, the code, and the package API side by side. 2. Scalability is treated as a first-class concern. Each method is benchmarked on single-node pandas, Polars, Dask, and PySpark where relevant, and the throughput numbers are the ones the reader actually reproduces. 3. Deployment patterns are cloud-agnostic. FastAPI plus Docker plus MLflow form the core stack. SageMaker, Vertex, Databricks, and Azure ML map onto this stack with small adapters. 4. Regulatory, fairness, and explainability material is integrated chapter by chapter rather than confined to a single appendix. SR 11-7, GDPR Article 22, ECOA, and the EU AI Act are referenced in every chapter whose content they actually govern. ## Reproducibility All results in this book are rendered directly from executable code. Random seeds are fixed. Dataset versions are pinned. A continuous integration run renders the full book from scratch; any number or figure that does not match the text is treated as a build failure. ## License Text is licensed under Creative Commons Attribution 4.0 International (CC-BY-4.0). Code is licensed under the MIT License. Redistribute, adapt, and use in your own work with attribution. ## A note on scope The book does not cover quantitative credit pricing, CDS markets, or structured credit. It focuses on models whose output is the probability of default for an individual borrower or facility over a fixed horizon, plus the calibration, explanation, and capital consequences of that output. The structural and causal chapters (@sec-ch08 and @sec-ch28) touch on pricing only insofar as the lenses they introduce inform retail and SME scoring. ## Acknowledgments This book builds on four decades of work by Baesens, Thomas, Hand, Lessmann, Bastos, Verbraken, Crook, Altman, Ohlson, Merton, and many others whose contributions we cite throughout. Any errors are ours. ================================================================================ # Source: references.qmd ================================================================================ # References {.unnumbered} ::: ================================================================================ # Source: chapters/01-introduction.qmd ================================================================================ # Introduction and Historical Development **Scope: both retail and corporate.** Surveys consumer scoring (FICO, scorecards) and corporate distress modeling (Altman Z, Ohlson O, Merton) as one historical lineage. ## Why a book on credit scoring {.unnumbered} A credit score is a conditional expectation. Given what a lender can observe about a borrower at the moment of decision, a score is an estimate of the probability that the borrower will fail to meet a contractual obligation over some horizon. Every decision that follows, whether to extend credit, at what price, against what collateral, with what limit, is a function of that estimate and its uncertainty. This book is about how to construct that estimate well. The problem has three features that, together, make credit scoring distinct from generic binary classification. First, the ground truth is expensive and delayed. A default observation arrives months or years after the decision, and often only for the subset of applicants the lender chose to accept, so the training distribution is selected. Second, decisions are regulated. The Equal Credit Opportunity Act, the Fair Credit Reporting Act, Basel III, IFRS 9, CECL, SR 11-7, GDPR Article 22, and the EU AI Act all impose hard constraints on what features can be used, how models must be documented, how risk-weighted assets are computed, and how losses are provisioned. Third, the consumer side is large, roughly 18 trillion US dollars of household debt in the United States as of 2024 according to the Federal Reserve, with billions of dollars in interest and fees flowing through scoring systems every day. Small improvements in discrimination compound into large profit-and-loss effects and large welfare effects. The goals of this book are narrow and concrete. For the practitioner, we derive every method from scratch, implement it in NumPy or PyTorch, and then call the same standard library that a risk team would run in production. For the academic, we cite the primary literature in top-tier venues and benchmark each method on the same three public datasets so that results are comparable across chapters. For the regulator or supervisor, we tie every technique to the supervisory text that constrains its use. We prefer working code over narrative, and working math over intuition. This first chapter is the only one without a single estimator as its core object. Its job is to explain why the field exists, how it arrived at its current shape, and how the rest of the book is organized. A short empirical section fits a logistic scorecard on the two canonical public datasets. That baseline recurs throughout later chapters as a reference point for every more elaborate method. A word to the practitioner in an emerging market. The institutional history in this chapter is Anglo-American because the primary sources and the regulatory templates are. The modeling problems are not. A Vietnamese consumer-finance lender, an Indonesian digital bank, a Kenyan mobile-money scorer, and a Brazilian fintech share a set of features that mid-1990s US scorecard literature did not contemplate: thin-file or no-file borrowers, a cash economy with self-reported income, a credit bureau whose coverage is partial and whose tradeline depth is shallow, and a distribution channel that is mobile-first from the first customer touch. Every chapter from here on has to be read twice: once for the US or EU template, once for what needs to be re-derived, rebalanced, or replaced when the bureau carries half the adult population and half the income is informal. ## Why credit scoring exists ### Information asymmetry as the core friction The theoretical justification for credit scoring was provided in two papers written eleven years apart. The first is @akerlof1970lemons. Akerlof showed that when sellers know more about product quality than buyers, the market can unravel. Low-quality goods crowd out high-quality goods because buyers, unable to distinguish, price-average. Owners of high-quality goods withdraw, the average quality drops, prices drop further, and the market collapses toward the lowest quality or disappears altogether. The argument is a one-paragraph proof of the welfare cost of asymmetric information. The second is @stiglitz1981credit. Stiglitz and Weiss adapted Akerlof's logic to credit markets. A bank cannot perfectly observe the riskiness of a loan applicant. If it raises the interest rate to compensate for unobserved risk, it worsens the pool of applicants, because safe borrowers have lower reservation rates and drop out, while risky borrowers, whose upside is bounded by success and whose downside is bounded by default, remain. The result is credit rationing: in equilibrium, banks prefer to cap quantity rather than clear the market with price, and some creditworthy borrowers are rejected. This is the adverse-selection side of the story. There is also a moral-hazard side. Once a loan is made, the borrower can take unobservable actions, whether to invest the proceeds productively, to maintain insurance, to honor the repayment plan when default is unattractive but legal, that affect repayment. Under moral hazard, contracts and monitoring become the margins of adjustment [@holmstrom1979moral; @townsend1979optimal]. Screening at origination addresses selection; monitoring during the life of the loan addresses moral hazard. A credit score is primarily a screening device, although behavioral scores used after origination are monitoring devices. Earlier work laid the foundation. @spence1973job showed how informed parties can signal quality through costly actions. @rothschild1976equilibrium analyzed how uninformed insurers can screen by offering menus of contracts. @jaffee1976imperfect argued credit rationing arises when loan supply functions become backward-bending under default risk. @diamond1984financial showed that a delegated monitor, the bank, can resolve the free-rider problem among dispersed creditors by aggregating monitoring costs. @diamond1991monitoring sharpened the argument into a theory of the choice between bank loans and public debt, with reputation and screening as the relevant margins. A simple numerical illustration makes the Stiglitz-Weiss mechanism concrete. Suppose borrowers come in two unobservable types, safe and risky, each drawn with equal probability. Safe projects pay back a fixed amount $R_s$ with certainty. Risky projects pay back a larger amount $R_r > R_s$ with probability $p$ and zero with probability $1 - p$. A bank that sets interest rate $r$ faces the participation margin: safe borrowers accept only if $R_s \ge 1 + r$, while risky borrowers accept if $p \cdot R_r \ge p \cdot (1 + r)$, that is, whenever $R_r \ge 1 + r$. Because $R_r > R_s$, any rate high enough to drive safe borrowers out still attracts risky ones. The expected return to the bank is non-monotone in $r$: raising $r$ increases revenue per contract but worsens the mix. At some $r^*$ the two effects exactly cancel; above it, expected profit falls. The bank's optimal policy caps $r$ at $r^*$ and rations quantity at that rate. The welfare loss is the mass of safe borrowers who would have borrowed at rates just above $R_s - 1$, and the bank would have lent to them, except that the rate required to break even on the pooled portfolio is unacceptable to the safe type. Credit scoring resolves the friction by conditioning the offer on an observable signal that is correlated with type. The costly-state-verification argument of @townsend1979optimal takes a different route to the same destination. In Townsend's setup, the borrower knows her return but the lender can verify it only by paying a verification cost. The optimal contract is a standard debt contract: the borrower pays a fixed amount in non-default states, and the lender verifies only in default. The verification cost is the economic rent the lender extracts, and scoring reduces that rent by lowering the probability of default ex ante. @holmstrom1979moral's moral-hazard setup generates a different implication: when effort is unobservable, the first-best is not implementable and the contract must make the borrower's payment contingent on the outcome. Scoring affects this setup through the participation constraint, not the incentive constraint, because it improves the ex-ante distribution of the contract partners. The screening-versus-relationship view connects back to banking structure. @hauswald2006information model how a bank's informational advantage from screening is eroded by competitor acquisition of the same signals, which changes the equilibrium compensation for screening effort. @liberti2019information formalize the distinction between hard information (codifiable, transferable across the org) and soft information (subjective, context-dependent, tied to the loan officer) and show how the two interact as scoring technology improves. For the practitioner, the takeaway is that the value of a scoring system is not just the loss reduction it delivers on accepted loans. It is the change in the whole portfolio allocation induced by conditioning on a predictive signal. ### Screening versus monitoring A useful distinction for the rest of this book is between screening (ex ante, before credit is extended) and monitoring (ex post, during the life of the loan). Screening models use application data, bureau data, and any alternative data legally available at origination to estimate the probability of default over a fixed horizon, typically 12 or 24 months. Monitoring models, often called behavioral scores, use the ongoing trajectory of payments, balances, utilization, and external data to update the probability of default as new information arrives. The same mathematical machinery supports both, but the feature sets differ and the horizon of the prediction differs. Parts II and III of this book focus on screening. @sec-ch32 takes up dynamic behavioral scores. The separation maps onto the classical theory. Screening attacks Akerlof-style adverse selection by extracting information from observable signals. Monitoring attacks Stiglitz-style moral hazard by verifying actions after they are taken. A bank that does both well captures the lion's share of the borrower's informational rent. A bank that does only screening leaves the moral-hazard channel open. A bank that does only monitoring accepts too many bad loans at origination. Most institutional lenders run both systems. Most fintech lenders, at least in the early generations, focused on screening with rich alternative data and delegated monitoring back to traditional servicers. The distinction is methodologically useful because it pins down the label. For screening, the label $Y$ is a default indicator over a fixed horizon after origination, typically 12, 18, or 24 months. The observation window is forward-looking and the training set consists of past originations observed long enough to label. For monitoring, the label is a default indicator over a horizon after the as-of date, and the training set can be a panel of monthly observations on on-book accounts. The covariates in the monitoring case include not only origination attributes but also the whole history of balances, payments, and status codes since origination. The modeling choice on the monitoring side is often a discrete-time hazard rather than a single-horizon binary classifier, because the panel structure is natural and the competing-risk structure (default, attrition, prepayment) is material. A further distinction, not always emphasized, is between application scoring and scoring for collections or loss-mitigation. Collections models predict the probability that a delinquent account will roll to charge-off or, conditionally, that a given recovery tactic (letter, call, settlement offer) will cure the delinquency. The label here is different: it is recovery or cure, not default. The feature set overlaps with behavioral scoring but the loss function is different. ### Welfare arguments There is a tension between two welfare claims, both defensible. On one side, accurate scoring improves allocative efficiency. It reduces the rate at which safe borrowers are pooled with risky ones, lowers the cost of credit for the safe, and raises the rate at which productive projects are financed. @einav2013impact document that credit scoring technology introduced at a large auto lender caused cross-subsidies to collapse, with safe borrowers receiving more generous terms, and overall profits rising. @petersen2002does show that small-business lending distance rose sharply after the diffusion of scoring, which is consistent with scoring replacing costly soft-information production by loan officers. @frame2001effect report that small-business scoring expanded credit access in lower-income neighborhoods. On the other side, scoring can create disparate-impact harms when the features used, or the historical patterns encoded in the labels, reflect protected characteristics. @fuster2022predictably show that moving from logistic regression to random forests on the same mortgage dataset raised predicted default probabilities for Black and Hispanic borrowers relative to White borrowers and that the differential is driven by the technology itself, not by a change in the underlying portfolio. @bartlett2022consumer find that FinTech algorithms in mortgage lending discriminate less than face-to-face loan officers on the origination decision, but continue to charge minority borrowers more on price. @howell2024lender show that automation of small-business Paycheck Protection Program lending narrowed racial gaps in credit access because human discretion was a material source of disparity. The welfare question is not whether scoring is good or bad. It is: given that scoring exists, which methods and governance processes minimize error-variance, minimize disparate impact, and respect individual rights? Parts V and VI of this book treat that question in detail. There is a third welfare channel that cuts across the first two: the effect of scoring on screening incentives. @rajan2015failure show that when loan officers know that a statistical model will be used to approve, their effort to collect soft information falls, and the model's performance on the induced sample degrades. The mechanism is that loan officers stop recording marginal information once the approval decision is made by a model. The data-generating process shifts, and what looked like a predictive signal in the old regime no longer predicts in the new one. @rajan2011statistical formalize the incentive feedback. @keys2010did document an analogous effect in the subprime-mortgage securitization market: when loans were more easily securitized, screening effort at origination fell. @mian2009consequences tie the resulting credit expansion to the 2008 mortgage default crisis. The welfare analysis also interacts with credit supply during macro shocks. @agarwal2018banking show that during the 2000s expansion, banks passed through only a fraction of monetary-policy-driven cost reductions to consumer borrowers, and the pass-through varied with borrower risk score. @bhutta2015payday document the welfare effect of payday borrowing on credit-constrained households, where scoring determines access to mainstream credit and therefore the outside option. @bazot2018financial places the long-run cost of financial intermediation in Europe in historical perspective. A scoring system is not just a classifier; it is one link in a longer chain through which monetary policy, banking structure, and household welfare interact. ### A minimal formal frame Let $X \in \mathcal{X}$ be the features observed at origination. Let $Y \in \{0, 1\}$ be the default indicator over a horizon $H$. A score is any function $$ s: \mathcal{X} \to \mathbb{R}, $$ that preserves the ordering of the conditional default probability $\pi(x) = \Pr(Y=1 \mid X=x)$. Under the logistic form $$ \pi(x) = \frac{1}{1 + \exp(-\beta_0 - x^\top \beta)}, $$ we can take $s(x) = \beta_0 + x^\top \beta$ directly. Under any monotone transformation of a probability estimate, we can also take the probability itself or an affine scaling to an integer scale, like the Fair Isaac convention of mapping log-odds to points with a points-to-double-the-odds constant. The central operational quantity in this book is the receiver-operating-characteristic curve and its summaries, the area under the curve (AUC) and the Gini coefficient $2 \cdot \text{AUC} - 1$. The Kolmogorov-Smirnov statistic $$ \text{KS} = \sup_t \bigl| F_{s \mid Y=0}(t) - F_{s \mid Y=1}(t) \bigr|, $$ measures the maximum separation between the score distributions of good and bad borrowers. We will use AUC, KS, Gini, Brier score, and calibration plots throughout. The profit-based view connects the score to the accept-reject decision. Let $c$ be the marginal profit from an accepted good borrower, let $\ell$ be the marginal loss from an accepted bad borrower, and let $\pi(x)$ be the estimated default probability. The expected profit from accepting an applicant with features $x$ is $$ \mathbb{E}[\text{profit} \mid x] = (1 - \pi(x)) \cdot c - \pi(x) \cdot \ell, $$ so the profit-maximizing cutoff is $\pi(x) \le c / (c + \ell)$, or equivalently $s(x) \ge s^*$ for some threshold $s^*$ calibrated against that loss ratio. @elkan2001foundations derives the same rule in the cost-sensitive-learning framework. @verbraken2014novel extends it to include fixed costs and expected maximum profit as a classifier-selection criterion. The Bayes-optimal classifier under 0-1 loss is a threshold on $\pi(x)$ at 0.5. The Bayes-optimal classifier under the cost matrix above is a threshold at $c / (c + \ell)$. Neither is necessarily achievable: if the hypothesis class cannot represent $\pi(x)$, we incur approximation error. If the sample is finite, we incur estimation error. @hand2006classifier argues that the literature overstates the gap between classifiers because most of the variance is in the data-generating process and relatively little in the model class. This book's empirical results on the Taiwan and Home Credit datasets are consistent with that view: the spread in AUC across ten modern methods on the same data is typically 3 to 5 AUC points, which is material but less than the spread across different feature sets or the spread across different sampling seeds on small datasets. A remark on probabilities versus scores. The unit of measurement on a credit bureau report is points, not probability. The reason is presentational: a three-digit integer between 300 and 850 is easier for a consumer to anchor on than a probability between 0 and 1, and the log-odds scale compresses the tails so that a 40-point gap at the bottom of the distribution and a 40-point gap at the top of the distribution correspond to the same multiplicative change in odds. Inside a lender's risk system, the operative quantity is still the probability (or its calibrated cousin, expected loss); the points are a display layer. Calibration of the underlying probability to realized default rates is therefore a critical step, not a cosmetic one. ## A brief history: 1840 to 1980 ### The mercantile agencies and the invention of the credit report The credit reporting industry predates the credit score by a century. @olegario2006culture provides the definitive treatment; @lauer2017creditworthy extends the story to consumer surveillance. The origin is Lewis Tappan's Mercantile Agency, founded in New York in 1841, which paid a network of lawyers and local merchants to file reports on the character, capital, and circumstances of country-store proprietors buying wholesale goods on credit. The reports were written in a telegraphic style and filed in ledgers that subscribers could consult. The Mercantile Agency became R. G. Dun and Company; a competing operation founded by John Bradstreet in Cincinnati in 1849, which published rated directories. The two merged in 1933 to form Dun and Bradstreet, which is still the leading commercial credit-rating agency. Two features of the 19th-century mercantile agency matter for modern scoring. First, the agency produced a common informational infrastructure that allowed credit decisions to scale beyond the informal networks of merchant correspondents. A wholesaler in New York could extend 90-day credit to a shopkeeper in Kansas because the agency had a ledger entry, even though the two parties had never met. Second, the ratings were encoded. By the 1860s, Dun used letter-number combinations, like A1 or G3, that compressed a paragraph of qualitative assessment into a single symbol. That compression is the lineage of the modern three-digit credit score. Consumer credit reporting followed commercial reporting by several decades. Retail credit bureaus emerged locally in the early 20th century, aggregating payment histories across merchants. The Associated Credit Bureaus trade association was formed in 1906. The three bureaus that dominate the US consumer market today, Equifax (descended from Retail Credit Company, founded 1899), Experian (descended from TRW Information Services and, ultimately, CCN of Nottingham, founded 1980), and TransUnion (founded 1968), all consolidated hundreds of local bureaus into national networks during the postwar decades. The data architecture that emerged had two lasting features. First, the bureau is a data aggregator, not a lender, and it sells data to lenders in exchange for contributions from those same lenders. The tradeline structure, a record per credit account with balance, payment, delinquency, and utilization, is the unit of exchange. Second, the bureau maintains a set of public-record attachments, typically judgments, tax liens, and bankruptcies, that hang off the consumer identity. The Fair Credit Reporting Act of 1970 codified consumer rights over this record; the rules on what can and cannot appear, and for how long, shape what inputs a scoring model can legally use. @leyshon2008credit document how the growth of this electronic record-keeping interacted with the retail-banking business model in the 1990s, when automated underwriting went from a niche to a standard. The key conceptual point is that the bureaus are the infrastructure on which modern scoring runs; every US consumer lender, and many commercial lenders, use bureau data either as input to their scoring or as input to challenger models that validate their own decisions. International variation in this infrastructure is material. The United Kingdom and Ireland have two dominant bureaus (Experian and Equifax), with TransUnion (formerly Callcredit) a distant third. Germany has SCHUFA, a mutualized bureau owned by the financial-services sector with a different data-sharing model from the US bureaus. France, until recently, had no positive-data bureau at all; scoring was built largely from internal bank data and negative public-record flags. Emerging markets often have thin bureau coverage, which is why alternative-data approaches have outsized traction in those markets. The cross-country variation in bureau depth is one of the reasons the literature on financial inclusion [@bis2020data, @bazarbash2019fintech] places so much weight on non-bureau signals. ### Early bank scoring The first numerical scoring work in US banking is usually attributed to @durand1941risk, whose NBER monograph on consumer installment financing applied @fisher1936use's linear discriminant (@sec-ch06-discriminant) to loan-approval data from personal-finance companies and small-loan lenders. Durand built weighted-factor scoring that assigned points to borrower attributes, age, occupation, years at current employer, bank account ownership, and summed them into a single risk index. The classification accuracy was modest by modern standards, but the conceptual move, from individual-case judgment to a point-total that could be applied consistently across a portfolio, was the foundation of everything that followed. @myers1963credit extended the framework with practical weight-construction procedures that banks could implement manually or on punched-card machinery. @bierman1970equation derived a Bayesian optimal accept-reject rule for trade-credit decisions. @greer1967optimal worked through the profit-maximizing cutoff under known loss-given-default and recovery distributions. @orgler1970credit applied statistical scoring to commercial loans at a money-center bank. These papers, spread across statistics, operations research, and money-credit-banking journals, show that by the late 1960s, the theoretical apparatus for scoring was essentially in place. What was not yet in place was the electronic infrastructure. Credit applications in the 1950s and 1960s were processed by hand. A typical consumer lender would have a policy manual and a form that branch staff filled in. Rules were deterministic and heavy on excluded occupations, residency requirements, and employment stability. The transition from manual policy rules to statistical scorecards required not only a methodology but also the data-collection infrastructure and the computing hardware to execute the model consistently. Durand's methodology deserves a closer look because it set the template for the next forty years. He tabulated borrower characteristics against the observed good/bad outcome on a sample of nearly 7,000 loans, computed the correlation of each attribute with repayment, selected the attributes that contributed the most information jointly, and assigned points by combining Fisher-style weights with rounding for implementation ease. The final score was a sum of bin-level points. The approach can be read as a constrained logistic regression in which the link function is linear, the design matrix is a wide one-hot encoding of binned features, and the coefficients are rounded to sensible integer multiples. Decades later, this is exactly the recipe that @myers1963credit, @orgler1970credit, and every subsequent Fair Isaac (FICO) scorecard would follow. The critical insight was procedural: by writing the scoring function as a sum of independent contributions, the model becomes interpretable, auditable, and implementable on the computing hardware of its day. The 1960s literature added decision-theoretic grounding. @bierman1970equation wrote down the optimal accept-reject threshold in a Bayesian framework and showed that it depends on the ratio of the loss from accepting a bad to the profit from accepting a good. @greer1967optimal extended the analysis to the loss-given-default margin. @orgler1970credit applied the scoring form to commercial loans at Chase Manhattan and documented a 30-plus percent reduction in bad-loan rates relative to judgmental underwriting in the matched comparison. These papers collectively established that (a) scoring could be more accurate than judgmental assessment, (b) the accept-reject decision depends on economics, not just accuracy, and (c) the same math could be applied to consumer and commercial portfolios, even if the input data differed. The hardware context matters. A 1960s-era credit-granting system ran on punched-card tabulators or early electronic mainframes. The per-decision compute budget was small. A scorecard of 10 to 20 characteristics, each with 3 to 8 bins, could be evaluated by a lookup table; a logistic regression with continuous features could not be, without an arithmetic unit and a logarithm routine. The scorecard format was therefore not just an interpretability choice but also a deployment choice. Much of the survival of the scorecard format into the 21st century, well past the point at which computing ceased to be the constraint, is inertia from this early architectural fit. ### Altman and the modern bankruptcy-prediction literature @altman1968zscore was the watershed paper. Altman applied multiple discriminant analysis to a matched sample of 33 bankrupt and 33 non-bankrupt manufacturing firms and derived the Z-score, $$ Z = 1.2 X_1 + 1.4 X_2 + 3.3 X_3 + 0.6 X_4 + 1.0 X_5, $$ where $X_1$ through $X_5$ are working capital / total assets, retained earnings / total assets, earnings before interest and taxes / total assets, market value of equity / book value of debt, and sales / total assets. Firms with $Z < 1.81$ were predicted to fail; firms with $Z > 2.99$ were predicted to survive; the zone between was ambiguous. On the holdout sample, Altman reported 95 percent classification accuracy at a one-year horizon. The paper mattered for three reasons. First, it used a compact and interpretable statistical model to beat subjective assessment. Second, it turned corporate distress into a measurable object: a company's Z-score could be tracked over time and compared across industries. Third, it spawned an enormous literature. @altman1977zeta introduced the ZETA model, a seven-factor extension fit to a larger sample. @beaver1966financial, published two years earlier, used univariate ratio analysis; Altman subsumed and improved on it. @ohlson1980financial replaced the discriminant framework with logistic regression, which avoided the multivariate-normality assumption on the predictors and the restrictive equal-covariance assumption of linear discriminant analysis (@sec-ch06-discriminant; see @sec-ch06-rda for the regularized variant that relaxes this assumption without going to full QDA). @shumway2001forecasting and @campbell2008search moved the literature to discrete-time hazard models with dynamic covariates. The Z-score is interesting methodologically beyond its empirical success. The five ratios were selected from a larger candidate set by stepwise discriminant analysis, which today would be considered a feature-selection procedure with high selection-induced bias. The signs and magnitudes of the coefficients were interpretable in light of accounting logic: profitability (EBIT / total assets) and efficiency (sales / total assets) carry positive weight; leverage (book value of debt in the denominator of $X_4$) carries negative weight through the inverted ratio. The thresholds for the three zones (safe, gray, distressed) were chosen to minimize misclassification cost on the matched sample. The matched-sample design (33 pairs) is now understood to give an overoptimistic picture of real-world accuracy because the base rate in the sample is 50 percent, whereas in the population it might be 2 to 5 percent. @ohlson1980financial's move to logistic regression partly addressed this by accommodating unbalanced samples, and @shumway2001forecasting's hazard-model approach further corrected it by using all firm-year observations, not just matched pairs. Parallel to the academic literature, the rating agencies (Moody's, Standard and Poor's, Fitch) were developing their own quantitative models to complement analyst judgment. The Moody's KMV Expected Default Frequency (EDF), built on the @merton1974pricing structural framework, combined equity-volatility-implied distance to default with an empirical mapping to observed default frequencies. S&P's CreditModel produced analogous outputs for private firms. These commercial models shared a lineage with Altman's work but also drew on the options-pricing literature of @black1973pricing and @merton1974pricing, which gave them a structural interpretation that the reduced-form Z-score lacked. One result from @campbell2008search deserves special attention because it applies, with modification, to retail credit as well. Campbell, Hilscher, and Szilagyi document that simple accounting ratios alone explain only a modest share of the variation in corporate default probability. The remainder is driven by market-based inputs (stock volatility, excess returns) and macro inputs (term spreads, unemployment). The implication for retail scoring is analogous: origination-time features alone miss a substantial chunk of the variation that later unfolds, and behavioral and macro features close the gap. ### The 1970s and the regulatory response Three US laws in the 1970s shaped scoring for the next fifty years, and a fourth laid the groundwork. The Fair Credit Reporting Act of 1970 created the statutory framework for consumer credit reports: who may issue them, who may obtain them, what permissible purposes are, how errors are disputed, and how long adverse information may remain (seven years for most items, ten for bankruptcies). Before FCRA, bureau records were essentially private commercial property and consumers had no legal right to inspect their own files. After FCRA, consumers could request reports, dispute errors, and see who had pulled their file. The law simultaneously created the modern bureau compliance regime and enabled the bureau-based scoring that became the industry standard. The Equal Credit Opportunity Act of 1974 prohibited credit discrimination on the basis of race, color, religion, national origin, sex, marital status, and (added in a 1976 amendment) age and receipt of public assistance. The Federal Reserve's Regulation B, first published in 1975, implemented ECOA. Two of Regulation B's provisions matter in particular for scoring. The effects test, codified in the 1977 Regulation B revisions, required that a scoring system's outputs not produce disparate impact against protected groups unless the system was empirically derived, demonstrably and statistically sound, and the specific features used were justified by business necessity. Adverse-action notices, required for any denial or less-favorable approval, required the principal reasons for the action to be provided in writing, with specific reason codes. The Home Mortgage Disclosure Act of 1975 required mortgage lenders above a threshold size to disclose loan-level origination data to the public, including applicant race, ethnicity, sex, and census tract. HMDA is the primary data source for academic and regulatory work on mortgage fairness [@bhutta2021how, @bartlett2022consumer]. The 2018 HMDA amendments, implementing sections of the Dodd-Frank Act, expanded the data fields to include interest rate, debt-to-income ratio, and property value, which sharply increased the data's usefulness for fair-lending analysis. The Community Reinvestment Act of 1977 required depository institutions to serve the credit needs of their local communities; it does not directly regulate scoring, but CRA examinations consider lending distributions that scoring shapes. A fourth law, the Fair Housing Act of 1968, prohibits discrimination in residential real estate transactions, including mortgage lending, on protected characteristics. FHA and ECOA overlap for mortgage lending; they diverge on other credit products. The structure that emerged by the late 1970s has three anchor properties. First, scoring is legally permitted, and even preferred, over subjective assessment, but must be empirically validated and cannot use protected characteristics as direct inputs. Second, consumers have statutory rights to inspect, dispute, and receive reasons. Third, aggregate lending distributions are publicly observable and subject to fair-lending oversight. All three properties still hold and shape modern practice, including the governance of machine-learning credit models. ### International parallels The US scoring history is not universal. The United Kingdom developed bureau-based scoring in the 1980s and 1990s on a similar timeline, with Experian (through CCN and its predecessors), Equifax, and Callcredit (now TransUnion UK) as the main bureaus, and with the Office of Fair Trading and later the Financial Conduct Authority as the regulators. The UK's consumer-credit legislation, the Consumer Credit Act of 1974, predates ECOA but focuses more on truth-in-lending than on fair-lending. Disparate-impact analysis is a weaker part of the UK tradition, although the Equality Act 2010 provides the statutory hook when needed. Continental Europe developed scoring more slowly in the consumer segment because bureau coverage was thinner and bank-based relationship lending was stronger. Germany's SCHUFA is owned and contributed to by the banking sector under a mutualized structure; the Data Protection Directive and its successor GDPR impose constraints on automated decision-making that are stricter than US rules. France's credit information landscape has historically been dominated by the Fichier des Incidents de Remboursement des Crédits aux Particuliers, a negative-data registry, with positive data added only recently. Scoring in these jurisdictions has depended more on internal bank data and less on third-party bureau scores than the US equivalent. East Asia has developed alternative architectures. Japan has multiple credit information centers (JICC, CIC, JBA) with statutory information-sharing and a scoring industry centered on retail banks and consumer finance companies. Korea has the Korea Credit Bureau and NICE Information Service, which calculate proprietary scores analogous to FICO. China's credit-scoring landscape is shaped by the People's Bank of China's Credit Reference Center and by private scoring systems built on top of the Alipay and WeChat Pay platforms (see @bis2020data). India has CIBIL (TransUnion India), Experian India, Equifax India, and CRIF High Mark, with scoring that developed quickly after Reserve Bank of India licensure in 2010. Emerging markets present the most dramatic contrast. Many African, Latin American, and South Asian countries have thin bureau coverage, shallow banking penetration, and a large unbanked population. Scoring in these markets relies heavily on alternative data: mobile-money transaction history, psychometric test results, utility-payment records, and social-graph signals. @bazarbash2019fintech and @bis2020data are the main macro treatments. @gambacorta2024data is an account of Chinese fintech scoring in particular. ### Regulatory and structural backdrop Through the 1960s and 1970s, the legal environment shifted. The Fair Credit Reporting Act of 1970 (FCRA) gave consumers the right to access their credit reports, to dispute inaccuracies, and to require accuracy. The Equal Credit Opportunity Act of 1974 (ECOA) prohibited discrimination in credit on the basis of race, color, religion, national origin, sex, marital status, or age. Regulation B, issued by the Federal Reserve to implement ECOA, allowed empirically derived, demonstrably and statistically sound (EDDSS) credit scoring systems and specified the conditions under which characteristics like age could be used. The legal architecture created a demand for statistical models that could be documented and defended, which accelerated the industry's move away from subjective judgment. At the macro level, the 1970s and early 1980s saw a surge in consumer credit volume. Revolving credit on bank-issued cards grew rapidly. Deposit-rate deregulation under the Monetary Control Act of 1980 and the diffusion of the MasterCard and Visa interchange networks expanded the addressable market. The combination of legal pressure to standardize, commercial pressure to scale, and the increasing availability of mainframe computing produced the environment into which modern credit scoring arrived. ## The FICO era, 1956 to 2000 ### Fair, Isaac founding and the scorecard form Fair, Isaac and Company was founded in 1956 in San Rafael, California, by Bill Fair, an engineer, and Earl Isaac, a mathematician, who had met at the Stanford Research Institute. Their first products were custom scorecards sold to individual lenders. The scorecard form is a linear model that scores a borrower on a set of categorical or banded characteristics and sums points to produce a three-digit score. The form derives from the logistic model (@eq-logistic) after substituting weight-of-evidence (WoE) transformations of the original features: $$ \text{WoE}_j(x) = \log\left( \frac{\Pr(X_j = x \mid Y = 0)}{\Pr(X_j = x \mid Y = 1)} \right), $$ and fitting logistic regression on the WoE-encoded features. Points for a bin are the contribution of that bin's WoE to the log-odds, rescaled to the FICO convention (typically points-to-double-the-odds = 20, base score = 600 at base odds = 50:1). This formalism has several operational virtues that kept it dominant through the 1980s and 1990s. First, the scorecard is trivially interpretable: points per bin add up to the score, and the contribution of each characteristic to the score is transparent. Second, bin-based encoding handles nonlinearity without requiring explicit polynomial or spline terms. Third, the form maps cleanly onto adverse-action notice requirements under ECOA, because the four or five characteristics that contributed the most negative points can be listed as reasons for denial. Fourth, the scorecard is robust to missing values when missingness is treated as its own bin. We derive the scorecard formalism in full in a later chapter. The information-value statistic, usually credited to the Fair Isaac technical tradition, measures the predictive strength of a binned feature: $$ \text{IV}_j = \sum_{b \in \text{bins}_j} \left( \Pr(X_j = b \mid Y=0) - \Pr(X_j = b \mid Y=1) \right) \cdot \text{WoE}_j(b). $$ By industry rule of thumb, $\text{IV} < 0.02$ is weak, $0.02 \le \text{IV} < 0.1$ is medium, $0.1 \le \text{IV} < 0.3$ is strong, and $\text{IV} \ge 0.3$ is suspicious and should be checked for leakage. The statistic is equivalent to the symmetrized Kullback-Leibler divergence between the feature distribution conditional on good and the feature distribution conditional on bad, summed over bins. It gives the modeler a fast, univariate screen before stepwise logistic regression. The fine-to-coarse classing procedure is the signature operational step of scorecard development. Fine classing divides each feature into many small bins, often deciles for continuous features and observed categories for discrete features. Coarse classing then merges adjacent bins to produce a stable, monotone WoE profile with enough observations per bin to estimate the WoE reliably. The typical target is 20 to 50 bins fine, 4 to 8 bins coarse. Monotonicity is usually imposed to match business intuition (for example, a bin encoding longer tenure at the current job should have a WoE at least as favorable as the adjacent shorter-tenure bin). ### Bureau data and the FICO score Through the 1970s Fair, Isaac delivered custom scorecards to banks and retailers. The product that changed the industry was the bureau-based generic score. In 1989, Fair, Isaac and Equifax released the Beacon score; similar products followed with TransUnion (Empirica) and the predecessors of Experian (Fair Isaac Risk Model). By 1995, Fannie Mae and Freddie Mac endorsed the FICO score for mortgage underwriting, which anchored the score as the de facto standard. @mester1997whats gave an early survey from the Federal Reserve Bank of Philadelphia; @avery2009credit reports on the diffusion effects; @frb2007report is the Federal Reserve's comprehensive congressional report on the availability and affordability effects. The FICO score itself is a weighted sum constructed from bureau data with five published component families: payment history (about 35 percent of the weight), amounts owed (about 30 percent), length of credit history (about 15 percent), new credit (about 10 percent), and credit mix (about 10 percent). The precise algorithm is proprietary. What is public is the range (300 to 850), the distribution shape, and the broad feature-family weights. For a lender, the key property is that the score is comparable across applicants and across time, which allowed the entire mortgage, auto, and card industries to standardize underwriting guidelines in terms of score bands. That standardization, combined with the GSE endorsement, made the FICO score the central coordinating institution of US consumer lending by the late 1990s. The three-bureau structure also generated an important product distinction that persists today. Each bureau runs its own version of FICO (the Beacon variants at Equifax, the Empirica variants at TransUnion, and the Fair Isaac Risk Model variants at Experian), trained on its own historical data, and a given consumer can have three somewhat different FICO scores at any moment. Mortgage underwriters pull all three and take the middle; card issuers often pull one and make a decision against it. The VantageScore consortium, founded in 2006 by the three bureaus, tried to unify the scoring tradition outside Fair Isaac's pricing regime; it has seen meaningful but minority adoption. The competitive dynamic between FICO and VantageScore continues to shape what data flow into bureau scores, what cutoffs dominate underwriting, and how regulators think about the concentration of this market. In 2022, the Federal Housing Finance Agency announced that Fannie Mae and Freddie Mac would begin accepting both FICO 10T and VantageScore 4.0 in mortgage underwriting, a multi-year transition that ends the pure FICO monopoly in GSE-eligible originations. Three structural consequences of FICO's dominance bear on modern scoring practice. First, the score acts as a compression layer between bureau data and lender decisions. A lender that relies primarily on FICO has a less granular view of the borrower than a lender that pulls raw tradelines. **FinTech lenders have exploited this gap by building in-house models on raw bureau data that compress differently, and often better, than FICO for specific product-segment pairings**. Second, FICO is a regulated model: Fair Isaac has a model-governance regime and regularly publishes performance statistics to lender clients. This is one of the reasons the model remained stable over decades; changes to FICO have knock-on effects on mortgage underwriting guidelines that neither the GSEs nor their regulator wants to process frequently. Third, the FICO score itself has become a feature in downstream models. Lenders build their own probability of default models on top of bureau data and include the FICO score as one input; bureau scores include FICO as a feature in some variants; and academic work on mortgage pricing [@bhutta2021how] uses FICO bands as an explanatory variable in causal analyzes of disparities. The score is simultaneously an output and an input. ### ECOA, Regulation B, and the compliance architecture The compliance infrastructure around scoring tightened in parallel. Regulation B required that any demographic characteristic used in a credit decision be empirically validated as predictive and not function as a proxy for protected class. The Office of the Comptroller of the Currency, the Federal Reserve, and the Federal Deposit Insurance Corporation issued examination manuals that specified how scorecards should be documented, how override rates should be tracked, and how disparate-impact testing should be performed. @hoffman1983interpretation provides an early legal analysis of how the ECOA effects test applied to scoring. The combination of ECOA and FCRA pushed lenders toward systems where each decision could be explained to the applicant and audited by the supervisor. The scorecard form fit that requirement naturally. A practical consequence of ECOA that every modern practitioner confronts is the adverse-action notice. When an application is denied or approved on less favorable terms than requested, the lender must provide the principal reasons for the action. For a scorecard, the reason codes are the characteristics that contributed the largest negative points relative to the base, typically presented as four or five reason codes chosen from a fixed menu per product. The menu is designed to be non-discriminatory on its face: "level of delinquency on credit accounts" is acceptable; "balance on revolving accounts" is acceptable; something like "zip code" is not, because it can function as a proxy for race. The transition from scorecards to machine-learning models has complicated the adverse-action notice: a gradient-boosted tree ensemble does not have additive, feature-level contributions to the score in the same way a scorecard does. Shapley-value decompositions [@lundberg2017unified], applied to the ensemble's output, provide the functional equivalent of scorecard points and are now the dominant approach to ML adverse-action notices. The Community Reinvestment Act of 1977 is a parallel but distinct constraint. CRA requires depository institutions to serve the credit needs of the communities in which they operate, including low- and moderate-income neighborhoods. Scoring does not directly violate CRA, but the aggregate distribution of lending across census tracts is an examination item, and a scoring model that systematically underweights features specific to lower-income applicants can trigger CRA concerns even if it does not violate ECOA on an individual-applicant basis. ### Small-business scoring and the relationship-transaction debate Scoring technology spread from consumer to small-business lending in the 1990s. @frame2001effect document the diffusion and its effect on small-business credit supply. @petersen1994benefits had earlier established the value of lending relationships in small-business credit, where soft information about the borrower's management and local conditions was the dominant input. @petersen2002does, using the same Survey of Small Business Finances, found that after the diffusion of scoring, the mean distance between small-business borrowers and their lenders rose substantially. The interpretation was that hard information (coded in the score) was replacing soft information (produced by proximate loan officers). @liberti2019information survey the modern literature on this transition. ### The 1997 Hand and Henley synthesis @hand1997statistical is the clearest mid-1990s statement of where the field had arrived methodologically. The authors reviewed linear (@sec-ch06-discriminant) and quadratic (@sec-ch06-qda) discriminant analysis, logistic regression, nearest-neighbor methods, classification trees, and early neural networks, evaluated them on consumer credit data, and concluded that sophisticated methods rarely outperformed logistic regression by enough to justify the loss of interpretability. @hand2006classifier generalizes the argument. @hand2009measuring critiques AUC as a coherent performance measure and proposes the H-measure. @thomas2000survey and @crook2007recent are complementary surveys. For two decades, the logistic scorecard was the industry standard, not because it was the most accurate method available, but because the marginal accuracy gain from alternatives was small, the cost of moving away from an interpretable model was high, and the governance infrastructure was aligned around scorecards. ## The machine-learning era, 2000 to the present ### The Baesens benchmark and the ensemble turn The turning point on the methodology side was @baesens2003benchmarking. Baesens and coauthors ran a head-to-head benchmark of linear discriminant analysis (@sec-ch06-discriminant), quadratic discriminant analysis (@sec-ch06-qda), logistic regression, classification trees, k-nearest neighbors, least-squares support vector machines, and several neural network architectures on eight credit datasets. The two headline findings: no single classifier dominated, but the nonlinear methods, support vector machines and neural networks, produced the best AUC on most datasets by a small but consistent margin. The gap was 1 to 3 AUC points in most cases, which is material in risk-adjusted profit but not revolutionary. The interpretation was that the loss function of credit scoring is benign enough that simple methods do almost as well as complex ones. @lessmann2015benchmarking updated the study with 41 classifiers on eight datasets and arrived at a sharper conclusion. Heterogeneous ensembles, particularly ensembles of neural networks and gradient boosting machines, consistently beat logistic regression by an AUC margin of 3 to 8 points, which corresponds to a Gini improvement of 6 to 16 points. Ensembles of ensembles dominated. The authors reported that 17 of the 41 classifiers statistically outperformed logistic regression on their multi-dataset comparison after Bonferroni correction. Between the two benchmarks, the underlying algorithms evolved. @breiman2001random introduced random forests. @friedman2001greedy introduced gradient boosting for regression and the AdaBoost cousin for classification. @friedman2000additive showed that AdaBoost is a greedy additive logistic-regression-style fit. @chen2016xgboost released XGBoost, which became the dominant credit-scoring algorithm in the industry within three years of publication. @ke2017lightgbm (LightGBM) and @prokhorenkova2018catboost (CatBoost) followed with faster histogram-based and ordered-boosting variants. The Gradient Boosted Decision Trees (GBDT) family combined the interpretability and feature-handling advantages of trees with the error-reduction benefits of ensembling. Three features of GBDT drove industry adoption in credit, specifically. First, GBDTs handle mixed-type data (numeric, categorical, missing) natively, without feature engineering. A credit scoring dataset has hundreds of raw bureau attributes, each with varying fractions of missingness tied to account age and type; logistic regression requires imputation and careful WoE binning for each, whereas XGBoost learns the missing-direction automatically. Second, GBDTs reach near-top performance with a few hundred rounds and default hyperparameters, which reduces development-cycle cost relative to support-vector machines or neural networks. Third, GBDT models are reasonably interpretable after Shapley decomposition, which aligns with the ECOA adverse-action requirement and the SR 11-7 explainability expectations. The combination of data-handling convenience, out-of-the-box accuracy, and post-hoc interpretability is why the GBDT family, not deep learning, won the credit-scoring market despite the parallel deep-learning revolution in image and language tasks. Deep learning for tabular credit data has had a slower path. Early applied-ML papers reported marginal gains from deep networks over gradient boosting, in the 1 to 2 AUC-point range, but the gains were inconsistent across datasets and sensitive to preprocessing and hyperparameter choice. @grinsztajn2022why provided the most rigorous side-by-side comparison and concluded that tree-based models still outperform deep learning on tabular benchmarks, attributing the gap to inductive biases: trees handle piecewise-constant patterns and irregular feature distributions better than neural networks without extensive engineering. @arik2021tabnet and @gorishniy2021revisiting propose attention-based architectures for tabular data that narrow the gap. The consensus as of 2024 is that for a typical credit-scoring dataset with a few hundred features and a few hundred thousand to a few million observations, a well-tuned GBDT is the default, and deep learning should be a challenger, not the primary. The calculus changes when the feature set includes unstructured inputs: text, images, graphs, or sequences. ### Industry adoption Adoption at regulated lenders was gradual. Basel II, published in 2006 [@basel2006international], allowed the internal-ratings-based (IRB) approach in which banks use their own PD (Probability of Default), LGD (Loss Given Default), and EAD (Exposure at Default) estimates for regulatory capital, which raised the compliance cost of any change to an approved model and slowed adoption of machine learning in that segment. Card issuers, unsecured-personal lenders, and FinTech firms, which did not compute regulatory capital under IRB, were faster. By the early 2010s, most US card issuers had production XGBoost models for originations and for line management. Mortgage underwriting remained anchored to FICO and Desktop Underwriter / Loan Prospector automated underwriting through the 2008 crisis and after. @khandani2010consumer is a representative academic-industry bridge: the authors applied machine-learning classifiers to combined transaction-level and bureau data from a major US bank and reported 6 to 25 percent improvements in the cost-adjusted forecast of 90-day delinquency. @verbraken2014novel proposed profit-based performance measures that tied classifier selection to the lender's expected profit curve. @finlay2011multiple built multi-classifier architectures that approximated the top line of later benchmarks. @breeden2020survey surveys the credit-risk ML literature through 2019. The industry-academic split on adoption is worth noting. Top-tier finance journals accepted ML-credit papers only after the benchmark was established and the disparity-effects literature caught up. @fuster2019role and @fuster2022predictably in RFS and JF, @bartlett2022consumer in JFE, and @howell2024lender in JF are the recent anchor papers. The machine-learning venues (NeurIPS, ICML, KDD, JMLR) accepted credit-scoring applications earlier, but often with small datasets and a narrower lens on the policy consequences. The gap has narrowed as the same authors began to publish across both literatures, and the regulatory interest in algorithmic credit has forced a convergence. The book attempts to respect both, with the theory and method sections drawing on the ML venues and the empirical and regulatory sections drawing on the finance venues. ### Alternative data and the FinTech wave Two parallel developments expanded the input space. The first was alternative-data scoring. @berg2020rise document that a German online lender replaced traditional credit-bureau inputs with digital footprints, such as device type, operating system, time of day of the application, email-provider class, and page-navigation behavior, and obtained discrimination at least as good as a credit-bureau baseline. On their sample of roughly 250,000 applications, the digital-footprint model delivered an AUC of 0.696 versus 0.683 for the credit-bureau model and 0.736 for the combination. The implication is that a lender with essentially zero bureau history, the typical FinTech starting position, can still underwrite competitively using only the trace left by an online application. @iyer2016screening and @lin2013judging studied a related problem in peer-to-peer lending, where small-borrower data were combined with social-network data and verbal descriptions. @duarte2012trust introduced the appearance-trust mechanism: borrowers who appear trustworthy in their profile photographs are more likely to be funded and less likely to default, even after controlling for observable credit-quality signals. @vallee2019marketplace situated marketplace lenders in the broader banking landscape. @buchak2018fintech documented the rise of shadow banks in US mortgage lending and the role of technology in that rise, with FinTech lenders' share of the US mortgage market rising from near zero in 2007 to roughly 10 percent by 2015 and roughly 15 percent by 2019. @fuster2019role isolated the role of technology adoption in mortgage refinancing take-up and found that tech-enabled lenders processed applications roughly 20 percent faster, which passed through partially to a higher refinance take-up rate. @jagtiani2019roles analyzed LendingClub directly and documented that the platform's internal grades contained information beyond FICO. The alternative-data story is not uniformly positive. The same signals that predict default can correlate with protected characteristics, creating legal exposure under ECOA effects testing even without explicit use of protected attributes. Device type, operating system, and page-navigation timing all carry demographic information; the residual predictive power of those signals, after netting out demographic content, is what the lender is entitled to use. Separating the two is non-trivial and is one of the motivations for the causal-fairness work in a later chapter of this book. A second concern is stability: digital-footprint signals can be gamed. Applicants who learn that iOS devices get better offers will acquire iOS devices, or use them for the application, even if they don't otherwise. The signal then decays. We will discuss the practical stability evidence in a later chapter. The second development was big-tech platform scoring. @bis2020data document that a Chinese fintech platform's machine-learning models trained on payment and commerce data from its parent platform can predict small-business default at least as well as, and sometimes better than, commercial-bank models that rely on collateral values and financial statements. @gambacorta2024data extends the analysis. @bazarbash2019fintech surveys the fintech-lending literature from an IMF perspective. @philippon2016fintech frames the welfare question: how much of the incumbent banking system's margin is due to genuine intermediation and how much is due to legacy cost that fintech can displace. ### Fairness, interpretability, and the regulatory response As machine-learning models entered credit decisions, two literatures intensified. The first is fairness. @hardt2016equality proposed equalized-odds and equal-opportunity criteria for supervised learning. @chouldechova2017fair proved that under base-rate differences across groups, multiple natural fairness definitions cannot be simultaneously satisfied. @kusner2017counterfactual proposed counterfactual fairness as an alternative causal criterion. @barocas2016big gave the legal framing in Big Data's Disparate Impact. @hurlin2026fairness provide a recent fairness benchmark specifically for credit scoring. The second is explainability. @ribeiro2016why introduced LIME. @lundberg2017unified introduced SHAP. @mitchell2019model proposed model cards for model reporting. We will derive and apply these tools. The Federal Reserve's SR 11-7 [@sr117] guidance on model risk management, first published in 2011, is the document every US bank model team reads before deploying a scoring model. Basel's BCBS 239 [@bcbs239] governs risk-data aggregation. IFRS 9 [@ifrs9] and CECL [@cecl] govern expected-loss provisioning. The EU AI Act, adopted in 2024, classifies credit-scoring systems as high-risk AI and imposes documentation, human oversight, and incident-reporting requirements. GDPR Article 22 bounds automated decisions that produce legal or similarly significant effects on individuals, which scoring generally does. ### The FinTech empirical literature on disparities Three Journal of Finance and Journal of Financial Economics papers define the current empirical frontier on fintech and disparities. @fuster2022predictably show that the switch from logistic regression to nonlinear machine-learning models in mortgage pricing raises predicted default rates for minority borrowers relative to the same borrowers under linear models, even when the training data are identical and the predictors are fair on their face. @bartlett2022consumer decompose the fintech-lending pricing wedge and find that FinTech lenders price-discriminate less than traditional branches on the origination decision, but the interest-rate disparity persists. @howell2024lender show that automation in Paycheck Protection Program (PPP) small-business lending narrowed racial gaps in credit access, consistent with human discretion being a source of the gap. These three papers pull in opposite directions on the net welfare effect of algorithmic credit, and the resolution will come from careful empirical work on decision-making margins. The mechanism in @fuster2022predictably is instructive. The authors fit both logistic regression and random forest models on identical mortgage data from Fannie Mae and Freddie Mac, using the same features and the same training period. They then compute predicted default probabilities for Black, Hispanic, Asian, and White borrowers and compare the distributions. The random forest predictions are systematically higher for Black and Hispanic borrowers than the logistic-regression predictions, and systematically lower for White borrowers. The authors trace the differential to feature interactions captured by the tree ensemble that are not captured by the linear model. A feature that is modestly correlated with a protected characteristic becomes more predictive when combined nonlinearly with other features, and the nonlinear combination carries more of the protected-characteristic signal than either feature alone. This is not a data problem; it is a model-class problem. The policy response, the authors suggest, may require either constraining the model class or applying fairness constraints at training time. @howell2024lender exploit the PPP program's automated lending channels as a natural experiment. The program had both human-underwritten loans at banks and fully automated loans at online lenders; both operated under identical federal guarantees. The authors compare racial gaps in access across the two channels and find the automated channel had a 13-percentage-point narrower Black-White gap in loan receipt. The identification rests on the quasi-random assignment of applicants to channels, partly based on pre-existing banking relationships and partly on the timing of different lenders coming online. The interpretation is that human loan officers, not algorithms, were a material source of the disparity, and that algorithmic triage was therefore, on net, a fair-lending improvement in this setting. Whether the result generalizes beyond PPP, where the underwriting was thin and the guarantee was federal, is an open empirical question. The reconciliation of these seemingly opposed results is that the effect of algorithmic credit on fairness depends on the counterfactual. Relative to a fully judgmental loan officer with biases, algorithms can be fairer. Relative to a well-specified linear model, nonlinear algorithms can be less fair in the distribution of predictions. The policy-relevant question is which counterfactual applies to which decision, and how to design the model-selection procedure so that the right counterfactual is realized. ### An empirical baseline Before the rest of the book piles on more elaborate methods, it is worth reporting the baseline: how well does a textbook logistic scorecard do on two canonical public datasets, the 1,000-row UCI Statlog German Credit set [@hand1997statistical refers to it], and the 30,000-row UCI Taiwan Credit Card Default set [@yeh2009comparisons]? The next subsection runs exactly that experiment. Every later chapter benchmarks against the same split and the same metrics. #### Loading the datasets The `creditutils` module exposes deterministic loaders for the two public datasets. Both come from the UCI Machine Learning Repository and are cached in `book/data/` after the first fetch. `train_valid_test_split` performs a deterministic 60/20/20 partition keyed by seed, so every chapter that imports it sees the same rows in the same slice. The three slices serve distinct roles: - **Training set (60 percent).** The rows the model actually fits on. Coefficients, splits, embeddings, and any other parameters are estimated only from this slice. - **Validation set (20 percent).** A held-out slice used *during* development to pick hyperparameters, thresholds, and early-stopping rounds. It is seen many times but never fit on directly. - **Test set (20 percent).** A locked-away slice touched exactly once, at the end, to report out-of-sample performance. Anything tuned against it stops being a test set and starts being a second validation set. The German set carries a 30 percent default rate by construction (the original Statlog protocol oversampled defaults to balance the classes). The Taiwan set carries a 22 percent rate, which is the actual portfolio rate in the 2005 vintage. Neither is representative of a modern US prime portfolio, but both are standard benchmarks in the credit-scoring literature and every method in this book will be evaluated on them. #### A minimal logistic scorecard The first baseline is logistic regression with standard scaling for numeric features and one-hot encoding for categoricals. No feature engineering, no weight-of-evidence binning, no regularization tuning. This is the simplest defensible model. Three observations worth absorbing. First, the AUC on both datasets is in the 0.74 to 0.83 range. That is the neighborhood of performance where all subsequent benchmarks in this book will live. A nonlinear model that beats this by more than 3 AUC points on a holdout of this size should be treated with suspicion of data leakage. A model that beats it by 1 to 2 AUC points is doing what @baesens2003benchmarking and @lessmann2015benchmarking predict. Second, the KS statistic on the German set is above 0.5, which reflects the heavily oversampled target distribution and the small sample size. The Taiwan KS, near 0.4, is closer to a production figure for an unsecured revolving product. Third, the Brier scores are near the variance of the labels, which is expected for an unregularized logistic fit; calibration work in a later chapter will close some of that gap. #### Converting probabilities to scorecard points The scorecard convention, inherited from Fair Isaac, rescales the log-odds into integer points with two free parameters: a base score at a base odds ratio, and a points-to-double-the-odds (PDO) value. Higher points equal lower risk. The default parameters in `creditutils.scorecard_points` are a base score of 600 at base odds 50:1 (50 good per 1 bad) and PDO = 20. The bulk of both distributions lands in a FICO-adjacent band. On the German set, the 5th to 95th percentile range is roughly 450 to 590, tight around a median near 525, because the sample is small and the 30 percent default rate compresses the log-odds. On the Taiwan set the 5th to 95th range is roughly 480 to 570 with a similar median, but the tails are much wider: the minimum dips to 353 and the maximum reaches 1085. That right tail is not a scoring artifact; it is the unregularized logistic model producing near-zero default probabilities for a handful of very safe applicants, which the points formula then maps to scores well above any realistic cutoff. Calibration and regularization in later chapters pull those tails in. For policy purposes the usable signal sits in the interquartile range: a 680 cutoff would accept essentially the entire prime population here; a 620 cutoff would accept near-prime. The actual FICO algorithm, of course, is proprietary and uses many more inputs than the 20 or 23 variables in these sets. #### Default rate by score decile The next plot is the single most common diagnostic in credit-scoring practice. The test set is ranked by predicted score and partitioned into ten deciles of equal size. The realized default rate within each decile is plotted. A good model produces a monotone, steeply increasing step function: the lowest-ranked decile (decile 10, riskiest) should have a default rate several multiples of the highest-ranked decile (decile 1, safest). As shown in @fig-decile-default, neither panel is perfectly monotone, and that is the point of plotting this diagnostic rather than relying on AUC alone. On the German holdout the ordering is correct only in the coarse sense: the top three deciles sit well above the 0.275 portfolio rate (0.50, 0.50, 0.75) and the bottom four sit well below it, but the middle of the curve is not rank-ordered. With only 20 applicants per decile, a single bad flips the rate by 5 points, so those middle-decile inversions are sampling noise, not a failure of the model. On the Taiwan holdout, the tails are sharp, but the middle six deciles are essentially flat between 0.11 and 0.18, with a mild inversion around deciles 3 to 5. This is the typical shape for an unregularized logistic model on a 22-percent-default population: strong separation at the extremes, weak separation in the belly. That belly is where calibration, WoE binning, and nonlinear models earn their keep in later chapters. That lift at the tails is already the economic value of the model: by rejecting the top two deciles, a lender cuts loss rate on the remaining book substantially, at the cost of a smaller book. Later chapters will derive the expected-profit calculus that turns this diagnostic into a cutoff-selection rule, formalize benchmark comparisons and add the fairness layer on top. #### A lift table for economic interpretation A slightly more structured version of the decile plot is the cumulative-gains and lift table. It is the single most-used artifact in credit model reviews because it translates model discrimination into a quantity a credit officer can read: if we accept the top X percent of applicants by score, what fraction of defaults have we avoided? The following block produces that table for the Taiwan holdout. Read the table left to right. If the lender accepts the safest 10 percent of applicants (cum_pop = 0.10, the left end of the sorted score distribution), the captured-bad percentage is well below 10 percent, meaning the accepted book has few defaults relative to the population. The realized default rate on that 10 percent book is near 10 percent of the population default rate, which is the lift benefit of the model. Moving along the rows, the lender trades volume for quality. The shape of this curve is the discriminatory content of the model at every operating point. #### Comparing the two datasets side by side The German ROC curve sits visibly higher than the Taiwan ROC curve, reflecting the higher AUC on the small balanced-label dataset. Both curves are recognizably concave; neither shows the pathology of a dominated ROC (which would indicate a poorly calibrated or leakage-corrupted model). We will discuss how to derive the connection between ROC and cost curves formally. #### Calibration check A final diagnostic: is the probability output of the logistic baseline well-calibrated? A calibrated model satisfies $\Pr(Y=1 \mid \hat\pi = p) \approx p$ across the range of $p$. The quick check is a reliability plot that bins predicted probabilities and compares bin-mean predictions to bin-mean realized default rates. As shown in @fig-calibration, the plot is close to the diagonal across most of the range, with some mild overprediction in the highest-probability bin. A production model would typically refine this with isotonic regression or Platt scaling. This baseline is the reference point for every subsequent chapter. A logistic regression gets you to an AUC of 0.74 in Taiwan and 0.83 in Germany with a few lines of code, a fifteen-minute setup, and total transparency. Every additional complexity that the book adds should be evaluated against the cost-adjusted improvement it delivers over this reference. ## Scope and structure of this book ### Notational conventions used throughout the book Every chapter respects the following notational conventions. Random variables are capital letters ($X, Y, Z$); specific realizations are lowercase ($x, y, z$). The default indicator is $Y \in \{0, 1\}$ with $Y = 1$ encoding default. The feature vector is $X \in \mathbb{R}^d$ or $X \in \mathcal{X}$ when the feature space includes categorical components. The probability of default conditional on features is $\pi(x) = \Pr(Y = 1 \mid X = x)$. A score is $s: \mathcal{X} \to \mathbb{R}$, usually written as $s(x)$. Log-odds are $\eta(x) = \log \pi(x) - \log(1 - \pi(x))$. Parameters are Greek letters: $\beta$ for linear coefficients, $\theta$ for a generic parameter vector, $\sigma$ for a scale. Loss functions are $\ell(\cdot, \cdot)$ with the first argument the label and the second the prediction. Expectations are $\mathbb{E}$, variances are $\mathrm{Var}$, and indicators are $\mathbb{1}$. Population quantities have no subscript; sample quantities have a hat, $\hat\beta$, $\hat\pi(x)$. For data matrices, $X \in \mathbb{R}^{n \times d}$ is the design matrix of $n$ observations in $d$ features, and $y \in \{0, 1\}^n$ is the label vector. The $i$-th row of $X$ is $x_i$, a column vector in $\mathbb{R}^d$. The $j$-th feature is $X_j$. Training, validation, and test subscripts are $tr$, $va$, $te$. When splits are introduced by a cross-validation fold, the fold index is $k$ and the out-of-fold predictions are $\hat\pi^{(-k)}(x)$. For time, $t$ indexes calendar time for the macro side and origination-relative months for the credit side. Horizon $H$ is a positive integer in months, commonly $H \in \{12, 18, 24, 36\}$. A default within the horizon is $Y_{t,H} = \mathbb{1}\{\text{default occurs in } (t, t + H]\}$. When the context is clear, the horizon subscript is dropped. For matrices in derivations, vectors are column vectors by default. Transposes are $X^\top$. Inner products are $x^\top y$. Norms are $\|x\|$ (Euclidean unless subscripted). Identity matrices are $I_d$. Zero matrices and vectors are $0$ with context giving the dimension. Probability measures on $\mathcal{X}$ are $P$; their expectations are $\mathbb{E}_P[\cdot]$. Regulatory abbreviations, the reader will see repeatedly, are: ECOA (Equal Credit Opportunity Act), FCRA (Fair Credit Reporting Act), HMDA (Home Mortgage Disclosure Act), CRA (Community Reinvestment Act, not to be confused with credit rating agency), SR 11-7 (Federal Reserve Supervisory Guidance on Model Risk Management), IFRS 9 (International Financial Reporting Standard 9), CECL (Current Expected Credit Loss, the US GAAP analog of IFRS 9), GDPR (General Data Protection Regulation), EU AI Act (Regulation (EU) 2024/1689 on Artificial Intelligence). Basel abbreviations, in increasing specificity, are Basel II (2006 framework), Basel III (post-crisis revisions, ongoing), BCBS 239 (risk-data-aggregation principles), IRB (internal ratings based), AIRB (advanced IRB), and FIRB (foundation IRB). Statistical abbreviations are AUC (area under the ROC curve), KS (Kolmogorov-Smirnov), PSI (population stability index), WoE (weight of evidence), IV (information value), PDO (points to double the odds), and LGD/EAD (loss given default, exposure at default). ### Software environment The computational environment is a Python 3.12 virtual environment with the packages listed in `book/requirements.txt` and installed via `uv` or `pip`. The key libraries are numpy, pandas, polars for dataframes; scikit-learn, statsmodels, lifelines, scikit-survival for classical statistics; xgboost, lightgbm, catboost for gradient boosting; torch, transformers for deep learning and language models; shap, lime, dice-ml for explainability; fairlearn for fairness; optbinning and scorecardpy for scorecard-specific tooling; mlflow, fastapi, onnx for deployment; dask and pyspark for scalability. @sec-app-B-env lists exact versions and installation steps. GPU acceleration is used only in chapters that benefit materially such as neural networks, NLP, LLMs, and graph neural networks. Every other chapter runs on a modern laptop CPU in under 90 seconds per benchmark. ### What this book is not This is not a textbook on probability or statistics. Readers should be comfortable with maximum-likelihood estimation, generalized linear models, convex optimization at the level of @friedman2010regularization, and measure-theoretic probability at the introductory level. @sec-app-A-math is a review, not a self-contained treatment. This is not a software engineering book. Production scoring systems involve feature stores, streaming ingestion, real-time decisioning, A/B testing infrastructure, and change management that are outside this book's scope. The deployment chapters in Part VIII cover the minimum wrapper (FastAPI, MLflow, ONNX, Docker) but defer to specialized texts on the rest. This is not a trading book. Credit-default-swap pricing, bond pricing with credit risk, and counterparty credit in derivatives portfolios are different problems with different primary data sources. ### Relationship to adjacent books Several adjacent texts cover parts of this ground, and a reader should know where to go for the depth that the present book does not provide. Thomas, Edelman, and Crook's Credit Scoring and Its Applications (2017, 2nd edition) is the definitive practitioner reference on scorecards and behavioral scoring; its coverage of machine-learning methods is intentionally limited. Siddiqi's Intelligent Credit Scoring (2017) is the operating guide for Fair Isaac-style scorecards and is the book we point readers to for the production nuances of fine-to-coarse classing and adverse-action design. Baesens, Roesch, and Scheule's Credit Risk Analytics (2016) is the SAS-centric analog. Duffie and Singleton's Credit Risk (2003) is the fixed-income-oriented reference for structural and reduced-form default models. Hastie, Tibshirani, and Friedman's Elements of Statistical Learning (2009) remains the canonical statistical-learning reference and the source for most of the math in our Parts II and III. Murphy's Probabilistic Machine Learning (2022) is a more recent alternative with stronger Bayesian coverage. For regulatory context, the Federal Reserve's Supervisory Review Program manuals, the BCBS working papers, and the EBA's technical standards are the primary sources; each chapter cites the specific documents that apply. Within the academic research that this book draws on, four sources of recurring material stand out. The Journal of Finance, Journal of Financial Economics, and Review of Financial Studies carry the anchor empirical work on credit markets, fintech, and disparities. Management Science and Operations Research carry out the operations and benchmark work. The Journal of the Royal Statistical Society Series B, the Annals of Statistics, the Journal of the American Statistical Association, and Biometrika carry the methodological work. JMLR, NeurIPS, and ICML carry the newer machine-learning methodology relevant to credit. We cite these venues directly; conference proceedings for KDD are cited when a method (for example, XGBoost) first appeared there. ### Reproducibility commitment Every chapter renders end-to-end under Quarto from a clean checkout. Every dataset is either public (German, Taiwan, HMDA, LendingClub, Home Credit) or has a documented synthetic fallback (when a Kaggle credential is required). Every random process is seeded. Every numerical output in the text agrees with the code block above it, because the text is generated after the block runs. The repository is on GitHub; issues and pull requests are welcome. ## Vietnam and emerging markets ### Market context Vietnam is a useful running exemplar for the emerging-market practitioner because every structural feature that breaks an off-the-shelf Western scorecard is present at once. The credit bureau infrastructure is two-tiered. The Credit Information Center (CIC) is the public bureau operated under the State Bank of Vietnam (SBV) and consolidates regulated-lender tradelines from banks, finance companies, and microfinance institutions. The Vietnam Credit Information Joint Stock Company (PCB) is a private bureau, launched in 2007 and majority-owned by a consortium of commercial banks, and complements CIC with a broader set of non-bank and utility data. Adult coverage in the combined bureau system sits around the mid-50% range, well below the 90%-plus coverage that Anglo-American scorecard literature assumes [@worldbank_findex2021; @cic_vietnam2023]. The other side of the population is mobile. Active mobile subscriptions exceed 140% of the adult population, and smartphone penetration crossed 80% of urban adults by 2023 [@adb2023digital]. The SBV has codified remote onboarding through Circular 16/2020/TT-NHNN, which permits fully electronic know-your-customer for payment accounts subject to liveness, biometric, and database-match controls [@sbv_circular16_2020]. Personal data processing is governed by Decree 13/2023/ND-CP, the first comprehensive data protection regime in Vietnam, which defines sensitive personal data, lawful bases, cross-border transfer impact assessment, and data subject rights in terms that read like a lighter-footprint GDPR [@vn_decree13_2023]. A regulatory sandbox for fintech, including credit scoring and peer-to-peer lending use cases, was formalized through Decree 94/2025/ND-CP [@vn_decree94_2025; @sbv2023vietnam]. ### Application considerations This chapter is introductory, so the application implication is programmatic rather than methodological. Every later chapter inherits four constraints from the Vietnamese environment. First, the effective training sample is smaller than the US or EU equivalent. A mid-sized Vietnamese consumer-finance book carries one to three million active accounts, not tens of millions, and bureau-inquiry depth on each account is shallower. This favors simpler models, tighter regularization, and a higher bar for adopting deep or transformer-scale architectures. Second, macro volatility is first-order. The 2011 banking-sector stress, the 2022 corporate-bond episode, and the recurrent FX pressure on the dong each produced sharp cohort effects that a through-the-cycle model has to accommodate [@imf2024vietnamart4]. Third, informal income is large. Roughly one-third to one-half of urban household income and a larger share of rural income do not pass through a payroll account, so self-reported income must be validated against bank statements and transaction-level signals. Fourth, seasonal effects from the Tet (Lunar New Year) holiday produce a January-February spike in consumer borrowing and a late-Q1 spike in short-term delinquency that dominate any quarterly-seasonality adjustment fitted on a Western calendar. Real-estate collateral concentration is a fifth recurring issue. Vietnamese bank balance sheets carry a heavy weight of residential and land-use-right collateral, and the correlation between collateral value and default probability is stronger and more regime-dependent than in the US mortgage market. The later chapters that deal with LGD, stress testing, and IFRS 9 overlays are where this matters most. ### Rationalization An introductory chapter does not have a method to accept or reject. It has a reading strategy. The reading strategy for the emerging-market practitioner is to treat the book's core estimators (logistic regression with weight-of-evidence, gradient boosting, and survival models) as the default toolkit. These are the methods that tolerate small samples, that document cleanly to a regulator who has never seen a neural network, and that support the reason-code apparatus that a Circular-41 bank or a licensed consumer-finance subsidiary needs on every declined application. The deep sequence and graph models are worth the reading, but not usually worth the production spend on a Vietnamese book below roughly ten million accounts. The fairness and explainability chapters are worth more, not less, because Decree 13/2023 and the SBV's model-risk expectations are moving in the direction of documented and contestable decisions. ### Practical notes Data in Vietnam starts with two bureau pulls. A CIC pull returns tradelines across regulated lenders, a credit score (CIC operates a domestic scoring product), and inquiry history. A PCB pull returns a broader tradeline set and, for some subscribers, utility and telecom tradelines. Neither bureau carries the ten-to-fifteen-year tradeline depth of Experian or Equifax, so the observation window on a CIC-based scorecard is shorter and the feature list is correspondingly leaner. Most lenders supplement with internal transaction data (current-account flows for bank-owned finance companies, e-wallet flows for fintech affiliates) and with telecom-derived features sourced through SBV-approved data partners. Regulatory reporting lines run to the SBV Banking Supervision Agency for licensed banks, to the SBV Department of Credit for finance companies, and to the Ministry of Public Security for Decree 13 data-protection compliance. Basel II capital is framed by Circular 41/2016/TT-NHNN for most domestic banks [@sbv_circular41_2016]; a limited number of systemically important institutions have moved toward Basel III elements under SBV pilot programs, and IFRS-style provisioning is being phased in alongside the domestic Vietnamese Accounting Standards. Consumer-finance lending carries its own overlay under Circular 43/2016/TT-NHNN on consumer lending by finance companies, which sets conduct rules on fee disclosure, collection practices, and maximum cash-lending ratios for finance-company portfolios. Alongside this, Circular 22/2023/TT-NHNN (29 Dec 2023) amends Circular 41/2016 on capital adequacy ratios and updates the Basel II standardized capital calculation for banks [@sbv_circular22_2023]. A scorecard that is lawful under US ECOA but that cannot produce a Vietnamese-language adverse-notice string within the Circular 43 format is not deployable. We return to each of these anchors in a later chapter. ## Takeaways - Credit scoring is a response to information asymmetry between lenders and borrowers. The theoretical case, from @akerlof1970lemons and @stiglitz1981credit through @diamond1984financial and @holmstrom1979moral, establishes that without some screening technology, competitive credit markets ration quantity and undersupply efficient loans. - The history of the field is a continuous tightening of three feedback loops: more data (from agency ledgers in 1840 to bureau scores in 1989 to digital footprints in 2018), more statistical sophistication (from Durand in 1941 to Altman in 1968 to the @baesens2003benchmarking and @lessmann2015benchmarking ensemble benchmarks), and more regulatory scaffolding (ECOA 1974, FCRA 1970, Basel II 2006, IFRS 9 2014, CECL 2016, EU AI Act 2024). - The modern empirical frontier is not about squeezing another AUC point out of XGBoost. It is about alternative data, fairness, explanation, and the interaction between model choice and the allocation of credit across groups. The three papers to read first are @fuster2022predictably, @bartlett2022consumer, and @howell2024lender. - Logistic regression on clean bureau data gets a lender to AUC 0.74 to 0.83 on the two public benchmarks in this book. Everything else in this book should be measured against the marginal cost-adjusted benefit it delivers over that reference. - The rest of the book provides working code for every method, the primary references for every claim, the regulatory context for every deployment, and a reproducible pipeline from raw data to benchmarked score. ## Further reading - @akerlof1970lemons introduced the lemons problem. - @stiglitz1981credit established the credit-rationing equilibrium. - @durand1941risk is the first NBER application of statistical discrimination to consumer credit. - @altman1968zscore is the founding paper of quantitative bankruptcy prediction. - @hand1997statistical is the mid-1990s state-of-the-art review. - @baesens2003benchmarking and @lessmann2015benchmarking are the two multi-dataset benchmarks that bracket the ML era. - @berg2020rise is the canonical digital-footprint paper. - @fuster2022predictably is the benchmark fairness-under-ML paper in credit. - @bartlett2022consumer decomposes the fintech-lending discrimination wedge. - @howell2024lender is the cleanest natural-experiment identification of automation effects on disparities. - @thomas2000survey and @crook2007recent are practitioner surveys with wide coverage. - @olegario2006culture and @lauer2017creditworthy are the two indispensable books on the institutional history of credit reporting. - @basel2006international and @sr117 are the foundational regulatory documents for model risk and capital. - @liberti2019information ties the hard-information versus soft-information distinction to modern scoring. ================================================================================ # Source: chapters/02-formal-setup.qmd ================================================================================ # The Credit Scoring Problem: Formal Setup **Scope: both retail and corporate.** PD, LGD, EAD, and M definitions under the Basel IRB framework. The identities and decomposition apply identically to consumer and firm-level portfolios. ## Overview {.unnumbered} Credit scoring is a classification problem wearing the clothes of a decision problem. A lender does not really want to know whether a borrower will default. A lender wants to know whether to approve, at what price, with what limit, and how much capital to set aside. The probability is an input. The decision is the output. Everything in this book flows from that distinction. We define what counts as a default, what counts as an indeterminate outcome, and what the three canonical scoring problems are: application scoring (@sec-ch02-app-scoring), behavioral scoring (@sec-ch02-beh-scoring), and collection scoring (@sec-ch02-coll-scoring). We write down the Basel II and Basel III definitions of PD (Probability of Default), LGD (Loss Given Default), and EAD (Exposure at Default), derive the expected loss identity, and derive the regulatory capital formula under the Asymptotic Single Risk Factor (ASRF) model of @gordy2003risk and @vasicek2002distribution. A word for the emerging-market reader. The Basel, IFRS 9, and Vasicek machinery below is jurisdiction-neutral in the math but not in the inputs. In Vietnam and peer markets, PDs have to be estimated on thinner tradeline files from the Credit Information Center and PCB, on cohorts whose macro backdrop includes exchange-rate shocks and episodic property-sector stress, and on obligors whose income is partly informal and whose delinquency cycle has a pronounced Tet seasonality. Every later step in the pipeline, from bad definition to LGD floor to the supervisory correlation $\rho$, inherits that input structure. The formal setup in this chapter is the place where a practitioner writing under SBV Circular 41/2016 has to decide which parameters are locally estimable and which have to be borrowed from supervisor-supplied or regional benchmarks. A word on sequencing. If the math here looks heavy, it is. The reason is simple. Every later chapter in this book, whether logistic regression, survival analysis, gradient boosting, graph neural networks, or large language models, ultimately outputs a probability that gets fed through the same Basel pipeline. The numerics of that pipeline drive every design choice in the model. You cannot reason about a scorecard without knowing what a 1% shift in PD does to regulatory capital. That calculation lives here. ### Notation {.unnumbered} Let $X \in \mathcal{X} \subseteq \mathbb{R}^d$ denote the feature vector of a borrower. Let $Y \in \{0, 1\}$ denote the default indicator, with $Y=1$ for bad and $Y=0$ for good. Let $D \in \{0, 1\}$ denote the lender's accept-reject decision, with $D=1$ meaning approve. Let $\eta(x) = \Pr(Y=1 \mid X=x)$ denote the true posterior. A scoring model is any measurable function $s : \mathcal{X} \to \mathbb{R}$. A probability model is a scoring model whose output can be calibrated to $[0, 1]$. We write $\hat p(x)$ for the model's probability of default estimate and $t$ for a cutoff. Greek letters $\Phi$ and $\varphi$ are the standard normal CDF and PDF. Basel capital symbols are $\mathrm{PD}, \mathrm{LGD}, \mathrm{EAD}, K, \mathrm{RWA}$ and are defined in later sections. Class prior is $\pi_1 = \Pr(Y=1)$. ## Borrower types: goods, bads, indeterminates A dataset is a list of loans. A loan has a maturity, a sequence of payments, and eventually a final outcome. Labeling that outcome as good, bad, or indeterminate is not a statistical problem. It is an accounting and supervisory problem. Getting this labeling wrong is a leading source of bad models, even before a single feature is chosen. ### The canonical three-way split A goods-bads-indeterminates partition was formalized in the early scorecard literature and rehearsed by @thomas2000survey. A bad is a borrower whose outcome is bad enough to count as a default. A good borrower is one who completes the observation window without ever crossing that threshold. An indeterminate is a borrower whose outcome is ambiguous: too far along to call a good, not far enough to call a bad. Indeterminates are typically dropped from the training sample for application scoring, with the caveat that dropping them biases the estimator of $\eta(x)$. The operational definitions are set by the regulator, by accounting standards, and by internal policy. The three main anchors are the Basel default definition, the IFRS 9 and CECL staging framework, and the firm's collections policy. ### The Basel default definition Paragraph 452 of @basel2006international and its successor text in @basel2017finalising define a default as having occurred when either of two conditions is met: 1. The bank considers that the obligor is unlikely to pay its credit obligations in full, without recourse by the bank to actions such as realizing security. 2. The obligor is past due more than 90 days on any material credit obligation to the banking group. The second condition is what most modelers mean by 90+ days past due (90+ dpd). The first condition is the unlikely-to-pay (UTP) trigger. UTP is a judgment call and includes events such as distressed restructuring, specific provisions being raised, and the sale of the obligation at a material credit-related economic loss. For retail exposures, the 90+ dpd threshold can be extended to 180 days at national supervisory discretion for some product classes. The EBA guidelines tightened this (see @eba2017gl), and the modern European practice is 90 dpd with a materiality threshold. The materiality threshold, under EBA Regulatory Technical Standards, has absolute (100 EUR retail, 500 EUR non-retail) and relative (1% of the on-balance-sheet exposure) components. There is a subtle point here that matters for modeling. Default is observed at the facility level, but some jurisdictions require default to be recognized at the obligor level. The EBA guideline [@eba2017gl] applies an obligor-level default trigger for non-retail exposures and allows facility-level default only for certain retail exposures. A borrower with one defaulted credit card does not automatically default on their mortgage under facility-level treatment, but does under obligor-level. The choice affects both labels and feature construction. ### Observation window, performance window, sampling window Every application scoring dataset is defined by three time windows: 1. The observation window is the time interval during which the feature vector $X$ is measured. For application scoring, this is a snapshot at origination. 2. The performance window is the time interval during which the outcome $Y$ is observed. A common choice is 12 months. 3. The sampling window is the calendar interval from which the accounts are drawn. A typical setup for a monthly originated consumer loan portfolio is: sampling window of 12 to 24 months ending 18 months before today, observation window of one application date per account, performance window of 12 months. The 18-month gap ensures that every account in the training sample has had a chance to reach the 12-month performance horizon. If the performance window is shorter than the emergence period of defaults, the bad rate in the training sample is downward-biased. If it is too long, the sample excludes recent cohorts and the model lags the population. A 12-month horizon is standard for unsecured consumer credit. For mortgages, the horizon is often 24 to 36 months because defaults emerge more slowly. ### Defining the bad more precisely In practice, firms use a bad definition that is stricter than Basel. A common retail policy is: 90+ dpd in the 12-month performance window, or a written-off status, or a charge-off flag. The written-off and charge-off flags are internal accounting triggers that typically fire later than 90+ dpd, so the 90+ dpd condition dominates. A few alternatives show up: - Ever-90 in 12 months: the borrower reached 90 dpd at any point in the 12-month window. This is the default. - Worst-status: the borrower's maximum dpd bucket over the window. Both 90+ dpd and a 60+ dpd ever-delinquent flag can be modeled. - Roll-rate based: transition matrix from the delinquency status at month $m$ to the status at month $m+k$. Used for behavioral scoring. The choice of bad definition is not just a label transformation. A tighter definition like ever-60 produces a higher bad rate, a different discriminative signal, and a different calibration target. Models trained on ever-60 labels cannot be used directly as a probability of ever-90 without recalibration. ### Indeterminates An indeterminate is a loan whose outcome is ambiguous. Typical examples: - A loan that reached 30 to 59 dpd but never went further. Not quite a default, not a pristine repayment. - A loan that was in the observation window but was voluntarily closed without a final status. - A loan that was sold to a third party and whose subsequent performance is unknown. Three handling strategies are standard: 1. Drop indeterminates from training. Simplest, loses information, biases the estimator of $\eta(x)$. 2. Assign a fractional label based on the empirical bad rate among indeterminates in a matched population. 3. Survival modeling where indeterminates become censored observations. The best practice for scorecards is usually strategy 1 with a sensitivity check on strategy 2. The exceptions are portfolios where indeterminates are a large fraction of the sample, in which case strategy 3 is preferred. ### Class prior and population mixture The prior $\pi_1 = \Pr(Y=1)$ is product-dependent (@tbl-formal-setup-class-prior). Typical ranges: | Product | Typical 12-month bad rate | |---------------------------|--------------------------:| | Prime mortgage | 0.3% to 1.5% | | Auto loan (prime) | 1% to 3% | | Credit card (mainstream) | 2% to 6% | | Personal loan (unsecured) | 3% to 10% | | Subprime credit | 10% to 30% | | SME lending | 2% to 10% | : Product-based class prior The Taiwan dataset we use throughout the book has a 22% bad rate, which is a credit card book in a stressed cohort (@yeh2009comparisons). The German dataset has a 30% bad rate, which is a marketing accident: the sample was manually balanced. Real German retail books at the time sat around 3% to 5%. The class prior matters because it appears in every decision-theoretic calculation and because the posterior $\eta(x)$ is prior-dependent. If we retrain on a resampled dataset with different prior $\pi_1'$, the score is still useful for ranking but the probability is wrong. We return to this at length in a later chapter. ## What is a PD? Five conditioning choices A PD on a screen looks like a number. It is not. It is a conditional probability whose conditioning set has five moving parts. Two PDs that disagree on any one of the five are not comparable as numbers, only as ranks. This section names the five parts and gives the operating rules for making PDs comparable when the business forces a cross-vendor, cross-portfolio, or cross-vintage comparison. The five parts also explain a recurring surprise. A vendor quotes a 4% PD on a borrower; an internal model quotes 1.5% on the same borrower; both pass calibration on their own books. Neither model is wrong. The two numbers are estimates of different quantities under different conditioning. The reconciliation requires aligning the conditioning set, not retraining the models. ### The construct expanded Write the PD as the full conditional probability it really is: $$ \mathrm{PD}(x) = \Pr(Y \in \mathcal{B} \text{ within horizon } h \mid X = x, \mathcal{P}, \mathcal{C}, \mathcal{S}). $$ The five conditioners: 1. $\mathcal{B}$, the bad event set. Which outcomes count as a default. 2. $h$, the performance horizon. The window over which $Y$ is observed. 3. $\mathcal{P}$, the reference population. The portfolio whose mixture defines $\eta_{\mathcal{P}}(x) = \Pr(Y \in \mathcal{B} \mid X = x)$. 4. $\mathcal{C}$, the conditioning information used. Whether macro state is conditioned on (PIT) or integrated out (TTC). 5. $\mathcal{S}$, the sampling frame. The selection from the through-the-door (TTD) population that produced the training data. In plain English: who counts as defaulted, how long we wait, who is in the pool, what macro state we assume, and whether the data we used reflects the full applicant pool or only the accepted slice. Change any one and the number changes, often by a factor of two or three on the same borrower. A PD quote without the five-tuple is incomplete the same way a bond yield without a maturity is incomplete. The construct here is the thing the model is estimating; @sec-ch02-pd-lgd-ead-and-regulatory-capital starts from a fully specified construct and works out the capital arithmetic. Get the construct wrong and the arithmetic is exact but meaningless. ### Choice 1: the bad event $\mathcal{B}$ The bad event has already been treated at length in @sec-ch02-setup. We restate the point here because it is the most common source of cross-vendor non-comparability. The Basel anchor is 90+ dpd or UTP, but real PD numbers in the market correspond to half a dozen variants: ever-90 within 12 months, ever-60, worst-status, charge-off, distressed-restructuring flag, bankruptcy. The variants differ by a factor of two to four on the same book. A useful identity. If $\mathcal{B}_A \subseteq \mathcal{B}_B$ (the looser definition is a superset of the stricter one), then $$ \Pr(Y \in \mathcal{B}_A) \le \Pr(Y \in \mathcal{B}_B) \quad \text{pointwise in } x, $$ so a loose-bad PD is always at least as large as a strict-bad PD on the same exposure. In plain English: counting more events as "default" can only push the default probability up. The ratio between the two is not constant in $x$, which is why a simple multiplicative correction across all borrowers fails. Operating rule. Before comparing two PD numbers, write down each model's $\mathcal{B}$. If they differ, do not compare the numbers directly. Fit a mapping $\mathcal{B}_A \to \mathcal{B}_B$ on a held-out sample using a roll-rate matrix [@thomas2017credit], then convert one to the other before comparison. ### Choice 2: the performance horizon $h$ The horizon turns a PD from a probability into a function of time. Hazard intensity matters: a borrower with a 4% 12-month PD does not have a 16% four-year PD, because survival compounds and the hazard typically decays or peaks for seasoned exposures. Three horizons dominate in practice: - 12-month PD. Basel IRB anchor and the standard for application scoring on unsecured retail. - Lifetime PD. IFRS 9 stage-2/3 and CECL anchor. Computed by integrating a hazard over the remaining contractual term. - Term PD (point-event). Probability of default before the next behavioral score refresh, often one to three months. The naive conversion $h$-year PD $\approx 1 - (1 - p_{12})^h$ assumes a constant hazard and independent yearly trials. It is correct only as a first-order approximation. The right derivation uses a survival or Markov framework (see @sec-ch35-ifrs9 and the survival chapter referenced there): $$ \mathrm{PD}(x, h) = 1 - \exp\!\left(-\int_0^h \lambda(u \mid x, \mathcal{F}_0) \, du\right), $$ with $\lambda$ the hazard intensity at age $u$ conditional on covariates at origination $\mathcal{F}_0$. In plain English: time stretches the probability the same way it stretches a bond's default risk. A 1% one-year PD is not a 1% lifetime PD on a 30-year mortgage; it is 20% to 30%, depending on hazard shape and prepayment. Operating rule. Never compare a 12-month PD to a lifetime PD. Translate one to the other via a hazard model fit on the same portfolio, then compare. A reported PD without a horizon is unusable for provisioning or pricing. ### Choice 3: the reference population $\mathcal{P}$ The posterior $\eta(x) = \Pr(Y = 1 \mid X = x)$ is a function of the joint distribution of $(X, Y)$. The joint distribution is determined by the population. Two models trained on a prime card book and a subprime auto book learn different $\eta$ functions, and a borrower with identical feature vector $x$ gets different PDs from the two. This is not a calibration bug. It is the correct posterior under each population. The same $x$ is genuinely riskier in a subprime book because the unobserved factors that landed the borrower in the subprime channel are themselves correlated with default. By Bayes' rule: $$ \eta_{\mathcal{P}}(x) = \frac{\pi_{\mathcal{P}} f_{\mathcal{P}}(x \mid Y = 1)}{\pi_{\mathcal{P}} f_{\mathcal{P}}(x \mid Y = 1) + (1 - \pi_{\mathcal{P}}) f_{\mathcal{P}}(x \mid Y = 0)}, $$ so both the class prior $\pi_{\mathcal{P}}$ and the class-conditional densities $f_{\mathcal{P}}(\cdot \mid Y)$ shift with $\mathcal{P}$. If the class-conditional densities are roughly invariant (a strong assumption sometimes called covariate shift, see @sec-ch04-drift), then the posterior on a new population is reachable by a prior-correction formula. @king2001logistic give the working version for logistic regression: adjust only the intercept by $\log(\pi_{\mathcal{P}}' / (1 - \pi_{\mathcal{P}}')) - \log(\pi_{\mathcal{P}} / (1 - \pi_{\mathcal{P}}))$. In plain English: if the *shape* of the risk function in feature space is portable but the average default rate differs, you can rescale the intercept and get usable PDs. If the *shape* is also different, you have to retrain or recalibrate, not just rescale. Operating rule. A vendor's PD on a portfolio they did not train on is suspect at the absolute-probability level even when discrimination is excellent. Always recalibrate on a holdout drawn from the target population (@sec-ch04-brier). ### Choice 4: cycle treatment $\mathcal{C}$ (PIT vs TTC) The same borrower with the same feature vector has a higher one-year PD in a recession than in a boom. The point-in-time (PIT) PD captures this; the through-the-cycle (TTC) PD averages over it. Both are valid quantities; they answer different questions. Formally, let $M_t$ denote a vector of macro factors at time $t$. Then: $$ \mathrm{PD}^{\mathrm{PIT}}(x, t) = \Pr(Y = 1 \mid X = x, M_t), $$ $$ \mathrm{PD}^{\mathrm{TTC}}(x) = \mathbb{E}_{M}\!\left[\Pr(Y = 1 \mid X = x, M)\right] = \int \mathrm{PD}^{\mathrm{PIT}}(x, m) \, dF(m). $$ The TTC PD is the expected PIT PD over the long-run macro distribution $F(m)$. In plain English: PIT is "what we think will happen this year"; TTC is "what happens on average across the cycle." A pure PIT estimate moves up in recessions and down in booms; a pure TTC estimate sits still and lets the macro overlay do the work elsewhere. Basel IRB targets TTC for capital-stability reasons. IFRS 9 and CECL target PIT (or near-PIT) for provisioning. A bank therefore runs two PD numbers on the same exposure, and a vendor that ships only one of them is incompletely positioned for either use case. The intermediate construct is a hybrid PD with explicit macro overlay [@carlehed2012framework]. Common practice is to estimate $\mathrm{PD}^{\mathrm{TTC}}(x)$ as the model baseline and apply a scalar macro adjustment so that $\mathrm{PD}^{\mathrm{PIT}}(x, t) = g(\mathrm{PD}^{\mathrm{TTC}}(x), M_t)$. Rating-agency practice has been examined empirically in @loffler2013rating, who finds that even agency ratings are not pure TTC. Migration matrices conditional on the cycle are derived in @bangia2002ratings. Stress-testing chapters (@sec-ch35-ifrs9) develop this further. Operating rule. Tag every PD with its cycle stance. A 3% PD that is PIT and a 3% PD that is TTC are not the same risk claim, even if both pass calibration on their respective targets. ### Choice 5: sampling frame $\mathcal{S}$ The PD a model learns is a PD conditional on the data the model saw. If the data is accepted-only, the learned $\eta(x)$ is $\Pr(Y = 1 \mid X = x, D = 1)$, not the target $\Pr(Y = 1 \mid X = x)$ on the TTD applicant population. The two are equal only when $D$ is independent of $Y$ given $X$, which is precisely the assumption reject inference tries to relax (@sec-ch10-reject). The selection bias propagates into every comparison: - Two banks with different approval rates produce different selected-sample distributions even if their TTD populations are identical. Their internal PDs are conditional on different selection events. - A bureau score trained on observed-default tradelines is implicitly conditioned on having survived previous credit decisions. Apply it to a thin-file applicant who would have been rejected at past stages and the score's PD interpretation breaks. - Low-default portfolios (sovereigns, prime corporates) suffer the dual problem of selection plus tiny event counts. The standard PD estimate is biased and almost certainly understates risk; @plutotasche2005 give a confidence-bound estimator that is the industry workhorse. Operating rule. State the sampling frame. When PDs from two sources need to be compared, the comparison is valid only on the intersection of their training frames or after a selection correction (Heckman or its generalizations, in @sec-ch10-heckman-selection-correction). ### Score versus PD: ordinal versus cardinal A clean separation that saves a great deal of confusion. - A **score** is a real-valued ranking function $s : \mathcal{X} \to \mathbb{R}$. Higher means safer (or riskier, depending on sign). Designed to be rank-comparable. Says: borrower A is safer than borrower B. Does not claim an absolute probability. - A **PD** is a calibrated probability $\hat p : \mathcal{X} \to [0, 1]$. Cardinal. Claims $\mathbb{E}[\mathbf{1}\{Y \in \mathcal{B}\} \mid X = x] = \hat p(x)$. A strictly monotone transform of a score is the same score for ranking purposes. AUC, KS, Gini, and the H-measure are all invariant to any strictly monotone transform of $s$ (@sec-ch04-auc). Brier, log-loss, calibration intercept and slope, and the expected calibration error are not invariant: they react to the absolute level of $\hat p$, not just the ordering. This is why two vendors can have identical AUC on the same portfolio and still produce wildly different PDs. AUC is a ranking statistic. The PDs differ because the calibration mapping from rank to probability is fit under different $(\mathcal{B}, h, \mathcal{P}, \mathcal{C}, \mathcal{S})$ tuples. In plain English: the score answers "who is riskier"; the PD answers "how risky in absolute terms." Two scoring shops can agree on the first answer perfectly and disagree on the second by factor-of-three magnitudes. ### What is comparable, and what is not The five conditioners give a precise decision rule for whether a comparison is meaningful (@tbl-ch02-pd-comparability). | Comparison | Conditioner alignment needed | What fails otherwise | |-----------------------------------------|-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------| | Two borrowers, one model | None | Comparable by construction | | Two models, same portfolio | Same $\mathcal{B}$, $h$, $\mathcal{S}$ | Different label definitions inflate one model's AUC | | Two vendors, same borrower | All five aligned, or recalibrated to a common scale | Vendor A's 700 corresponds to a different PD than vendor B's 700 | | Same borrower, two dates | TTC stance, or explicit PIT-with-macro decomposition | Cyclical PD movement gets read as a borrower-level shift | | Two products (card, auto, mortgage) | Same $\mathcal{B}$, $h$, common scale | "PD" gets contaminated by exposure and recovery, which live elsewhere | | Two vintages, same product | Same $\mathcal{B}$, $h$, $\mathcal{S}$, plus seasoning adjustment | Hazard-shape differences look like population changes | : A decision rule for PD comparability The pattern. Ranking comparisons are robust to most conditioner mismatches because AUC is monotone-invariant. Probability comparisons require all five to align or an explicit translation step. ### The industry fix: master rating scale and recalibration The Basel-conformant resolution is a **master rating scale**. The bank defines a fixed ladder of grades (say 18 buckets, grade 1 the safest, grade 18 the defaulted), each with a target PD range on a fixed triple $(\mathcal{B}, h, \mathcal{C}) =$ (Basel 90+ dpd or UTP, 12 months, TTC). Every model on every portfolio is recalibrated so that its raw output PD is mapped, by isotonic regression or Platt scaling on a reference holdout, to a grade on the master scale. Low-default grades use the @plutotasche2005 confidence-bound estimator to avoid the zero-event trap. The downstream effect: - Two vendors that map to the same grade are by definition expressing the same TTC PD claim. The grade is the common currency. - Across products, the comparison is grade-to-grade. PD differences across product lines are dampened by the calibration step. - Across vintages, the score-to-grade mapping is re-estimated at each refresh. Drift in that mapping is the diagnostic; the grade itself is intended to be stable. Calibration mechanics belong in @sec-ch04-brier and @sec-ch16-score-comparability; the master-scale construct belongs in this chapter because it is the construct-level resolution to the five-conditioner problem. For vendor onboarding, the master scale is the operating layer through which a candidate model is judged. The performance back-test in @sec-ch34b-perf works at the grade level for exactly this reason. ### A numerical illustration We make the non-comparability concrete with a simulation. Two outcome definitions on the same latent risk produce two PD models with almost identical AUC but per-borrower PDs that disagree by factor-of-two magnitudes. The two models rank borrowers almost identically. AUC on each model's own label sits around 0.85, and the strict-trained model's score also ranks the loose-defined label well. At the individual borrower level, the PDs differ by a factor of three at the median, with much larger ratios in the tails. That is the gap a master rating scale closes: by mapping each model's score to a fixed grade ladder on a common $\mathcal{B}$, the per-borrower PD becomes the grade's target PD, and the cross-vendor comparison is well-defined again. The operational takeaway. If you are asked "is vendor A's PD higher than vendor B's PD on this borrower?", the answer is undefined until each vendor's PD is converted to a common scale. If you are asked "does vendor A rank this borrower higher than vendor B?", the answer is well-defined and the standard discrimination tools handle it (@sec-ch04-auc). ## PD, LGD, EAD, and regulatory capital The three building blocks of Basel credit risk capital are the probability of default (PD), the loss given default (LGD), and the exposure at default (EAD). Each is a separate estimation problem with its own target, horizon, and regulatory treatment. The expected loss on a facility is their product, and the unexpected loss is what regulatory capital is designed to absorb. ### Probability of default The PD is the probability, over a one-year horizon, that the obligor will default: $$ \mathrm{PD}(x) = \Pr(Y = 1 \mid X = x, \text{horizon} = 1\text{yr}). $$ Two operational flavors exist. The point-in-time (PIT) PD is the best estimate of the one-year default probability given everything observable today, including the current state of the economy. The through-the-cycle (TTC) PD is a long-run average that smooths over macroeconomic fluctuations. Basel IRB PDs are intended to be closer to TTC for capital-stability reasons. IFRS 9 and CECL require PIT-style estimates for expected credit loss provisioning. For retail exposures, Basel II requires PD estimates to be at least 0.03% (the three-basis-point floor). This prevents the capital calculation from imploding for very low-risk obligors. Basel III finalization @basel2017finalising kept the 0.03% floor for retail and corporate PDs. ### Loss given default The LGD is the fraction of the exposure that is lost in the event of default, net of recoveries and workout costs: $$ \mathrm{LGD} = 1 - \mathrm{RR}, \quad \mathrm{RR} = \frac{\text{recoveries} - \text{workout costs}}{\text{EAD at default}}. $$ The LGD is bounded in $[0, 1]$ in principle. In practice, LGDs can exceed 1 for exposures with expensive workouts or can be negative for exposures that are over-collateralized. Basel LGDs are floored at a regulatory minimum (for example, 10% for residential mortgages under Basel III) to limit downside modeling. LGD estimation has its own literature [@bastos2010forecasting, @calabrese2014fractional, @calabrese2014downturn]. A recurring issue is the bimodality of recovery rates: either a collateralized facility recovers most of the exposure, or an unsecured one recovers almost nothing. The resulting U-shaped LGD distribution resists standard regression and motivates fractional-response models. A critical Basel distinction is between a regular LGD and a downturn LGD. The regular LGD is the empirical average over the portfolio history. The downturn LGD is the worst-case LGD under a stressed macro scenario. Basel IRB capital is calibrated against downturn LGDs, on the theory that defaults and recoveries are correlated (recoveries fall when defaults rise). ### Exposure at default The EAD is the expected amount of exposure at the moment of default. For term loans, this is close to the current outstanding balance, which makes EAD uninteresting. For revolving facilities (credit cards, lines of credit), EAD is much more interesting because a borrower approaching default typically draws down unused commitments. The standard decomposition is: $$ \mathrm{EAD} = \mathrm{OnBalanceSheet} + \mathrm{CCF} \times \mathrm{UndrawnCommitment}, $$ where CCF is the credit conversion factor, the fraction of undrawn commitment that is expected to be drawn by the time of default. Basel II IRB allows banks to estimate CCFs internally for some exposure classes; Basel III finalization @basel2017finalising tightened input floors and retired the advanced IRB approach for several exposure classes. EAD vs LGD: - EAD (Exposure at Default): dollar amount owed at the moment of default. The size you're exposed to. E.g. \$1M loan drawn, hence EAD = \$1M. - LGD (Loss Given Default): fraction of EAD you actually lose after recovery (collateral, workout). E.g., LGD = 40% means recover 60 cents on the dollar. Loss on one default = EAD × LGD. - \$1M exposure × 40% LGD = \$400K actual loss. EAD = how much at risk. LGD = how much of that risk becomes real loss. ### Expected loss The expected loss on a single obligor over a one-year horizon is the product of the three: $$ \mathrm{EL} = \mathrm{PD} \times \mathrm{LGD} \times \mathrm{EAD}. $$ The derivation is a direct consequence of the law of total expectation. Let $L$ be the loss, $Y \in \{0, 1\}$ be the default indicator, and let $L \mid Y=1$ have mean $\mathrm{LGD} \times \mathrm{EAD}$ and $L \mid Y=0 = 0$. Then $$ \mathbb{E}[L] = \mathbb{E}[L \mid Y=1]\Pr(Y=1) + \mathbb{E}[L \mid Y=0]\Pr(Y=0) = \mathrm{LGD} \times \mathrm{EAD} \times \mathrm{PD}. $$ This assumes that PD, LGD, and EAD are independent across the three factors. In reality, LGDs tend to be worse when PDs rise (a recession effect), which is why Basel requires downturn LGDs. ### Unexpected loss and the ASRF model Expected loss is covered by loan loss provisions. Unexpected loss, the tail of the loss distribution, is what regulatory capital is for. Basel II introduced the Asymptotic Single Risk Factor (ASRF) model to compute capital as a closed-form function of PD, LGD, and a supervisory correlation $\rho$. The derivation is due to @gordy2003risk, building on the single-factor Vasicek portfolio model (@vasicek2002distribution) and ultimately on the Merton structural model (@merton1974pricing). We now derive the formula from scratch. #### The Vasicek single-factor model Let obligor $i$ have an unobserved latent asset return $Z_i$ modeled as $$ Z_i = \sqrt{\rho} M + \sqrt{1 - \rho} \varepsilon_i, $$ where $M \sim \mathcal{N}(0, 1)$ is a systemic factor shared across all obligors and $\varepsilon_i \sim \mathcal{N}(0, 1)$ are idiosyncratic innovations, independent of $M$ and across obligors. The correlation between any two obligors' asset returns is $\rho$ by construction, and each $Z_i$ is marginally standard normal. An obligor defaults when its asset return falls below a threshold $c_i$: $$ Y_i = \mathbb{1}\{Z_i \le c_i\}. $$ The unconditional default probability is $$ \mathrm{PD}_i = \Pr(Z_i \le c_i) = \Phi(c_i), \quad \Rightarrow \quad c_i = \Phi^{-1}(\mathrm{PD}_i). $$ This is the Merton link [@merton1974pricing] between the structural latent model and a reduced-form PD. #### Conditional default probability Condition on $M = m$. Then $Z_i \mid M = m \sim \mathcal{N}(\sqrt{\rho} m, 1 - \rho)$, and $$ \Pr(Y_i = 1 \mid M = m) = \Pr(Z_i \le c_i \mid M = m) = \Phi\!\left(\frac{c_i - \sqrt{\rho} m}{\sqrt{1 - \rho}}\right). $$ Conditional on $M$, the $Y_i$ are independent. Unconditionally, they are not: the common factor $M$ induces correlation. #### The 99.9% worst-case factor Capital is calibrated at the 99.9% confidence level under Basel II IRB, meaning one year in a thousand. The 99.9% worst-case outcome for the systemic factor $M$ is the 0.001-quantile of its distribution. Because a low $M$ produces more defaults (conditional PD is decreasing in $m$), the 99.9% stress corresponds to $M = \Phi^{-1}(0.001) = -\Phi^{-1}(0.999)$. Substituting $m = -\Phi^{-1}(0.999)$ into @eq-cond-pd: $$ \mathrm{PD}_i^{(0.999)} = \Phi\!\left(\frac{\Phi^{-1}(\mathrm{PD}_i) + \sqrt{\rho} \Phi^{-1}(0.999)}{\sqrt{1 - \rho}}\right). $$ This is the default probability under a one-in-a-thousand stress scenario for the systemic factor. #### From a single obligor to a portfolio For a portfolio, the loss is $L = \sum_i \mathrm{LGD}_i \times \mathrm{EAD}_i \times Y_i$. The ASRF assumption is that the portfolio is infinitely fine-grained, meaning no single obligor dominates and idiosyncratic risk diversifies away. Under this assumption (see @gordy2003risk, Proposition 5), the portfolio loss conditional on $M$ converges to its conditional mean: $$ L / \Big(\sum_i \mathrm{EAD}_i\Big) \to \sum_i w_i \mathrm{LGD}_i \Pr(Y_i = 1 \mid M), $$ where $w_i = \mathrm{EAD}_i / \sum_j \mathrm{EAD}_j$. The portfolio's 99.9% value-at-risk is then $$ \mathrm{VaR}_{0.999} = \sum_i \mathrm{EAD}_i \times \mathrm{LGD}_i \times \mathrm{PD}_i^{(0.999)}. $$ #### Subtracting expected loss The 99.9% VaR includes the expected loss $\sum_i \mathrm{EAD}_i \mathrm{LGD}_i \mathrm{PD}_i$. Because EL is already covered by provisions, regulatory capital needs to cover only the gap: $$ K_i = \mathrm{LGD}_i \cdot \Phi\!\left(\frac{\Phi^{-1}(\mathrm{PD}_i) + \sqrt{\rho}\, \Phi^{-1}(0.999)}{\sqrt{1 - \rho}}\right) - \mathrm{PD}_i \times \mathrm{LGD}_i. $$ This is the per-unit-of-EAD capital charge. The full regulatory capital for an exposure is $$ \mathrm{Capital} = K \times \mathrm{EAD} \times \mathrm{MaturityAdjustment} \times 12.5, $$ where the 12.5 multiplier converts the capital charge into a risk-weighted asset amount at an 8% capital ratio ($1 / 0.08 = 12.5$). The maturity adjustment is an additional multiplicative factor for corporate exposures and is set to 1 for retail exposures under the Basel IRB formula. We ignore it for retail. #### Supervisory correlation Basel II supplies the correlation $\rho$ as a supervisory function of PD. For residential mortgages, $\rho = 0.15$ flat. For other retail exposures: $$ \rho_{\mathrm{other\ retail}} = 0.03 \frac{1 - e^{-35 \mathrm{PD}}}{1 - e^{-35}} + 0.16 \left(1 - \frac{1 - e^{-35 \mathrm{PD}}}{1 - e^{-35}}\right). $$ For corporate, sovereign, and bank exposures: $$ \rho_{\mathrm{corp}} = 0.12 \frac{1 - e^{-50 \mathrm{PD}}}{1 - e^{-50}} + 0.24 \left(1 - \frac{1 - e^{-50 \mathrm{PD}}}{1 - e^{-50}}\right). $$ The functional form is monotone decreasing in PD: riskier obligors have lower asset correlations because they are more idiosyncratic. This empirical regularity was calibrated from data and discussed in the Basel explanatory note [@basel2005irb]. #### Implementing the IRB capital calculator @fig-basel-k shows the shape every credit risk officer has internalized. Capital is concave in PD. A borrower at 1% PD costs roughly five times as much in capital as a borrower at 0.1% PD, not ten times. The corporate curve is always above the retail curve because corporates have higher supervisory correlations. The residential mortgage curve is nearly straight because $\rho$ is constant at 0.15. The **Basel IRB risk-weight function** [@basel2006international, retained in @basel2017finalising] in @eq-basel-k is the single most important calculator in credit risk. It stacks three named results: the **Merton structural default link** [@merton1974pricing], the **Vasicek single-factor portfolio loss distribution** [@vasicek2002distribution], and the **ASRF granularity limit** of @gordy2003risk. The supervisory correlation functions $\rho(\mathrm{PD})$ in @eq-rho-retail and @eq-rho-corp are calibrated per @basel2005irb, and the corporate maturity adjustment uses the Basel para. 272 slope $b(\mathrm{PD}) = (0.11852 - 0.05478 \ln (\mathrm{PD}))^2$. Expected loss @eq-el, unexpected loss as $\mathrm{VaR}_{0.999} - \mathrm{EL}$, and the 12.5 RWA multiplier ($1/0.08$) close the pipeline. Every pricing model, every strategic capital calculation, every IRB benchmark uses this stack. Memorize it. #### A sensitivity calculation Consider a retail credit card book at PD = 5%, LGD = 70%, EAD = 1000. The baseline capital per account is: A 100 basis point upward miscalibration on this credit-card book lifts capital from 8.26% to 8.43% of EAD, or roughly \$1.64 extra per \$1000 of exposure. For a \$5B book, that is \$8M of capital tied up or released. The sensitivity is modest at mid-range PDs because the Basel $\rho$ for other retail falls with PD, partially offsetting the effect. At lower PDs, where $\rho$ is near its 16% upper bound, the same 100bp shift can move capital several times as much. PD calibration is not a rounding exercise. ### What the IRB formula does not capture Three assumptions in the ASRF derivation are known to be wrong in practice: 1. Infinite granularity. Real portfolios have concentration, especially in SME and corporate books. The granularity adjustment [@gordy2010small] is an explicit correction, not used in the Basel formula, but used in internal capital models. 2. Single systemic factor. Real factor structure is multi-dimensional: country, industry, tenor. The single-factor model is a conservative approximation that happens to give a closed form. 3. Gaussian dependence. Default dependence has tails fatter than Gaussian, well-documented post-2008. The formula is known to underestimate tail losses for heavy-tailed portfolios. Frailty-correlated defaults [@duffie2009frailty] are an empirical demonstration that the Basel assumption is too thin. These limitations motivate the economic capital layer that banks run alongside the regulatory calculation. We revisit the multi-factor and non-Gaussian issues in later chapters. A related practitioner reference on conservative PD estimation in low-default portfolios is @pluto2005thinking. ## Application, behavioral, and collection scoring Scorecards solve three distinct problems: 1. decide whether to open an account, 2. decide what to do with an existing account, and 3. decide how to collect on a delinquent account. Each problem has its own features, its own target, its own performance window, and its own way of failing. Treating them as the same problem is a common mistake. ### Application scoring Application scoring is the classic scorecard setting. At time $t = 0$, an applicant submits an application with features $X_0$ (demographics, income, employment, declared debt, bureau pull). The lender must decide whether to approve and, if so, what limit and price to offer. The target $Y_{12}$ is the default indicator over the 12-month performance window starting at origination. The estimand is $$ \eta_{\mathrm{app}}(x) = \Pr(Y_{12} = 1 \mid X_0 = x, D = 1), $$ where $D = 1$ conditions on approval. This conditioning is the source of the reject-inference problem (section 2.4). The training sample is the set of previously approved applicants, with features frozen at origination and outcomes observed over the performance window. The classical reference for application scorecards is the survey of @thomas2000survey. The logistic regression scorecard with Weight of Evidence (WoE) binning (see @sec-ch07) dominates this setting. Gradient boosting models have the highest raw discrimination (see @lessmann2015benchmarking) but are harder to reason about for regulatory purposes. An application scorecard typically has a short feature list (10 to 30 bins after WoE transformation) and is retrained every 12 to 18 months. The feature list is constrained by what can be collected at application time: the set of bureau attributes, self-reported income, and derived ratios. The most predictive single feature in almost every application scorecard is a credit bureau score (FICO, VantageScore, or equivalent). A bureau score is a scorecard itself, trained on a national-level archive, fed as one feature into the bank's scorecard. ### Behavioral scoring Behavioral scoring operates on existing accounts. Features include the application scorecard's original inputs plus the time-varying on-book history: balance, payment behavior, utilization, and delinquency flags. @crook2007recent trace the evolution of behavioral scoring through the 2000s. The target is usually a forward-looking default indicator over a 12-month window: $$ \eta_{\mathrm{beh}}(x_t) = \Pr(Y_{t+12} = 1 \mid X_t = x_t, \text{on-book at } t). $$ Behavioral scores are recomputed monthly. They drive: - Credit line management: raise or cut the limit on an approved account. - Cross-sell triggers: send a pre-approved loan offer to a profitable customer. - Collection triggers: flag an account for proactive outreach before it defaults. - Pricing updates: re-price a variable-rate facility at a review date. Behavioral scores out-predict application scores by a wide margin, because the observed payment history dominates everything else. A single variable, such as "number of months in the last 12 with any delinquency," carries more signal than the entire application form. The design issue with behavioral scoring is that features are time-varying. A naive approach extracts snapshots at fixed time points (for example, the balance on the observation date) and feeds them to a logistic regression. A more principled approach uses recurrent or transformer models on the full sequence (@sec-ch26). The middle ground is panel-style regressions with hand-engineered summary features, which is what most banks actually run. See @shumway2001forecasting for the hazard-model formalization of panel default prediction, and @duffie2007multi for the multi-period extension. ### Collection scoring Collection scoring operates on accounts that are already delinquent. The decision is which collection action to take, not whether to approve the loan. The candidate actions are: - Send a reminder (letter, SMS, email, app notification). - Call the customer. - Refer to an internal collections team. - Sell the debt to a third-party collector. - Charge off and write down. The target in a collection model is not default. Default has effectively already happened (the account is delinquent). The target is the recovery amount over a short horizon, typically 90 days: $$ \eta_{\mathrm{coll}}(x_t, a) = \mathbb{E}[R_{t + 90} \mid X_t = x_t, A = a], $$ where $R$ is the recovery amount and $A$ is the collection action. This is a treatment-effect problem disguised as a regression. The data-generating process is policy-driven: the firm's past collections policy determines which actions were taken on which accounts, so the observed outcomes are not the same as the potential outcomes under a new policy. Naive regression on action effects is confounded. Collection scoring is where the tools of causal inference (@sec-ch28) have the most immediate payoff. Uplift models, off-policy evaluation, and contextual bandits all show up here. In practice, most large lenders run simple propensity-to-pay models and A/B test new policies into production. ### Why the distinction matters A common failure mode is using one model where another was needed. Three examples: 1. An application scorecard is deployed on the behavioral book. The features are stale. Performance degrades because the application scorecard lacks the payment-behavior features that a behavioral scorecard would use. 2. A behavioral scorecard is used for new applicants. There is no on-book history, so the most predictive features are missing. The model extrapolates, and the calibration breaks. 3. A default-prediction model is used for collections. The default has already happened. The model tells you what you already know. The three models should share a common infrastructure (data, monitoring, model risk framework) but be kept conceptually and operationally separate. ## Reject inference Application scoring has a structural problem. The training sample is the set of previously approved applicants because only they have observed outcomes. The scorecard is then deployed on all applicants, approved or not. If the approval policy was non-random, which it always is, the training distribution differs from the deployment distribution. This is sample selection bias, the canonical @heckman1979sample problem, adapted to credit scoring by @hand1997statistical and extensively studied by @banasik2003sample and @crook2004does. ### The setup Let $X$ be application features, $D \in \{0, 1\}$ be the historical approval decision, and $Y \in \{0, 1\}$ be the default outcome observed only when $D = 1$. The lender wants $$ \eta(x) = \Pr(Y = 1 \mid X = x), $$ but the training sample only provides $$ \eta_A(x) = \Pr(Y = 1 \mid X = x, D = 1). $$ If $D$ is conditionally independent of $Y$ given $X$, then $\eta_A = \eta$ and the problem goes away. This is often called the missing-at-random condition. It holds when the historical approval rule depends only on $X$. It fails when approval depends on information the new model does not observe: loan officer judgment, soft collateral, relationship history, or unobserved applicant characteristics. ### Heckman's two-step The @heckman1979sample model assumes latent variables $$ \begin{aligned} Y^* &= X^{\top} \beta + U, \\ D^* &= Z^{\top} \gamma + V, \end{aligned} $$ with $(U, V) \sim \mathcal{N}(0, \Sigma)$ jointly normal and correlated: $\rho_{UV} = \sigma_{UV} / \sqrt{\sigma_U^2 \sigma_V^2}$. Observed decisions are $D = \mathbb{1}\{D^* > 0\}$ and observed outcomes are $Y = \mathbb{1}\{Y^* > 0\}$ when $D = 1$. Under this model, $$ \mathbb{E}[Y^* \mid X, Z, D = 1] = X^{\top} \beta + \sigma_{UV} \lambda(Z^{\top} \gamma), $$ where $\lambda(u) = \varphi(u) / \Phi(u)$ is the inverse Mills ratio. The correction term $\sigma_{UV} \lambda(Z^{\top} \gamma)$ is the bias induced by conditioning on $D = 1$. Heckman's two-step estimator is: 1. Estimate $\gamma$ by probit on $D$ against $Z$ in the full sample of applicants. 2. Compute $\hat{\lambda}_i = \lambda(Z_i^{\top} \hat{\gamma})$ for each approved applicant. 3. Regress $Y^*$ on $X$ and $\hat{\lambda}$ in the approved sample. The coefficient on $\hat{\lambda}$ estimates $\sigma_{UV}$. The Heckman model gives a closed-form bias correction but requires either \(a\) an exclusion restriction (a variable in $Z$ that is not in $X$ but drives $D$) or \(b\) strong distributional assumptions. In the credit context, exclusion restrictions are often argued from the loan officer's judgment features (captured in $Z$, not in the modelable $X$), but the assumption is rarely defensible in modern automated underwriting. ### Alternative approaches The credit scoring literature has explored several alternatives: - Re-weighting. Use propensity scores $\Pr(D = 1 \mid X)$ to re-weight the approved sample. @banasik2007reject applied this idea and found modest improvements. - Parceling. Assign a fractional bad label to rejected applicants based on the approved-sample model's prediction. A classical approach from @thomas2000survey. Produces stable models but merely shifts the bias, not removes it. - Fuzzy augmentation. Score each reject twice, once as a good and once as a bad, with weights from the approved-sample model. An iterative variant of parceling. - Control groups. Randomly approve a small fraction of would-be rejects. Gives unbiased data on the rejected region at the cost of some defaults. Widely used in fintech, rarely used in traditional banking. - Instrumental variables. Exploit exogenous variation in the approval rule (a policy change, a regional experiment). See @imbens2008recent for the methodology and @angrist1996identification for the identification theory. The consensus in the literature [@crook2004does, @banasik2003sample, @hand1997statistical] is that reject inference techniques offer modest improvements at best when the approval rule is well-explained by observable features, and are genuinely useful only when the approval rule relies on information not in the model. @crook2004does famously conclude that reject inference is rarely worth the effort for typical bank datasets. This negative result is partly because banks approve around 60 to 80 percent of applicants, so the rejected region is not that informative. @sec-ch10 develops reject inference in depth, including the modern approaches based on semi-supervised learning and causal identification strategies. ## Class imbalance and its consequences Credit portfolios are imbalanced. Prime mortgage books have 99.5% goods and 0.5% bads. Even subprime books are 80% good, 20% bad. This imbalance affects what metrics to track, how to regularize the model, and how to set the classification threshold. ### What imbalance does not break Class imbalance is often blamed for issues it does not cause. Logistic regression's maximum likelihood estimator is consistent under imbalance (@mcfadden1974conditional). The calibration of the model's probability predictions depends on the prior, but in a known way: the intercept shifts by $\log \pi_1 / (1 - \pi_1)$ compared to a balanced sample, and the slopes are unaffected [@king2001logistic]. AUC is invariant to the class prior [@hand2001measuring, @japkowicz2002class]. Gradient boosting and random forests are also not structurally broken by imbalance. What breaks them is the interaction between imbalance and finite samples: with very few positives, the model has very little signal. This is a sample size problem, not an imbalance problem. ### What imbalance does break Three things go wrong under imbalance: 1. Accuracy is useless. At 1% bad rate, a constant "predict good" classifier has 99% accuracy. Accuracy is dominated by the majority class. Use AUC, KS, and log-loss instead. 2. Brier score is not invariant to class prior. Because Brier is an absolute squared-error measure, it tracks the variance of the outcome $Y$, which is $\pi_1 (1 - \pi_1)$. Under imbalance, Brier is mechanically small even for uninformative models. Brier should be interpreted relative to the baseline $\pi_1 (1 - \pi_1)$ or re-expressed as a Brier skill score. 3. Threshold-based metrics (precision, recall, F1) shift with prior. These metrics depend on the operating point, which in turn depends on the ratio of positives to negatives. Across portfolios with different priors, threshold-based metrics are not comparable without re-calibration. We now demonstrate points 2 and 3 with a controlled simulation. #### AUC invariance, Brier sensitivity As shown in @fig-auc-vs-brier, AUC is constant within simulation noise, consistent with its prior-invariance result. Brier, however, does not tell the same story. As the prior falls, the raw Brier score climbs because the predicted probabilities $\hat p = \sigma(s)$ have their mass around 0.5, while the labels become increasingly concentrated at 0. The Brier skill score relative to the forecast $\pi_1$ turns strongly negative for small priors, which is the correct signal that the probabilities are badly calibrated for that mixture, not that the discriminative score got worse. The fix is recalibration via @eq-prior-correction or via an isotonic step on a held-out sample. This is why regulators accept AUC and KS as universal monitoring metrics across portfolios, while Brier is always reported alongside the base rate or as a skill score [@brier1950verification, @murphy1973new]. The Brier skill is a sharp diagnostic for miscalibration; raw Brier on its own is not. ### Bayes decision boundary The optimal classification threshold under a cost-sensitive loss function is not 0.5. It depends on the costs of false approvals and false rejections. We derive it. Let the cost matrix be: | | $Y = 0$ (good) | $Y = 1$ (bad) | |-------------------|------------------------|-------------------------| | $D = 1$ (approve) | 0 | $C_{10}$ (default loss) | | $D = 0$ (decline) | $C_{01}$ (lost margin) | 0 | Only relative costs matter, so the diagonal is normalized to zero. Expected cost given $\hat p = \Pr(Y = 1 \mid X)$: $$ \mathbb{E}[\text{Approve}] = \hat p C_{10}, \qquad \mathbb{E}[\text{Decline}] = (1 - \hat p) C_{01}. $$ Approve when the expected cost of approving is smaller: $$ \hat p C_{10} < (1 - \hat p) C_{01} \iff \hat p < \frac{C_{01}}{C_{01} + C_{10}}. $$ The Bayes threshold is $$ t^* = \frac{C_{01}}{C_{01} + C_{10}}. $$ This result is independent of the class prior. The prior matters only through its effect on $\hat p$. For example, with $C_{01} = 0.03$ (3% margin lost on a declined good) and $C_{10} = 0.45$ (45% LGD on an approved bad), the threshold is $$ t^* = \frac{0.03}{0.03 + 0.45} = 0.0625. $$ Any borrower with $\hat p \ge 6.25\%$ is declined. The credit-card threshold is aggressive at 6.25%. The mortgage threshold is tighter at 2%. The subprime threshold sits at 13%. These numbers match the published approval rate experience for the relevant books. The derivation is straight from @elkan2001foundations, and the logic generalizes to multi-action decisions and to non-binary outcomes. A profit-oriented generalization that integrates the cost matrix with the EMP framework is developed by @verbraken2014novel. ### Log-loss and Bernoulli likelihood Every probabilistic classifier this book trains ends up minimizing, explicitly or implicitly, the cross-entropy (log-loss). We derive it from first principles. Let $Y_i \in \{0, 1\}$ be independent Bernoulli draws with parameter $p_i = \eta(X_i)$ and let the model estimate $\hat p_i = f_\theta(X_i)$. The Bernoulli likelihood for a single observation is $$ \mathcal{L}_i(\theta) = \hat p_i^{Y_i} (1 - \hat p_i)^{1 - Y_i}. $$ The joint likelihood over $n$ independent observations is the product $\prod_i \mathcal{L}_i$. The log-likelihood is $$ \log \mathcal{L}(\theta) = \sum_{i=1}^{n} \left[ Y_i \log \hat p_i + (1 - Y_i) \log (1 - \hat p_i) \right]. $$ The negative log-likelihood (NLL), divided by $n$, is the cross-entropy loss: $$ \mathrm{CE}(\theta) = -\frac{1}{n} \sum_{i=1}^{n} \left[ Y_i \log \hat p_i + (1 - Y_i) \log (1 - \hat p_i) \right]. $$ This is identical to the information-theoretic cross-entropy between the empirical label distribution and the model's predictive distribution. Minimizing CE is equivalent to maximum likelihood for the Bernoulli family. The result holds whatever the functional form of $f_\theta$: logistic regression, gradient boosting, random forests, neural networks, transformers. They all minimize the same target under the same justification. Two useful properties follow. 1. CE is a strictly proper scoring rule [@dawid1982well, @degroot1983comparison]: the unique minimizer over all predictive distributions is the true conditional distribution $\eta(x)$. A model trained to minimize CE, in the infinite-data limit, recovers the Bayes-optimal predictor. 2. CE decomposes into calibration and refinement components [@murphy1973new]. If $\hat p$ is a function of a coarser score $S$, then $$ \mathrm{CE} = \mathbb{E}[\mathrm{KL}(\eta \| S)] + \mathbb{E}[\mathrm{KL}(\hat p \| \eta \mid S)]. $$ The first term is the refinement loss: how much information is lost by summarizing $X$ into $S$. The second term is the calibration loss: how much the model deviates from the true conditional given its own score bin. A well-calibrated model has the second term equal to zero. @sec-ch04 develops the calibration-refinement decomposition in detail. An example of NumPy implementation ### A calibration note Many production systems re-balance the training sample (undersampling the majority, oversampling the minority, SMOTE-style synthetic generation @chawla2002smote). These interventions change the effective prior and bias the output probabilities. If you resample, you must recalibrate. The correction is a direct consequence of Bayes' rule. If the training prior is $\pi_1^{\mathrm{train}}$ and the deployment prior is $\pi_1^{\mathrm{deploy}}$, the recalibration of a predicted probability is $$ \hat p^{\mathrm{deploy}} = \frac{a}{a + b}, \qquad \begin{aligned} a &= \hat p^{\mathrm{train}} \cdot \pi_1^{\mathrm{deploy}} (1 - \pi_1^{\mathrm{train}}), \\ b &= (1 - \hat p^{\mathrm{train}}) \cdot \pi_1^{\mathrm{train}} (1 - \pi_1^{\mathrm{deploy}}). \end{aligned} $$ This is derived from the posterior odds ratio of Bayes' theorem and appears in @elkan2001foundations and @king2001logistic. It is the single most useful formula to know when moving a model between a resampled training distribution and an unsampled deployment distribution. @sec-ch15 develops the resampling family in depth and revisits this correction. ## Benchmark on Taiwan data: observed vs. predicted PDs We end the main content with a short benchmark that ties the formalism to real data. We train a logistic regression on the UCI Taiwan default dataset [@yeh2009comparisons], partition borrowers into deciles of predicted PD, and plot the observed default rate against the predicted rate. This is the elementary calibration diagnostic that every production scorecard is expected to pass. As shown in @fig-taiwan-pd-buckets, the deciles mostly sit near the 45-degree line, with a visible lift in the top decile. The top decile's observed default rate exceeds its predicted PD, which means a plain logistic regression with standardized features understates the worst deciles. A scorecard in production would pass this through isotonic or Platt calibration [@platt1999probabilistic] (see in @sec-ch04) to correct the systematic lift. The KS and AUC of this naive logistic are already usable, which is a reminder that credit scoring problems are tractable with small models if the features are informative. The reason we ran this benchmark is to underline the chapter's main point. Every downstream calculation (IRB capital, IFRS 9 expected credit loss, approval threshold, pricing) uses the predicted PD as an input. A systematic bias at the top decile translates directly into systematic bias in capital and pricing. @sec-ch02-pd-lgd-ead-and-regulatory-capital gave us the sensitivity: at a mid-range 5% book, 100 basis points of PD bias moves capital by one to two dollars per \$1000 of exposure, and the effect is several times larger at lower PDs. A miscalibrated top decile is a real-money problem. ## Scalability considerations The benchmarks in later chapters run on the three canonical public datasets: German (1000 rows), Taiwan (30,000 rows), and Home Credit (300,000 to 1 million rows). Real bank portfolios are larger: a mid-sized US card issuer has 10 to 50 million active accounts, evaluated monthly, with a transaction history that can extend to 10 years. A year of daily transaction-level features on a 50M account book runs to a low-terabyte scale. The scaling path for application scoring is straightforward. Feature engineering dominates. An application scorecard refits well under pandas up to about 5 million rows. Beyond that, Polars is the pragmatic next step (same API semantics, multi-threaded, columnar). Dask and Spark come into play for monthly behavioral refreshes across tens of millions of accounts. We show concrete pandas-to-Polars-to-Spark comparisons in @sec-ch17 for feature engineering and in @sec-ch34 for training. The scaling path for behavioral scoring is different. The data is a time-indexed panel. The features are aggregations over rolling windows. The natural tool is an out-of-core column-store (Parquet with Polars lazy frames, or DuckDB, or Spark). The natural model at this scale is gradient boosting (@sec-ch12) rather than deep sequence models, for latency and interpretability reasons. The deep sequence and graph cases are treated in @sec-ch26 and @sec-ch27. For the IRB capital calculation itself, scalability is trivial. The formula is a scalar function that vectorizes cleanly over NumPy arrays. A portfolio of 100 million exposures runs in under a second on a laptop. The bottleneck in production is always data movement, not math. ## Deployment considerations A credit scoring model is a small cog in a much larger decision system. The model gets a feature vector, outputs a PD, and hands it off to a policy engine that applies hard-coded rules (minimum credit bureau score, maximum debt-to-income, and similar) before the final decision. The model is almost never the final decision maker, for regulatory and practical reasons. The deployment pattern we use across the book is: 1. Package the model as a versioned artifact (ONNX, pickle, or MLflow format). Store training data, hyperparameters, and metrics alongside the artifact. 2. Wrap the artifact in a FastAPI or gRPC service. The service exposes `predict` (returns PD and optional explanations) and `health`. Latency budget: single-digit milliseconds for application scoring, tens of milliseconds for behavioral monthly batch. 3. Route decisions through a separate policy engine that consumes the PD and applies the rest of the decision logic. 4. Log every prediction with input features, output score, model version, and timestamp. This is required by @sr117 and by the EU AI Act for high-risk systems. 5. Monitor in production for population stability (PSI), performance drift (AUC and KS on vintage cohorts), and calibration drift (predicted vs. observed by bucket). The deployment artifact of this chapter is the IRB capital calculator, which we expose as a small reference implementation. @sec-ch34 treats the full MLOps pipeline. ## Regulatory considerations Five regulatory anchors frame everything in this book. This chapter touched the first two; the others recur in later chapters. ### Basel II/III (IRB) We derived the ASRF formula from first principles. The practitioner consequences are: - Internal PD, LGD, and EAD models require supervisory approval. The validation is framed by @basel2006international Part 2.3 and the EBA @eba2017gl technical standards. - PDs must be TTC-style (through-the-cycle) for capital. IFRS 9 and CECL PDs are PIT and not the same number. - The 0.03% PD floor on retail exposures constrains the tail of the rating scale. - LGDs must be downturn-calibrated. Downturn LGDs are the empirical average in stressed periods, not the overall average. - Model risk is monitored continuously, with an annual validation cycle. Basel III finalization (@basel2017finalising, also known as the output floor package) tightened IRB input floors and introduced an aggregate floor of 72.5% against the standardized risk-weighted assets. The practical effect is that the capital saved by a sophisticated internal model is capped at 27.5% of the standardized figure. The BCBS 239 principles on risk data aggregation [@bcbs239] then impose data-quality and timeliness standards on every input that feeds the capital calculation. ### SR 11-7 The Federal Reserve's Supervisory Guidance on Model Risk Management [@sr117] is the US equivalent. Its key tenets are effective challenge, independent validation, comprehensive documentation, and a model inventory. Every credit scoring model in a US bank is required to satisfy SR 11-7. The chapter's construction of PD, LGD, EAD, and the capital formula is the kind of derivation a SR 11-7 validator expects to see in the model documentation. ### IFRS 9 and CECL Accounting standards, such as @ifrs9 and @cecl, require expected credit loss provisioning. IFRS 9 uses a three-stage model (stage 1: 12-month ECL, stage 2: lifetime ECL for significantly increased credit risk, stage 3: lifetime ECL for impaired). CECL uses lifetime ECL from inception, without staging. Both frameworks require PIT-style PD and LGD estimates, forward-looking macroeconomic overlays, and transparent documentation. @sec-ch35 develops these in depth. ### ECOA, FCRA, and fairness In the US, credit decisions are regulated by the Equal Credit Opportunity Act (ECOA) and the Fair Credit Reporting Act (FCRA). ECOA prohibits discrimination based on protected classes (race, color, religion, national origin, sex, marital status, age). FCRA regulates the use of credit reports and mandates adverse action notices with specific reasons. A modern credit scoring pipeline must provide feature-level reason codes for every declined application. SHAP values [@lundberg2017unified], treated in @sec-ch22, are the current standard tool for this. ### EU AI Act and GDPR Article 22 The EU AI Act (effective 2024 to 2026 in phases) classifies credit scoring as a high-risk system, imposing requirements on data governance, technical documentation, transparency, human oversight, accuracy, robustness, and cybersecurity. GDPR Article 22 grants the right not to be subject to a decision based solely on automated processing, with the practical effect that an automated credit decision must have a human-in-the-loop pathway. @sec-ch05 and @sec-ch24 treat the full regulatory compliance surface. ## Vietnam and emerging markets ### Market context The formal setup of this chapter (PD, LGD, EAD, the ASRF capital formula, and the three scoring problems) is transplanted into Vietnam through SBV Circular 41/2016/TT-NHNN, which adopts Basel II's standardized approach for most domestic banks and opens an internal-ratings pathway on a pilot basis for a short list of systemically important institutions [@sbv_circular41_2016]. The counterparty infrastructure has two pillars. The Credit Information Center (CIC) is the SBV's public bureau and is the mandatory reporting destination for regulated lenders. The Vietnam Credit Information JSC (PCB) is the private bureau. Combined adult coverage is around the 50 to 55 percent range, with thinner tradeline depth than a US or EU bureau file [@cic_vietnam2023; @worldbank_findex2021]. Mobile penetration above 140 percent of adults and smartphone adoption above 80 percent of the urban adult population underpin an onboarding channel that is mobile-first; eKYC under Circular 16/2020/TT-NHNN and personal-data handling under Decree 13/2023/ND-CP are the binding constraints on what data can enter the feature vector $X$ at origination [@sbv_circular16_2020; @vn_decree13_2023]. ### Application considerations The formal estimands of this chapter survive the move to Vietnam. The inputs that feed them do not. Four adjustments recur. First, the training sample for an application scorecard is small by US standards. Mid-size consumer-finance portfolios carry one to three million active accounts, and the 12-month performance window times the 18-month gap-to-today discipline leaves a usable cohort of a few hundred thousand loans. The 0.03 percent Basel PD floor rarely binds in this regime because the fitted rating scale is coarser, with a floor defined at one of the top rating grades rather than at the individual obligor level. Second, macro volatility pushes a lender toward the through-the-cycle PD definition of @eq-pd-def even for IFRS 9 reporting. The 2011 Non-Performing Loans spike, the 2022 corporate-bond episode, and recurrent FX pressure on the dong mean that a point-in-time PD that is accurate for any single quarter is structurally unstable across two-year windows [@imf2024vietnamart4]. Third, informal income breaks the self-reported income feature in the application form. A bank that treats declared income as exogenous is modeling a proxy. Bank-statement parsing, e-wallet flow features, and cross-checks against telco and utility billing are the practical substitutes. Fourth, the Tet seasonality creates a January-February originating cohort that is systematically riskier than the annual average and a short-term delinquency spike in the following quarter that a naive monthly vintage curve reads as a break. The LGD-downturn concept in the chapter needs a local anchor. The Basel instruction to use a stressed LGD average assumes a recession history that a lender can sample. Vietnamese consumer-finance portfolios at the relevant scale rarely have a full stress cycle in the observable sample, and LGDs on unsecured personal loans interact with collection-sector regulation (Circular 43/2016/TT-NHNN on consumer lending by finance companies) in ways that change mid-cycle, while capital treatment is set by Circular 41/2016 as amended by Circular 22/2023/TT-NHNN (29 Dec 2023) on capital adequacy ratios [@sbv_circular22_2023]. A conservative practitioner applies a floor to LGD rather than relying on an empirical downturn estimate on a short panel. ### Rationalization The ASRF formula and the three-way good-bad-indeterminate split are good fits for Vietnam because they are precisely the machinery that Circular 41/2016 codifies. The supervisory correlation $\rho$ is supplied by the regulator, so the practitioner is not asked to estimate it on a thin sample. The PD floor and the LGD floor are exactly the conservatism tools that an emerging-market portfolio needs. The reject-inference problem of the formal setup is, if anything, more acute in Vietnam than in the US: historical approval rules lean heavily on loan-officer judgment for SME and near-prime consumer lending, so the missing-at-random condition is less defensible. @sec-ch10 is the place to come back for this. The one piece of the chapter that has to be handled with care is the PIT-TTC distinction[^02-formal-setup-1]. The chapter presents them as two operational flavors of the same estimand. In a Vietnamese book, the PIT estimate is unstable across the macro cycle and the TTC estimate is the only one that survives supervisory review for capital. Practitioners should default to the TTC definition for PD models that enter Circular 41 capital and treat the PIT estimate as a separate, monitoring-only output. [^02-formal-setup-1]: **Point-in-Time (PIT)** models evaluate a borrower's current risk using real-time economic data, making them volatile over economic cycles. **Through-the-Cycle (TTC)** models estimate long-term risk, focusing on stable, enduring creditworthiness over economic cycles. ### Practical notes The two local datasets that support this chapter's machinery are the CIC inquiry-and-tradeline extract and the PCB enriched file. Neither is publicly downloadable, but both are accessible to licensed lenders under CIC's subscriber program. For reproducibility in this book, the UCI Taiwan dataset is a reasonable Southeast-Asian credit-card analog, and the Home Credit Group public Kaggle release is the closest open-source stand-in for a thin-file consumer-finance portfolio. Reporting lines for the capital formula run to the SBV Banking Supervision Agency for commercial banks, with model validation documentation expected in parallel with the capital return. Model-risk-management expectations in Vietnam are not codified at the level of SR 11-7, but the SBV's 2019 Circular 13/2018/TT-NHNN on internal control systems, plus the Circular 41/2016 approval process for internal-model pilots, function as a working equivalent. A team building an IRB-style PD model in Vietnam should expect to submit the ASRF derivation, the calibration curve from @fig-taiwan-pd-buckets diagnostics, and the per-segment $K$ curve from @fig-basel-k as core exhibits. ## Takeaways - Credit scoring is a probabilistic classification task embedded in a decision-theoretic pipeline. The probability is the intermediate output; the decision is what matters. - Goods, bads, and indeterminates are defined by the Basel 90+ dpd rule, UTP triggers, and firm policy. Getting the bad definition wrong invalidates every downstream metric. - A PD is a conditional probability indexed by five choices: bad event $\mathcal{B}$, horizon $h$, population $\mathcal{P}$, cycle stance $\mathcal{C}$, sampling frame $\mathcal{S}$ (@sec-ch02-pd-construct). Cross-vendor and cross-vintage comparisons are only well-defined after these are aligned or after both PDs are mapped to a common master rating scale. - Expected loss decomposes as $\mathrm{EL} = \mathrm{PD} \times \mathrm{LGD} \times \mathrm{EAD}$. Unexpected loss is what Basel regulatory capital covers, via the Asymptotic Single Risk Factor (ASRF) formula. - The IRB capital formula $K = \mathrm{LGD} \cdot \Phi((\Phi^{-1}(\mathrm{PD}) + \sqrt{\rho} \Phi^{-1}(0.999)) / \sqrt{1 - \rho}) - \mathrm{PD} \cdot \mathrm{LGD}$ falls out of a single-factor Vasicek model plus a 99.9% stress scenario. Memorize it. - Application, behavioral, and collection scoring are three different problems. Do not confuse them. - Reject inference is the credit-scoring-specific version of sample selection bias. The bias is small when the approval rule is well-explained by observed features, large when it is not. - Class imbalance makes accuracy useless, shifts Brier mechanically, and bends threshold metrics. AUC is invariant. Log-loss is the natural loss under the Bernoulli model and is a strictly proper scoring rule. - The Bayes-optimal cutoff from a cost matrix is $t^* = C_{01} / (C_{01} + C_{10})$. It is independent of the class prior and is the production threshold for cost-sensitive classification. ## Further reading - @basel2006international, the original Basel II text, and @basel2017finalising for Basel III finalization. - @basel2005irb, the BIS explanatory note on the IRB risk weight functions, which derives $\rho$ calibration. - @gordy2003risk for the formal risk-factor justification of the IRB formula. - @vasicek2002distribution for the single-factor portfolio loss distribution. - @thomas2000survey for the foundational scorecard survey. - @thomas2017credit for the modern scorecard text and the standard roll-rate machinery used for bad-definition translation. - @carlehed2012framework for the canonical PIT-TTC decomposition. - @loffler2013rating for empirical evidence on through-the-cycle rating practice. - @bangia2002ratings for cycle-conditional migration matrices. - @plutotasche2005 for low-default PD estimation under the master-scale workflow. - @crook2007recent for the behavioral-scoring update. - @heckman1979sample for the canonical sample-selection correction. - @hand1997statistical for the credit-scoring adaptation. - @banasik2003sample and @crook2004does for empirical reject-inference results. - @elkan2001foundations for cost-sensitive classification theory. - @king2001logistic for rare-event logistic regression and prior correction. - @lessmann2015benchmarking for the modern classifier benchmark landscape. - @eba2017gl for the EBA IRB PD/LGD estimation guidelines. - @sr117 for the US supervisory guidance on model risk. ================================================================================ # Source: chapters/03-data.qmd ================================================================================ # Data: Sources, Features, and Preprocessing **Scope: both retail and corporate, retail-leaning.** Bureau, application, transaction, and alternative-data sources. Worked examples lean retail (UCI German, Taiwan, CIC Vietnam); corporate financial-statement features are covered alongside. ## Overview {.unnumbered} Credit scoring lives or dies on its inputs. A logistic model trained on the wrong population, a gradient-boosted tree fit to leaky features, or a deep network that imputes missingness with a mean will all fail in production, and they will fail in ways that regulators care about. The modeling choices that textbooks emphasize matter. The data choices matter more. This chapter takes data seriously. We walk through the traditional sources that sit inside a bank's scorecard, catalog the alternative signals that have appeared in the past decade, and formalize the preprocessing steps that translate raw tables into model-ready features. Three tools get most of the attention: (1) weight of evidence for monotone encoding, (2) imputation for missingness, and (3) time-aware splitting for leakage control. Each is worked out in math and in code that runs on the public UCI data sets. The chapter also makes a scalability argument. A scorecard team that only thinks in pandas hits a wall at a few million rows. Polars, Dask, and Spark each solve a different piece of that wall, and weight of evidence encoding is one of the simplest places to show the tradeoff. The last section walks through a classic leakage bug, trains a model on it, and shows the out-of-time hit. A note for the emerging-market reader. The data stack this chapter describes (bureau file, internal core-banking data, alternative overlays) looks different when the bureau is the Credit Information Center (CIC) rather than Experian, when roughly half the adult population has no tradeline, when declared income comes from cash work rather than payroll, and when the origination channel is a mobile app with an eKYC liveness check rather than a branch visit. The preprocessing decisions that follow, weight-of-evidence binning, missingness treatment, and point-in-time feature construction, have to absorb a higher missingness rate, a shorter tradeline depth, and a heavier reliance on transaction-level cash-flow features. The chapter's methods are the right methods, but the defaults (bin counts, IV thresholds, imputation strategy) need to be set with the thin-file population in mind. The intended reader is a senior practitioner or an academic researcher who already understands logistic regression and classical statistical learning. The chapter spends no time re-deriving maximum likelihood for a linear model; it spends most of its time on the joins, the cohort definitions, and the pre-model transformations that separate a demonstration notebook from a production scorecard. The empirical sections lean on UCI German and UCI Taiwan because they are reproducible everywhere, but the methods port cleanly to larger Home Credit, LendingClub, and HMDA samples covered in later chapters. ### Notation {.unnumbered} We use $X \in \mathcal{X}$ for features and $Y \in \{0, 1\}$ for the default label, with $Y = 1$ denoting a bad. Population rates are $\pi_1 = \Pr(Y = 1)$ and $\pi_0 = 1 - \pi_1$. A binned feature partitions $\mathcal{X}$ into disjoint bins $\{A_j\}_{j=1}^{J}$. For a categorical variable, bins are level groupings. For a numeric variable, bins are intervals. Conditional probabilities are $p_{j \mid 1} = \Pr(X \in A_j \mid Y = 1)$ and $p_{j \mid 0} = \Pr(X \in A_j \mid Y = 0)$. ## Traditional data Banks have collected the same categories of consumer credit data for four decades. Little has changed in the core schema. The four pillars of a traditional retail scorecard are the bureau report (@sec-ch03-bureau), the bank's internal master file (@sec-ch03-internal), the application form (@sec-ch03-application), and the external overlay such as income verification or fraud flags (@sec-ch03-overlays). ### The bureau report A consumer credit bureau is a private clearinghouse. It ingests monthly tradeline updates from thousands of furnishers, normalizes them into a canonical schema, and sells reports and scores back to lenders. In the United States the three nationwide bureaus are Equifax, Experian, and TransUnion [@avery2003overview]. The Fair Credit Reporting Act (FCRA) governs what they can collect, how long they can keep it, and what consumers can dispute. Europe runs a mix of positive and negative bureaus by country. China operates the Credit Reference Center of the People's Bank of China alongside several private bureaus. A bureau report breaks into five sections that every modern scorecard touches: 1. **Identification**: name, date of birth, social security or national identifier, current and prior addresses. Used to link the report to the application and to detect identity fraud. 2. **Tradelines**: one row per active or closed credit account. Each tradeline has an opening date, an account type (revolving, installment, mortgage, open), a credit limit or original balance, a current balance, a minimum payment, and a 24-month payment history string such as `OK OK OK 30 60 OK OK ...`. These strings are the raw material for every delinquency-based feature. 3. **Inquiries**: every hard pull in the past two years, with date and subscriber name. A burst of inquiries in the last 30 days is a strong short-horizon risk signal. 4. **Public records**: bankruptcies, tax liens, civil judgments. Post-NCAP changes in the United States, most judgments and tax liens no longer appear, which reshaped the public record feature bank after 2017. 5. **Collections**: charged-off accounts placed with third-party collectors. Often shown with original creditor, collection agency, and charge-off balance. The FICO score, the dominant consumer credit score in the United States, derives from bureau data only. Myfico.com publishes the category weights: payment history (roughly 35 percent), amounts owed (30 percent), length of credit history (15 percent), new credit (10 percent), credit mix (10 percent). The underlying algorithm is proprietary, but its inputs are public knowledge and follow a small number of archetypes. Utilization is the ratio of revolving balance to limit, computed per tradeline and aggregated to the file level. Delinquency depth is the worst 24-month payment code on each line, rolled up to file level as the fraction of tradelines that were 30+, 60+, or 90+ in the last 6, 12, or 24 months. Age features include the age of the oldest account and the average age of open tradelines. VantageScore, the joint bureau product, and the proprietary scorecards that large lenders build in-house use the same tradeline and inquiry data but different binning, weighting, and target definitions. A common pattern inside a bank is to stack an in-house behavioral score on top of a bureau score, so the scorecard captures both the generic credit-file signal and the account-specific behavior that the bureau does not see. The specific feature vocabulary is remarkably stable across bureaus and across decades. A partial list of the archetype features a scorecard developer can expect to find useful: - `bureau_score`: the FICO or VantageScore on file at the observation date. - `oldest_tradeline_age_months`: age of the oldest account. Tracks length of credit history. - `avg_open_tradeline_age_months`: average age of open tradelines. Captures both length and churn. - `utilization_revolving`: sum of balance divided by sum of limit across open revolving lines. - `utilization_maximum_tradeline`: the maximum utilization across any single revolving tradeline. - `num_tradelines_30dpd_12m`: count of tradelines that reached 30+ days past due in the last 12 months. - `num_tradelines_60dpd_24m`: the 60+ DPD analog over 24 months. - `num_inquiries_6m`: hard pulls in the last 6 months. - `num_new_tradelines_12m`: newly opened accounts in the last 12 months. - `bankruptcy_flag`, `collections_flag`, `tax_lien_flag`: public record presence indicators. - `secured_installment_flag`, `mortgage_flag`: structural presence indicators. - `revolving_total_balance`, `installment_total_balance`: dollar aggregates by account type. - `months_since_last_delinquency`: recency of the most recent bad event; "never" is usually coded as a large positive number. A clean scorecard typically uses 15 to 30 of these, with roughly a two-thirds weight on delinquency-adjacent features (past behavior), a 15-percent weight on utilization (current behavior), and a 10- to 20-percent weight on length and mix (structural). ### Bank internal data Internal data is what the lender knows from its own books. It is almost always more predictive than bureau data for customers who already hold a product. A current-account issuer sees the full flow of salary credits and direct debits. A mortgage servicer sees escrow behavior. A credit card issuer sees transactional authorizations in real time. The operational data stores used for model building tend to be organized by the source system: - **Core banking**: account master, balances, interest accruals, statement-level tables. - **Card authorization**: every swipe, with merchant category, amount, timestamp, and channel. - **Payments and transfers**: ACH, wire, SEPA, faster payments, internal transfers. - **Collections**: arrears, promise-to-pay history, agent notes, settlement agreements. - **Customer contact**: call center records, digital channel logs, complaint flags. A behavioral scorecard reduces this to a set of standardized windows: 1-month, 3-month, 6-month, and 12-month. Each window is aggregated into counts, sums, max, min, ratios, and trends. A typical card model will have 30 to 80 such features. The specific recipe matters less than the window discipline. The windows must end strictly before the observation point of the score, and they must be computable at that same point in the production path. If they are not, the model is leaky, a topic we return to in @sec-ch03-temporal-leakage-and-lookahead-bias. There is a structural asymmetry between new-account origination scorecards and behavioral scorecards on existing accounts. An origination scorecard knows the bureau, the application, and nothing else. A behavioral scorecard on an existing card account has the full payment, balance, and transaction history of that account, along with the bureau refresh. The behavioral scorecard is almost always more accurate within its existing customer base; it typically reaches Gini coefficients in the 0.55 to 0.70 range on 12-month horizons, while origination models for the same lender reach 0.40 to 0.55. The gap comes entirely from the richer internal data. Two specific internal features consistently dominate in behavioral scorecards. The first is the ratio of payment to statement balance, also called the revolving-pay rate. A customer who pays the full balance every month is structurally lower risk than one who pays the minimum, even at the same utilization. The second is the trend in transaction frequency and amount over the last 3 to 6 months. A sudden drop in transaction count while balance grows is a strong short-horizon risk signal, often preceding a 30+ DPD event. ### Account-level versus customer-level modeling Some scorecards are built per-account; others are built per-customer. The choice matters. A customer-level model aggregates across all of the customer's accounts with the lender and produces one score. An account-level model produces a score per account, so a customer with three cards receives three scores. The customer-level model is more data-efficient but forces the aggregation to happen before the model sees the data. The account-level model is more flexible but requires the lender to manage three predictions for the same person. In practice, origination uses the customer level (one application, one decision) and account management uses the account level (per-card credit-line changes, per-card repricing). IFRS 9 and CECL (@sec-ch35) have specific implications: the expected credit loss calculation is at the account level, so account-level PDs are the operational requirement even when the decision model runs at the customer level. ### Application data The application form is the scorecard's only chance to see information a customer volunteers that is not in the bureau or the internal master file. Typical fields are employment status, occupation, income, employer tenure, housing tenure, marital status, dependents, purpose of loan, requested amount, and term. Most fields are self-reported. Some lenders cross-check a subset through payroll data providers or open-banking connections. Application features are high-signal for thin-file borrowers, who by definition have sparse bureau records. For customers with thick files, the marginal value of application data is lower, and the scorecard usually regresses toward bureau features. Income is the perennial exception. A consistent, verified income feature dominates many other inputs, even for thick-file customers. ### Tradeline-level features The bureau delivers tradelines as a repeated-measures structure. A report with 12 open tradelines contains 12 rows of account-level information, not 1 aggregated row. Turning that into a single observation per borrower is a feature engineering problem, and it is where most scorecard teams spend their time. Three families of aggregation dominate: 1. **Pointwise**: count of open installments, sum of revolving balances, and maximum utilization. 2. **Temporal**: count of tradelines with 30+ delinquency in the last 12 months, months since last delinquency, months since the oldest account opened. 3. **Structural**: presence of mortgage, presence of auto loan, ratio of secured to unsecured balance. Modern gradient-boosted scorecards work directly on 100 to 500 such derived columns. Classical scorecards collapse further to 15 to 30 features, selected by information value and stability. We build both kinds in @sec-ch07 and @sec-ch12; this chapter sets up the ingredients. Tradeline aggregation is where many hidden failure modes live. A common one is double-counting: if a mortgage is reported by both the servicer and the originator for a brief window, a naive aggregation double-counts it in the mortgage-count feature and in the sum-of-balances feature. Bureaus use identifier keys to deduplicate, but the keys are imperfect, and lenders usually write their own dedupe rules. Another is the treatment of authorized users: a thin-file consumer can ride on a spouse's or parent's revolving account, and the bureau reports the authorized user tradeline in the primary report. Whether the feature engine counts that line is a policy choice that affects both risk discrimination and fair lending posture. ### The credit-invisible and unscored populations A large fraction of adults in any jurisdiction have insufficient bureau data to produce a score. @brevoort2016credit estimate that roughly 26 million Americans are credit invisible, meaning they have no bureau record, and another 19 million are unscored, meaning their record exists but is too thin or stale for the bureau's scoring model. The invisible and unscored populations skew younger, non-white, and lower-income, which makes them the principal policy target for alternative-data scoring. Any scorecard designed for mass-market retail lending has to make an explicit choice about how to treat thin-file applicants. The three common choices are (1) to score them on a dedicated thin-file model, (2) to route them to a judgmental underwriter, or (3) to decline by default. Each choice has fair-lending implications that model governance must document. ### External overlays A bureau report is not the only external data source. Income verification services like The Work Number, Finicity, and Plaid provide real-time payroll feeds that confirm stated income within seconds. Fraud databases such as FICO's Falcon, LexisNexis ThreatMetrix, and Early Warning Services flag known fraud rings, synthetic identities, and device fingerprints. AML and sanctions screening, while not directly part of credit risk, feeds the same decision workflow, and its data quality affects the stability of any model downstream. Model governance treats each overlay as a separate data source requiring its own lineage documentation, its own monitoring, and its own retraining plan. ## Alternative data The line between "traditional" and "alternative" is time-dependent. Utility payment data was alternative in 2005 and is standard today. What the term currently means is any signal that is not in a nationwide credit bureau and not in a bank's own ledger. Five categories cover most of what large lenders have deployed or piloted. ### A working taxonomy 1. **Psychometric**: survey-style questionnaires designed to measure personality traits (conscientiousness, honesty, locus of control) that correlate with repayment. Deployed in frontier markets where bureau coverage is sparse. 2. **Behavioral and device**: smartphone metadata, browser fingerprints, typing dynamics, session-level app usage. @berg2020rise document that ten easily observed digital footprint variables, such as device operating system and time-of-day login patterns, deliver predictive power comparable to a credit bureau score at a German e-commerce lender. 3. **Transactional**: bank account data obtained through open banking APIs (PSD2 in Europe, CDR in Australia, the 1033 rule in the United States). Each transaction has a date, amount, counterparty, and a classification tag. 4. **Social and platform**: data collected inside a digital platform. @iyer2016screening and @lin2013judging show that in peer-to-peer lending, social ties and soft information embedded in listings contain residual risk information beyond traditional hard information. 5. **Utility and telco**: electricity, water, mobile phone bills. Thin-file consumers often have 6 to 24 months of telecom usage even when they have zero tradelines. Chinese BigTech platforms have reported that transactional and platform data dominate bureau data for their own ecosystem [@bis2020data; @gambacorta2024data]. The underlying point is older than FinTech. The richer the lender's view of the borrower's cash flow, the less a centralized bureau adds on the margin. ### The information content of alternative data For any new feature, the relevant question is whether it adds risk-adjusted discrimination on top of what the scorecard already has. @berg2020rise formalize this through nested models, regressing default on credit bureau score alone, on digital footprint alone, and on both. The marginal $R^2$ from adding digital footprints is comparable to the marginal $R^2$ of the bureau score itself. @gambacorta2024data run a similar test using Chinese fintech data and show that a model built on transactional data alone beats a model built on a bureau score alone, although the two together beat either. The regulatory question, which we will see in @sec-ch05, is whether the resulting model complies with fair lending rules. Alternative data that correlates with protected characteristics can open disparate impact exposure even when the feature is nominally neutral. The empirical finding in @fuster2022predictably is that machine learning models with richer feature sets can widen or narrow racial pricing gaps depending on the choice of input set. Data policy is a model policy. ### Pitfalls of alternative data Alternative data has three recurring failure modes. First, many signals drift fast. A social feature built in 2015 might not exist in 2025 because the platform has changed. Second, the distribution of missingness is usually not random. A customer without a smartphone has no device fingerprint, and the absence correlates with both income and credit risk. Third, the population on which an alternative data model is validated is almost always a self-selected sample of borrowers who consented to the data collection. Reject inference, covered in @sec-ch10, becomes essential. A fourth, less-discussed failure mode is the regulatory half-life of novelty. When a signal first enters the market, it usually carries strong predictive power and light compliance scrutiny. As it diffuses across lenders, the signal loses economic rent, adversarial actors learn to game it, and regulators begin to ask how it maps onto protected classes. @fuster2022predictably make the latter point sharply for ML scorecards: models with richer feature sets can move the distribution of predictions in ways that widen or narrow racial pricing gaps, and the direction is not determined by the model class alone. Data governance for alternative signals needs an explicit sunset clause in the same way that a model governance framework has a model retirement clause. ### Cash flow underwriting Cash flow underwriting is the most important new category of alternative data in the past five years. The workflow starts with a consumer-consented open-banking pull of 12 to 24 months of transaction history across all of the customer's depository accounts. A categorizer assigns each transaction a merchant category and a cash-flow tag (inflow, fixed outflow, discretionary outflow, transfer, fee). Aggregate features are computed at daily, weekly, and monthly frequency: net cash flow, income stability, rent coverage ratio, recurring debit presence, and overdraft count. @bis2020data argue that the informational content of this data can substitute for collateral in SME lending, which is a strong statement about its power. Two operational facts make cash-flow underwriting different from the other alternative categories. It is consented, so the data subject is aware of the collection in a way that is rarely true for device or behavioral features. And it is structured, so the feature engineering pipeline can be specified and tested deterministically. These properties make cash flow data easier to defend in model validation. They also constrain the universe: only applicants who connect a bank account have any signal at all, which maps directly back onto the missingness-at-design problem that structural alternative data creates. ### Summary of marginal information content Across the empirical literature, the pattern is consistent. A credit bureau score captures roughly two-thirds of the variation in default that the combined set of bureau, bank, and alternative data captures, with the remaining third split roughly evenly across application data (for thin-file) and alternative data (for both thin- and thick-file). The size of the alternative data contribution depends strongly on the customer segment. For prime thick-file borrowers, the alternative signal is mostly redundant. For thin-file, credit-invisible, and frontier-market borrowers, the bureau is sparse, and the alternative signal is essential. This heterogeneity argues for a segmented modeling approach rather than a single global model. Later chapter treat inclusion and segmentation head-on. ## Weight of evidence and information value ### Why bin features at all? Before any formula, it is worth asking why a credit team would throw away information by replacing a continuous income figure with the bucket it falls into. The answer is that binning exchanges resolution for six properties that matter more than resolution in the production setting where a scorecard runs for years against drifting data and is subject to formal model validation. 1. **Robustness to outliers, measurement error, and reporting noise.** Bureau fields are reported with varying conventions across furnishers; income is self-reported on applications and has long upper tails; tradeline counts are zero-inflated. Bins absorb all of this into a discrete level whose log-odds contribution is bounded. 2. **Explicit treatment of missingness.** A missing tradeline summary is not the same as a zero. Binning makes the missing level its own bin with its own WoE, so the model uses missingness as a signal when it is informative and ignores it when it is not. No imputation choice is hidden inside the pipeline. 3. **Monotone, additive contribution to log-odds.** WoE-encoded features enter the logistic regression as linear terms whose coefficients factor cleanly into a base-odds piece and a per-bin contribution (@eq-logit-woe), which is what enables the points-based scorecard formulation in @sec-ch07. Underwriters and adverse-action systems can read each bin's contribution as a fixed point increment. 4. **Stability across resamples and through time.** A coarse five-bin partition of a feature is far more stable under resampling and population drift than a continuous coefficient, because the only thing that can change is the per-bin event rate. A scorecard that is refreshed annually but whose binning was set during development needs the binning to be stable across years; that is what the bootstrap and through-time diagnostics in the stability subsection below test. 5. **Two-stage decoupling of binning and coefficient estimation.** Bin selection is supervised but happens before the model is fit. The binning table is reviewed in isolation against IV, monotonicity, and bin-share rules. The coefficient estimation step then becomes essentially a sanity check, because the slope on a single WoE-encoded feature is approximately $-1$ in population. This decoupling is what lets a five-person scorecard team ship a model that a ten-person model risk team can validate. 6. **A model artifact that regulators read.** SR 11-7 model risk validators, ECOA Reg B examiners reviewing reason codes, and GDPR Article 22 explainability reviewers all expect a tabular artifact that lists every model input, every bin, and every contribution. The binning table *is* that artifact. No post-hoc explanation method (SHAP, LIME, permutation importance) produces something that validators trust the same way, because those methods are computed on the trained model rather than fixed at training time. The first three reasons are the technical ones taught in @siddiqi2017intelligent. The last three are the institutional reasons that explain why binned-WoE pipelines remain dominant in retail credit even decades after gradient boosting matched or exceeded their predictive performance. We return to the question of whether modern algorithms eliminate the need for any of this in the subsection on modern models below. Weight of evidence (WoE) is the canonical encoding used for this purpose in credit, going back to Kullback's work on information statistics in the 1950s [@kullback1951information] and commercialized in banking by Fair, Isaac, and Company in the 1970s. @siddiqi2017intelligent is the industry-standard reference for the practical pipeline. ### Formal definition Fix a feature that has been binned into $J$ disjoint bins $A_1, \dots, A_J$. For each bin define the share of goods and share of bads that fall in that bin: $$ g_j = \frac{\#\{i: x_i \in A_j, y_i = 0\}}{\#\{i: y_i = 0\}}, \qquad b_j = \frac{\#\{i: x_i \in A_j, y_i = 1\}}{\#\{i: y_i = 1\}}. $$ The weight of evidence for bin $j$ is the log-ratio of those shares: $$ \mathrm{WoE}_j = \ln\!\left(\frac{g_j}{b_j}\right). $$ Positive WoE means the bin is enriched with goods relative to the base rate. Negative WoE means the bin is enriched with bads. The information value of the feature is the weighted sum $$ \mathrm{IV} = \sum_{j=1}^{J} (g_j - b_j) \mathrm{WoE}_j = \sum_{j=1}^{J} (g_j - b_j) \ln\!\left(\frac{g_j}{b_j}\right). $$ Practitioners rank features by IV with the rules of thumb from @siddiqi2017intelligent: IV less than 0.02 is unpredictive, 0.02 to 0.1 is weak, 0.1 to 0.3 is medium, 0.3 to 0.5 is strong, and above 0.5 is suspiciously good and usually means the bin count is too fine or the feature is leaky. ### A worked example by hand Before invoking `optbinning`, it helps to compute the formulas in @eq-shares and @eq-iv on a toy portfolio with three bins. Suppose 1,000 applicants split across an `income_bucket` feature with $G = 900$ goods and $B = 100$ bads in total: Read the table row by row. The Low bin holds $250/900 \approx 27.8\%$ of all goods but $50/100 = 50\%$ of all bads. Its WoE is $\ln(0.278/0.500) \approx -0.59$, a negative value flagging higher-than-average risk, consistent with a $50/300 = 16.7\%$ bad rate against the population base rate of $10\%$. The High bin reverses the imbalance and contributes $\mathrm{WoE} \approx +0.44$. The IV total of about $0.21$ would land the feature in the "medium predictive" tier of the rule of thumb above. Three sanity checks the table makes obvious: - The columns $g_j$ and $b_j$ each sum to $1$, because they are class-conditional shares. - A bin with $g_j = b_j$ contributes zero to IV, because $\ln(1) = 0$. Equality in every bin is exactly the case where the feature is independent of $Y$. - Both negative and positive WoE bins contribute positively to IV, because the signs of $(g_j - b_j)$ and $\mathrm{WoE}_j$ always match. These same numbers are what `optbinning` would print if it were given the same three bins, modulo the Laplace pseudo-count discussed in the from-scratch implementation below. ### Equivalence to a symmetric KL divergence The IV in @eq-iv is exactly the symmetrized Kullback-Leibler divergence between the class-conditional distributions of the binned feature. Let $P$ denote the distribution of $X$ conditional on $Y = 0$ and $Q$ the distribution of $X$ conditional on $Y = 1$, so $P(A_j) = g_j$ and $Q(A_j) = b_j$. The KL divergence of $P$ from $Q$ is $$ D_{\mathrm{KL}}(P \parallel Q) = \sum_j g_j \ln(g_j/b_j). $$ By symmetry, $$ D_{\mathrm{KL}}(Q \parallel P) = \sum_j b_j \ln(b_j/g_j) = -\sum_j b_j \ln(g_j/b_j). $$ Adding the two, $$ D_{\mathrm{KL}}(P \parallel Q) + D_{\mathrm{KL}}(Q \parallel P) = \sum_j (g_j - b_j)\ln(g_j/b_j) = \mathrm{IV}. $$ Information value is the Jeffreys divergence between the good and bad class-conditional feature distributions [@kullback1951information]. Two consequences follow: 1. IV is always non-negative, with equality if and only if $g_j = b_j$ for all $j$, the case where the feature contains no information about the label. 2. IV is additive across independent disjoint features only in the limit where no feature carries any information contained in another, so the IV-based ranking should be read as a marginal screen, not a joint optimum. ### Connection to logistic regression The link to logistic regression is tight. Conditional on bin $j$, $$ \mathrm{logit} \Pr(Y = 1 \mid X \in A_j) = \ln\!\left(\frac{\Pr(Y=1, X \in A_j)}{\Pr(Y=0, X \in A_j)}\right) = \ln\!\left(\frac{\pi_1}{\pi_0}\right) + \ln\!\left(\frac{b_j}{g_j}\right) = \alpha - \mathrm{WoE}_j, $$ where $\alpha = \ln(\pi_1 / \pi_0)$ is the log base-odds. Fitting a logistic regression on a single WoE-encoded feature recovers an intercept equal to $\alpha$ and a slope equal to $-1$ in population. In sample, the slope is close to $-1$ and the deviation measures how close the bin assignment is to the saturated model. Because of @eq-logit-woe, WoE encoding gives logistic regression coefficients that factor cleanly into a base-odds piece and a bin-contribution piece, which is what enables the standard points-based scorecard formulation we develop in @sec-ch07. ### Empirical ranking on Taiwan default `optbinning` [@navas2020optimal] uses a mixed-integer programming formulation to find an optimal monotone binning that maximizes IV subject to monotonicity and bin-size constraints. The algorithm extends classical supervised discretization [@fayyad1993multi] by enforcing risk monotonicity, which is what underwriters expect from a scorecard. The highest-IV feature in the Taiwan data is the most recent delinquency code (`PAY_0`), followed by older payment codes, which matches the domain intuition. The `summary()` table includes two additional ranking columns alongside `iv`, both computed on the same binned distributions and sometimes preferred when IV is unstable. - **`js`: Jensen-Shannon divergence.** A bounded, symmetric variant of the KL divergences from @eq-kl-pq and @eq-kl-qp. With $M = (P + Q)/2$ the bin-share midpoint, $\mathrm{JS}(P, Q) = \tfrac{1}{2} D_{\mathrm{KL}}(P \parallel M) + \tfrac{1}{2} D_{\mathrm{KL}}(Q \parallel M)$. Always lies in $[0, \ln 2]$, so it cannot blow up the way IV can when a bin is nearly empty. Read the same way as IV: bigger means more class separation. Often used as a robustness check on the IV ranking. - **`gini`: bin-level Gini coefficient.** Twice the area between the bin-ordered cumulative-goods and cumulative-bads curves; equivalently $2 \cdot \mathrm{AUC} - 1$ computed using the bin-ordered WoE as the score. Reported on $[0, 1]$. Same monotone direction as IV but scaled like the discriminative AUC measure used in @sec-ch04, so it lets the scorecard team compare a feature's marginal predictive contribution against the overall model AUC in the same units. The full treatment of AUC, Gini, KS, and Brier sits in @sec-ch04. In practice, these three columns rank features almost identically; large disagreements are a flag that one bin is dominating IV, in which case a coarser binning or a JS-based ranking is the safer choice. ### WoE-encoded features versus raw features in logistic regression The WoE-encoded logistic model gains roughly 4 to 5 AUC points relative to standardized raw features on Taiwan. The gap narrows once we move to flexible models like gradient boosting, which can internally approximate monotone step functions. The gap persists, however, for the class of linear models that regulators prefer because they are auditable. Three mechanisms drive the gap, and naming them clarifies when WoE will help and when it will not. 1. **Linearization in log-odds.** WoE turns a non-monotone or kinked empirical risk curve into a monotone, additive contribution in log-odds space, exactly the space the logistic link operates in. A single linear coefficient then fits a relationship that the raw feature would need a polynomial, spline, or piecewise basis to express. 2. **Common units across columns.** WoE rescales every feature, numeric or categorical, into log-odds. The L2 penalty in `LogisticRegression` therefore stops privileging high-variance columns over high-information ones, which is what causes the standardized-raw baseline to leak coefficient mass to noisy features. 3. **Bounded leverage.** Outliers and missing values are absorbed into bins with finite WoE, so a single extreme observation cannot tilt the regression line. Standardization shifts and scales but leaves the long tail intact, and a logistic regression with a few extreme rows still gets dragged. None of the three is unique to WoE. Splines, target encoders, and isotonic transforms each capture some subset. WoE is the only encoding that captures all three *and* preserves a binning table that a model validator can read. That second property is what makes it the default in regulated retail credit, even where it does not maximize AUC. ### WoE is univariate: handling interactions and non-linearity WoE is computed one feature at a time, and the IV in @eq-iv treats each feature in isolation. The encoding captures non-linearity *within* a feature (e.g., the bin shape can be U-shaped, monotone, or step), but it does not capture any signal that lives only in the joint distribution of two or more columns. "High credit limit is risky only when paired with a thin payment history" is invisible to a WoE-plus-logistic model unless the interaction is engineered explicitly. Equivalently, the IV-based ranking is a marginal screen: a feature with low IV may still carry conditional information that a joint model would use, and a feature with high IV may be redundant given another already in the model. This is the structural reason for the AUC gap pattern in the comparison above and in the end-to-end benchmark in @sec-ch03-benchmark. Three remedies sit on a complexity-versus-interpretability spectrum. 1. **Hand-built interaction features.** Cross-tabulate two binned features into a single categorical (`PAY_0` × `LIMIT_BAL_bin`), then WoE-encode the cross. The result stays inside the scorecard pipeline and remains auditable. Cost: combinatorial explosion if used liberally, and small bin counts that hurt stability. 2. **Two-dimensional supervised binning.** `optbinning` ships an `OptimalBinning2D` that solves the same MIP over a pair of features. Useful for a small number of known-interactive pairs (utilization × age of oldest tradeline is a classic). 3. **Segmented scorecards.** Fit one scorecard per pre-defined segment (thin-file vs thick-file, secured vs unsecured). Interactions with the segmenting variable are absorbed by the segmentation. @sec-ch31 treats this in depth. If the dominant signal is genuinely interactive, none of the above competes with a tree ensemble that splits in arbitrary feature combinations by construction. The empirical fact that WoE-plus-logistic stays close to gradient boosting on regulated retail credit data is a statement about that data: monotone main effects from delinquency, utilization, and tenure dominate, and interactions are second-order, not a general property of the encoding. On data where interactions are first-order (fraud, marketing response on heterogeneous customer pools), the calculus reverses and a tree ensemble is the right starting point. ### A from-scratch implementation It is good practice to verify the library against a short NumPy reference. The function below reproduces `optbinning`'s per-bin WoE for a fixed set of bin edges. The two IV numbers (here $0.2002$ from `optbinning` and $0.1998$ from the scratch implementation) differ by about $4 \times 10^{-4}$, with the scratch value the smaller of the two. The gap is not a constant offset: it comes from the Laplace pseudo-count of $0.5$, which shrinks the empirical bin shares toward uniform, compresses the WoE magnitudes, and therefore lowers the IV. Setting `laplace=0` in `woe_iv_from_bins` reproduces `optbinning`'s value exactly whenever no bin is empty, which is the right cross-check to run during development. The pseudo-count earns its keep in production, where a single bin occasionally drops to zero in a refresh sample and an unsmoothed $\ln(0)$ would crash the scoring service. ### Reading a binning table This table is the canonical artifact a scorecard developer hands over for model validation. Each row is a bin. The columns show the bin boundaries, the fraction of the population in the bin, the event rate, the WoE, and the IV contribution. The totals row at the bottom gives the IV for the feature. Binning tables like this one are what SR 11-7 [@sr117] validators will read first when reviewing a scorecard submission. Three diagnostics in the binning table carry most of the validation signal. First, the WoE column should be monotone when the feature is ordinal and the business logic calls for monotonicity. An income feature that increases in WoE, then dips for the highest bin, then increases again, either has too many bins, has a sample-size problem, or has a real structural break (high earners with complicated tax situations, for instance) that needs a dedicated flag. Second, the bin-share column (often labeled `count (%)`) should not have any bin with less than 3 to 5 percent of the population. Small bins have unstable WoE and produce scorecards that swing wildly under resampling. Third, the event rate column should step smoothly across bins when ordered by WoE. Large jumps suggest that the bin boundaries are not where the risk boundary actually sits. The `optbinning` algorithm is a mixed-integer programming formulation that enforces these properties globally rather than through heuristic post-processing [@navas2020optimal]. It extends the classical supervised discretization literature, which treats binning as a greedy information-gain split, by adding constraints that express what a scorecard developer would actually want. The classical reference is @fayyad1993multi, which uses a minimum description length principle to choose the number of bins. The MIP formulation subsumes this as a special case with different constraints. ### Edge cases in WoE encoding Three edge cases appear in almost every scorecard build, and each has a standard fix. **Zero-count bins.** A bin may contain no goods or no bads in the training sample. The raw WoE in @eq-woe is then $\ln 0$ or $-\infty$. Three fixes exist: 1. A Laplace smoothing term adds a pseudo-count of $0.5$ or $1$ to each bin. 2. A bin merge folds the offending bin into an adjacent bin with consistent risk. 3. An `optbinning` constraint on minimum bin size avoids the zero-count state entirely during fitting. > The Laplace fix is the simplest and adequate for most production code. **Out-of-range values at scoring time.** Production data may contain values that fall outside the training-time range. For numeric features, the conventional answer is to extend the outermost bins to $(-\infty, \text{edge}_1]$ and $(\text{edge}_{J-1}, \infty)$, so every value maps to a bin. For categorical features, the analog is to keep a catch-all level that absorbs unseen categories with a WoE set equal to the population-weighted average of the observed levels. **Missing values.** `optbinning` treats missing values as a separate bin, which is the behavior we want. If the missingness itself is informative, the WoE of the missing bin will be nonzero, and the model will use it. If not, the WoE will be close to zero, and the model ignores it. Either way, the treatment is explicit rather than hidden behind a mean imputation. ### Stability of WoE under resampling A scorecard is only useful if the bin edges and WoE values are stable. Two diagnostics verify stability: bootstrap resampling and through-time partitioning. If the bootstrap standard deviation of IV is large relative to the mean, the feature's binning is unstable and the scorecard will be brittle. On the `PAY_0` feature the bootstrap spread is modest, which is the expected picture for a strongly predictive feature. Features that fail bootstrap stability almost always benefit from a coarser binning. ### The relationship to supervised discretization WoE is one member of a family of supervised discretization methods. @fayyad1993multi introduced the entropy-based minimum description length principle for choosing the number of bins. Chi-merge and ChiSquare-based methods use a test of independence between adjacent bins as the merge criterion. The CART tree, treated at length in @sec-ch11, is a univariate supervised binner when grown as a stump, and its splitting criterion is Gini impurity rather than WoE. All of these methods can be expressed as choices of the split function and the stopping criterion; the `optbinning` MIP formulation lets the user specify both explicitly. The deep reason WoE dominates in credit is not statistical performance but interpretability. A CART split gives a threshold and a count; WoE gives a threshold, a count, and a log-odds contribution that plugs directly into a points-based scorecard. The transformation from WoE to scorecard points is linear in the log-odds, which @sec-ch07 works out in detail. ### Do modern algorithms still need binning and WoE? A reasonable reading of the chapter so far is that WoE is a piece of legacy machinery that exists because logistic regression cannot handle non-linearity on its own. Gradient boosting splits at arbitrary thresholds, neural networks learn arbitrary feature transformations, and either approach matches the AUC of a WoE-plus-logistic pipeline on most retail credit data. So why bin? The honest answer has two parts. The first is that for **predictive performance alone**, you do not need WoE if your downstream model is a tree ensemble or a sufficiently regularized neural network. The end-to-end benchmark in @sec-ch03-benchmark makes this explicit: a gradient-boosted model on raw features matches the WoE-plus-logistic configuration on Taiwan within noise. Tree learners pick their own thresholds, absorb missingness through surrogate splits or dedicated handling, and are insensitive to monotone transforms of inputs. WoE is computationally and statistically wasted on them. The second part is that **production credit scoring is not a pure prediction problem**. It carries six constraints that WoE-style preprocessing addresses by construction and that modern algorithms address only with extra effort: - **Reason codes for adverse action notices** under ECOA Reg B and FCRA §1681m must be ordered, stable, and explainable per applicant. A scorecard derives reason codes mechanically from the per-bin WoE contributions; a gradient boosting model derives them from SHAP values, which are post-hoc, sample-dependent, and not always monotone in the underlying feature, even when the feature is supposed to be. - **Monotonicity constraints** on features such as utilization, delinquency, and tenure are required by both regulators and underwriters. WoE binning enforces monotonicity at the binning step; LightGBM and XGBoost support monotonicity flags but at a real cost in fit, and neural networks need either a Lipschitz architecture or a monotone-by-construction layer such as in @sill1998monotonic. - **Stability under population drift** is harder to guarantee with continuous splits chosen on a single training sample than with bins reviewed against a stability index. Champion scorecards stay in production five-plus years; champion gradient-boosted models are typically refreshed every quarter. - **Auditability of the model artifact** by SR 11-7 model risk management groups, by GDPR Article 22 explainability reviewers in the EU, and by a non-technical credit committee. The binning table is the artifact those audiences read. SHAP and partial dependence are explanations *of* the artifact, not the artifact itself. - **Reproducibility across pipelines.** A binning table can be re-implemented by a different team in a different language with the same WoE values. A gradient-boosted model with a particular set of hyperparameters cannot be reproduced exactly outside the original training pipeline. - **Sample efficiency for thin-file segments.** Where the relevant subpopulation is small (frontier-market lenders, new-to-credit applicants, niche product lines), a five-bin discretization extracts more reliable signal than a continuous spline that needs many degrees of freedom to fit non-linearity. The pragmatic stack used by many regulated lenders today is a hybrid. Gradient boosting on raw or lightly engineered features is used for ranking and for the challenger model; a WoE-binned logistic scorecard is used for the production decision model that is actually deployed. The two are reconciled with a calibration step. Where a single model must serve both purposes, the rising option is the **Explainable Boosting Machine** (EBM) of @lou2013accurate and @nori2019interpretml, which fits a generalized additive model with one shape function per feature and optionally one shape function per pairwise interaction. EBMs are essentially a continuous-bin generalization of WoE, with the per-feature shape playing the role of the WoE column and the per-pair shape playing the role of an `OptimalBinning2D` cross. They typically match gradient boosting on AUC while preserving the per-feature artifact regulators expect. So the short answer to "is this an artifact of logistic regression from decades ago?" is: only the *encoding* is. The underlying constraints (e.g., reason codes, monotonicity, stability, auditability, sample efficiency) are not artifacts of the algorithm; they are properties of the regulatory and operational environment in which credit models live. WoE persists because it satisfies those constraints almost for free, not because anyone is nostalgic for the 1970s. ## Missing data Every real credit data set has missing values. A bureau-less applicant has no tradelines. An application that skipped an optional field has nulls. An open-banking connection that failed mid-session has a truncated history. How missingness gets handled determines, in practice, whether a model works in production for the tail of the customer base where it matters most. ### Rubin's taxonomy @rubin1976inference classified missing-data mechanisms into three types. Let $X$ be the complete data, $M$ the missingness indicator matrix, and $X_{\mathrm{obs}}, X_{\mathrm{mis}}$ the observed and missing partitions. - **MCAR** (missing completely at random): $\Pr(M \mid X) = \Pr(M)$. Missingness is independent of both observed and unobserved data. Rare in practice. The classic example is a random sensor dropout. - **MAR** (missing at random): $\Pr(M \mid X) = \Pr(M \mid X_{\mathrm{obs}})$. Missingness depends only on observed variables. An income field that is more likely to be blank for younger applicants is MAR, because age is observed. Under MAR, likelihood-based methods such as multiple imputation are unbiased. - **MNAR** (missing not at random): $\Pr(M \mid X)$ depends on $X_{\mathrm{mis}}$. Missingness depends on the unobserved value itself. High earners who decline to report income are MNAR. Imputation alone cannot recover the full data distribution without assumptions. @little2019statistical gives the textbook treatment. For scorecard work, the relevant message is that MCAR is a convenient fiction, MAR is often defensible given rich observed data, and MNAR is the state of the world for sensitive fields like income, mortgage balance on outside institutions, or credit applications at competitors. ### Imputation strategies Five strategies cover most of what a scorecard pipeline needs: 1. **Simple statistic**: replace with the column mean, median, or mode. Fast, unbiased under MCAR, biased otherwise. Collapses variance. 2. **Indicator plus statistic**: add a binary "was missing" column and impute the underlying value. Captures the information in the fact of missingness, which for credit is often predictive on its own. 3. **k-nearest neighbors**: find the $k$ most similar rows under a defined distance, average their values. Works well when the data has strong local structure. Compute scales quadratically with sample size. 4. **Multivariate iterative (MICE)**: model each incomplete feature as a function of the others, iterate to convergence [@vanbuuren2011mice; @white2011multiple]. Scikit-learn's `IterativeImputer` is the Python implementation. Recovers MAR under mild assumptions. 5. **Model-based with native support**: tree learners like XGBoost, LightGBM, and CatBoost natively route missing values to the child node that minimizes loss. For those learners, imputation is a pre-model choice only if you also run a linear benchmark. For credit scoring, the missing-indicator strategy deserves first-line status. If an applicant failed to fill in an optional field, that refusal often correlates with risk. Losing it by silent mean-imputation is a substantive information loss. ### Simulated experiment on German credit We take the German credit data, inject missingness under the three mechanisms, and compare five imputation strategies on the downstream AUC of a logistic model. The MCAR mechanism drops each cell independently with probability 0.2. The MAR mechanism drops a cell with probability 0.4 when `duration` is in the top 40 percent and 0.08 otherwise. The MNAR mechanism drops a cell with probability 0.4 when the cell's own value is in the top 30 percent of its column and 0.04 otherwise. Credit analogs map cleanly: MAR is "applicants with long tenure skip optional fields"; MNAR is "applicants with high balances skip the balance field". Two observations. First, no imputer dominates across all three mechanisms, which is the expected result once we understand that each method encodes different assumptions. Under MCAR the mean-plus-indicator strategy leads because the indicator itself is random and the underlying distribution is symmetric. Under MAR, mean or median imputation is competitive because the drivers are observed in the other features. Under MNAR no method recovers the full signal. The second observation is that the differences are small in absolute terms, a few AUC points, but they compound into large differences in expected profit when stacked across features. This is where adding a missingness indicator pays dividends for essentially zero cost. ### When to use which A practical decision rule for scorecard work: - **Categorical feature with natural "missing" level**: keep the missing level as its own bin. WoE handles it. This is the default in `optbinning`. - **Continuous numeric with MAR missingness and rich observed data**: `IterativeImputer` (MICE) with a small number of iterations. Add an indicator if missingness rate exceeds 5 percent. - **Sparse high-dimensional matrix with MNAR fields**: keep the indicator, impute the value with a median. Do not try to be clever. - **Tree learner downstream**: do not impute. Pass NaNs through. Compare to imputation only if required by governance. - **Linear or neural scorecard**: impute explicitly and persist the imputer in the same artifact as the model. The industry's single largest imputation failure mode is pipeline drift. The training data had a missingness rate of 2 percent for income. The production data has 15 percent, because a downstream upstream vendor changed a default. The imputer silently mean-fills the missing values, and the score concentrates near the population mean. Monitor missingness rates on every feature in production with the same care that you monitor PSI (@sec-ch16). ### What about matrix completion, GAIN, MIWAE, MissForest? A reasonable reader will ask why the list above stops at MICE when the imputation literature has moved on to low-rank matrix completion [@mazumder2010softimpute], random-forest imputation [@stekhoven2012missforest], generative-adversarial imputation [@yoon2018gain], deep-latent-variable imputation [@mattei2019miwae], and AutoML-style imputer selection [@jarrett2022hyperimpute]. The answer is not that these methods are bad. They are excluded from the recommended scorecard pipeline for four reasons that all bind at the same time in credit. First, **low-rank assumptions do not fit credit feature matrices**. Matrix completion methods assume the underlying data matrix is approximately low-rank, which is the right model for collaborative filtering (a few latent taste factors generate every user-item rating) and for image inpainting (smoothness in pixel space). Credit features are a heterogeneous mix of bureau scores, demographics, employment fields, behavioral aggregates, and self-reported items. There is no shared low-dimensional latent factor that generates all of them, and SoftImpute-style nuclear-norm completion silently shrinks toward a basis that has no interpretation. Empirically, low-rank completion is competitive on dense numeric panels (e.g., genomics) and weak on the wide mixed-type tables that scorecards consume. Second, **the inference-time story is broken or expensive**. A scorecard must score one new applicant in milliseconds. Matrix completion, GAIN, and MIWAE were all designed for the in-sample setting, where you complete a fixed matrix once. Out-of-sample completion for a single new row requires either projection onto a stored basis (matrix completion), a forward pass through a generator trained on potentially stale data (GAIN), or sampling from a learned posterior (MIWAE). Each of these adds a second model to the production path that must itself be versioned, monitored for drift, and validated under SR 11-7. The marginal AUC gain rarely justifies the operational cost. Third, **the empirical lift on tabular data is small**. Two large published benchmarks of imputation methods on tabular data, @jager2021benchmark and @lemorvan2021whatsgood, both find that median imputation with a missing-indicator column is within one or two AUC points of the best deep imputer on essentially every downstream classification task they test. Where the deep methods win, they win by margins that are smaller than the variance across random seeds. For credit, where the dominant model is a gradient-boosted tree that handles NaN natively (@sec-ch12-xgboost), the practical gain from a sophisticated imputer is close to zero. Fourth, **governance penalizes opacity**. A bank validator will ask three questions about any imputer: what assumption does it encode, what happens when production missingness drifts, and how would you detect a regression. Median-plus-indicator answers all three in one sentence each. GAN- or VAE-based imputation answers none of them cleanly, and the validator will require a separate model risk file, a champion-challenger setup, and ongoing monitoring of the imputer's own outputs. This burden is real and is why most production scorecards still ship with median-plus-indicator or with a tree learner that ignores the question entirely. The honest summary is that matrix completion and its deep-learning successors are excellent tools for the problems they were designed for (collaborative filtering, image inpainting, gene-expression panels) and a poor fit for the wide-mixed-type, low-latency, high-governance environment of credit scoring. The gap is not theoretical sophistication; it is fit-for-purpose. A team that wants to experiment with HyperImpute or MissForest as a challenger to the median-plus-indicator champion should do so, but the production default belongs with the simpler tool. ### Multiple imputation and variance inflation Single imputation, where each missing cell is replaced with one value, systematically understates downstream standard errors. Multiple imputation draws $M$ imputed data sets, fits the model on each, and combines the results using the rules of @rubin1976inference. @vanbuuren2011mice and @white2011multiple treat the methodology in detail. For credit scoring, the pragmatic reality is that $M = 1$ is almost always used. The argument is that scorecard inference is about the predicted probability rather than the coefficient standard errors, and the downstream decisions (approve, decline, price) are robust to the kind of variance inflation that single imputation glosses over. That argument is correct for pure decision-making, but breaks down when the model's coefficients are used in capital calculations under Basel IRB, where the regulator cares about the confidence interval around the PD. Banks that run IRB models generally carry a multiple-imputation pipeline for a subset of critical features, even when single imputation is the default for decision making. ### Imputation and monotonicity A subtle but important property of any imputer is whether it preserves the monotone risk structure that scorecard binning relies on. Mean imputation breaks monotonicity because the imputed value sits near the middle of the distribution, while the underlying missing-value risk may be at one of the tails. Median imputation has the same problem. An imputation strategy that restores monotonicity is to impute to the value that matches the observed risk of the missing group. Operationally, this means fitting a univariate logistic regression of the label on the feature in the non-missing subsample, then assigning the imputed value so that the predicted log-odds of a missing row equals the empirical log-odds of the missing subset. This is worth the effort when downstream scorecard monotonicity is a constraint, and overkill otherwise. The construction is short enough to demonstrate in code. We inject MNAR missingness on `duration` (longer-duration applicants are more likely to have the field blank, and longer duration is also riskier), then compare the imputed value chosen by the mean, the median, and the risk-matched rule. The risk-matched row has `abs_gap` equal to zero by construction. The mean and median rows have a positive gap whose sign tells you which direction the imputer is pulling the missing population: a negative `pred_logit_at_imp` minus `empirical_logit_missing` means mean or median imputation is making the missing rows look safer than they really are, which is exactly the failure mode that breaks monotonicity in the downstream scorecard. The AUC differences are typically small on a single feature, which matches the broader finding from the simulated experiment above. The point of risk matching is not to win the AUC race; it is to keep the scorecard's monotone risk structure intact when the binning step downstream insists on it. Two further notes. First, the rule generalizes from a single feature to a multivariate setting by replacing the univariate logit with a model that uses all observed features and solving for the imputed value of the missing column at the row's own observed covariates. Second, the rule is a single-imputation device. If you need calibrated standard errors, draw the imputed value from the posterior predictive distribution of the univariate logit instead of using the point estimate, then average across draws using Rubin's rules. ### Missingness indicator interpretation When a missingness indicator column is added, the logistic coefficient on it has a direct risk interpretation. Let $M_j$ be the indicator that feature $j$ is missing and $X_j^{\text{imp}}$ be the imputed value. The fitted model has the form $$ \mathrm{logit} \Pr(Y=1 \mid X, M) = \alpha + \beta_j X_j^{\text{imp}} + \gamma_j M_j + \cdots $$ The coefficient $\gamma_j$ measures the risk premium (or discount) associated with the fact of missingness itself, holding the imputed value constant. A positive $\gamma_j$ with a substantial magnitude says that a missing-on-$X_j$ row is riskier than an observed row with the same imputed value, which is direct evidence that the missingness mechanism is MNAR. A near-zero $\gamma_j$ is evidence that the mechanism is plausibly MCAR or MAR. > Two cautions apply. First, the indicator is colinear with the imputed value if every imputed row shares the same value, which in mean imputation is always the case for a single feature. Regularized logistic regression handles this cleanly; unpenalized logistic regression may show nonfinite standard errors. Second, if two features have correlated missingness (for example, both self-reported income and self-reported employment length are blank for the same applicant), adding both indicators recovers almost all the useful signal but can make the model's bias unstable across resampling. A joint "application-form incomplete" indicator often works better than two separate indicators. ## Feature selection Scorecards live with more features than they use. A fintech's feature store routinely holds thousands of columns. A deployed model uses 10 to 60. The process of getting from one to the other is feature selection. Four approaches cover almost everything production teams deploy. ### IV filter The simplest screen is to rank every feature by information value and keep the top $K$, or keep every feature with $\mathrm{IV} \ge 0.02$. This is a univariate filter. It ignores correlations. In a scorecard with 500 candidate features and heavy correlation, it is nevertheless the right first step, because it cuts the search space from 500 to 100 or 50 before any multivariate method runs. On Taiwan most of the 23 numeric features clear the 0.02 threshold. The handful that fall below are demographic fields whose marginal signal is weak once `LIMIT_BAL` and the payment-delinquency block are in the pool. On a raw feature store with thousands of variables, two-thirds typically fall below. ### LASSO The LASSO [@tibshirani1996regression] adds an $\ell_1$ penalty to the logistic log-likelihood, $$ \hat\beta(\lambda) = \arg\min_{\beta_0, \beta} \Big\{ -\tfrac{1}{n}\sum_{i=1}^n \ell_i(\beta_0, \beta) + \lambda \sum_{j=1}^{p} |\beta_j| \Big\}, $$ where $\ell_i$ is the per-observation log-likelihood. For $\lambda$ large, the solution is the null model; for $\lambda = 0$ the solution is ordinary logistic regression. Between those extremes, coefficients enter one at a time as $\lambda$ decreases, which gives the characteristic LASSO path. Features whose coefficients remain at zero across most of the path are dropped. The `glmnet` coordinate-descent algorithm [@friedman2010regularization] is the standard solver. Three properties make LASSO attractive for scorecard selection. First, it handles correlated features by picking one and shrinking the others, which reduces the collinearity problem that flat logistic regression suffers on WoE-encoded features. Second, the regularization path is cheap to compute, so a team can inspect the entire trajectory rather than committing to a single $\lambda$. Third, the elastic net extension [@zou2005regularization] smooths the all-or-nothing selection into a convex combination with ridge regularization, which is usually the better default in production. The shape of this plot is typical. At the strongest penalty most coefficients are zero; as the penalty weakens the `PAY_0` family enters first, followed by `LIMIT_BAL` and `PAY_AMT`. When two highly collinear features are present, the LASSO picks one and keeps the other at zero until the penalty is very weak. That behavior is the motivation for the elastic net extension. At $C = 0.02$, the LASSO drops the weakest features and keeps a compact subset. The practical workflow is to cross-validate over $\lambda$ to pick the operating point, then stability-select across bootstrap resamples to drop features that enter the solution only intermittently. ### Mutual information and permutation importance Two nonparametric alternatives round out the toolkit. Mutual information $I(X_j; Y) = \sum_{x,y} p(x,y) \ln \frac{p(x,y)}{p(x)p(y)}$ is closely related to IV. For a binary $Y$, $I(X_j; Y) = H(Y) - H(Y \mid X_j)$, which measures the expected reduction in label entropy from knowing $X_j$. `sklearn.feature_selection.mutual_info_classif` estimates it for continuous and discrete features. Permutation importance [@breiman2001random; @altmann2010permutation] measures the drop in model performance when a single feature is randomly permuted on the validation set. It is model-agnostic and captures interactions that univariate IV misses. The cost is that permutation importance for correlated features is misleading: permuting one feature often leaves the information intact through its correlated neighbor, so the importance of both looks low. Conditional permutation variants partly fix this. The mutual-information ranking and the marginal permutation-importance ranking agree on the top features with the IV ranking, which is the usual picture when the candidate set is clean. When they disagree, disagreement usually points to either collinearity (MI and IV up, permutation importance down) or a nonlinear interaction (permutation importance up, MI down). The conditional permutation column sharpens this read: features whose marginal importance was suppressed by a correlated neighbor recover here, because permuting within strata of that neighbor blocks the leakage path. A feature that looks important marginally but collapses under conditional permutation is mostly proxying for its partner; a feature that looks weak marginally but rises under conditional permutation carries information the rest of the matrix cannot reconstruct. Reading the table, the top ten are all repayment-status (`PAY_*`) and payment-amount (`PAY_AMT*`) variables, consistent with the IV ranking in @sec-ch03-weight-of-evidence-and-information-value. `PAY_0`, the most recent repayment status, dominates on every column: $I(X;Y) \approx 0.079$, marginal permutation importance $\approx 0.040$, and conditional permutation importance $\approx 0.021$. It is the only feature whose importance survives conditioning, meaning its signal is not reconstructible from any single correlated neighbor. The older repayment lags tell the collinearity story cleanly. `PAY_2` through `PAY_6` have non-trivial mutual information (roughly $0.03$-$0.05$) but their marginal permutation importance is already near zero, and conditional permutation drives it to zero or slightly negative. Mechanically, `PAY_0` through `PAY_6` are strongly autocorrelated across months, so shuffling `PAY_2` alone barely hurts the model because `PAY_0` (and the other lags) still carry the same delinquency signal. Mutual information is a univariate quantity and does not see this substitution, which is why the MI column stays elevated while both permutation columns collapse. The practical consequence is that the linear model is essentially reading `PAY_0` and using the older lags as noisy confirmation; dropping `PAY_3`-`PAY_6` would barely move held-out performance, though keeping them can still help in nonlinear models that exploit interactions (e.g. "was delinquent two months ago but recovered"). The `PAY_AMT*` variables sit at the bottom of the table with MI around $0.02$ and permutation importance within sampling noise of zero (the small negative values at `n_repeats = 10` are pure variance, not evidence against the feature). For a linear model on the standardized scale, raw payment amounts carry little marginal information once repayment status is known: a customer who is two months delinquent is risky regardless of whether last month's payment was \$500 or \$5,000. These features typically become useful only after transformation (ratio to bill amount, log-scaling) or inside a model that can interact them with `PAY_0`. ### Boruta Boruta [@kursa2010boruta] is a wrapper method built on random forests. For each feature it creates a shadow feature that is a random permutation of the original. A random forest is fit on the augmented matrix, and each real feature is tested against the best shadow feature. Features that consistently beat their shadow are confirmed; features that lose consistently are rejected; borderline features are marked as tentative. Boruta is aggressive in retaining correlated relevant features, which is desirable when the downstream model can handle them, and it is slow. On Taiwan, Boruta confirms a broad core (around nineteen of the twenty-three predictors): `LIMIT_BAL`, every `PAY_*` delinquency counter, every `BILL_AMT_*`, and every `PAY_AMT_*`. The four demographics (`SEX`, `EDUCATION`, `MARRIAGE`, `AGE`) are rejected. This overlaps heavily with the LASSO choice at $C \approx 0.05$ and with the top of the IV ranking: the repayment-history block dominates regardless of which selector we trust. ### Stability across methods A useful diagnostic is to compare the top-10 features from each method. If three of four methods agree on a feature, it is very likely signal. If a feature appears in only one method, investigate. The literature, going back to @guyon2003introduction, warns against over-reliance on any single score; practical scorecard teams combine a univariate filter (IV), a sparse model (LASSO), and a nonlinear wrapper (Boruta or permutation importance on a tree learner) before freezing the feature list. ### Stability selection A richer version of the stability argument is stability selection, introduced in the statistics literature for LASSO-type methods. The idea is to fit the LASSO on many bootstrap resamples and record how often each feature is selected. Features that appear in 80 or 90 percent of bootstraps are confirmed; features that appear in fewer than 50 percent are rejected; the intermediate band is marked for review. Stability selection has strong theoretical guarantees for controlling the expected number of false positives in high-dimensional settings. Features that appear in every bootstrap fit are the robust core of the model. Features that appear in 50 to 70 percent of fits are correlated with the robust core and will drop in and out depending on the resample. Features that appear in fewer than 30 percent of fits are weak and should be removed even if they clear the IV threshold. ### Redundancy analysis Univariate feature selection tells you which features carry signal; it does not tell you which features are redundant. Redundancy analysis uses the correlation structure of the candidate matrix to collapse highly correlated groups before fitting the downstream model. Pairs with absolute correlation above 0.8 are the candidates for group collapse. One common rule is to keep the higher-IV member of each pair and drop the other. Another is to collapse the pair into a ratio or a difference feature that captures the residual signal. For scorecard work, the former is simpler and adequate; for gradient boosting, neither matters because the tree learner handles correlation natively. ### Practical selection pipeline A working recipe that survives audit: 1. Start with the full feature candidate set from the feature store. 2. Drop features with more than 50 percent missingness unless a missingness indicator is strongly predictive. 3. Compute IV on the training fold. Keep features with $\mathrm{IV} \ge 0.02$ and drop features with $\mathrm{IV} > 0.5$ for manual review (usually leakage). 4. Compute the correlation matrix on the kept features. For each pair with $|r| > 0.8$, keep the higher-IV member. 5. Run a LASSO path with stability selection. Keep features that appear in at least 70 percent of bootstraps at the cross-validated $\lambda$. 6. Optionally run Boruta as a consistency check. Features that survive both stability-selection LASSO and Boruta are the robust core. 7. Freeze the feature list. Document the rejection reason for every dropped feature in the model development document. Every step should be reproducible from a seed and a snapshot of the training data. A scorecard that cannot be regenerated byte-for-byte from its inputs will fail SR 11-7 validation on its first audit cycle. ## Temporal leakage and lookahead bias The single most damaging data bug in credit scoring is using information at training time that would not be available at scoring time in production. @khandani2010consumer showed that machine-learning credit models trained on out-of-time cohorts can forecast delinquencies with substantial economic value, and that claim hinges entirely on correct temporal splits. In practice, this kind of bug is common, it is hard to detect with standard cross-validation, and it inflates backtest performance so dramatically that a model can look like a breakthrough until the first month of live scoring. ### The time structure of credit data Two calendars govern any credit data set. The first is the **observation date**, the point at which a score is computed. The second is the **performance window**, the forward-looking horizon over which the label is defined. A 12-month PD model might score at month $t_0$ and label the customer as a bad if they reach 90+ days past due at any point in $[t_0, t_0 + 12]$. For training, the label requires all data up to $t_0 + 12$, which means the most recent observation point for which a complete label exists is 12 months behind the data engineer's clock. @fig-ch03-pd-timeline makes the two-calendar structure explicit. Cohort A sits twelve months in the past, so its performance window closes at today and its label is fully observed. Cohort B is more recent; its observation date is only six months ago, so six months of its performance window has not happened yet. Cohort B cannot enter the training set until the calendar advances enough for its label to resolve. This architecture implies three rules: 1. Features must be computable strictly from data known at the observation date. 2. Labels must come from data in the performance window. 3. The split between training and test must respect the ordering of observation dates, not the ordering of label dates. Violate rule 1, and the model is leaky. Violate rule 2, and the label is wrong. Violate rule 3, and the evaluation is optimistic. ### Types of leakage Four kinds of temporal leakage show up in scorecard pipelines: 1. **Direct target leakage**: a feature includes the label. Happens when a team builds a feature from a table that has already been updated with performance outcomes. The "30+ in month $t$" flag sourced from a data warehouse version that has processed later data is the classic example. 2. **Aggregate leakage**: a feature uses a statistic computed over a pool that includes future observations. Mean encodings computed over the whole data set are the archetype. 3. **Split leakage**: a customer appears in both train and test because a random split ignored time or customer identity. Common with repeat borrowers. 4. **Snapshot leakage**: a feature is pulled from a production system that updates continuously, without anchoring to a specific as-of date. The feature value at training time differs from its value at scoring time because the underlying record has changed. ### A reproducible bug and its fix We simulate 24 monthly cohorts with a regime change at month 18. We engineer a leaky feature that uses the same-cohort default rate, and a non-leaky feature that uses the lagged default rate from the previous three months. We then compare a random split against an out-of-time split. Both engineered columns are summaries of the *default rate* (the fraction of borrowers who defaulted in a given set of months). They differ only in *which months* go into the summary, and that single difference decides whether the feature is admissible. The preview shows three borrowers who all sit in `month = 0`, so any statistic that is defined per month takes the same value across the three rows; that repetition is an artifact of grouping, not redundancy between the columns. - `x` is the single idiosyncratic predictor, drawn independently per borrower from $\mathcal{N}(0, 1)$. - `y` is the realized default indicator: $0$ for rows 0 and 1, $1$ for row 2. In the real world this column is only populated after the performance window closes, typically 12 months after the observation date. - `same_month_default_rate` reads $0.2625$ on every row in the month-0 cohort. That number is literally `df_t[df_t["month"]==0]["y"].mean()`: the default rate of the *current* cohort, computed from every borrower's label in that month, *including the borrower sitting in the current row*. Row 2's feature value uses row 2's own `y = 1` as part of the average. This is the aggregate-leakage archetype from the list above: the statistic is defined over a pool that contains the future. - `prior_months_default_rate` reads $0.36375$, which for month 0 is just the global mean of `y` across all 24 months. The rolling calculation asks for the default rate of the *previous three* months, but month 0 has no prior history, so `fillna(global_mean)` plugs the missing window with the unconditional base rate. From month 3 onward this column becomes the true three-month trailing default rate, computed only from cohorts whose performance windows have already closed. The two columns look alike on purpose: both are "average of `y` for some set of months", and both are constant across borrowers inside a cohort. That surface similarity is the trap. What distinguishes them is the *window*. `same_month_default_rate` reaches forward into labels that do not yet exist at scoring time; `prior_months_default_rate` reaches only backward, into cohorts that have already resolved. The correct way to judge any feature is to ask what would have to happen in production to reproduce its value for a live applicant. For `same_month_default_rate`, the recipe is "take the mean of $y$ among all borrowers in the same cohort as this applicant". The applicant's cohort is the current month. Their own label will not exist for another 12 months. Neither will the labels of the other borrowers booked in the same month. A production scorer cannot compute this feature, cannot approximate it, and cannot substitute a plausible placeholder without changing the model. The column is a training-time ghost: it looks informative because it is smuggling the answer in as an input, and a naive random split carries that cheat straight into the test fold. For `prior_months_default_rate`, the recipe is "take the mean of $y$ from cohorts 1, 2, and 3 months before the applicant's cohort". If today is March 2026 and the observation date is March 2026, the relevant cohorts are December 2025, January 2026, and February 2026 *only after their labels have resolved*. For a 12-month PD that means the feature is usable at scoring time if we interpret "lag" in terms of fully-observed label months, so in practice the lookup runs against the December 2024, January 2025, and February 2025 bookings (13 to 15 months ago), whose performance windows closed months ago. The value on March 2026's scoring job is byte-for-byte identical to the value the training pipeline would have computed for a March 2026 observation date. The feature is honest: the causal arrow runs strictly from past to future. The general test is a single question you should ask of every engineered column before it enters the design matrix: *if I froze the entire database at the observation date, could I still compute this value?* If yes, the feature is admissible. If no, no amount of cross-validation machinery will save the model from its first live scoring month; the random-versus-out-of-time comparison in the next chunk is exactly the diagnostic that exposes the gap. The leaky feature looks best on a random split, because the random split leaks the regime information across train and test. Under the out-of-time split the leaky feature loses its advantage, because the training cohorts end at month 15 and the test cohorts start at month 18, so the training-time aggregate carries no regime signal. An honest backtest pulls the feature back down to its true value. The morality of the exercise is not that the leaky feature is useless. The leakage-induced lift is what is useless, and the backtest has to be structured to kill it. ### Point-in-time feature construction The discipline that prevents leakage is point-in-time (PIT) feature engineering. A PIT feature store stores every feature with two timestamps: the **event time**, when the underlying fact occurred, and the **as-of time**, when that fact became visible to the lender. When a training row is built for observation date $t_0$, only facts with as-of time $\le t_0$ are joined. Systems like Feast, Tecton, and Databricks Feature Store expose this temporal-join primitive natively. @lopez2018advances makes the same argument for financial machine learning, where the analog is survivorship bias and lookahead bias in backtesting [@mackinlay1997nonlinear]. For cross-validation over time, @bergmeir2018note show that random K-fold on time series data is systematically biased. The right cross-validation scheme for scorecards is an expanding-window or rolling-window walk-forward: train on months 1 to $K$, test on month $K+1$, roll forward by one month, repeat. This gives $T - K$ out-of-time estimates, which can be averaged to give a performance distribution rather than a point estimate. ### A checklist for diagnosing leakage - Did the training labels use data dated after the observation date? - Are any features computed from aggregates over the full panel rather than from per-row history? - Does the same customer appear in both train and test? - If a feature is time-varying, does the training snapshot of it match the production scoring snapshot? - Does the OOT AUC match the random-split AUC? If OOT is substantially lower, that is probably honest deterioration. If OOT is higher, something is wrong. ### Walk-forward cross-validation Random K-fold cross-validation is the default in scikit-learn and the wrong default for credit data. Random K-fold treats the rows as exchangeable, which they are not when there is a time index. A row from month 24 in the training fold and a row from month 12 in the test fold creates an implicit leak: the model sees a future row and then evaluates on a past row. @bergmeir2018note show that for stationary time series this can be benign, but for regime-shifting data it inflates performance. Walk-forward cross-validation fits the time structure directly. For a data set spanning months $1$ through $T$, pick an initial training window $[1, K]$ and a test window $[K+1, K+h]$. Fit the model on the training window and evaluate on the test window. Roll the training window forward by $h$ months and repeat. This produces $(T - K) / h$ out-of-sample evaluations, each of which is a genuine out-of-time test. The sklearn `TimeSeriesSplit` class implements the basic variant. Walk-forward AUCs give a performance distribution that incorporates regime shifts. The standard deviation across folds is a better summary of production risk than a single holdout AUC because it reflects how the model behaves across real cohorts. ### Label maturation and the cold-start problem Credit labels mature slowly. A "bad" defined as 90+ days past due at any point in a 12-month horizon cannot be observed until 12 months after origination. A 60-day performance window takes 60 days; a 24-month window takes 24 months. In practice, scorecard teams use a two-calendar workflow. The label calendar defines the earliest origination date for which a complete label exists; the feature calendar defines the latest date on which features are available. Training data lives in the intersection. The consequence is that a scorecard trained at time $T$ on a 12-month bad definition has its most recent complete cohort at $T - 12$ months. The 12 months between $T - 12$ and $T$ contain applications that have not had time to develop, but they have the most current feature distributions. Most banks use those immature cohorts only for monitoring, not for training, and accept the constraint that the training data is always 12 to 18 months stale. Survival analysis (@sec-ch09) provides a way to use immature cohorts in training by censoring them, which is its principal operational advantage over binary classification in credit. ### Survivorship and sample-selection bias Related to leakage is survivorship and sample-selection bias. A scorecard trained on the population of applications that were approved and booked systematically excludes the population that was declined, which is the population the scorecard is trying to classify. @heckman1979sample formalized the bias this induces. In credit, it is addressed through reject inference (@sec-ch10) and through careful cohort construction: the scorecard development sample should, where possible, include reject bureau pulls so that the model sees the full application flow rather than only the approved subset. Even then, the bias persists unless the reject inference is done correctly, and most scorecards carry a structural skew toward the approved population's risk profile. ### Cohort effects from policy changes A lender that tightens its underwriting rules at month 10 creates a structural break in the training data: approvals after month 10 are a different population than approvals before month 10. A model trained on the pooled data estimates a blend of the two populations and predicts poorly on either. The "fix" usually summarized as "treat the policy change as a cohort indicator" is true but useless on its own. The actual fix is a four-stage pipeline: 1. collect the policy events as structured data, 2. join them onto the training rows so every observation knows its policy version, 3. detect breaks the policy log missed, 4. choose a modeling response per regime and validate it walk-forward. Every stage is concrete and worth implementing in code. #### Stage 1: Collect policy events as structured data The model risk discipline starts with a `policy_log` table maintained by credit policy, not by data science. The minimum schema is below; in practice, banks add columns for the approving committee, the legal-vetting status, and a free-text rationale. The `effective_month` column is a *date* in production (`effective_at TIMESTAMP`); we use integer months here so the example aligns with `df_t`. The `scope_*` columns matter: a policy that fires only on a sub-segment must not be applied to the rest of the book. Store this table in the same warehouse as the application data and version it; an immutable append-only log with `valid_from` and `valid_to` columns is the right shape, so retroactive corrections do not silently rewrite history. #### Stage 2: Simulate the application stream and join policy versions We rebuild a 24-cohort applicant stream where the latent risk variable `x` is generated for every applicant, but the booking decision and therefore the *training population* depends on the active policy. Two things just happened that matter operationally. First, `df_p` is the *application* stream and `booked` is the *training* stream; the gap between them is exactly the survivorship bias the previous section warned about, and the policy log is the only honest record of why the gap is there. Second, the join from policy log to applications is by `(month, scope_product, scope_segment)`, not by `month` alone. In production, this is a `LEFT JOIN ... AND application_date BETWEEN policy.valid_from AND policy.valid_to AND application.product = policy.scope_product` against the warehouse, run once at training-set construction time and stored alongside the row. #### Stage 3: Detect breaks the policy log missed Two things go wrong. The policy team forgets to log a soft change (a credit officer dialing up manual overrides), or an external event (a macro shock, a competitor exit) creates a structural break with no internal policy attached. Detection is mechanical and worth running on every refresh. Read the row: the largest PSI on `x` against the pre-policy baseline lines up with the months in `known_policy_months`, and the CUSUM argmax lands on or near the same boundary. PSI \> 0.25 is the conventional "significant shift" threshold; CUSUM peaks identify the candidate break date. Both are unsupervised and run independently of the policy log, so a discrepancy is the signal that the log is incomplete. For change-point detection on multivariate streams, `ruptures` (Pelt, Binseg, Window) implements the full family in @truong2020selective; it accepts a cost function and returns segment boundaries, which is what you want when you suspect more than one break and do not know `K` in advance. The Pelt output is a list of segment boundaries; with the policy log claiming breaks at months 6, 12, and 18, you should see endpoints near those values. The `pen` argument is the only hyperparameter that needs tuning; raise it to suppress spurious breaks, lower it if you suspect breaks the algorithm is missing. For a multivariate signal (default rate, mean `x`, and thin-file share jointly), pass a 2-D array of shape `(T, K)` instead of `dr.reshape(-1, 1)`; the rbf cost generalizes without code changes. Once a candidate break date is in hand, the classical hypothesis-test version is the @chow1960tests F-test for parameter equality across two regression segments. The full machinery is one `statsmodels` call per candidate date for the F-test, plus one global call for CUSUM-of-residuals (`statsmodels.stats.diagnostic.breaks_cusumolsresid`). Read the F-test column: the candidate break with the smallest `p_value` is the most defensible split point, and `p < 0.05` (with `0.01` the conventional bar in a model risk submission) is the formal evidence the chair of the model risk committee will ask for. CUSUM-of-residuals is a single global test: a small p-value rejects the null of "stable coefficients across the entire sample" without committing to a specific break date. Use `ruptures` to *find* candidate break dates, `chow_f` to *rank* them, then CUSUM as a sanity check that something broke at all. #### Stage 4: Pick a modeling response and backtest it walk-forward The four candidate responses to a documented break, in increasing sophistication, are listed below. Each is a one-line change to the training stage, but each makes a different assumption and has a different sample-size cost. The four columns implement four distinct fixes: - **`pooled`** is the do-nothing baseline. It blends regimes and is the failure mode the section warns about. - **`subset_post`** keeps only rows whose `policy_id` matches the regime currently in test. This is the cleanest answer when the post-policy population is large enough; the cost is sample size, and the rule of thumb (used above) is to fall back to pooled below 200 training rows. - **`indicator`** keeps the full sample but adds a `policy_post` 0/1 control. This assumes the *slope* on `x` is constant across regimes and only the intercept moves. Add a `policy_post * x` interaction term if you want to relax that assumption; the F-test on the interaction is exactly the @chow1960tests test. - **`importance_weighted`** keeps the full sample but reweights training rows by the density ratio between the test regime and the train regime, estimated by a domain classifier. This is the @sugiyama2007covariate covariate-shift correction. It dominates the indicator approach when the shift is in `P(x)` (the population mix changed) rather than `P(y | x)` (the relationship changed). The right choice is data-dependent and is exactly what the walk-forward backtest above is for: the column with the highest mean and lowest std across folds wins. There is no universally correct answer, only an empirically defensible one. #### Stage 5: Wire it into the training pipeline Once the chosen fix is selected, it lives in the feature/data pipeline, not in a notebook. The minimum production-grade integration has four pieces: The four numbered steps map one-to-one to the failure modes the section listed. Step 1 forces every row to know its regime via a temporal `BETWEEN` join against `policy_log` and prevents silent drift; the join is a single Polars expression in development and a single SQL statement in production. Step 2 catches breaks the policy team forgot to log: the diff `psi_breach_months - documented_break_months` is the alert payload, and an empty diff is what you want to see in steady state. Step 3 wraps the regime decision in a sklearn `Pipeline` so the same object that trained also serves predictions, with `RegimeFilter` documenting the regime contract even though the row filtering happens upstream. Step 4 logs per-policy AUC as a first-class MLflow metric so the model risk dashboard tracks degradation per regime, not just an aggregate; the production replacement for the temp-dir tracking URI is `mlflow.set_tracking_uri("databricks")` or whichever managed backend the team runs. ## Scalability: Polars versus pandas Weight-of-evidence encoding is embarrassingly parallel at the row level. The bin assignment step is a vectorized lookup; the WoE mapping is a per-bin scalar. Neither operation needs a global join. On data that fits in memory, pandas and Polars are both fast. The interesting question is when the gap between them matters. On a laptop, both engines finish a 1M-row WoE encoding in well under a second. The pandas path runs single-threaded on the pandas block manager; the Polars path parallelizes over cores through its Rust backend. The absolute runtimes are close at 1M rows because the bottleneck is memory bandwidth rather than compute. The gap widens at 10M to 100M rows, where Polars's columnar execution and multi-threaded groupby start to matter. For scorecard training, the practical rule is that up to a few million rows, pandas is fine; past that, Polars gives a 3x to 10x speedup with the same code shape. Past 100M rows the data no longer fits comfortably on a single machine and the engine choice shifts to Dask or Spark. The logical structure is identical to the pandas/Polars version: assign bins (map), aggregate counts per bin (reduce), broadcast the small WoE lookup back onto every row (broadcast join). What changes is that the partitioning is explicit, the aggregation is two-pass (per-partition then global), and the broadcast join is a first-class operator. The Dask implementation below runs end-to-end on a small partitioned Parquet store the chunk writes itself; replace the local path with `s3://...` and the same code runs unchanged on a billion rows. Two practical notes on the Dask version. The `quantile` call uses Dask's t-digest approximation, which is the only scalable way to get global percentiles without shuffling the whole column. The `merge` is automatically a broadcast join because `woe_lookup` is a pandas DataFrame, not a Dask one; if both sides were Dask DataFrames, you would pay a shuffle, so always materialize the small side with `.compute()` before joining. The PySpark port follows the same three-stage logic; the only thing that changes is the syntax. The block below is shipped as a `python` code fence rather than a `{python}` chunk because starting a Spark session in the book renderer adds 10 to 20 seconds per build and requires a Java runtime; uncomment to run. Three practical notes on the Spark version. `QuantileDiscretizer` is the production-grade analog of `pd.cut` on quantiles; for fixed user-supplied edges (regulatory bins) use `Bucketizer(splits=edges)` instead. The `broadcast()` hint is mandatory rather than optional once the right side has more than a handful of partitions; Spark's cost-based optimizer will under-broadcast in the presence of skew. Configure `spark.sql.shuffle.partitions` to roughly the cluster's vCPU count for the aggregation; the default of 200 is wrong for both small clusters (over-partitioned) and large ones (under-partitioned), and is the single most common cause of slow Spark scorecard jobs. When porting between engines, the sanity check is to run all four (pandas, Polars, Dask, Spark) on the same 1M-row sample and assert that the resulting per-bin WoE values agree to 1e-6. Disagreement is almost always an off-by-one in bin edges (`include_lowest`, `right=True/False`, `handleInvalid` for Spark) or a quantile-approximation mismatch; it is always solvable, but it has to be caught at port time, not in production. For a one-shot scorecard rebuild that has to run on a billion rows, the pragmatic recipe is: develop and validate on a 1M-row Polars sample, then port to Spark for the full-volume rebuild, keeping the Polars code as the regression oracle. For larger feature panels where the binning itself must learn from a sample (rather than be quantile-uniform), the same fit-on-sample-then-distribute pattern works with `optbinning.BinningProcess`. The runnable Dask cell at the start of this section already implements that pattern; the `BinningProcess` object is pickled, broadcast to workers via the filesystem (or `Client.scatter(...)`) , and applied row-locally inside `map_partitions`. The asymmetry between `fit` (single-node, on a sample, runs once) and `transform` (distributed, row-local, runs everywhere) is the standard shape of every preprocessing step in a production ML pipeline; the same template ports to scaling, mean imputation, target encoding, and embedding lookups, with the only change being the object inside the `pickle`. ### Scaling imputation and feature selection Imputation does not scale linearly. kNN imputation is $O(n^2)$ in the number of rows, which rules it out past a few hundred thousand rows. MICE with $M$ iterations scales as $O(M \cdot n \cdot p^2)$ for linear base learners; it scales well in $n$ but poorly in $p$. Simple mean and median imputation is trivially parallel. For large-scale scorecards the pragmatic pattern is to use mean-plus-indicator by default and escalate to MICE only for a small number of carefully chosen features where the missingness distribution is known to be MAR. LASSO with coordinate descent [@friedman2010regularization] has a computational cost of $O(n \cdot p \cdot \text{paths})$ per fit, which scales well up to tens of millions of rows with a few thousand features. Above that, stochastic variants become the tool of choice. Boruta scales poorly, because it fits a random forest repeatedly; on large data, it is typically run on a subsample rather than on the full set. For scorecard work, the dominant scalability bottleneck is not any single algorithm but the full pipeline from raw tradelines to bin-assigned WoE features. Point-in-time joins against a multi-terabyte feature store are where most of the compute lives. That is an engineering problem solved by feature stores and efficient event-time joins; it is not a machine learning problem solved by a faster estimator. ## Benchmark on German and Taiwan The discussion so far has been piecewise: WoE on Taiwan, imputation on German, leakage on a synthetic panel. We now stitch these together into a single end-to-end benchmark that compares four data-pipeline configurations on both public data sets. The table tells a compact story. Within logistic regression, WoE encoding delivers a meaningful lift (config B vs A). Gradient boosting on either raw or WoE-encoded features (configs C and D) matches or exceeds the WoE-plus-logistic result, because the tree learner already captures monotone step functions internally. WoE encoding is therefore primarily useful when the downstream model class is linear, which for regulated scorecards it usually is. The German credit data is small (1000 rows) and noisy; AUC in the 0.75 to 0.80 range is what modern models achieve, and the gap between linear and gradient-boosted models is small at this scale. @sec-ch16 benchmarks the two data sets across a larger set of learners with profit curves and calibration plots. ## Deployment considerations Data preprocessing is part of the model artifact. When the logistic scorecard or gradient-boosted tree is persisted, the binning tables, imputers, and WoE lookups that sit before the model must persist alongside it. Three patterns prevent drift between training and production. First, bundle preprocessing into a scikit-learn `Pipeline` object that includes imputation, binning, WoE mapping, and the model. The pickle of the pipeline is the deployable artifact. When the artifact is loaded in a FastAPI service, the `predict_proba` call uses the training-time fit for all preprocessing. Second, export the pipeline to ONNX through `skl2onnx` for language-neutral deployment. Not every scorecard stack runs on Python; Java and C# services are common in banks. ONNX captures the preprocessing graph along with the model weights. Third, log the artifact to MLflow with a data schema. The schema pins every input column's name, dtype, and nullability. Schema violations at inference time are caught at the edge rather than silently producing bad scores. BCBS 239 [@bcbs239] requires this kind of lineage for regulated institutions. A common operational rule is to re-fit the entire pipeline, preprocessing included, every time the model is retrained. Re-using a binning table that was fit on data from two years ago while refitting the downstream model on fresh data is a subtle form of drift that is hard to detect and easy to avoid. ## Regulatory considerations Data choices have direct regulatory exposure. A brief map: - **FCRA** (Fair Credit Reporting Act, United States): governs bureau data. Requires accuracy, consumer access, and dispute rights. Any model using bureau data must respect the adverse action notice requirements of Regulation B. - **ECOA** (Equal Credit Opportunity Act): prohibits discrimination on race, color, religion, national origin, sex, marital status, age, and receipt of public assistance. Alternative data can create disparate impact even when not explicitly discriminatory [@barocas2016big; @fuster2022predictably]. Fair lending review must be part of the data onboarding process. - **GDPR Article 22** (European Union): gives data subjects the right not to be subject to a decision based solely on automated processing, including profiling, that produces legal effects. Credit decisions are in scope. Article 15 gives the right to an explanation of the logic involved. Scorecard binning and WoE make this easier than black-box tree ensembles. - **EU AI Act**: credit scoring is classified as a high-risk AI system. Data governance requirements include documentation of training data, testing for bias, and robustness testing. The Act is being phased in, with full application for high-risk systems expected in 2026 to 2027 depending on category. - **BCBS 239**: sets expectations for risk data aggregation, including lineage and quality. Scorecard data pipelines fall in scope for banks subject to Basel framework supervision. - **SR 11-7** [@sr117]: the Federal Reserve's model risk management guidance. Data is an explicit model risk dimension. Model validation must test that inputs are accurate, complete, and appropriate. Missing data handling is a frequent audit finding. A validator will ask what happens when the income field is blank, whether that outcome was tested, and whether the model response is bounded and monotone in the unknown direction. Imputation strategy decisions belong in the model development document, not in tribal knowledge. ## Vietnam and emerging markets ### Market context Data in Vietnam flows through two bureaus and a growing set of alternative channels. The Credit Information Center (CIC) sits inside the SBV and is the mandatory reporting destination for all regulated lenders; it maintains a national registry of tradelines, delinquency status, and inquiry history, plus a domestic consumer score product. The Vietnam Credit Information Joint Stock Company (PCB) is the private bureau, launched in 2007 and majority-owned by a bank consortium. Adult bureau coverage is around 50 to 55 percent, with CIC and PCB files overlapping substantially on regulated-lender tradelines and diverging on utility and telecom data [@cic_vietnam2023; @worldbank_findex2021]. Mobile subscriptions exceed 140 percent of the adult population, and smartphone adoption above 80 percent of urban adults makes app-based transaction data a realistic alternative overlay [@adb2023digital]. Remote onboarding is governed by SBV Circular 16/2020/TT-NHNN on electronic KYC [@sbv_circular16_2020], and personal-data processing is bound by Decree 13/2023/ND-CP, which introduces explicit consent, purpose limitation, and cross-border data transfer assessment into the pipeline [@vn_decree13_2023]. ### Application considerations The data methods in this chapter transplant with three main adjustments. First, missingness is higher across the board. Declared-income fields in application forms have informative non-response because cash income does not fit the form. The missing-indicator columns that @sec-ch03-missing recommends as a safety net are load-bearing in Vietnam, not decorative. A naive mean imputation on declared income pushes a measurable fraction of thin-file applicants into an artificially safe bucket and corrupts the weight-of-evidence table. Second, bureau tradeline depth is shallower. CIC carries two to five years of tradeline history for a typical obligor, versus ten to fifteen years on a US file. Features that assume long histories (oldest-tradeline age, average-age-of-accounts) are either missing or compressed and lose their predictive ordering. The IV-filter step should be re-run on local cohorts rather than copied from US scorecard conventions. Third, alternative data carries more marginal weight. Bank-statement parsing, e-wallet flow features (through MoMo, ZaloPay, VNPay and similar), telco recharge and top-up patterns, and utility bill payments are substantive signals for the thin-file population and often dominate self-reported income. The trade-off is that each of these feeds falls squarely inside Decree 13's definition of personal data, and some (health-linked insurance premia, location traces) are sensitive personal data with stricter consent and assessment obligations. Leakage is a different problem in an emerging market. The dominant leakage mode is not a future feature reaching back into the observation window; it is a feature whose meaning shifts mid-cycle because the regulation or the product changed. Circular 43/2016/TT-NHNN on consumer lending by finance companies reshaped consumer-finance collection practice, and Circular 22/2023/TT-NHNN (29 Dec 2023) amended Circular 41/2016 on capital adequacy ratios in the middle of most training windows, which pushes both the 30+ delinquency behavior and the risk-weighting of cohorts pre- and post-amendment into distinct regimes [@sbv_circular22_2023]. Tet seasonality adds a second structural break: features that measure rolling 30-day activity bleed across the Lunar New Year window and need a calendar-aware rather than a fixed-offset construction. ### Rationalization Weight-of-evidence encoding is a good fit for Vietnam. The bureau file is shallow and categorical-heavy, the analyst team is typically small, and the supervisory expectation under Circular 41/2016 is documented and auditable per-feature transformations [@sbv_circular41_2016]. WoE gives that out of the box. Information value as a filter is also a good fit because it is stable under the re-sampling that a typical Vietnamese lender does to cope with a thin positive class. LASSO and Boruta are reasonable but secondary tools; the feature count after sensible WoE binning rarely exceeds forty, and stability selection gives diminishing returns at that scale. Imputation strategies that assume MAR conditional on a rich covariate set are shakier in Vietnam than in the US: missingness on declared income is correlated with occupation and with cash-economy participation in ways that the available covariates do not fully capture. The missing-indicator approach dominates multiple imputation in practice. Finally, the Polars-over-pandas scalability argument is less urgent in Vietnam at typical book sizes (one to three million accounts) than in a US money-center book, but the data engineering cost of a pandas-only pipeline is still real because monthly behavioral refreshes multiply row counts by twelve. ### Practical notes A practical Vietnamese data stack starts with a daily CIC delta feed, a weekly PCB refresh, an internal core-banking feed for bank-owned lenders, and an e-wallet or transaction-API feed for fintech-affiliated lenders. Reporting obligations on the data side run to the SBV Banking Supervision Agency for bank submissions, to CIC for tradeline contribution, and to the Ministry of Public Security for Decree 13 compliance, including the annual data-processing impact assessment. Fair-lending review in Vietnam is less codified than under US ECOA, but is increasingly scoped under the consumer-protection provisions of Circular 43/2016/TT-NHNN on consumer lending by finance companies. A Vietnamese team building on the chapter's WoE and imputation pipeline should expect to document, for each feature, the source system, the retention policy under Decree 13, the consent basis, and the cross-border flag. ## Takeaways - Traditional credit data comes from the bureau, the bank's internal systems, and the application form. The schema has been stable for four decades. Alternative data extends the signal set with transactional, behavioral, device, social, and telco categories, each with its own drift profile and compliance exposure. - Weight of evidence encoding, with information value $= \sum_j (g_j - b_j) \ln(g_j / b_j)$, is the Jeffreys divergence between class-conditional feature distributions. IV gives a univariate ranking; the bin assignment makes a linear model interpretable and robust. - Rubin's MCAR, MAR, MNAR taxonomy drives imputation choice. A missing-indicator column is the cheapest safety net against information loss from silent imputation, and it should be the default for any scorecard feature with a nontrivial missingness rate. - Feature selection needs a univariate filter (IV), a sparse linear method (LASSO or elastic net), and a nonlinear wrapper (Boruta or permutation importance). No one method dominates. Use agreement across methods as a stability signal. - Temporal leakage is the most damaging bug in scorecard work. Every feature must be computable from information available at the observation date. Evaluate with out-of-time splits and walk-forward cross-validation, never random K-fold on stacked cohorts. - Polars handles multi-core WoE encoding with roughly the same code as pandas. The performance gap grows past 10M rows. Bundle preprocessing with the model in a serialized pipeline to prevent train-serve skew. ## Further reading - @siddiqi2017intelligent: the standard industry reference on scorecard development, with a full treatment of WoE binning and variable selection. - @kullback1951information: the original paper on the KL divergence and its symmetrization. The Jeffreys divergence is what scorecard developers call information value. - @rubin1976inference; @little2019statistical: the formal foundation for missing-data mechanisms and likelihood-based imputation. - @vanbuuren2011mice; @white2011multiple: MICE, the multivariate chained-equations approach to imputation, with guidance for applied researchers. - @tibshirani1996regression; @friedman2010regularization; @zou2005regularization: the LASSO and its elastic-net refinement, including the coordinate-descent algorithm that underlies every modern implementation. - @kursa2010boruta: the Boruta wrapper method. - @navas2020optimal: the mixed-integer programming formulation of optimal monotone binning that the `optbinning` package implements. - @berg2020rise; @gambacorta2024data; @bis2020data: the empirical literature on the marginal information content of digital footprints and transactional data beyond credit bureau scores. - @khandani2010consumer: machine-learning credit models with a rigorous treatment of out-of-sample evaluation. - @lopez2018advances; @bergmeir2018note: lookahead bias and walk-forward validation in financial machine learning. - @brevoort2016credit: the population of credit invisibles and unscored consumers, and why alternative data matters for inclusion. - @avery2003overview: a canonical overview of consumer credit reporting data from the Federal Reserve. ================================================================================ # Source: chapters/04-metrics.qmd ================================================================================ # Performance Metrics and Model Evaluation **Scope: both retail and corporate.** Discrimination (AUC, KS), calibration (Brier, reliability), and profit metrics. Worked examples on Taiwan default; the metrics themselves are portfolio-agnostic. ## Overview {.unnumbered} A credit score is useful only to the extent that it ranks, calibrates, and pays. Ranking is about discrimination between defaulters and non-defaulters. Calibration is about the scores matching observed default rates. Paying is about the dollars a portfolio gains or loses at a chosen cut-off. This chapter treats each of these three questions formally, derives the standard metrics from first principles, implements them from scratch, and compares the from-scratch code against the production libraries that will be used everywhere else in the book. The chapter is unusually long because the field has accumulated a large collection of conflicting conventions. AUC (@sec-ch04-auc) is the default academic yardstick but is incoherent as a cost measure [@hand2009measuring]. KS (@sec-ch04-ks) is the default regulatory yardstick and is arguably worse for ordering classifiers. Brier (@sec-ch04-brier) is proper but ignores ranking. Profit curves (@sec-ch04-profit) require cost assumptions that most teams never write down. H-measure (@sec-ch04-hmeasure) fixes the coherence problem, but almost nobody uses it. EMP (@sec-ch04-emp) is the right objective for many credit portfolios, but is missing from `sklearn`. A practitioner must know when each one matters. A chapter on metrics is also implicitly a chapter on validation design. Every point estimate of AUC, KS, Brier, PSI, or profit is an estimate from a finite sample, which means every number comes with a standard error that a careful practitioner reports and defends. Two teams disagreeing about which of their models is better is almost always a disagreement about variance, not about the point estimate. Most of the interesting arguments in credit-scoring benchmark papers [@baesens2003benchmarking; @lessmann2015benchmarking] turn out to be about the right statistical test, not about the right algorithm. This chapter, therefore, spends as much time on the statistics of model comparison as on the metric formulas. A word for the emerging-market reader. AUC, KS, Brier, and profit-based metrics transplant unchanged, but the operating context does not. In Vietnam and peer markets, thin bureau files mean smaller evaluation samples and wider confidence intervals on every point estimate; macro volatility means that out-of-time validation on a single recent quarter can be misleading; and cost-matrix parameters for profit curves have to be set against local funding cost, local LGD histories, and local collection rules rather than a US credit-card template. A metric dashboard calibrated on US benchmarks will report a healthy AUC at a Vietnamese bank while hiding a calibration drift that moves Circular 41/2016 capital by basis points. This chapter's statistics still apply; the defaults need adjustment. The running datasets are the UCI German Credit file and the UCI Taiwan credit-card default file loaded through `creditutils`. Both come from `load_german_credit()` and `load_taiwan_default()`. For drift and walk-forward experiments we generate a time-stamped synthetic cohort because neither UCI file carries dates. For 10M-row scalability we synthesize Bernoulli labels and Gaussian scores and drive the computation through Dask `delayed` graphs. The point is not that 10 million rows are exotic for a credit portfolio, they are not, but that the same code must run correctly at that scale without rewriting. ### Notation {.unnumbered} We write $Y \in \{0, 1\}$ for the default label, with $Y=1$ meaning default. $S$ is a real-valued score or a probability of default. The class-conditional cdfs are $F_0(t) = \Pr(S \le t \mid Y=0)$ and $F_1(t) = \Pr(S \le t \mid Y=1)$. Class priors are $\pi_1 = \Pr(Y=1)$ and $\pi_0 = 1-\pi_1$. A threshold $t$ defines a decision: predict positive if $S > t$. This gives a true positive rate $\mathrm{TPR}(t) = 1 - F_1(t)$ and a false positive rate $\mathrm{FPR}(t) = 1 - F_0(t)$. ## The three questions a credit model must answer Discrimination, calibration, and expected profit are mathematically distinct objects. A model can discriminate perfectly yet be badly miscalibrated. A model can be well calibrated yet still lose money at every threshold because the cost structure is asymmetric. The Hand and Henley review lays out the three-way taxonomy cleanly [@hand1997statistical]. Lessmann and colleagues update it and show that model rankings depend on which question you ask [@lessmann2015benchmarking]. - Discrimination answers: if I draw a random good and a random bad, what is the probability the score ranks them correctly? AUC, Gini, KS, and the H-measure all live here. - Calibration answers: among borrowers with predicted probability $p$, is the observed default rate also $p$? Brier score, reliability diagrams, and isotonic or Platt rescaling live here. - Expected profit answers: given the unit economics of my loan book, what threshold maximizes dollars? Profit curves, cost-sensitive learning, and EMP live here. - Monitoring adds a fourth question, more operational than statistical: does the score distribution this month look like the distribution on which the model was trained? PSI and CSI answer that. - Finally, the chapter closes on validation design and on statistical comparison of two or more classifiers. The reason three distinct questions matter is most visible in a stress scenario. Consider a retail lender that keeps ranking performance (AUC, KS) flat quarter-on-quarter while the macro environment deteriorates, say in a mild recession. The portfolio default rate rises from 2 percent to 4 percent. If the scoring model was only validated on ranking, nothing flags. If it was validated on calibration, the reliability diagram crosses above the diagonal in every bucket, Brier spikes, and the lender responds by increasing loss allowances. If it was validated on profit, the profit curve at the current threshold is below zero and the lender tightens. Each of the three views gives a different and complementary signal. A governance regime that collapses them into a single number has no chance of detecting the recession fast enough. The logistic baseline reaches an out-of-sample AUC near 0.72 on Taiwan. Gradient boosting lifts it roughly seven points, to around 0.78. That gap, which in relative terms is substantial, sets the scale for the rest of the chapter: metrics are not just ranking tools, they are the yardstick on which the return-on-effort of model improvements is measured. A 0.005 AUC difference between logistic and boosting is noise on a dataset of this size. A 0.05 difference is a genuine lift. The DeLong test in @sec-ch04-compare makes that distinction formal. A further pedagogical reason for this dataset: the base rate of 22 percent is closer to a sub-prime or emerging-market book than to a prime retail portfolio, where the base rate is often under 2 percent. Many of the subtleties of metrics in credit scoring only become operationally relevant under class imbalance. A Taiwan-like base rate is near enough to balanced that the textbook formulas work, but far enough from 50-50 that the effect of imbalance on Brier, on KS, and on profit curves is visible. The German Credit file, with its base rate of 30 percent and just 1000 observations, is the pedagogical toy; Taiwan at 30000 observations is the realistic workhorse. ## AUC-ROC and Gini ### Definition and probabilistic reading The ROC curve plots $\mathrm{TPR}(t)$ against $\mathrm{FPR}(t)$ as $t$ sweeps from $+\infty$ to $-\infty$. The area under the ROC curve is $$ \mathrm{AUC} = \int_0^1 \mathrm{TPR}\bigl(\mathrm{FPR}^{-1}(u)\bigr) du. $$ A cleaner definition, due to @bamber1975area, rewrites AUC as a probability over pairs. Let $S_+$ be the score of a random positive (defaulter) and $S_-$ the score of a random negative (non-defaulter). Then $$ \mathrm{AUC} = \Pr(S_+ > S_-) + \tfrac{1}{2}\Pr(S_+ = S_-). $$ This is the classical reading for a credit score: given a random defaulter and a random non-defaulter, the AUC is the probability that the model ranks the defaulter above the non-defaulter. Because we want non-defaulters ranked higher in scoring practice, we often flip the convention. It changes nothing substantive: AUC is invariant under monotone transforms of the score. The Gini coefficient is the standard credit-bureau restatement, $$ \mathrm{Gini} = 2\cdot\mathrm{AUC} - 1, $$ which maps random to 0 and perfect to 1. Gini is widely reported in model development documents in European and Asian retail-credit shops, while AUC is preferred in academic machine learning and in US model risk documents. Both carry the same information. ### Deriving AUC from Mann-Whitney U The connection between @eq-auc-prob and the Mann-Whitney U statistic [@mann1947test] is exact. Let $m = |\{i : y_i = 1\}|$ and $n = |\{i : y_i = 0\}|$. Let $R_+$ be the sum of ranks of the positive-class scores when all $m+n$ scores are ranked from smallest to largest. Mann-Whitney U is $$ U = R_+ - \tfrac{m(m+1)}{2}, $$ and the empirical AUC is $$ \widehat{\mathrm{AUC}} = \frac{U}{m\cdot n}. $$ Equation @eq-auc-mw has three practical consequences. First, AUC requires only ranks, so ties are handled by average ranking. Second, the computational cost is dominated by a sort, giving $O((m+n)\log(m+n))$. Third, the sampling variance of $\widehat{\mathrm{AUC}}$ can be derived from the variance of $U$, which is the trick DeLong uses for inference [@delong1988comparing]. ### From-scratch implementation The agreement is to eight decimal places, which is as close as 64-bit floats get on this sample size. The inequality $|\hat{A}_{\text{MW}} - \hat{A}_{\text{sk}}| < 10^{-9}$ is a cheap regression test we will reuse in later chapters. ### Interpretation and a warning AUC has a third reading, often forgotten: it is also the probability that a randomly chosen observation is correctly classified when the threshold is itself drawn uniformly at random from the set of scores [@hand2013area]. Hand's argument against AUC as a scalar summary rests on this: the implicit weighting over thresholds depends on the classifier's score distribution, and therefore on the classifier itself. That weighting is not a user-chosen cost function. It is an artifact of the model. Two models compared by AUC are being compared under two different implicit cost distributions. The H-measure (@sec-ch04-hmeasure) in @hand2009measuring fixes this. ### Partial AUC Before getting to the partial variant, it helps to restate what the ROC curve actually draws, because the rest of this section is a claim about *which part* of that curve matters for a credit decision. The notation was fixed at the start of the chapter, but the quick reminder is: - **TPR** (true positive rate, also called *sensitivity* or *recall*) at threshold $t$ is the fraction of actual defaulters the model flags as risky, $\mathrm{TPR}(t) = \Pr(S > t \mid Y=1)$. Higher is better: it is the share of the bad book you caught. - **FPR** (false positive rate, $1 - \text{specificity}$) at threshold $t$ is the fraction of actual non-defaulters the model wrongly flags as risky, $\mathrm{FPR}(t) = \Pr(S > t \mid Y=0)$. Lower is better: it is the share of the good book you turned away. - The **ROC curve** (Receiver Operating Characteristic, a name inherited from WWII radar detection) is the parametric plot of $\mathrm{TPR}(t)$ on the $y$-axis against $\mathrm{FPR}(t)$ on the $x$-axis as the threshold $t$ sweeps from $+\infty$ (deny nobody, both rates at 0) to $-\infty$ (approve nobody, both rates at 1). A useful model bows up into the top-left corner: high TPR at low FPR. A coin-flip model tracks the diagonal. The full AUC in @eq-auc-def is the area under this whole curve. The partial AUC is the same integral, restricted to a slice of that curve: $$ \mathrm{pAUC}(a, b) = \int_a^b \mathrm{TPR}\bigl(\mathrm{FPR}^{-1}(u)\bigr) du, $$ where $\mathrm{FPR}^{-1}(u)$ is the threshold that produces false-positive rate $u$, so the integrand is just "the TPR you get when the FPR is $u$". Integrating from $a$ to $b$ means averaging TPR over the FPR band $[a, b]$, ignoring the rest of the curve. The motivation is that full AUC averages TPR over *every* possible FPR from 0 to 1, which is operationally absurd for a lender. Thresholds that produce FPR = 0.9 mean approving almost all defaulters and rejecting almost all good customers; no bank would ever deploy a model there, so performance in that region is economically irrelevant, yet full AUC counts it with equal weight. Partial AUC literally zeroes the contribution of thresholds the business will not use. The usable region in credit scoring is always the low-FPR end: lenders reject few good customers, which means low FPR, and accept whatever TPR that buys them. A concrete example. Suppose a lender's current policy approves roughly the top 60 percent of applicants by score. On a book with a 3 percent default rate, those 40 percent of declined applicants are overwhelmingly good customers, so the operating FPR is near 0.4. Anything beyond FPR = 0.4 corresponds to cut-offs more aggressive than the bank would ever use. Reporting $\mathrm{pAUC}(0, 0.4)$ captures every cut-off the credit committee would actually consider, and nothing else. This makes pAUC a cheap, practical approximation to the H-measure (@sec-ch04-hmeasure), which formalizes the same "only count thresholds you would actually pick" idea through a cost-distribution prior. pAUC replaces that prior with a hard window: weight 1 inside $[a, b]$, weight 0 outside. It is crude but easy to explain to a non-technical audience, which is why it shows up in model-validation reports when H-measure does not. Two implementation notes. First, the raw $\mathrm{pAUC}(a, b)$ has an awkward scale. It lies in $[0, b - a]$, so for $a = 0, b = 0.4$, a perfect classifier scores 0.4 and random scores 0.08, which is hard to read. @mcclish1989analyzing proposed the standard rescaling: $$ \mathrm{pAUC}_{\text{norm}}(a, b) = \tfrac{1}{2}\left[1 + \frac{\mathrm{pAUC}(a, b) - \tfrac{1}{2}(b^2 - a^2)} {(b - a) - \tfrac{1}{2}(b^2 - a^2)}\right], $$ which maps random to 0.5 and perfect to 1, matching the scale of full AUC. This is the number `sklearn.metrics.roc_auc_score` returns when called with the `max_fpr` argument (which sets $a = 0$ and $b$ equal to `max_fpr`). Second, a warning before anyone puts pAUC into a production scorecard document: the choice of $[a, b]$ is a modeling decision and should be justified from the book's operating policy, not tuned to make the model look good. Two teams reporting pAUC on the same model with different FPR windows will get different numbers; without the window, the metric is ambiguous. Always report the window alongside the statistic: "pAUC(0, 0.4) = 0.84, McClish-normalized", not just "pAUC = 0.84". When the business question is narrow and the operating point is known, pAUC is often a better summary than full AUC. When the operating point is unknown or the model will be used across many regimes, full AUC or the H-measure is safer. ### Sampling variance The asymptotic variance of $\widehat{\mathrm{AUC}}$ under the non-parametric model, due to @hanley1982meaning, is $$ \widehat{\mathrm{Var}}(\widehat{\mathrm{AUC}}) = \frac{\hat A(1-\hat A) + (m-1)(Q_1 - \hat A^2) + (n-1)(Q_2 - \hat A^2)}{m n}, $$ with $Q_1 = \hat A / (2 - \hat A)$ and $Q_2 = 2\hat A^2/(1+\hat A)$. For a Taiwan-like sample with $m \approx 2000$ positives and $n \approx 7000$ negatives at $\hat A = 0.78$, this gives a standard error around 0.008, corresponding to a 95 percent interval roughly $[0.76, 0.80]$. The bootstrap and DeLong standard errors in @sec-ch04-compare should both land in this neighborhood. For pure ranking, AUC is defensible. For any decision that depends on a threshold, ranking is not enough. ## Kolmogorov-Smirnov statistic ### Definition and history KS has become the dominant metric in US consumer-credit regulation and in the risk dashboards of every retail bank. It is the maximum vertical gap between the class-conditional cdfs, $$ \mathrm{KS} = \sup_t \bigl|F_1(t) - F_0(t)\bigr|, $$ an application of the classical two-sample statistic of @kolmogorov1933sulla and @smirnov1948table. In terms of ROC coordinates, it is the maximum vertical distance between the ROC curve and the diagonal, $$ \mathrm{KS} = \sup_t \bigl(\mathrm{TPR}(t) - \mathrm{FPR}(t)\bigr). $$ Given scored observations sorted in ascending order, the empirical KS is the largest gap between the cumulative fractions of bad and good borrowers at any threshold. Practitioners often report the score bucket at which the maximum gap occurs and use it as an operating point. ### From-scratch implementation The KS value on the Taiwan logistic baseline is roughly 0.37. Intuitively, at the score threshold where the gap is largest, the model rejects 37 percentage points more of the defaulters than of the non-defaulters: for example, at that cut-off it might reject 60% of the true bads while only rejecting 23% of the true goods ($\mathrm{TPR}-\mathrm{FPR}=0.37$). ### The geometric link to AUC Both KS and AUC integrate over the ROC curve, but differently. Gini can be written as $$ \mathrm{Gini} = 2\int_0^1 \bigl(\mathrm{TPR}(u) - u\bigr) du, $$ so Gini is (twice) the *mean* vertical distance of the ROC curve above the diagonal, whereas KS is its *maximum*. Because one summary is an average and the other is a peak, two classifiers can have the same Gini and very different KS, or the same KS and very different Gini. - *Same Gini, different KS.* Model A has an ROC curve that bulges uniformly above the diagonal, giving a moderate gap at every threshold. Model B has an ROC curve that spikes sharply in one region and sits close to the diagonal elsewhere. The two areas under the curve can match exactly, so their Gini agrees, yet Model B's peak gap (its KS) is taller because all of its separating power is concentrated at one cut-off. - *Same KS, different Gini.* Two models can reach the same peak TPR$-$FPR at some threshold, but one keeps that gap wide across a large range of thresholds (a fat ROC curve, higher Gini) while the other drops back to the diagonal immediately on either side of the peak (a narrow spike, lower Gini). @fig-gini-vs-ks makes both cases concrete. Four piecewise-linear ROC curves are constructed by hand so the arithmetic is transparent. In the left panel, two models with the same Gini land at the same area under the curve, yet the red spike delivers a KS of 0.50 against the blue bulge's 0.30. In the right panel, two models touch the diagonal-gap ceiling at the same FPR, and both report KS of 0.50, yet the wider green ROC carries a Gini of 0.61 while the narrow orange triangle registers 0.50. The vertical bars mark the KS point on each curve. The operational lesson is that KS rewards a model that separates well at one particular threshold, while AUC rewards average separation across all thresholds. If the business runs a single accept/reject policy at a known cut-off, KS near that cut-off is the relevant number; if the model is used across many cut-offs (risk-based pricing, tiered limits, challenger testing), AUC or Gini is more faithful to how the scorecard is actually consumed. A classic failure mode is celebrating a model with the highest KS in a validation deck, then deploying it at a business cut-off that sits far from the KS-maximizing threshold, where a rival model with a lower KS but a flatter, fatter ROC curve would have done better. A common trap: KS-optimizing a classifier silently chooses an operating point. If business uses a different cut-off, that KS is operationally irrelevant. ### The score bucket at which KS is maximized Banks often report the decile or score bucket at which the KS gap occurs, and adopt that bucket as the cut-off. The practice is defensible when the KS cut-off aligns with the unit economics of the portfolio. When the profit-maximizing threshold is somewhere else, the KS cut-off is merely a convenient statistical landmark with no financial interpretation. The KS of a random scorer is zero in expectation, and its sampling distribution under the null is the two-sample Kolmogorov distribution. Critical values depend only on the sample sizes $m, n$, $$ \Pr\bigl(\mathrm{KS} > c\bigr) \approx 2\sum_{k=1}^{\infty} (-1)^{k-1} e^{-2 k^2 c^2 \frac{mn}{m+n}}. $$ In practice, the KS of a credit model is orders of magnitude above the null, so the critical-value test is not useful for model validation. The two-sample KS is, however, useful for detecting distribution shift at the feature level, a cheap complement to PSI for continuous variables. ### Why practitioners cling to KS KS is appealing because it maps cleanly to a business decision: the gap between cumulative bads and goods at a threshold is the headline number on every credit-policy deck. It is also the natural number to plot against score deciles. Banks have used KS for 50 years, and every downstream process (policy rules, pricing matrices, recovery operations) is engineered around a KS-selected cut-off. The consequence is path-dependence: even when AUC or H-measure is a better metric, a bank cannot easily switch because the downstream plumbing assumes a single KS cut-off. Any serious metric overhaul must therefore include a policy migration plan. ## The H-measure ### Why AUC is incoherent @hand2009measuring points out that when we compare two classifiers $A$ and $B$ by AUC we are implicitly averaging misclassification loss over thresholds with a different weight function for each classifier. The weight is the score distribution itself, which changes when the classifier changes. That makes comparisons by AUC non-transitive in cost terms. Hand calls it incoherent, in the sense used by Bayesian statisticians for non-axiomatic procedures. Hand and Anagnostopoulos return to the problem and sharpen the critique [@hand2013area]. The H-measure replaces the classifier-dependent weighting by a user-specified prior $w(c)$ over the cost ratio $c$, where $c$ represents the relative cost of a false positive. Practitioners in banking usually pick a Beta prior concentrated around sensible ranges. The default Beta(2, 2) gives equal weight to both error directions and peaks near $c=0.5$, which corresponds to equal costs. ### Derivation **Step 1: costs on a single scale.** A false positive (a good flagged as bad) costs $c_{FP}$; a false negative (a bad accepted as good) costs $c_{FN}$. Only the ratio matters for ranking thresholds, so rescale the two costs to sum to one and write $c = c_{FP}/(c_{FP}+c_{FN}) \in (0,1)$. Then $c_{FP} = c$ and $c_{FN} = 1-c$, a single scalar. $c = 0.5$ is the symmetric case; $c \to 1$ penalizes false positives almost exclusively, $c \to 0$ penalizes false negatives almost exclusively. **Step 2: expected loss at a threshold.** With the notation fixed at the start of the chapter (predict positive when $S > t$), the two error probabilities for a randomly drawn subject are - false positive: $\Pr(S > t,\ Y=0) = \pi_0 (1 - F_0(t))$, - false negative: $\Pr(S \le t,\ Y=1) = \pi_1 F_1(t)$. The $\pi_0$ and $\pi_1$ appear because a FP requires the subject to *be* a good in the first place ($Y=0$, probability $\pi_0$) and *then* fall on the wrong side of the threshold (probability $1-F_0(t)$). Same for the FN. Multiplying each error probability by its cost and summing: $$ \mathcal{L}(t, c) = \pi_0 c (1-F_0(t)) + \pi_1 (1-c) F_1(t). $$ This is the expected per-subject loss of using threshold $t$ under cost ratio $c$. **Step 3: optimal threshold for a given** $c$. The decision-maker picks $t$ to minimize @eq-hloss-threshold, giving the cost-conditional Bayes threshold $$ t^*(c) = \arg\min_t \bigl\{\pi_0 c (1-F_0(t)) + \pi_1 (1-c) F_1(t)\bigr\}. $$ The minimized loss is $$ L(c) = \pi_0 c (1-F_0(t^*(c))) + \pi_1 (1-c) F_1(t^*(c)). $$ As $c$ sweeps from 0 to 1, $t^*(c)$ traces out the ROC-convex-hull operating points: high $c$ (costly FP) drives $t^*$ up so few subjects get flagged; low $c$ drives $t^*$ down. **Step 4: trivial baselines.** Two threshold-free classifiers bracket the problem: - *Accept everyone* ($t = +\infty$): no FP, every bad missed. Loss $= \pi_1 (1-c)$. - *Reject everyone* ($t = -\infty$): every good flagged, no FN. Loss $= \pi_0 c$. A decision-maker would use whichever of the two is cheaper at the cost ratio $c$, so the best the trivial rule can do is $$ L_{\max}(c) = \min\{\pi_0 c,\ \pi_1 (1-c)\}. $$ The two lines cross at $c^\dagger = \pi_1/(\pi_0+\pi_1) = \pi_1$: left of $c^\dagger$ reject-everyone is cheaper, right of $c^\dagger$ accept-everyone is cheaper. $L_{\max}$ is a triangular tent with peak $\pi_0 \pi_1$. Any useful classifier must beat this at every $c$ where we care, i.e., $L(c) \le L_{\max}(c)$. **Step 5: averaging over cost ratios.** A single value of $c$ is rarely known, so integrate $L(c)$ against a user-specified prior $w(c)$ on $(0,1)$. The *loss gap* is $L_{\max}(c) - L(c)$, the savings over the trivial rule at cost $c$. Normalizing this average gap by the average trivial loss gives the H-measure: $$ H = 1 - \frac{\int_0^1 L(c) w(c) dc}{\int_0^1 L_{\max}(c) w(c) dc}. $$ **Step 6: bounds and corner cases.** Because $0 \le L(c) \le L_{\max}(c)$ pointwise, the ratio lies in $[0,1]$, so $H \in [0,1]$: - $H = 1$ when $L(c) = 0$ for $w$-almost every $c$, which requires the score to separate the classes perfectly ($F_0$ and $F_1$ have disjoint support). - $H = 0$ when $L(c) = L_{\max}(c)$ for $w$-almost every $c$, i.e., the classifier is never better than picking the cheaper trivial rule. A random score achieves this in expectation because $t^*(c)$ under a random score collapses to one of the two trivial thresholds. - Values in between measure the fraction of the trivial-rule loss the classifier recovers, averaged under $w$. The weighting $w$ is the one thing the user controls. Hand's default is $\mathrm{Beta}(2,2)$, centered on $c=0.5$ with light tails. A bank with a calibrated estimate of its FP/FN cost ratio should pick a $w$ tightly concentrated near that value; a regulator auditing a portfolio across many use cases should pick a broader $w$. ### From-scratch implementation The random scorer gets $H \approx 0$, the perfect scorer $H = 1$, and the Taiwan logistic model sits well inside the unit interval. On the same dataset, boosting wins against logistic under both H-measure and AUC, which is reassuring. That agreement is not automatic, and the reason comes straight from the two metrics' definitions: - AUC $= \int_0^1 \text{TPR}(u) du$ treats every FPR equally. The implicit weight on the cost ratio $c$ is the classifier's own score density [@hand2009measuring], so two classifiers are effectively weighed on different scales. - H integrates the Bayes loss $L(c)$ against a *fixed* user prior $w(c)$. Each $c$ pins down one operating point on the ROC convex hull, specifically the tangent with slope $\pi_0 c / (\pi_1 (1-c))$. When one ROC sits weakly above the other at every FPR the classifier is Pareto-dominant: $\text{TPR}_A(u) \ge \text{TPR}_B(u)$ for all $u$ forces both $\text{AUC}_A \ge \text{AUC}_B$ and $L_A(c) \le L_B(c)$ at every $c$, so AUC and H must agree. The interesting case is when the ROCs cross: one classifier is better in a low-FPR region (tight-credit regime, high $c$) and worse in a high-FPR region (loose-credit regime, low $c$), or vice versa. AUC's uniform average over FPR and H's $w$-weighted average over $c$ then emphasize different slices of the curve, and the winner flips. The "When H-measure changes the ranking" example below builds two classifiers with identical AUC but opposite regime strengths, and shows the H rank flip as $w(c)$ shifts from low $c$ to high $c$. ### When H-measure changes the ranking The previous subsection argued abstractly that crossing ROCs can cause AUC and H to disagree. This subsection *builds* two such classifiers on purpose, then walks through the graphics to show why the rank flips. **Construction.** Pick a synthetic population with $\pi_1 = 0.3$. Build two scores on the same labels, each calibrated so the ROCs cross and the AUCs nearly match: - **Model A (top-loaded).** Half of the positives are "obvious," their score is drawn from $\mathcal{N}(4.5, 0.3^2)$, far above everyone else. The remaining positives look like the negatives, $\mathcal{N}(0, 1)$. The ROC shoots up to $\mathrm{TPR} \approx 0.5$ at almost zero FPR, then runs along the diagonal. Good when the business only keeps the very top of the ranked list. - **Model B (uniform shift).** Every positive gets the same moderate boost: $\mathcal{N}(1, 1)$ versus $\mathcal{N}(0, 1)$ for negatives. The ROC is smoothly concave: no fast start, but a better climb once you are willing to tolerate some FPR. Good when the business operates at moderate-to-high flag rates. **Graphic 1: the ROC crossing.** Left panel shows the full ROC, right panel zooms into the low-FPR corner. Model A's ROC lifts vertically in the first 1% of FPR, reaching about 0.5 TPR almost for free. Beyond that it is essentially random: the non-obvious positives are pure noise. Model B's ROC is boring but steady, overtaking A once enough FPR budget is available. **Graphic 2: Bayes loss as a function of cost ratio.** The H-measure integrand is $L(c)$. Compute it for both models on the same cost grid, overlay the trivial-rule tent $L_{\max}(c)$, and mark the two priors' centers of mass. The left panel tells the story. $L_A(c)$ drops well below $L_{\max}$ in the right tail (high $c$), because A's obvious-positives block means you can flag defaulters without flagging any negatives, exactly what you want when FP is expensive. But $L_A(c)$ hugs $L_{\max}$ in the middle and left. Once the obvious positives are taken, A's remaining score is random, so no improvement is available. $L_B(c)$ sits below $L_{\max}$ across the interior but never as low as $L_A$ in the right tail. The right panel shows the two priors that will weight these $L(c)$ curves. Beta(10, 2) concentrates mass near $c = 0.83$ (right tail, where A wins), Beta(2, 10) near $c = 0.17$ (left tail, where B wins). **Graphic 3: the integrand.** The H-measure is not just $L(c)$ but $L(c) w(c)$ integrated, then normalized. Plotting the integrand makes the flip unmistakable. Under Beta(10, 2), the blue curve ($L_A w$) sits well below the red curve ($L_B w$) where the prior has mass, so A integrates to less loss and wins H. Under Beta(2, 10), the same comparison reverses. **Graphic 4: H under a sweep of priors.** To drive it home, sweep the Beta prior's mean across $(0, 1)$ and plot $H_A$ and $H_B$ as functions of the mean. The crossover is where the ranking flips. Three observations from this last figure: 1. **AUC lives on a horizontal line.** The dashed lines are Gini = $2\text{AUC}-1$ (a monotone transform of AUC). They ignore the prior: AUC gives one number regardless of which cost regime the business operates in. On this dataset Gini ranks B above A. 2. **H ranks A above B across most of the prior-mean axis.** For any prior with $\mathbb{E}[c] \gtrsim 0.15$, including the symmetric Beta(2, 2), H prefers A. That already contradicts the AUC ranking. 3. **Crossover is in the low-**$c$ tail. Only when the prior concentrates very heavily on $c < 0.15$ does H agree with AUC that B is better. A bank with a symmetric or FP-costly prior should pick A; a lender operating in an extreme FN-costly regime (mid-tier subprime, for example) should pick B. **Numerical summary.** **Why the regimes map to priors the way they do.** The cost ratio $c$ is the cost of a false positive (flagging a good applicant as bad). Two concrete scenarios: - A politically constrained prime lender with a low default rate, audited on fair-lending, pays a high reputational cost every time it rejects a creditworthy applicant. For this lender, $c$ is large and the right prior is Beta(10, 2), concentrated near 0.83. The lender will only flag applicants it is very confident about (operating at low FPR). **Model A's obvious-positives block wins**, because it lets the lender flag roughly half of the defaulters while flagging virtually no goods. - A subprime lender on a near-break-even book of loans with a 20% default rate cannot afford to accept defaulters; each one wipes out the margin on many good loans. For this lender $1 - c$ is large (FN is expensive), so $c$ is small and the right prior is Beta(2, 10), concentrated near 0.17. The lender tolerates a high FPR in exchange for catching almost every defaulter. **Model B's uniform separation wins**, because its ROC keeps rising past the point where A's ROC flattens at the diagonal. AUC reports a single number that averages these two regimes under a weighting each classifier gets to pick for itself [@hand2009measuring]. H measure forces the bank to state its weighting up front and answers the question *that* bank actually has. ### Implementation notes **Existing packages.** The R `hmeasure` package on CRAN is the reference implementation. For Python, `pip install hmeasure` (PyPI: `hmeasure` 0.1.6, last updated 2021) gets you a direct translation of the R code. Its public API is Two things to know before relying on it: 1. **Constrained score range.** The package requires `y_score` to fall in the label range: for 0/1 labels, `y_score ∈ [0, 1]`. Raw logits, z-scores, or any score outside that interval are rejected. You must rescale first. 2. **One-parameter Beta family.** The only prior control is `severity_ratio` $= \text{cost}_{FN}/\text{cost}_{FP} = (1-c)/c$. Internally it maps to $\alpha = 2,\ \beta = 1 + 1/\text{severity\_ratio}$, so the prior is always in the family Beta$(2, b)$ with $b \ge 1$. Symmetric priors like Beta(10, 2) used in the rank-flip demo cannot be expressed. The default `severity_ratio=None` sets the ratio to $\pi_1/\pi_0$, giving Beta$(2, 1+\pi_0/\pi_1)$. The custom `h_measure(y_true, y_score, alpha, beta)` above takes arbitrary $\alpha, \beta$ and accepts any real-valued score, which is why we used it for the rank-flip example. It produces identical numbers to the pip package on the priors the package can express: The differences are at the $10^{-6}$ level, attributable to our grid-based trapezoid integration versus the package's closed-form cdf evaluation. Either implementation is fine in practice. **Three "default" priors in the literature.** Be careful to cite which one you report. | Default | $(\alpha, \beta)$ | Rationale | Source | |------------------|------------------|------------------|------------------| | Symmetric | $(2, 2)$ | No prior opinion on $c$ | @hand2009measuring, §5 | | Mean-at-$\pi_1$ | $(2,\ 2\pi_0/\pi_1)$ | Prior mean equals base rate; costs proportional to priors | @hand2009measuring, §5.1 | | Severity-ratio | $(2,\ 1+\pi_0/\pi_1)$ | Package default; mode-at-$\pi_1$-adjacent | R/Python `hmeasure` default | For Taiwan-like $\pi_1 = 0.22$: Beta(2, 2), Beta(2, 7.09), Beta(2, 4.55). The three disagree by a few percent on the same scorecard, so consistency matters more than the specific choice. If you have a *calibrated* cost ratio, use it and skip the family altogether. **Two further subtleties.** First, the H-measure is not a strict improvement over AUC in every regime. When the ROC curves of two classifiers are well separated (one Pareto-dominates the other), AUC and H agree and the extra complexity of the prior is not buying anything. H earns its keep when ROCs cross, which is precisely the regime where AUC's implicit weighting is most misleading. The rank-flip example above is the clean demonstration. Second, H is a ratio, and ratios misbehave when the denominator shrinks. Recall the construction: $$ H = 1 - \frac{\int L(c) w(c) dc}{\int L_{\max}(c) w(c) dc}, $$ where the numerator is the model's expected loss under the prior $w(c)$ and the denominator is the expected loss of the *trivial* benchmark (classify everyone the same way). A standard form of the benchmark loss is $L_{\max}(c) = \min\{\pi_0 c, \pi_1 (1-c)\}$: at small $c$, rejecting no one is optimal and the loss is $\pi_1(1-c)$; at large $c$, rejecting everyone is optimal and the loss is $\pi_0 c$. Either way, $L_{\max}(c) \to 0$ as $c \to 0$ or $c \to 1$. The function is pinned to zero at both corners. That is where trouble starts. If the prior $w(c)$ concentrates almost all of its mass near a corner, it is integrating $L_{\max}$ precisely over the region where $L_{\max}$ is nearly zero. The denominator then shrinks toward zero, and H becomes a small number divided by a small number: tiny numerical perturbations in the numerator (bin edges, grid spacing, a single extra observation near the cut) can swing H by a lot. Concretely: - **Beta(2, 2), Beta(2, 7), Beta(2, 4.5)**: the three defaults tabled above, all put substantial mass in the interior of $(0,1)$, so the denominator is comfortably away from zero and H is stable. - **Beta(2, 200)** has mean $\approx 0.01$ and puts essentially all mass in $c \in (0, 0.05)$. The denominator integrates $L_{\max}$ over a region where $L_{\max} \le \pi_1 \cdot 0.05$, a very small number. H computed from such a prior is numerically fragile; reporting it to three decimals is false precision. Extreme class imbalance is the regime where this bites. For fraud detection with $\pi_1 = 0.005$: - Mean-at-$\pi_1$ gives $\beta = 2\pi_0/\pi_1 = 2 \cdot 0.995 / 0.005 \approx 398$, i.e., Beta(2, 398) with prior mean $\approx 0.005$. - Severity-ratio gives $\beta = 1 + \pi_0/\pi_1 \approx 200$, i.e., Beta(2, 200) with mean $\approx 0.01$. Both formulas push the prior hard into the left corner, exactly where the denominator is near-zero and H loses stability. The mechanical rule "just plug $\pi_1$ into the default formula" stops being safe here. The robust practice in very-imbalanced settings is to **report H under several priors** (e.g., the package default, a symmetric Beta(2, 2), and one prior derived from a business-stated cost ratio) and treat large disagreements among them as information about the comparison, not as a number to be averaged away. If a single-number summary is required, justify the choice of prior explicitly rather than inheriting a default that happens to land in the unstable region. ## Brier score, reliability, and calibration ### From ranking to probability AUC, KS, and H measure only how a score orders observations. They say nothing about whether a predicted probability of 0.15 corresponds to a 15 percent default rate in the data. In credit scoring that gap matters. IFRS 9 and CECL both require expected credit losses stated in probability units [@ifrs9; @cecl]. Capital under Basel IRB is a function of calibrated PD [@basel2006international]. A score that ranks well but is miscalibrated lets lenders set the wrong reserves and the wrong interest rate. The Brier score [@brier1950verification] is the mean squared error of the probabilistic prediction, $$ \mathrm{BS} = \frac{1}{N}\sum_{i=1}^{N}\bigl(p_i - y_i\bigr)^2, $$ where $p_i = \Pr(Y=1 \mid \mathbf{x}_i)$ is the forecast probability and $y_i \in \{0,1\}$ is the realized label. Brier is a strictly proper scoring rule [@gneiting2007strictly]: it is minimized when the forecaster reports her true conditional probability. ### The Murphy decomposition @murphy1973new showed that the Brier score admits a canonical decomposition into reliability, resolution, and uncertainty. Bin the forecasts into $K$ groups with $n_k$ observations and mean forecast $\bar{p}_k$ and observed base rate $\bar{o}_k$ within each bin, and let $\bar{o}$ be the overall base rate. Then $$ \mathrm{BS} = \underbrace{\frac{1}{N}\sum_k n_k (\bar{p}_k - \bar{o}_k)^2}_{\text{reliability}} - \underbrace{\frac{1}{N}\sum_k n_k (\bar{o}_k - \bar{o})^2}_{\text{resolution}} + \underbrace{\bar{o}(1-\bar{o})}_{\text{uncertainty}}. $$ - **Reliability** (calibration penalty, *lower is better*) measures the squared gap between what the model *says* and what actually *happens* inside each bin. For bin $k$, if the model predicts $\bar{p}_k = 0.30$, but the observed default rate is $\bar{o}_k = 0.45$, that bin contributes $n_k (0.30 - 0.45)^2$ to reliability. A perfectly calibrated model has $\bar{p}_k = \bar{o}_k$ for every bin, so reliability $= 0$. In credit scoring, this directly controls whether a predicted PD of 5% really loses 5% of principal on average (i.e., the quantity pricing, provisioning, and IFRS 9/CECL rely on). - *Intuition:* "Do my probabilities mean what they say?" - *What increases it:* overconfident scores, covariate shift, training on a different base rate than production sees. - **Resolution** (discrimination reward, *higher is better*) measures how much the bin-conditional rates $\bar{o}_k$ spread around the overall base rate $\bar{o}$. If every bin has $\bar{o}_k \approx \bar{o}$, the model is not separating good borrowers from bad, and resolution $\approx 0$. If low-score bins default at 1% and high-score bins at 40%, the variance across bins is large and resolution is high. Note the minus sign in @eq-murphy: more resolution *subtracts* from Brier, so a model that sorts risk well is rewarded. - *Intuition:* "Do my probabilities actually vary with the truth?" - *What increases it:* informative features, flexible-enough models, adequate sample size in the tail bins. - **Uncertainty** ($\bar{o}(1-\bar{o})$) is the Bernoulli variance of the labels. It depends only on the *mix* of defaulters and non-defaulters in the data, not on the model. A portfolio with a 2% default rate has uncertainty $0.02 \times 0.98 = 0.0196$; a balanced 50/50 sample has the maximum possible uncertainty of $0.25$. It is the Brier score of the constant forecast $p_i = \bar{o}$ for all $i$. - *Intuition:* "How hard is this problem inherently?" - *Why it matters:* raw Brier scores are not comparable across portfolios with different base rates, because uncertainty alone will make them look different. **The trade-off the decomposition exposes.** Rearranging @eq-murphy, $\mathrm{BS} = \text{uncertainty} - (\text{resolution} - \text{reliability})$. Two classifiers evaluated on the *same* dataset share the uncertainty term exactly, so their Brier gap is entirely driven by (resolution $-$ reliability). This is why the decomposition is diagnostic, not just descriptive: - A model that predicts the constant base rate $\bar{p}_i = \bar{o}$ is perfectly calibrated (reliability $= 0$) but has zero resolution. Its Brier equals uncertainty. Operationally it is useless: every applicant gets the same PD, so no one can be ranked, priced, or cut off. - A model that sorts risk well but is miscalibrated (say, every PD is inflated by $3\times$) can still beat the constant forecast on AUC yet have a *worse* Brier than a calibrated but less discriminating model. Recalibration (isotonic regression, Platt scaling) fixes reliability without touching the ranking (i.e., without touching resolution), which is why it is an almost-free improvement when available. - Because reliability and resolution move independently, report both alongside the headline Brier. A single Brier number hides whether you need better features (raise resolution) or better calibration (lower reliability) [@degroot1983comparison; @dawid1982well]. ### From-scratch implementation The reconstructed Brier agrees with the `sklearn` value up to the bucketing error. On this run boosting wins on *both* terms: higher resolution (0.0368 vs. 0.0298) because it captures nonlinear interactions the linear logit misses, and slightly lower reliability (0.0005 vs. 0.0034) because the logistic model is mildly underfit so its bin-average predictions drift from the bin-observed rates. This outcome is not the norm. Gradient-boosted classifiers trained with log-loss are usually *less* well-calibrated than logistic regression (i.e., shallow ensembles shrink probabilities toward $0.5$, and deep ensembles push them toward $0$ and $1$ [@niculescu2005predicting]), which is why Platt scaling or isotonic regression on a held-out fold is standard practice for boosted models. Logistic regression, by contrast, is calibrated-in-the-large on its training data by construction of the MLE. The typical decomposition pattern is therefore *boosting wins resolution, loses reliability*, with the Brier winner determined by which term dominates; always inspect both columns rather than reading the headline Brier alone. ### Reliability diagrams The reliability diagram plots observed frequency against mean predicted probability within each bin. Points on the 45-degree line are perfectly calibrated. **Reading the diagram.** The dashed 45-degree line is perfect calibration. A curve *above* the diagonal means the model is **under-confident** (it predicts, say, 0.40 but the true default rate in that bin is 0.48); *below* the diagonal means **over-confident**. Three things stand out in the Taiwan split: - **Support.** Boosting's squares reach out to predicted probability $\approx 0.74$ while logistic's circles stop near $0.61$. Boosting is willing to issue sharper forecasts, the visual signature of the higher resolution we saw in the decomposition. - **Boosting (orange).** The curve sits essentially on the diagonal across the full range, with a small dip only in the top bin (predicted $\approx 0.74$, observed $\approx 0.70$). This is the near-zero reliability term (REL=0.0005) made visible. - **Logistic (blue).** The curve is jagged and non-monotone in the $0.05$-$0.25$ region: bins at predicted $\approx 0.20$ default at only $\approx 0.12$-$0.13$ (over-confident, below the diagonal), while the top bin at predicted $\approx 0.61$ defaults at $\approx 0.68$ (under-confident, above the diagonal). The model is simultaneously too bold in the middle and too timid at the top: a classic symptom of a linear-in-the-logit fit trying to approximate a nonlinear default surface. That wiggle is exactly what shows up as the larger REL=0.0034. Operationally, the logistic mis-shape would under-price the middle-risk segment (charging as if PD were $20\%$ when realized losses are closer to $12\%$) and reject too aggressively at the top (turning down applicants whose true PD is $68\%$ after pricing for $61\%$). Boosting's curve hugs the diagonal, so PDs can be fed into pricing and provisioning with no post-hoc correction; the logistic model would benefit from Platt or isotonic recalibration, which is exactly what the next sections cover. ### Post-hoc recalibration The reliability diagram shows that a model's raw score $s_i$ may sort risk well (good resolution / AUC) while still mapping to the wrong *level* of probability (poor reliability). **Recalibration** is a cheap post-hoc fix: leave the model alone, learn a scalar map $\hat{p} = g(s)$ on a held-out slice, and deploy $g$ in front of the scorer. Because $g$ is monotone (or near-monotone), it preserves the ranking of applicants (i.e., AUC and resolution are essentially unchanged), while bending the probabilities onto the diagonal. Two canonical choices differ in how much shape they allow $g$ to take: 1. @sec-metrics-platt-scaling 2. @sec-metrics-isotonic-regression ::: callout-tip ## Why a held-out slice is non-negotiable Fitting $g$ on the same data used to train the underlying model would let $g$ absorb the model's training-set overfitting and report fake calibration. Standard practice is an out-of-bag fold from the training data (sklearn's `CalibratedClassifierCV` does this via cross-validation), never the test set; the test set is still for final evaluation. ### Platt scaling @platt1999probabilistic proposed the parametric route: assume the miscalibration is a simple squash-or-stretch along the logit axis, and learn it with a *one-dimensional logistic regression* whose only feature is the raw score, $$ \hat{p}_i = \sigma(A s_i + B), \quad \sigma(z) = \frac{1}{1+e^{-z}}, $$ with $A, B$ estimated by maximum likelihood on an out-of-bag slice of the training data. The two parameters have clean interpretations: $A$ controls *sharpness* (\|$A$\| $>1$ stretches probabilities toward $\{0,1\}$, $|A|<1$ pulls them toward the base rate), and $B$ is an intercept shift that re-centers the score on the observed prevalence. Two parameters is also its limitation: Platt can fix a global sigmoidal bias, but it cannot repair the kind of local non-monotone wiggle we saw in the logistic reliability curve. This shape assumption is why Platt is the natural choice for models whose raw scores are already sigmoidal-looking. Classical SVM decision values, boosted-tree margins before the logistic link, and logistic regressions whose only problem is the wrong intercept after sampling correction or threshold shifting. On models with fundamentally non-sigmoidal score distributions (e.g. Naive Bayes with its characteristic push toward 0 and 1), Platt is usually outperformed by the non-parametric alternative below. One practical detail from the original paper: Platt replaces the hard labels $\{0,1\}$ with the smoothed targets $$ y^+ = \frac{N_+ + 1}{N_+ + 2}, \qquad y^- = \frac{1}{N_- + 2}, $$ where $N_+$ and $N_-$ are the positive and negative counts in the calibration set. Without this smoothing the MLE can blow up toward infinite $A$ when the scores separate the classes perfectly; the Laplace-style prior keeps the estimate finite. Implementations that omit the smoothing (rare but not unheard of) tend to produce over-confident $\hat{p}$ at the extremes. To make this concrete, we construct a small near-separable calibration set and fit Platt's two parameters two ways: once against the hard $\{0,1\}$ targets, and once against the smoothed $(y^+, y^-)$ targets. Because the targets are no longer binary we cannot reuse `LogisticRegression`; we minimize the Bernoulli negative log-likelihood directly. With hard labels the optimizer drives $A$ toward a large value (the gradient keeps rewarding steeper slopes because *every* positive sits above *every* negative). The resulting $\hat{p}$ at moderate scores like $s = \pm 2$ is already indistinguishable from $0$ or $1$ in floating-point, which is exactly the over-confidence at the extremes that the paper warns about. With smoothed targets the MLE's ceiling is set by $y^+ < 1$ and $y^- > 0$: the slope that best matches $y^+ \approx 0.976$ for the positives is finite, so $A$ converges to a moderate value and the recalibrated probabilities leave room for uncertainty. ### Isotonic regression @zadrozny2002transforming took the non-parametric route: instead of assuming a sigmoidal shape, only assume **monotonicity** (i.e., if the model ranks A as riskier than B, the recalibrated probability of A should not be lower). That is the bare minimum any reasonable calibration map must satisfy, and it is enough to identify a unique fit by least squares, $$ \hat{p} = \arg\min_{\text{mono}}\sum_i (y_i - g(s_i))^2 \quad \text{subject to } g \text{ non-decreasing}. $$ The solution is a monotone step function computed in $O(N \log N)$ by the pool-adjacent-violators algorithm: sort by score, walk left to right, and whenever an adjacent block has a lower mean than its predecessor, merge the two and replace both with their pooled mean. The result looks like a staircase hugging the reliability curve: flat over regions where the raw scores are well-ordered but at the wrong level, and stepping up wherever the observed rate jumps. Because isotonic adapts locally, it can repair exactly the non-monotone wiggle that Platt cannot, which is why, on the Taiwan logistic model, we expect isotonic to drive REL closer to zero than Platt does. The price is flexibility cost: with few calibration points, isotonic tends to overfit into a coarse staircase that memorizes noise. @niculescu2005predicting benchmark the two across a range of base classifiers and find isotonic wins once the calibration set exceeds a few thousand observations, while Platt is more robust on smaller samples. A reasonable default: use Platt below $\sim$ 1,000 calibration points, isotonic above $\sim$ 5,000, and either-or (compare via held-out Brier) in between. ::: callout-note ## What recalibration does *not* fix Neither method adds information. If the model's resolution is low (bins don't separate defaulters from non-defaulters), recalibration cannot raise it: the monotone map can only slide existing bin centers along the diagonal, not spread them further apart. Recalibration is a remedy for reliability problems, not for a weak feature set or an under-fit model. ### Calibrating with sklearn Three things to read off the figure: - **Platt (orange) almost perfectly overlays the uncalibrated curve (blue).** The top bin stays at $(\approx 0.61, \approx 0.68)$ and the mid-range wiggle at predicted $\approx 0.15$-$0.25$ is untouched. This is the *expected* behavior, not a failure: logistic regression fit by MLE is calibrated-in-the-large on its training data by construction, so Platt's two parameters land near the identity map $A \approx 1, B \approx 0$ and Platt has no local flexibility to fix the middle-range non-monotonicity even if $A, B$ had moved. Platt earns its keep when the underlying model is *globally* sigmoidally miscalibrated (SVM margins, boosted-tree raw scores); it has little to offer a logistic regression. - **Isotonic (green) is the only curve that visibly changes.** Its top bin extends to $(\approx 0.71, \approx 0.69)$. This is much closer to the diagonal, and the staircase pools the jagged middle bins into a monotone sequence. This is the pool-adjacent-violators algorithm doing exactly what it was designed for: repairing local, non-sigmoidal mis-shape that a parametric form cannot touch. - **AUC is unchanged for both.** Platt and isotonic are monotone maps, so the *ordering* of applicants by $\hat{p}$ is the same as by $s$. Rank-based metrics (AUC, KS, Gini) are invariant under monotone transformations; only probability-level metrics (Brier, log-loss, ECE) move. Brier improves by only a few basis points here. That modest gain is consistent with the starting point: the base logistic model's REL was already $0.0034$, leaving little room for any recalibrator to work. The picture is very different for boosted trees and random forests, whose raw probabilities are typically pushed toward $0.5$ (shallow ensembles) or toward $\{0,1\}$ (deep ensembles), producing much larger reliability gaps and correspondingly larger post-calibration Brier improvements [@niculescu2005predicting]. A useful rule of thumb: the size of the calibration gain is roughly proportional to the pre-calibration REL term; if REL is already small, no method will move Brier much, and you should look to better features (resolution) rather than better calibration to improve the model. ### Calibration error as a separate metric Reliability diagrams are visual. For automated monitoring, a scalar summary of miscalibration is useful. Two standards exist: Expected Calibration Error (ECE), which is the bin-weighted absolute deviation between mean forecast and mean outcome within bins, $$ \mathrm{ECE} = \sum_{k=1}^{K} \frac{n_k}{N}\bigl|\bar p_k - \bar o_k\bigr|, $$ and the reliability component of the Brier decomposition from @eq-murphy, which is the squared analog. ECE is sensitive to bin count and the binning strategy, so quantile bins with $K = 10$ or $K = 15$ are standard. **Why the tails are the weak spot.** Both ECE and reliability estimate $\bar o_k$ by averaging $y_i \in \{0,1\}$ over the $n_k$ observations in bin $k$. The standard error of that estimate is $$ \mathrm{SE}(\bar o_k) = \sqrt{\frac{\bar o_k (1-\bar o_k)}{n_k}}, $$ so the noise scales as $1/\sqrt{n_k}$. In the body of the score distribution, equal-frequency binning puts hundreds or thousands of observations into each bin and $\mathrm{SE}$ is negligible. In the tails, two things go wrong at once: 1. **Sparsity.** The top and bottom quantile bins often contain only a handful of observations, especially with quantile binning on a score that is itself concentrated near $0$, which is typical for a 2-3% default portfolio. A bin with $n_k = 20$ has $\mathrm{SE} \approx 0.10$ even under perfect calibration, so the observed rate can land $\pm 0.20$ from the true rate by pure sampling noise. 2. **Label scarcity.** The tails are precisely where one class dominates. A "top-risk" bin may have only $2$ or $3$ actual defaults out of $30$ applicants; flip one label and the estimated $\bar o_k$ jumps by 3 percentage points. The estimator is most unstable exactly where the decisions are most expensive (approve/decline at the cutoff, price the riskiest applicants). The combination means that a tail bin can look wildly miscalibrated when the model is actually fine: inflating ECE and reliability, and producing the alarming spikes at the edges of the reliability diagram that practitioners learn to distrust. **Practical remedies.** - **Minimum-count thresholding.** Require $n_k \ge n_{\min}$ (typical choices: $n_{\min} = 50$-$100$). Bins below the threshold are either dropped from the ECE sum (and the $n_k/N$ weights renormalized over the survivors) or merged into the adjacent bin until the threshold is met. Merging is preferable because dropping biases the estimator toward the body of the distribution. - **Equal-frequency (quantile) bins** over equal-width bins, so every bin has the same $n_k = N/K$ by construction and no bin is automatically sparse. - **Confidence intervals on** $\bar o_k$, drawn as vertical error bars on the reliability diagram, so the reader can see which deviations are real signal and which are $\pm 2\mathrm{SE}$ sampling noise. - **Adaptive / debiased estimators** such as the debiased ECE of @kumar2019verified, which subtract the expected-under-null bias, or kernel-smoothed calibration curves that borrow strength across neighboring score values instead of treating each bin independently. The upshot: a reliability spike in a sparse tail bin is not automatically a calibration problem; it may be a sample-size problem. Always report $n_k$ alongside $\bar p_k$ and $\bar o_k$ before acting on tail miscalibration. Reading the three numbers. A scalar ECE is a probability-weighted average of $|\bar p_k - \bar o_k|$ across bins, so an ECE of $0.05$ means the model's predicted PD in a typical bin is off by about 5 percentage points against the realized default rate. For a Taiwan card book with a base rate near 22 percent, that is a material miss: a decile priced at a predicted PD of 10 percent but defaulting at 15 percent mis-prices every loan in the bucket by \~50 basis points of spread. The uncalibrated logistic comes in near 0.05. The Platt-scaled version is essentially the same, which is a useful negative result: Platt imposes a single sigmoid curve on the calibration map, and if the miscalibration is not itself sigmoid-shaped (for example, a bowed S instead of a monotone squeeze) the parametric fit has nowhere to go and can even worsen ECE slightly on a finite test fold while still lowering Brier. Isotonic regression cuts ECE by roughly a factor of four because it is a non-parametric monotone step function and can absorb arbitrary calibration curve shapes, at the cost of more variance in small bins. Operationally this is the ordering one usually sees on tabular credit data: uncalibrated $\approx$ Platt $\gg$ isotonic in large-sample regimes, with the ranking reversing in small-sample regimes where isotonic starts to overfit. That ranking is still subject to the tail-noise caveats above. Before treating a 50 bp gap between two calibrators as a real difference, confirm it is not inside the $\pm 2\text{SE}$ bands implied by the bin sizes $n_k$, which is exactly what the remedies in the next four chunks compute. The `ece_score` above is the naive textbook estimator: equal-frequency bins, every non-empty bin included, no standard errors. Each of the four remedies from the previous list turns into a small modification of that loop. **Remedy 1: minimum-count thresholding.** Require $n_k \ge n_{\min}$ either by dropping the offending bin or by merging it into its neighbor. Merging preserves the total mass $\sum_k n_k/N = 1$ and is therefore the less biased choice. **Remedy 2: equal-frequency vs equal-width bins.** The naive `ece_score` already uses equal-frequency (quantile) bins, which is the safer default. For contrast, the equal-width version below is what many tutorials show, and it is exactly the version that explodes in the tails when the score distribution is skewed (common for a low-default portfolio). **Remedy 3: confidence intervals on** $\bar o_k$. The simplest honest reliability diagram plots a Clopper-Pearson (exact binomial) band around each $\bar o_k$; bars that overlap the diagonal are not evidence of miscalibration. The same logic extends to Wilson or Jeffreys intervals. **Remedy 4: debiased ECE.** The naive squared estimator $(\bar p_k - \bar o_k)^2$ has positive bias equal to $\operatorname{Var}(\bar o_k) = \bar o_k(1-\bar o_k)/n_k$ even under perfect calibration. The Kumar-Liang-Ma debiased estimator subtracts that bias per bin before summing [@kumar2019verified]. The correction is largest where $n_k$ is smallest, which is exactly the tails. For a perfectly calibrated model it shrinks the reported ECE toward zero, where it belongs. The null-check line is the important one: on data drawn from a calibrated process the naive L2 estimator reports a non-zero "error" that is pure sampling noise, whereas the debiased version collapses to (near) zero. In production monitoring this is what prevents a calibration alarm from firing every quarter on a model that has not actually drifted. ### When to calibrate Calibration should be done on a held-out slice that the classifier has not seen during training. `CalibratedClassifierCV` handles this with an inner cross-validation loop: the base estimator is refit on each fold and the calibration map is fit on the complement. Calibrating on the training set, or on the same fold used to pick hyperparameters, is a common bug and produces over-confident, miscalibrated probabilities. The bug is subtle because it produces a calibration map that looks excellent *in-sample* and fails out-of-sample. On the training set, the base model has already overfit (its high-risk predictions are systematically too high and its low-risk predictions systematically too low because it has memorized some noise), so a recalibrator fit on that same data learns the *inverse of the overfit*, not the inverse of the true miscalibration. Applied to unseen data, it pushes probabilities in the wrong direction. The effect is most visible for flexible models such as gradient boosting. The following block contrasts the two workflows on the Taiwan boosting model: Two diagnostics to watch in the output. First, the leaky variant's test REL ($0.0024$) is about six times the CV variant's test REL ($0.0004$) and is in fact *worse* than not calibrating at all ($0.0005$). This is the signature of fitting the calibration map to in-sample noise: on the training fold it would drive REL close to zero, but that gain does not transfer. Second, AUC for the leaky variant is identical to the uncalibrated AUC ($0.7804$), because Platt is a monotone sigmoid on the original scores and monotone transforms preserve ROC ordering. The CV variant's AUC ($0.7800$) drifts down by a hair, not because Platt broke ranking, but because `CalibratedClassifierCV` refits the base booster on inner folds and averages their predictions, so the *scores being calibrated* are not exactly $p_{gb}$. A difference of $4 \times 10^{-4}$ in AUC is well inside the bootstrap band you would report anyway. The operational lesson is that a team monitoring only AUC or only raw Brier will see Platt-leaky as a no-op; only the REL component of the Murphy decomposition exposes the bug. A second rule: do not calibrate before you have exhausted feature engineering. If the input features are miscalibrated (for example a missing indicator that modifies the relationship between a feature and default, such as an unemployment flag that changes the slope on income) calibrating the output only hides the problem without fixing it. The recalibrator will squash or stretch the average, but segment-level biases (the unemployed sub-population systematically under-predicted) remain. @gelman2008prior goes further and argues that weakly informative priors on logistic coefficients are themselves a calibration device, by shrinking over-extrapolated coefficients toward a plausible scale before any post-hoc fix is needed. The right sequence is: fit the model, inspect reliability on the validation fold, fix model misspecification if the gap is structural (missing features, wrong functional form, segment-specific slopes), and only then apply Platt or isotonic post-processing as a cheap final correction. **Random k-fold is the wrong default for credit.** The demo above uses `cv=5`, which defaults to random `KFold`. That assumes rows are exchangeable in time, which is precisely the assumption that breaks in credit scoring. Macro regimes shift, product mixes change, underwriting rules tighten, and the calibration curve learned on 2018-2022 applications can point the wrong direction for 2024 applications even when AUC is stable. Random k-fold further leaks future information into the calibrator: rows originated *after* the scoring date are used to fit the map that corrects predictions made *at* the scoring date, giving an over-optimistic held-out REL that the live system will never match. The operationally honest setup is out-of-time (OOT) calibration: fit the base model on period $[T_0, T_1]$, fit the calibrator on a strictly later period $[T_1, T_2]$, and evaluate on $[T_2, T_3]$. `CalibratedClassifierCV` accepts any scikit-learn splitter via its `cv=` argument, so swapping random folds for walk-forward folds is one line: Two additional defenses compound with OOT splitting. First, recalibrate on a rolling window rather than once at deployment, so the map tracks regime drift instead of freezing the 2022 shape into 2025 decisions. Second, monitor the REL component of Brier on each new vintage and trigger a recalibration when REL crosses a pre-agreed threshold rather than on a fixed calendar. The PSI and CSI sections later in this chapter operationalize the monitoring side; the point here is only that the calibration workflow itself must be time-aware from the start. ## Financial impact: cost matrices, profit curves, and EMP ### Cost-sensitive learning @elkan2001foundations writes the optimal threshold under asymmetric misclassification costs. Let $c_{01}$ be the cost of accepting a bad (false negative in the default-prediction framing) and $c_{10}$ the cost of rejecting a good (false positive). The expected loss at threshold $t$ on a posterior probability $p = \Pr(Y=1 \mid \mathbf{x})$ is $$ E[\text{loss}] = c_{10} \pi_0 (1-p) \mathbf{1}_{p > t} + c_{01} \pi_1 p \mathbf{1}_{p \le t}, $$ and the minimizing threshold is $t^* = c_{10}/(c_{01} + c_{10})$, a result that depends only on cost ratios. **What** $t$ **is and what it is measured in**. The threshold $t$ lives on exactly the same scale as the posterior $p$: it is a number in $[0,1]$ with the units of a default *probability*, not the units of whatever raw score the model emits. The decision rule encoded in the indicators is: - if $p > t$ the model predicts "bad" and the lender *rejects* the applicant, incurring cost $c_{10}$ if the applicant was actually good; - if $p \le t$ the lender *accepts* and incurs cost $c_{01}$ if the applicant turns out to default. If your score is a log-odds, a FICO-like 300-850 integer, or an internal rating grade, you cannot plug that score into the Elkan inequality directly. You must first map the score to a calibrated PD (Platt, isotonic, @sec-ch04-brier), or equivalently push $t^{*}$ through the same monotone transform so the inequality is evaluated on matching scales. This is the reason calibration matters: an uncalibrated classifier can still be a good *ranker*, but its cut-off under Elkan's rule is meaningless because the numerical comparison $p > t^{*}$ has no unit-correct interpretation. **Why only cost ratios matter.** Multiplying both $c_{01}$ and $c_{10}$ by the same constant (e.g., switching the unit of currency from USD to VND, or the exposure size from a 10k loan to a 1k loan on a homogeneous book) does not move $t^{*}$. So the whole decision is parameterized by a single number, the severity ratio $c_{01}/c_{10}$. Most practical disputes reduce to arguments about that ratio, not about the absolute cost figures. **A concrete example.** An unsecured personal-loan desk books a \$10,000 loan with a 4% net interest margin over an expected three-year amortization. The foregone-profit cost of rejecting a good applicant is roughly $c_{10} \approx 0.04 \times 3 \times 10,000 = 1,200$. The loss-given-default on that product is 70%, so accepting a borrower who ultimately defaults costs $c_{01} \approx 0.70 \times 10,000 = 7,000$. Elkan's cut-off is then $$ t^{*} = \frac{c_{10}}{c_{01} + c_{10}} = \frac{1,200}{7,000 + 1,200} \approx 0.146, $$ so the desk should reject any applicant whose *calibrated* PD exceeds 14.6%, even though the book-wide base rate is only around 3%. If the desk tightens its margin to 2.5% without touching LGD, $c_{10}$ drops to \$750 and $t^{*}$ drops to about 9.1%: a cheaper-to-forgo good makes rejection less costly, so the cutoff tightens and the book contracts. Moving in the other direction, a secured product with 30% LGD would give $c_{01} = 3,000$, $c_{10} = 1,200$, and $t^{*} \approx 0.286$. The desk can extend credit to materially riskier applicants because each bad costs less. Credit lenders rarely state costs directly; they state yields and loss-given-default. The example above is the bridge between the two vocabularies: $c_{10}$ is the present value of the interest margin you would have earned on a good booking, and $c_{01}$ is EAD times LGD. The Verbraken family of metrics reframes the same object directly in profit units so practitioners never have to construct $(c_{01}, c_{10})$ by hand [@verbraken2013novel; @verbraken2014novel]. **Mind the convention.** Elkan writes "$p > t$ $\Rightarrow$ reject" because $p$ is a default probability (high $=$ risky). The profit curve in the next subsection flips to "$S \le t$ $\Rightarrow$ accept" because it defines $S$ as risk and writes the *acceptance* set explicitly. Both statements describe the same action: reject the riskiest tail. The direction of the inequality is a matter of which side the author chose to name. ### Profit curve Let $r$ be the profit per correctly accepted good and $L$ the loss per accepted bad. The expected net profit from accepting everyone whose score sits below threshold $t$ is $$ \Pi(t) = \pi_0 r (1 - F_0(t)) - \pi_1 L (1 - F_1(t)). $$ Observe the thresholding convention: we accept applicants with $S \le t$ because we are now thinking of $S$ as risk, high risk on top. The profit curve traces $\Pi(t)$ as $t$ sweeps. The threshold that maximizes $\Pi$ is the operational cut-off under the assumed $(r, L)$. It is distinct from the threshold that maximizes KS or that sits at the point of tangency between ROC and a cost-sensitive iso-loss line [@provost2001robust; @drummond2006cost]. The boosted model dominates the logistic model everywhere on this grid, and both curves are negative for very loose acceptance policies because the portfolio starts paying more in loan losses than it earns in interest. #### Three thresholds on one score {.unnumbered} The profit curve is a sweep, but in production the lender writes a single number into the policy: a cut-off. Three candidate cut-offs show up in the literature and they usually disagree because they are solving different problems: - **KS-optimal** maximizes $|F_0(t) - F_1(t)|$ (i.e., the vertical gap between the cumulative distributions of goods and bads). It uses *no cost information at all*. It is the right answer only if the business objective is "discriminate as loudly as possible at one point on the CDF", which is almost never the business objective. - **Empirical profit-maximum** is $\arg\max_t \widehat{\Pi}(t)$ on the held-out fold, given a specific $(r, L)$. It is the number that falls out of the previous chunk. It uses the realized costs but it is noisy: a fold with one extra bad at the margin can move the cut-off by several percentage points of PD, so it should be bootstrapped. - **Elkan / Bayes-optimal** is the closed-form $t^{*} = L \pi_1 / (r \pi_0 + L \pi_1) \cdot [\ldots]$. In the profit framing where $c_{10}=r$ and $c_{01}=L$, the per-applicant break-even PD simplifies to $t^{*} = r/(r+L)$. It lives on the calibrated PD scale (see the earlier warning that this threshold is meaningless on an uncalibrated score). It uses no data beyond $(r, L)$. > In theory, when the model is perfectly calibrated and the test fold is infinite, the empirical profit-max and the Elkan threshold coincide. > > In practice, any of the three can be up to a few percentage points of accept-rate apart. The only one that is typically far off is KS-optimal, because it is optimizing the wrong object. A cleaner way to see the relationship is to view all three on the ROC plane, not on the profit curve. The profit function $r \pi_0 (1 - \text{FPR}) - L \pi_1 (1 - \text{TPR})$ is linear in $(\text{FPR}, \text{TPR})$, so curves of constant profit are parallel lines with slope $m = r \pi_0 / (L \pi_1)$ in $(\text{FPR}, \text{TPR})$ space. The profit-max operating point is the tangency of the ROC curve with the highest such iso-profit line, which is the geometric argument of @provost2001robust. KS-optimal is the tangency of the ROC with a line of slope 1 (because $\text{TPR}-\text{FPR}$ is maximized there). The two are the same point only in the special case $r \pi_0 = L \pi_1$. Reading the picture. The KS-optimal cut-off sits materially to the left of the profit-max point: it rejects more applicants than profit maximization wants to, because $r \pi_0 < L \pi_1$ implies the iso-profit slope $m$ is shallower than 1, and the ROC tangency under a shallower line sits further down the curve. The empirical profit-max and the Elkan threshold are neighbors: the empirical answer is a small perturbation around the Bayes answer driven by sample noise and residual miscalibration of the boosted scores. If you re-run the chunk after isotonic calibration of $p_{gb}$, the two usually collapse onto the same point. The operational lesson is simple: use KS for storytelling about ranking, use Elkan for the policy, and use the empirical profit-max as a sanity check on whether calibration is close enough to trust the closed-form answer. The sensitivity plot closes the loop with the earlier Elkan worked example: at $L/r = 5$ the three models agree on an accept rate near 70%; at $L/r = 20$ (a subprime-like product with 70% LGD and thin margin) the optimal book compresses to near 20%; at $L/r = 50$ all three models converge to "accept almost nobody", because the loss per bad so dominates the profit per good that only the deepest prime tail is worth booking. The slope of each curve is, to first order, the density of the score distribution at the moving Elkan cut-off: scores that are concentrated near the decision boundary are fragile to small changes in the severity ratio, which is itself an argument for reporting EMP rather than $\widehat\Pi$ at a single $(r, L)$ pair. ### Expected maximum profit (EMP) @verbraken2014novel argue that the profit curve depends on the arbitrary choice of $(r, L)$ and propose averaging the maximum profit over a prior on the uncertain parameter. In credit scoring, the uncertain parameter is usually the fractional loss $\lambda$, the share of outstanding principal lost in default, drawn from a Beta distribution calibrated to historical loss-given-default data. Formally, $$ \mathrm{EMP} = \int_0^1 \max_t \Pi(t; r, \lambda) h(\lambda) d\lambda, \qquad h(\lambda) \sim \mathrm{Beta}(\alpha, \beta). $$ Using EMP moves the metric from an arbitrary point on the profit curve to a business-oriented integrated criterion. Verbraken and co-authors recommend $\alpha = 6$ and $\beta = 14$ as a default, which gives a loss-given-default density concentrated around 0.3. The EMP gap between logistic and boosted trees maps cleanly to the profit curve gap at the maximum, weighted by the LGD distribution. #### Anatomy of an EMP number {.unnumbered} EMP is a single scalar, which makes it easy to drop on a scorecard, but that compactness hides three ingredients that business users need to see directly: the prior $h(\lambda)$, the conditional optimal profit $\Pi^{*}(\lambda) = \max_t \Pi(t; r, \lambda)$, and the integrand $\Pi^{*}(\lambda) h(\lambda)$ whose area is the numerator of EMP. The next figure decomposes EMP for the default parameters; the vertical-axis scales are deliberately independent because the three objects live in different units. Three things to read off this plot. First, the middle panel shows that as $\lambda$ rises the conditional profit falls *and* the conditional optimal accept rate falls: a more punishing LGD forces the lender to book a tighter subset of applicants. Second, the integrand in the bottom panel is concentrated between $\lambda \in [0.15, 0.50]$; values of $\lambda$ below 0.1 or above 0.7 contribute essentially nothing to EMP because the prior places almost no mass there. Third, the area of the shaded bottom panel divided by the area of the top panel is literally the EMP number printed above the figure. Changing the prior shape changes the shaded area; changing the model changes the middle-panel curve. #### Plugging in your own product economics {.unnumbered} The Beta-LGD prior and the yield $r$ should come from the lender's own book, not from a textbook default. Pick $r$ as the cumulative net-interest or net-fee yield per dollar of exposure on a good booking over the product's expected life (short-tunure installment: 0.05-0.10; unsecured card: 0.15-0.25; subprime personal: 0.20-0.35). Pick $(\alpha, \beta)$ so that the Beta mean $\alpha/(\alpha+\beta)$ matches the historical loss-given-default mean on workout data, and so that the Beta spread matches the realized spread across vintages. The following scenarios span the usual products. The bar chart tells the operationally important story: EMP changes more as the product changes than as the model changes. Switching from the logistic baseline to the boosted model is worth a few basis points of EMP across every product; switching from a prime mortgage economics to an SME-unsecured economics, with the same two models, changes EMP by an order of magnitude. When the prior mean LGD is very high (SME unsecured, E\[LGD\] $\approx 0.80$) the EMP can turn negative for at least one of the models, which is a direct signal that the product is unpriced: no cut-off exists at which the book earns a positive expected profit under that loss distribution. #### Making a decision from EMP {.unnumbered} EMP is in units of *expected profit per applicant* on the same currency scale as $r$. Three decisions fall out of the number: 1. **Model selection.** Pick the model with higher EMP, provided the gap is larger than the bootstrap dispersion. A 95% bootstrap CI on EMP is built by resampling $(y, \hat p)$ pairs with replacement; the gap is "real" if the two intervals separate. Without that check, a 20-basis-point EMP gap is indistinguishable from a reshuffled test fold. 2. **Go / no-go on the product.** If the realistic prior delivers EMP $\le 0$, the product cannot be priced into profitability at any cut-off; either raise $r$ (rate, fee, or fee income assumption), tighten $\lambda$ (more collateral, stricter workout), or drop the product. The EMP is more honest than a profit curve at a single $(r, L)$ because it accounts for LGD uncertainty, which is where most credit-book surprises originate. 3. **Portfolio-level dollar translation.** EMP is per-applicant and exposure-normalized, so the portfolio value is $\text{EMP} \times N \times \bar E$ where $N$ is the annual application volume and $\bar E$ is the average booked exposure. A 0.002 EMP improvement from switching classifiers on a book of 100k applications at an average 10k USD exposure is 2 million USD a year. That is typically the unit in which a model-replacement proposal should be pitched to a risk committee. Two caveats reported alongside every EMP number. First, EMP ignores fixed and operating costs; it is the ceiling on portfolio contribution, not the bottom line. Second, EMP is only as honest as $(r, \alpha, \beta)$: always report it at the central prior *and* at a stressed prior with higher LGD mean (for example $\mathrm{Beta}(10, 10)$ instead of $\mathrm{Beta}(6, 14)$) so the reader can see whether the model ranking is robust to a macro-driven LGD shift. If the ranking inverts under stress, the decision should wait for a richer LGD study before a model swap. ### Threshold optimization under business constraints Real lenders rarely pick the unconstrained optimum. They add constraints: minimum acceptance rate to satisfy loan growth targets, maximum exposure to a risk segment, and fairness floors to meet ECOA obligations. The constrained optimum is found by sweeping the profit curve and taking the first feasible point. The following pattern is defensive and explicit. @sec-ch23 extends the formalism to fairness-constrained thresholds, where one enforces approximate equality of either acceptance rate or of true-positive rate across protected groups. **Reading the numbers.** @fig-constrained-threshold plots the profit curve with both optima marked. The unconstrained optimum is $49.0\%$ accept at profit $0.0457$ per applicant. Adding a minimum-accept floor of $55\%$ moves the operating point to $61.1\%$ accept and the profit down to $0.0455$: a drop of only $0.0002$, which is about four basis points of the peak. Two things are visible. First, the cost of the constraint is small because the profit curve is nearly flat near its peak; the lender is giving up very little expected profit to satisfy a loan-growth target. Second, the constrained optimum lands at $61.1\%$, not on the constraint boundary at $55\%$. That happens because the empirical profit curve has small local wiggles (the next block of applicants between $55\%$ and $61\%$ happens to contain more goods than bads, producing a local bump). In a world with infinite test data the curve would be smooth and the constrained optimum would sit exactly at $55\%$; in production this is a good place to bootstrap the curve to see how stable the operating point is. #### Common constraint families in credit {.unnumbered} Beyond the min-accept floor above, the policy discussion almost always includes some subset of: - **Growth and volume floors**: minimum acceptance rate or minimum booked volume per period, to hit origination targets and absorb fixed costs. - **Loss-rate ceilings**: maximum *expected* loss rate in the booked portfolio, e.g., $\sum_i p_i \mathbf{1}_{\text{accept}_i} \big/ N_{\text{accept}} \le \bar p_{\max}$. This is a risk-appetite statement distinct from a profit objective: a book can be profitable and still breach the loss ceiling. - **Concentration and segment caps**: maximum share of accepted book in any single risk decile, geography, product, or industry. Regulatory capital rules (Basel Standardized Approach, SBV Circular 41/2016) and internal risk-appetite limits live here. - **Fair-lending floors**: minimum acceptance rate per protected group, or approximate parity of true-positive rates across groups. @sec-ch23 develops this family in detail and threads it through a post-processing step. - **Capital and RWA ceilings:** maximum risk-weighted asset increment per decision window, driven by regulatory capital ratios rather than profit. - **Operational capacity**: maximum decisions per day given underwriter or collections throughput. Binds mostly for manual-review pipelines and during portfolio stress. Most of these reduce to linear inequalities on either the acceptance rate $a$ or on a per-segment acceptance vector $(a_1, \dots, a_K)$, which is why the constrained problem is almost always a small linear program in practice. #### Stacking multiple constraints {.unnumbered} The same sweep-and-filter logic extends to any number of constraints that are functions of "which applicants are accepted in ascending-risk order." Stacking a min accept rate of $60\%$ with an expected loss-rate ceiling on the booked portfolio looks like this: Each additional constraint does one of three things: it is *slack* (the unconstrained optimum already satisfies it, so adding it costs nothing), *binding* (it pulls the operating point and costs some profit, which is its shadow price), or *infeasible* (no operating point satisfies all constraints simultaneously). The last row is deliberately infeasible: at $60\%$ acceptance the booked-portfolio average PD climbs to roughly $11\%$, so a loss ceiling of $8\%$ cannot be honored while respecting the growth floor. The correct response from a policy committee is not to re-solve until something fits; it is to acknowledge the conflict and relax one of the constraints, using the shadow price below to decide which relaxation is cheaper. The curve is flat at zero, while the floor is below the unconstrained optimum (\~49%), because the unconstrained choice already satisfies the floor; once the floor climbs above that point, every additional percentage point of mandated acceptance costs a roughly constant slice of per-applicant profit. This slope is the number a CFO should be quoting when negotiating loan-growth targets against risk appetite. #### From sweep to linear program: which library to reach for {.unnumbered} The sweep pattern above works because the problem is one-dimensional: a single threshold on a single score. As soon as there are multiple scores, multiple products, or per-segment thresholds, the constrained optimum should be solved as a general linear program. Python offers a ladder of tools: - `scipy.optimize.linprog` small dense LPs, fine for a handful of segments; built into the scientific stack. - `pulp` or `cvxpy` mid-size problems where constraints and objective are easier to *read* than to code as matrices. `cvxpy` in particular lets the policy team write `sum(cost * accept) <= budget` instead of shaping `A_ub` and `b_ub` by hand. - `python-mip` or `pyomo` binary-decision problems (accept or reject per individual applicant) with interfaces to `CBC` / `Gurobi`. Typically overkill for credit threshold selection because the LP relaxation is tight: sorting by score and taking the top fraction per segment is an optimal LP solution on totally unimodular data. - `fairlearn.postprocessing.ThresholdOptimizer` scikit-learn-compatible utility that returns group-specific thresholds subject to equalized-odds, demographic-parity, or related constraints. This is the shortest path from a fitted classifier to a fairness-constrained policy. - `optbinning` originally a WoE/binning library, but its profit and EMP helpers expose "given a score and a cost vector, return the optimal cut-off" with a solver under the hood. For most credit policy problems the right first step is the sweep. Move to `cvxpy` when you have more than two or three segments or non-trivial couplings between them (a cap on *joint* share of two geographies, for example), and move to a 0/1 integer solver only if you need a per-applicant decision that is not expressible as "threshold on a score per segment." #### A worked comparison across the ladder The four code chunks below solve the *same* multi-segment credit-policy problem with four different libraries, on the Taiwan test fold. The problem has per-applicant accept variables $x_i \in [0, 1]$, per-applicant expected-contribution coefficients $c_i = r(1 - p_i) - L p_i$, and three families of constraints that a real policy committee would argue about: 1. an overall minimum acceptance rate (growth target), 2. a cap on the booked-book expected PD (risk appetite), 3. per-segment minimum-acceptance floors (e.g., to keep the high-school-educated segment from collapsing to zero accepts). The point of showing the same problem four times is to make the trade-off between readability, solver power, and scale concrete. The first shared block re-derives the split indices and the per-applicant segmentation so every demo operates on the same applicants. ##### `scipy.optimize.linprog`: the LP, written as matrices {.unnumbered} The SciPy interface wants the objective and constraints as dense (or sparse) arrays. It is the shortest dependency footprint (NumPy + SciPy) and plenty fast for the $\approx 10^3$ applicants here, but the code reads like a stack of matrix rows rather than like the policy statement. The LP relaxation is tight: every $x_i$ comes back at $0$ or $1$ (no fractional decisions), which is the totally-unimodular property referenced in the ladder. That is why sorting by score and taking the top fraction per segment is the same optimal policy. ##### `cvxpy`: the same LP, written as policy statements {.unnumbered} `cvxpy` lets the policy discussion and the code converge. Each line below corresponds to one bullet that a chief risk officer would read on a policy memo, with the added bonus that the *dual values* of the constraints come back for free: exactly the shadow-price number used in the third habit below. Two things are visible. First, the `scipy` and `cvxpy` solutions agree to solver tolerance, confirming they are solving the same LP. Second, the constraint on the joint `grad + uni` share of the booked book *could not* have been expressed as a per-segment floor; it is a cross-segment coupling, and it is exactly the kind of constraint that turns an Excel-grade policy memo into an LP. ##### `pulp` / `python-mip`: when decisions must be binary {.unnumbered} The LP relaxation being tight is the reason the ladder says an integer solver is "typically overkill." The integer solver becomes necessary only when the policy has genuinely combinatorial structure: per-applicant binary decisions with side constraints that couple individuals, e.g., "book at most one of these two correlated exposures" or "approve in batches of 10 to respect underwriter throughput." `pulp` and `python-mip` are near-identical in spirit; here is `pulp`, calling the open-source CBC solver. The per-segment "max accepted PD $\le$ min rejected PD" check confirms the integer solution collapses back to a threshold per segment, which is why the LP relaxation was adequate in the first two demos. `pyomo` and `python-mip` are drop-in replacements that expose the same CBC or Gurobi backends; the choice between them is mostly about which modeling API the team already knows. ##### `fairlearn.postprocessing.ThresholdOptimizer`: group-specific thresholds {.unnumbered} Fair-lending floors sit in a different slot on the ladder. When the constraint is "approximately equalize TPR and FPR across a protected attribute," the cleanest implementation is not to encode the constraint into the LP but to post-process a scored classifier with group-specific thresholds chosen to satisfy the parity condition. `fairlearn` wraps that choice in a scikit-learn-compatible object. The "fair" columns come from `ThresholdOptimizer`; the "one" columns from a single shared threshold picked to match the overall accept rate. The gap in FPR across groups is narrower under the fair policy, which is the equalized-odds guarantee; the cost is a small drop in overall accuracy relative to the shared threshold. The reason to reach for `fairlearn` rather than add a fairness constraint to the `cvxpy` program is not that it cannot be expressed there (it can, as a linear inequality on per-group acceptance rates). It is that `ThresholdOptimizer` chooses between *randomized* group-specific thresholds when no deterministic rule exactly hits the parity condition, which is the regulator-expected behavior in US Regulation B disparate-impact analysis and is easy to get wrong by hand. #### How business teams should think about this {.unnumbered} The point of constraints is not to squeeze the last dollar out of the model; it is to make the trade-off between risk appetite, regulatory obligation, and commercial ambition quantitative. Three habits help: 1. **Always report unconstrained and constrained side by side.** The gap is the dollar price of the constraint. If the gap is small (as with the $55\%$ floor above, $\sim 4$ bp of profit per applicant), the policy discussion can focus on whether the constraint is the right one without worrying about the model choice. If the gap is large, the lender should interrogate whether the constraint is truly necessary or whether it can be softened (e.g., by averaging the minimum-accept rate over a rolling quarter rather than enforcing it every month). 2. **Think in shadow prices, not in absolutes.** A sentence like "every additional 5 pp of mandated accept rate above $49\%$ costs about $0.3$ bp of per-applicant profit, which at 100k annual applicants and a \$10,000 average exposure is $\approx 30,000$ USD a year per 5 pp" is far more useful than "profit went from $0.0457$ to $0.0455$", because it lets policy authors choose the constraint level where the marginal profit sacrifice is tolerable. 3. **Name the binding constraint.** In practice only one or two constraints actually bind at any given time; the rest are slack. If the binding constraint is the loss ceiling, the lever is model quality or pricing. If the binding constraint is the accept-rate floor, the lever is underwriting throughput or channel growth. If the binding constraint is a fairness floor, the lever is feature engineering or reject-inference coverage on the under-served segment. Identifying the binding constraint tells the reader *which part of the business* is actually deciding this year's book. #### The three habits, in code Each habit is an executable operation, not just a principle. The three chunks below apply them to the Taiwan profit curve already built above, using policy-team assumptions a CFO would recognize: $100,000$ scored applicants per year and an average booked exposure of $10,000$ USD per loan. These two constants let us translate a "basis points of per-applicant profit" number into an annual dollar figure, which is what lets the habit change the conversation. ##### Habit 1: unconstrained vs constrained, priced in USD {.unnumbered} The first habit turns the profit-curve gap into a sentence a finance committee can use. The gap at a $55\%$ floor is tiny; the gap at an $85\%$ floor is not. Showing both in the same table is how you stop a policy discussion from anchoring on whichever number was loaded into the slide deck first. Reading the output: the $55\%$ floor costs roughly $2$ bp of per-applicant profit, or about \$200k a year at the stated volume. That is a rounding error in a retail credit portfolio; the policy discussion can focus on whether $55\%$ is the right target, not on whether the model lift justifies the constraint. The $85\%$ floor costs roughly $300$ bp per applicant, or about \$30M a year: an altogether different conversation, and one that should include reconsidering whether loan growth should be averaged over a rolling quarter rather than enforced every month. ##### Habit 2: the shadow price, per 5 pp of floor {.unnumbered} The second habit converts the profit curve into a *marginal* dollar cost: "each additional 5 pp of mandated acceptance above the unconstrained optimum costs about $X$ USD a year." That single sentence is far more useful than two decimal places on an absolute profit number, because it lets a policy author pick the tightest constraint whose marginal cost is still tolerable. The table and the figure are the same object, read two different ways. The plateau (zero marginal cost) is the region where the floor is slack: the unconstrained optimum already sits above it. The elbow is where the constraint starts pulling the operating point off the flat part of the profit curve, and the steep tail past $80\%$ is where every incremental 5 pp costs seven-figure annual profit. A policy author can now pick the constraint level at which the marginal sacrifice is tolerable, rather than negotiating in the abstract. ##### Habit 3: naming which constraint is actually binding {.unnumbered} The third habit is the most diagnostic. Solve the policy LP with `cvxpy`, read off the dual values of every constraint, and print which constraint has a non-zero dual. That is exactly the set of constraints the business is actively giving up profit for; the rest are free. The scenarios below walk the committee through three policy regimes and, for each, identify the *single* constraint that is deciding the book. Each row of scenarios tells a single-sentence story. In the base policy the min-accept floor binds: the business is giving up profit to hit a growth target, so the *lever* is channel growth or underwriting throughput. If the marketing team can deliver more applicants of the same quality, the floor becomes slack and the cost disappears. In the "growth push" the same constraint is still binding but with a much larger dual, which is how you see (without re-reading the policy memo) that the growth target has moved from comfortable to aggressive. In the "risk hawk" scenario the booked-PD cap binds instead: the lever is model quality or pricing, because the only way to accept more applicants without breaching the cap is to separate good risks from bad ones more cleanly. In the "fair-lending" scenario the SEX=1 accept-rate floor binds: the lever is feature engineering or reject-inference on that sub-population, because the model is currently under-booking them relative to the parity target. The dual value itself is the marginal USD cost of that constraint at the current operating point and converts directly to the annual-dollar figure shown on the last line of each block. The operational reading is the same across all three habits. The unconstrained/constrained gap says whether the constraint is free or expensive. The shadow price per 5 pp says *how* the expense scales. The binding-constraint reading says *which* part of the business is deciding this year's book. Reported together on a single page, these three numbers let a policy committee argue about the right target rather than about the modeling choice. ## Population stability, CSI, and drift monitoring ### Why stability matters A credit score trained on 2020 data starts drifting the day after its model monitoring report is signed. Application mix changes, credit-bureau data changes, and macro conditions change. Three monitoring tools are standard: PSI (@sec-ch04-psi) on the score distribution, CSI (@sec-ch04-csi) on individual input features, and rolling AUC (@sec-ch04-auc) or KS (@sec-ch04-ks) on recent outcomes that have matured. Drift-induced performance loss is documented in a long line of machine-learning work [@gama2014survey]. ### Population Stability Index **What is being compared.** PSI is a distance between *two distributions of the same scalar quantity* evaluated on two populations. You pick one scalar (the variable you are monitoring) and two time windows (the populations), then compute one PSI number that answers "did window $A$ look like window $E$ for this variable?". In credit monitoring, the scalar is almost always the *model score* (or equivalently the calibrated PD), because a single score-level PSI summarizes whether the overall risk mix of applicants has moved. The two populations are: - $E$ = "expected" or reference: the score distribution on the *development sample* (the data the model was trained on), or on the last revalidation vintage. $E$ is held fixed, often for a full year of production, so that successive PSI numbers are comparable. - $A$ = "actual": the score distribution on the *current scoring window*, typically the most recent calendar month or quarter of applications that have been scored but not necessarily matured yet. **A concrete example.** Suppose the logistic scorecard was trained on applications booked January through December 2024 (the development sample) and went live on 1 January 2025. On 1 April 2025, the monitoring team wants to know whether March 2025 applicants still look like the development book. They: 1. Pull the $\hat p$ (PD) the current model assigns to every 2024 development-sample application. Call this vector $E$. This is fixed for 2025 and reused every month. 2. Pull the $\hat p$ the current model assigns to every March 2025 application. Call this vector $A$. 3. Bin $E$ into 10 deciles (so each reference decile holds 10 percent by construction), drop $A$ into the same cutpoints, and apply @eq-psi. A PSI of $0.03$ means the March 2025 applicant mix is indistinguishable from development at monitoring resolution. A PSI of $0.18$ in, say, May 2025 says "investigate": maybe a new marketing channel is sending thinner files. A PSI of $0.31$ in August 2025 says the score no longer describes the population it is being used on, and retraining is on the table. One month later, the team repeats the exercise with $E$ unchanged and $A$ now equal to the April 2025 scores, and so on. The same formula applies unchanged to any *single* input feature (income, utilization, days-past-due-30, debt-to-income). In that case, the scalar $E$ is "debt-to-income on the development sample" and $A$ is "debt-to-income on March 2025 applicants". When the scalar is a feature rather than the score, the metric is called the Characteristic Stability Index (CSI) and is covered in @sec-ch04-csi. The division of labor is simple: PSI on the score answers "has the overall risk mix of my applicants changed?", CSI on individual features answers "which specific input moved?" and therefore "why did PSI move?". **What it looks like.** The clearest way to build intuition is to draw $E$ and $A$ on top of each other for two cases: a quiet month that should produce a near-zero PSI, and a drifted month that should trip the investigation threshold. @fig-psi-intuition uses the logistic-scorecard test-fold PDs as a stand-in for $E$, treats the first half as the development reference, and constructs two "actual" populations: a stable one (the second half, i.i.d. with the first) and a drifted one (the second half with a deliberate upward shift). The left column overlays the two densities, the right column shows the decile-level expected-versus-actual proportions and the per-bin PSI contributions that sum to the headline number. The top row is what a healthy month looks like: the two densities lie on top of each other, every decile of $E$ holds roughly 10 percent of $A$, and the per-bin contributions are all within rounding of zero. The bottom row is the picture a monitoring committee cares about: $A$ has shifted to the right, the low-PD deciles of $E$ are over-populated in $A$ (risk mix moved up), the high-PD deciles of $E$ are under-populated (fewer clean files), and two or three bin contributions account for most of the PSI total. Nothing in the scalar would tell you this, but the bar chart tells the remediation team exactly which part of the score range to investigate first. Partition the expected score distribution $E$ and the actual distribution $A$ into $B$ buckets with proportions $e_b$ and $a_b$. PSI is the symmetric Kullback-Leibler discrepancy up to constants, $$ \mathrm{PSI} = \sum_{b=1}^{B} (a_b - e_b) \log\frac{a_b}{e_b}. $$ Two properties are worth naming explicitly. The sum is *symmetric* in $E$ and $A$ (i.e., swapping reference and actual gives the same PSI, unlike the raw KL divergence). And every per-bin term $(a_b - e_b) \log(a_b/e_b)$ is *non-negative*, because the difference and the log always carry the same sign, so the total decomposes cleanly as a non-negative sum of bin-level contributions. That decomposition is what we use below to localize the drift. Industry thresholds, often credited to the early Experian and FICO model-governance notes, call $\mathrm{PSI} < 0.10$ stable, $0.10 \le \mathrm{PSI} < 0.25$ requires investigation, $\mathrm{PSI} \ge 0.25$ means the model needs retraining. The cut-offs are convention, not theory; they survive because they work on long-run empirical data. In practice the most useful cut-off is the one calibrated against the *noise floor* your own pipeline generates in quiet periods (see below); the industry numbers are a starting point, not a mandate. **Reading the 0.0034 result.** The scalar we pass into `psi_from_scratch` here is `p_lr`, the held-out-fold logistic PD. The two "populations" are simply the first 3,000 and the last 3,000 rows of that same test fold (i.e., two random halves of an i.i.d). sample. By construction, they should look statistically identical, and a PSI of $0.0034$ says exactly that: roughly three-tenths of a percent, well below the $0.10$ "stable" threshold and nowhere near the $0.25$ "retrain" threshold. This value is the *noise floor:* the monitor must rise above before the alarm should fire on this dataset. In a live pipeline, calibrate your investigation threshold against the empirical distribution of PSI during historically quiet periods; the industry $0.10/0.25$ numbers are a reasonable default, but the right threshold is the one that separates signal from the noise level of your particular data feed. **Decomposing PSI by bin.** A single PSI scalar hides the *shape* of the drift. The per-bin contributions $(a_b - e_b) \log(a_b/e_b)$ are non-negative and sum to PSI, so they localize *where* the distribution moved. Two PSI $= 0.20$ episodes can have very different causes: - *Concentrated in the highest-risk decile.* The portfolio has absorbed a new cohort of higher-risk applicants (e.g., a macro shock, a new marketing channel, a competitor's risk-based-pricing change). Remediation is usually business: tighten underwriting or re-price the product. - *Spread roughly evenly across all deciles.* An upstream data change is shifting every score mechanically (e.g., a bureau integration switched, a missing-value imputation changed, a new version of a feature transformer). Remediation is usually engineering: find the data change, not the credit policy. In this split-the-sample case, every bin contributes essentially nothing: the `delta %` column hovers inside a plus-or-minus one percentage point of bin mass, which is what binomial noise of order $\sqrt{e_b(1-e_b)/(n/B)}$ looks like at $n = 3,000$ and $B = 10$. In a real drift episode, the equivalent table shows one or two rows with contributions an order of magnitude larger than the rest; the `delta %` signs and the bin index together tell the committee whether the drift is at the top of the score distribution, the bottom, or smeared across the middle. That is the level of detail a remediation conversation needs, and it is lost if the monitor only reports the headline scalar. ### PSI under intentional drift The split-the-sample check in the previous subsection establishes a noise floor of roughly $0.003$: that is the value PSI takes when nothing has moved. The opposite end of the scale is equally important. If we know the distribution has shifted by a controlled amount, does PSI respond monotonically, and where does it cross the conventional $0.10$ and $0.25$ thresholds? Answering that question is what lets a monitoring team interpret a live PSI reading rather than just report it. The experiment below sweeps a shift parameter $\delta$ from $0$ to $0.5$, adds $\delta \cdot \mathrm{Beta}(2,2)$ noise to a reference $\mathrm{Beta}(2,5)$ score population, and recomputes PSI at each step. The reference distribution is the shape a typical PD model produces: mass concentrated at low scores with a thin upper tail. The additive perturbation pushes probability mass to the right, which is what a deteriorating portfolio looks like in practice. The curve is monotone and roughly convex: each additional unit of shift buys a larger increment in PSI, so the index is more sensitive to drift once it has already started. The investigate line at $0.10$ is crossed between $\delta \approx 0.15$ and $0.20$, and the retrain line at $0.25$ around $\delta \approx 0.25$. Two practical points follow. First, the industry thresholds are not arbitrary round numbers; on a realistically shaped score they correspond to distribution shifts large enough to be visible by eye in an overlay histogram. Second, by the time PSI reaches $0.25$, the shift has consumed about half of the x-axis range used here, which is why monitoring at the $0.10$ line, not waiting for $0.25$, is the standard operating discipline. ### Characteristic Stability Index CSI and PSI are *the same formula*. The Characteristic Stability Index for feature $j$ is $$ \mathrm{CSI}_j = \sum_{b} (a_{j,b} - e_{j,b}) \log\frac{a_{j,b}}{e_{j,b}}, $$ which is @eq-psi with $(e_b, a_b)$ replaced by the binned marginal distribution of input $j$. There is no new mathematics here, and implementations routinely reuse the same `psi` function for both quantities (as we do in the code below). The two names exist because the monitoring conversation is different depending on what you binned. PSI on the composite score answers "does the model's output look like what we trained on?", which is the first signal a governance committee looks at. CSI on each input answers "which of the things we feed the model has drifted?", which is the diagnostic you pull up *after* PSI fires. Reporting them under separate names keeps the dashboard legible: an alert on $\mathrm{PSI}$ goes to the model owner, a cluster of alerts on $\mathrm{CSI}_j$ goes to the data engineering team that owns the feature pipeline. The pairing of the two readings is what makes CSI useful. A large $\mathrm{CSI}_j$ on a single input combined with a modest PSI on the score means the model has *absorbed* a feature shift, usually because correlated inputs compensated or the feature had low weight; remediation may be no more than a documentation update. A large $\mathrm{CSI}_j$ on multiple inputs combined with a large PSI is a hard distribution shift the model cannot absorb, and is the textbook case for retraining. ### Rolling PSI In production a daily PSI is computed against a fixed reference (often the training distribution). Rolling plots make drift visible. ## Validation designs ### Holdout A single train/test split is the cheapest design and the weakest. The estimator of generalization error has the variance of a single draw. It is only adequate when data are abundant and the question is whether model A beats model B by a large margin. ### k-fold cross-validation @stone1974cross defines cross-validation as the rotation of $k$ non-overlapping folds, with the point estimate the average of the $k$ held-out scores; its bias-variance properties are analyzed in @arlot2010survey. A warning before the code. k-fold is the textbook default for i.i.d. tabular data and it is what almost every published credit-scoring benchmark reports [@baesens2003benchmarking; @lessmann2015benchmarking], because the public UCI files have no timestamps and there is nothing else to do. It is *not* the right design for a production credit model. Shuffling observations across folds mixes future and past, so a model validated by k-fold sees information that a live model will not have, and the estimated AUC hides any temporal drift. The next two subsections (@sec-ch04-oot on out-of-time validation, @sec-ch04-walkforward on walk-forward) present the designs that a supervisor will actually accept. k-fold appears here for three reasons only: it is the result most benchmark papers quote, it is an honest estimator on the UCI files used in this chapter, and it provides the variance baseline that out-of-time and walk-forward numbers are compared against. When running it, stratify by the rare class; $k=5$ and $k=10$ are the conventional choices. ### Out-of-time validation For a production credit model, k-fold shuffles through time and hides temporal drift. The supervisory preference is out-of-time (OOT) validation, and the design is deliberately simple. 1. Pick a cutoff date $T$. 2. Train on everything before $T$. 3. Test on the single most recent window after $T$ that already contains matured outcomes (i.e., applications old enough that the default label has been observed). 4. Report one AUC, one KS, one Brier, one profit number. The OOT performance is the honest answer to the question the bank cares about, namely how the model will behave on next quarter's applications, and it is the number that shows up in a model-validation memo to the regulator. The price of the simplicity is that OOT is *one* estimate. You learn nothing about whether next quarter's number is better or worse than the quarter before it, and the sample size of that single window sets the width of the confidence interval. ### Walk-forward validation Walk-forward is OOT repeated. Slide the cutoff $T$ forward by one month (or one quarter), refit on the updated training window, evaluate on the next period, record the number, and continue. The design yields a *time series* of performance metrics rather than the single scalar OOT produces. Two things become visible that OOT hides: the shape of degradation between retrainings, and the natural month-to-month variance against which any single OOT estimate should be read. It also lets you compare training-window lengths directly, as the $6$-month and $12$-month lines below do. @bergmeir2012use shows that walk-forward is consistent under mild stationarity conditions and recommends it as the default for time-series predictor evaluation. > In short: OOT is the point estimate, walk-forward is the series that puts error bars on it. Since neither UCI file carries timestamps, we synthesize a cohort with a mild temporal shift. The shorter window tracks the drift faster but is noisier. The longer window is smoother but lags. The choice between them is governed by the stationarity assumption in the portfolio: fast-moving consumer populations want shorter windows, stable commercial books can carry longer ones. ### Nested cross-validation Nested CV addresses a different leakage than the temporal one discussed in @sec-ch04-oot and @sec-ch04-walkforward. The problem it solves is *hyperparameter* leakage: if the same folds are used to both pick hyperparameters and report generalization, the reported number is biased upward because the hyperparameters were tuned against the very observations now being used to score them. Reusing the same fold for both overstates performance by roughly $0.5$ to $2$ percent of AUC in credit-scoring benchmarks [@lessmann2015benchmarking]. The nested design fixes this by separating the roles: an outer loop evaluates generalization, and an inner loop inside each outer training block selects hyperparameters. It does *not* fix temporal leakage. If both the outer and inner splits are shuffled k-folds, the outer training blocks still contain observations from after the outer validation period, and the estimate remains optimistic in the same way a plain k-fold is. The production-correct pattern is nested *time-based* splits: the outer loop is walk-forward over time (@sec-ch04-walkforward), and the inner loop grid-searches inside each historical training window, respecting the same order-preserving discipline. Use shuffled nested CV only in the same scope where plain shuffled k-fold is acceptable, namely benchmark tables on the UCI files, which is the context of the code below. In high-signal regimes a cheap substitute is to fix hyperparameters from prior experience and use a single (time-respecting) CV for estimation. #### Production pattern: nested walk-forward CV The code above uses `StratifiedKFold` in both loops and therefore inherits the temporal-leakage critique of plain shuffled k-fold. This subsection replaces both loops with time-respecting splits. The pattern is the one that goes into a model-validation memo: the outer loop walks the cutoff forward month by month exactly as in @sec-ch04-walkforward, and the inner loop is a chronological `TimeSeriesSplit` *within* the current outer training window. No row from month $t$ ever participates in selecting hyperparameters for a model that will be scored on month $\tau < t$, and no validation month ever contributes to the fit that predicts it. **Packages used.** `sklearn.model_selection.TimeSeriesSplit` for the inner chronological splitter [@pedregosa2011scikit]. No custom splitter is needed because the data is already grouped by month in the `data` list built in @sec-ch04-walkforward; the inner loop splits along the *month axis*, which is the unit that must stay ordered. If the cohort were a flat dataframe with a date column, the equivalent construction would be `TimeSeriesSplit` on the sorted unique month index, then `np.isin(df["month"], train_months)` to materialize each fold. `GridSearchCV` is deliberately avoided here: its default splitter does not see the month grouping, and passing a prebuilt list of `(train_idx, val_idx)` tuples through its `cv=` argument obscures the invariant this code is meant to make obvious. Three details deserve emphasis. 1. *The inner splitter runs on months, not on rows.* `TimeSeriesSplit(n_splits=k)` is called with `np.arange(n)` where `n` is the number of training months. Each inner fold is therefore a contiguous block of months, which is the grouping that actually matters for temporal leakage. Splitting on rows inside the outer window would recreate the same leakage this pattern is trying to avoid, because a single month's observations would straddle inner train and inner validation. 2. *The reported number is the mean of the outer scores.* The selected $C$ per outer fold is a byproduct, not the deliverable. Papers that report the *single* hyperparameter chosen by nested CV are misreading the procedure: nested CV estimates the error of the *model-selection pipeline*, not of one fixed model. If the goal is a single deployable model, pick hyperparameters once on the full historical window using the same inner `TimeSeriesSplit`, and report the nested number as the honest generalization estimate for that pipeline. 3. *The bottom panel is a diagnostic.* If the selected $C$ moves substantially across outer months, the pipeline is drift-sensitive and the nested estimate is the right summary to quote. If $C$ is flat across all outer folds, the inner search is adding variance without changing the answer, and the cheap substitute mentioned earlier, namely fixing hyperparameters from prior experience and running a single time-respecting CV, is likely adequate. When the data is a flat panel with a date column rather than a prebuilt list, the same construction is: The structure is identical: outer `TimeSeriesSplit` on sorted unique months, inner `TimeSeriesSplit` on the outer training months only, and row materialization through `isin` masks. The same pattern extends to `GradientBoostingClassifier`, `lightgbm.LGBMClassifier`, or any estimator whose hyperparameters need tuning; only the `param_grid` and `fit` call change. ## Statistical comparison of classifiers Every section so far has produced *point estimates*: one AUC, one KS, one Brier, one profit number. A practitioner now has to answer the question that point estimates cannot: given that model A scores higher than model B on the test set, is that difference a real improvement or is it within the sampling noise of this particular evaluation sample? The chapter intro flagged this already, that most benchmark-paper disagreements turn out to be about variance rather than algorithms [@baesens2003benchmarking; @lessmann2015benchmarking]. This section gives the standard procedures that let a model owner defend "A is better than B" to a validator, and that let a benchmark paper rank many classifiers across many datasets without pretending that small gaps are meaningful. Two settings come up, and they need different tools. - *Two classifiers, one evaluation sample.* This is the common case inside a single bank: the challenger model and the champion model are scored on the same OOT window, and the question is whether $\Delta\mathrm{AUC}$ is significantly nonzero. Because both AUCs are computed on the same observations, they are *correlated*, and an unpaired test would give the wrong standard error. @sec-ch04-delong is the parametric paired procedure; @sec-ch04-bootstrap-ci is the distribution-free paired alternative. - *Many classifiers, many datasets.* This is the setting of a benchmark paper or a cross-portfolio comparison: $K$ algorithms each run on $N$ datasets, and the question is which algorithms sit significantly above the others overall. Pairwise tests do not compose here because of multiple-comparison inflation and because AUC on different datasets is not directly commensurable. @sec-ch04-friedman gives the rank-based procedure that handles both issues. A natural question: if Friedman-Nemenyi solves the multiple-comparison and scale problems, is it strictly better than DeLong? No. The two tests operate on different null hypotheses and different data structures, and the correct choice is dictated by how many datasets are on the table, not by which test has the cleaner statistical properties. - *On one dataset, DeLong strictly dominates Friedman-Nemenyi.* DeLong exploits the pairing of predictions on the *same observations* and consumes the full placement structure (@eq-delong-place); Friedman-Nemenyi would have only $N = 1$ dataset, which is below the sample size the rank test needs to reject anything. Running Friedman on a single OOT window is not conservative, it is uninformative. - *Across many datasets, DeLong does not compose.* Pairwise DeLong on $K$ classifiers gives $K(K-1)/2$ $p$-values with no built-in family-wise correction, and the variance estimator is per-dataset so there is no principled way to pool across datasets. Friedman-Nemenyi is the correct aggregation precisely because it moves to ranks. - *In a hybrid workflow, use both.* Run DeLong inside each OOT window to defend pair-level improvements to the validator, and run Friedman-Nemenyi across OOT windows or portfolios to defend overall ranking in a benchmarking memo. The two answers rarely conflict, but when they do, each is answering a different question. ### DeLong test for two correlated AUCs @delong1988comparing derive a nonparametric variance estimator for the difference between two AUCs computed on the same observations. Let $V_{10}^{(k)}(i)$ be the structural component for the $i$-th positive observation under scorer $k$, and $V_{01}^{(k)}(j)$ the component for the $j$-th negative. Define the placements $$ V_{10}^{(k)}(i) = \frac{1}{n}\sum_{j=1}^{n}\psi(S_i^+, S_j^-), \quad V_{01}^{(k)}(j) = \frac{1}{m}\sum_{i=1}^{m}\psi(S_i^+, S_j^-), $$ with $\psi(a, b) = \mathbf{1}(a > b) + \tfrac{1}{2}\mathbf{1}(a = b)$. Then $\mathrm{AUC}^{(k)} = \tfrac{1}{m}\sum_i V_{10}^{(k)}(i) = \tfrac{1}{n}\sum_j V_{01}^{(k)}(j)$ and $$ \widehat{\mathrm{Var}}(\mathrm{AUC}^{(1)} - \mathrm{AUC}^{(2)}) = \mathbf{L}^\top \left(\frac{\mathbf{S}_{10}}{m} + \frac{\mathbf{S}_{01}}{n}\right)\mathbf{L}, $$ with $\mathbf{L} = (1, -1)^\top$ and $\mathbf{S}_{10}, \mathbf{S}_{01}$ the $2\times 2$ sample covariance matrices of the placements. Under $H_0 : \Delta\mathrm{AUC} = 0$, the ratio $\Delta\widehat{\mathrm{AUC}} / \sqrt{\widehat{\mathrm{Var}}}$ is asymptotically standard normal. @sun2014fast give an $O((m+n)\log(m+n))$ implementation that avoids the explicit double sum in @eq-delong-place. The version below uses the direct $O(mn)$ form because the tests in this chapter fit comfortably in memory; the fast form is useful when $m, n$ exceed $10^5$. ### Bootstrap CIs and comparison An alternative, distribution-free, is the paired bootstrap. Draw $B$ bootstrap samples of observation indices, compute the AUC difference in each, and take the empirical quantiles [@efron1979bootstrap]. The two inferential procedures should broadly agree in large samples. When they diverge, DeLong's is the parametric answer under asymptotic normality, which fails for tiny defaulter counts; the paired bootstrap is the robust fallback. ### Multi-classifier comparison: Friedman and Nemenyi DeLong and the paired bootstrap answer the two-classifier question. A bank benchmarking many candidate algorithms, or a paper like @lessmann2015benchmarking comparing dozens of classifiers across dozens of portfolios, faces a harder setup: $K$ classifiers each evaluated on $N$ datasets, and the question is which ones sit significantly above the others across the whole benchmark. Two problems rule out running DeLong or a bootstrap on every pair. First, pairwise testing inflates the family-wise error rate: at $K=10$ there are $45$ pairs, and naive $\alpha = 0.05$ tests will flag several "significant" differences by chance alone. Second, AUC numbers on different datasets are not on a common scale (an AUC of $0.72$ on a thin emerging-market file is not comparable to $0.72$ on a mature US portfolio), so averaging raw AUCs across datasets is not defensible. The Friedman-Nemenyi procedure of @demsar2006statistical solves both problems by switching to *ranks*. Within each dataset, the classifiers are ranked from best to worst, which removes the cross-dataset scale problem. Friedman tests whether the rank distribution is uniform across classifiers; Nemenyi gives the post-hoc critical difference that controls family-wise error. This is why the procedure is the default in the benchmarking literature and why @sec-ch16 adopts it. The @friedman1937use test is a non-parametric Anova on ranks. For each dataset, rank the classifiers from 1 (best) to $K$ (worst), average ties. Let $\bar R_k$ be the average rank of classifier $k$ across $N$ datasets. The test statistic $$ \chi^2_F = \frac{12 N}{K(K+1)} \left(\sum_{k=1}^{K} \bar R_k^2 - \frac{K(K+1)^2}{4}\right) $$ has an approximate $\chi^2_{K-1}$ distribution. On rejection, pairwise comparisons use the @nemenyi1963distribution post-hoc procedure. The critical difference between average ranks at significance level $\alpha$ is $$ \mathrm{CD} = q_\alpha \sqrt{\frac{K(K+1)}{6N}}, $$ with $q_\alpha$ the Studentized-range-based critical value. Two classifiers are declared significantly different when $|\bar R_i - \bar R_j| > \mathrm{CD}$. Because the Nemenyi table and its extensions live in @demsar2006statistical @sec-app-A-math, the $q$ constants above are hard-coded rather than re-derived here. ## Scalability ### Pandas is the baseline For a typical book scoring application with up to a few million observations, pandas plus sklearn is adequate and should be the default. Beyond roughly 10 to 20 million observations, the memory cost of loading all labels and scores at once begins to matter, and a two-pass algorithm with streaming quantiles and chunked histograms becomes attractive. ### Dask delayed AUC on 10M rows The Mann-Whitney form in @eq-auc-mw can be approximated well by a fine histogram of the score distribution conditional on the label. Divide the score axis into $B$ bins, accumulate $(h_p, h_n)$ over all chunks, and compute the ROC from the cumulative class histograms. The error is $O(1/B)$ and the communication cost is $O(B)$ per chunk, independent of chunk size. The histogram approximation matches sklearn to the fifth decimal place at $B=2000$ bins on 10M rows, and runs in a fraction of the sklearn time because it never materializes the full sort. In a true distributed setting, replace the in-memory chunks with Dask Bag or Spark RDD partitions and the logic is unchanged. ### Polars for joins, Dask for reductions In a production pipeline the typical split is: Polars for data prep (joins, filtering, feature engineering), Dask or Spark for aggregations and statistical reductions, and then back to NumPy for the final metric computation. All three respect the same API shape, and the metrics implemented in this chapter can be plugged into any of them. ## Deployment hooks for metrics Metrics are worthless without a governance layer that surfaces them. MLflow logs every metric at every evaluation step, together with the model artifact and the dataset fingerprint required by SR 11-7. A minimal wrapper: A FastAPI endpoint that returns a calibrated probability, a decile score, and the decision under the current threshold is the minimum contract expected of a production scoring service. @sec-ch34 on MLOps expands this into a full production deployment with ONNX export, canary deployment, and shadow scoring. ## Regulatory touchpoints SR 11-7 requires model performance testing on "an ongoing basis" [@sr117]. In practice, that is read as: quarterly out-of-time validation, monthly PSI and CSI, and rolling AUC or KS on each monthly origination cohort once outcomes have matured. Model risk management teams want to see at least two metrics for each of discrimination and calibration, so AUC or KS plus Brier or a reliability diagram is the minimum. Basel IRB implicitly requires calibration. Capital is a function of PD, and PD is calibrated only if Brier and reliability are tracked [@basel2006international; @basel2017finalising]. A model with strong AUC but drifting calibration understates or overstates capital. Under IFRS 9 and CECL, the same logic applies to expected credit losses [@ifrs9; @cecl]. The loss allowance on a loan is a function of the calibrated PD, the loss-given-default, and the exposure at default; mis-calibration flows into reported net income. The EU AI Act classifies consumer credit scoring as high risk and requires documentation of validation procedures and drift monitoring. The Demsar framework [@demsar2006statistical] for multi-classifier comparison appears in several model selection documents as the preferred way to demonstrate that an updated model beats the incumbent on multiple held-out windows. Under GDPR Article 22, a right to meaningful information about automated decisions has been read to include an explanation of the score distribution in which the individual applicant sits; PSI and reliability diagrams feed this. @mitchell2019model propose model cards as a unified document bundling the metrics, intended use, and ethical considerations. Several US banks now attach a model card as an appendix to their model development document under SR 11-7; it typically carries an AUC table, a KS table, a reliability diagram, a PSI trend line, and a fairness decomposition. ## Vietnam and emerging markets ### Market context Evaluation metrics in Vietnam operate inside a specific supervisory and data context. The Credit Information Center (CIC) under the State Bank of Vietnam (SBV) and the private bureau PCB together cover around 50 to 55 percent of the adult population, so held-out evaluation samples drawn from recent cohorts are meaningfully smaller than a comparable US cut [@cic_vietnam2023; @worldbank_findex2021]. Mobile penetration above 140 percent and eKYC under Circular 16/2020/TT-NHNN mean the origination channel refreshes quickly; a cohort that is three quarters old already differs materially from the current applicant mix [@sbv_circular16_2020]. Personal-data handling under Decree 13/2023/ND-CP constrains how long raw features can be retained for back-testing, which feeds directly into how far back out-of-time evaluation samples can extend [@vn_decree13_2023]. Basel II under SBV Circular 41/2016 supplies the capital formula whose inputs the metrics in this chapter are quietly evaluating [@sbv_circular41_2016]. ### Application considerations The metric toolkit ports cleanly, with four concrete adjustments. First, sample size bounds on AUC confidence intervals bind harder. A typical monthly cohort for a consumer-finance lender is 20,000 to 80,000 accounts with a bad rate of 3 to 8 percent, which yields a few thousand positives at most; DeLong intervals around AUC 0.72 can reach plus-or-minus 0.02 or more. The chapter's Friedman-Nemenyi machinery across multiple cohorts is therefore more valuable in Vietnam than in a US setting because it aggregates power across thin panels. Second, PSI thresholds set on Western books (0.10 investigate, 0.25 retrain) are too loose for a Vietnamese portfolio that sees sharp seasonal shifts around Tet. A calendar-aware PSI, computed against a same-period-prior-year baseline rather than a rolling three-month baseline, is the pragmatic fix. Third, profit-curve and EMP parameters have to be re-anchored. Vietnamese consumer-finance funding cost, regulated maximum interest rate under Circular 43/2016/TT-NHNN on consumer lending by finance companies, and realized LGD on unsecured personal loans differ from US credit-card norms, and the default Verbraken-style prior on LGD from @verbraken2014novel will mis-weight the Vietnamese profit curve if used unedited. Fourth, calibration drift is the metric that moves capital. Under Circular 41/2016 as amended by Circular 22/2023/TT-NHNN (29 Dec 2023) on capital adequacy ratios, PD miscalibration flows through the standardized or IRB capital formula [@sbv_circular22_2023]; a Brier-skill drop of a few tenths over a year is a capital signal, not just a modeling signal. Reject-inference-driven bias matters more in Vietnam than the point estimates of AUC and KS suggest. Historical approval rules at most Vietnamese banks are heavily judgmental on SME and near-prime consumer segments, so the approved-only AUC overstates the discriminative power of the model in the full applicant pool. The chapter's statistics assume the evaluation sample is representative; in Vietnam, teams should report AUC and KS separately for the scored-through channel and for randomly-approved control cohorts where those exist. ### Rationalization The full metric stack of this chapter is the right stack for Vietnam, with one small re-weighting. AUC and Gini remain the primary discrimination metrics because they are prior-invariant and therefore comparable across cohorts of different bad-rate mix, which is useful when Tet cohorts and off-Tet cohorts sit side by side. KS remains the regulator-facing number because it is what Vietnamese supervisors read, even though the academic case against KS in @hand2009measuring applies unchanged. Brier and reliability diagrams are more important in Vietnam than in a US setting because they drive the capital calculation under Circular 41. Profit curves and EMP are genuinely useful but need local parameters. The H-measure is under-used in local practice and is worth adding because the cost prior can be set explicitly to reflect Vietnamese consumer-finance economics. PSI and CSI are essential given the Tet seasonal regime and the mid-window regulatory shifts described in @sec-ch03. Where simpler methods dominate: for most Vietnamese lenders below roughly one million active accounts, a weekly AUC, a monthly KS, a monthly PSI against a twelve-month-prior baseline, and a quarterly Brier on the calibrated PD cover the supervisory surface without requiring the DeLong or Friedman-Nemenyi apparatus. The multi-classifier comparison tools are worth building only when the team is running champion-challenger cycles at scale. ### Practical notes Concrete practical notes for a Vietnamese scorecard team. Evaluation data should be drawn from the CIC performance-tape join, which provides a 90+ dpd flag consistent with the SBV default definition under Circular 41. PCB can supplement for lenders that subscribe, primarily to widen the feature evaluation base rather than the outcome tape. Reporting lines for validation metrics run to the SBV Banking Supervision Agency for commercial banks and to the SBV Department of Credit for licensed finance companies, with IFRS-9-style forward-looking validation increasingly expected alongside the domestic accounting-standard reports. An internal model-risk-management function built to the substance (if not the letter) of SR 11-7 is now industry practice at the top-tier Vietnamese banks, and the metric package in this chapter is the baseline deliverable for a quarterly model-performance review. Teams should budget for an annual re-estimation of cost-matrix parameters against realized LGD and funding cost, not a one-time calibration at model launch. ## Takeaways - AUC measures ranking, KS measures the best operating point, Brier measures calibration, profit curves measure money. Any serious credit model reports all four. - AUC is incoherent as a cost-weighted metric because the implicit weight depends on the classifier [@hand2009measuring]. Use the H-measure when you want a single scalar that respects a user-specified cost prior. - Calibration is cheap to fix with Platt or isotonic post-processing and expensive to ignore. Miscalibration translates one-for-one into mis-stated capital and reserves. - EMP is the right objective for a credit book because it integrates the profit curve over uncertainty in loss-given-default [@verbraken2014novel]. Pick the prior, justify it, and report the number alongside AUC. - PSI on the score and CSI on features are the monitoring workhorses. A 0.10 threshold triggers investigation and 0.25 triggers retraining in almost every bank. - Walk-forward validation is the honest estimator of production performance. Shuffled k-fold should be used only when data are plainly iid, which a credit portfolio almost never is. - For comparing classifiers, DeLong is the parametric answer on one dataset and Friedman-Nemenyi is the rank-based answer across many. ## Further reading - @hand2009measuring: the original H-measure paper and the cleanest critique of AUC. - @verbraken2014novel: development of the profit-based EMP measure for credit scoring. - @gneiting2007strictly: the modern reference on strictly proper scoring rules. - @niculescu2005predicting: comprehensive empirical study of probability calibration across classifiers. - @demsar2006statistical: the canonical framework for statistical comparison of classifiers. - @delong1988comparing: nonparametric variance for AUC differences, with the fast @sun2014fast variant for large samples. - @lessmann2015benchmarking: the definitive benchmark of classifiers in credit scoring, a useful calibration for what to expect. - @provost2001robust: ROC convex hull and the link between thresholds and cost ratios. - @drummond2006cost: cost curves, a complement to ROC that surfaces the cost dependence directly. - @bergmeir2012use: when time-series cross-validation is valid and when it is not. - @gama2014survey: concept drift taxonomy and adaptation strategies, framing for PSI and CSI. - @allen2014mergers and @allen2019search: structural estimates of search frictions and bargaining in negotiated mortgage prices, useful when ROI metrics need a market-equilibrium interpretation rather than a portfolio one. - @crawford2018asymmetric: structural identification of adverse selection alongside imperfect competition; reframes "calibration" as a joint property of pricing and selection rather than the model alone. ================================================================================ # Source: chapters/05-regulation.qmd ================================================================================ # Regulatory and Legal Framework **Scope: both retail and corporate.** SR 11-7 model risk and Basel IRB apply across portfolios. ECOA, FCRA, GDPR Article 22, and EU AI Act provisions on automated decisions are consumer-specific; ECOA Regulation B also covers small-business credit. ## Overview {.unnumbered} A credit model is not a mathematical object that merely happens to sit inside a bank. It is a regulated object. Its inputs, training regime, internal parameters, calibration, monitoring, and every adverse decision it issues are bound by overlapping statutes: prudential (Basel, SR 11-7), consumer (ECOA, FCRA), data protection (GDPR), and sectoral AI law (the EU AI Act). A model that earns a higher AUC, but cannot produce a lawful adverse action notice is a model a bank cannot deploy. This chapter frames the regulatory framework as a set of constraints on the estimator. Each regime maps to precise artifacts: a Pillar I capital number, a reason code string on a notice, a record of an automated decision, a conformity dossier. The methods and code that produce those artifacts sit alongside the estimators that produce the probability of default. Treating them as separable is a common failure mode. We build them jointly. Why spend an entire chapter on regulation before the first serious estimator? Two reasons. The first is that the constraints are binding. A scorecard architect who does not know that Regulation B §1002.9(b)(2) forbids a generic "failed our internal screening" reason will build a pipeline that cannot be deployed. A modeler who does not know that Basel III §9 imposes an output floor will overestimate the marginal capital benefit of a sophisticated IRB model. A data scientist who does not know that Annex III §5(b) of Regulation (EU) 2024/1689 classifies credit scoring as high-risk will ship a model that requires a conformity assessment and a fundamental-rights impact assessment that have not been built. The failure modes are not statistical; they are legal and operational, and they crystallize the week before launch. The second is that the regulations shape what is measurable. The Basel IRB definition of default (90 days past due or unlikeliness to pay) is the dependent variable for most PD models at banks. The FCRA definition of a "consumer report" constrains which features enter the model at origination. The GDPR Article 22(3) right to contest means the pipeline must support human review. The EU AI Act Article 14 human oversight requirement means the model is not stand-alone; it is embedded in a workflow that a person can intervene in. Build the estimator without these constraints in mind, and the retrofit is expensive. The chapter has two halves. The first (@sec-ch05) walks through the Basel IRB capital formula, derives it from the Vasicek asymptotic single-risk-factor (ASRF) model, and implements it in NumPy. The second half covers the law and policy that govern a credit decision once PD is estimated. It includes the Equal Credit Opportunity Act (ECOA) and Regulation B (@sec-ch05-ecoa), the Fair Credit Reporting Act (FCRA) (@sec-ch05-fcra), GDPR Article 22 (@sec-ch05-gdpr), the EU AI Act classification of credit scoring as high-risk (@sec-ch05-euaia), and the U.S. model-risk supervisory guidance SR 11-7 and OCC 2011-12 (@sec-sr117). Adverse action notices, reason-code generation from logistic regression and gradient boosted trees (@sec-adverse-action), and a worked model card complete the chapter. A word to the emerging-market reader. The Basel, ECOA, FCRA, GDPR, and EU AI Act anchors below are Anglo-American and European, but the substance transplants unevenly. A Vietnamese, Indonesian, Indian, or Nigerian lender operates under a local prudential regime (in Vietnam, SBV Circular 41/2016 for Basel II capital as amended by Circular 22/2023 on capital adequacy ratios, Circular 43/2016 for consumer lending by finance companies, Decree 94/2025 for the fintech sandbox) and a local data-protection regime (in Vietnam, Decree 13/2023 on personal data) that mirror the Western framework in substance while differing in scope, definitions of sensitive data, and adverse-action obligations. The architecture of the chapter, capital formula plus reason codes plus documentation artifacts, is the right architecture anywhere. The specific statutory triggers and the drafting of the reason-code strings are local and are where a cross-border lender has to invest. One note on scope. The chapter is written from the perspective of a U.S. or EU regulated lender. Many jurisdictions have parallel structures: the UK PRA's SS3/18 on model risk management, the Monetary Authority of Singapore's FEAT principles, the Bank of Canada's E-23 guideline, the Reserve Bank of Australia's CPG 235. These tend to converge on the same substance: IRB-style capital, effective challenge, adverse action or reason-for-decision notices, and an emerging AI-specific overlay. A practitioner in one of those jurisdictions should read the citations here and substitute the local equivalent. ### Notation {.unnumbered} - $PD$: one-year probability of default for an obligor or facility, expressed as a real number in $[0,1]$. - $LGD$: loss given default as a fraction of EAD, in $[0,1]$. - $EAD$: exposure at default, in monetary units. - $M$: effective maturity of the facility in years (IRB corporate). - $R$ or $\rho$: asset value correlation. - $\Phi$ and $\Phi^{-1}$: the standard normal CDF and its inverse. - $K$: regulatory capital requirement per unit of EAD. - $RWA$: risk-weighted assets. - $\mathrm{MoC}$: margin of conservatism. ## Basel II and III IRB: PD, LGD, EAD, and the ASRF capital formula The Internal Ratings Based (IRB) approach under Basel II and its Basel III revisions [@basel2006international; @basel2017finalising] lets a bank use its own estimates of risk parameters to compute regulatory capital. The parameters are $PD$, $LGD$, $EAD$, and (for non-retail exposures) $M$. The capital formula is not a regression fit to data; it is a closed-form consequence of the Vasicek [@vasicek2002loan] asymptotic single-risk-factor (ASRF) model, made portfolio-invariant by Gordy [@gordy2003risk]. ### Formal definitions of the IRB parameters Basel II (paragraphs 452 to 468 of the Comprehensive Version) defines $PD$ as the one-year probability that an obligor will default, conditional on survival to the start of the year. Default itself (paragraph 452) is the later of a 90-days-past-due trigger or a "unlikeliness to pay" assessment. Formally, $$ PD_i = \Pr\!\left(D_i^{t+1} = 1 \mid \mathcal{F}_t \right), $$ where $D_i^{t+1}$ indicates default of obligor $i$ over the horizon $(t, t+1]$ and $\mathcal{F}_t$ the information set at time $t$. IRB estimates must be long-run averages. Basel II paragraph 447 sets the PD floor for non-retail exposures at 3 basis points (3bps), retained in Basel III [@basel2017finalising §36]. $LGD$ is the facility-level economic loss conditional on default: $$ LGD_i = \mathbb{E}\!\left[ 1 - \frac{\text{discounted net recoveries}_i}{\text{EAD}_i} \big| D_i = 1 \right]. $$ Economic loss includes direct workout costs, indirect costs, and a discount rate that reflects funding and risk. Basel III caps the retail floor at 25% or less and introduces output floors on LGD; the EBA operationalizes the estimation steps in @eba2017gl. $EAD$ is the expected exposure at the moment of default. For on-balance-sheet exposures, $EAD$ equals the drawn amount plus a supervisor-set or bank-estimated credit conversion factor (CCF) applied to the undrawn commitment: $$ EAD_i = \text{Drawn}_i + CCF_i \cdot \text{Undrawn}_i . $$ The effective maturity $M$ for corporate, sovereign, and bank exposures is the cash-flow-weighted average: $$ M = \frac{\sum_t t \cdot CF_t}{\sum_t CF_t},\qquad 1 \le M \le 5 \text{ years}. $$ Retail IRB does not use $M$. Retail exposures are assumed short-term and not subject to maturity mismatch charges. Retail IRB splits into three sub-segments: (i) residential mortgages, (ii) qualifying revolving retail exposures (QRRE, principally credit cards and similar revolving lines), and (iii) "other retail" (auto loans, personal loans, small business loans below the retail threshold). Each sub-segment uses a different asset-value correlation function. The three retail functions are the consequence of Basel II's empirical calibration against observed default correlations; corporate exposures, by contrast, use a PD-dependent correlation that ranges from 0.12 to 0.24. #### The default definition in practice Paragraph 452 of Basel II defines default as occurring when at least one of two events has taken place: 1. The bank considers that the obligor is unlikely to pay its credit obligations in full, without recourse to actions such as realizing security. 2. The obligor is past due more than 90 days on any material credit obligation. The "unlikeliness to pay" (UTP) leg is qualitative and leaves room for supervisory disagreement. Basel II Annex 7 lists indicators: restructuring with economic loss, distressed sale of assets, payment holidays to prevent arrears, bankruptcy filing, specific provisions booked. The EBA guidelines on the application of the default definition (EBA/GL/2016/07) harmonize these indicators across EU banks and introduce a materiality threshold: an absolute materiality threshold (100 EUR retail, 500 EUR non-retail) and a relative threshold (1% of on-balance-sheet exposure). Counting days past due seems mechanical but is not. The clock starts the day the obligation becomes due and unpaid; it restarts only after the arrears are cured. Technical past-due items (e.g., a payment held in suspense due to processing error, or a disputed charge under FCRA) do not start the clock. The default status must persist for a minimum probation period (EBA: three months for retail, 12 months for unsecured non-retail) after the cure before the obligor can be re-classified as performing. Data pipelines that miss the probation requirement tend to underestimate long-run PDs. #### LGD: the work beyond the mean Equation @eq-lgd-def hides considerable operational complexity. The discount rate must reflect the risk of the recovery cash flows, not the risk-free rate. A common practice is to use the original contract rate plus a risk premium; some jurisdictions require the risk-adjusted rate from the bank's internal funds transfer pricing. Workout costs include the salary of the collections staff allocated to the facility, legal fees, and indirect overhead. Indirect costs are typically the hardest to pin down; EBA's 2017 guidelines require that they be included, estimated as a percentage of direct costs if no better measure exists. Recovery rates on retail loans are often bimodal: a high mass near zero (obligors who repay quickly under hardship programs) and a second mass near one (obligors who charge off fully). Bastos [@bastos2010forecasting] documents this for bank loans; Calabrese and Zenga [@calabrese2014fractional] for Italian consumer loans. A beta regression is a defensible default if the modeler accepts that the mean LGD is a poor summary of the recovery distribution. For downturn LGD the tail of the distribution matters more than the mean, because downturn conditions shift mass from the "recovered" mode to the "charge-off" mode. #### EAD and off-balance-sheet exposures For revolving lines, equation @eq-ead-def requires estimating $CCF$ for the undrawn commitment. A CCF of 50% on an undrawn credit card balance means the bank expects half of the available headroom to be drawn between the reporting date and default. For non-retail exposures Basel II provides supervisor-set CCFs (paragraph 311): 75% for commitments with an original maturity over one year, 20% for short-term trade-related contingencies. For advanced IRB retail and non-retail exposures the bank estimates its own CCF or EAD conversion factor. The Basel III revision [@basel2017finalising §31] removes CCF estimation for retail revolving exposures under the advanced IRB approach and replaces it with supervisor-set numbers for some facilities. This is part of the broader Basel III narrowing of advanced IRB scope; the framework's authors judged that banks' CCF estimates were too optimistic. ### The ASRF model and the capital formula The Vasicek single-factor structural model takes obligor $i$'s standardized asset return as $$ A_i = \sqrt{\rho} Y + \sqrt{1 - \rho} \varepsilon_i,\qquad Y,\varepsilon_i \sim \mathcal{N}(0,1) \text{ i.i.d.} $$ The obligor defaults when $A_i$ falls below a threshold $c_i = \Phi^{-1}(PD_i)$. Conditional on the systematic factor $Y = y$, the default probability is $$ p_i(y) = \Phi\!\left(\frac{\Phi^{-1}(PD_i) - \sqrt{\rho} y}{\sqrt{1 - \rho}}\right). $$ Gordy [@gordy2003risk] shows that in an infinitely fine-grained, single-factor portfolio the 99.9% VaR of loss is attained by fixing $Y$ at the one-sided 0.1% quantile, $y = -\Phi^{-1}(0.999) = \Phi^{-1}(0.001)$. Substituting, $$ p_i^{\text{worst}} = \Phi\!\left(\frac{\Phi^{-1}(PD_i) + \sqrt{\rho} \Phi^{-1}(0.999)}{\sqrt{1 - \rho}}\right). $$ The unexpected loss per unit of $EAD$, on which IRB capital is charged, is $LGD \cdot (p_i^{\text{worst}} - PD_i)$. For corporate exposures Basel II introduces a maturity adjustment that inflates the charge with $M > 1$: $$ b(PD) = \bigl(0.11852 - 0.05478 \ln PD\bigr)^2, $$ $$ MA(PD, M) = \frac{1 + (M - 2.5) b(PD)}{1 - 1.5 b(PD)}. $$ The Basel II asset value correlation for corporate, sovereign, and bank exposures is $$ \rho_{\text{corp}}(PD) = 0.12 \cdot \frac{1 - e^{-50 PD}}{1 - e^{-50}} + 0.24 \cdot \left(1 - \frac{1 - e^{-50 PD}}{1 - e^{-50}}\right). $$ For residential mortgages Basel uses a flat $\rho = 0.15$. For qualifying revolving retail exposures (QRRE, typically credit cards) $\rho = 0.04$. For "other retail" the formula mirrors corporate with a decay constant of 35: $$ \rho_{\text{other retail}}(PD) = 0.03 \cdot \frac{1 - e^{-35 PD}}{1 - e^{-35}} + 0.16 \cdot \left(1 - \frac{1 - e^{-35 PD}}{1 - e^{-35}}\right). $$ The IRB capital requirement per unit of EAD is then $$ K(PD, LGD, M) = \left[ LGD \cdot \Phi\!\left(\frac{\Phi^{-1}(PD) + \sqrt{\rho}\, \Phi^{-1}(0.999)}{\sqrt{1 - \rho}}\right) - LGD \cdot PD \right] \cdot MA(PD, M). $$ Risk-weighted assets are $RWA = K \cdot 12.5 \cdot EAD$, with the $12.5 = 1/0.08$ factor embedding the 8% Basel total-capital ratio. The @bcbs128 explanatory note derives each element of this formula from the Vasicek model. Three properties of the formula deserve attention. **Portfolio invariance**. Gordy's key theoretical contribution [@gordy2003risk] is that in the infinitely fine-grained limit the 99.9% VaR is a sum of contributions, each of which depends only on the obligor's own parameters ($PD_i$, $LGD_i$, $M_i$, $EAD_i$) and the systematic factor. No cross-obligor interaction term survives. This is what lets Basel set capital per facility rather than per portfolio. The trade-off is that idiosyncratic concentration risk, sectoral concentration risk, and double default risk are lost; they re-enter through Pillar II add-ons. **Inelasticity at the extremes**. Because $\rho$ is a convex combination of two constants as a function of $PD$ (through the weighting function $w$), the correlation approaches $0.24$ as $PD \to 0$ and $0.12$ as $PD \to 1$ for corporate exposures. In the retail formulas the analogous limits are 0.16 and 0.03. The effect is that low-$PD$ obligors have higher correlation and therefore disproportionately higher capital per unit of expected loss. The Basel committee's rationale is that a small shock to a highly-rated obligor (a downgrade that moves $PD$ from 10bps to 100bps) is likely to be systemic; obligors already rated as high-risk have default probabilities driven more by idiosyncratic stress. **No cycle dependence in the formula itself**. The IRB formula takes $PD$ as given; the cycle dependence enters through the bank's choice of rating philosophy. A "through-the-cycle" (TTC) PD is designed to be stable across the business cycle; a "point-in-time" (PIT) PD reflects current economic conditions and moves with the cycle. A TTC PD plugged into the IRB formula yields stable capital charges; a PIT PD yields capital that rises in recessions. The Basel framework permits either, but supervisors scrutinize the stability of capital under stress. In practice many banks use a hybrid rating philosophy, and the rating philosophy must be disclosed and documented under SR 11-7. ### Implementation from scratch and retail vs corporate comparison Three practical takeaways from @fig-irb-capital. The corporate curve lies well above the retail curves at low $PD$, because a corporate exposure is assumed more correlated with a single systematic factor ($\rho \in [0.12, 0.24]$) than a retail obligor ($\rho \in [0.03, 0.16]$). The QRRE curve is the flattest because $\rho = 0.04$ is the lowest fixed correlation in the framework; credit card portfolios diversify systemic risk. The mortgage curve's steepness at small $PD$ follows from a flat but higher correlation $\rho = 0.15$ combined with the inverse Mills shape of $\Phi^{-1}$. Table @tbl-irb reports the capital numbers across representative PDs. At $PD = 1\%$, $LGD = 45\%$, and $M = 2.5$ the IRB capital requirement for a corporate exposure is about 7.4% of $EAD$; an "other retail" exposure is about 3.7%; a QRRE (credit card) exposure is about 1.4%. This is not an approximation; it is what Pillar I demands. Bank holding companies under Collins Amendment floors and the Basel III output floor of 72.5% [@basel2017finalising §9] must also compute the standardized charge, and a bank can use the IRB number only to the extent that it does not drop below the floor multiplied by the standardized number. ### Margin of conservatism Basel III [@basel2017finalising §32.12] and the EBA PD/LGD guidelines [@eba2017gl] require that risk parameter estimates include a *margin of conservatism* (MoC) to compensate for identified weaknesses. The EBA framework decomposes MoC into three categories: - **Category A**: data and methodological deficiencies. Missing data periods, small portfolio subsegments, rating philosophy drift. - **Category B**: model changes and changes in regulatory definition. A new default definition, a restructuring of the rating system, or a change in reporting segment. - **Category C**: general estimation error. Quantifiable statistical uncertainty in the estimators, including finite-sample bias. A common operationalization sums the three components, floored at zero: $$ PD^{\text{applied}} = PD^{\text{best}} + \mathrm{MoC}_A + \mathrm{MoC}_B + \mathrm{MoC}_C. $$ Category C is often estimated by a bootstrap of the PD calibration sample: compute the PD point estimate on each resample, take the upper one-sided confidence bound at 75% or 90%, and subtract the point estimate. Categories A and B are supervisory judgment anchored in documented data issues. The MoC applies at the grade or pool level, not at the obligor level, because IRB capital is computed on calibrated grade averages, not raw model output. A worked example clarifies the bootstrap for Category C. Suppose a rating grade has 400 observations over a 10-year window, with 12 defaults. The point estimate of the long-run PD is $12/400 = 3\%$. A non-parametric bootstrap with 10,000 resamples on the calibration window yields a one-sided 90% upper confidence bound of, say, 4.2%. The Category C MoC is then $4.2\% - 3.0\% = 1.2\%$. The applied PD for the grade is $3.0\% + \mathrm{MoC}_A + \mathrm{MoC}_B + 1.2\%$. The cross-resample variation captures statistical noise but does not capture model misspecification; Category A components do that. There is a temptation, in conservative model development, to double-count MoC. A modeler who holds out a stressed validation period, fits the PD there, and takes the stressed PD as the long-run value is effectively adding a cycle-based conservatism to the point estimate. If the Category B MoC then also adds for the same cycle risk, the final PD is over-conservative. The EBA guidelines are explicit: the MoC components must be distinct and non-overlapping. Supervisory review checks for both under- and over-conservatism. A persistently excessive MoC triggers questions about the underlying model's quality. ### LGD downturn LGD must reflect "economic downturn" conditions [@basel2006international §468; @eba2019downturn]. The EBA 2019 guidelines define a downturn using two steps: identify a downturn period from macro variables (typically GDP, unemployment, and default rate cycles), then compute the LGD that would obtain under that period. The applied LGD is the maximum of the long-run average LGD, the downturn LGD estimated from historical data, and a downturn LGD estimated via a macroeconomic mapping if downturn data are scarce: $$ LGD^{\text{applied}} = \max\!\left( LGD^{\text{long-run}}, LGD^{\text{dt, historical}}, LGD^{\text{dt, estimated}} \right) + \mathrm{MoC}_{LGD}. $$ Calabrese [@calabrese2014downturn] shows that mixture distributions for recoveries fit downturn tails better than beta regressions. Bastos [@bastos2010forecasting] documents that secured retail recoveries are bimodal and state-dependent, so a naive long-run mean understates downturn losses. Practitioners typically estimate an additive or multiplicative downturn add-on on top of the long-run LGD; the additive version is easier to reconcile to reference data, the multiplicative version scales more realistically with LGD level. #### How the downturn period is identified The EBA 2019 guidelines detail the identification procedure. The bank selects a set of economic indicators relevant to the loss drivers of the portfolio: GDP growth, unemployment, the bank's own default rate, and a portfolio-specific indicator such as house prices for mortgages or car prices for auto loans. For each indicator the bank identifies the trough over the reference period of at least 20 years (or the longest available series for newer portfolios). The union of the troughs defines the downturn period. If the reference period is shorter than 20 years the MoC compensates for the shortfall. A mortgage portfolio in the United States faces a natural reference period: 2007 to 2011, when the combined collapse of house prices, rise in unemployment, and surge in defaults produced the worst retail credit losses in post-war data. A mortgage LGD model calibrated on the 2001 to 2023 period must include this window and typically assigns the downturn LGD to it. A corporate LGD model faces a more diffuse set of candidates: 2001 (dot-com and Enron-era restructurings), 2008 to 2009 (general distress), 2020 (COVID, partially offset by government support for corporates). The bank must justify its chosen reference period with quantitative evidence and obtain supervisory approval. #### The LGD floor Basel III introduces LGD floors for bank-estimated parameters, documented in the Basel III finalization paper and implemented through jurisdictional rulebooks (for example, Commission Delegated Regulation (EU) 2017/2358 in the European Union, and the Federal Reserve's Final Rule on Basel III Endgame in the United States, issued 2023). For unsecured retail mortgages the floor is 5%; for secured retail mortgages after application of the collateral haircut the floor is 5% as well; for corporate exposures the floor is 25% on unsecured senior claims. The floors are calibrated to prevent banks from publishing implausibly low LGDs and should be applied at the exposure level before the EAD weighting. The combination of MoC, downturn LGD, and the LGD floor can produce an applied LGD that is substantially above the observed average recovery. This is by design. The Basel framework's premise is that capital requirements must be robust to stress, and Pillar I LGD is not a best estimate; it is a conservative long-run downturn estimate. ### Where IRB sits in the rest of the chapter The IRB parameters map onto every downstream artifact. The PD model feeds @sec-adverse-action reason codes. The IRB rating system triggers the @sec-sr117 model risk controls on development, validation, and ongoing monitoring. The LGD downturn methodology is, in regulatory view, another "model" with its own validation. Basel III introduces output floors that limit the benefit of sophisticated estimators; this is why a bank cannot deploy a deep learning PD model and use its number directly for Pillar I capital. The EBA discussion paper on machine learning for IRB [@eba2020mlrr] enumerates the obstacles: lack of interpretability, lack of stability, and incompatibility with the rating philosophy. ## ECOA and Regulation B The Equal Credit Opportunity Act (ECOA) of 1974 [@ecoa1974] prohibits credit discrimination. The implementing regulation, Regulation B at 12 CFR Part 1002 [@regb1002], is administered by the Consumer Financial Protection Bureau (CFPB). Regulation B binds any "creditor" that "regularly participates in a credit decision, including setting the terms of the credit." This is broad. It covers banks, credit unions, fintech lenders, merchant lenders, and any algorithm-driven underwriter that touches a U.S. consumer or small business credit application. ### Prohibited bases Section 1002.2(z) lists the prohibited bases: - race, - color, - religion, - national origin, - sex (including sexual orientation and gender identity, per CFPB interpretive guidance), - marital status, - age (provided the applicant has the capacity to contract), - receipt of income from any public assistance program, - exercise in good faith of a right under the Consumer Credit Protection Act. ECOA forbids any credit decision that is based on a prohibited basis. Regulation B operationalizes this through two distinct legal theories: **disparate treatment** and **disparate impact (effects test)**. ### Disparate treatment vs effects test **Disparate treatment** is the use of a prohibited basis, or a deliberate proxy for one, as a decision input. Demonstrating disparate treatment requires evidence that the creditor considered the protected attribute. Intentional use is the classic form; "facial" disparate treatment includes using a protected attribute as a feature. Under 12 CFR 1002.6(b)(1), a creditor shall not consider a prohibited basis in any aspect of a credit transaction. There are narrow exceptions: a creditor may inquire about age to verify contractual capacity, may inquire about marital status in community-property states, and must collect monitoring information for Regulation B §1002.13 (for home-secured credit) and HMDA reporting. **Disparate impact** (effects test) applies even absent intent. Regulation B §1002.6(a) adopts the effects test standard articulated in *Griggs v. Duke Power Co.*: a facially neutral policy that has a disproportionate adverse impact on a prohibited class is unlawful unless justified by business necessity, and even then the claimant can prevail by showing a less discriminatory alternative. HUD's parallel standard for the Fair Housing Act [@hud2013disparate] formalizes the three-step burden-shifting framework: 1. the plaintiff shows a facially neutral practice causes a disparate impact on a protected class, 2. the defendant shows the practice is necessary to achieve a substantial, legitimate, nondiscriminatory business interest, 3. the plaintiff shows the interest can be served by a less discriminatory alternative. For credit models, the operational question is whether a feature, or the model as a whole, causes disparate impact. This is where the four-fifths rule (selection rate for a protected group below 80% of the reference group's rate) and statistical tests such as the adverse-impact ratio enter practice. But Regulation B's text anchors the standard in judicial doctrine, not in a bright-line statistical test. Bartlett et al. [@bartlett2022consumer] show that algorithmic pricing in fintech mortgage platforms reduces but does not eliminate disparities relative to face-to-face lending. Howell et al. [@howell2024lender] demonstrate that increased lender automation expands minority credit access by removing discretionary loan officer bias, a mirror-image finding. Both papers make the point that an automated model can reduce disparate treatment while still producing disparate impact. #### Proxies and the effects test A recurring question in fair-lending enforcement is whether a feature operates as a proxy for a prohibited basis. ZIP code is the archetypal example: it is not a protected attribute, but it correlates with race. If a model uses ZIP code and the ZIP-code coefficient produces an adverse impact on a racial group, a plaintiff can argue disparate impact. The defendant's burden under step 2 of the effects test is to show business necessity, typically through an econometric argument that ZIP code carries predictive information beyond what is captured in bureau data and personal financials. The plaintiff's step 3 burden is then to propose a less discriminatory alternative, such as restricting the model to non-ZIP features at the cost of some predictive power. @barocas2016big discuss the general problem that any sufficiently rich model will pick up features that are proxies for protected attributes, even when the modeler intends neutrality. This is the core of the "disparate impact" theory. The empirical literature [@bhutta2021how; @bartlett2022consumer; @dobbie2021measuring] provides quantitative estimates of disparity under various modeling regimes. #### Operational controls A compliant fair-lending program typically includes: - a documented list of prohibited bases and their operationalization in the bank's data, - a disparate-impact test run on every new model before deployment, at each material change, and on a defined monitoring cadence, - a documented "less discriminatory alternative" analysis that evaluates candidate alternative models or feature sets and records the selection criteria, - a governance owner in the second line of defense (compliance or a dedicated fair-lending team) with authority to block deployment, - a periodic audit by the third line of defense (internal audit). The fair-lending analysis draws on @sec-ch23 and @sec-ch24 of this book. Here we only fix the legal framing; the statistical apparatus comes later. #### Applicant characteristic inference (BISG) Regulation B §1002.5(b) prohibits creditors from asking about race in most credit transactions (with exceptions for HMDA-reportable home loans), so fair-lending analysts typically do not have the protected attribute on the application file. For fair-lending testing they use the Bayesian Improved Surname Geocoding (BISG) method, originally developed by the RAND Corporation and adopted by the CFPB. BISG combines a Bayesian prior from the 2010 U.S. Census surname distribution with a geographic update from the Census block-group race distribution. It produces a probability that an applicant belongs to each racial group. Fair-lending tests then weight the outcomes by the BISG probabilities. BISG has known flaws. It performs poorly on mixed-race applicants and on minority groups outside the surname database. The CFPB's 2014 Proxy Methodology White Paper acknowledges these limits. For ECOA enforcement, BISG-derived disparities are probative but not dispositive; the Bureau looks for convergent evidence. ### Adverse action notice requirements (Reg B §1002.9) An adverse action under ECOA is, per §1002.2(c), "a refusal to grant credit in substantially the amount or on substantially the terms requested" or "a termination of an account or an unfavorable change in the terms of an account." If the creditor takes adverse action, §1002.9 [@regb10029] imposes: 1. **Notice within 30 days** of receiving a completed application. For accounts already existing, the notice must be provided within 30 days of the action. 2. **Content**: a statement of the action taken; the name and address of the creditor; the ECOA notice text (§1002.9(b)(1)); a statement of the specific reasons for the adverse action, or a statement that the applicant has the right to request the specific reasons within 60 days and the address to which the request must be sent. 3. **Specific reasons must be specific**. §1002.9(b)(2) provides that the statement of reasons "must be specific and indicate the principal reason(s) for the adverse action." A statement that the adverse action was based on the creditor's internal standards or policies, or that the applicant failed to achieve a qualifying score, is insufficient. The CFPB has issued two recent circulars clarifying how §1002.9 applies to algorithmic models. Circular 2022-03 [@cfpbecoa2022] states that ECOA's adverse action requirements apply even when a creditor relies on a complex algorithm, such as one incorporating machine learning, that operates as a "black box." A creditor that cannot accurately identify the principal reasons for the adverse action cannot use that algorithm to deny credit. Circular 2023-03 [@cfpbsection1033] reiterates that the official sample form is not a safe harbor for overly generic reasons; the creditor must tailor reasons to the actual basis of the decision. The implication for this book is concrete: if a lender uses XGBoost, LightGBM, or a deep neural network to score applicants, the lender must also deploy a mechanism that extracts a specific, principal-reason adverse action notice for every denial. @sec-adverse-action derives such mechanisms. #### "Principal reasons" in practice How many reasons is "specific"? Regulation B §1002.9(b)(2) and @sec-app-C-data do not fix a number, but industry practice is four reasons on the standard adverse action notice, matching the FCRA §615(a) disclosure of "key factors" on a credit score. The four reasons are not arbitrary. They represent the four factors with the largest adverse contribution to the score, in rank order. A lender that reports four reasons but has ten features contributing materially must have a documented rule for the selection. The Bureau's sample adverse action notices (@sec-app-C-data to Regulation B) list common reasons: credit application incomplete, temporary or irregular employment, insufficient credit references, income insufficient for amount of credit requested, length of residence, number of recent inquiries on credit bureau report, and so on. A lender can use the sample reasons verbatim or tailor them. Tailored reasons must still be specific: "your income was below the threshold we use for this product" is specific; "you did not meet our standards" is not. #### Adverse action on counteroffers and pricing An adverse action is not only a denial. §1002.2(c) covers a refusal to grant credit in substantially the amount or on substantially the terms requested. A pricing tier that is higher than the requested rate, a credit limit that is lower than requested, or a term that is shorter than requested can all trigger the notice obligation if the gap is "substantial." In practice, risk-based pricing that places an applicant into a tier other than the prime tier may trigger a §1002.9 notice or, alternatively, a risk-based pricing notice under FCRA §615(h). The FCRA risk-based pricing notice is a parallel, narrower obligation. If a creditor grants credit on terms materially less favorable than the most favorable terms available to a substantial proportion of consumers, and the determination was based in whole or in part on a consumer report, the creditor must provide the risk-based pricing notice. A lender can often choose between the two regimes (the ECOA notice or the FCRA notice) but typically defaults to the more stringent ECOA notice to avoid compliance error. ## FCRA: credit bureau regulation and dispute rights The Fair Credit Reporting Act of 1970 [@fcra1970] governs "consumer reporting agencies" (CRAs, the credit bureaus) and "users" of consumer reports. The statute is codified at 15 U.S.C. §§ 1681 et seq. Four provisions are central for credit modeling. **Permissible purposes (§1681b)**. A consumer report may be obtained only for a permissible purpose: in connection with a credit transaction, an employment decision, insurance underwriting, legitimate business need, a court order, or with the consumer's written instructions. A model pipeline that pulls bureau data for a population not covered by a permissible purpose is unlawful regardless of the downstream use. **Adverse action triggers and disclosure (§1681m)**. If a user takes adverse action "based in whole or in part on any information contained in a consumer report," the user must provide the consumer a notice with the name, address, and telephone number of the CRA that furnished the report; a statement that the CRA did not make the decision and is not able to provide specific reasons; notice of the consumer's right to a free copy of the report; and notice of the right to dispute inaccuracies. §615(a) also requires disclosure of the numerical credit score used, the range of possible scores, and the key factors that adversely affected the score. This is the origin of the term "reason codes": each bureau score (FICO, VantageScore) is accompanied by four reason codes that identify the main factors pushing the score downward. **Accuracy and dispute rights (§1681i, §1681s-2)**. A consumer may dispute the accuracy or completeness of any item in their file. On dispute, the CRA must conduct a reasonable investigation within 30 days, and furnishers (creditors who reported the information) must themselves investigate and correct if warranted. This is not a cosmetic right; the statute creates a private right of action with actual and punitive damages. **Pre-screening (§1681b(c))**. A creditor may use bureau data for pre-approved credit offers subject to firm offer of credit requirements and opt-out mechanisms. Two FCRA items constrain modeling practice directly. First, a model that uses bureau information as inputs is, for §1681m purposes, treated as using the report. Second, many features commonly used in credit scoring (trade-line age, utilization, number of recent inquiries) must be traceable back to a bureau record because the adverse action notice must identify bureau-sourced factors among the "key factors." #### Alternative data and FCRA A growing share of lenders use alternative data: cashflow from bank-account aggregation, rent payments, utilities, telecom, and in some cases behavioral signals such as device fingerprints or browsing history. The FCRA's reach depends on whether the data aggregator is a "consumer reporting agency," defined at §1681a(f) as any person who, for monetary fees, dues, or on a cooperative nonprofit basis, regularly engages in whole or in part in the practice of assembling or evaluating consumer credit information or other information on consumers for the purpose of furnishing consumer reports to third parties. Many bank-account aggregators (Plaid, MX, Finicity) assert that they are not CRAs because the consumer initiates the data-sharing and directs the aggregator to transmit the data to the lender. The CFPB and state regulators have scrutinized this position; under Dodd-Frank Section 1033 and the CFPB's 2024 Personal Financial Data Rights Rule (codifying consumer access to financial data), the regulatory boundary is shifting. The operational point for modelers is simple: before including a feature in a production model, document the source, the permissible purpose on which it was obtained, and whether the source is a CRA. If the source is a CRA, the FCRA §615(a) disclosure of key factors must reach through to that source. #### Dispute pipelines and retraining A borrower who disputes an item in their credit report and prevails forces the bureau to correct the record. A model trained on stale bureau data will embed the uncorrected item until retraining. Regulatory practice tolerates a retraining cadence (quarterly for most bureau-driven models), but it does not tolerate systematic use of known-inaccurate data. A model that scored an applicant on an item that was subsequently disputed and corrected must, on re-application, use the corrected item. This forces a dependency: the bureau pull at application time must use the current file. #### FCRA and adverse action from pure bureau scores For a pure bureau-score decision (e.g., a credit card cross-sell that uses only the applicant's FICO score), §615(a) requires the creditor to disclose the numerical score, the range of possible scores, the date, the name of the scoring entity, and up to four key factors that adversely affected the score. The four key factors are produced by the scoring entity (FICO, VantageScore) at the time the score is pulled and are included in the credit bureau response. The creditor does not have to re-derive them; the creditor just has to include them in the notice. For a proprietary model that uses bureau inputs alongside internal data, the creditor must derive its own principal reasons from its own model. The bureau-provided "key factors" are not sufficient, because they reflect the bureau score, not the creditor's model. ## GDPR Article 22 and automated decision-making The General Data Protection Regulation [@gdpr2016] applies to processing of personal data of data subjects in the European Union. Credit scoring of EU residents is in scope even when the controller is established outside the EU, per Article 3(2). Article 22 is the critical provision for automated credit decisions. ### The text of Article 22 Article 22(1) provides a qualified right: > The data subject shall have the right not to be subject to a decision based solely on automated processing, including profiling, which produces legal effects concerning him or her or similarly significantly affects him or her. Article 22(2) lists exceptions: the automated decision is necessary for entering into or performance of a contract with the data subject, authorized by Union or Member State law, or based on the data subject's explicit consent. Article 22(3) then requires, even when an exception applies, that "the data controller shall implement suitable measures to safeguard the data subject's rights and freedoms and legitimate interests, at least the right to obtain human intervention on the part of the controller, to express his or her point of view and to contest the decision." Credit scoring plainly is a decision with legal or similarly significant effects. A fully automated credit denial is covered. The contract exception (22(2)(a)) typically applies because the automated decision is taken in the context of contract formation, but the 22(3) safeguards still bind. ### Meaningful information about the logic Articles 13(2)(f), 14(2)(g), and 15(1)(h) require the controller to provide the data subject with "meaningful information about the logic involved, as well as the significance and the envisaged consequences of such processing for the data subject" whenever automated decision-making under Article 22(1) takes place. The precise content of "meaningful information about the logic" is debated. Wachter, Mittelstadt, and Floridi [@wachter2017right] argue that the GDPR does not create a right to a specific explanation of an individual decision; the recitals are non-binding and Article 22 references "the logic involved" in the general sense. Selbst and Powles [@selbst2017meaningful] push back, reading the provision as a right to information sufficient to understand the individual decision. Malgieri and Commandé [@malgieri2017right] sit between: not a right to the full algorithm, but a right to legibility of the factors that drove the decision. Operational practice has converged on providing at least: (i) the categories of data used, (ii) the model class (logistic regression, gradient boosted trees, neural network), (iii) the main factors that influenced the individual decision, and (iv) a mechanism to contest. The ECOA adverse action notice mechanism, when ported to EU credit, largely satisfies these demands. The Court of Justice of the European Union's 2023 *SCHUFA* ruling (Case C-634/21) held that the computation of a probability value constitutes a "decision" for Article 22 purposes when the value is used by a third party as a substantial determinant of a credit decision. This extends Article 22 obligations to bureau scoring, not just the downstream lender. ### Contest provisions Article 22(3) requires an avenue to "contest the decision." Practice involves three components: 1. A non-automated review channel with a named human reviewer. 2. The data subject's ability to submit additional evidence (payment history, error correction, hardship documentation) that the reviewer considers. 3. A documented outcome with a separate notice if the contested decision is maintained. For a lender using a machine learning model this implies shadow human decision capacity. A pipeline with 99% automated denials that cannot absorb a 1% contest rate into a human queue is not compliant. #### GDPR fairness and data minimization Article 5 of the GDPR imposes general principles: lawfulness, fairness, and transparency (5(1)(a)); purpose limitation (5(1)(b)); data minimization (5(1)(c)); accuracy (5(1)(d)); storage limitation (5(1)(e)); integrity and confidentiality (5(1)(f)); and accountability (5(2)). For a credit model these translate to concrete constraints. - **Purpose limitation**. Personal data collected for one purpose cannot be re-used for another incompatible purpose without a fresh legal basis. A bank that collected transaction data for payment processing cannot freely re-use it to train a credit model without assessing compatibility or obtaining consent. - **Data minimization**. The model must use only data that is adequate, relevant, and limited to what is necessary. A modeler who adds a device-fingerprint feature that provides 0.1 point of AUC on a 0.80 base must justify the marginal benefit against the marginal privacy cost. Courts and data protection authorities have read this requirement strictly in the credit-scoring context. - **Accuracy**. Inaccurate personal data must be rectified or erased without delay. If a feature in the model is based on a data point the data subject successfully rectified under Article 16, the rectified value must feed the model on next use. - **Storage limitation**. Training data must be kept no longer than necessary. A common practice is to retain training data for a documented period tied to the model refresh cycle and the statute-of-limitations period for regulatory audit. #### Special category data Article 9 of the GDPR prohibits the processing of "special category data" (racial or ethnic origin, political opinions, religious or philosophical beliefs, trade union membership, genetic data, biometric data, data concerning health, or data concerning a natural person's sex life or sexual orientation) unless an exception applies. A credit model cannot use race, religion, or health as a feature. This is stricter than ECOA (which forbids use of protected attributes in decisions) because GDPR Article 9 reaches to *processing*, not only the decision. A subtle question arises with fair-lending audits. Under Article 9(2)(g), processing can be lawful if it is necessary for reasons of substantial public interest, on the basis of Union or Member State law. A bank performing a fair-lending test on its model using BISG-inferred race probabilities is processing a special-category variable. Most EU data protection authorities treat this as lawful under Article 9(2)(g) when a statutory fair-lending framework is in place, but the legal basis must be documented. ## EU AI Act: credit scoring as a high-risk AI system Regulation (EU) 2024/1689 [@aiact2024], the EU AI Act, entered into force 1 August 2024, with tiered application dates (obligations for high-risk systems apply from 2 August 2026 for most Annex III systems; the prohibited-practices provisions and general-purpose AI chapters apply earlier). Credit scoring is in scope. ### Annex III classification Annex III of the AI Act lists the use cases classified as "high-risk." Point 5(b) covers: > AI systems intended to be used to evaluate the creditworthiness of natural persons or establish their credit score, with the exception of AI systems used for the purpose of detecting financial fraud. Consumer and SME credit scoring systems fall squarely within Annex III §5(b). The scope exclusion for fraud detection is narrow: a system that uses credit-related signals to prevent fraud may be out of scope, but a system that determines creditworthiness for origination is in. ### Obligations on providers of high-risk systems Chapter III, Section 2 of the AI Act (Articles 8 to 15) imposes substantive obligations on providers: - **Risk management system (Article 9)**. A continuous, iterative process spanning the entire lifecycle of the system, including identification of known and reasonably foreseeable risks, adoption of risk-management measures, and monitoring. - **Data and data governance (Article 10)**. Training, validation, and testing datasets must be relevant, representative, free of errors to the extent feasible, and examined for possible biases likely to affect fundamental rights. - **Technical documentation (Article 11 and Annex IV)**. A dossier including general description of the system, detailed description of its elements and development process, monitoring, functioning and control, and performance metrics. - **Record keeping (Article 12)**. Automatic logging of events over the lifetime of the system. - **Transparency and provision of information to deployers (Article 13)**. Instructions for use that are clear on intended purpose, accuracy, robustness, and known limitations. - **Human oversight (Article 14)**. The system must be designed so that it can be effectively overseen by natural persons, including the ability to intervene, override, or stop operation. - **Accuracy, robustness, and cybersecurity (Article 15)**. Appropriate levels of accuracy and robustness, including against adversarial attempts to manipulate outputs. ### Fundamental Rights Impact Assessment (FRIA) Article 27 of the AI Act introduces the Fundamental Rights Impact Assessment for deployers that are either public bodies or private entities providing public services, and specifically for deployers of Annex III §5(b) (credit scoring) and §5(c) (life and health insurance) systems. Before first use, the deployer must conduct an assessment containing: - a description of the processes in which the system will be used, - the period and frequency of use, - the categories of natural persons likely to be affected, - the specific risks of harm likely to have an impact on the affected groups, - a description of the implementation of human oversight measures, - the measures to be taken in the case of materialization of those risks, including internal governance and complaint mechanisms. The FRIA must be notified to the national market-surveillance authority. A standardized template is to be issued by the AI Office under Article 27(5). ### Practical consequence A U.S. bank that serves EU residents, a fintech in the European Economic Area, and a large model vendor providing a credit scoring service are all within scope. Deployments using open-source or internally built models are not exempt. The high-risk regime layers on top of GDPR (which continues to apply to the personal-data aspects), the Consumer Credit Directive 2023/2225 (which addresses creditworthiness assessment under consumer protection law), and national banking regulation. The AI Act does not preempt those regimes; it adds. #### Provider vs deployer The AI Act distinguishes a "provider" (Article 3(3)) from a "deployer" (Article 3(4)). The provider develops or has developed an AI system with a view to placing it on the market or putting it into service under its own name or trademark. The deployer is any natural or legal person using the AI system under its authority. A bank that builds its own credit model in-house is both provider and deployer. A bank that licenses a model from a vendor and uses it is a deployer; the vendor is the provider. A bank that builds a model, fine-tunes a vendor's model, or modifies a system enough to change its intended purpose can become a provider, even when it did not author the original system. The provider has the heavier obligations: conformity assessment (Article 43), CE marking (Article 48), registration in the EU database (Article 49), and post-market monitoring (Article 72). The deployer has the human-oversight obligation (Article 26), the FRIA obligation (Article 27), and an obligation to use the system in accordance with the provider's instructions. #### Conformity assessment and CE marking Before placing a high-risk AI system on the EU market, the provider must carry out a conformity assessment. For Annex III §5(b) credit scoring systems the assessment is an internal control procedure: the provider verifies that the system meets the Chapter III Section 2 requirements, prepares the technical documentation (Article 11 and Annex IV), and issues an EU declaration of conformity. The declaration is retained for 10 years and made available on request. CE marking signals conformity. Registration in the EU AI database (Article 71) includes a public-facing record of the provider, the system's intended purpose, and the deployer (for deployers that are public bodies or EU institutions). The database is maintained by the Commission; as of this writing (2024 into 2025) the registration system is under development. #### Substantial modification Article 25 addresses what happens when a deployer modifies a high-risk AI system. A "substantial modification" (Article 3(23)) turns the deployer into a provider for that modification. A bank that retrains a licensed model on its own data, changes the input feature set materially, or adjusts the model to score a new population (e.g., small business instead of consumer) risks crossing the substantial-modification threshold. The Commission guidance on Article 25 (anticipated 2025) will clarify the threshold; in the meantime, prudent practice treats any retraining that materially changes model outputs on the relevant evaluation population as substantial. #### Overlap with IRB For IRB PD models, the AI Act stacks on top of the Basel framework. The EBA's 2021 discussion paper on machine learning for IRB [@eba2020mlrr] anticipated this: any ML-based IRB model must satisfy the IRB framework (through-the-cycle stability, interpretability for supervisory review, MoC documentation) and, if it processes natural-person data, the AI Act. The dual regime is why many large banks continue to prefer logistic regression scorecards for retail IRB: simplicity is a compliance asset. ## SR 11-7 and OCC 2011-12: model risk management SR 11-7 [@sr117] and the parallel OCC Bulletin 2011-12 [@occ201112] are the U.S. supervisory guidance on model risk management. They apply to national banks (OCC) and bank holding companies and state member banks (Federal Reserve). Together with the FDIC's adoption of the same guidance (FIL-22-2017), they set the baseline expectation for any U.S. bank that develops, purchases, or uses a credit model. ### What SR 11-7 requires SR 11-7 defines a model as a "quantitative method, system, or approach that applies statistical, economic, financial, or mathematical theories, techniques, and assumptions to process input data into quantitative estimates." This is deliberately broad and covers: - scorecards and logistic regression credit models, - tree ensembles and deep networks used for underwriting, - economic capital models, - CCAR/DFAST stress-test engines, - CECL/IFRS 9 expected-credit-loss models, - pricing models and ALM models. The guidance is organized around three elements: development, validation, and governance. **Model development**. The guidance requires robust model development aligned with the business purpose, comprehensive testing (including out-of-sample and out-of-time), and full documentation sufficient that a third party could replicate the model. **Model validation**. Validation is an independent effective-challenge function, structured around three components: 1. Conceptual soundness (theory, inputs, methodology, implementation review). 2. Ongoing monitoring (process verification, benchmarking, outcome analysis, sensitivity analysis). 3. Outcomes analysis (backtesting, stability tests, benchmarking against alternative models and challenger models). SR 11-7 explicitly requires that validation be conducted by staff with no stake in the model's use. For a challenger model, validation runs the same analyzes on a different structure. **Model governance**. An inventory of all models with risk tiering, a model risk policy signed off by the board, a documented process for model changes, and exception and limitation tracking. The policy must define roles for model owner, developer, validator, and user. ### Effective challenge The phrase "effective challenge" is a SR 11-7 term of art. It means "critical analysis by objective, informed parties who can identify model limitations and assumptions and produce appropriate changes." Effective challenge is not merely a review for process adherence; it probes the model's assumptions. In credit, effective challenge on a PD model typically involves: - replicating calibration on a held-out time period, - stress-testing rating migration under adverse macro scenarios, - comparing PD rankings against a naive external benchmark (bureau score, altman Z, rating agency default rate table), - running sensitivity analyzes on included features (removing any single feature and measuring the performance drop), - constructing a challenger model of a different class (for example, logistic regression as a challenger to XGBoost). ### Model inventory and tiering Institutions run hundreds to thousands of models. SR 11-7 requires an inventory and a risk tier for each. A typical scheme: - **Tier 1**: critical regulatory models (IRB PD, stress test, CECL). Annual independent validation, documented effective challenge, board reporting. - **Tier 2**: important decision models (underwriting scorecards, pricing). Full validation at implementation plus re-validation on a defined cycle (18 to 24 months). - **Tier 3**: lower-impact models (utilization forecasters, marketing propensity). Lighter validation, streamlined documentation. Adverse action reason-code generators are themselves often treated as tier 2 models because a faulty reason code is a compliance exposure. ### How SR 11-7 reads on machine learning SR 11-7 (2011) predates deep learning in banking. The guidance applies, however, to any model. The Fed, OCC, and FDIC issued the 2021 interagency RFI on AI/ML in banking, signaling that the SR 11-7 framework is the governance lens through which ML models are supervised. The specific additional concerns for ML are model opacity, feature engineering stability, hyperparameter governance, and data leakage. The EBA report on machine learning for IRB [@eba2020mlrr] lists parallel concerns on the European side. #### Hyperparameter governance A single XGBoost model for credit scoring can be configured along dozens of hyperparameters: number of trees, maximum depth, learning rate, subsample and colsample fractions, L1 and L2 regularization weights, minimum child weight, gamma, number of parallel threads, monotonicity constraints on individual features, and so on. Each of these choices affects the out-of-sample error and the fairness profile. SR 11-7 requires that the selection be documented, justified, and controlled. In practice that means: a defined hyperparameter search space, a defined search algorithm (grid, random, Bayesian optimization), a defined selection criterion (out-of-sample AUC, calibration, or a multi-objective score that includes fairness), and a defined test data set that was held out from the search. The cross-validation folds must be locked before the search; a modeler who retunes on a fold after seeing the test result is leaking information and must reset. #### Data leakage and feature lineage Data leakage is the modeler's recurrent failure mode. A feature that appears in training data but is not available at the moment of decision is leaked. Examples from credit modeling: - a feature that includes payment behavior from the month after the scoring date, - a target-encoded categorical where the encoding used the full dataset rather than just the training partition, - a feature that aggregates counterparty information updated after the loan originated. SR 11-7's process-verification requirement is the primary control: the validation team traces each feature's definition back to its source system and verifies that it could have been computed at the moment of decision. A production pipeline that computes features on a historical snapshot (a "feature-time-travel" system) is easier to audit than one that computes features on the latest data at retraining time. #### Ongoing monitoring and backtesting SR 11-7 requires ongoing monitoring. For a PD model this typically includes: - **Discrimination metrics**: AUC or Gini on new vintages, tracked quarterly. - **Calibration**: Hosmer-Lemeshow, Brier score, or binomial backtests at each grade. For IRB, the BCBS 2005 paper on backtesting [@bcbs193] lays out the approach. - **Stability**: Population Stability Index (PSI) on the score distribution and feature distributions. A threshold of 0.10 for yellow and 0.25 for red is common but arbitrary; what matters is that the threshold is documented. - **Override rate**: the share of model outputs overridden by human review, tracked by override reason. When any of these breach the defined threshold, a remediation is triggered: re-calibration if stability is fine but calibration is off, re-fit if discrimination has drifted, rebuild if the feature distribution has materially changed. #### The three lines of defense SR 11-7 does not mandate the "three lines of defense" structure by name but is typically operationalized through it: - **First line**: the business and model development team. Owns the model, submits documentation, responds to findings. - **Second line**: the model risk management function (validation) and compliance. Runs effective challenge, approves or rejects, reports to senior management. - **Third line**: internal audit. Tests whether the first and second lines are fulfilling their defined responsibilities. Does not re-run validation; audits the process. The structure puts the model developer at arm's length from the approver. This arm's length is what the regulator checks. #### The OCC 2011-12 overlay OCC Bulletin 2011-12 [@occ201112] is substantively the same as SR 11-7 in intent, with some wording differences. OCC applies it to national banks. The OCC's examination manual drills in more deeply on scorecards and vendor models; the OCC has a long history of examining credit scoring at the portfolio level through the Uniform Retail Credit Classification system. A national bank supervised by the OCC will typically see OCC examiners review its credit scoring models on-site every 12 to 18 months, while state member banks supervised by the Federal Reserve will see their examiners operate off a comparable cadence. #### Vendor models Vendor-supplied models are not exempt from SR 11-7. The guidance explicitly requires the same validation rigor for vendor models as for internal ones. The vendor must provide sufficient documentation for the bank to conduct validation; if the vendor will not share the model internals, the bank must negotiate contractual protection or not use the model for material decisions. This is the governance dimension of the build-vs-buy decision, and it is the reason why many banks keep core underwriting models internal even when vendor models are cheaper. ## Adverse action notices and reason-code generation Given the regulatory setup above, generating a compliant adverse action notice from a modern credit model is the critical operational task. The task factors into three components: 1. Decide that the applicant would be adversely actioned under the model. 2. Identify the principal reasons, in specific, factor-level terms, that drove the adverse action. 3. Translate the factor labels into consumer-readable reason statements. We focus on (2), which is the interesting algorithmic step. We run the exercise on the German credit dataset, training both a logistic regression and an XGBoost model, and extracting reason codes from each. ### Reason codes from a logistic regression For a logistic regression model with standardized features, the score for applicant $i$ is $$ \text{logit}(PD_i) = \beta_0 + \sum_j \beta_j z_{ij}, $$ where $z_{ij}$ is the standardized feature value. The contribution of feature $j$ to the logit is $\beta_j z_{ij}$. The features that drive an adverse decision are those with the largest positive contribution. A subtle point: the "reference" for reason codes is not the population mean. Hurlin, Périgon, and Saurin [@hurlin2026fairness] discuss this in the context of fairness, and the same logic applies here. If the baseline is an average applicant, the contribution $\beta_j z_{ij}$ measures distance from the mean. For ECOA purposes, that is typically what the regulator expects: "your amount was higher than typical," "your credit history was shorter than typical." If the baseline is instead a "reference approved applicant," then the contributions measure distance from approval. We use the first convention below. The output shows, for a set of adversely actioned applicants, the three features with the largest positive contribution to the logit. The `status` feature is the German dataset's checking account status; `purpose` is the loan purpose; `credit_history` is the credit history string. These are mapped to consumer-readable labels in a reason-code table (not shown) that translates, for example, `status` to "Your checking account balance was low or the account is absent" and `amount` to "The requested loan amount was high relative to typical applicants." ### Reason codes from tree ensembles via TreeSHAP Gradient boosted trees require a more general attribution. The *Shapley Additive Explanation* of Lundberg and Lee [@lundberg2017unified] decomposes a model's prediction for an individual into per-feature contributions that satisfy efficiency (contributions sum to prediction minus expected prediction), symmetry, and additivity. For tree ensembles the exact TreeSHAP algorithm runs in polynomial time and is implemented in XGBoost as `pred_contribs=True`. The output mirrors the logistic regression reason codes in structure: for each applicant with a PD above the denial threshold, the three most adverse features are reported. Some observations carry through. First, SHAP values are on the logit scale for the XGBoost binary classifier. They are therefore directly comparable to the logistic regression contributions. The unit is "log-odds deviation from the dataset mean prediction." Second, one-hot-encoded categorical features produce one reason per level. A reasonable aggregation rolls per-level SHAP up to the parent feature before taking the top-$k$. The code above reports the raw per-level feature name; a production system would aggregate and translate. Third, interaction effects get split across main effects by TreeSHAP. If the regulator requires that an applicant sees a single "reason," and the underlying model contains a `purpose x duration` interaction, the top-$k$ SHAP algorithm may surface `purpose` and `duration` separately. This is acceptable under §1002.9 as long as each reason is specific and accurate. Barocas, Selbst, and Raghavan [@barocas2020hidden] point out two hidden assumptions in this approach: the choice of reference point (what "baseline applicant" are we explaining against?) and the granularity of the feature (is `credit_history` a single feature, or four categorical levels?). Both choices affect which reasons surface. For ECOA compliance, the documented convention must be deliberate and consistent across applicants. ### A production reason-code service The TreeSHAP call above returns raw per-column contributions. A production adverse-action service wraps that array in a function that (a) aggregates one-hot columns back to the parent feature, (b) excludes or flags age contributions per Regulation B §1002.6(b)(2), (c) breaks ties deterministically so identical inputs always return the same reason order, (d) maps parent names to consumer-readable strings, and (e) emits an audit record so the lender can reproduce the notice on demand. The emitted JSON is the audit artifact. A compliance query reproduces the reasons from `input_hash`, `model_version`, and `code_version` alone: load the pinned model checkpoint, replay the input through the same code path, and confirm the hash and reason list match. The `baseline_kind` field records the reference convention (population mean versus reference-approved applicant) so a dispute can be reviewed against the correct counterfactual. The service treats the decision threshold, the baseline convention, the excluded features, and the consumer-text table as configuration, not code. A change in any of them is a versioned deployment. This is the minimum structure needed to satisfy SR 11-7 process verification for the adverse-action pipeline. ### Reason codes from deep and model-agnostic explainers For a neural network, kernel machine, stacking ensemble, or any scorer without a native SHAP solver, the adverse-action pipeline falls back to model-agnostic attribution. Four methods dominate the literature: - **Integrated Gradients** [@sundararajan2017axiomatic]. Path integral of the gradient from a baseline input to the observed input. Satisfies completeness (attributions sum to $f(x) - f(x^\text{ref})$) and implementation invariance. - **DeepLIFT** [@shrikumar2017learning]. Per-feature contribution relative to a reference activation. The Rescale rule attributes $(x_j - x_j^\text{ref}) \cdot m_j$, where $m_j$ is a chain-rule multiplier through the network that coincides with the gradient when activations are linear. - **Kernel SHAP** [@lundberg2017unified]. Model-agnostic sampling-based Shapley estimation. Works on any callable that maps $x$ to a scalar score. - **LIME** [@ribeiro2016why]. Local linear surrogate fit to perturbed samples around the instance; the surrogate coefficients are the reasons. The code below trains a multi-layer perceptron on the German credit features and extracts reason codes with each method. The MLP is chosen not because it is the right model for this dataset (it is not) but because it is neither a linear model nor a tree ensemble, so it exercises the model-agnostic path. The same code works on a stacking ensemble, a calibrated random forest, a kernel SVM, or any `sklearn`-style estimator that exposes `predict_proba`. #### Integrated Gradients (black-box, finite-difference) Integrated Gradients is defined as $\phi_j = (x_j - x_j^\text{ref}) \int_0^1 \partial_j f(x^\text{ref} + \alpha (x - x^\text{ref})) \, d\alpha$. For a black-box scorer we approximate the path integral with the midpoint rule and the per-step gradient with vectorised central finite differences. The result satisfies completeness up to numerical error. #### DeepLIFT Rescale (exact, exploiting MLP weights) `MLPClassifier` exposes `coefs_` and `intercepts_`, so we can walk the network by hand and apply the DeepLIFT Rescale rule exactly. Completeness holds to machine precision. #### Kernel SHAP (model-agnostic, any callable) Kernel SHAP needs only a scalar-output function and a background sample. By explaining `logit_mlp` directly, the attributions land on the logit scale, directly comparable to IG and DeepLIFT. #### LIME (local linear surrogate) LIME fits a weighted linear model to perturbed samples around the instance. The surrogate coefficients are the reasons. LIME weights live on the surrogate's scale, not the logit scale, so they should not be compared numerically to IG, DeepLIFT, or Kernel SHAP. They can still be ranked. #### Method comparison and governance The four outputs rank the same handful of parent features for most applicants (`status`, `duration`, and `credit_history` dominate on this dataset) but the magnitudes and scales differ. Integrated Gradients and DeepLIFT are both on the logit scale, complete with respect to the chosen reference, and deterministic for a fixed baseline. Kernel SHAP lands on the logit scale here because we explained log-odds directly; it carries Monte Carlo variance that shrinks as `nsamples` grows. LIME's coefficients live on the surrogate's scale and should not be compared numerically to the other three. A production pipeline that mixes model families therefore fixes one attribution method per family and documents the scale, not a single method across all models. The `resid` diagnostic printed for IG and DeepLIFT is the numerical gap between the sum of attributions and the model's logit change from baseline to observed input. For the IG implementation above it is bounded by the finite-difference step size and the number of path steps; for DeepLIFT Rescale it is machine precision. An adverse-action audit that finds a material residual (say, more than 1% of `delta_logit`) should treat the attribution as unreliable and either tighten the numerical scheme, switch to a gradient-exact implementation for the specific model family, or fall back to Kernel SHAP with higher `nsamples`. Rudin [@rudin2019stop; @rudin2022interpretable] argues that in high-stakes credit one should start with an interpretable model rather than an opaque one plus post-hoc explanation. That is a defensible position; the adverse-action-notice mechanism here does not excuse deploying a model whose reasons cannot be audited. The code above demonstrates that the mechanics are available for any model; the governance question is whether the explanation is faithful enough for ECOA, which turns on the choice of baseline, the aggregation to parent features, and the stability of the reason set under small input perturbations. For completeness, a Kernel SHAP run on the XGBoost model produces nearly identical answers to TreeSHAP on most applicants because both target the same Shapley decomposition. Exact TreeSHAP remains strictly preferred when available because it is deterministic and has no Monte Carlo variance. ### From reasons to reason codes The top-$k$ features are not the adverse action notice. The notice is consumer-readable text. The bank maintains a reason code table that maps a raw feature name to a consumer-readable statement, and an optional secondary mapping that adjusts the statement based on the direction and magnitude of the contribution. A minimal example for the German dataset: | Feature | Consumer-readable reason | |------------------|------------------------------------------------------| | `status` | "The balance or status of your checking account did not meet our criteria." | | `duration` | "The requested loan term was longer than typical for this product." | | `amount` | "The requested loan amount was higher than we typically extend to applicants with your profile." | | `credit_history` | "Your credit history showed items that indicated elevated risk." | | `purpose` | "The stated purpose of the loan placed the application in a higher-risk category." | | `savings` | "The balance of your reported savings was low relative to the requested loan size." | | `employment` | "Your length of employment was short relative to the requested loan size." | | `other_installment` | "You have other active installment obligations at another institution." | | `property` | "The value of property you hold as security or evidence of stability was low." | | `age` | "Your reported age fell into a category we use as one of several factors in our decision." (subject to ECOA age exceptions) | The last row illustrates a trap. Age is a partial prohibited basis under ECOA: a creditor may not consider age except in limited circumstances, including that the applicant is a minor or that age is used as a predictive factor in an empirically derived, demonstrably and statistically sound credit scoring system that does not assign a negative factor or value to the age of any applicant 62 or older. The Regulation B §1002.6(b)(2) and §1002.2(w) provisions set the boundary. A lender using age as a feature must maintain documentation that satisfies the "empirically derived, demonstrably and statistically sound" (EDDSSS) requirement. ### Reason codes for embeddings and opaque features Modern credit models increasingly consume features whose coordinates are not directly consumer-readable: text embeddings of a free-form loan-purpose field, graph embeddings summarising the applicant's transaction counterparties, image embeddings of an uploaded ID document, learned representations from a pretrained tabular foundation model. A SHAP value on "embedding coordinate 37" is not a reason a regulator will accept. "Your value on latent dimension 37 was high" fails the ECOA specificity test. Three patterns reduce an arbitrary feature space back to something the bank can print on a notice. 1. **Concept grouping.** Name a small set of concepts (for example, "unsecured discretionary purpose", "auto purchase", "business use") and learn a direction in embedding space for each concept, either by training a linear probe on labelled examples or by computing a Concept Activation Vector [@kim2018interpretability]. Project the embedding-space attribution onto the concept directions and report the top-$k$ concepts. 2. **Prototype matching.** Precompute a set of prototype applicants with labelled archetypes ("thin-file self-employed", "young first-car borrower"). At scoring time, report the prototype nearest in embedding space and use its reason-code template. This is the mechanism of prototype-based deep nets [@li2018deep] reused at attribution time. 3. **Structural aggregation.** When the embedding has a natural decomposition (image tiles, text spans, transaction merchant categories, graph neighbours), run SHAP or Integrated Gradients at that decomposition level and aggregate attribution by a human-readable grouping. The notice then names the group, not the coordinate. In all three patterns the reason-code table maps *concept* or *prototype* or *region* to consumer-readable text. The regulator accepts the notice as long as the entity named is a real, auditable function of the applicant's data. What fails is "coordinate 37"; what succeeds is "a high share of gambling merchants in your recent transactions" or "loan-purpose text matched patterns associated with unsecured discretionary spending". The code below implements concept grouping on a synthetic opaque-embedding block derived from the German purpose field. The same pattern applies to a real transformer embedding: only the embedding tensor changes. The reason a regulator sees is still a sentence about the applicant's behaviour, not a number on a latent axis. The attribution math is identical to the tabular case; only the last mile (mapping attribution to consumer text) changes. ### Grouping one-hot levels for reasons The XGBoost model above was trained on one-hot-encoded categoricals. SHAP then attributes contribution to each one-hot column, not to the parent categorical. Adverse action notices expect the parent name. Two approaches handle the grouping. 1. The first approach trains directly on label-encoded or native categorical columns. XGBoost 1.5+ and LightGBM support native categorical handling. SHAP then attributes to the parent feature natively. This is cleaner, but loses some expressiveness in the tree structure. 2. The second approach (used in the code above) trains on one-hot and aggregates SHAP across levels to get a per-parent-feature contribution. The aggregation is additive because TreeSHAP is additive. Two details matter. First, "zero-valued" one-hot dummies can still carry SHAP contribution if the tree's path includes a split on that dummy; SHAP attributes the contribution to the absence of the category, which is still information. Second, for a parent with $L$ levels and reference level absorbed by drop-first, the summed SHAP across the $L-1$ dummies is the full parent contribution relative to the reference. In the code above, the `parent_feature` function and the `parent_scores` dictionary implement this aggregation in the logistic regression path. For the XGBoost path the first snippet merely relabels the top-$k$ one-hot columns with their parent name. A production implementation sums SHAP per parent and then ranks parents. The production reason-code service defined earlier (`_aggregate_to_parent` and `build_reason_record`) already does this. Factored into a single pure function for reuse: The two orderings often disagree. A categorical parent with four one-hot levels that each contribute $+0.15$ sums to $+0.60$ at the parent level, dominating any single column that contributed $+0.40$ but whose siblings contributed near zero. The column-level ranking would hide the parent; the parent-level ranking surfaces it. For ECOA purposes, the parent is the correct unit of attribution: a denial reason is a feature of the applicant, not a value of a dummy column. ### Stability of reason codes across model refreshes A quiet failure mode of reason-code pipelines is instability across model refreshes. If the model is retrained every quarter and the feature importances shift materially, applicants who receive identical decisions on two applications can see different reasons across them. The regulator does not require stability, but consumers notice. A simple stability check: after each refresh, compute the reason codes for a fixed panel of applicants (a "regression test set"), and measure the share of applicants whose top-three reasons changed. A threshold of 10% change without underlying data change triggers a review. A persistent instability suggests the model is overfitting to nuisance variation and the training regimen needs review. The code below implements the check against the XGBoost model trained above. The panel is the set of adverse applicants in the test fold. "Refreshes" are perturbed retrains: same data, different seeds and subsampling rates, standing in for the small amount of stochastic variation any production retrain introduces. In a production pipeline, the reference panel is pinned (stored with its SHAP matrix and reason sets), the threshold is part of the model-governance configuration, and the check runs in CI as a gate on the retrained artifact. A breach does not automatically block the deploy, but it does force second-line review: is the shift explained by a deliberate feature change, a distribution shift in the training data, or is it nuisance variation that the retraining regimen should be tightened to suppress? ### Reason codes under monotone constraints Modern boosting implementations support monotonicity constraints: force the model's output to be monotonically increasing or decreasing in a specific feature. This is valuable for reason codes. A lender can enforce that higher utilization never decreases the PD, which precludes cases where the model, counterintuitively, penalizes low utilization due to interaction effects with other features. The monotone-constrained model is easier to explain because every feature-level contribution has a consistent sign. For ECOA purposes, monotonicity constraints are a defensible business-necessity design. A model that violates monotonicity on a feature the business expects to be monotone (debt-to-income, for example) is harder to justify to a regulator. The cost is a small AUC reduction, typically 0.5% to 2% depending on the number of constraints and the flexibility of the underlying data. ## Documentation artifacts SR 11-7, the EU AI Act, ECOA, and IRB all demand documentation. Four artifacts carry most of the weight. ### Validation report Produced by the second-line validation function. Covers conceptual soundness, process verification, backtesting, benchmarking, and a documented sign-off. Typical length: 40 to 120 pages for a tier 1 model. A validation report does not report on the business case for the model; it reports on whether the model does what it claims, works as implemented, and remains fit for purpose. ### Datasheet for the dataset Gebru et al. [@gebru2021datasheets] introduce "Datasheets for Datasets," a structured template for disclosing dataset provenance, composition, collection process, preprocessing, labeling, intended use, distribution, and maintenance. For a credit dataset, the datasheet includes: who and what the records represent, the sampling frame (approved applicants only, all applicants including declines, rejected applicants with inferred outcomes), temporal coverage, labeling rules for default, protected-attribute coverage, and any reweighting applied. The datasheet is not a nice-to-have. Under EU AI Act Article 10, the dataset used for training a high-risk system must be examined for biases and characterized in the technical documentation. A datasheet satisfies that requirement. ### Model card Mitchell et al. [@mitchell2019model] introduce the *model card*, a short document describing a trained model. A well-formed model card is one to three pages that covers intended use, out-of-scope uses, factors (relevant demographic, phenotypic, and environmental factors), metrics, evaluation data, training data, quantitative analyzes disaggregated by factor, ethical considerations, and caveats. Below is a worked model card for the XGBoost PD model fit above, in JSON so it can be parsed by downstream tooling (an MLflow registry, a model inventory database, an AI Act conformity system). We compute the quantitative fields from the data we just fit. The JSON card is machine-readable. A bank's model inventory can ingest it and attach it to the governance ledger. An AI Act conformity assessment can use it as the starting point for the Article 11 technical documentation. ### Validation report skeleton The fourth artifact is the validation report. Unlike the three above, the validation report is authored by an independent team. Its skeleton, at minimum: - Executive summary and conclusion. - Conceptual soundness assessment (theory, methodology, data). - Process verification (code review, environment, data lineage, feature pipelines). - Outcomes analysis (backtesting, benchmarking, sensitivity, stability, calibration). - Monitoring plan (metrics, triggers, frequency). - Limitations, assumptions, and compensating controls. - Approval, exceptions, and re-validation schedule. The validation report cites the model card, the datasheet, and the development report; it does not reproduce them. Every limitation surfaces in the risk tiering and monitoring plan. ## Regulatory implications for the rest of this book The chapters that follow rarely return to the full apparatus of this chapter, but every method intersects with it. The discriminant analysis of @sec-ch06 and the logistic scorecard of @sec-ch07 produce the simplest reason codes: a linear contribution per feature. That interpretability is why they remain the workhorses of origination scoring. The survival models of @sec-ch09 and the reject-inference methods of @sec-ch10 touch directly on IRB PD estimation: survival calibrates the time-to-default horizon properly, and reject inference addresses the selection bias in the training data that the Basel framework acknowledges as a risk. The trees (@sec-ch11), ensembles (@sec-ch12), SVMs (@sec-ch13), and deep networks (@sec-ch14-nn) force the reason-code apparatus of this chapter into play. Without a compliant reason-code pipeline and a model card, a gradient boosted model cannot be used for U.S. retail origination. The fairness chapters (@sec-ch23 and @sec-ch24) pick up the disparate-treatment and effects-test framework of @sec-adverse-action and make it operational. The MLOps chapter (@sec-ch34) operationalizes the SR 11-7 controls: logging, ongoing monitoring, champion-challenger pipelines, retraining governance. The IFRS 9 and CECL chapter (@sec-ch35) takes the IRB PD formula of @sec-ch05 and embeds it into an accounting-based expected-credit-loss estimator. ## IRB capital applied to a small synthetic portfolio To close the chapter, apply the IRB formula to a synthetic retail portfolio that mirrors what a U.S. lender would face. Two supervisory points drop out of the numbers. First, the RWA density (total RWA divided by total EAD) is markedly different across segments. QRRE density sits well below other-retail density at the same PD and LGD mix, because the fixed $\rho = 0.04$ mutes the Vasicek tail. A portfolio rotation from other-retail to QRRE, holding PD and LGD means fixed, reduces RWA without doing anything to the underlying credit risk. This is regulatory arbitrage and a key supervisory concern. Basel III's output floor [@basel2017finalising §9] is designed to reduce the scope for such arbitrage. Second, the portfolio's capital is not just a sum of individual $K$s; it is the expectation that fast-growing QRRE, despite low $\rho$, generates unexpected losses systemically correlated across obligors. The ASRF model is a first-order approximation that ignores granularity and sectoral concentration. Pillar II, Pillar III, and concentration add-ons pick up what Pillar I misses. ## Emerging markets The five regulatory pillars developed in this chapter, including Basel IRB capital in @sec-ch05-regulation, ECOA adverse action in @sec-ch05-ecoa, FCRA bureau regulation in @sec-ch05-fcra, GDPR Article 22 automated-decision rights in @sec-ch05-gdpr, and the EU AI Act high-risk regime in @sec-ch05-euaia, each have direct statutory analogs in the major emerging markets. Mapping them is not cosmetic: circular numbers, filing obligations, regulator contact lines, and dispute timelines differ. But the decomposition is the same one a US or EU scorecard team would recognize, and the internal artifacts (model card, datasheet, validation report, reason codes, Article 27-style impact assessment) transfer with minor relabeling. This section does for India, Brazil, Indonesia, Mexico, and Kenya what the rest of the chapter does for the US and EU: name the instrument, say what it requires, and state how it lands on the scorecard team. ### Cross-jurisdictional mapping @tbl-em-pillars lines up the local instrument against each of the five chapter pillars. The table is indicative (i.e., the jurisdictions differ in how tightly each pillar binds), but the point is that a practitioner moving from a New York or Frankfurt desk to São Paulo, Mumbai, Jakarta, Mexico City, Nairobi, or Hanoi should expect to find all five pillars already in local law, usually under an older statute than the equivalent US or EU version. The gaps are where an AI-specific regime has not yet been enacted (Indonesia, Mexico, Kenya, Vietnam) and where IRB access is effectively closed (Kenya, most of Indonesia, the Vietnamese pilot aside); in these cases the standardized approach plus a domestic Pillar II overlay is the binding capital channel. | Pillar | India | Brazil | Indonesia | Mexico | Kenya | Vietnam | |:--|:--|:--|:--|:--|:--|:--| | IRB / capital | RBI Basel III Master Circular; NBFC SBR [@rbi2023basel_master] | BCB Circ. 3648/2013 [@bcb_circ3648_2013] | OJK POJK 11/03/2016 KPMM [@ojk_kpmm_2016] | CNBV CUB [@cnbv_cub2023] | CBK PG/02 (Basel II standardized) [@cbk_risk2013] | SBV Circ. 41/2016 and 22/2023 [@sbv_circular41_2016; @sbv_circular22_2023] | | Adverse action | RBI Fair Practices Code; Digital Lending KFS [@rbi2022digitallending] | CDC Art. 43; Cadastro Positivo [@brazil_cadpositivo2011] | POJK 22/2023 [@ojk_pojk22_2023] | Fintech Law; LFPDPPP ARCO [@mexico_fintech2018; @mexico_lfpdppp2010] | Consumer Protection Act 2012; CRB pre-listing notice [@kenya_cis2020] | Circ. 43/2016; Decree 13/2023 Art. 14 [@vn_decree13_2023] | | Bureau / FCRA | CICRA 2005; four RBI-licensed CICs [@india_crica2005] | LC 166/2019 Cadastro Positivo opt-out [@brazil_cadpositivo2011] | OJK SLIK; POJK 15/2022 LPIP [@ojk2022fintech] | LRSIC 2002; Buró, Círculo [@mexico_sic2002] | CBK CRB Regulations 2013/2020 [@kenya_cis2020] | SBV CIC; PCB; Circ. 03/2013 [@cic_vietnam2023] | | Data protection / Art. 22 | DPDP Act 2023 [@india_dpdp2023] | LGPD Art. 20 (explicit) [@lgpd2018] | UU PDP Art. 10 [@indonesia_pdp2022] | LFPDPPP Art. 16 [@mexico_lfpdppp2010] | DPA 2019 s. 35 (explicit) [@kenya_dpa2019] | Decree 13/2023 Art. 11, 14 [@vn_decree13_2023] | | High-risk AI | MeitY advisory; RBI FREE-AI committee | PL 2338/2023 (pending, EU-style tiers) | OJK fintech sandbox POJK 13/2018 [@ojk2022fintech] | No binding AI law; INAI drafting | DPA Part V automated-decision rights [@kenya_dpa2019] | Decree 94/2025 sandbox [@vn_decree94_2025] | | Open / consent data | RBI Account Aggregator [@rbi2023aa] | BCB Open Finance Joint Res. 1 [@bcb_openfinance_2020] | OJK open-API roadmap | Fintech Law Art. 76 open APIs [@mexico_fintech2018] | (in consultation) | Decree 94/2025 sandbox [@vn_decree94_2025] | : Five regulatory pillars across six emerging markets. Rows correspond to the chapter sections (@sec-ch05-regulation, @sec-ch05-ecoa, @sec-ch05-fcra, @sec-ch05-gdpr, @sec-ch05-euaia) plus an open-banking row because alternative-data scoring depends on it in every one of these markets. ### India The Reserve Bank of India runs both the prudential and the consumer-conduct regime for banks; the Securities and Exchange Board of India (SEBI) and the insurance regulator (IRDAI) are outside the credit-scoring perimeter. Capital is set by the Master Circular on Basel III Capital Regulations [@rbi2023basel_master]. IRB access requires supervisory pre-approval and in practice Indian banks operate on the standardized approach with RBI-set risk weights; unsecured consumer credit risk weights were raised from 100% to 125% in late 2023 in response to the rapid growth of the segment. Non-bank finance companies (NBFCs) sit under the Scale-Based Regulation (SBR) framework, which imposes bank-equivalent capital obligations on the top tier. The adverse-action analog is the RBI Fair Practices Code, which requires lenders to communicate rejection reasons in writing, and the Digital Lending Guidelines 2022 [@rbi2022digitallending], which mandate a Key Fact Statement disclosing APR and a cooling-off period; the Default Loss Guarantee circular of June 2023 [@rbi2023fldg] caps first-loss cover at 5% of loan portfolio for regulated-lender/fintech tie-ups and is the operative constraint on co-lending scorecards. Bureau regulation runs through the Credit Information Companies (Regulation) Act 2005 [@india_crica2005]; the four licensed CICs are CIBIL (TransUnion), Experian, Equifax, and CRIF High Mark. CICRA and its regulations give a consumer the right to access the credit information file and to seek correction of inaccurate data (the functional analog of FCRA §611 dispute rights) with the operational timeline set by the CIC Regulations. The Article 22 analog is the Digital Personal Data Protection Act 2023 [@india_dpdp2023], notified but not yet in full force as of 2026-04; it is narrower than GDPR (no explicit right against solely automated decisions, no data-portability right), but its rights chapter gives consent, grievance, and correction rights that collectively pin down an appeal pathway. The AI-specific regime is still non-statutory: MeitY's 2024 advisories on generative AI and the RBI committee on Responsible and Ethical Enablement of AI (FREE-AI), constituted in late 2024, signal that guidance is in progress, but there is no Annex III analog yet. The practical substitute for open banking is the RBI NBFC-Account Aggregator framework [@rbi2023aa], a consent-based financial-data-sharing layer that sits between banks and lenders; an Indian credit-scoring team building alternative-data features goes through an Account Aggregator rather than through bank-by-bank API deals. The scorecard-team takeaways: standardized capital is the binding channel; Digital Lending KFS strings are the adverse-action artifact; CICRA dispute rights are the FCRA-equivalent pipeline; the Account Aggregator is the consent log; DPDP grievance redressal is the Article-22-equivalent appeal route. ### Brazil The Banco Central do Brasil (BCB) and the Conselho Monetário Nacional (CMN) are the prudential authorities; consumer conduct is shared with Senacon (the federal consumer-defense secretariat) and data protection with the ANPD. Brazil has the deepest IRB adoption in Latin America: BCB Circular 3648/2013 [@bcb_circ3648_2013] sets out the foundation and advanced IRB approaches, with Basel III buffers layered through subsequent CMN resolutions. Several of the largest Brazilian banks operate approved IRB models on retail portfolios, and the integrated risk-management obligation under CMN Resolution 4557/2017 [@cmn_res4557_2017] is the Brazilian operational analog to SR 11-7: it requires a documented model-risk framework covering development, validation, implementation, monitoring, and governance (i.e., the same five headings a US bank would list). The adverse-action analog is Article 43 of the Code of Consumer Protection (CDC, Law 8078/1990), which entitles the consumer to access and correct any credit data used against them; the Cadastro Positivo regime in Law 12.414/2011, amended by Complementary Law 166/2019 [@brazil_cadpositivo2011], switched positive-data inclusion from opt-in to opt-out and materially changed the thin-file Brazilian subprime segment by making positive behavior visible by default. The bureau regime runs through Serasa Experian, Boa Vista SCPC, SPC Brasil, and Quod, all licensed under Law 12.414. The Article 22 analog is the LGPD [@lgpd2018], which in Article 20 gives an explicit, named right to request review of decisions taken solely on the basis of automated processing, including credit scoring and personality profiling. Article 20 is the closest any emerging-market data-protection law comes to reproducing GDPR Article 22 verbatim; a Brazilian scorecard team should treat it as operationally identical to the GDPR obligation. The AI-specific regime is in motion: PL 2338/2023, the Brazilian AI bill, was approved by the Senate in December 2024 and copies the EU risk-tier structure, including a "high-risk" class that will capture credit scoring; the House vote is pending as of 2026-04, so a Brazilian deployment should expect an Annex-III-equivalent obligation to bind within the planning horizon. Open Finance Brazil, launched by the CMN-BCB Joint Resolution No. 1/2020 [@bcb_openfinance_2020] and rolled out in four phases from 2021 into 2022, is the consent-based data-sharing rail for alternative-data scoring; its scope has been extended beyond banking into investments, insurance, and pensions. ### Indonesia The OJK (Otoritas Jasa Keuangan) is the integrated prudential and conduct regulator; Bank Indonesia retains monetary and payments authority. The Basel III capital regime sits in POJK 11/POJK.03/2016 on minimum capital adequacy for commercial banks (KPMM), as amended [@ojk_kpmm_2016]; IRB is not operational in Indonesia, so the binding calculation is standardized risk weights with an OJK add-on for concentration and macro-prudential buffers. The adverse-action analog is POJK 22/2023 on consumer and community protection in the financial services sector [@ojk_pojk22_2023], which requires transparent disclosure of credit-decision reasons, timely complaint handling, and an escalation path to OJK consumer protection. The bureau regime is a hybrid: the public SLIK (Sistem Layanan Informasi Keuangan), run by OJK, succeeded BI Checking in 2018 and contains all regulated-lender data; private bureaus (LPIPs, Lembaga Pengelola Informasi Perkreditan) operate under a separate OJK licensing regime and add telco, utility, and e-commerce data. The Article 22 analog is the Personal Data Protection Law (UU PDP) 27/2022 [@indonesia_pdp2022], which gives the data subject a right to object to decisions based solely on automated processing that carry legal or significant effect --- close to GDPR Article 22 in scope. The enforcement body under the PDP Law is still being stood up as of 2026-04, so the practical compliance pressure today comes from OJK rather than from the PDP authority. The digital-lending channel is the dominant consumer-credit surface in Indonesia and sits under POJK 10/POJK.05/2022 [@ojk2022fintech], which licenses information-technology-based lending services (LPBBTI, formerly known as P2P lending), caps daily effective interest through subsequent OJK circulars, prohibits collection harassment, and requires blacklist disclosure. OJK's regulatory sandbox for digital financial innovation is the channel for novel scoring approaches, including alternative-data and ML-based models, that sit outside POJK 10. There is no Indonesian AI Act, and OJK guidance on AI in financial services is still advisory rather than Annex-III-equivalent. Indonesian practice: SLIK pull + POJK 22/2023 reason-code strings + UU PDP consent log + OJK sandbox admission if the model is ML-based. ### Mexico CNBV (Comisión Nacional Bancaria y de Valores) is the banking supervisor; Banxico runs payments and monetary policy; CONDUSEF handles consumer complaints. Capital rules are in the Circular Única de Bancos (CUB) [@cnbv_cub2023], which implements Basel III with Mexican calibrations; internal-model approvals for credit risk exist in principle under CUB but are case-by-case, so the standardized approach is the default. Model-risk governance obligations inside the CUB require independent validation and board-level oversight of internal models (i.e., an SR 11-7-shape obligation with different numbering). The adverse-action analog is the Fintech Law [@mexico_fintech2018] for regulated fintechs (Institutions of Financial Technology, IFTs) and the ARCO rights under the LFPDPPP [@mexico_lfpdppp2010] for banks: access, rectification, cancellation, and opposition. The "opposition" right is the closest ARCO gets to an Article 22 appeal; a Mexican lender that cannot produce a natural-language rationale for a denial is exposed to both a CONDUSEF complaint and an ARCO opposition claim. The bureau regime is the Law to Regulate Credit Information Companies of 2002 [@mexico_sic2002]; two licensed SICs (Buró de Crédito and Círculo de Crédito) share coverage, and the law sets out consumer dispute and rectification rights against SIC files. There is no binding AI law in Mexico; INAI, the federal data-protection authority, published guidance on personal data and AI in 2023, and a legislative restructuring of INAI has been under way since 2024 as part of the broader transparency-agency reform. The Fintech Law's open-API mandate has produced slow progress (Mexico's open-banking rollout is well behind Brazil's), but it is the statutory basis for consent-based data sharing that alternative-data scorecards rely on. Mexican takeaways for a scorecard team: CUB-governed capital with a high procedural bar for internal models, CONDUSEF-visible reason codes as the adverse-action artifact, SIC data pulls through Buró or Círculo, LFPDPPP ARCO logs as the Article-22 substitute, and no AI-specific regime today. ### Kenya The Central Bank of Kenya (CBK) supervises banks and, following amendments to the CBK Act that brought digital credit providers under its remit, also licenses the digital-credit segment. Capital follows CBK Prudential Guideline PG/02 (Basel II standardized); IRB is not open to Kenyan banks. PG/04 on Risk Management [@cbk_risk2013] is the model-governance document. It's narrower than SR 11-7 but covering the same three pillars (development, validation, independent review). The adverse-action analog is a split between the Consumer Protection Act 2012 (generic) and the CBK Banking (Credit Reference Bureau) Regulations [@kenya_cis2020], which require a lender to give prior written notice to a borrower before reporting a default to a CRB; amendments in 2020 responded to the digital-lender listing explosion by tightening consent requirements, unwinding small-value negative listings, and narrowing the data-use perimeter. Three CRBs are licensed in Kenya: Metropol, TransUnion Kenya, and Creditinfo. The Article 22 analog is the Kenya Data Protection Act 2019 [@kenya_dpa2019], and specifically Section 35, which grants a data subject the right not to be subject to a decision based solely on automated processing that produces legal or significant effects, which is close to a verbatim copy of GDPR Article 22. Kenya has one of the strongest automated-decision rights in Sub-Saharan Africa and an active Office of the Data Protection Commissioner. The digital-credit segment sits under the Digital Credit Providers Regulations 2022 [@cbk2023digital], which licensed the sector for the first time and imposed rate caps, collection rules, and data-use limits; the initial licensing round saw only a fraction of applicants licensed, which reshaped the market. The Kenyan scorecard team lands on: standardized capital with a CBK Pillar II overlay, CRB Regulations pre-listing notice as the adverse-action strong form, DPA §35 as a GDPR-strength Article-22 substitute, and the DCP Regulations as the digital-credit conduct perimeter. ### Vietnam: worked example ### Market context Vietnam's prudential and consumer-credit framework is a good worked example for the emerging-market practitioner because the legal sources map cleanly onto the five pillars of this chapter. The Basel II capital regime is implemented through SBV Circular 41/2016/TT-NHNN, which prescribes the standardized approach for most domestic banks and opens a limited IRB pilot pathway for systemically important institutions [@sbv_circular41_2016]. Consumer lending conduct is governed by Circular 43/2016/TT-NHNN on consumer lending by finance companies, which sets fee disclosure, collection, and cash-lending-ratio rules. Separately, Circular 22/2023/TT-NHNN (29 Dec 2023) amends Circular 41/2016 on capital adequacy ratios and refines the Basel II standardized capital calculation for banks [@sbv_circular22_2023]. The State Bank of Vietnam (SBV) is the principal prudential supervisor. The Credit Information Center (CIC), a public bureau operated under the SBV, and the private Vietnam Credit Information Joint Stock Company (PCB) between them reach roughly 50 to 55 percent of the adult population [@cic_vietnam2023; @worldbank_findex2021]. Mobile penetration above 140 percent of adults and smartphone adoption above 80 percent of urban adults underpin an eKYC onboarding channel codified by Circular 16/2020/TT-NHNN [@sbv_circular16_2020]. Personal data protection is governed by Decree 13/2023/ND-CP, the first comprehensive Vietnamese data-protection instrument [@vn_decree13_2023]. Regulatory-sandbox experimentation with credit scoring, peer-to-peer lending, and open banking is framed by Decree 94/2025/ND-CP, which supersedes earlier draft circulars and establishes the SBV-run controlled testing mechanism [@vn_decree94_2025; @sbv2023vietnam]. ### Application considerations Mapping the chapter's regulatory surface onto Vietnam produces five concrete adjustments. First, the IRB capital derivation in @sec-ch05 survives unchanged, but the jurisdictional wrapper is Circular 41/2016 rather than the Basel text itself. Most Vietnamese banks today run the Circular 41 standardized approach; a handful of state-owned and joint-stock banks are in the IRB pilot. The ASRF formula, the 99.9 percent confidence level, the 12.5 RWA multiplier, and the 8 percent minimum capital ratio all carry through directly. The $\rho$ supervisory functions are set identically to the Basel defaults. What differs is the output floor: Basel III's 72.5 percent floor is not yet binding in the Vietnamese transposition, so the capital saving from a successful IRB pilot is larger in Vietnam than in a EU or US bank, which changes the economics of the pilot investment. Second, the adverse-action analog in Vietnam is thinner than ECOA Regulation B §1002.9 but is tightening. Circular 43/2016 on consumer lending by finance companies requires clear fee and rate disclosure and a lawful reason for collection actions, and Decree 13/2023 Article 14 gives a data subject the right to know the purpose and legal basis of processing and to contest an automated decision. The practical drafting obligation on a Vietnamese scorecard team is close to the ECOA reason-code obligation even though the statutory trigger is different. Third, FCRA-style bureau regulation is embedded in the CIC and PCB subscriber agreements plus the SBV credit-reporting regulations (Circular 03/2013/TT-NHNN and its successors). Consumer access to the CIC file is enabled through the CIC Credit Connect app, which is the nearest local analog to the US annualcreditreport disclosure. Dispute rights exist in practice, but are less heavily litigated than in the US. Fourth, the GDPR Article 22 analog in Vietnam is Decree 13/2023 Article 11 (consent) and Article 14 (rights of the data subject), which together require a human-review pathway for automated decisions producing significant legal or financial effects. The scope is narrower than GDPR Article 22 but the practical design constraint is similar: the pipeline must support an appeal channel and must log the automated decision. Fifth, the EU AI Act analog is nascent. Decree 94/2025 establishes a sandbox for fintech including credit scoring, and the Ministry of Science and Technology has published draft AI-governance principles aligned with the ASEAN AI Governance Framework, but there is no Vietnamese counterpart to Annex III of the AI Act as of the drafting date [@vn_decree94_2025]. Two crosscutting issues deserve attention. Real-estate collateral concentration on Vietnamese bank balance sheets is large enough that the Pillar II concentration add-on to Pillar I capital is often the binding constraint, not the IRB formula itself. The 2022 corporate-bond episode and recurrent property-sector stress mean that downturn-LGD estimation under Circular 41 has to rely on conservative floors rather than empirical recession averages. Macro volatility and FX pressure on the dong mean that PIT PDs are unstable across two-year windows, so the supervisory expectation is effectively TTC for capital and PIT for IFRS-9-style provisioning. ### Rationalization The regulatory architecture of this chapter (IRB capital, adverse-action notices, model-risk management, documentation artifacts) is a good fit for Vietnam because the local regime is moving toward the same substance under different labels. Teams that build to the chapter's surface (Circular 41 capital, Circular 22 disclosure strings, Decree 13 consent and subject-rights logging, SR 11-7-style model cards and validation reports) will satisfy SBV expectations today and will absorb the expected tightening of the fintech sandbox and data-protection rules with modest incremental effort. Where simpler methods dominate: adverse-action reason codes from a logistic scorecard with WoE bins are more defensible in a Vietnamese adverse-notice dispute than TreeSHAP explanations from a gradient-boosted model, because the linear decomposition is inspectable by a supervisor who has not seen SHAP and because the reason-code strings map onto the field-level disclosures in Circular 22. The more elaborate reason-code machinery in @sec-sr117 is worth building only for the subset of Vietnamese lenders that have already moved to ensemble models in production. Documentation artifacts, particularly the datasheet, the model card, and the validation report, are under-built in Vietnamese practice today and are the highest-leverage addition a risk team can make. ### Practical notes Reporting lines for a Vietnamese credit-risk team run to the SBV Banking Supervision Agency for commercial banks, to the SBV Department of Credit for licensed finance companies, to the SBV Payment Department for e-wallet and payment-related data flows, and to the Ministry of Public Security for Decree 13/2023 personal-data compliance, including the annual personal-data processing impact assessment. The CIC contribution and subscription agreements are a separate reporting line inside the SBV umbrella. Model-risk governance is codified partly through Circular 13/2018/TT-NHNN on internal control systems and partly through the Circular 41/2016 approval process for internal-model pilots; there is no single document with the scope of SR 11-7, so most top-tier banks write internal model-risk policies that lift the SR 11-7 structure. The sandbox pathway under Decree 94/2025 is the realistic entry point for novel credit-scoring approaches that sit outside Circular 41, including alternative-data scorecards and AI-driven underwriting. Cross-border banks in Vietnam should expect to maintain parallel documentation packages: a Basel II Pillar III disclosure aligned with SBV Circular 41, a home-jurisdiction SR 11-7 or PRA SS3/18 package, and a Decree 13 data-processing register. The chapter's @fig-irb-capital capital curve and the documentation templates in @sec-adverse-action are the same in Ho Chi Minh City and in New York; the statutory wrappers are not. ## Takeaways - Basel IRB's capital formula is a direct consequence of the Vasicek ASRF model at 99.9% VaR. It is deterministic given PD, LGD, EAD, M, and the segment. The differences across segments are entirely driven by the asset-value correlation parameter and the retail/corporate split. - Regulation B §1002.9 requires specific, principal reasons for any ECOA adverse action, including those generated by complex algorithms. The CFPB's 2022-03 circular removes any ambiguity: "black box" is not a safe harbor. - GDPR Article 22, the EU AI Act Annex III §5(b), and the Article 27 FRIA are three overlapping obligations that together govern credit scoring in the EU. A U.S. lender serving EU residents is in scope. - SR 11-7 and OCC 2011-12 structure model risk management around development, validation, and governance. "Effective challenge" is the test that a model survived adversarial internal review. - Reason codes from logistic regression follow from the decomposition of the logit. Reason codes from gradient boosted trees follow from TreeSHAP. Both approaches preserve the property that per-feature contributions sum to the prediction minus a baseline. - The documentation artifacts (model card, datasheet, validation report) are not optional. Under the EU AI Act they form the Article 11 technical documentation; under SR 11-7 they are the governance record; under ECOA they underpin the adverse action notice. ## Further reading - The IRB foundations in @basel2006international and @basel2017finalising, with the @bcbs128 explanatory note. - @gordy2003risk for the risk-factor model foundation. - @vasicek2002loan for the original loan portfolio value model. - @calabrese2014downturn on downturn LGD modeling and @bastos2010forecasting on recovery rates. - @hurlin2026fairness for fairness in credit scoring. - @wachter2017right, @selbst2017meaningful, @malgieri2017right for the GDPR Article 22 debate. - @aiact2024 (the AI Act text) and @gdpr2016 (the GDPR text). - @sr117 and @occ201112 for U.S. model risk management. - @mitchell2019model (model cards) and @gebru2021datasheets (datasheets for datasets). - @rudin2019stop and @rudin2022interpretable for the interpretability-first position. - @bartlett2022consumer and @howell2024lender for empirical evidence on algorithmic fair lending. ================================================================================ # Source: chapters/06-discriminant-analysis.qmd ================================================================================ # Discriminant Analysis and the Altman Z-Score **Scope: corporate.** Altman MDA, Z'/Z'', Ohlson, Shumway, and Campbell-Hilscher-Szilagyi on the UCI 572 Taiwanese Bankruptcy panel. Consumer applicability is discussed only in @sec-ch06-limitations. ## Overview {.unnumbered} Linear discriminant analysis was the first statistical tool a bank analyst could hand to a credit committee with a coefficient table and a decision rule. It still is, in many corporate risk groups, because regulators, auditors, and working capital officers can read it. @altman1968zscore turned Fisher's 1936 idea into a working bankruptcy filter by fitting a five-ratio discriminant function on a matched sample of 66 manufacturers. More than five decades later, the Z-score survives as a monitoring metric, a covenant trigger, and a classroom staple. The method is no longer state of the art for out-of-sample accuracy, but it is a lower bound on interpretability and a useful calibration against fancier models. This chapter rebuilds that machinery end to end. The formal part derives Fisher's criterion from the between-to-within variance ratio, proves its equivalence to the Bayes rule under Gaussian equal-covariance class-conditionals, and extends to quadratic discriminant analysis (@sec-ch06-qda) when covariances differ. The empirical part replays the Altman MDA on the @liang2016financial Taiwanese Bankruptcy Prediction panel (UCI 572: 6,819 firm-years, 220 bankruptcies), then steps through the Z', Z'', and ZETA extensions. The benchmark part puts LDA head to head with logistic regression, Ohlson's logit, Shumway's hazard model, and the Campbell-Hilscher-Szilagyi distance measure (@sec-ch06-chs), and documents where LDA still wins and where it loses badly. A pragmatic warning first. LDA on raw consumer-credit features, with their mixture of one-hot dummies and skewed amounts, is almost always dominated by a penalized logit or a gradient-boosted tree on the same design matrix. The reason is not that LDA is wrong in principle. It is that its generative Gaussian assumption is wrong in that particular setting. Where features really are close to jointly Gaussian, LDA remains statistically efficient [@efron1975efficiency]. The chapter gives the conditions and shows them in code. An emerging-market framing sits underneath the whole chapter. In Vietnam and peer economies, corporate books are dominated by thin-file private SMEs whose audited financials arrive late, if at all. Household lending is pulled around by the Tet holiday liquidity cycle, informal-income cash flows, and macro volatility. An LDA or Z''-style model is often the only thing a credit committee in Ho Chi Minh City or Hanoi will approve for middle-market corporate scoring, because the coefficient table is auditable and the sample sizes do not support heavier machinery. The emerging-market section at the end of the chapter returns to this with the CIC bureau, SBV Circular 11/2021, and practical notes on fitting Z'' to Vietnamese manufacturers. ### Notation {.unnumbered} Let $X \in \mathbb{R}^p$ be the feature vector and $Y \in \{0, 1\}$ the default indicator, with 1 coding default. Write $\pi_k = \Pr(Y = k)$, $\mu_k = \mathbb{E}[X \mid Y = k]$, and $\Sigma_k = \operatorname{Var}(X \mid Y = k)$. When the common-covariance assumption holds, $\Sigma_0 = \Sigma_1 = \Sigma$. Sample estimates are hatted. The within-class scatter is $S_W$ and the between-class scatter is $S_B$. $\Phi$ is the standard normal CDF. For firm-level work, $X_1, \dots, X_5$ name the Altman ratios in the order he wrote them. ## Motivation {.unnumbered} Banks run two kinds of default models at a minimum: one for corporates and large SMEs, scored on financial statements, and one for consumer accounts, scored on application plus bureau data. @beaver1966financial showed that individual accounting ratios discriminate between bankrupt and healthy firms one to five years out, but he scored one ratio at a time. The weakness is obvious: ratios are correlated, the information is redundant, and a single-ratio cutoff throws away the multivariate signal. @altman1968zscore fixed this with Fisher's multiple discriminant analysis (MDA). He picked five ratios out of an initial list of 22, fit a linear discriminant on a paired sample of 33 bankrupt and 33 non-bankrupt manufacturers over 1946 to 1965, and published a scoring function that bank analysts could compute by hand. The published function, his decision zones, and his out-of-sample hit rate (95 percent on the original sample, about 80 percent at two-year horizons on holdout) made the Z-score the reference point every later bankruptcy model had to beat. Three things changed after 1980. @ohlson1980financial showed that a logit on nine variables beat the Z-score on a bigger sample, because binary outcomes with mixed-type predictors fit the logit log-likelihood better than the Gaussian likelihood behind LDA. @shumway2001forecasting reframed bankruptcy as a time-to-event process and built a multi-period hazard model, which avoids the selection bias baked into static matched samples. The derivation, pooled-logit equivalence, and its place in the lineage appear in @sec-ch06-empirical of this chapter; the full implementation (long-table construction, time-varying covariates, term-structure recovery, and the current state of the art) is developed in @sec-ch09-shumway, with the connection to distance-to-default covered in @sec-ch08-empirical. @campbell2008search combined accounting and market-based inputs, including volatility and equity returns, and improved out-of-sample ranking further. The sequence from Altman through Campbell is a textbook instance of the same phenomenon, climbing a ladder of statistical sophistication, while the underlying economics stay close to "leverage, profitability, liquidity, size." This chapter keeps the whole ladder in one place. @sec-ch06 derives LDA from scratch. @sec-ch06-altman reconstructs Altman's Z. Sections [-@sec-ch06-extensions] and [-@sec-ch06-empirical] step through its extensions and its empirical competitors. @sec-ch06-limitations returns to the original question: when does the linear-Gaussian generative model win against the discriminative logit? ## Formal setup {.unnumbered} A credit classifier produces a score $s(x) \in \mathbb{R}$ for each applicant vector $x \in \mathbb{R}^p$. A decision rule declares default when $s(x) > t$ for some threshold $t$. Quality of the score is measured by a ranking metric (AUC, KS) and by calibration to the observed default rate in bins. Three ingredients separate LDA from its alternatives. 1. **A generative assumption on the class-conditional distribution**. LDA posits $X \mid Y = k \sim \mathcal{N}(\mu_k, \Sigma)$ with shared covariance. QDA relaxes to $\Sigma_k$. Naive Bayes factors the density across features. Logistic regression makes no density assumption at all and models $\Pr(Y \mid X)$ directly. 2. **An estimation procedure**. LDA uses the sample class means and pooled covariance, which are the maximum-likelihood estimators under the Gaussian assumption. Logit uses maximum-likelihood estimation of the conditional density. Both converge at the standard parametric rate $n^{-1/2}$ to their respective targets. 3. **A decision function**. LDA's is $\hat\Sigma^{-1}(\hat\mu_1 - \hat\mu_0)$. Logit's is the MLE of the log-odds coefficient. When the LDA assumptions hold, both targets coincide and the question is efficiency. When they fail, LDA's estimand is no longer the Bayes rule and logit wins by consistency. The chapter walks through these three ingredients in order, first for the two-class case that matches corporate bankruptcy, then for the multi-class case that matches rating-grade assignment, then back to binary with the full credit-scoring machinery around it. ## Linear discriminant analysis ### Fisher's criterion @fisher1936use asked for a linear projection $w^\top X$ of the feature vector that separates the two classes as well as possible. Measure separation by the ratio of between-class to within-class variance along the projected axis. If $\mu_0, \mu_1 \in \mathbb{R}^p$ are the class means and $\Sigma_0, \Sigma_1$ are the class covariances, the projected between-class squared distance is $\left(w^\top(\mu_1 - \mu_0)\right)^2$, and the projected within-class variance is $w^\top(\Sigma_0 + \Sigma_1) w$ up to class weights. Fisher's criterion is $$ J(w) = \frac{\bigl(w^\top(\mu_1 - \mu_0)\bigr)^2}{w^\top \Sigma_W w} = \frac{w^\top S_B w}{w^\top S_W w}, $$ where $S_B = (\mu_1 - \mu_0)(\mu_1 - \mu_0)^\top$ is the rank-one between-class scatter and $S_W = \pi_0 \Sigma_0 + \pi_1 \Sigma_1$ is the within-class scatter. The objective is scale-invariant in $w$, so fix $w^\top S_W w = 1$. The Lagrangian is $$ \mathcal{L}(w, \lambda) = w^\top S_B w - \lambda\bigl(w^\top S_W w - 1\bigr). $$ Stationarity $\partial\mathcal{L}/\partial w = 0$ gives the generalized eigenvalue problem $$ S_B w = \lambda S_W w. $$ When $S_W$ is positive definite, left-multiply by $S_W^{-1}$ to get the standard eigenvalue problem $S_W^{-1} S_B w = \lambda w$. Because $S_B$ has rank 1 in the two-class case, there is exactly one non-zero eigenvalue, and the corresponding eigenvector is proportional to $S_W^{-1}(\mu_1 - \mu_0)$. The maximum value of the criterion equals that eigenvalue and is the squared Mahalanobis distance between the class means [@mahalanobis1936generalized]: $$ \max_{w \ne 0} J(w) = (\mu_1 - \mu_0)^\top \Sigma^{-1} (\mu_1 - \mu_0) = \Delta^2. $$ In the $K > 2$ class case, $S_B$ has rank up to $K - 1$, and the discriminant projection has $K - 1$ directions. This is the "multiple" in MDA [@rao1948utilization]. The geometric content deserves a second pass. Write the within-class scatter as a symmetric positive-definite matrix and factor it as $S_W = L L^\top$ via Cholesky. Substitute $u = L^\top w$. The criterion becomes $$ J(w) = \frac{u^\top (L^{-1} S_B L^{-\top}) u}{u^\top u}. $$ The Lagrangian now has the structure of an ordinary Rayleigh quotient. The optimal $u^\star$ is the top eigenvector of the symmetric matrix $L^{-1} S_B L^{-\top}$, and we recover $w^\star = L^{-\top} u^\star$. Equivalently, Fisher's projection is the linear direction that would be maximally separating in a whitened coordinate system where the within-class scatter is isotropic. This is also how @bickel2004some interpret LDA's failure in high dimensions: the whitening step breaks when $L$ is near-singular, and the finite-sample direction diverges from the true Bayes direction even with moderate dimension. ### Equivalence with the decorrelated signal-to-noise direction Start from a different angle. Suppose $X \mid Y = k \sim \mathcal{N}(\mu_k, \Sigma)$. Let $Z = \Sigma^{-1/2}(X - \bar\mu)$ where $\bar\mu = (\mu_0 + \mu_1)/2$. Under the change of variables, $Z \mid Y = k \sim \mathcal{N}\bigl(\tfrac{1}{2}(-1)^{1-k} \Sigma^{-1/2}(\mu_1-\mu_0), I\bigr)$. The two class distributions are now unit-covariance Gaussians symmetric about the origin, separated along the direction $d = \Sigma^{-1/2}(\mu_1 - \mu_0)$. The Bayes rule reduces to thresholding the projection $d^\top Z$, and in the original coordinate system that projection is $(\Sigma^{-1/2})^\top d \cdot (X - \bar\mu) = \Sigma^{-1}(\mu_1-\mu_0) \cdot (X - \bar\mu)$. Same answer, different derivation, same coefficient $\beta = \Sigma^{-1}(\mu_1 - \mu_0)$. The Mahalanobis distance @eq-mahalanobis controls the discriminability. When $\Delta$ is small, no linear rule separates well; any competing non-linear rule that does better must be exploiting non-Gaussian, not geometry. When $\Delta$ is large, almost any sensible rule works, and the optimization details stop mattering. @anderson1951classification formalized this and gave the asymptotic error rate for Fisher's rule as $\Phi(-\Delta/2)$ when the priors are equal, which is the quantity most later empirical papers use as a benchmark. ### Sample-size corrections and plug-in bias In practice, $\Sigma$ is unknown and we plug in a sample estimate. The unbiased within-class covariance is $$ \hat\Sigma = \frac{1}{n-2}\left[\sum_{i: y_i=0} (x_i - \hat\mu_0)(x_i - \hat\mu_0)^\top + \sum_{i: y_i=1} (x_i - \hat\mu_1)(x_i - \hat\mu_1)^\top\right]. $$ Plugging $\hat\Sigma$ and $\hat\mu_k$ into the Bayes rule produces a linear classifier whose error exceeds the Bayes error by an $O(p/n)$ term [@anderson1951classification]. @bickel2004some show that as $p/n \to \gamma > 0$, the classifier loses all discriminative power unless $\Sigma$ has structure (sparsity, block-diagonality, a factor model). In the $p \ll n$ regime relevant to Altman's 5-variable model on 66 firms, the plug-in correction is small. In the consumer-credit regime with 50 to 200 dummies on a few thousand applicants, it is not. A partial fix is regularized discriminant analysis [@friedman1989regularized], which shrinks $\hat\Sigma_k$ toward a pooled covariance and a diagonal target to trade bias against variance. The full derivation, the hyperparameter grid, and a runnable comparison against LDA and QDA appear in @sec-ch06-rda. ### Bayes decision under Gaussian equal-covariance Now change view. Suppose the class-conditional densities are multivariate Gaussian with a common covariance: $$ X \mid Y = k \sim \mathcal{N}(\mu_k, \Sigma), \qquad k = 0, 1. $$ The posterior log-odds reduce to a linear discriminant. Write the log-posterior ratio: $$ \begin{aligned} \log\frac{\Pr(Y=1\mid X)}{\Pr(Y=0\mid X)} ={}& \log\frac{\pi_1}{\pi_0} - \tfrac12 (X-\mu_1)^\top \Sigma^{-1}(X-\mu_1) \\ & + \tfrac12 (X-\mu_0)^\top \Sigma^{-1}(X-\mu_0). \end{aligned} $$ The quadratic terms in $X$ cancel under equal covariance, leaving $$ \begin{aligned} \log\frac{\Pr(Y=1\mid X)}{\Pr(Y=0\mid X)} ={}& X^\top \Sigma^{-1}(\mu_1-\mu_0) \\ & - \tfrac12(\mu_1+\mu_0)^\top \Sigma^{-1}(\mu_1-\mu_0) + \log\frac{\pi_1}{\pi_0}. \end{aligned} $$ The Bayes-optimal classifier thresholds this linear function of $X$. The coefficient vector $\Sigma^{-1}(\mu_1 - \mu_0)$ is exactly the Fisher direction @eq-fisher-gep up to scaling, so the two derivations coincide. The intercept differs only by the prior adjustment $\log(\pi_1/\pi_0)$ and the midpoint term, which Fisher's variance-ratio criterion does not fix because it is scale and location invariant. Three consequences matter in practice. First, LDA is linear in $X$, so the decision boundary is a hyperplane. Second, its coefficients are interpretable in the same way OLS coefficients are, because they come from inverting a single covariance matrix. Third, the estimated probability $$ \Pr(Y=1 \mid X) = \sigma\!\left(X^\top \beta + \beta_0\right), \qquad \beta = \Sigma^{-1}(\mu_1 - \mu_0), $$ is correctly calibrated when the Gaussian assumption holds. When it does not hold, the resulting probabilities are often miscalibrated even if the ranking remains good. This matters for credit scorecards because regulators expect the probability of default, not only its rank. ### Quadratic discriminant analysis Drop the equal-covariance assumption. Let $X \mid Y = k \sim \mathcal{N}(\mu_k, \Sigma_k)$. The same algebra yields $$ \log\frac{\Pr(Y=1\mid X)}{\Pr(Y=0\mid X)} = -\tfrac12 X^\top(\Sigma_1^{-1} - \Sigma_0^{-1}) X + X^\top(\Sigma_1^{-1}\mu_1 - \Sigma_0^{-1}\mu_0) + C, $$ where $C$ collects the scalar intercept with $\log(\pi_1/\pi_0)$, $\log|\Sigma_k|$ terms, and quadratic terms in the class means. The decision surface is now a quadric, not a hyperplane. QDA has $p(p+1)$ parameters in the covariance blocks versus $p(p+1)/2$ for LDA, so it overfits quickly when $p$ grows relative to $n$ [@friedman1989regularized]. For credit work, QDA is the natural upgrade when defaulters show a different covariance structure from survivors. That is common in practice: distressed firms have fatter tails and more correlated deterioration across ratios. Whether QDA actually beats LDA depends on whether you have enough defaulters to estimate $\Sigma_1$ well. When the defaulter sample is too thin to support separate covariances but LDA's equal-covariance constraint is visibly wrong, the regularized path in @sec-ch06-rda is the practical middle ground. ### Regularized discriminant analysis @friedman1989regularized proposed a two-parameter shrinkage that interpolates between LDA (@sec-ch06-discriminant) and QDA (@sec-ch06-qda) and then shrinks each covariance toward its diagonal: $$ \hat\Sigma_k(\alpha, \gamma) = (1 - \gamma)\left[(1 - \alpha)\hat\Sigma_k + \alpha \hat\Sigma_{\text{pool}}\right] + \gamma \operatorname{diag}\!\left(\hat\Sigma_k\right). $$ The two hyperparameters index a rectangle of models. At $\alpha = 1, \gamma = 0$ the pooled covariance recovers LDA. At $\alpha = 0, \gamma = 0$ the class-specific covariances recover QDA. At $\alpha = 1, \gamma = 1$ the pooled diagonal reproduces diagonal LDA, which under Gaussian marginals is Gaussian naive Bayes. The interior of the rectangle covers the intermediate regularization paths. The first parameter $\alpha$ controls covariance pooling. Pure QDA uses $\hat\Sigma_k$ estimated on the $n_k$ observations of class $k$, which has $p(p+1)/2$ free parameters per class. When the rarer class carries a few dozen observations (the Altman 33 defaulters, a stressed emerging-market corporate book, a tail-event sample), $\hat\Sigma_1$ is noisy and QDA's quadratic decision surface follows the noise. Shrinking toward $\hat\Sigma_{\text{pool}}$ borrows strength from the larger class at the cost of a small bias if the covariances truly differ. The second parameter $\gamma$ controls diagonal shrinkage. The off-diagonal entries of $\hat\Sigma_k$ are noisier than the diagonal in high dimension [@bickel2004some], and setting $\gamma > 0$ throws away the noisiest entries. The limit $\gamma = 1$ is diagonal LDA, which assumes feature independence within a class; the limit $\gamma = 0$ keeps the full sample covariance. For small samples with modest $p$, a cross-validated RDA typically outperforms both pure LDA and pure QDA. It is a good default when the modeler is uncertain about the covariance structure, because the optimal $(\alpha, \gamma)$ tells the modeler which assumption was closer to the data without a separate hypothesis test. RDA finds an interior $(\alpha, \gamma)$ that beats both corners. On a Gaussian-equal-covariance sample, the optimum would collapse to the LDA corner; on a sample with distinct covariances and a small minority class the optimum is typically in the interior. For credit work, this matters most in two settings: corporate distress scoring with a dozen or two defaulters per year, and consumer-credit segments like fraud-adjacent cohorts where the rarer class is both thin and heteroskedastic. Either way the cost is one cross-validation grid over a $11 \times 11$ rectangle, which is negligible next to the downstream calibration and monitoring pipeline. A caveat: RDA inherits LDA's generative Gaussian assumption. It handles covariance misspecification but not the failure modes documented in @sec-ch06-limitations (heavy categoricals, skewed amounts, rare-event bias). On a mixed-type consumer design matrix, a well-tuned regularized logit remains the better default; RDA is the right tool when the predictors are continuous financial ratios and the sample is too thin for unconstrained QDA. ### From-scratch Fisher LDA The following block implements LDA from the generalized eigenvalue system @eq-fisher-gep and compares it to `sklearn`. It also verifies the closed-form equivalence $w \propto S_W^{-1}(\mu_1 - \mu_0)$. The two directions agree exactly up to sign because the rank-one $S_B$ forces the sole non-trivial eigenvector to lie along $S_W^{-1}(\mu_1 - \mu_0)$. Now verify against `sklearn`: Both implementations return the same linear decision rule up to a positive scaling and produce identical predictions on this sample. ### Decision boundary plot The LDA boundary is the set where @eq-lda-logit equals zero. For the shared-covariance case it is a straight line. QDA (@sec-ch06-qda) adds the quadratic terms in @eq-qda-logit, producing a conic boundary. ### QDA on heteroskedastic data When the two classes have different covariance structures the LDA hyperplane systematically cuts into one of them. Simulate a sample where class 1 has a rotated and stretched covariance relative to class 0. QDA beats LDA by several percentage points on this specific simulation because the Bayes boundary is genuinely quadratic. The cost is fragility: QDA's covariance in class 1 has nine parameters in a two-dimensional problem, so extending this to $p = 20$ ratios on a $n_1 = 33$ defaulter sample, the setting Altman was in, is a recipe for overfitting. That is one reason he stuck to LDA. ### Statistical efficiency of LDA versus the logit @efron1975efficiency studied the asymptotic relative efficiency of LDA and logistic regression under Gaussian class-conditionals. When the Gaussian model holds, LDA is more efficient than logit by up to about 40 percent at extreme class separations. When the Gaussian model fails, logit is consistent for the log-odds while LDA is not, so the ordering flips. @press1978choosing made the same observation on binary-heavy data and recommended logit for application scoring. The folklore that "logistic regression almost always beats LDA on real credit data" traces to this efficiency argument. It is about model misspecification, not about LDA being a bad estimator under its own assumptions. The efficiency result is worth unpacking, because it contradicts a common intuition. Both LDA and logit are consistent for the same linear Bayes rule when the Gaussian model holds, so an asymptotic comparison is between two unbiased estimators of the same coefficient vector, and the question becomes whose sampling variance is smaller. LDA exploits the additional information that the class-conditional distributions are Gaussian, giving it access to the covariance matrix estimated on all $n$ observations rather than only the information captured by the gradient of the log-likelihood at $\beta$. Logit ignores the full covariance and extracts only the first-order information at the decision boundary. Under Gaussian, LDA's information is strictly richer, which is where the efficiency gain comes from. Under misspecification, the information LDA uses is wrong, and the extra signal becomes a biased signal. A useful diagnostic is the Henze-Zirkler test or the Mardia skew and kurtosis tests for multivariate normality on each class. If the class-conditional density is heavily non-Gaussian, the efficiency argument no longer applies and a discriminative model like logit is the safer default. In corporate bankruptcy work, financial ratios after a log-plus-Winsorize transformation are typically close enough to Gaussian that LDA's efficiency is a real bonus. In consumer credit work, the mix of dummies makes the Gaussian assumption a fantasy. ### Multiclass discriminant analysis Bankruptcy is the binary case. A rating agency or a banking supervisor usually wants a multi-class classifier that assigns firms to one of several rating grades. For $K$ classes, Fisher's criterion generalizes to $$ J(W) = \operatorname{tr}\!\left[(W^\top S_W W)^{-1} (W^\top S_B W)\right], \qquad W \in \mathbb{R}^{p \times (K-1)}, $$ with $S_B = \sum_{k=1}^K n_k (\hat\mu_k - \hat\mu)(\hat\mu_k - \hat\mu)^\top$ the between-class scatter, $S_W = \sum_{k=1}^K \sum_{i: y_i = k}(x_i - \hat\mu_k)(x_i - \hat\mu_k)^\top$ the within-class scatter, and $\hat\mu$ the overall sample mean. The optimal $W^\star$ collects the top $K - 1$ generalized eigenvectors of $S_B w = \lambda S_W w$. For $K = 2$ this reduces to @eq-fisher-gep, with $W^\star$ a single vector. Under Gaussian class-conditionals with shared covariance $\Sigma$, the multi-class Bayes classifier assigns $x$ to the class $k^\star$ that maximizes the linear discriminant function $$ \delta_k(x) = x^\top \Sigma^{-1} \mu_k - \tfrac{1}{2}\mu_k^\top \Sigma^{-1} \mu_k + \log \pi_k. $$ Ratings-grade applications typically have $K$ between 7 and 22. In that range the $K - 1$ MDA directions often capture only a few axes of genuine variation: one for leverage-profitability, one for size-liquidity. Higher MDA components add noise. A useful diagnostic is a scree plot of the eigenvalues from the generalized system, keeping only those above the Marchenko-Pastur cutoff for pure noise. ### The connection with linear regression Fisher's paper [@fisher1936use] observed that the LDA coefficients for a two-class problem can be obtained as the OLS slope of an indicator variable regressed on $X$, up to a positive constant. The constant is computable and depends on the class priors and the within-class variance. The upshot is that a practitioner with only a linear regression implementation can still compute an LDA direction. Write $y_i \in \{-1, +1\}$ or $\{0, 1\}$, run OLS of $y$ on $X$, and interpret the coefficient vector as proportional to $\Sigma^{-1}(\mu_1 - \mu_0)$. This is not a recommended implementation for numerical reasons (LDA's own linear algebra is more stable), but the identity is useful in proofs and occasionally in debugging a mismatch between two library implementations. ## The Altman Z-score ### Construction Altman's 1968 sample was 66 manufacturing firms, 33 that had filed Chapter X or XI bankruptcy between 1946 and 1965 and 33 matched survivors of similar size and industry. He started from 22 financial ratios in five categories (liquidity, profitability, leverage, solvency, activity), ran MDA with stepwise selection, and converged on five ratios that collectively maximized the multivariate separation. The published equation is $$ Z = 1.2 X_1 + 1.4 X_2 + 3.3 X_3 + 0.6 X_4 + 1.0 X_5, $$ with ratios defined as | Ratio | Definition | Story | |--------------------|--------------------------------|--------------------| | $X_1$ | Working capital / Total assets | Short-term liquidity buffer. | | $X_2$ | Retained earnings / Total assets | Cumulative profitability and age. | | $X_3$ | EBIT / Total assets | Operating efficiency, independent of leverage and tax. | | $X_4$ | Market value of equity / Book value of total liabilities | Market-implied solvency cushion. | | $X_5$ | Sales / Total assets | Asset turnover. | The original paper expresses $X_1$ through $X_4$ as percentages (so the 1.2 coefficient multiplies a raw decimal of 0.10 as 1.2 multiplied by 10 percent). Altman's later monographs reformulated the equation so that the ratios are entered as decimals and the coefficients become 0.012, 0.014, 0.033, 0.006, 0.999, which is algebraically the same model. The version in @eq-altman-z uses the percentage convention, which is how it appears in most textbooks. ### Why five ratios and not more A modern analyst faced with the same problem today would reach for a regularized logit or an XGBoost model with several hundred candidate features, not a hand-selected five. Altman's constraint was different. He had 66 observations and a desk analyst as the intended consumer. Five ratios was the natural upper bound on what the analyst could compute from a paper balance sheet and what MDA could fit without overfitting. The information content of the five ratios also reflects five distinct mechanisms of corporate distress. - Liquidity ($X_1$) captures the short-term survival buffer. A firm with deeply negative working capital cannot pay suppliers next month and is forced to restructure or file for protection. - Cumulative profitability ($X_2$) captures firm age and past performance. Retained earnings over assets is low for young firms and for firms that have been paying out everything they earn. Both subgroups default at higher rates. - Operating efficiency ($X_3$) captures the core economic engine. EBIT is independent of leverage and tax and measures how well the operating assets generate cash, which is the most fundamental driver of long-run survival. - Market solvency ($X_4$) captures the market's forward-looking assessment. Equity value over debt is the option-theoretic buffer in Merton's sense. - Asset turnover ($X_5$) captures managerial efficiency. High turnover firms extract more revenue from their asset base and tend to survive shocks better. A modern feature-engineered ratio set would add volatility measures, size effects, industry controls, and macroeconomic conditioning. The gains from those additions are real but incremental. Altman's five variables still capture the largest part of the predictable signal, which is why they show up as top predictors in later work with much richer feature sets [@tian2015variable, @das2009accounting]. ### Decision zones Altman reported two cutoffs on the training sample. Firms with $Z > 2.99$ fell firmly into the non-bankrupt class in every year-ahead cross-section. Firms with $Z < 1.81$ fell firmly into the bankrupt class. Between these values lay a zone of ignorance that he called the gray zone. The rule is $$ Z > 2.99 \Rightarrow \text{safe}, \qquad 1.81 \le Z \le 2.99 \Rightarrow \text{gray}, \qquad Z < 1.81 \Rightarrow \text{distress}. $$ The two thresholds are not symmetric around zero because LDA's intercept depends on the class priors, and Altman picked cutoffs that minimized the empirical Type I and Type II error separately rather than a single Bayes-optimal threshold. ### A historical note on Altman's sample Altman's 1968 sample deserves closer inspection because several of his choices propagate into modern practice. He matched each bankrupt firm with a non-bankrupt firm of similar asset size and in the same industry (two-digit SIC). The match served two purposes: it controlled for industry and size effects that would otherwise leak into the discriminant direction, and it let him estimate a covariance structure on a small sample by pooling observations from roughly comparable operating environments. The downside is that the matched sample implicitly imposes a 50-50 prior. Altman's published intercept and decision zones inherit that prior, and his out-of-sample accuracy numbers assume it. The stepwise selection procedure Altman used is no longer the methodology of choice. Stepwise selection with a small sample and correlated features is known to produce an inflated in-sample fit and an unstable set of retained variables. The fact that Altman's five ratios have survived decades of refit work is some evidence that the chosen ratios capture genuine economic mechanisms (liquidity, cumulative profitability, operational efficiency, solvency, turnover), not just that stepwise hit a lucky local optimum. @altman2000predicting and @altman2017financial document that the same ratios reappear as top predictors in regressions with hundreds of candidate features, so the original variable choice has held up even as the coefficients have drifted. One more historical detail matters. Altman's paper reports two sets of error rates. The first is the in-sample error rate on the 66-firm training sample (6 percent). The second is a jack-knife estimate that holds out each firm in turn (20 to 25 percent). The out-of-sample rate is what held up over time; the in-sample rate is an artifact of fitting a 5-coefficient linear model on 66 observations. Readers who quote the 95 percent accuracy figure without the jack-knife context usually overstate the model's true predictive power by a factor of three on the error side. ### Reproducing the coefficients on a public corporate panel Altman's original 66-firm panel is not redistributable, but the @liang2016financial Taiwanese Bankruptcy Prediction dataset (UCI 572) is. It carries 6,819 firm-years from companies listed on the Taiwan Stock Exchange between 1999 and 2009, 220 of them flagged as bankrupt the following year (a 3.2 percent base rate), with 95 financial ratios per firm-year. Five of those ratios line up directly with Altman's $X_1$ through $X_5$, with one substitution: UCI 572 ships only book-value items, so $X_4$ is the book-equity-to-liability ratio used in Altman's $Z'$ refit for private firms (@altman2000predicting), not the original market-value ratio. Everything in this section therefore fits $Z'$, not the public-firm $Z$, and the appropriate decision cutoffs are $Z' < 1.23$ for distress and $Z' > 2.90$ for safe. The released features are min-max normalized to $[0,1]$, so the recovered coefficient magnitudes will not match Altman's published numbers in absolute scale; the relative weights and the implied ranking are what carry over. The two distributions overlap heavily: the bankruptcy mode sits about 0.3 to the left of the survivor mode but the right tail of the bankrupt group spills well past the survivor mode and vice versa. That is the honest empirical picture. Altman's original 6 percent in-sample error rate on a 66-firm matched panel does not generalize to a 6,819-firm unmatched cross-section at a 3 percent base rate; the AUC numbers later in this section will quantify the gap. Now refit MDA on the Taiwan panel and compare the recovered direction with Altman's published $Z'$ coefficients, after standardizing both sides so the comparison is in Mahalanobis units. The relative weighting is broadly consistent with Altman's $Z'$ ordering: profitability ($X_3$, EBIT/TA) and cumulative profitability ($X_2$, RE/TA) carry most of the discriminative weight, with liquidity ($X_1$, WC/TA) and the book-equity ratio ($X_4$) contributing materially. The numerical magnitudes do not match Altman's 1968 publication and they are not supposed to. The Fisher direction $\Sigma^{-1}(\mu_1 - \mu_0)$ depends on the within-class covariance of the underlying sample, and a 6,819-firm Taiwanese panel with min-max normalized ratios and a 3.2 percent base rate has a different $\Sigma$ from a 66-firm matched US manufacturing sample with raw ratios and a 50 percent base rate. The substantive lesson is the one Altman's coefficients always carried: profitability and cumulative profitability dominate, leverage and liquidity contribute, and asset turnover is the smallest of the five even after a refit on a different country, decade, and base rate. ### Demonstrating the three caveats The historical note above claims three things about Altman's 1968 design: (i) the matched sample bakes in a 50-50 prior that the published intercept inherits, (ii) stepwise selection on a 66-firm sample picks an unstable variable subset, and (iii) the in-sample accuracy headline overstates predictive power by a factor of three relative to a jack-knife estimate. The Taiwan panel is the right laboratory for each claim because it has more than 200 actual bankruptcies, which is enough to replay Altman's 33-plus-33 design hundreds of times. #### Caveat 1: the matched 50-50 prior Take one draw of 33 bankrupt and 33 healthy firms from the Taiwan panel (Altman's proportions), fit LDA on that matched subset, and compare the intercept and the implied decision boundary against an LDA fit on the full cross-section with its empirical 3.2 percent base rate. The two fits point in almost the same direction in feature space. What shifts is the intercept. A rule of "classify as distressed if LDA score exceeds zero" assigns roughly half the matched sample to each class by construction; the same rule applied under the empirical 3.2 percent prior misclassifies a different count because the base rate is far from 50 percent. Any practitioner who imports Altman's 1.23/2.90 cutoffs to a book whose default rate is 2 percent is implicitly operating at a 50-50 prior anchor that the cutoffs were calibrated for. #### Caveat 2: stepwise instability on a small sample Pad the five Altman ratios with five spurious candidates of similar marginal variance, then run forward selection on repeated 33-plus-33 bootstraps. Tracking which ratios survive across resamples isolates the stability problem from the signal problem. Across resamples, the five true ratios are picked most of the time but not all of the time, and at least one noise variable clears the selection threshold in a meaningful fraction of resamples. Altman fixed the feature set at publication and that froze the particular realization he drew. Later refits (Z', Z'', ZETA, the @tian2015variable and @altman2017financial updates) are essentially new draws from this distribution, which is why the retained ratios shift slightly across papers even when the economic story stays the same. #### Caveat 3: the jack-knife gap Repeat the 33-plus-33 design 300 times. For each draw, fit LDA and report two numbers: the resubstitution error on the 66 training firms and the leave-one-out error. The distance between the two is the bias Altman warned about. The resubstitution distribution concentrates near the 6 percent that Altman's paper headlines. The leave-one-out distribution sits several times higher. A simulation with a known data-generating process reproduces his reported gap exactly because the gap is a structural property of fitting a five-coefficient linear rule on 66 observations, not a quirk of the particular 1946 to 1965 sample. The practical lesson: on any small-sample MDA or logistic scorecard, publish both numbers or neither; the in-sample figure on its own is misleading. ### Applying the Z-score The empirical pattern matches the design intent of the cutoffs but is far less crisp than the textbook figure that simulations produce. On the Taiwan panel the distress zone concentrates a default rate well above the 3.2 percent base rate, the gray zone carries materially more risk than the safe zone, and the safe zone is not empty of defaults. Two practical points follow. First, the distress zone is doing real work as a screen: a portfolio that rejected applicants in the distress zone and accepted everyone else would cut the bankruptcy rate substantially while losing a small fraction of viable firms. Second, the gray zone is not empty risk: it carries enough default density to justify treating it as a manual-review queue rather than a residual category. Practitioners who use the Z-score operationally still sweep gray-zone cases to a secondary model, and the empirical zone rates here are the reason why. ## Extensions: Z' and Z'' ### Why one model does not fit all firms The 1968 model has a market-value input, $X_4$, which requires a traded equity. Private firms do not have one, and neither do most SMEs. Service-sector firms have very different asset turnover ($X_5$), so imposing the manufacturing-calibrated coefficient shifts their Z artificially low. Emerging-market firms have a different accounting regime and different default rates. Altman responded with two refits that are now called Z' and Z''. A third, ZETA, came out of @altman1977zeta as a proprietary seven-variable model for a commercial bankruptcy service. The ZETA coefficients are not public, but its rough structure survives in practitioner writing on the extensions. ### Z' for private firms Altman replaced $X_4$ with book value of equity over book value of total liabilities and refit on a private-firm sample. The resulting equation is $$ Z' = 0.717 X_1 + 0.847 X_2 + 3.107 X_3 + 0.420 X_4^{\prime} + 0.998 X_5, $$ where $X_4^{\prime} = \text{BVE}/\text{TL}$. The cutoffs shift: $Z' > 2.90$ is safe, $Z' < 1.23$ is distress, and the gray zone widens. The lower $X_4^{\prime}$ weight reflects the noisier signal from book values compared with market values. ### Z'' for non-manufacturers and emerging markets For non-manufacturing firms or emerging-market issuers, Altman dropped $X_5$ entirely because asset turnover differs sharply by industry and contaminates cross-industry comparisons. The Z'' model uses book value again and drops sales: $$ Z^{\prime\prime} = 6.56 X_1 + 3.26 X_2 + 6.72 X_3 + 1.05 X_4^{\prime}. $$ A constant of $+3.25$ is added in some versions so that the safe and distress cutoffs can be anchored at 2.60 and 1.10 respectively. The Z'' model is the one most often cited in emerging-market sovereign and corporate work [@altman2005emerging] and is still used by rating agencies as a first-pass screen for non-listed issuers. ### ZETA and descendants @altman1977zeta introduced a seven-variable MDA that added a measure of earnings stability (standard deviation of EBIT/TA), a debt-service coverage ratio, and a measure of firm size. The ZETA model was a commercial product. Its publicly reported out-of-sample accuracy was higher than the Z-score on the 1970s sample it was trained on (about 90 percent at one year and 70 percent at five years). Modern Altman papers [@altman2000predicting, @altman2017financial] have revisited the model with much larger international samples and report that the original coefficients still carry predictive information, but optimal thresholds and coefficient magnitudes have drifted with macroeconomic conditions and accounting standards. ### Implementing Z' and Z'' On the Taiwan panel both variants land in the same neighborhood. $Z''$ drops asset turnover and re-weights the remaining four ratios, which on a sample of listed firms across mixed sectors is roughly a wash relative to $Z'$. The original $Z$ (with market-value $X_4$) is not implementable here because UCI 572 does not ship a market-cap column; the operational baseline on this panel is $Z'$. ## Empirical performance across decades ### Benchmarks and the sequence Altman, Ohlson, Shumway, CHS The literature on corporate default prediction is a sequence of ladder steps. Each step added either better statistical machinery or better inputs. 1. @altman1968zscore: MDA, five accounting ratios, static matched sample. 2. @ohlson1980financial: logit, nine variables including a size factor and funds-from-operations, unmatched sample of \~2,000 firms. 3. @zmijewski1984methodological: probit on three variables, introduced choice-based sampling corrections. 4. @shumway2001forecasting: multi-period hazard model with accounting and market inputs, reducing selection bias from static design. 5. @hillegeist2004assessing: Merton-based KMV distance-to-default compared against accounting models. 6. @chava2004bankruptcy: industry-adjusted hazard model, larger sample. 7. @campbell2008search: hazard model with equity returns and volatility added, multi-period logit. 8. @bharath2008forecasting: test of whether the KMV structural distance contains information beyond a simplified version of it. By the time you reach @campbell2008search, the distance-to-default input (Merton-style, @merton1974pricing) is no longer treated as a complete model: it is one feature among many in a hazard regression. The Altman Z, by the same logic, is one feature. Later chapters in this book cover the hazard machinery and the structural models. This chapter's narrower question is how the original MDA Z compares to what came after on out-of-sample data. ### What "out of sample" means in the Altman literature A reader of this literature encounters three different out-of-sample protocols, and they are not equivalent. 1. Hold-out within the training period. Split the 66-firm sample into an estimation set and a validation set. This tells you something about in-sample variance but nothing about temporal generalization. 2. Hold-out out of period. Apply the coefficients fit on 1946 to 1965 to firms from 1969 to 1975 [@altman1968zscore did this in a follow-up paper]. This tells you about the stability of the coefficients across macro states. 3. Hold-out out of country or industry. Apply the coefficients to a different jurisdiction or sector. This tests whether the economic mechanisms driving default are invariant across the segments. Different papers report different protocols and the choice matters. @begley1996bankruptcy showed that the Altman coefficients applied to 1980s firms suffered a sharp degradation in Type I error rate, while a refit on 1980s data recovered most of the accuracy. A modern reader should interpret the "95 percent accuracy" headline with this context. ### Ohlson's logit Ohlson's model, O-score, is a nine-predictor logistic regression. The predictors include size (log total assets deflated by GNP), TL/TA, WC/TA, CL/CA, an indicator for negative equity, NI/TA, FFO/TL, an indicator for a net loss in the last two years, and a change-in-net-income measure. The fitted coefficients are documented in @ohlson1980financial. The model's one-year misclassification rate on Ohlson's hold-out sample was about 12.4 percent versus Altman's 26.9 percent on the same hold-out, though the two models used different definitions of bankruptcy. Ohlson's nine variables are - $\log(\text{TA}/\text{GNP deflator})$: a size control. - $\text{TL}/\text{TA}$: leverage. - $\text{WC}/\text{TA}$: liquidity. - $\text{CL}/\text{CA}$: short-term stress. - $\text{OENEG}$: a binary indicator for $\text{TL} > \text{TA}$ (negative equity). - $\text{NI}/\text{TA}$: profitability. - $\text{FFO}/\text{TL}$: coverage. - $\text{INTWO}$: a binary indicator for negative net income in each of the last two years. - $\text{CHIN} = (\text{NI}_t - \text{NI}_{t-1})/(|\text{NI}_t| + |\text{NI}_{t-1}|)$: a relative change in net income. The coefficients in Ohlson's primary model are reported to four significant figures in his Table 4. The inclusion of binary flags like OENEG and INTWO is what first made the logit framework visibly superior to LDA on this data: LDA has no natural way to handle discrete indicators inside its Gaussian assumption. Logit takes them in stride. Two mechanisms explain Ohlson's edge. First, the logit likelihood is matched to the binary response, while LDA maximizes a different criterion that coincides with the Bayes rule only under Gaussian conditional distributions. Second, Ohlson used a non-matched sample, so the prior reflected the actual bankruptcy base rate. Altman's matched sample implicitly assumed a prior of 0.5, which overstates the intercept for practical scoring. ### Shumway's hazard model @shumway2001forecasting pointed out that bankruptcy is a time-to-event process, so a one-period static classifier mis-specifies the dependence between survival and covariates. He estimated a discrete-time hazard model, $$ h(t \mid X_{it}) = \Pr(Y_{it} = 1 \mid Y_{i,t-1} = 0, X_{it}) = \sigma(X_{it}^\top \beta + \alpha_t), $$ on annual firm-year panels, with $\alpha_t$ a baseline year effect. The econometric content is the same as a pooled logit on firm-years with time fixed effects, but the interpretation differs: each firm contributes every observation year until it either defaults or exits the sample. Shumway reported that his hazard model beat both the Altman Z and the Ohlson O on out-of-sample ranking across 1962 to 1992. ### The structural distance-to-default Before reaching CHS (@sec-ch06-chs), it is worth pausing on the market-based alternative Altman could not use in 1968. @merton1974pricing models equity as a call option on the firm's assets. Under the Black-Scholes framework [@black1973pricing], equity value $E$ and asset value $V$ are linked by $$ E = V \Phi(d_1) - D e^{-rT} \Phi(d_2), \qquad d_{1,2} = \frac{\log(V/D) + (r \pm \tfrac{1}{2}\sigma_V^2) T}{\sigma_V \sqrt{T}}, $$ where $D$ is the face value of debt, $T$ the horizon, $r$ the risk-free rate, and $\sigma_V$ the asset volatility. The distance to default is $$ \mathrm{DD} = \frac{\log(V/D) + (\mu_V - \tfrac{1}{2}\sigma_V^2) T}{\sigma_V \sqrt{T}}, $$ with associated default probability $\Phi(-\mathrm{DD})$ under the physical measure. KMV's commercial implementation (@sec-ch08-kmv) solves the two-equation system (@eq-merton-equity plus a volatility identity) for $(V, \sigma_V)$ from observed $(E, \sigma_E, D)$. @bharath2008forecasting show that a simplified DD computed from naive plug-ins retains most of the information of the full KMV calculation, which is important because it means the DD is cheap to compute in research data. @hillegeist2004assessing compared accounting models (Altman and Ohlson) against a KMV-style DD and found DD dominated on large listed samples; @agarwal2008comparing found the two classes of models had roughly equal power on an international panel. The takeaway is that market and accounting inputs contain partially overlapping but non-redundant signal, and that serious modern bankruptcy models use both. ### Pooled logit as the practical benchmark Shumway's likelihood is identical to a pooled logit on firm-year panels with year fixed effects. That observation is important for practitioners because it means Shumway's model is a one-line estimation in any statistics package that supports logistic regression. The estimation treats each firm-year as an independent observation conditional on the firm surviving to that year, which is a discrete-time hazard parameterization. For a balanced panel of $N$ firms observed for $T$ years each, the likelihood is $$ \mathcal{L}(\beta, \alpha) = \prod_{i=1}^N \prod_{t=1}^{T_i} h(t \mid X_{it})^{y_{it}} \bigl(1 - h(t \mid X_{it})\bigr)^{1 - y_{it}}, $$ where $h(\cdot)$ is @eq-shumway-hazard, $T_i$ is the last observation year before default or censoring, and $y_{it} = 1$ only in the single default year. The log-likelihood is a standard logit log-likelihood with firm-year rows, which is how it is estimated in practice. The practical lesson is that the gap between Altman's MDA and a modern bankruptcy model is not a gap between linear and non-linear models. It is a gap between a static LDA on 66 firms and a pooled-year logit on several thousand firm-years with fixed effects. The linear form is the same. The estimation framework and the data structure are what changed. ### Campbell-Hilscher-Szilagyi distance @campbell2008search (CHS) fold market-based variables into Shumway's hazard framework and argue that the combined accounting-plus-market model dominates either input class on its own. Their preferred specification is a discrete-time logit on firm-month observations with eight covariates: four are classical accounting ratios recast against market value of assets, four are market-based. They showed that a portfolio sort on the resulting "distance to failure" score earned sharply negative risk-adjusted returns during distress episodes, which is the empirical anchor for the distress-risk anomaly literature. **The eight CHS covariates.** Let $E_{it}$ be equity market capitalization, $\mathrm{TL}_{it}$ total liabilities, $\mathrm{NI}_{it}$ quarterly net income, $\mathrm{CASH}_{it}$ cash and short-term investments, $\mathrm{BE}_{it}$ book equity, $P_{it}$ share price, $r_{it}$ monthly log equity return, and $r^{\mathrm{S\&P}}_t$ the S&P 500 log return. Market value of total assets is $\mathrm{MTA}_{it} = E_{it} + \mathrm{TL}_{it}$. The four accounting-adjusted ratios are $$ \mathrm{NIMTA}_{it} = \frac{\mathrm{NI}_{it}}{\mathrm{MTA}_{it}}, \quad \mathrm{TLMTA}_{it} = \frac{\mathrm{TL}_{it}}{\mathrm{MTA}_{it}}, \quad \mathrm{CASHMTA}_{it} = \frac{\mathrm{CASH}_{it}}{\mathrm{MTA}_{it}}, \quad \mathrm{MB}_{it} = \frac{\mathrm{MTA}_{it}}{\mathrm{TL}_{it} + \mathrm{BE}^{+}_{it}}, $$ where $\mathrm{BE}^{+}$ follows @daniel2001explaining and adds 10 percent of the market-book gap to avoid negative-equity singularities. The four market-based covariates are $$ \mathrm{EXRET}_{it} = r_{it} - r^{\mathrm{S\&P}}_t, \quad \mathrm{SIGMA}_{it} = \sqrt{252} \cdot \mathrm{sd}(r^d_{i, t-2:t}), \quad \mathrm{RSIZE}_{it} = \log\!\frac{E_{it}}{\mathrm{MktCap}^{\mathrm{S\&P}}_t}, \quad \mathrm{PRICE}_{it} = \log\min(P_{it}, 15), $$ with $\mathrm{SIGMA}$ the annualized standard deviation of daily returns over the trailing three months. Profitability and excess returns enter as geometrically declining moving averages: $$ \mathrm{NIMTAAVG}_{it} = \frac{1 - \phi^3}{1 - \phi^{12}} \sum_{k=0}^{3} \phi^{3k} \mathrm{NIMTA}_{i, t-3k}, \qquad \mathrm{EXRETAVG}_{it} = \frac{1 - \phi}{1 - \phi^{12}} \sum_{k=0}^{11} \phi^{k} \mathrm{EXRET}_{i, t-k}, $$ with $\phi = 2^{-1/3}$, so the weight halves every three months. Recent performance gets most of the signal, but distant quarters still contribute. **Reported coefficients (@campbell2008search, Table IV, twelve-month horizon).** The published signs and rough magnitudes are | Covariate | Sign | Magnitude | |------------|----------|---------------| | NIMTAAVG | negative | $\approx -20$ | | TLMTA | positive | $\approx 1.4$ | | EXRETAVG | negative | $\approx -7$ | | SIGMA | positive | $\approx 1.4$ | | RSIZE | negative | $\approx -0.05$ | | CASHMTA | negative | $\approx -2.4$ | | MB | positive | $\approx 0.05$ | | PRICE | negative | $\approx -0.9$ | | Intercept | | $\approx -9.1$ | Leverage (TLMTA), volatility (SIGMA), and overvaluation (MB) push default risk up. Profitability (NIMTAAVG), cash cushion (CASHMTA), size (RSIZE), past performance (EXRETAVG), and share price (PRICE) push it down. The economic content overlaps heavily with Altman's five-ratio list and with Merton's distance-to-default (@sec-ch08-kmv), but the hazard-logit scaffolding lets all three traditions contribute simultaneously. **Replication status.** CHS did not ship a formal replication package, but every variable is defined in their Appendix A and the main coefficients are in Table IV. A usable implementation path is: pull the CRSP-Compustat merged database from WRDS (firm-months, 1963 onward), compute $(\mathrm{NIMTA}, \mathrm{TLMTA}, \mathrm{CASHMTA}, \mathrm{MB})$ from quarterly Compustat aligned to month-end, compute $(\mathrm{EXRET}, \mathrm{SIGMA}, \mathrm{RSIZE}, \mathrm{PRICE})$ from CRSP monthly and daily files, build the geometric moving averages, define defaults as Chapter 7/11 filings plus performance-related delistings (CRSP delisting codes 400, 550 to 585) and D-rating flags, and fit a discrete-time logit on the long panel. @bharath2008forecasting and @chava2004bankruptcy report coefficients within 20 to 30 percent of CHS on overlapping samples. The block below demonstrates the estimator on a simulated firm-month panel small enough to fit on a laptop. The fit recovers the sign on all eight covariates and is within 30 percent of the data-generating value for TLMTA, SIGMA, MB, and PRICE. CASHMTA and RSIZE come back at roughly half the DGP magnitude. NIMTAAVG and EXRETAVG attenuate the most: on a 3,000-firm panel, the cross-sectional spread of profitability and excess-return averages is narrow relative to the within-month default-shock noise, so the estimator cannot pin down their large coefficients precisely. On the full CRSP-Compustat sample with millions of firm-months, the same code recovers magnitudes close to @campbell2008search Table IV. The point of the exercise is the scaffolding: once the eight covariates and the geometric-average weights are constructed, the CHS model is a one-line logistic regression, which is why the specification has become the reference hazard model for public-firm bankruptcy prediction and why later papers (e.g., @bharath2008forecasting, @duffie2009frailty) cite it as the benchmark rather than the headline paper in the horse race. ### Out-of-sample accuracy in the research record The evidence on decade-level stability of these models is documented in @begley1996bankruptcy for the 1980s (Altman's Type I error rate roughly doubled when the original coefficients were applied out of sample, reaffirmed when the model was refit), in @agarwal2008comparing for the late 1990s (accounting-only, market-only, and combined models all beat each other on different segments), and in @altman2017financial for an international panel of over 1.5 million firm-years (the Z'' ranks similarly to logit on a balanced sample and loses ground on unbalanced samples). The robust summary: - The original Altman coefficients are stale after 10 to 15 years. Refitting coefficients on fresh data recovers most of the accuracy. - Logit beats LDA out of sample in most documented replications, usually by 2 to 5 percentage points of AUC at one-year horizons. - Market-based inputs (volatility, returns) beat accounting-only models on listed firms by a further 3 to 8 percentage points of AUC. - No single model dominates across time and geography, which is why modern practice builds ensembles (@sec-ch12-ensembles) and runs large-scale horse races across classifier families (@sec-ch16-bench). ### Decomposing the sequence of improvements It is useful to step back and ask how much each methodological jump contributed to measurable accuracy. @chava2004bankruptcy ran all four ancestors side by side on a US panel from 1962 to 1999: Altman Z, Ohlson O, Shumway hazard, and a KMV-style DD. Their one-year out-of-sample accuracy ratios (a Gini-like ranking statistic) run roughly 0.65 for Altman, 0.75 for Ohlson, 0.83 for Shumway, 0.86 for a joint accounting-plus-market hazard model. The Altman-to-Ohlson jump is worth 10 points and is almost entirely about the likelihood being matched to the binary response and the sample being bigger and unmatched. The Ohlson-to-Shumway jump is worth 8 points and comes from using panel data instead of a point-in-time cross-section. The last 3 points come from market inputs. None of the jumps change the qualitative story (leverage, profitability, and liquidity drive default) but each added roughly one basis point of information. For a modern bankruptcy model on a public-firm panel, the minimum defensible approach is therefore a hazard logit on a combination of Altman-style accounting ratios, a Merton-style DD, and size and return controls. That is Shumway's original specification plus one variable, and it reproduces most of Campbell-Hilscher-Szilagyi's gain (@sec-ch06-chs) at a much smaller implementation cost. The contribution of deep learning on the same inputs, documented in @tian2015variable and later work, is modest: an improvement of 1 to 3 accuracy points at large sample sizes, usually at the cost of interpretability. For covenant triggers, regulatory reporting, and cross-industry comparability, the linear hazard model remains the sensible default. ### Benchmark on the German credit data The corporate-bankruptcy literature evaluates models on firm-year panels with market data. Consumer credit data look different, but we can still compare LDA to logistic regression on the UCI German sample. This is a classic benchmark in the LDA-versus-logit tradition [@press1978choosing, @hand1997statistical, @baesens2003benchmarking]. LDA and logit are within a basis point or two of each other on AUC and KS on this sample. The standardized pipeline helps both of them: the German data have dummies and order-of-magnitude differences in amounts, and without scaling LDA ends up dominated by `amount` alone. The ranking metrics tie. The calibration differs. That pattern is general: LDA and logit produce similar orderings but different probabilities whenever the features depart meaningfully from joint Gaussian, and the calibration gap is the one that shows up in regulatory backtesting. ### The Altman model on a corporate sample The Altman ratios are corporate inputs. The German Credit data are consumer loans, so the Z-score does not apply directly. The corporate-style comparison runs on the same Taiwan bankruptcy panel from earlier in the chapter. Restrict the feature matrix to the five Altman ratios and put LDA head to head with logit on the held-out half of the panel. LDA and logit are very close in AUC on the five Altman ratios; the Brier scores differ by more than the AUCs because logit's likelihood is matched to the binary outcome and LDA's is not. That is @efron1975efficiency's efficiency result running in reverse: when the class-conditional density departs from joint Gaussian (which it does on a real corporate panel with 3 percent base rate and bounded normalized inputs), the calibration penalty for LDA is real even when the ranking is not. ### Benchmark on the Taiwan default sample The Taiwan credit-card default dataset [@yeh2009comparisons] is a larger consumer benchmark, with 30,000 observations and a 22 percent default rate. We apply the same LDA-versus-logit comparison and add a random-forest baseline to see where the linear models sit relative to a non-linear one. The Taiwan benchmark shows the pattern one expects from the literature. LDA and logit are within a basis point of each other in ranking. The random forest improves on both by several points of AUC because the default boundary depends on interaction effects between payment status, bill amounts, and demographics that are invisible to a linear model. For a production application scorecard, the practical modeling question is whether the interpretability gain from a linear model is worth the accuracy loss relative to an ensemble. ### Profit-based evaluation and decision zones Ranking metrics (AUC, KS) treat every false positive and false negative as equally costly. Credit decisions are not symmetric. @elkan2001foundations formalized cost-sensitive learning, and @verbraken2014novel developed profit-based measures specific to credit scoring. The operating threshold that maximizes expected profit depends on the net interest margin, the loss given default, and the denial rate that the business is willing to tolerate. For the Z-score, Altman's asymmetric zones (2.99 and 1.81) can be read as a crude profit maximization. The safe cutoff is high enough that firms above it almost never default, so the lender can accept them with near-certainty of repayment. The distress cutoff is low enough that firms below it default often enough to justify rejection. The gray zone absorbs the cases where the evidence is mixed and additional information (manual review, covenants) can produce a better decision than the statistical model. That is a sensible design pattern for a score whose calibration is imperfect. The top-decile capture of the random forest is noticeably above the linear models. In profit terms that translates to a better triage decision at high-cost operating points, which is exactly where ensemble methods earn their keep. ## Limitations for consumer credit ### The Gaussian assumption versus reality Consumer credit features are not joint Gaussian. They are a mix of continuous amounts (loan principal, income, balances) with heavy skew, integer counts (number of open accounts, hard inquiries), binary flags (homeowner status, paystub verified), ordinal categories (employment length buckets), and high-cardinality nominals (state, purpose, funding channel). Every one of these violates the LDA generative model. The violation is not fatal for ranking, as the German benchmark shows: LDA's hyperplane recovers roughly the same ordering as the logit's hyperplane because both are linear in the same features. The violation is fatal for calibration and for probability-of-default use cases, because the sigmoid in @eq-lda-sigmoid is derived under the Gaussian assumption, and that assumption is what guarantees the sigmoid is correct. ### Failure mode: heavy categoricals Suppose a modeler adds an interaction dummy for `purpose x credit_history`, picking up small-cell combinations that contain very few defaulters. The logit handles this with shrinkage or a simple prior [@gelman2008prior]. LDA cannot, because it has no regularization built in: its coefficients come from a single covariance inverse that goes unstable as the rank of the design matrix approaches the sample size. LDA's calibration error roughly doubles once the heavy interaction dummies enter; the logit barely moves. A practitioner should read this as follows: on a raw one-hot design with high-cardinality interactions, LDA is using variance it does not have to estimate differences in means that are dominated by noise, and the resulting probability scores drift. ### Mixed types and the right generative model A cleaner fix for LDA in a mixed-type setting is to use location models: continuous features conditioned on the discrete cells, with a separate covariance per cell if you have enough data. That lifts LDA into a hierarchical version that takes back some of the territory logit gets from flexible conditional distributions. In practice the cost of maintaining a cell-conditioned model exceeds the benefit, which is why logit and trees dominate the consumer-credit stack. ### Illustrating the calibration failure with class-conditional histograms The calibration pattern in the reliability plot earlier in the chapter has a simple explanation once you look at the class-conditional densities of the LDA projection. On a mixed-type design, the projected score is not Gaussian within each class. The logit is robust to this because it learns the sigmoid coefficients that best map the score to the binary outcome. LDA instead assumes the projected score is Gaussian within each class and computes the posterior from the class-conditional Gaussian densities. When the real class-conditional is skewed or bimodal, the posterior formula systematically overweights or underweights the tails. The defaulter distribution has a long tail pulling to the left (lower score, higher risk). LDA's sigmoid extrapolates the density from the center to the tail assuming a Gaussian shape, which under-estimates the posterior default probability in the right tail of the defaulter distribution. That is the mechanism behind the reliability-diagram deviation. ### When LDA still wins Three conditions favor LDA in real work. 1. Small samples, low feature dimension, nearly continuous features. If you are scoring 200 middle-market corporates on 6 financial ratios, the Gaussian assumption is a soft approximation and the efficiency gain from using it is real. 2. Strict interpretability requirements with a linear scoring function. Altman's Z-score is still the default because a credit analyst can compute it in a spreadsheet. Regulators accept it because its coefficients do not change with data batches. 3. Extreme class imbalance with a small tail of defaulters. LDA's estimator for the class-1 mean $\mu_1$ is an unbiased sample mean and does not suffer from logit's rare-event bias [@king2001logistic], which penalizes the intercept of a maximum-likelihood logit when defaults are below, say, one percent. Outside these conditions, logit beats LDA, and boosted trees beat both on out-of-sample ranking. Altman himself later moved to logit and hazard formulations in his empirical work [@altman2017financial], while keeping the Z-score as a monitoring signal. ### Calibrating LDA outputs When an LDA model is selected for regulatory reasons despite its miscalibration on mixed data, the standard fix is a post-hoc calibration. Two choices dominate. 1. Platt scaling [@platt1999probabilistic]. Fit a univariate logistic regression of the outcome on the LDA decision function, using a held-out sample. The two fitted coefficients (slope and intercept) absorb the calibration bias. Platt scaling assumes a sigmoid shape for the miscalibration, which is usually correct if the underlying score is approximately monotone in the true risk. 2. Isotonic regression. Fit a monotone step function of the outcome on the LDA score. Isotonic is more flexible than Platt, but it needs more data to estimate reliably. With small validation sets the isotonic fit can overfit specific bins. On this holdout, the three curves overlap within sampling noise: the apparent ordering of LDA, LDA + Platt, and logit is not statistically meaningful at $n \approx 300$. Platt scaling does not visibly remove a systematic bias here because the raw LDA curve was not strongly biased to begin with, and the small sample inflates per-bin variance. The takeaway is procedural rather than empirical: for Basel IRB reporting, running raw LDA and then applying a Platt-style calibration on a holdout is a defensible pipeline, provided the calibration step is documented, evaluated on a sample large enough to make the reliability diagram interpretable, and re-checked over time. ### Stability under covariate drift LDA's coefficients are a function of the class means and a common covariance. Both drift with the business cycle. @begley1996bankruptcy documented that Altman's original 1968 coefficients, applied in the 1980s without refit, had a Type I error rate roughly twice the one Altman reported. The same drift applies to modern refits. A reasonable monitoring protocol for a production LDA model includes: - A monthly or quarterly refresh of the class means $\hat\mu_0, \hat\mu_1$ on a rolling window of observations, with a formal test for mean equality against the previous window (a Hotelling's $T^2$ statistic suffices). - A monthly refresh of the pooled covariance $\hat\Sigma$, with a log of the condition number and a formal test for covariance equality across time windows (Box's M test, used with caution because it is sensitive to non-normality). - A check that the decision zones remain associated with their historical default rates. A population-stability index between the current score distribution and the calibration distribution is a reasonable summary. Two facts come out of this bootstrap. First, the signs of the top-10 coefficients are stable across resamples, which is the single most important property for governance: a reviewer can attach a directional story to each driver without worrying that the next refit will flip it. Second, the magnitudes are not equally well identified. Coefficients like feat9 and feat13 sit several standard deviations away from zero, while feat44 and feat43 have whiskers that nearly cross zero, meaning a different training draw could materially down-weight them. Any single refit should therefore be read as one draw from this distribution, and production deployment of an LDA Z-score under SR 11-7 style governance should report bootstrap intervals (or an equivalent uncertainty quantification) for the coefficients that drive scoring decisions. ## Reading the coefficient table A coefficient table is the artifact a risk committee reviews, not the algebra behind it. This section trains a small LDA on a subset of German features that admits a narrative walk-through and annotates the coefficients. The exercise is a template for how to document any linear model for a governance review. Three observations are worth making to a non-statistical reader of such a table. 1. The coefficient sign matches the direction of the class-mean gap. If defaulters have a higher average loan duration, the LDA coefficient on `duration` is positive (pushing the score toward default) after the standardization. If the sign disagrees with the class-mean gap, the feature is redundant given the others, and the correlation structure has flipped its apparent effect. This is the LDA analog of a Simpson's paradox diagnostic. 2. The magnitudes are comparable only after standardization, because the raw LDA coefficients inherit the scale of the input features. A coefficient of 0.1 on `amount` (measured in marks) and a coefficient of 0.5 on `installment_rate` (measured on a 1 to 4 scale) are not directly comparable until both features have been divided by their standard deviation. 3. The intercept encodes the base rate. Under Gaussian LDA, the intercept is $-\tfrac{1}{2}(\mu_0 + \mu_1)^\top \Sigma^{-1}(\mu_1 - \mu_0) + \log(\pi_1/\pi_0)$. The first term is purely geometric, and the second is the prior log-odds. Reporting both pieces separately (the geometric midpoint contribution and the prior contribution) helps a reviewer understand whether the model is moving the decision boundary because of the data or because of the prior assumption. The coefficient table on a larger design is the same template. For a 50-feature LDA model, the table becomes long enough that a graphical representation (a forest plot of standardized coefficients with bootstrap confidence intervals) is more readable than a numerical table, but the content is the same. ## A worked example: from Z-score to pricing A credit analyst does not just want a pass/fail decision. She wants a spread. Suppose the bank's funding cost is 3 percent, its operating cost on a corporate loan is 1 percent, the expected LGD is 45 percent, and the target return on economic capital is 12 percent. The minimum spread the bank can charge on a one-year term loan to a firm with default probability $p$ is $$ s(p) = \frac{p \cdot \mathrm{LGD} + \mathrm{cost} + \kappa \cdot \mathrm{RWA}(p)}{1 - p}, $$ where $\mathrm{RWA}(p)$ is the risk-weighted assets produced by the regulatory IRB formula and $\kappa$ is the target return on capital. For a rough illustration, if $p = 0.02$ and $\mathrm{RWA}(0.02) = 0.45$, the required spread from @eq-pricing sits around 190 basis points on top of the funding cost, which matches typical investment-grade loan pricing. The Z-score enters by mapping to $p$. A raw Z-score is not a probability. The standard conversion fits a logit of observed defaults on the Z-score on a holdout, producing a sigmoid that maps Z directly to PD. @altman2000predicting gives the rough mapping for US manufacturers as roughly PD = 1 percent at Z = 3, 5 percent at Z = 2, 25 percent at Z = 1, and 70+ percent at Z = 0. That mapping is what turns a Z-score into a pricing input. The fitted logit and the rule-of-thumb anchors agree on shape (PD falls monotonically with Z') and disagree on level. The rule of thumb was constructed against a matched-sample 50 percent prior, so it overstates absolute PD on a population whose true base rate is 3 percent. The standard fix is the calibration step shown here: fit a logit of observed defaults on Z on a holdout, and use the fitted curve rather than the published mapping. The mapping absorbs both the calibration bias of LDA and the base-rate gap between the holdout portfolio and Altman's original 1968 sample. ## Scalability {.unnumbered} LDA scales as $O(n p^2)$ for the covariance estimation plus $O(p^3)$ for the covariance inverse. In credit practice $p$ is small (tens of features) and $n$ runs to tens of millions at most, so the bottleneck is the streaming pass through the data to accumulate $S_W$. Both are embarrassingly parallel: - Pandas: single-pass `DataFrame.groupby` with `.cov()` or `.mean()` on the feature matrix. - Polars: same logic with the lazy API, chunked reads for data that do not fit in memory. - Dask: partition-level scatter-gather (`map_partitions` to emit per-class sums of squares, then reduce). - PySpark: `groupBy(label).agg(...)` on a Vector-type column, joined with a global Vector-aware summarizer to produce $\hat\Sigma$. Because LDA's training is a closed-form sufficient-statistic update, it is a good candidate for online and incremental fitting on a rolling window. Maintain running sums of observations, feature totals, and outer products; solve the generalized eigenvalue system on a schedule. The cost to refit daily on tens of millions of accounts is dominated by the data shuffle, not the math. A million rows by twenty features fits LDA in a fraction of a second on a laptop. The practical scale question is not raw compute. It is the pipeline around the fit: feature monitoring, covariance stability, and the question of whether a common covariance assumption still holds after three quarters of macro drift. ### Scalability warning: condition-number surveillance LDA breaks silently when $S_W$ becomes ill-conditioned. Two typical causes: a feature goes constant in a subsample, or a one-hot dummy becomes perfectly collinear with another after monotone transformation. In production monitoring, log the condition number of $S_W$ on every refit and alert if it exceeds a threshold (rule of thumb: $10^8$ for double precision). The reference library `sklearn` uses SVD by default, which is numerically stable but still produces silently biased coefficients when the effective rank drops. ## Deployment {.unnumbered} Wrapping LDA as a scoring service is simple. The learned state is a coefficient vector $\beta \in \mathbb{R}^p$ and an intercept $\beta_0$; prediction is one dot product per record. ONNX export is straightforward: `skl2onnx.convert_sklearn(lda_pipeline, initial_types=...)` produces a graph that is a single matrix multiplication plus a softmax. Inference latency is sub-millisecond on any hardware that can compute a 20-element dot product. MLflow logging should include the fitted `coef_` and `intercept_`, the within-class covariance, the training prior, and the feature list. For regulated deployments, log the sample means per class and the eigenvalues of $S_W^{-1} S_B$ as summary statistics that backtesting can reference. ## Regulatory considerations {.unnumbered} The Altman Z and its LDA cousins land in regulatory documentation more often than their predictive performance would justify, precisely because they are linear. Four regulatory angles matter. ### SR 11-7 model risk management Fed Supervisory Guidance on Model Risk Management [@sr117] requires documentation of the conceptual soundness, the data used, the methodology, and the ongoing monitoring of every model that drives material decisions. A Z-score satisfies conceptual soundness trivially: five accounting ratios, one linear combination. The weak point is monitoring. An LDA whose coefficients depend on a covariance that drifts with the economy needs either periodic recalibration, a stability test on the class means, or both. ### Basel II/III IRB Under the internal ratings-based framework [@basel2006international, @basel2017finalising], regulators require that a bank's PD model produces a calibrated probability of default over a one-year horizon, backed by a sufficient data history and a long-run average. An LDA score is not calibrated out of the box on mixed-type data, as the German example above shows. A standard workaround is to apply isotonic regression or a Platt-scale calibration on top of the LDA score, converting the raw linear output into a calibrated PD. EBA guidelines on PD estimation [@eba2017gl] are compatible with this as long as the calibration step is documented and backtested. ### ECOA and FCRA On consumer-credit portfolios, the Equal Credit Opportunity Act prohibits the use of certain protected attributes, and the Fair Credit Reporting Act requires adverse-action reasoning to cite specific factors from the applicant's file. LDA is compatible with adverse-action generation because each coefficient maps to a specific feature contribution. The reason-code algorithm is usually some variant of sorting features by $|\beta_j (x_j - \bar{x}_j)|$ on the rejected application and returning the top four to six contributors. @sec-ch05 walked through this. ### GDPR Article 22 and the EU AI Act Article 22 of the GDPR gives subjects the right not to be subject to a decision based solely on automated processing that produces legal effects. The EU AI Act classifies creditworthiness assessment of natural persons as a high-risk system, with obligations around transparency, human oversight, and documentation. A linear LDA satisfies the transparency requirement by construction. Its weaker calibration on consumer data is actually a practical risk here, because the Act implicitly requires that probabilistic statements be accurate. Running a calibrated logit on top of an LDA score is one path; running the logit directly is another. ### IFRS 9 and CECL lifetime expected credit loss Under IFRS 9 [@ifrs9] and CECL [@cecl], banks book expected credit losses across the lifetime of each exposure that has experienced a significant increase in credit risk. The PD input in these calculations is a forward-looking PD, not a point-in-time PD. The Altman Z-score is a through-the-cycle accounting measure and does not by itself supply the macroeconomic conditioning that IFRS 9 stage-2 and stage-3 transitions require. In practice, banks use the Z-score (or a refit LDA) as the starting PD and apply a macroeconomic scaling factor that depends on forecasted GDP growth, unemployment, and interest rates. The scaling factor is usually calibrated on a logit of the default rate on macro variables (a transition-matrix adjustment if a ratings-based approach is used). This two-stage architecture keeps the interpretable LDA at the core and pushes the non-linear conditioning into a smaller, auditable layer. ### Adverse action and explanation mechanics The Fair Credit Reporting Act requires a lender that takes adverse action to disclose the principal reasons for that action in the consumer's file. For a linear model, the canonical algorithm computes the contribution of each feature to the applicant's score, sorts by absolute contribution, and returns the top four or five features as reason codes. For LDA, the contribution of feature $j$ to the decision function at input $x$ is $\beta_j(x_j - \bar x_j)$ using the standardized coefficient. The sign of the contribution indicates whether the feature pushed the score toward approval or rejection. The ranking is stable under re-scaling provided the standardization is applied consistently. A regulator will also want to know that the reason codes are meaningful rather than artifacts of a feature cluster. Best practice is to group highly correlated features (for example, TL/TA and debt-to-equity) into a single named reason ("high leverage") at the reporting stage, using a predefined group-to-feature mapping. That mapping is a governance artifact that should be documented and versioned with the model. LDA's coefficient structure makes this kind of grouping natural, which is one of the reasons it has persisted in consumer-credit regulatory contexts despite its weaknesses. ## Practitioner notes: what to do if you inherit a Z-score model A new team often inherits a Z-score or an LDA-style scoring function that has been in production for years. The cheapest costly mistake is to assume it still works. Five diagnostic steps separate a healthy inheritance from a liability, and a sixth decides what to ship next. **Step 0: rebuild the artifact inventory.** Before touching the math, write down what actually exists. The minimum set is the coefficient vector with its training date, the feature dictionary that maps production columns to model inputs (including any winsorization or imputation that runs before the score), the cutoff schedule (the score-to-decision table and any policy overrides bolted on top), the calibration map from raw score to PD, and the monitoring artifacts that have been produced since deployment. A Z-score model is not the five Altman ratios. It is the pipeline that turns a customer file into an approve, decline, or refer decision, and the pipeline is where most of the drift hides. If any one of these artifacts is missing, treat the model as undocumented and budget for a full re-derivation rather than a refit. **Step 1: refit and compare coefficients.** Rerun the estimation on the most recent three years of in-scope obligors, using the same feature definitions as production, and compare the refit coefficients against the deployed ones. Three failure modes matter. (1) A sign flip on any feature with non-trivial coefficient mass. This usually means the economic relationship has reversed (e.g., during a low-rate window, leverage stops predicting default in the inherited direction) or that a feature has been redefined upstream. (2) A magnitude shift larger than a factor of two on a top-rank feature, which moves cutoffs materially even when the sign is preserved. (3) A new feature that the refit pulls in with a large coefficient when forced into the specification, which means the original feature set is missing a now-important driver. Hotelling's $T^2$ on the class means across time windows is a compact test of whether the inputs themselves have moved. Box's M flags whether the pooled covariance assumption still holds, with the usual caveat that it is sensitive to non-normality. Both should be logged with a confidence level rather than a p-value, since with large modern panels every test rejects. **Step 2: redraw the reliability diagram.** Compute the reliability diagram on the last year of production decisions, bucketing by score decile and overlaying the observed default rate against the calibration map's predicted PD. Three patterns to look for. (1) A uniform vertical shift, where the curve is parallel to the diagonal but offset. This is a base-rate change, often macro-driven, and is correctable by re-fitting the intercept of the calibration logit on the most recent vintage. (2) A tilt, where low-risk bins are well calibrated but high-risk bins under- or over-predict. This is usually a sign that the score's discrimination has degraded in the tail and a slope refresh is not enough; consider isotonic recalibration or a feature refresh. (3) Bin-level zigzag with no systematic pattern. This is sampling noise, common when fewer than roughly 30 observations land in a bin; either widen the bins, lengthen the window, or accept that the calibration cannot be evaluated at the tail until more outcomes accrue. Either way, a Platt-scale refresh on the current population is a defensible patch for the parallel-shift case and should be the first remediation tried. **Step 3: stress the feature set.** Run a leave-one-out sensitivity on the top-rank features. Drop each in turn, refit the LDA, and measure the AUC gap, the KS gap, and the Brier gap on a held-out window. A single feature contributing more than 10 to 15 AUC points means the model is fragile to a feature outage or a definition change at the upstream source, which is not a hypothetical: bureau format changes, accounting-standard transitions (IFRS 15, IFRS 16), and ERP migrations all rewrite features without warning. Either add a redundant input from a separate data source, or move to a model family with more graceful degradation under feature loss (a tree ensemble with surrogate splits is the usual fallback). Pair this with a population stability index (PSI) check on each top feature against the original training window: a PSI above 0.25 on a top-rank feature is a stronger signal than the AUC drop because it precedes the performance loss. **Step 4: audit the policy overlay.** Production scoring is rarely just the model. Cutoffs, exclusion rules, automatic referrals, and analyst overrides accrete around an inherited model and frequently account for as much approve-or-decline variance as the score itself. Pull the last year of decisions and decompose them into pure-model approvals, pure-model declines, override approvals, and override declines. If the override rate exceeds 5 percent of decisions, the declared model is not the operating model, and the diagnostics in Steps 1 to 3 are scoring the wrong object. The remediation is to either fold the most common overrides back into the model (e.g., as a hard exclusion feature) or to retire them with documented rationale. Keep the override audit as an ongoing report, not a one-time exercise. **Step 5: decide what to ship.** The four findings combine into one of four actions. (a) The model and its calibration both pass. Document the diagnostics, set a quarterly re-check cadence, and stop. (b) Calibration has drifted but discrimination is intact. Apply a Platt or isotonic refresh, re-evaluate, and document the refresh as a model change under SR 11-7 or its local analogue. (c) Discrimination has degraded in a specific segment (sector, vintage, channel). Add segment-specific intercepts or fit a segmented model, and re-validate by segment. (d) The feature set is no longer adequate or the override rate has overtaken the model. Retire the inherited model on a planned timeline, run a parallel build with a modern specification (logit with a richer feature set, or a gradient-boosted challenger), and document the migration. The temptation to skip (d) and keep patching is the most expensive failure mode in inherited-model maintenance, because every successive Platt refresh masks a discrimination problem that compounds. The governance lesson is that an inherited model is a working assumption, not a finished product. The Altman Z-score is the rare model that has survived this kind of scrutiny for fifty years, and it has survived precisely because its variable choice reflects real economic mechanisms, not because its coefficients are stable. Modelers who treat inherited Z-scores as immutable artifacts replicate the failure of @begley1996bankruptcy, where Altman's 1968 coefficients applied unchanged in the 1980s nearly doubled the Type I error rate. Modelers who treat them as throwaway artifacts and rebuild from scratch on every refit lose the institutional memory encoded in the original feature choice and reintroduce features that have already been ruled out for legal, operational, or reputational reasons. The discipline is to treat the inherited model as a hypothesis with a known prior and to update both the prior and the hypothesis on each refresh cycle. ## Where LDA connects to later chapters LDA's linear decision rule is the simplest member of a family of techniques that later chapters build out. @sec-ch07 on logistic scorecards shows how to move from LDA's Gaussian-derived sigmoid to a maximum-likelihood-derived sigmoid with regularization. @sec-ch08 on structural models formalizes the Merton and KMV distance-to-default (@sec-ch08-kmv) that competed with Altman's accounting model in the 1990s. @sec-ch09 on survival analysis generalizes the one-period hazard of Shumway to a full time-to-event framework. @sec-ch11 on trees and @sec-ch12 on ensembles show the non-linear gains available to a modeler willing to pay for them with interpretability. The Altman tradition does not disappear as the chapters progress. It reappears in @sec-ch28 on causal credit, where the coefficients of a linear model are easier to interpret causally than a deep network's weights, and in @sec-ch29-sme on corporate SME scoring, where LDA on six accounting ratios is still the default for small business lending when data are scarce. A reader who has finished this chapter should be able to: (1) derive Fisher's direction and show it equals the Bayes direction under Gaussian equal-covariance; (2) implement a two-class LDA from a generalized eigenvalue solver and compare to `sklearn`; (3) read the Altman 1968 paper and explain why the coefficients look the way they do; (4) apply the Z, Z', and Z'' variants correctly across firm types; (5) benchmark LDA against logit on mixed-type consumer data and interpret where each wins; (6) diagnose an LDA calibration failure and patch it with Platt scaling; and (7) walk a governance reviewer through the coefficient table without jargon. ## Vietnam and emerging markets {.unnumbered} ### Market context Vietnam's wholesale credit market is bank-dominated and overwhelmingly private-SME by headcount. The State Bank of Vietnam (SBV) supervises 49 credit institutions plus finance and leasing companies under a Basel II standardized-approach framework rolled out under Circular 41/2016 and tightened by Circular 11/2021 on loan classification and provisioning [@sbv2021circular11]. The single public credit bureau for bank supervision is the Credit Information Center (CIC), run as an SBV subsidiary, which aggregates obligor histories across licensed banks and finance companies and produces a supervisory CIC score [@cicvn2023report]. Private bureau coverage (PCB Vietnam) is thinner and concentrated on consumer segments. The data rail that a corporate modeler touches is therefore: CIC pulls keyed on national ID or tax code, plus the obligor's own audited statements where available, plus internal account behavior. Identity verification moved online under Circular 16/2020/TT-NHNN, which authorized electronic KYC for payment accounts and unlocked remote onboarding for retail-credit originators [@sbv2020ekyc]. Personal data handling is now governed by Decree 13/2023/ND-CP, which sets consent, cross-border transfer, and breach notification rules similar in spirit to the GDPR but with a narrower legitimate-interest basis and a data protection impact assessment filing requirement with the Ministry of Public Security [@govvn2023decree13]. The macro backdrop matters for any model that inherits Altman-style coefficients calibrated on US manufacturers. Vietnamese GDP volatility is roughly twice the OECD median, credit-to-GDP crossed 130 percent in 2022, and NPL recognition has historically lagged because of VAMC special-bond treatment [@imf2023vietnamart4; @worldbank2022vietnamfinance]. Corporate failures cluster in construction, real estate, and trade finance, cycles driven by property policy and export demand. The informal economy is still around one quarter of GDP, and Findex 2021 places the adult bank-account rate below the ASEAN average although closing fast [@worldbank2021findex; @adb2022vnfin]. ### Application considerations A textbook Altman Z on Vietnamese manufacturers misreads two of its five inputs. First, the numerator of $X_4$, market value of equity, is unavailable for the vast majority of firms because only a few hundred are listed on HOSE and HNX. Second, retained earnings ($X_2$) are shaped by SBV-mandated provisioning additions rather than pure accumulated profit. Altman's Z'' [@altman1977zeta; @altman2000predicting], which drops $X_5$ and uses book equity over total liabilities for $X_4$, is the natural starting point. LDA and Z'' transfer well when three conditions are met: (i) the ratios have been winsorized to tame heavy tails from state-owned enterprise reporting; (ii) the covariance matrix is pooled across a reasonably homogeneous sector, not across banking, real estate, and manufacturing together; (iii) the estimated coefficients are refit on Vietnamese defaults rather than copied from Altman (2000). Bank-lending sensitivity to uncertainty in Vietnam differs systematically from developed-market benchmarks, which means the prior on coefficient magnitudes should not be imported. Tet adds a second wrinkle. Consumer-credit outstanding balances and arrears move with the Lunar New Year in ways not present in US benchmark data. If the design matrix includes age of most-recent delinquency or utilization ratios pulled at month end, a model fit on January-February snapshots overstates risk, and one fit on May-June snapshots understates it. The practical response is to fit separate LDA means per observation month and to use a calendar-adjusted cumulative default rate as the target. ### Rationalization LDA and Z''-style models fit Vietnam best where the modeler has few defaults, a small feature list of accounting ratios, and a supervisor who insists on a readable coefficient table. Middle-market corporate scoring at a mid-tier joint-stock bank is the canonical case. The method fits poorly when the design matrix is dominated by CIC-derived behavioral indicators for retail obligors, because these are heavily one-hot and skewed. Consumer-credit scoring under eKYC workflows should use a WoE scorecard (@sec-ch07) or a calibrated tree. A second contraindication is the lack of market-implied volatility for most obligors, which blocks the KMV-DD variable (@sec-ch08-kmv) that would otherwise stabilize a corporate LDA in a hybrid model. ### Practical notes Training data. The 500-firm HOSE/HNX sample is sufficient to refit Z'' coefficients on listed manufacturers. For the broader private-SME universe, the IFC MSME Finance Gap (Vietnam profile) provides aggregate default rates by sector that can anchor a prior [@ifc2019vnmsme]. Bank-level panels from DataCore and ADB supervisory data can be licensed for academic benchmarking [@adb2022vnfin]. Regulator touchpoints. Model documentation for SBV on-site inspections must include the discriminant coefficients, the sample window, the observed default definition in the sense of Circular 11/2021, and a stability back-test across at least two downturns. Data-protection impact assessments filed under Decree 13/2023 should specify the legal basis for each CIC pull and each bureau attribute consumed by the LDA [@govvn2023decree13]. Validation units should map the model's rank-order performance against the CIC supervisory score, not just internal booking performance. Internal escalation. In a typical Vietnamese joint-stock bank, the Credit Risk Committee owns sign-off on corporate PD models and the Model Risk Unit (where it exists) owns the independent validation. LDA and Z''-style documentation sits comfortably with both because the coefficient table is legible without a statistician. The same legibility is a liability when the model degrades silently: stability drift tends to surface only when the annual revalidation runs. A quarterly $S_W$ condition-number check and a rolling AUC on a fresh CIC cut are cheap safeguards that practitioners should build into the pipeline by default [@bis2020em]. IMF FSAP findings on Vietnam repeatedly flag the gap between model development and ongoing monitoring as a supervisory concern, and a discriminant model's simplicity is not a substitute for that monitoring discipline [@imf2023vietnamart4]. ## Takeaways {.unnumbered} - LDA is the Bayes-optimal classifier under Gaussian equal-covariance. Its coefficients equal $\Sigma^{-1}(\mu_1 - \mu_0)$, and the Fisher direction is the unique generalized eigenvector of $S_W^{-1} S_B$. - Altman's 1968 Z-score is MDA applied to five financial ratios on 66 matched firms. The coefficients 1.2, 1.4, 3.3, 0.6, 1.0 are not magical; they are the multivariate separation direction in that specific sample. Refitting on new data gives new coefficients. - The decision zones (safe 2.99, distress 1.81) are empirical thresholds, not Bayes cutoffs. Z' and Z'' restate the model for private firms and for non-manufacturers, with refitted coefficients. - Logit beats LDA on mixed-type consumer data, usually by 1 to 3 points of AUC and substantially more on calibration. Hazard models with market-based inputs [@shumway2001forecasting, @campbell2008search] beat both on corporate data. - LDA still wins when features are near Gaussian, samples are small, or interpretability and regulatory acceptance dominate. For a middle-market corporate PD model on six ratios, LDA with a Platt-scale calibration remains a reasonable choice. - Monitor the condition number of $S_W$ and the stability of class means. LDA degrades silently under heavy one-hot interactions and under covariance drift. ## Further reading {.unnumbered} - @fisher1936use: the original discriminant function. - @rao1948utilization: the multiple discriminant generalization. - @anderson1951classification: the classification-theoretic derivation that connects LDA to the Bayes rule. - @efron1975efficiency: the asymptotic efficiency calculation that settles the LDA-versus-logit question under Gaussian. - @press1978choosing: the empirical argument for logit on binary-heavy data. - @altman1968zscore: the 1968 paper every credit analyst should own. - @altman1977zeta: ZETA and the seven-variable extension. - @altman2000predicting: Altman's own review of the Z-score and ZETA after 30 years of data. - @altman2017financial: international evidence on Z-score stability across decades. - @ohlson1980financial: logit replaces MDA, on a larger sample. - @zmijewski1984methodological: choice-based sampling corrections for default models. - @shumway2001forecasting: the hazard-model reframing. - @campbell2008search: accounting plus market-based inputs in a hazard model. - @hillegeist2004assessing: accounting versus structural bankruptcy models. - @agarwal2008comparing: market-based versus accounting-based head to head. - @friedman1989regularized: regularized discriminant analysis for small samples. - @bickel2004some: LDA in high dimensions, where the naive version fails. ================================================================================ # Source: chapters/07-logistic-scorecard.qmd ================================================================================ # Logistic Regression and the Scorecard **Scope: retail (with one corporate detour).** Primary applications are consumer credit scorecards on UCI German Credit and UCI Taiwan default. The Ohlson O-score section (@sec-ch07-ohlson) applies the same logit machinery to corporate bankruptcy and is flagged inline. ## Overview {.unnumbered} Logistic regression is still the workhorse of retail credit risk. Every large bank, every bureau, every fintech with a prime book runs a logistic regression scorecard somewhere in its decision stack. Not because nothing better exists, but because nothing else clears the simultaneous bar of statistical rigor, regulatory transparency, and operational robustness. A well-built scorecard is auditable at the bin level, easy to monitor, cheap to score at ten thousand requests per second, and trivial to explain to an adverse-action letter recipient. That combination is rare. This chapter derives logistic regression the way a practitioner should know it. We build the MLE by hand using Newton-Raphson / IRLS, prove the equivalence between a logistic regression on weight-of-evidence features and an additive scorecard, derive the points-to-double-odds (PDO) scaling from first principles, train a full scorecard on Taiwan default and a regularized logistic on German credit, apply Platt and isotonic calibration with reliability diagrams, reproduce Ohlson's 1980 O-score, then walk the model through operational concerns: reason codes, monotonic constraints, PSI monitoring, recalibration versus refit, FastAPI deployment, ONNX export, MLflow logging, and PySpark MLlib for 1-million-row scale. By the end, you will have a working, logged, versioned, testable scorecard pipeline that maps cleanly onto SR 11-7 [@sr117] and EBA IRB [@eba2017gl; @eba2022irb] expectations. None of the math is hidden, none of the code is stubbed. The chapter is deliberately long because scorecards sit at an unusual intersection. The statistics are classical, the engineering is production-grade, and the regulatory framing is enormous. A credit scorecard fails if any of those three legs wobbles, so we give each its own derivation, code, and failure modes. Readers who already know the math can skip ahead to @sec-ch07-scaling and @sec-ch07-impl; readers who already ship models may find the history and regulatory sections repetitive. The intent is that a graduate student can hand the chapter to a risk executive and vice versa. An emerging-market framing runs alongside the math. A Vietnamese retail lender opening files under eKYC faces applicants whose bureau footprint at CIC is two lines long, whose income arrives in cash, and whose outstanding balances compress violently around Tet [@cicvn2023report]. WoE binning is the right tool because it turns a thin bureau line plus a noisy informal-income proxy into a stable score without over-parameterizing. The closing section returns to this with CIC data, SBV Circular 11/2021 default definitions, and the practical binning of informal-income indicators. A word on why logistic regression persists. @hand2006classifier argued nearly two decades ago that the "illusion of progress" in classification is that tiny AUC improvements dominate the literature while the costs and benefits of deployment dominate practice. Credit is the cleanest example. Regulated lenders care about monotone constraints, bin-level explainability, portability across booking systems, and the ability to retrain a vintage in a week. A 1% lift from a gradient-boosted ensemble often fails to pay for the governance overhead. @dumitrescu2022machine revisit this question on modern data and find that a carefully binned logistic scorecard is within one or two percent of AUC of tuned tree ensembles, sometimes ahead of them on small-sample out-of-time windows. That is the empirical case for this chapter still existing in a book that also contains a chapter on graph neural networks. ### Notation {.unnumbered} Let $y_i \in \{0,1\}$ denote default on obligor $i \in \{1,\dots,n\}$ and $x_i \in \mathbb{R}^p$ the covariate vector (already one-hot / WoE-encoded). Define $\beta \in \mathbb{R}^p$ as the regression coefficients and $\eta_i = x_i^\top \beta$ as the linear predictor. The conditional default probability is $\pi_i = P(y_i = 1 \mid x_i) = \sigma(\eta_i)$ where $\sigma(z) = (1 + e^{-z})^{-1}$ is the sigmoid. The log-odds of default is $\mathrm{logit}(\pi) = \log(\pi/(1-\pi)) = \eta$. The diagonal matrix $W$ with entries $W_{ii} = \pi_i(1-\pi_i)$ is the Fisher information weight. Bin $k$ of feature $j$ has weight of evidence $$ \mathrm{WoE}_{jk} = \log\left(\frac{\Pr(x_j \in \text{bin } k \mid y=0)}{\Pr(x_j \in \text{bin } k \mid y=1)}\right) $$ and information value $\mathrm{IV}_j = \sum_k (\Pr(\text{bin}_k \mid y=0) - \Pr(\text{bin}_k \mid y=1)) \cdot \mathrm{WoE}_{jk}$. ------------------------------------------------------------------------ ## Logistic regression as a PD model ### The Bernoulli GLM A PD model answers one question: what is $\Pr(y_i = 1 \mid x_i)$? The minimum assumption that keeps the answer inside $[0,1]$ while letting covariates enter linearly is the logit link of a Bernoulli GLM [@nelder1972generalized]: $$ \log \frac{\pi_i}{1 - \pi_i} = x_i^\top \beta. $$ @berkson1944application introduced logits for bioassay, @cox1958regression formalized their use for binary regression, and @mcfadden1974conditional gave the discrete-choice interpretation that dominates credit applications: the score $x_i^\top \beta$ is the (shifted) log-odds of choosing "default" in a binary latent-utility model. Three properties make @eq-logit-link the natural PD specification: 1. **Calibrated by construction on a representative sample.** The MLE score equation is $\sum_i (y_i - \pi_i) x_i = 0$, so residuals sum to zero within any contrast that is in the column space of $X$. The sample mean PD matches the sample default rate. 2. **Additive on the log-odds scale.** Incremental effects combine via addition, which is what enables the scorecard. 3. **Coherent with the Basel IRB philosophy.** Regulators expect PDs that are additive in explanatory factors, ranked, and back-testable [@basel2006international; @basel2005irb]. Logistic regression meets all three natively. ### Likelihood and log-likelihood The sample log-likelihood under independent Bernoulli observations is $$ \ell(\beta) = \sum_{i=1}^{n} \big[ y_i \log \pi_i + (1-y_i) \log(1-\pi_i) \big] = \sum_{i=1}^{n} \big[ y_i \eta_i - \log(1 + e^{\eta_i}) \big]. $$ @eq-loglik is strictly concave in $\beta$ whenever $X$ has full column rank, so the MLE is unique (when it exists: complete separation breaks existence, see @firth1993bias for the penalized remedy). ### Score function and Hessian Differentiating @eq-loglik term by term and using $\partial \pi_i / \partial \beta = \pi_i(1-\pi_i) x_i$ (chain rule on the logistic CDF), the gradient (score function) is $$ U(\beta) = \frac{\partial \ell}{\partial \beta} = \sum_{i=1}^{n} (y_i - \pi_i) x_i = X^\top (y - \pi), $$ a $p\times 1$ vector of weighted residuals. Differentiating once more, $$ H(\beta) = \frac{\partial^2 \ell}{\partial \beta \partial \beta^\top} = -\sum_{i=1}^{n} \pi_i(1-\pi_i) x_i x_i^\top = - X^\top W(\beta) X, $$ the $p\times p$ matrix of second partials, where $$ W(\beta) = \mathrm{diag}\big(\pi_1(1-\pi_1), \ldots, \pi_n(1-\pi_n)\big) $$ is the diagonal matrix of Bernoulli variances at the current $\beta$. Each diagonal entry $w_i = \pi_i(1-\pi_i) \in (0, 1/4]$ is the variance of $y_i \mid x_i$, peaking at $\pi_i = 1/2$ (most uncertain) and shrinking to zero as $\pi_i$ approaches 0 or 1 (near-certain cases contribute little curvature). Three properties matter for estimation. 1. **Negative semi-definiteness.** For any $v \in \mathbb{R}^p$, $v^\top H v = -\sum_i w_i (x_i^\top v)^2 \le 0$ since $w_i \ge 0$. If $X$ has full column rank and at least one $\pi_i \in (0,1)$, $H$ is strictly negative-definite, so $\ell$ is strictly concave and the MLE (when it exists) is unique. Complete or quasi-complete separation drives some $\pi_i$ to $\{0,1\}$, sending $w_i \to 0$ and pushing $\beta$ to infinity. 2. **No dependence on** $y$. The Hessian depends on $\beta$ through $\pi$, not on the observed $y$. This is a hallmark of the canonical link (logit for the Bernoulli family): the observed information $-H(\beta)$ equals the expected (Fisher) information $\mathcal{I}(\beta) = -\mathbb{E}[H(\beta)] = X^\top W(\beta) X$. Newton-Raphson and Fisher scoring therefore coincide, which is why a single algorithm (IRLS, @eq-irls below) drops out cleanly. 3. **Asymptotic covariance.** The MLE satisfies $\hat\beta \approx \mathcal{N}\big(\beta, (X^\top \widehat W X)^{-1}\big)$, with $\widehat W$ evaluated at $\hat\beta$. The diagonal of this inverse gives the standard errors that drive Wald tests and the score-band confidence intervals reported by `statsmodels` and `glm` in R. ### Newton-Raphson The Newton step solves the local quadratic: $$ \beta^{(t+1)} = \beta^{(t)} - \big[\nabla^2 \ell(\beta^{(t)})\big]^{-1} \nabla \ell(\beta^{(t)}) = \beta^{(t)} + (X^\top W^{(t)} X)^{-1} X^\top (y - \pi^{(t)}). $$ Plugging the identity $X^\top(y - \pi) = X^\top W (W^{-1}(y-\pi))$ and defining the working response $z^{(t)} = X \beta^{(t)} + W^{(t)-1}(y - \pi^{(t)})$ rearranges @eq-newton into a weighted least-squares solve: $$ \beta^{(t+1)} = (X^\top W^{(t)} X)^{-1} X^\top W^{(t)} z^{(t)}. $$ @eq-ch07-irls is the iteratively reweighted least squares (IRLS) form [@green1984iteratively; @nelder1972generalized]. Each iteration is a WLS regression of $z$ on $X$ with weights $W$. Convergence is quadratic once you are close, and damping (step halving) handles the rare divergent early steps. Three practical properties of IRLS matter for credit work. First, the update is scale-equivariant: rescaling columns of $X$ leaves predictions unchanged and simply rescales coefficients. That lets us standardize for numerical conditioning without interpretive cost. Second, the weight matrix $W$ only depends on the current prediction $\pi^{(t)}$, which means a single IRLS iteration on a fresh dataset is a closed-form Platt-style refit of the linear predictor: useful when we want to recalibrate a deployed model against a new vintage without re-learning the binning. Third, the working response $z$ can be interpreted as the current linear predictor plus the Pearson-residual correction scaled by $W^{-1}$, which is the same object that drives Cox-Snell and deviance residuals in a GLM. Understanding that construction pays dividends when we turn to calibration (the Platt fit in @sec-ch07-calibration is exactly a single IRLS step on the $(\eta, y)$ pair). Under the asymptotic sandwich, $\sqrt{n}(\hat\beta - \beta) \Rightarrow \mathcal{N}(0, I(\beta)^{-1})$, where $I(\beta) = X^\top W X / n$. Practitioners use $\widehat{\mathrm{Var}}(\hat\beta) = (X^\top \hat W X)^{-1}$ for Wald tests and confidence intervals on the points. The corresponding likelihood-ratio test for nested models compares $2[\ell(\hat\beta_{\text{full}}) - \ell(\hat\beta_{\text{restricted}})]$ against $\chi^2_{\text{df}}$. Credit teams use it to justify dropping or adding a characteristic: if the LR statistic clears the $\chi^2$ critical value and the resulting out-of-time Gini is within a basis point, the restricted model wins on parsimony. #### What can go wrong with IRLS Four failure modes appear repeatedly in credit modeling. 1. *Separation.* If one feature perfectly predicts the target on the training sample, $\hat\beta_j \to \infty$. IRLS oscillates or diverges; the likelihood is unbounded. This is not rare with high-cardinality categorical variables or with rare PAY status bins after aggressive binning. Solutions: Jeffreys prior [@firth1993bias], L2 regularization, or forcing a minimum obligor count per bin during binning. 2. *Ill-conditioning.* Near-collinear columns make $X^\top W X$ nearly singular. The Newton step explodes. Regularization fixes this; so does dropping columns by VIF or by feature-engineering the binning. 3. *Numerical overflow in the sigmoid.* Large $|\eta|$ causes `exp(eta)` to overflow. The naive form $1/(1+e^{-\eta})$ blows up for $\eta \ll 0$, and the alternative $e^{\eta}/(1+e^{\eta})$ blows up for $\eta \gg 0$. The branchless stable form picks whichever side keeps the exponent non-positive, so the result stays in $(0,1)$ at every $\eta$ representable in float64. This is the exact `pi = np.where(...)` step used by `irls_logit` in @sec-ch07-impl. The chunk below demonstrates the difference on $\eta \in \{-2000, -50, 0, 50, 2000\}$ and verifies the stable form matches `scipy.special.expit` to machine precision. At $\eta = 2000$ the form $e^{\eta}/(1+e^{\eta})$ evaluates `inf/inf` and returns `nan`. At $\eta = -2000$ the form $1/(1+e^{-\eta})$ raises an overflow warning that downstream code is free to ignore but a fitter pinned to `np.errstate(over="raise")` will still abort on. The branchless `np.where` form picks whichever branch keeps the exponent non-positive, matches `scipy.special.expit` to within $3 \times 10^{-38}$, and emits no warnings. In an IRLS loop, a single corrupted $\pi_i$ contaminates the working response $z_i = \eta_i + (y_i - \pi_i)/(\pi_i(1-\pi_i))$ and the Newton step diverges silently, so this guard is non-optional in any production fitter. 4. *Non-monotone log-likelihood between steps.* If a Newton step worsens the loss, halve the step and retry. The function is concave, so one or two halvings always work. ### WoE encoding and the additive scorecard Credit-scoring practice fits logistic regression not on raw features but on WoE-encoded features [@thomas2017credit; @anderson2007credit; @siddiqi2017intelligent]. Each continuous or categorical feature $j$ is bucketed into bins $B_{j1}, \dots, B_{j K_j}$ by a supervised binning algorithm that maximizes information value subject to monotonicity. Each bin is replaced by its WoE (@eq-woe-def). Formally, the design matrix becomes a block of one-hot indicators multiplied by the bin's WoE value: $$ x_{ij}^{\text{WoE}} = \sum_{k=1}^{K_j} \mathrm{WoE}_{jk} \cdot \mathbf{1}\{x_{ij} \in B_{jk}\}. $$ #### Equivalence proof **Claim.** A logistic regression on WoE-encoded features is algebraically equivalent to a logistic regression with a separate coefficient per bin, up to a constant shift, and yields an additive point score per bin. **Proof sketch.** Consider a logistic regression with bin-level one-hot encoding, so $x_{ij}$ is replaced by indicators $d_{ij1}, \dots, d_{ij K_j}$ and coefficients $\alpha_{j1}, \dots, \alpha_{j K_j}$. The linear predictor is $$ \eta_i = \beta_0 + \sum_j \sum_k \alpha_{jk} d_{ijk}. $$ Substituting $\alpha_{jk} = \beta_j \cdot \mathrm{WoE}_{jk}$ (one coefficient $\beta_j$ per feature, scaled by each bin's WoE) gives $$ \eta_i = \beta_0 + \sum_j \beta_j \sum_k \mathrm{WoE}_{jk} d_{ijk} = \beta_0 + \sum_j \beta_j x_{ij}^{\text{WoE}}. $$ This is exactly the logistic regression on WoE-encoded features. The restriction $\alpha_{jk} = \beta_j \mathrm{WoE}_{jk}$ is a single-factor constraint per feature: instead of $K_j$ degrees of freedom, the WoE model uses one. When the empirical WoEs approximate the population log-odds-ratio well (which is the reason binning is done), this constraint loses little accuracy while dramatically reducing over-fit. The point formula in the next section will reveal why this representation yields an additive scorecard: because $\eta_i$ is a sum of per-bin contributions, scaling it to points preserves additivity, so every applicant's score decomposes exactly into feature-level point contributions. #### Why WoE and not raw indicators? In principle, one could fit logistic regression on raw one-hot indicators. Three reasons it is not done. - **Generalization.** With $K_j$ free coefficients per feature, a 20-feature scorecard with 8 bins each has 160 free coefficients, which over-fits on the $\sim$ 10k-obligor training samples that are common for a new product. - **Monotonicity.** Raw indicators have no enforced relationship between adjacent bins, so one can get non-monotone coefficient estimates that contradict policy beliefs. WoE, combined with monotone binning, enforces the relationship by construction. - **Stability under population drift.** If one indicator bin fills up unevenly across vintages, its coefficient moves independently. WoE pools the sample through binning, making coefficients substantially more stable vintage-to-vintage, as @siddiqi2017intelligent documents. #### Binning choices in practice The bin boundaries matter. Three supervised binning recipes dominate production scorecards: 1. *Decision-tree binning.* A shallow CART on $(x_j, y)$ gives boundaries optimized for target split quality. Simple, but can over-fit if the tree depth is not bounded. 2. *Chi-merge.* Iteratively merge adjacent bins with low chi-square statistic on the event-rate contingency table [@thomas2017credit]. 3. *Optimal binning via mixed-integer programming.* @navas2020optimal formulates bin selection as an MILP with monotonicity, minimum sample size, and maximum bin-count constraints. This is what `optbinning` implements, and what we use below. In each case, the output is a list of bin boundaries plus the empirical WoE per bin. The sklearn `ColumnTransformer` plus custom transformer idiom is enough to industrialize any of the three. #### Information value as a feature-selection filter Before model fitting, practitioners rank features by information value: $$ \mathrm{IV}_j = \sum_{k=1}^{K_j} \big( f_{jk}^{(0)} - f_{jk}^{(1)} \big) \cdot \mathrm{WoE}_{jk} $$ where $f_{jk}^{(y)}$ is the share of observations with outcome $y$ falling in bin $k$. Rough conventions [@siddiqi2017intelligent]: - IV \< 0.02: not predictive. - 0.02 - 0.1: weak. - 0.1 - 0.3: medium. - 0.3 - 0.5: strong. - $> 0.5$: suspiciously strong, check for leakage. IV is sensitive to sample size and binning choices, so treat it as a screen rather than a selection criterion. The final feature set should be chosen by **out-of-time Gini contribution under penalized LR**, not by IV rank alone. ### Related nuances Several items worth flagging before we code. 1. *Rare events.* Under heavy imbalance (low base rate), MLE $\hat\beta_0$ is biased downward. @king2001logistic give a closed-form correction; @firth1993bias recommends Jeffreys prior penalization, which has become the default in modern credit practice. sklearn's L2 penalty (@sec-logistic-l2-ridge) with modest `C` achieves similar regularization in large-$n$ credit datasets without the small-sample closed-form. The full menu of resampling, cost-sensitive, and threshold-moving fixes for severe imbalance is treated in @sec-ch15. 2. *Separation.* Rare monotone bins (e.g., a PAY_0 bin with zero goods) make the likelihood diverge. Optimal binning enforces a minimum bad and good rate per bin [@navas2020optimal] to prevent this before fitting. 3. *Prior corrections.* When a training set is stratified (over-sampled bads), the MLE intercept no longer reflects the deployment prior. The standard correction shifts $\hat\beta_0$ by $\log(\pi_{\text{pop}} / (1-\pi_{\text{pop}})) - \log(\pi_{\text{train}} / (1-\pi_{\text{train}}))$ [@king2001logistic]. All other coefficients are left unchanged; only the intercept carries the mismatch between training and deployment base rates. This is a one-line fix that many deployed scorecards get wrong when a sampling policy changes mid-year. 4. *Choice-based sampling.* When the sampling scheme itself is endogenous (e.g., the training set is only of accepted applicants), the logistic likelihood is mis-specified in a more fundamental way. Reject inference (@sec-ch10) addresses this directly. For the bulk of retail products where the sampling scheme is exogenous or rebuilt via weighted likelihood, the base-rate shift is the only correction needed. 5. *Interpreting* $\beta_j$. In a logit on WoE-encoded features, $\hat\beta_j$ close to 1.0 indicates the empirical WoE is a faithful summary of the feature's log-odds-ratio. Values substantially above 1 imply the binning under-resolves the feature (the WoE signal is being amplified by the linear coefficient to compensate). Values substantially below 1 suggest the binning over-resolves or is contaminated by noise. Senior scorecard modelers use this as a diagnostic: after fitting, inspect the distribution of $\hat\beta_j$ values. Most should live between 0.5 and 1.2. Outliers deserve a look. ### Worked example: from raw inputs to a score The math above is easier to internalize on a small concrete dataset. This subsection takes one continuous feature (debt-to-income ratio, `DTI`) and one categorical feature (`employment_type`, with four levels), bins each, computes WoE and IV by hand, fits the logistic regression, maps the coefficients to points, and scores a single applicant end-to-end. The arithmetic in each chunk is small enough to reproduce on paper, so any mismatch with intuition is locatable to a single line. The pipeline is the same end-to-end chain that every production scorecard implements; @fig-ch07-pipeline lays it out so that each step below, and each later section of the chapter, has a place on the map. The dashed feedback edge is important: monitoring does not just report, it triggers either a recalibration (cheap, intercept and slope only) or a full refit (expensive, often new bins) depending on what PSI and out-of-time AUC say; the *Recalibration vs refit* section covers the choice between the two. Steps 1 through 8 below populate the first half of this diagram with concrete numbers. The Regularization section sits at the *fit* stage, @sec-ch07-calibration at the *calibrate* stage, and the monitoring sections at the dashed feedback loop. #### Step 1. Generate a 4,000-obligor portfolio `DTI` is drawn so that higher leverage carries higher default probability; `employment_type` is drawn so that `salaried` is safest, `self_employed` is the median, `gig` is risky, and `unemployed` is riskiest. The relationship is not perfect, which is what makes the binning informative. #### Step 2. Bin the continuous feature Three binning strategies dominate practice for a continuous feature: equal-width cuts, equal-frequency (quantile) cuts, and supervised cuts learned from a shallow decision tree. Each delivers a different bad-rate profile on the same `DTI` column. We run all three on the simulated portfolio, compare counts and monotonicity, then settle on the fixed cuts used for the rest of the walkthrough. Three patterns recur every time this comparison is run on a real portfolio, and they show up here as well. 1. *Equal-width is sensitive to skew.* `DTI` is gamma-distributed, so the two highest equal-width bins together hold under 10% of the portfolio. Bad rates are monotone on this seed but the tail bin counts are small enough that a typical 5% min-bin-size rule would force a merge, and on neighboring seeds the top bin's rate is noisy enough to invert against its neighbor. 2. *Equal-frequency stabilizes counts but not cuts.* Quantile cuts give every bin the same `n`, which is what makes IV and WoE estimates low-variance. Cuts land at population quantiles (here 0.142, 0.243, 0.355, 0.527), not at policy-relevant thresholds. A 36% DTI is a meaningful underwriting boundary; a 35.5% quantile is not. 3. *Supervised cuts find risk-driven boundaries.* The tree minimizes Gini on `default`, so its cut points (here near 0.20, 0.33, 0.52, 0.86) sit at genuine changes in bad rate, and the resulting bad-rate profile is the steepest of the three. With a 5% minimum leaf size and at most five leaves, this is exactly what `optbinning` does for a single feature, minus the mixed-integer monotone constraint [@navas2020optimal]. The CART splitting rule used here is derived in @sec-ch11-splits; the same impurity criterion underlies decision-tree binning in production. In production we would feed the supervised cuts into an optimizer that adds a monotone-event-rate constraint per feature, the way `optbinning` and `scorecardpy` do. For a one-feature walkthrough the supervised cuts are usually adequate; we use rounded, policy-readable boundaries instead so the WoE arithmetic in the next step stays legible. The `bad_rate` column is monotone increasing across the five `DTI` bins, which is the property the binning was supposed to deliver. If a middle bin's bad rate dipped below its lower neighbor, we would merge it with the neighbor with the closer rate and refit. The supervised tree above produces a similar monotone profile on this draw; the manual cuts win here on readability, not on bad-rate fidelity. #### Step 3. Compute WoE and IV by hand WoE compares the share of goods in a bin to the share of bads in that bin (@eq-woe-def). Let $G$ and $B$ be portfolio totals of goods and bads. For bin $k$, $\mathrm{WoE}_k = \log( (g_k/G) / (b_k/B) )$ where $g_k$, $b_k$ are bin counts. Information value (@eq-iv-def) sums the bin-level signal weighted by the gap between good-share and bad-share. Reading this table left to right: the safest `DTI` bin has positive WoE (more goods per bad than the portfolio average), the riskiest bin has negative WoE, and the IV contributions are uniformly positive because each bin's gap reinforces the same direction. The summed IV lands above the 0.5 "suspiciously strong" threshold; in real data, that would prompt a leakage check, but here it is expected because the synthetic generator made `DTI` a dominant driver of the true PD. #### Step 4. Bin the categorical feature For `employment_type` the bins are the four observed levels; no boundaries to choose. We compute the same WoE table. `unemployed` carries the most negative WoE (it is the highest-bad-rate level), `salaried` the most positive. If two adjacent levels had nearly identical WoE we could collapse them to reduce degrees of freedom; here the four levels separate cleanly. #### Step 5. Replace each raw value with its bin's WoE This is @eq-woe-encode applied row by row. After this step, every column the logistic regression sees is already on a log-odds-ratio scale, so the regression coefficient on each WoE column is dimensionless and comparable across features. #### Step 6. Fit the logistic regression on the two WoE columns The design matrix has three columns: an intercept and two WoE features. Fitting via the from-scratch IRLS in @sec-ch07-impl returns the same coefficients as `statsmodels`. To confirm the from-scratch solver, fit the same design matrix with `statsmodels.Logit` and print both coefficient vectors plus the max absolute deviation. Anything above 1e-6 means the IRLS implementation has a bug; here it sits at machine precision. The first three rows show the IRLS and `statsmodels` coefficients agree to roughly 1e-12. The standard errors come from the diagonal of $(X^\top \widehat W X)^{-1}$ evaluated at $\hat\beta$, which is the same Hessian the IRLS loop already computed; `statsmodels` returns it for free, so we report it here rather than re-derive it. Both slope coefficients land near $-1$. The sign is negative because under @eq-woe-def a positive WoE marks a safer bin, and the logit of *default* should fall as the bin gets safer; the unit magnitude is the sanity check from @sec-ch07-scorecard, namely that the binning is faithful to the underlying log-odds-ratio so the regression has very little extra work beyond aggregating the two WoE channels. #### Step 7. Map coefficients to points per bin Apply the FICO-style scaling `(base_score=600, base_odds=50, pdo=20)` from @eq-points-per-bin. The factor and offset are computed once; the bin-level points then drop out as $-B \beta_j \mathrm{WoE}_{jk} + (A - B \beta_0)/p$ with $p = 2$ characteristics here. The two tables together are the entire scorecard a credit officer would see. Column `Points` is the number a row earns when its applicant falls in that bin. Higher points = safer, by the convention chosen here. #### Step 8. Score one applicant end to end Pick a single applicant and trace the arithmetic from raw inputs to total score and PD. Two things to notice. First, the total score equals `offset - factor * eta` to the last decimal, which is the algebraic identity the bin tables were built to satisfy: summing the per-bin points reproduces the affine transform of the linear predictor. Second, this applicant's PD is close to the portfolio average because their DTI sits in the middle bin and their employment level is the median-risk level, so neither feature pushes the score far from the intercept. A second applicant with `DTI=0.05` and `employment_type="salaried"` would gain roughly `factor * beta_dti * (WoE_safest_DTI - WoE_middle_DTI)` plus the equivalent employment delta; that is the exact mechanism by which an underwriter explains why one file approves and another does not. #### What this example is not The walkthrough uses hand-picked bin edges so the arithmetic stays legible. A production scorecard would use `optbinning` or chi-merge to find boundaries, enforce minimum bin counts, enforce monotonicity in the bad rate, and split out a holdout for IV stability. It would also run a separation check before fitting to flag bins with zero bads. The shape of the pipeline (raw -\> bin -\> WoE -\> logit -\> points) is identical; only the boundary-selection step gets replaced. The full pipeline run on Taiwan default appears in @sec-ch07-impl. ## Scaling: points to double the odds ### The PDO formula A scorecard converts the model's log-odds into integer points such that the score is easy to read and stays stable across portfolios. The conventions are fixed by two parameters: - `base_score`: the points assigned to a reference applicant whose odds of being **good** (non-default) equal `base_odds`. - `pdo`: "points to double the odds" is the number of points a score must gain for the good-bad odds to double. Let $o(s) = (1 - p(s))/p(s)$ be the odds that an applicant with score $s$ is good, where $p(s)$ is the applicant's PD. Linearity requires $$ s = A + B \log o $$ for some constants $A$ (offset) and $B$ (factor). Doubling the odds means $\log o$ increases by $\log 2$. The definition of PDO says the associated increase in $s$ is `pdo`: $$ \mathrm{pdo} = B \log 2 \ \Longrightarrow\ B = \mathrm{pdo} / \log 2. $$ Anchoring the score at $s = \mathrm{base\_score}$ when $\log o = \log(\mathrm{base\_odds})$ gives $$ \mathrm{base\_score} = A + B \log(\mathrm{base\_odds}) \ \Longrightarrow\ A = \mathrm{base\_score} - B \log(\mathrm{base\_odds}). $$ For the FICO-style `(base_score=600, base_odds=50, pdo=20)` convention: $B = 20 / \log 2 \approx 28.8539$ and $A = 600 - B \log 50 \approx 487.1230$. ### Points per bin Under logistic regression on WoE-encoded features, $\log(p_i/(1-p_i)) = \beta_0 + \sum_j \beta_j \mathrm{WoE}_{ji}$, hence $$ \log o_i = -\beta_0 - \sum_j \beta_j \mathrm{WoE}_{ji}. $$ Substituting into @eq-score-linear: $$ s_i = A - B \beta_0 - B \sum_j \beta_j \mathrm{WoE}_{ji} = \Big(\frac{A - B\beta_0}{p}\Big) p + \sum_j \big(-B \beta_j \mathrm{WoE}_{ji}\big), $$ where $p$ is the number of characteristics. The bin-level point contribution is $$ \mathrm{points}_{jk} = -B \cdot \beta_j \cdot \mathrm{WoE}_{jk} + \frac{A - B \beta_0}{p} $$ with total score $s_i = \sum_{j=1}^p \mathrm{points}_{j, k(i,j)}$ where $k(i,j)$ is the bin applicant $i$ falls into for feature $j$. The $(A - B\beta_0)/p$ term spreads the intercept evenly across characteristics so that each feature contributes a clean per-bin number. Sign conventions vary: most credit shops use "higher points = safer" by choosing $y=1$ = default so that $\beta_j > 0$ for risky bins (large negative WoE_good convention) gives negative points. The `scorecard_points` helper in `creditutils.py` implements this mapping. ### Cutoff reasoning Given a desired approval rate $\alpha$ and a loss tolerance, the cutoff score $s^*$ is the quantile such that above-cutoff applicants yield an expected bad rate below the target. Because $s$ and $\log o$ are affine and $\log o$ and $p$ are monotone, picking a cutoff on points is equivalent to picking a PD threshold, but points are what analysts actually use in policy discussions. #### Why PDO scaling has survived The PDO convention is not a mathematical requirement. It survived because it solves three non-technical problems at once. First, integers compress better in legacy core-banking systems than floats, and once upon a time every byte mattered. Second, the 20-points-per-doubling rule maps neatly onto human intuition: a 40-point gap means odds quadruple, which is the kind of magnitude that lending officers can discuss without a calculator. Third, portability across portfolios is easier when each lender uses the same PDO; although the absolute anchor point differs, the "points per doubling" semantics are shared across FICO, VantageScore, and most in-house scorecards. That said, scaling conventions vary. Some shops use `base_score=500, base_odds=20, pdo=20`; others use `600, 50, 20`. The arithmetic is identical up to a global affine shift. The only thing that matters in practice is that the scorecard's master-scale mapping (points to rating grade) is recalibrated whenever the scaling constants change. Getting this wrong once, and shipping mis-anchored scores to downstream pricing engines, is how a lender burns several million dollars before noticing. #### Master-scale mapping For IRB portfolios, the score must be discretized into a master scale of rating grades with pre-defined PD midpoints. The master scale is a single table that every credit risk system in the bank agrees on. A typical master scale has 10 to 22 grades. Bin boundaries are set so each grade contains roughly equal obligor counts on the training set and so the pooled default rate in each grade monotonically increases. @eba2017gl requires the grades to be distinct, ordered, and sufficiently granular that no two adjacent grades have overlapping 95% confidence intervals on their default rate. The score in points gives us a clean axis to draw these boundaries on. The IRB capital function the master scale feeds into is derived in @sec-ch05-regulation. #### Negative points and policy overrides Some scorecards use a signed convention where "safer" applicants get higher scores. Others reverse it. The `reverse_scorecard=True` flag in `optbinning.Scorecard` picks the direction; once set, keep it fixed for the life of the scorecard, because monitoring dashboards and override rules depend on the sign. Policy overrides (e.g., "deny anyone with a recent bankruptcy regardless of points") sit outside the arithmetic, but live in the same deployment pipeline. Good practice is to encode every override in a rule table that is versioned alongside the scorecard artifact. ## Regularization Regularization plugs into a single stage of the workflow: the *fit* box in @fig-ch07-pipeline. Bins, WoE values, points scaling, and downstream calibration are all unchanged by the choice of penalty. What the penalty controls is which $\beta$ vector IRLS converges to when the unpenalized objective is ill-conditioned, separated, or overfit to the training vintage. Why regularize logistic regression at all? Three reasons. First, credit features are correlated. Payment status, utilization, and recent delinquencies share variance. Without regularization, coefficients can be noisy even at $n$ in the hundreds of thousands. Second, unpenalized MLE diverges under quasi-separation, which happens whenever an optimal-binning run produces a bin with zero bads. Third, regularization improves out-of-time performance on shifted populations, a practical concern for credit scorecards that see macro cycles the training data did not. Three triage rules for *when* to regularize, mapped onto the same workflow: 1. *After binning, before fitting,* if any bin has fewer than \~30 bads or zero bads. Quasi-separation will make unpenalized IRLS diverge or produce wildly large coefficients. Use L2 with a modest `C` (i.e. `C = 1.0` to `4.0` in sklearn) by default; this is the cheapest fix and matches what monotone optimal binning expects downstream. 2. *Before fitting,* if your candidate feature pool has more than \~3x the features you intend to keep. Use L1 to do selection, then refit L2 on the survivors. The two-stage approach is what production teams ship because the L2 refit produces stable coefficients, and the L1 stage produces an auditable selection trail. 3. *During the out-of-time check,* if the recent-vintage AUC is materially worse than CV AUC. This is a sign the unregularized model has memorized vintage-specific noise. Increase $\lambda$ until the gap closes; the *Picking* $\lambda$ subsection below has the rule. ### L1 (lasso) @tibshirani1996regression introduced the lasso penalty: $$ \hat\beta^{L1} = \arg\min_\beta \Big\{- \ell(\beta) + \lambda \sum_{j=1}^p |\beta_j| \Big\}. $$ L1 induces sparsity because the sub-differential of $|\cdot|$ at zero is the interval $[-1, 1]$: any coefficient whose partial derivative of the unpenalized loss is below $\lambda$ in magnitude is set to zero. Coordinate descent is the standard solver [@friedman2010regularization]; large-scale L1 logistic uses interior-point methods [@koh2007interior] or the LARS-IC path [@park2007l1]. In credit scoring, L1 is useful when your candidate feature pool is much larger than your stable signal set. It drops characteristics that do not survive cross-validation. ### L2 (ridge) @lecessie1992ridge formalized ridge logistic: $$ \hat\beta^{L2} = \arg\min_\beta \Big\{- \ell(\beta) + \tfrac{\lambda}{2} \sum_{j=1}^p \beta_j^2 \Big\}. $$ L2 shrinks coefficients smoothly, never to exactly zero. The penalized Hessian $X^\top W X + \lambda I$ is always invertible, which solves the separation problem and stabilizes IRLS. For WoE-encoded features whose effective degrees of freedom are low, modest L2 is usually enough. sklearn's default `penalty='l2'` with `C=1` is a reasonable starting point on WoE models. ### Elastic net @zou2005regularization combined the two: $$ \hat\beta^{EN} = \arg\min_\beta \Big\{- \ell(\beta) + \lambda_1 \sum |\beta_j| + \tfrac{\lambda_2}{2} \sum \beta_j^2 \Big\}. $$ Elastic net keeps groups of correlated features together (unlike lasso, which picks one and drops the rest) while still doing selection. Credit models with highly correlated behavioral variables (e.g. payment history lags) benefit. ### Stability selection Coefficient stability matters as much as accuracy. @meinshausen2010stability proposed sub-sampling the data, fitting lasso at a grid of penalties, and counting how often each feature is selected. Features with high selection probability across samples are kept. Practitioners use this routinely to prune candidate pools before fitting the production scorecard. ### The Bayesian view Ridge logistic is the MAP estimate under a Gaussian prior: $\beta_j \sim \mathcal{N}(0, \sigma^2)$ with $\lambda = 1/\sigma^2$. Lasso is the MAP under a Laplace prior. Elastic net is a mixture. Treating the penalty as a prior has a practical payoff: @gelman2008prior show that a weakly informative Cauchy$(0, 2.5)$ prior on standardized coefficients acts as a default that prevents separation without meaningfully biasing large effects. In Python, this is available through `pymc` or via sklearn's L2 with a modest `C`. Credit modelers who ship Bayesian scorecards get credible intervals on the points directly, which makes governance reviews easier. ### Picking $\lambda$ The cross-validated AUC curve is usually flat across a factor of ten in $\lambda$. Two rules of thumb narrow the choice. 1. The 1-standard-error rule [@hastie2009elements]: pick the smallest $\lambda$ whose CV-AUC is within one standard error of the best. This delivers a sparser, more stable model with negligible accuracy cost. 2. The out-of-time AUC rule: hold out the most recent vintage, fit on earlier data, pick $\lambda$ that maximizes AUC on the recent vintage. This is closer to the deployment distribution than random-fold CV and usually selects slightly stronger regularization. In practice, we do both and pick the larger $\lambda$ of the two. ### Coefficient sign constraints Business and regulatory rules often require certain coefficients to have a known sign. For example, "longer credit history should not *lower* the score" is both a common-sense constraint and a defensible anti-discrimination argument. Two implementations: 1. **Binning-level enforcement** via monotone WoE constraints. This is the preferred approach when the variable is numeric. If WoE is monotone in the feature, then the scorecard points are also monotone in the feature, regardless of the LR coefficient sign. 2. **Optimization-level enforcement.** Fit penalized logistic regression with a linear equality or inequality constraint on $\beta_j$. `cvxpy` (a Python domain-specific language for disciplined convex programs that compiles to ECOS, SCS, or a commercial solver) or a projected gradient descent step handles this in a few lines. The downside is that a constraint binding at $\beta_j = 0$ signals that the model wants a different sign than policy allows; the right response is to drop the feature, not to fight the data. #### Newton-Cholesky vs SAGA sklearn offers several solvers. For credit-sized L2 problems (p \< 1000, n \< 10M), `lbfgs` or `newton-cholesky` is the fastest. For L1 or elastic-net penalties, `saga` is the only general choice, and `liblinear` works for the L1 + binary case. When benchmarking on a laptop, `newton-cholesky` (added in scikit-learn 1.2) typically matches statsmodels' IRLS in speed and produces coefficients that agree to 1e-6. ### When does each help? @tbl-ch07-penalty-regimes maps the most common credit-modeling regimes to a default penalty choice. The rows are not exhaustive, but each captures a situation that recurs in practice. | Regime | Recommended penalty | |------------------------------------|------------------------------------| | Small WoE scorecard (20 features, 50k obs) | L2, `C = 1.0 - 4.0` | | Large raw-feature logistic (500+ candidates) | L1 then refit L2 on survivors | | Correlated behavioral signals | Elastic net | | Bayesian prior on coefficients | L2 with calibrated $\lambda$ [@gelman2008prior] | | Production with legal sign constraints | L2 + projection onto sign cone, or monotonic binning upstream | : Default penalty choice by credit-modeling regime. Triage rules at the start of @sec-ch07-regularization (bin sparsity, candidate pool size, OOT gap) decide *whether* to regularize; this table decides *which* penalty once the answer is yes. In all cases, tune $\lambda$ on the training window with a cross-validation scheme that matches the data structure, then confirm the penalty choice on an out-of-time validation set. The inner CV is for hyperparameter selection; the OOT set is the time-shift check. Three cases cover most credit data: 1. **Independent obligor-snapshots** (one row per borrower, single performance window). Use `StratifiedKFold` on the label. Random folds are safe because every row already shares the same observation and performance frame, so there is no temporal channel through which information can leak between folds. 2. **Panel data with repeating obligors** (same borrower appears in multiple snapshots, e.g. monthly behavioral scoring). Random K-fold leaks: the same borrower can land in both train and validation folds, inflating CV AUC. Use `StratifiedGroupKFold(groups=borrower_id)` so all rows for a given borrower stay in the same fold, and stratification still balances bads across folds. 3. **Long training window spanning macro regimes** (multiple vintages, visible cycle inside the training period). If you want the inner CV to mirror the deployment condition rather than the within-window condition, use `TimeSeriesSplit` (rolling or expanding origin) so each validation fold is later in time than its training fold. This is closer to OOT but costs statistical efficiency; reserve it for cases where the training window itself is non-stationary. The default for a textbook scorecard built on a single application vintage is case 1. Cases 2 and 3 are the situations where "stratified K-fold" without further qualification quietly overstates performance. ## Calibration Discrimination (AUC and Gini in @sec-ch04-auc, KS in @sec-ch04-ks) tells you whether the score ranks bads above goods. Calibration tells you whether the predicted PD equals the observed default rate (@sec-ch04-brier). A lender needs both. Miscalibrated scores damage pricing, capital, and loss provisioning regardless of AUC. ### Reliability diagram Partition the score into equal-quantile bins. Plot $\bar p_k$ (mean predicted PD within bin $k$) against $\bar y_k$ (observed default rate within bin $k$). A perfectly calibrated model lies on the identity line. @dawid1982well gives the Bayesian foundation; @degroot1983comparison decompose the Brier score into calibration + refinement, which underlies the metric toolkit in @sec-ch04-brier. @fig-ch07-reliability-taiwan shows what the diagram looks like in practice on Taiwan default for the uncalibrated, Platt, and isotonic versions of the same logistic regression. ### Platt scaling @platt1999probabilistic introduced a one-parameter sigmoid recalibration originally for SVMs: fit a logistic regression of $y$ on the raw score $\eta$. For logistic regression it amounts to refitting the intercept and slope, which is nearly a no-op unless the training population is mis-weighted (stratified sampling, re-weighting for imbalance). The Platt curve in @fig-ch07-reliability-taiwan is visibly closer to the diagonal than the uncalibrated one in the middle deciles, with no movement at the endpoints; that is the signature of a one-parameter sigmoid fit. ### Isotonic @zadrozny2002transforming fit an isotonic (monotone non-decreasing) step function that minimizes mean squared error between predicted and observed PD on a calibration sample. Isotonic is more expressive than Platt and handles S-shaped miscalibration that sigmoids cannot. Cost: higher variance on small calibration sets. The isotonic line in @fig-ch07-reliability-taiwan is visibly more responsive to local deviations than Platt; @tbl-ch07-brier-decomposition quantifies the trade-off via the Brier reliability and resolution components. ### Beta calibration @kull2017beta proposed a three-parameter family that generalizes Platt and corrects for S-shaped, L-shaped, or U-shaped miscalibration. Use it when the reliability diagram shows asymmetric deviation, like an isotonic-like S in @fig-ch07-reliability-taiwan, but on a medium-sized calibration set where isotonic would over-fit. @niculescu2005predicting is the canonical empirical comparison. The summary, adapted for credit: logistic regression on enough data is usually well calibrated out of the box; calibration pays off when the training population does not match deployment (policy changes, re-weighting) or after tree ensembles. ### Temperature scaling and confidence calibration @guo2017calibration popularized temperature scaling for neural networks: divide the logit by a scalar $T > 0$ learned on the validation set. For logistic regression, it collapses to rescaling the slope. In credit scorecards, this is useful when the score is produced by a stacked model whose top layer is not itself a logit (think SHAP-stacked trees fed into a ranker); a temperature-scaled calibrator then turns the raw margin into a probability without touching the base model. Temperature scaling is a special case of Platt with the intercept fixed at its unregularized MLE. The runnable demo in @sec-ch07-temperature-demo fits $T$ by 1-D minimization of validation NLL on the Taiwan logits and produces the figure showing $T^*$ landing near 1 (as expected for a base model whose log-likelihood is already at its maximum). ### Choosing the calibration method A short decision tree, drawn in @fig-ch07-calibration-decision and then run as code in @tbl-ch07-calibration-recommendations: 1. Logistic regression on a representative training sample, modest regularization, sample above 20,000 obligors: no calibration. The MLE is already calibrated in-sample by construction. 2. Logistic regression on stratified sample: apply the @king2001logistic intercept correction, no Platt needed. 3. Tree ensemble or calibrated sigmoid needed: Platt first, isotonic if the reliability diagram still shows S-shape. 4. Small calibration set (below 1,000): Platt or beta calibration. Isotonic over-fits on small samples. 5. Miscalibration is asymmetric in the tails: beta calibration [@kull2017beta] or an isotonic fit with care taken at the endpoints. The same tree, encoded as a function, lets us tag a list of representative scenarios with the recommended calibrator and then verify the recommendation against held-out Brier on the Taiwan test set. @tbl-ch07-calibration-recommendations runs this end-to-end after the Taiwan calibration demo below; the function takes four inputs that match the diamonds in @fig-ch07-calibration-decision and returns the leaf label. ### Calibration metrics beyond Brier Three alternatives appear in regulatory validation docs. - **Expected Calibration Error (ECE).** Weighted mean absolute gap between bin-average PD and bin-average default rate, with weights proportional to bin count. - **Maximum Calibration Error (MCE).** Worst-case bin gap. Used as a conservative upper bound. - **Hosmer-Lemeshow goodness-of-fit test.** Chi-square on deciles of predicted PD [@hosmer2013applied]. A low p-value flags miscalibration; under SR 11-7 a bank is expected to act on that signal. In practice, ECE with ten deciles plus a reliability plot is the combination you will see in most validation packages. @sec-ch07-ece-mce-hl runs all three on the Taiwan PDs from @fig-ch07-reliability-taiwan and reports them in @tbl-ch07-ece-mce-hl. ### Base rate drift and recalibration A recurring production issue is base-rate drift. Your scorecard predicts 4% default but the current vintage is running at 6%. Options: 1. *Affine recalibration (cheap).* Shift intercept by $\log(6/94) - \log(4/96)$. Keeps the ranking, adjusts the level. Defensible when the ranking KS/AUC on the new vintage is still acceptable. 2. *Platt recalibration (cheaper than refit).* Re-learn intercept and slope on a held-out recent vintage. Defensible when the ranking is slightly compressed but still correct on ordering. 3. *Full refit.* When CSI on a dominant feature exceeds 0.25 or when KS drops by more than 10%. Requires full revalidation. ## Ohlson's O-score @ohlson1980financial introduced the logit bankruptcy model that shifted corporate distress prediction off the discriminant-analysis path that @altman1968zscore (see @sec-ch06) had defined. Ohlson fitted a logistic regression on 105 bankrupt and 2058 non-bankrupt US firms over 1970-1976. The nine covariates, with Ohlson's estimated coefficients, are $$ \mathrm{O} = -1.32 - 0.407 \cdot \mathrm{SIZE} + 6.03 \cdot \mathrm{TLTA} - 1.43 \cdot \mathrm{WCTA} + 0.0757 \cdot \mathrm{CLCA} $$ $$ \quad - 1.72 \cdot \mathrm{OENEG} - 2.37 \cdot \mathrm{NITA} - 1.83 \cdot \mathrm{FUTL} + 0.285 \cdot \mathrm{INTWO} - 0.521 \cdot \mathrm{CHIN}. $$ where - $\mathrm{SIZE} = \log(\text{total assets}/\text{GNP deflator})$ - $\mathrm{TLTA} = \text{total liabilities}/\text{total assets}$ - $\mathrm{WCTA} = \text{working capital}/\text{total assets}$ - $\mathrm{CLCA} = \text{current liabilities}/\text{current assets}$ - $\mathrm{OENEG} = \mathbf{1}\{\text{total liabilities} > \text{total assets}\}$ - $\mathrm{NITA} = \text{net income}/\text{total assets}$ - $\mathrm{FUTL} = \text{funds from operations}/\text{total liabilities}$ - $\mathrm{INTWO} = \mathbf{1}\{\text{net income was negative in last two years}\}$ - $\mathrm{CHIN} = (NI_t - NI_{t-1})/(|NI_t| + |NI_{t-1}|)$. The one-year-ahead PD is $\Pr(\text{bankrupt}) = \sigma(\mathrm{O})$. Ohlson reported a Type I error of 12.4% at a 3.8% cutoff on his holdout. Later work reconfirmed on larger samples [@shumway2001forecasting; @campbell2008search] and extended the logit framework to multi-period hazard (@sec-ch09). The equation is presented here because it is an instructive example of how a logit with nine well-chosen ratios competes with modern ML on firm-level distress, and because many commercial credit-risk systems still use an O-score variant as a baseline. Below we first verify the arithmetic of @eq-oscore on a small synthetic panel, then refit the same specification on the UCI 572 Taiwanese Bankruptcy Prediction panel [@liang2016financial] (6,819 firm-years, 1999-2009) so the reader can see how Ohlson's 1980 sign pattern survives on out-of-sample public data. #### Why Ohlson matters Three things are remarkable about the O-score. First, Ohlson chose a logit specification when discriminant analysis was still the standard [@altman1968zscore]. His justification is econometric: discriminant analysis assumes multivariate normality of the covariates within each class, which financial ratios violate badly (they are fat-tailed, skewed, and mixed continuous-binary). The logit link drops that assumption and replaces it with a weaker one: that the log-odds of bankruptcy is linear in the covariates. Second, the sign pattern of @eq-oscore is the sign pattern every modern corporate-default model produces: leverage up, profitability down, liquidity down, volatility in NI up. A logit with nine ratios captures almost the entire story. Third, @shumway2001forecasting showed that Ohlson's one-year specification is biased because it treats each firm-year as independent when the same firm contributes multiple observations. The right object is a discrete-time hazard model (@sec-ch09-shumway), which can be estimated as a pooled logit with a time-varying hazard baseline. The O-score is the one-shot logit; the Shumway hazard is the panel-logit generalization (@sec-ch09-shumway). #### Reproducing Ohlson's diagnostics @ohlson1980financial reports a pseudo-$R^2$ of about 0.83 and Type I error of 12.4% at a classification cutoff of 3.8%. On his 105-bankrupt, 2058-healthy sample, that is a striking separation: roughly 88% of bankrupt firms are flagged one year in advance, at the price of a modest false-positive rate. The reason the model works so well is the feature choice. Leverage (TLTA), profitability (NITA), liquidity (WCTA, CLCA), funds from operations to liabilities (FUTL), and a sign-of-earnings-change dummy (INTWO) together capture the textbook theory of corporate distress [@beaver1966financial; @altman1968zscore]. The residual innovation in Ohlson's work is the use of log-size scaled by the GNP deflator, which standardizes across years and across the size distribution. Modern replications typically add macro covariates (GDP growth, credit spreads) and time-varying covariates to turn the O-score into a discrete hazard model. @campbell2008search report that such hazard-model extensions are the gold standard for corporate distress in public equities. ## Implementation from scratch ### IRLS, matched against `statsmodels.Logit` IRLS converges in a handful of iterations, and the coefficients agree with `statsmodels.Logit` to machine precision. The Fisher information lets us recover asymptotic standard errors: ### Points per bin by hand on a one-feature toy The point delta between bins is exactly $-B \beta (\mathrm{WoE}_{k_1} - \mathrm{WoE}_{k_2})$, matching @eq-points-per-bin. ### Ohlson O-score demonstration Firm `delta` (leverage above assets, negative working capital, sharp NI drop) lands with the highest PD; firm `gamma` (profitable, conservative leverage, improving NI) lands with the lowest. The arithmetic sign pattern reproduces @ohlson1980financial Table 4. #### Refitting Ohlson on UCI 572 (public data) The synthetic block above only checks that we can multiply Ohlson's coefficients by a row of ratios. The interesting question is whether the *specification* still works on data Ohlson never saw. The UCI 572 Taiwanese Bankruptcy panel ships nearly every Ohlson covariate by name, so we can map columns one-for-one and refit. The two exceptions are `INTWO` (the UCI `Net_Income_Flag` column is constant on the released file, so it carries no information and we drop it) and `CHIN` (Ohlson's earnings-change ratio requires a $t-1$ observation, but UCI 572 is a single firm-year cross-section without a usable lag). All remaining ratios in UCI 572 are min-max scaled to $[0,1]$ by the publishers [@liang2016financial], so the *magnitudes* of the refit coefficients will not match Ohlson 1980 in absolute units. The *signs* should. The hold-out AUC sits above 0.9 with only seven covariates, in the same ballpark as the full 95-ratio classifiers @liang2016financial benchmark on this panel. Of the refit coefficients that are statistically distinguishable from zero (TLTA, WCTA, NITA, and the intercept), the sign of every one matches Ohlson's 1980 sign on US Compustat data: higher leverage pushes PD up, lower working capital pushes PD up, lower profitability pushes PD up. The two covariates whose refit sign disagrees with @ohlson1980financial (CLCA, FUTL) are not statistically significant here and so are not load-bearing for the classifier. The point of the exercise is not that one should ship Ohlson's 1980 *coefficients* on Taiwan 2009 firms (that would be coefficient transport without recalibration; see @sec-ch04-drift on PSI/CSI monitoring and @sec-ch04-oot on out-of-time validation). It is that the *feature set* @ohlson1980financial chose in 1980 still produces a usable bankruptcy logit on a different country and a different decade. Corporate-rating extensions of the Ohlson logit, including ordered-multinomial and hazard variants on rating grades, are treated in @sec-ch29. ## The standard library call We fit logistic regression three ways on the UCI Taiwan default data: `statsmodels.Logit` for inference, `sklearn.linear_model.LogisticRegression` for pipelines, and `optbinning.Scorecard` for the full points scorecard [@yeh2009comparisons]. ### Route A. `statsmodels.Logit` on standardized raw features ### Route B. `sklearn.LogisticRegression` (L2) ### Route C. Full `optbinning.Scorecard` pipeline The three routes agree on the qualitative story but differ at the second decimal of AUC. Optbinning's per-variable supervised discretization buys a small but real AUC lift, mostly through the `PAY_*` payment-status variables where the relation to default is sharply non-linear. That is the canonical credit-scorecard gain from WoE binning. ### Inspect the scorecard table Reading the `PAY_0` block: bins with later payment status have lower WoE (more bad signal) and negative points; bins showing timely payment have positive points. The sum of a row's `Points` across all features plus the implicit intercept points equals the total score for that applicant. ### Score cutoff policy The cutoff has two levers: approval rate and expected loss. Credit policy tunes both against pricing and origination targets; the scorecard is the shared ledger. #### Cutoff optimization as a profit calculation The cutoff should not be set by rule of thumb; it should solve an explicit expected-profit problem. Let $r$ be the risk-adjusted return on a good account, $L$ be the expected loss on a bad account (LGD times EAD), and $p(s)$ be the calibrated PD at score $s$. Expected profit per approved applicant is $$ \pi(s) = (1 - p(s)) \cdot r - p(s) \cdot L. $$ Solving $\pi(s^*) = 0$ gives the breakeven PD $p^* = r / (r + L)$, and the corresponding score is the break-even cutoff. In practice, you want positive expected profit plus a margin for model error, so the operational cutoff sits slightly above breakeven. @verbraken2014novel embed this logic inside a profit-based classifier metric, the Expected Maximum Profit (EMP), derived in @sec-ch04-emp and revisited in the benchmarking chapter (@sec-ch16). Cutoff choices interact with regulatory requirements in three places. (a) Fair-lending: the cutoff must not produce disparate impact on a protected class (@sec-ch24). (b) Regulatory capital: the cutoff ties into the portfolio's PD distribution, which feeds the IRB risk-weight function (@sec-ch05-regulation). (c) CECL / IFRS 9: the cutoff implicitly defines the provisioning split between Stage 1 (performing) and Stage 2 (significant increase in credit risk), so moving the cutoff moves the allowance for credit losses; the staging rules and ECL math sit in @sec-ch35. For all three reasons, cutoff changes need change-management sign-off even though the underlying scorecard is unchanged. ## Benchmark on German + Taiwan ### Regularization path on German The L1 solution drops about a third of the one-hot indicators without losing test AUC, which is the intended behavior: lasso uses the information value of each bin and discards redundant ones. ### Coefficient stability under L1 The path plot reads left to right: at strong regularization (`C` near 0.001) every coefficient is zero; as the penalty relaxes, coefficients enter one at a time. Status of existing checking account, duration, and credit history enter earliest, which is consistent with classical credit-scoring intuition [@hand1997statistical; @thomas2000survey]. ### Stability selection on German @meinshausen2010stability run lasso on many sub-samples of the data and keep features that get selected often. The implementation is short: bootstrap the training rows, fit L1 at a fixed `C`, record which coefficients are non-zero, repeat. We use 100 sub-samples at 50% draw size, which is the recipe in the original paper. The 0.6 threshold is the @meinshausen2010stability default: features chosen in at least 60% of sub-samples are stable enough to ship. The shortlist usually overlaps with the indicators that entered the L1 path earliest, which is the consistency check between the two methods. ### Picking $\lambda$: 1-SE rule and out-of-time AUC `LogisticRegressionCV` reports the best `C` by mean CV-AUC, which is the *minimum-loss* rule. The two rules of thumb in the prose above are: (1) the 1-standard-error rule, which picks a sparser model whose CV-AUC is within one SE of the best; (2) the out-of-time rule, which scores each `C` on a held-out recent vintage rather than random folds. Both are short to implement. Note the sklearn convention: `C = 1/λ`, so "larger λ" in the prose corresponds to "smaller C" in the code. The 1-SE rule almost always returns a smaller `C` (stronger penalty) than the min-loss rule, and the resulting model carries fewer non-zero coefficients. On German credit the AUC penalty is typically under 0.005, well within governance noise. The out-of-time rule needs a vintage column. German credit does not ship one, so we synthesize a "vintage" by ordering the training rows and treating the last 25% as the recent block, then scoring each `C` on that block. Because we synthesized the vintage from an already-shuffled `train_test_split`, the "OOT" block is a random subsample rather than a true time shift, and the OOT AUC (≈0.81) actually *exceeds* the CV-AUC (≈0.76) instead of dropping below it. On this German-credit run that flips the usual ordering: OOT lands on a *larger* `C` than the 1-SE rule. With genuine vintage drift, OOT typically picks an equal-or-smaller `C` than 1-SE, because time shift penalizes models that leaned on training-window quirks. The deploy rule `C = min(1-SE, OOT)` is robust either way: it takes the more strongly regularized of the two, which is the "take the larger $\lambda$" recipe in the prose. ### Calibration demo on Taiwan On Taiwan, the uncalibrated logistic regression is already close to the 45-degree line because the training and test draws are homogeneous. Platt and isotonic both remove minor residual miscalibration in the middle deciles. The AUC is invariant to monotone transforms, so it stays the same; Brier improves slightly for both recalibration methods, as expected from the DeGroot-Fienberg decomposition [@degroot1983comparison]. ### Brier decomposition The Brier decomposition in @tbl-ch07-brier-decomposition shows where each calibration method spends its effort: Platt reduces reliability (the miscalibration term) without moving resolution much, while isotonic shaves both (it can reshape non-monotone residual patterns). ### ECE, MCE, and Hosmer-Lemeshow on Taiwan The three regulatory-grade summaries from the *Calibration metrics beyond Brier* list reduce to a few lines on top of the same quantile binning used for the reliability diagram. ECE and MCE share the bin-gap object $\bar p_k - \bar y_k$; the Hosmer-Lemeshow $\hat C$ statistic squares and standardizes it under a binomial null and compares the result to a $\chi^2_{B-2}$ reference [@hosmer2013applied]. Read the table column by column. ECE summarizes average miscalibration in the same units as PD; values under roughly 0.01 are acceptable for retail PD on a sample of this size. MCE is sensitive to a single bad bin and is the right metric when a tail miscalibration would mis-price the highest-risk segment. The Hosmer-Lemeshow p-value is the formal test; on a clean Taiwan split the uncalibrated logistic typically does not reject, and Platt and isotonic move the p-value upward by shrinking residual bin gaps. A rejecting p-value on a recalibrated model is a signal that the bin structure itself is wrong, for example because of a tied-score plateau, and that wider or rank-based bins are needed before the test can be trusted. ### Beta calibration on Taiwan The @kull2017beta family is $$ \mu(s; a, b, c) = \sigma\!\bigl(a \log s - b \log(1 - s) + c\bigr), $$ which collapses to Platt when $a = b$. Fit by stacking the two log-odds-like features $\log s$ and $-\log(1 - s)$ and running a two-feature logistic regression on the calibration sample. The implementation below avoids the `betacal` external dependency and uses only `numpy` + `scikit-learn`. When $a$ and $b$ come out close to one and $c$ close to zero, Platt scaling already absorbed the available correction and the beta fit collapses to the identity. Asymmetric values, for instance $a$ noticeably larger than $b$, indicate that the model over-shoots in the high-PD tail more than it does in the low-PD tail; that is the regime where beta calibration earns its keep over a pure sigmoid. ### Temperature scaling on Taiwan The @guo2017calibration recipe is one line of code once the base logits are in hand: minimize validation NLL over a single positive scalar $T$, then divide every deployment-time logit by $T^*$ before applying the sigmoid. With logistic regression the base logits already come from a likelihood maximizer, so $T^*$ is expected to land near 1 on a representative sample; the value of running the fit is to produce evidence for that claim and to have a deployable T-scaler ready when the base model is later swapped for a stacked or non-logit ranker. On Taiwan the optimizer returns $T^*$ very close to 1 and Brier nearly identical to `pd_uncal`, which is the predicted behavior: the base logits come from the same likelihood that temperature scaling re-optimizes one parameter of, so there is nothing to recover. The demo becomes useful in the stacked-model setting referenced in @sec-ch07-calibration: replace `raw_lr.decision_function` with the raw margin output of a non-logit ranker and the same six lines fit a deployable T-scaler without touching the base model. ### Niculescu-Mizil and Caruana on Taiwan @niculescu2005predicting compared logistic regression, boosted trees (@sec-ch12-gbm), SVMs (@sec-ch13), random forests (@sec-ch12-bagging), and naive Bayes across eleven UCI datasets. The headline is that boosted trees and SVMs produce sigmoid-distorted scores that Platt fixes cheaply, while logistic regression on a representative sample needs no help. The block below reproduces the spirit of that comparison on Taiwan default by fitting a logistic regression and a gradient boosted classifier, then applying each calibrator and tabulating Brier reliability and resolution. Two patterns from the table line up with the original Niculescu-Mizil and Caruana finding. First, the four logistic-regression rows have nearly identical Brier and reliability: the MLE is already calibrated on a representative training sample, so the calibrators have nothing to add. Second, the gradient-boosting rows show a visibly larger reliability gap in the *uncalibrated* row, and Platt closes most of it; isotonic and beta typically tie Platt on a sample this size and pull ahead only when the residual miscalibration is non-monotone or asymmetric. AUC is held constant within each base model by construction, since all three calibrators are monotone in the input score. ### Decision-tree recommendations applied to Taiwan The `prob_store` dictionary from the previous block holds the held-out probability vectors for each (base, calibrator) cell of @tbl-ch07-niculescu-mizil-credit. We encode the calibration decision tree from @fig-ch07-calibration-decision as a function and look up the resulting Brier from the right base-model column. Three patterns are visible. First, the "Big retail LR" and "Stratified bad-oversample LR" rows take the *no calibration* and *King-Zeng intercept* branches respectively and land at the LR base Brier (`0.146`) by construction; King-Zeng is an intercept-only shift, so it leaves Brier unchanged on a held-out sample drawn from the same population. Second, the gradient-boosting baseline on Taiwan is already at `0.135`, lower than LR; with default rate \~22% and 300 shallow trees, the boosted ensemble does not show the textbook S-shape that Platt was designed to fix. The three tree-ensemble rows therefore land within `±0.001` of the GBM baseline: Platt and beta hold Brier nearly flat, and isotonic is a hair worse here because it spends degrees of freedom fitting bin-level noise. The point of the table is *not* that calibration always helps but that the decision tree picks the lowest-risk calibrator for each regime; the empirical effect on any one dataset depends on whether the base model is already calibrated. This is the auditable artifact an SR 11-7 validator will ask for. ## Scalability We scale the logistic fit from a single pandas call to an out-of-core fit and a PySpark MLlib fit on a 1M-row synthetic default dataset. The goal is to verify that AUC is recoverable at scale and to quantify the wall-clock tradeoff. ### Synthetic 1M-row generator ### sklearn SAGA on the full 1M rows ### PySpark MLlib on the same dataset (graceful fallback) PySpark MLlib fits logistic regression in a distributed fashion. In environments without a JVM we fall back to a Dask out-of-core comparison so the chapter always renders. On a real cluster, MLlib parallelizes the Hessian accumulation across workers and returns coefficients within a few minutes on a 1M-row, 5-column dataset on 4 cores. The AUC is within rounding distance of the sklearn SAGA fit because the MLE is the same estimand. The tradeoff is operational: sklearn SAGA is faster at 1M rows on one laptop; MLlib wins when the data does not fit in memory or when you want a Spark pipeline for downstream feature engineering. At 10M rows, sklearn with float32 and SAGA still works under 30 seconds if the features fit; beyond that PySpark MLlib or a GPU-based solver is the better path. ### Dask pattern for out-of-core fitting For cases where even reading the full training set into memory is tight, Dask plus mini-batch logistic via `SGDClassifier.partial_fit` gives a streaming fit. The API pattern is: This is what teams reach for when running behavioral scorecards across multi-year customer panels and the full history cannot fit on a single node. ## Deployment The MLOps stack (FastAPI service contracts, container images, ONNX runtimes, MLflow registries, CI / shadow-deploy patterns) gets a full treatment in @sec-ch34. The blocks below cover only the scorecard-specific glue: serializing the artifact, exposing a thin scoring endpoint, and confirming numerical equivalence under ONNX export. ### Persist the scorecard ### FastAPI scoring service The companion file `book/deployment/scorecard_app.py` wraps the pickle behind a POST endpoint. Skeleton: The endpoint returns the integer points, the PD estimate, the approve/decline decision given a configured cutoff, and the FCRA-style reason codes derived from the weakest-contributing bins. This matches the Regulation B requirement that a denied applicant receive up to four principal reasons. To run locally: ### ONNX export of a scikit-learn LR ONNX gives us a language-neutral artifact that any serving platform (Triton, ONNX Runtime, TorchServe) can load. The numeric equivalence with sklearn confirms the conversion is faithful to 1e-6 or better on 32-bit float. ### MLflow logging Every production scorecard should be MLflow-logged with the hyperparameters, metric suite (AUC, KS, Brier, PSI, approval/bad rates), training data signature, and the artifact itself. SR 11-7 (@sec-sr117) expects you to reproduce the fit from logged artifacts on demand; @sec-ch34 covers the registry-plus-CI workflow that operationalizes that requirement. ## Operational deployment ### Reason codes Regulation B (ECOA) and FCRA require that a denied applicant be told, in concrete terms, why; the legal text and the four-reason rule are walked through in @sec-adverse-action, with the broader ECOA and FCRA framing in @sec-ch05-ecoa and @sec-ch05-fcra. The standard scorecard approach is: for each applicant, rank feature bins by points below the approved-population average for that feature; return the top-k features. The FastAPI skeleton above does this by comparing each applicant's bin points against the table of bin points. ### Monotonic constraints Regulatory and policy teams require that certain features be monotone: higher utilization should never *lower* PD, and older delinquencies should never *raise* it. Two mechanisms enforce this; tree-ensemble equivalents are derived in @sec-ch11-monotonic. 1. **Monotonic binning.** Optimal binning solves its mixed-integer program with a monotone event-rate constraint per feature. Then the learned WoE values are automatically monotone in the feature. This is the preferred approach because it is visible in the scorecard table and auditable. 2. **Sign-constrained logistic regression.** Fit a penalized LR with coefficient sign constraints imposed via convex optimization (`cvxpy`) or a projected gradient step. This is the fallback when a feature must enter raw. Here `monotonic_trend="ascending"` forces event rate to rise with PAY_0 (later payment), which matches credit intuition. The resulting bins can be trusted in front of regulators and policy committees. ### Model monitoring pillars Three quantities must be tracked at production cadence (daily for application scorecards, monthly for behavioral; behavioral scoring itself is treated in @sec-ch32): 1. **Population Stability Index (PSI) on score** (@sec-ch04-psi), flagged above 0.1, escalated above 0.25. 2. **Characteristic Stability Index (CSI)** per feature (@sec-ch04-csi), for root-cause on any PSI alert. 3. **Bad rate backtesting by score band**, with confidence intervals. The `creditutils.psi` helper computes the score-level PSI; we will compute CSI below. ## Stability in production ### PSI, characteristic stability, and recalibration cadence The Population Stability Index measures how the score distribution has shifted between a baseline window (training) and a current window (production); a fuller treatment with derivation, sampling distribution, and worked thresholds sits in @sec-ch04-psi: $$ \mathrm{PSI} = \sum_{b=1}^{B} (A_b - E_b) \log(A_b / E_b) $$ where $E_b$ and $A_b$ are the expected (baseline) and actual (current) fractions in quantile bucket $b$. A typical rule set: - **PSI \< 0.10:** no action. - **0.10 to 0.25:** investigate, check CSI, consider Platt/offset recalibration on the latest vintage. - **\> 0.25:** pause lending on the at-risk segment, refit the scorecard with the latest data, repeat back-test. ### Recalibration vs refit Recalibration keeps the coefficient vector and reshapes only the probability mapping. Useful when the shift is in the base rate (macro cycle) but the ranking is intact. The cheap implementation is Platt with a single intercept shift. Refit re-estimates all coefficients, often with the same binning. Needed when CSI is high on a top-IV feature or when the KS drops below a governance threshold. Refits require full SR 11-7 validation documentation, while recalibrations usually pass as "minor change" under a bank's change management policy. ### Quarterly cadence A defensible cadence is: recalibrate quarterly on rolling 12-month data, refit annually, and run full challenger benchmarking (@sec-ch16) every two years. Credit cycles (recessions) break this schedule: a shift of more than 30% in the monthly bad rate triggers an out-of-cycle refit. ## Regulatory considerations ### SR 11-7: Model Risk Management @sr117 and @occ2021model define three lines of defense; the supervisory letter and OCC bulletin are walked through in @sec-sr117. Scorecards sit inside the first line (development), are validated by the second (model risk), and audited by the third. The chapter's deliverables map onto SR 11-7 as follows. - **Conceptual soundness.** The derivation in @sec-ch07-scorecard and @sec-ch07-scaling is the text you cite when asked "why logistic regression here." Monotonic WoE binning is the concrete control on feature behavior. - **Data and design.** MLflow logs capture the training data signature. PSI and CSI are the ongoing data-quality signals. - **Process verification.** The IRLS-vs-statsmodels check confirms the solver is correct. The ONNX round-trip confirms the deployed artifact is the trained artifact. - **Outcome analysis.** Holdout AUC/KS/Brier, reliability diagram, and back-testing at the score-band level. ### ECOA / FCRA Reason codes are mandatory for adverse actions [@hoffman1983interpretation]; mechanics in @sec-adverse-action, statutory framing in @sec-ch05-ecoa and @sec-ch05-fcra. The scorecard's additive form makes this trivial: the feature contributing the lowest points is the top reason. Disparate-impact analysis is required under the effects test; @sec-ch24 walks through the audit. A scorecard that passes disparate-treatment review but fails disparate-impact review needs redesign, not a wrapper. ### Basel II / III IRB @basel2006international, @basel2005irb, and @basel2017finalising lay out IRB expectations; the ASRF capital formula and PD/LGD/EAD definitions are derived in @sec-ch05-regulation. A PD model used for regulatory capital must be pointed at a 12-month outcome window, ranked into pools, and validated annually. Logistic regression scorecards are the most common IRB PD model [@eba2017gl; @eba2022irb], and the points system in @sec-ch07-scaling is typically translated into rating grades by binning the score into master-scale bands. ### GDPR Article 22 and the EU AI Act Article 22 of GDPR entitles the data subject to an explanation of an automated decision (@sec-ch05-gdpr). Reason codes satisfy the right to explanation in practice. The EU AI Act classifies credit-scoring as high-risk and imposes documentation and human-oversight requirements (@sec-ch05-euaia). Scorecards are naturally auditable, which is one reason banks in the EU are reluctant to replace them wholesale with black-box ensembles. ### Fair-lending guardrails @bartlett2022consumer and @hurlin2026fairness give the up-to-date empirical view on fintech-era lending discrimination. A logistic regression on legitimate features can still produce disparate outcomes; @sec-ch24's fairness audit is mandatory before go-live. ## Vietnam and emerging markets ### Market context Vietnamese retail scorecards live on top of three data layers: the CIC supervisory pull, the bank's internal deposit and card behavior, and increasingly a consented bureau pull from DataCore or PCB. CIC reports carry loan-by-loan status for bank and finance-company exposures, aged arrears buckets, and a CIC group rating that mirrors the five-group classification of Circular 11/2021/TT-NHNN [@sbv2021circular11; @cicvn2023report]. Circular 16/2020/TT-NHNN allowed video-plus-liveness eKYC for payment account opening and, via subsequent guidance, for consumer credit onboarding, which shifted application flow from branch to mobile in three years [@sbv2020ekyc]. Decree 13/2023/ND-CP is the binding personal data regime. Under it, a bureau pull or an alternative-data pull (telco, e-wallet) requires an explicit consent record and a data protection impact assessment filed with the Ministry of Public Security's cybersecurity department [@govvn2023decree13]. For SBV supervision, the scorecard must map to the Circular 11 definition of default (overdue more than 90 days or group 3 and worse), not to an internal roll-rate definition [@sbv2021circular11]. Findex 2021 places Vietnam's account-holder rate at roughly 56 percent of adults with fast growth in mobile-money uptake, which is the feature universe a retail modeler now writes against [@worldbank2021findex]. Macro volatility is not optional. Vietnamese bank credit responds to uncertainty shocks more strongly than in advanced markets, which means scorecard PD tracks a moving ground truth. Vietnamese GDP swings and property-cycle episodes (2012, 2022) are documented in IMF Article IV filings [@imf2023vietnamart4]. Seasonality is the other first-order effect. Tet bonuses, rural-urban remittances, and closing wholesale markets produce repeatable Q1 liquidity compression that a fixed-threshold scorecard reads as a risk spike unless explicitly adjusted. ### Application considerations WoE binning is the backbone of a Vietnamese scorecard because it tolerates thin CIC lines, informal-income proxies, and categorical variables with many small cells. Three concrete patterns matter. First, informal-income proxies (utility bill regularity, e-wallet top-up cadence, salary-like deposit rhythm) bin well against default once the monotonic constraint is imposed and optbinning's pre-binning granularity is raised from 20 to 40. Raw income declared on the application is a weak predictor because it is self-reported and frequently refers to household rather than obligor income. Second, CIC thin-file applicants (zero or one historical trade) should be modeled with a thin-file indicator plus WoE on alternative attributes rather than imputed into the main bins, because the missing-not-at-random structure is adversarial: thin-file applicants are disproportionately young, migrant, or recently formalized. Third, Tet seasonality is handled by including a calendar-month-of-application feature with WoE, not by dropping pre-Tet vintages. The information value of a well-binned month variable typically sits at 0.02 to 0.05 and preserves calibration across the year. Default-rate drift and target definition. The Circular 11 default definition aggregates over loan groups, which changes the positive rate in a way that matters for the scaling factor. A scorecard built against a 30-days-past-due target and redeployed under a Basel-aligned 90-days-past-due target will require recalibration, not refit; the PDO (points to double the odds) should be recomputed and the intercept shifted. Segmented scorecards by product (cash loan, BNPL, auto, secured) are standard because the WoE binning of tenure interacts differently with the default definition. ### Rationalization Scorecards fit Vietnamese retail credit well. The regulatory environment rewards auditability. SBV on-site teams understand a coefficient table. Reason codes, which Decree 13/2023 pushes toward under its automated-processing language, fall out of a scorecard for free. The method fits less well when the portfolio is dominated by heavy alternative data streams (transaction text, device fingerprints) that are not easily binned; in that regime, a stacked model with a logistic scorecard on core features plus a gradient-boosted residual (@sec-ch12-gbm) on alternative data is a realistic compromise, with the meta-learner choice analyzed in @sec-ch12-stacking, as long as both components are documented for SBV. The scorecard also underperforms on super-thin-file segments where there is no variation to bin; @bjorkegren2020behavior gives the benchmark for alternative-data PD models in adjacent markets. ### Practical notes Datasets. Use CIC bureau pulls (on license), DataCore retail panels, and, for pedagogy, the Taiwan default dataset [@yeh2009comparisons]. The State Bank of Vietnam Fintech Regulatory Sandbox under Decree 94/2025/ND-CP is the legal venue to pilot alternative-data scorecards [@sbv2023vietnam]. ADB's Viet Nam Financial Sector Report gives sectoral default aggregates for sanity-checking base rates [@adb2022vnfin]. Regulator touchpoints. SBV Banking Inspection and Supervision Agency reviews scorecards under the Circular 11/2021 loan-classification lens. Documentation must include the WoE binning table, the points-per-bin mapping, the PSI monitoring cadence, and the cutoff governance. Decree 13/2023 requires a Personal Data Impact Assessment filing whenever a new feature category is added to the scorecard [@govvn2023decree13]. Operational cadence. A Vietnamese retail scorecard should be revalidated at least annually, with interim PSI checks keyed on the Lunar calendar (pre-Tet, post-Tet, mid-year). Recalibration rather than refit is appropriate when PSI stays under 0.1 and the population default rate shifts by less than 20 percent; otherwise a full refit with a fresh WoE binning is the honest answer. IFC and ADB work on Vietnamese SME lending documents that many consumer-finance lenders recalibrate quarterly to absorb Tet and policy-rate shifts [@ifc2019vnmsme; @adb2022vnfin]. Alternative-data additions (e-wallet telemetry, telco usage) should go through the SBV Fintech Regulatory Sandbox before being embedded in the production scorecard, both to harden the Decree 13/2023 lawful-basis narrative and to obtain supervisory comfort [@sbv2023vietnam]. ## Takeaways - A scorecard is a logistic regression on WoE-encoded features plus an affine scaling that turns log-odds into integer points. Both pieces have closed-form math and should be understood end to end. - IRLS is the right solver to know by hand. The four-line derivation in @sec-ch07-scorecard is enough to implement logistic regression from scratch in NumPy and to verify any production library. - Points per bin are $-B \beta_j \mathrm{WoE}_{jk}$ plus an intercept share. That formula is the contract between modelers and policy analysts. - Regularization helps three times: stability under quasi-separation, lower variance under correlated features, and better out-of-time transfer. L2 is safe; L1 is for feature selection; elastic net handles correlated behaviorals. - Calibration matters as much as discrimination for pricing and capital. Platt and isotonic are mechanical corrections; the reliability diagram is the test. - Production adds reason codes, PSI/CSI monitoring, a recalibration-vs-refit playbook, and an artifact pipeline (pickle, ONNX, MLflow). A scorecard that is not logged and monitored is not in production. ## Further reading - @hastie2009elements for the definitive statistical treatment of logistic regression. - @hosmer2013applied for applied tests, diagnostics, and categorical-variable handling. - @mccullagh1989generalized for the canonical GLM theory [@nelder1972generalized]. - @thomas2017credit and @anderson2007credit for scorecard-specific engineering. - @siddiqi2017intelligent for a vendor-inflected but practically invaluable scorecard walkthrough. - @friedman2010regularization for coordinate descent on penalized GLMs. - @platt1999probabilistic, @zadrozny2002transforming, @kull2017beta, and @niculescu2005predicting for the calibration literature. - @ohlson1980financial and @shumway2001forecasting for the logit bankruptcy lineage. - @dumitrescu2022machine for a modern benchmark that puts penalized LR on WoE features inside one percent of gradient-boosted trees for credit. - @sr117, @basel2006international, @eba2017gl, and @occ2021model for the regulatory frame. The borrower side of the scorecard is increasingly informed by a behavioral-economics literature that treats the customer's repayment trajectory as a function of attention, present-bias, and exponential-growth comprehension as much as of liquidity or risk. @gathergood2019balancematching show with UK and US card data that consumers fail to allocate payments toward the highest-APR card, sacrificing several hundred dollars per year; @meier2010presentbiased and @kuchler2021sticking trace revolving behavior to time-preference structure; @stango2009exponential document widespread underestimation of compound interest. @agarwal2009ageofreason find a U-shape in financial sophistication by age, with mistakes concentrated at the 25-year and 75-year ends of the life cycle. The card-market backdrop in @ausubel1991failure, @gross2002doliquidity, @stango2016borrowing, @agarwal2015regulating and @agarwal2018dobanks supplies the institutional context for these mechanisms: switching costs, limit bunching, and incomplete pass-through of regulatory rate caps shape how a logistic scorecard's threshold translates into observed repayment. ================================================================================ # Source: chapters/08-structural-models.qmd ================================================================================ # Structural Models: Merton and the KMV Framework **Scope: corporate.** Merton structural model, Black-Cox extensions, and the KMV distance-to-default. Inputs are firm-level (asset volatility, leverage, equity), so the framework does not transfer to consumer credit. ## Overview {.unnumbered} A firm defaults when it cannot pay. That sentence sounds like an accounting identity but it is really a statement about two random variables. One is the value of the firm's assets, which drifts and fluctuates as markets reprice the business. The other is the face value of the firm's obligations, which is a fixed claim written into debt indentures. Default is what happens when the first variable falls below the second on a date that matters. Everything in this chapter follows from taking that picture seriously. Structural models make the identity operational by embedding the firm inside a no-arbitrage asset-pricing framework. Starting from the balance-sheet identity $V = E + D$, they cast equity as a call option on the firm's assets and debt as a risky bond written on the same underlying. The probability of default is then the probability that the call finishes out of the money. That idea is due to @merton1974pricing, built directly on the Black-Scholes option-pricing framework of @black1973pricing, and it remains the single most influential piece of corporate credit theory a half-century later. The engineering version lives inside KMV (named for its founders Kealhofer, McQuown, and Vasicek), the commercial platform that Moody's bought in 2002 and turned into the public Expected Default Frequency (EDF) model. KMV translates Merton's formula into a workflow: observe equity and its volatility, back out asset value and asset volatility, compute a distance-to-default in standard deviations, map that distance into a PD using a proprietary historical table. The framework is still deployed at every major bank for wholesale and middle-market corporates, and its metric, DD, has become a standard covariate in reduced-form and accounting-based default models as well. This chapter builds the structural model from first principles, derives distance-to-default and the PD map (@sec-ch08-dd), codes the KMV iterative solver from scratch (@sec-ch08-kmv), and compares its output to Altman Z on a simulated Compustat-like panel (@sec-ch08-compare-altman). It then develops the reduced-form alternative of @jarrow1995pricing (@sec-ch08-reduced-form), contrasts the two philosophies, and ends with a tour of the empirical horse-race literature (@sec-ch08-empirical) that led from Merton to the hybrid frailty models of @duffie2009frailty. ### Notation {.unnumbered} Throughout this chapter: $V_t$ is the market value of the firm's assets at time $t$, $E_t$ its equity, $D$ the face value of a zero-coupon debt maturing at $T$, $\mu$ the physical drift of assets, $r$ the risk-free rate, $\sigma_V$ the asset volatility, and $\sigma_E$ the equity volatility. $\Phi$ is the standard normal CDF, $\phi$ its density. PD is real-world probability of default on the physical measure $\mathbb{P}$; PD$^Q$ is the risk-neutral counterpart on $\mathbb{Q}$. EDF is the KMV map of DD to PD. Hazard rate is $\lambda_t$, cumulative hazard $\Lambda_t = \int_0^t \lambda_s ds$. Two pieces of that notation deserve a fuller gloss before they show up inside derivations. #### Physical measure $\mathbb{P}$ versus risk-neutral measure $\mathbb{Q}$ {.unnumbered} A probability measure is just a rule that assigns probabilities to events. In a structural model the relevant event is "the firm's asset value at time $T$ is below $D$". Two different rules can be applied to that same event, and the textbook calls them $\mathbb{P}$ and $\mathbb{Q}$. The physical measure $\mathbb{P}$, also called the real-world measure, the historical measure, or the data-generating measure, is the law that actually governs the world. If you could rerun history a million times and tabulate how often each firm defaulted, the limiting frequency would be its $\mathbb{P}$ probability. Every empirical default frequency you ever read in a Moody's cohort study, an S&P transition matrix, or a Basel IRB pillar-3 disclosure is a sample estimate of a $\mathbb{P}$ probability. Under $\mathbb{P}$ the asset value drifts at the rate investors actually expect, $\mu$, which equals the risk-free rate plus a risk premium that compensates for bearing equity-like volatility: $$ dV_t = \mu V_t \, dt + \sigma_V V_t \, dW_t^{\mathbb{P}}. $$ The risk-neutral measure $\mathbb{Q}$ is a different probability law on the same sample space, constructed so that every traded asset earns the risk-free rate in expectation. It is a calculational device, not a description of reality: nobody believes stocks really drift at $r$. By Girsanov's theorem $\mathbb{Q}$ replaces the physical drift with $r$ while leaving the volatility unchanged, $$ dV_t = r V_t \, dt + \sigma_V V_t \, dW_t^{\mathbb{Q}}, $$ and the two measures are linked by an explicit Radon-Nikodym derivative whose log involves the Sharpe ratio $(\mu - r)/\sigma_V$. The reason $\mathbb{Q}$ exists at all is the fundamental theorem of asset pricing: in a frictionless arbitrage-free market, today's price of any payoff is the discounted $\mathbb{Q}$-expectation of that payoff. Bond and CDS prices therefore embed $\mathbb{Q}$-probabilities of default by construction. Two consequences follow. First, the same firm has two PDs, not one. The physical PD answers "how often does this firm default in the real world?" and the risk-neutral PD$^{Q}$ answers "what default probability is consistent with the price the market is charging for default protection?". Second, PD$^{Q}$ is mechanically larger than PD for any firm with a positive risk premium, because shifting the drift from $\mu$ down to $r$ pushes more probability mass below the default barrier. The wedge $\text{PD}^{Q} - \text{PD}$ is the credit risk premium, the same object that makes investment-grade bond spreads systematically wider than realized losses would justify [@huang2012how]. Concretely, plug $\mu = 0.10$, $r = 0.03$, $\sigma_V = 0.25$, $T = 1$, $V_0/D = 1.5$ into the Merton formula. The physical PD is about $0.4\%$. Replacing $\mu$ with $r$ for the risk-neutral version raises it to roughly $2.4\%$. Same firm, same balance sheet, same volatility, six times the probability, all driven by the change of measure. The pair PD and PD$^Q$ refers to the same event (the firm defaults by time $T$) measured under two different probability laws. PD on the physical measure $\mathbb{P}$ is the actual frequency you would expect to see if you could replay history many times: it uses the physical asset drift $\mu$, which contains the equity risk premium, and it is the right number for risk management, capital, expected loss, and forecasting. PD$^Q$ on the risk-neutral measure $\mathbb{Q}$ replaces $\mu$ with the risk-free rate $r$ and is the number embedded in market prices of bonds, CDS, and other credit derivatives. Because investors demand compensation for bearing default risk, PD$^Q$ is mechanically larger than PD for the same firm; the wedge between them is the credit risk premium. Practically: use PD for loss forecasting and Basel IRB inputs, use PD$^Q$ for pricing and hedging, and never mix the two inside a single calculation. EDF (Expected Default Frequency) is KMV's empirical replacement for the textbook formula PD $= \Phi(-\text{DD})$. The textbook formula is exact only if asset returns are truly lognormal, which they are not, so it badly understates default risk in the tails. KMV instead pools a large proprietary default database, sorts firms into DD buckets, computes the realized one-year default rate inside each bucket, and fits a smooth monotone curve through those bucket-level rates. The resulting function $\text{EDF}(\text{DD})$ is what gets shipped to clients. It is still a one-to-one map from distance-to-default to a probability, but the shape is calibrated to data rather than assumed from a Gaussian. The empirical-map step is built out in detail in @sec-ch08-dd. ## Motivation: why equity can be a call option on the firm Consider a firm with a single zero-coupon debt contract. The firm promises to pay the creditor $D$ dollars at maturity $T$ and is financed in part by equity. Shareholders control the firm until $T$, at which point two states of the world matter. 1. Either the assets $V_T$ exceed $D$, the creditors are paid in full, and shareholders keep the residual $V_T - D$. 2. Or $V_T < D$, in which case limited liability kicks in, shareholders walk away with nothing, and creditors seize the assets worth $V_T$. The payoff at $T$ to shareholders is therefore $$ E_T = \max(V_T - D, 0). $$ That is the payoff of a European call option on $V$ struck at $D$ with expiry $T$. The payoff to creditors is $$ \text{Debt}_T = \min(V_T, D) = D - \max(D - V_T, 0), $$ which is a risk-free bond minus a European put on $V$ struck at $D$. @merton1974pricing turned these two identities into the foundation of structural credit risk by pricing them under the Black-Scholes assumptions. The intellectual leap is that once equity is a call on assets, equity trading contains information about firm-asset volatility and firm-asset value. Equity is observed daily in liquid markets; asset value and asset volatility are not. The structural model lets you back them out. Everything KMV ships is built on that inversion. Two warnings are worth stating before the derivations. First, this is a model. Real firms have coupon debt, senior and junior tranches, callable provisions, cross-default clauses, pension obligations, lease liabilities, and revolvers. Compressing all of that into a single zero-coupon face value is a first approximation and the extensions literature ([@black1976valuing; @geske1977valuation; @longstaff1995simple; @leland1994corporate; @leland1996optimal]) exists precisely to relax those assumptions. Second, default in the classical Merton setup only happens at $T$. In real life, covenants, rating triggers, and liquidity crises can force default earlier. Barrier versions such as @black1976valuing address that. The emerging-market framing matters here more than in any other chapter. Merton-KMV needs a liquid equity price and an estimate of equity volatility. Vietnam has fewer than 800 listings across HOSE, HNX, and UPCoM, with thin free float at many names, and the vast majority of corporate borrowers are private SMEs with no equity price at all [@worldbank2022vietnamfinance; @adb2022vnfin]. Macro volatility amplifies the asset-drift uncertainty that already plagues Merton in developed markets. The closing emerging-market section returns to this with practical hybrids: Z'' plus CIC ratings, and Merton on the listed subset only. ### Why bother with a structural model at all A purely statistical model of corporate default, say a logistic regression on financial ratios, can deliver competitive AUC numbers without invoking any option pricing. Why incur the cost of an option-theoretic derivation to solve a classification problem? Four reasons. First, the structural model forces the analyst to confront the joint distribution of asset value and debt face value in a coherent way. Accounting ratios are noisy proxies for this joint distribution. The structural model is a generative story that ties them together. That generative story is what lets the framework extrapolate outside the historical sample. A logistic regression fit on 1985-2005 US data has no mechanism to think about what a sudden asset-volatility shock of the kind seen in March 2020 does to PD; the Merton model does, through $\sigma_V$. Second, the structural framework produces PDs that are internally consistent with bond and equity prices at the same time. An accounting-only model might predict a 1% PD for a firm whose bond yield implies 4%. Either the accounting model is wrong, the bond price is wrong, or the recovery assumption is wrong. The structural model at least gives a disciplined way to choose between these hypotheses. Third, the framework extends cleanly to more complex capital structures. The seniority ranking of debt tranches can be modeled as a waterfall of call options with progressively higher strikes. The priority of bank debt versus bond debt shows up as the strike ordering. Collateral and covenants show up as barrier features. These extensions preserve the option-theoretic skeleton and let a wholesale credit desk price instruments that a logistic regression would have no way to approach. Fourth, structural models are forward-looking by construction. Equity prices aggregate market expectations over all future states. An accounting-based score is backward-looking: it uses last quarter's balance sheet, which reflects last quarter's performance. In fast-moving distressed situations, the backward lag of accounting data can be fatal. @vassalou2004default shows that the structural DD has information content about equity returns beyond book-to-market and size, and @bharath2008forecasting shows that DD dominates accounting ratios at short forecast horizons. ## Formal setup ### The firm under Black-Scholes dynamics Assume a frictionless market, continuous trading, no taxes or dividends, a flat risk-free rate $r$, and a single risky firm. Firm assets evolve as a geometric Brownian motion under the physical measure $\mathbb{P}$: $$ dV_t = \mu V_t dt + \sigma_V V_t dW_t, $$ where $W_t$ is a standard Brownian motion, $\mu$ the expected asset return, and $\sigma_V$ the asset volatility. The SDE in @eq-asset-gbm is not solved by ordinary calculus, because $W_t$ has unbounded variation and a non-vanishing quadratic variation $d\langle W \rangle_t = dt$. Ito's lemma is the chain rule that fixes this: for a twice-differentiable function $f(t, V_t)$ of an Ito process, $$ df(t, V_t) = \frac{\partial f}{\partial t}\,dt + \frac{\partial f}{\partial V}\,dV_t + \tfrac{1}{2}\frac{\partial^2 f}{\partial V^2}\,d\langle V \rangle_t, $$ the only difference from the deterministic chain rule being the second-order term $\tfrac{1}{2} f_{VV}\, d\langle V \rangle_t$. That extra term is non-negligible because $(dW_t)^2 = dt$ rather than $0$. Apply @eq-ito-general to $f(V) = \ln V$, whose derivatives are $f_V = 1/V$ and $f_{VV} = -1/V^2$. The quadratic variation of $V$ from @eq-asset-gbm is $d\langle V \rangle_t = \sigma_V^2 V_t^2\, dt$, so $$ d \ln V_t = \frac{1}{V_t}\, dV_t - \frac{1}{2}\,\frac{1}{V_t^2}\,\sigma_V^2 V_t^2\, dt = \left(\mu - \tfrac{1}{2}\sigma_V^2\right) dt + \sigma_V\, dW_t. $$ The drift of $\ln V_t$ is therefore $\mu - \tfrac{1}{2}\sigma_V^2$, not $\mu$. The $-\tfrac{1}{2}\sigma_V^2$ piece is the Ito correction (or convexity correction): even with a fair coin, log-returns drift down because $\ln$ is concave and Jensen's inequality penalizes volatility. This is the same mechanism behind the volatility drag in geometric returns and behind the half-variance term in the Black-Scholes formula. Integrating @eq-ito-logV from $0$ to $T$ is now ordinary calculus on a deterministic drift plus a Wiener integral, $$ \ln V_T - \ln V_0 = \left(\mu - \tfrac{1}{2}\sigma_V^2\right) T + \sigma_V\, (W_T - W_0), $$ and exponentiating, with $W_T - W_0 \sim \mathcal{N}(0, T)$ written as $\sqrt{T}\, Z$ for a standard normal $Z$, gives the closed-form solution $$ V_T = V_0 \exp\!\left[(\mu - \tfrac{1}{2}\sigma_V^2)T + \sigma_V \sqrt{T} Z\right],\qquad Z \sim \mathcal{N}(0,1). $$ So $\ln V_T$ is normal with mean $\ln V_0 + (\mu - \tfrac{1}{2}\sigma_V^2)T$ and variance $\sigma_V^2 T$, i.e. $V_T$ is lognormal. Every PD formula in this chapter, including $\Phi(-\text{DD})$ and the Black-Scholes call price for equity, ultimately rides on @eq-VT-solution. The firm's capital structure consists of equity $E$ and a single zero-coupon bond with face $D$ maturing at $T$. The balance sheet identity holds at every date, $$ V_t = E_t + B_t, $$ where $B_t$ is the market value of the debt at $t$. ### The information structure: incomplete accounting information An important subtlety in the Merton setup is the information set. The model assumes that $V_t$ and $\sigma_V$ are known at time $t$. In practice neither is observed. What is observed is $E_t$ and a noisy proxy for $\sigma_E$ estimated from equity returns. The textbook structural model papers over this by assuming that markets can see through equity to asset value via the Black-Scholes inversion. That is a strong assumption, and relaxing it changes the model in ways large enough to deserve their own subsection. @duffielando2001 is the canonical treatment. Their setup is worth walking through because it is the cleanest bridge from structural to reduced-form models, and it underlies several of the extensions discussed later in the chapter (jumps in @sec-ch08-dd, the structural-reduced contrast in @sec-ch08-reduced-form, and the hybrid frailty work in @sec-ch08-empirical). #### Setup: manager's filtration versus market's filtration {.unnumbered} The manager observes the asset path $V_t$ continuously and therefore works on the natural filtration $\mathcal{F}_t^M = \sigma(V_s : s \le t)$. The market does not. Investors see the equity price (which under Merton is a deterministic function of $V$ but in the Duffie-Lando setup is observed only at the accounting-report frequency) and a sequence of noisy accounting reports $$ y_n = \ln V_{t_n} + \varepsilon_n,\qquad \varepsilon_n \sim \mathcal{N}(0, u^2), $$ released at dates $t_1 < t_2 < \cdots$. The market filtration is $\mathcal{F}_t^I = \sigma(y_n : t_n \le t) \vee \sigma(\mathbf{1}\{\tau \le s\} : s \le t)$, i.e. the noisy reports plus knowledge of whether the firm has already defaulted. Crucially $\mathcal{F}_t^I \subsetneq \mathcal{F}_t^M$. Default is the first passage of $V$ to a barrier $V_B$ (the Merton special case is $V_B = D$ at $t = T$ only), $$ \tau = \inf\{t \ge 0 : V_t \le V_B\}. $$ #### The key result: predictable under $\mathcal{F}^M$, totally inaccessible under $\mathcal{F}^I$ {.unnumbered} A stopping time is *predictable* if it can be announced by an increasing sequence of stopping times: there exist $\tau_n \uparrow \tau$ with $\tau_n < \tau$. Diffusions do not jump, so on the manager's filtration the first-passage time $\tau$ is predictable: as $V_t$ approaches $V_B$ the manager sees disaster coming. The Doob-Meyer compensator of the indicator $\mathbf{1}\{\tau \le t\}$ in this filtration is degenerate, the conditional hazard at $t = 0$ is zero, and short-horizon credit spreads collapse to zero. This is the well-known short-spread defect of the pure Merton model, which the empirical literature documents repeatedly [@huang2012how; @eom2004structural]. Project the same default time onto the smaller filtration $\mathcal{F}^I$. Because $V_t$ is now itself a random variable conditional on the noisy reports, the market does not see $V_t$ approaching $V_B$ in a deterministic way. @duffielando2001 prove that under mild regularity $\tau$ is *totally inaccessible* with respect to $\mathcal{F}^I$: it cannot be announced. The Doob-Meyer decomposition then yields a positive intensity $$ \lambda_t^I = \lim_{h \downarrow 0} \frac{1}{h}\, \Pr[\tau \le t+h \mid \mathcal{F}_t^I,\, \tau > t], $$ which has a closed-form expression in terms of the conditional density $g(v \mid \mathcal{F}_t^I)$ of $\ln V_t$ given the market's information, $$ \lambda_t^I = \tfrac{1}{2}\sigma_V^2\, \frac{\partial g}{\partial v}\bigg|_{v = \ln V_B}. $$ Equation @eq-lambda-density is the bridge between the structural and reduced-form worlds: a structural model with incomplete information *generates* a reduced-form intensity endogenously, rather than postulating one as in @jarrow1995pricing. #### Why short-end spreads stop collapsing {.unnumbered} Under full information, $\Pr[\tau \le h]$ for small $h$ behaves like $\exp(-c/h)$ near a non-zero distance to the barrier: vanishingly small. Under incomplete information, the conditional density $g$ has positive mass arbitrarily close to $\ln V_B$ even when the point estimate $\hat V_t \gg V_B$, simply because the posterior over $V_t$ is diffuse. The spread at short maturity inherits this density and becomes $O(1)$ rather than exponentially small. Numerically, with realistic accounting noise $u \in [0.10, 0.25]$ and posting frequencies of one quarter, @duffielando2001 close roughly half of the short-end credit-spread puzzle without invoking jumps or stochastic volatility. #### Implications for the rest of the chapter {.unnumbered} The filtration argument has three downstream consequences that recur in later sections. 1. **Empirical EDF beats theoretical** $\Phi(-\text{DD})$. The KMV calibration in @sec-ch08-dd folds the incomplete-information distortion into the bucket-wise default-rate map. That is one of the three reasons the Gaussian formula undershoots; the other two (jumps and strategic default) are listed alongside in @sec-ch08-dd. 2. **Structural-reduced hybrids are not a hack**. Because the Duffie-Lando intensity $\lambda_t^I$ is itself a structural object (a derivative of a structural posterior), running a hazard model whose intensity depends on DD plus accounting and macro covariates is consistent with the underlying theory rather than an ad-hoc patch. This is the philosophical justification for the hybrid models in @sec-ch08-reduced-form and @sec-ch08-empirical. 3. **Filtering is unavoidable in EM markets**. Vietnamese listed firms publish quarterly reports with material noise (accounting standard transition, related-party transactions, undisclosed contingent liabilities); private SMEs report annually with even larger $u$. The filtration problem is not a textbook curiosity in this setting, it is the modal case, and the practical hybrids in @sec-ch08-empirical handle it explicitly. ### Default event and default probability Default occurs if and only if $V_T < D$. Under the physical measure $\mathbb{P}$, $$ \text{PD}^{\mathbb{P}} = \Pr[V_T < D] = \Pr\!\left[\ln V_T < \ln D\right]. $$ Using (@eq-VT-solution), $$ \ln V_T = \ln V_0 + (\mu - \tfrac{1}{2}\sigma_V^2)T + \sigma_V \sqrt{T} Z, $$ so $$ \text{PD}^{\mathbb{P}} = \Pr\!\left[Z < \frac{\ln(D/V_0) - (\mu - \tfrac{1}{2}\sigma_V^2)T}{\sigma_V \sqrt{T}}\right] = \Phi(-\text{DD}), $$ with $$ \text{DD} = \frac{\ln(V_0/D) + (\mu - \tfrac{1}{2}\sigma_V^2)T}{\sigma_V \sqrt{T}}. $$ That is the definition of distance-to-default. It measures, in asset-volatility units, how many standard deviations the log asset value sits above the log default barrier after accounting for drift. The larger the DD, the smaller the PD, and the mapping is purely the normal CDF when the model is literally correct. KMV replaces $\Phi(-\text{DD})$ with an empirical map estimated from historical defaults; that calibration is developed in @sec-ch08-pd-routes, the reasons the lognormal map fails are dissected in @sec-ch08-undershoot, and a runnable empirical PD map on simulated data is built in @sec-ch08-empirical-pd-map. ## Derivation: equity as a call and debt as face value minus a put ### Step 1: translate the problem to a call option By @eq-equity-payoff, the terminal payoff of equity is that of a European call on $V_T$ struck at $D$. The Merton claim is that everything we know about pricing Black-Scholes calls transfers directly to corporate equity. The argument runs as follows. Under the risk-neutral measure $\mathbb{Q}$ the drift of $V$ is $r$, not $\mu$, because a self-financing hedging portfolio in $V$ must earn the risk-free rate. @harrison1979martingales and @harrison1981martingales provide the measure-theoretic machinery: in a complete arbitrage-free market there is a unique equivalent martingale measure under which discounted traded-asset prices are martingales. Asset value, as the underlying of a tradable claim, has drift $r$ under $\mathbb{Q}$, so $$ dV_t = r V_t dt + \sigma_V V_t dW_t^{\mathbb{Q}}. $$ By no-arbitrage, $E_0 = e^{-rT} \mathbb{E}^{\mathbb{Q}}[\max(V_T - D, 0)]$. Substituting the lognormal distribution of $V_T$ under $\mathbb{Q}$ and integrating yields the Black-Scholes formula, $$ E_0 = V_0 \Phi(d_1) - D e^{-rT} \Phi(d_2), $$ with $$ d_1 = \frac{\ln(V_0/D) + (r + \tfrac{1}{2}\sigma_V^2)T}{\sigma_V \sqrt{T}}, \quad d_2 = d_1 - \sigma_V \sqrt{T}. $$ ### Step 2: the Black-Scholes derivation step by step The derivation of (@eq-merton-equity) from (@eq-V-Q) and (@eq-equity-payoff) is textbook but worth spelling out because every symbol here has a credit-risk meaning. **Step 2.1: law of the terminal asset value.** Under $\mathbb{Q}$, $V_T = V_0 \exp[(r - \tfrac{1}{2}\sigma_V^2)T + \sigma_V \sqrt{T} Z^{\mathbb{Q}}]$ with $Z^{\mathbb{Q}} \sim \mathcal{N}(0,1)$ under $\mathbb{Q}$. Equivalently, $\ln(V_T/V_0) \sim \mathcal{N}((r - \tfrac{1}{2}\sigma_V^2)T, \sigma_V^2 T)$. **Step 2.2: split the expected payoff.** Write $$ \mathbb{E}^{\mathbb{Q}}[\max(V_T - D, 0)] = \mathbb{E}^{\mathbb{Q}}[V_T \mathbf{1}\{V_T > D\}] - D \cdot \Pr^{\mathbb{Q}}[V_T > D]. $$ **Step 2.3: the risk-neutral survival probability.** Because $\ln V_T$ is normal, $$ \Pr^{\mathbb{Q}}[V_T > D] = \Pr^{\mathbb{Q}}[\ln V_T > \ln D] = \Phi(d_2), $$ where $d_2$ comes from standardizing $\ln V_T$ under $\mathbb{Q}$ and noticing $d_2 = \frac{\ln(V_0/D) + (r - \tfrac{1}{2}\sigma_V^2)T}{\sigma_V \sqrt{T}}$. **Step 2.4: the expectation** $\mathbb{E}^{\mathbb{Q}}[V_T \mathbf{1}\{V_T > D\}]$. This is a standard "partial expectation of a lognormal." Change variables to $u = \ln(V_T/V_0)$, so $V_T = V_0 e^u$, and condition on $u > \ln(D/V_0)$: $$ \mathbb{E}^{\mathbb{Q}}[V_T \mathbf{1}\{V_T > D\}] = V_0 \int_{\ln(D/V_0)}^{\infty} e^u f_u(u) du, $$ with $f_u$ the normal density of $u$ with mean $m = (r - \tfrac{1}{2}\sigma_V^2)T$ and variance $s^2 = \sigma_V^2 T$. Completing the square, $$ \begin{aligned} e^u f_u(u) &= \frac{1}{\sqrt{2\pi s^2}} \exp\!\left[-\frac{(u - m)^2}{2 s^2} + u\right] \\ &= e^{m + s^2/2} \cdot \frac{1}{\sqrt{2\pi s^2}} \exp\!\left[-\frac{(u - m - s^2)^2}{2 s^2}\right]. \end{aligned} $$ The factor $e^{m + s^2/2} = e^{rT}$ because $m + s^2/2 = rT$. The remaining integral is the tail of a normal with mean $m + s^2$: $$ \int_{\ln(D/V_0)}^{\infty} e^u f_u(u) du = e^{rT} \Phi(d_1), $$ with $d_1 = \frac{\ln(V_0/D) + (r + \tfrac{1}{2}\sigma_V^2)T}{\sigma_V \sqrt{T}}$, by direct standardization. **Step 2.5: assemble.** Combine the two pieces and discount: $$ E_0 = e^{-rT} \left[V_0 e^{rT} \Phi(d_1) - D \Phi(d_2)\right] = V_0 \Phi(d_1) - D e^{-rT} \Phi(d_2), $$ which is (@eq-merton-equity). Debt follows from the balance-sheet identity $B_0 = V_0 - E_0$: $$ B_0 = V_0 \Phi(-d_1) + D e^{-rT} \Phi(d_2). $$ ### Step 3: risk-neutral PD The risk-neutral probability of default is $$ \text{PD}^{\mathbb{Q}} = 1 - \Pr^{\mathbb{Q}}[V_T > D] = 1 - \Phi(d_2) = \Phi(-d_2). $$ The only difference between $\text{PD}^{\mathbb{Q}}$ and $\text{PD}^{\mathbb{P}}$ is the drift: $r$ versus $\mu$. That difference is first-order; it is why KMV uses the physical drift and why quants pricing credit derivatives use the risk-neutral one. @vassalou2004default shows that Merton-implied default probabilities using the physical drift have genuine forecasting power for equity returns, which would not be true of the risk-neutral construct. ### Step 4: credit spread From (@eq-merton-debt), the continuously compounded yield on the zero-coupon defaultable bond is $y = -\frac{1}{T} \ln(B_0 / D)$, so the credit spread is $$ s = y - r = -\frac{1}{T} \ln\!\left[\Phi(d_2) + \frac{V_0}{D e^{-rT}} \Phi(-d_1)\right]. $$ Merton's empirical miss is well known: plugging observed leverage, volatility, and recovery into (@eq-spread) generates spreads that are too small relative to observed investment-grade spreads, the so-called credit-spread puzzle ([@huang2012how; @collin2001determinants; @chen2010macroeconomic; @eom2004structural]). Structural models with taxes, jumps, stochastic volatility, and stochastic interest rates close some of the gap but not all. ### Numerical check: Black-Scholes and put-call parity Put-call parity is satisfied to machine precision, which confirms the equity-as-call and debt-as-face-minus-put decompositions agree. The same two functions will be reused throughout the chapter, with $V$ playing the role of $S$ and $D$ the role of $K$. ### Extensions that actually ship The classical Merton model has well-known weaknesses and four extensions have become standard in practice. **Barrier default.** @black1976valuing allow default to happen any time the asset value crosses a lower threshold $K < D$, capturing covenants and early-trigger clauses. The equity payoff is a down-and-out call struck at $D$ with barrier $K$. The closed form is messier but still analytic, and for moderate leverage the resulting DD is lower than the classical DD by an amount that reflects the probability of passing through the barrier before $T$. @longstaff1995simple extend to a constant barrier with exogenous recovery and a stochastic interest rate, producing term-structure fits that are materially better than pure Merton. **Endogenous default.** @leland1994corporate and @leland1996optimal treat the default barrier as an equilibrium choice of shareholders, who compare the option value of continuing to service debt against the option of defaulting immediately. The equilibrium barrier rises with leverage and falls with asset volatility, capturing the strategic dimension of default that Merton's exogenous barrier misses. The Leland framework also delivers endogenous term-structure of credit spreads and an optimal capital structure that roughly matches observed leverage ratios in investment-grade corporates. **Compound options.** @geske1977valuation treats equity as a compound option in the presence of multiple debt maturities. Each coupon date is itself an option on the post-coupon firm. The resulting formula is a multivariate normal integral and provides a more realistic pricing of long-dated debt with intermediate coupon payments. The compound-option correction is what KMV uses internally to deal with firms that have revolving debt maturities. **Stochastic interest rates and jumps.** Adding Vasicek or CIR dynamics to $r$ lets the model capture the interest-rate-spread interaction that @collin2001determinants highlight. Adding jumps in $V$ raises short-horizon PD to realistic levels and closes the short end of the credit-spread puzzle. @chen2010macroeconomic embeds the whole thing inside a consumption-based asset-pricing framework with time-varying risk premia and produces a structural model that matches both the level and the cyclicality of observed credit spreads. None of these extensions have displaced Merton as the workhorse. KMV EDF ships a compound-option variant; academic researchers still benchmark on pure Merton DD because its estimation is unambiguous and its inputs are public. The practical compromise is to use Merton DD as a feature and let a downstream logistic or tree model pick up the residual structure that the extensions would have captured analytically. ## Distance-to-default and the PD map ### Defining DD inside the model The quantity DD from (@eq-dd) sits at the center of the whole structural edifice. It has three useful interpretations. **Reading 1: standardized log leverage.** Rewrite $\text{DD} = \frac{\ln(V_0/D) + (\mu - \sigma_V^2/2)T}{\sigma_V \sqrt{T}}$ as the number of one-year asset-volatility units separating log asset value (drifted by $(\mu - \sigma_V^2/2)T$) from log default barrier $\ln D$. Because the numerator is the mean of $\ln V_T - \ln D$ under the physical measure and the denominator is its standard deviation, DD is literally the $z$-score of log survival. **Reading 2:** $d_2$ under the physical drift. Compare to (@eq-d1d2): $d_2 = (\ln(V_0/D) + (r - \sigma_V^2/2)T)/(\sigma_V \sqrt{T})$. So DD and $d_2$ differ only in that DD uses $\mu$ and $d_2$ uses $r$. Under the risk-neutral measure, DD collapses to $d_2$. Structural PD under $\mathbb{Q}$ is $\Phi(-d_2)$; under $\mathbb{P}$ it is $\Phi(-\text{DD})$. **Reading 3: standardized log-moneyness.** The call-option analogy: DD is how far in the money the implicit call $\max(V_T - D, 0)$ is expected to finish, measured in asset-return standard deviations. Very in-the-money calls correspond to very distant-to-default firms. ### From DD to PD: two routes The theoretical route maps DD to PD through the normal CDF, $$ \widehat{\text{PD}} = \Phi(-\text{DD}). $$ This is exactly right if the asset-return distribution really is lognormal. It is badly wrong in the tails of real data. Empirically, actual default rates at high DD are nowhere near as small as the normal CDF predicts. The fix in KMV is to replace $\Phi$ with an empirical map built from a large proprietary default database: group firms by DD bucket, compute the realized one-year default rate in each bucket, and smooth the bucket-level hazard to get a monotone decreasing function $\text{EDF}(\text{DD})$. A useful stylized fact: for investment-grade firms the empirical EDF at a given DD sits roughly one to two orders of magnitude above $\Phi(-\text{DD})$. For a firm with DD equal to 4, the lognormal formula gives PD of about 3 bps; Moody's KMV EDF puts the same firm closer to 30 bps to 50 bps. This gap is one reason structural PDs cannot be used as-is for capital under a regulatory IRB model. ### Why the normal CDF undershoots The discrepancy between theoretical $\Phi(-\text{DD})$ and empirical EDF is not a minor calibration bug. It reflects a deep problem with the structural model's distributional assumption. Three mechanisms conspire to produce fatter tails than the lognormal allows. **Jumps.** Asset values do jump. Fraud disclosures, litigation surprises, adverse regulatory rulings, commodity price shocks, and pandemic-level events are not drawn from a lognormal distribution. Even a small Poisson jump component with intensity 2% per year and expected jump size -20% raises DD-implied PDs by 30-80% at low DDs. @duffie1999modeling and subsequent work in the structural literature quantify the jump contribution to observed spreads. **Incomplete information.** The filtration problem from @sec-ch08-filtration produces a positive short-end hazard that the diffusion model lacks. Investors do not observe $V_t$ exactly; they infer it from noisy accounting and market signals. The inferred distribution of $V_t$ has fatter tails than the underlying $V_t$, and the implied PD at any given point estimate is larger. The Duffie-Lando intensity in @eq-lambda-density is precisely the contribution this channel makes to the empirical PD map. **Strategic default.** Under limited liability, shareholders may walk away from a firm whose $V_T$ exceeds $D$ if the cost of equity injection exceeds the option value of continuing. This behavior is documented in sovereign and municipal debt (the "willingness to pay" problem) and in private equity-held firms with aggressive dividend recap structures. The Merton model does not capture strategic default because it assumes shareholders always pay if $V_T > D$. The empirical EDF calibration absorbs all three effects by construction. If you fit a smooth map from DD to realized default rates, the map folds in the jump, information, and strategic contributions automatically. The disadvantage is that the resulting PD is not a PD in any rigorous no-arbitrage sense; it is a conditional expectation of a default indicator given a model-implied covariate. For capital purposes that is usually good enough; for exotic-derivative pricing it is not. ### Numerical implementation The risk-neutral PD is larger than the physical PD because the drift under $\mathbb{Q}$ is the risk-free rate, and any firm with $\mu > r$ is riskier in the risk-neutral world than in the real world. That wedge is the basis of the credit risk premium. ### A simple empirical PD map If you have your own default database, you can build a KMV-style map in a dozen lines. The recipe is to bucket DD, compute the realized one-year default rate per bucket, and regress a logit of the default rate on DD to smooth. @bharath2008forecasting gives an influential comparison between the full structural DD and a naive approximation that skips the iterative solver; the naive version retains nearly all of the predictive power. That table is the empirical skeleton of EDF. KMV fits a smooth monotone curve through the `DD_mid`-to-`default_rate` mapping using a log-link-style GLM; the specific functional form is proprietary but the idea is exactly what the code above produces. ## The KMV implementation: inverting equity to recover asset value and volatility ### The identification problem Everything in the structural model is written in terms of unobservable inputs: $V_t$ and $\sigma_V$. Only $E_t$ is observed directly, and $\sigma_E$ can be estimated from its time series. We need a way to back out $V_t$ and $\sigma_V$ from $(E_t, \sigma_E, D, r, T)$. Two equations pin down the two unknowns. The first is (@eq-merton-equity) relating $E$ to $V$: $$ E = V \Phi(d_1) - D e^{-rT} \Phi(d_2). $$ The second is Ito's lemma applied to $E$ as a function of $V$. Since $E = f(V)$ with $f$ the BS call function, the instantaneous volatility of $\ln E$ satisfies $$ \sigma_E = \frac{V}{E} \frac{\partial E}{\partial V} \sigma_V = \frac{V}{E} \Phi(d_1) \sigma_V. $$ Here $\partial E / \partial V = \Phi(d_1)$ is the Black-Scholes delta of equity with respect to assets. Multiplying by $V/E$ rescales to log-returns. Equation (@eq-sigma-e-vega) is the structural-model hedge ratio. @jones1984contingent and early KMV memos solved the system by simultaneous nonlinear root-finding on $(V, \sigma_V)$ given a single observation of $(E, \sigma_E)$. The modern KMV approach instead uses an iterative fixed-point algorithm on an observed equity time series. ### The iterative KMV algorithm The standard KMV procedure, popularized by @vassalou2004default, is: 1. Initialize $\sigma_V^{(0)} = \sigma_E \cdot E_t/(E_t + D)$ (the naive leverage adjustment) and $V_t^{(0)} = E_t + D$. 2. Holding $\sigma_V^{(k)}$ fixed, invert (@eq-merton-equity) pointwise across the equity time series to get $V_t^{(k+1)}$ for every $t$. 3. Compute $\sigma_V^{(k+1)}$ as the annualized standard deviation of $\log V_t^{(k+1)} - \log V_{t-1}^{(k+1)}$. 4. Repeat 2-3 until $|\sigma_V^{(k+1)} - \sigma_V^{(k)}| < \epsilon$. There are two subtleties that matter for numerical stability. **Jensen-style correction.** Equation (@eq-sigma-e-vega) holds instantaneously but is a nonlinear transformation of $V$, so any finite-sample estimator of $\sigma_E$ implies a non-trivial $\sigma_V$. Using (@eq-sigma-e-vega) directly as a one-step estimator gives $\sigma_V \approx \sigma_E / (\Phi(d_1) V/E)$, but $\Phi(d_1)$ itself depends on $\sigma_V$. Iterating closes the loop. @duan1994maximum and @duan2004structural show that the KMV fixed-point estimator is closely related to the maximum-likelihood estimator for the transformed GBM and is consistent for $\sigma_V$ under the structural model, with the same asymptotic distribution up to a boundary correction. **Fixed-point monotonicity.** The map $\sigma_V \mapsto \sigma_V^{(k+1)}(\sigma_V)$ is a contraction in reasonable regions of parameter space, which is why Picard iteration converges. When the firm is deeply in the money ($V \gg D$), the map is almost linear with slope near one; when the firm is near default ($V \approx D$), the map can temporarily become non-contractive and produce oscillations. Practical implementations add damping $\sigma_V^{(k+1)} = (1 - \alpha) \sigma_V^{(k)} + \alpha \sigma_V^{(k+1)}(\sigma_V^{(k)})$ with $\alpha \in (0, 1)$. ### KMV solver implementation The loop is not vectorized inside `brentq` because the bracketing root-finder needs a scalar objective. For a 252-observation equity time series, this runs in roughly 100 milliseconds per iteration on a laptop. Production KMV systems run the same idea on millions of firm-year observations by replacing `brentq` with a vectorized Newton step on $\ln V$ since the BS call is monotone in $V$. ### Testing the solver on a simulated Compustat-like sample Recovery is accurate to a fraction of a percent. With 252 daily observations, the limiting factor is not bias but the finite-sample variance of the log-asset-return standard deviation estimator, which equals $\sigma_V / \sqrt{2n}$ times familiar factors. That is why KMV uses rolling windows of one or two years and shrinks to a sector mean. ### Why the naive BS-implied asset volatility breaks A common error in applied work is to compute $\sigma_V = \sigma_E \cdot E/(E + D)$, often called "leverage-adjusted" equity volatility. This is the starting point of the KMV iteration, not its output. The error scales like the difference between $\Phi(d_1) V / E$ and $E/(E + D)$, which can be large when leverage is high or when the firm is close to default. @bharath2008forecasting points out that even this naive quantity, when plugged back into the DD formula, retains most of the predictive power of the full iterative DD, but the predicted level of PD can be off by a factor of two or three. The naive estimate is biased low because $\Phi(d_1)$ is generally larger than $E/(E+D)$ for firms with positive drift. The iterative solver corrects the bias. ### Common implementation gotchas A production KMV pipeline hits several non-obvious pitfalls that take years to surface. **Face value definition.** Merton's $D$ is the face of a single zero-coupon bond. Real firms have short-term debt, long-term debt, off-balance-sheet commitments, and operating leases. @vassalou2004default uses $D = \text{short-term debt} + \tfrac{1}{2} \cdot \text{long-term debt}$ as a pragmatic approximation. The factor $\tfrac{1}{2}$ reflects the average time to maturity of long-term debt and the coupons that will be paid before the notional. @bharath2008forecasting show that the choice of $D$ definition matters less than the KMV literature's own emphasis would suggest; several alternative definitions produce DDs that are rank-correlated at 0.95 or higher. **Horizon** $T$. KMV uses $T = 1$ year. For capital purposes this matches the Basel one-year PD horizon. For bond pricing and credit-derivative applications, the horizon should match the instrument's maturity. The DD at $T = 5$ years and $T = 1$ year can differ substantially because the drift term $(\mu - \sigma_V^2/2) T$ scales linearly with $T$ while the noise scales with $\sqrt{T}$; for high-drift firms, longer horizons produce higher DDs. **Dividends.** A firm that pays dividends has an effective negative drift of size equal to the dividend yield, because assets drain out of the firm. The standard fix is to use $\mu - q$ in the DD formula, where $q$ is the dividend yield. Ignoring dividends for mature blue-chip firms with 2-4% dividend yields biases DD upward by 10-20%. **Stock splits and corporate actions.** Equity price history must be adjusted for splits, reverse splits, and spin-offs before the KMV iteration runs. Splits are easy; spin-offs change the asset base mid-sample and require a segment-by-segment reconstruction of $V$. A standard validation step is to compare implied $V_t$ against quarterly book-value-of-assets from Compustat; a persistent large gap usually indicates an unhandled corporate action. **Delisting.** Firms that delist for reasons other than default (going private, merging into another entity) must be censored at the delisting date, not treated as survivors. The delisting indicator in CRSP (DLSTCD codes 200-699) is the standard source; @shumway2001forecasting provides the conventional mapping. **Survivorship bias.** The KMV panel must include firms that have already defaulted, not just currently listed firms. A backtest on currently listed Compustat firms will overstate the model's accuracy by 20-40% because the most informative data points (realized defaults) are missing. The correct panel comes from the CRSP-Compustat merged database with all historical firm-years included. **Convergence failures.** The iterative solver occasionally fails to converge for firms with extreme leverage or near-zero equity. The symptom is $\sigma_V$ oscillating between two attractors. The standard fix is damping (as in the code above) plus a fallback to the naive estimator when damping does not settle. A production pipeline logs convergence diagnostics and flags firms with non-convergence for manual review. ## Comparing structural DD to Altman Z on a simulated Compustat sample ### Setup @altman1968zscore derived Z as a discriminant-analysis score on a small US bankruptcy sample. The formula is $$ Z = 1.2 X_1 + 1.4 X_2 + 3.3 X_3 + 0.6 X_4 + 1.0 X_5, $$ where \$X_1 = \$ working capital / total assets, \$X_2 = \$ retained earnings / total assets, \$X_3 = \$ EBIT / total assets, \$X_4 = \$ market value of equity / book value of total liabilities, \$X_5 = \$ sales / total assets. Higher Z means safer. The classical thresholds are Z above 2.99 (safe), between 1.81 and 2.99 (gray), below 1.81 (distress). @altman1977zeta updated the coefficients to ZETA, and subsequent work [@ohlson1980financial; @shumway2001forecasting; @campbell2008search] generalized the approach to logistic, hazard, and multi-period frameworks. Structural DD and Altman Z are conceptually different: DD is a forward-looking, market-implied distance to the default barrier; Z is a backward-looking, accounting-implied discriminant. The natural question is whether one dominates the other on the same sample. ### A synthetic Compustat panel Public data note: a structural KMV demonstration needs the joint distribution of equity time series, book leverage, and a default label. The accounting side is in the @liang2016financial Taiwanese Bankruptcy Prediction panel (UCI 572) used in @sec-ch06-altman-replication, but UCI 572 ships no daily equity prices, no market capitalization series, and no firm identifiers that would let one join external market data; this rules it out for distance-to-default. Free firm-month equity data (Yahoo Finance via `yfinance`, AlphaVantage) cover only currently-listed firms and so suffer from survivorship bias, which is precisely the bias that would inflate any out-of-sample KMV result. Compustat-CRSP (paywalled) is the production data source. The synthetic panel below preserves the joint dependence between accounting health and asset volatility that makes the DD-versus-Z comparison meaningful, without distributing licensed data. Each firm has a latent "health" variable that drives leverage, asset volatility, asset drift, and accounting inputs jointly. Default risk is therefore cross-correlated through `latent`, which gives both DD and Z a signal to pick up. ### Compute DD, PD, Altman Z ### Rank-correlation and discrimination The structural DD dominates here because the label was generated from PD. That is a tautology. The more honest comparison uses an independent default signal. Now Z, which loads on multiple accounting variables correlated with the latent health, catches up. The empirical literature [@bharath2008forecasting; @campbell2008search] reports exactly this pattern on real data: DD and Z have correlated but not redundant information, and hybrid models that include both dominate either alone. ### Plotting DD over time for healthy and distressed firms The DD trajectory of the distressed firm grinds toward zero over three years while the healthy firm drifts up. In practice, a DD below about 2 is a strong warning signal; below 1 is typically an investment-grade-to-junk migration; below 0 means the model implies the firm is already default-likely at the horizon. ### What DD tells you that a bond yield does not There is a tempting shortcut in credit analysis: read the bond yield, subtract the risk-free rate, call the result the implied PD (after dividing by one minus recovery). This gets you to a risk-neutral PD that the market has already priced. Why bother with Merton-DD at all? Three reasons, in order of importance. First, bond yields incorporate a credit risk premium that is a multiple of the physical PD. The typical long-run wedge between risk-neutral and physical PD for investment-grade corporates is 4x to 8x; for high-yield it narrows to 2x to 4x. A 200 bp spread does not mean a 200 bp physical PD. @huang2012how decomposes observed spreads into expected loss, credit risk premium, tax effects, and liquidity effects, and finds that in the investment-grade segment less than a third of the spread is expected loss. Second, not all firms have liquid bond markets. Middle-market corporates, private firms, and emerging-market issuers rarely have traded bonds with clean yields. Equity-based DD is available for any publicly listed firm and for many private firms through comparable-company adjustments. KMV's private-firm model uses sector regressions of public-firm DD on accounting ratios to produce DDs for private firms with no market data. Third, structural DD has forward-looking content that bond yields miss at moderate horizons. Bond yields are dominated by near-term default risk; Merton DD at a one-year horizon blends near-term volatility and longer-horizon drift, which is often what a through-the-cycle risk manager wants. The practical compromise is to use all three signals: KMV EDF from equity, market-implied PD from bonds and CDS, and a logistic-hazard model on accounting and macro covariates. Each provides a different slice of the information set, and a wholesale credit desk that watches all three detects regime shifts that a single signal would miss. ## Reduced-form models: Jarrow-Turnbull ### The reduced-form idea Structural models tie default to the firm's capital structure and asset process. Reduced-form models do the opposite. They treat the default time $\tau$ as an exogenous random variable with a hazard-rate process $\lambda_t$, and they calibrate $\lambda_t$ to market prices of defaultable bonds or CDS without modeling why default happens. The cost is that you cannot inspect the driver of $\lambda_t$ from fundamentals; the benefit is that you get exact calibration to any observed term structure and clean machinery for pricing exotic credit derivatives. @jarrow1995pricing is the canonical paper. The two-state model posits that default is a Poisson event with intensity $\lambda$, independent of interest rates in the simplest case and correlated in extensions. @jarrow1997markov generalizes to a Markov rating-migration structure; @lando1998cox develops the Cox-process framework with stochastic $\lambda_t$; @duffie1999modeling recasts the price of a defaultable cash flow as a discounted expectation with a default-adjusted discount rate. ### Hazard rates and survival probabilities Define the hazard rate $$ \lambda_t = \lim_{h \to 0^+} \frac{1}{h} \Pr[t \leq \tau < t + h \mid \tau \geq t]. $$ Cumulative hazard is $$ \Lambda(t) = \int_0^t \lambda_s ds. $$ Survival probability: $$ S(t) = \Pr[\tau > t] = \exp\!\left[-\Lambda(t)\right] = \exp\!\left[-\int_0^t \lambda_s ds\right]. $$ In the homogeneous case with constant $\lambda$, $\tau \sim \text{Exp}(\lambda)$ and $S(t) = e^{-\lambda t}$. In the inhomogeneous case, $\lambda_t$ is a deterministic or stochastic function of time and possibly covariates; the Cox-process case of @lando1998cox makes $\lambda_t$ itself a stochastic process. ### Pricing a zero-coupon defaultable bond Consider a bond with face value 1 maturing at $T$, no coupons, and a recovery rate $R$ paid at $T$ in the event of default before $T$ (the "recovery-of-face-value" convention). Under the risk-neutral measure with deterministic $\lambda$ and $r$: $$ P(0, T) = \mathbb{E}^{\mathbb{Q}}\!\left[e^{-rT} \mathbf{1}\{\tau > T\}\right] + R \cdot \mathbb{E}^{\mathbb{Q}}\!\left[e^{-rT} \mathbf{1}\{\tau \leq T\}\right]. $$ Independence of $\tau$ and $r$ (the simplest Jarrow-Turnbull case) gives $$ P(0, T) = e^{-rT}\left[S(T) + R(1 - S(T))\right] = e^{-rT}\left[e^{-\Lambda(T)} + R(1 - e^{-\Lambda(T)})\right]. $$ Take logs and compare to the risk-free price $e^{-rT}$ to get the implied credit spread $$ s(T) = -\frac{1}{T} \ln\!\left[S(T) + R(1 - S(T))\right]. $$ For small $\lambda T$ and $S(T) \approx 1 - \lambda T$, $$ s(T) \approx \lambda (1 - R), $$ which is the celebrated "spread is hazard times loss-given-default" approximation that industry CDS desks use every day. ### Contrasting structural and reduced-form Structural models derive PD from the capital structure. The advantage is interpretability and a tight link to fundamentals. The disadvantage is that they miss short-horizon default risk because diffusion processes do not jump: with $V$ following a GBM, $\Pr[V_T < D]$ at short $T$ goes to zero like $\Phi(-\text{DD}) \sim e^{-\text{DD}^2/2}$, which undershoots observed short-maturity spreads badly. The fixes split into two families. The first keeps the structural skeleton and adds either jumps, stochastic volatility, or unobserved asset value (the incomplete-information route formalized by @duffielando2001 and developed in @sec-ch08-filtration). The second switches to reduced-form altogether, as @duffie1999modeling and @sundaresan2013review survey. Reduced-form models bypass the mechanism and match spreads by construction. The advantage is calibration and tractability for exotics. The disadvantage is that $\lambda_t$ is a data-fit object with no causal story; macroeconomic stress tests must bolt on an external model for $\lambda_t$. Hybrid approaches combine the two: DD becomes an input to a logistic or hazard model alongside accounting ratios and macro variables. @campbell2008search is the best-known hybrid, using DD together with accounting ratios in a dynamic logit to forecast bankruptcies and delistings. @duffie2009frailty adds a latent frailty factor that explains the bunching of defaults in crises beyond what DD and accounting can capture. The frailty factor is effectively a reduced-form random intensity common to many firms, and it improves out-of-sample calibration in stress periods. ### Jarrow-Turnbull simulation and MLE The exponential MLE is the simplest Jarrow-Turnbull fit. When intensity varies over time, one can fit a piecewise-constant $\lambda_t$ by maximum likelihood across the hazard segments, or fit a Cox partial likelihood with covariates; both reduce to the same exponential MLE in the piecewise-constant case without covariates. The implied term structure is almost flat because $\lambda$ is constant. Non-flat term structures in practice reflect either $\lambda_t$ varying with $t$ or rating migrations in the @jarrow1997markov extension. ### Rating migrations: Jarrow-Lando-Turnbull The single-hazard model cannot reproduce the empirical pattern of transitions between rating categories. @jarrow1997markov extend the reduced-form framework by treating the credit rating as a continuous-time Markov chain over states $\{1, 2, \dots, K, \text{default}\}$, where state $K$ is the default-absorbing state. The generator matrix $\mathbf{Q}$ collects the transition intensities; the transition probability matrix over horizon $T$ is $$ \mathbf{P}(T) = \exp(\mathbf{Q} T), $$ using the matrix exponential. Calibrating $\mathbf{Q}$ from observed one-year transition matrices published by Moody's and S&P is standard practice. Under risk-neutral dynamics the generator $\mathbf{Q}^{\mathbb{Q}}$ may differ from the physical generator $\mathbf{Q}^{\mathbb{P}}$ through a "credit risk premium adjustment" that scales transitions toward default by a factor greater than one. @jarrow1997markov derive the adjustment from observed bond prices, and empirical estimates for investment-grade corporates put the adjustment factor in the 2 to 4 range. The rating-migration model solves the practical problem of pricing instruments whose payoff depends on rating, not just default: corporate bonds with rating-linked coupon step-ups, credit-default swaps with rating-triggered knockouts, and structured products with rating-based waterfall tranches. It also provides a natural framework for downgrade-risk management: the probability of downgrading from BBB to BB in the next year is directly computable from $\mathbf{P}(1)$. ### Correlated defaults Both structural and reduced-form models in their single-firm forms fail to capture the correlation in defaults across firms. Observed defaults are clustered in time: 2001, 2008, and 2020 each produced unusual bunching relative to what an independent-default model would predict. Two mechanisms generate default correlation in the structural framework. The first is a common asset-return factor: all firms' $V_t$ respond to a common market factor, so joint downturns push multiple firms below their barriers simultaneously. This is the idea underlying the @vasicek2002distribution and @gordy2003risk one-factor models used in the Basel IRB formula. The second is a common jump factor: systemic events like financial crises deliver simultaneous jumps to many firms' asset values, which a diffusion-only model cannot capture. @duffie2009frailty document a third mechanism: a latent "frailty" factor that is not captured by observed covariates. Even after controlling for DD, accounting ratios, and macro variables, US corporate defaults cluster more than the hazard model predicts. Adding a filtered unobserved factor improves out-of-sample calibration materially, especially in crisis periods. The frailty factor can be interpreted as capturing common information that market participants have but modelers do not. @das2007common test whether the bunching of defaults is consistent with a doubly stochastic hazard model (the Cox-process of @lando1998cox) and reject the independence hypothesis: conditional on observed covariates, defaults are still correlated. This has become the empirical motivation for portfolio credit risk models that go beyond independent-firm PDs. ### Jarrow-Turnbull with covariates: the proportional hazards form The estimator recovers the true coefficients to two decimal places. This is the workhorse of the @duffie2007multi multi-period default-prediction literature: hazard-rate models with DD as one of the covariates among firm financial ratios and macro factors. ### Dynamic hazard versus static logistic @shumway2001forecasting makes an important methodological point that applies directly to credit scoring: a static logit treating each firm-year as an independent observation, when the underlying data-generating process is a multi-period hazard, produces biased coefficients and inefficient use of the data. The fix is to use a discrete-time hazard specification that acknowledges the within-firm repeated observations. The Shumway setup writes the conditional probability of default in year $t$ given survival to year $t-1$ as $$ \Pr[\tau = t \mid \tau \geq t, X_{t-1}] = \frac{1}{1 + \exp(-X_{t-1}^\top \beta - \alpha_t)}, $$ with $\alpha_t$ a baseline-hazard term. The likelihood contribution of a firm that defaults in year $t$ is $$ L_i = \left[\prod_{s=1}^{t-1} \Pr[\tau \neq s \mid \tau \geq s, X_{i, s-1}]\right] \cdot \Pr[\tau = t \mid \tau \geq t, X_{i, t-1}], $$ while a firm censored at $t^*$ contributes the product of survival probabilities only. @shumway2001forecasting shows this likelihood is identical to a pooled logit on the firm-year panel with each firm contributing one observation per year until default or censoring, which is why the approach is sometimes called "pooled logit with risk-set sampling." The key insight is that this pooling is statistically valid only if one treats each firm-year-observation as a distinct draw, which changes the standard errors and coefficient estimates relative to the naive cross-sectional logit. @campbell2008search build on the Shumway framework with an expanded covariate set: DD from a KMV-style solver, equity volatility from recent returns, profitability, leverage, cash holdings, market-to-book, and relative price performance. Their preferred specification puts DD and equity volatility in the same model, which is mildly redundant by construction; both contain information about asset volatility. The empirical coefficient on DD remains large and significant even with volatility in the model, which suggests that the drift component of DD ($\mu - \sigma_V^2/2$) is adding something over and above pure volatility. ### CDS and market-implied PD A liquid credit-default-swap market exists for a few thousand corporate reference entities. CDS spreads imply risk-neutral default probabilities directly, without needing a structural inversion. The standard bootstrap procedure is: 1. Observe par CDS spreads at maturities 1y, 3y, 5y, 7y, 10y. 2. Assume a recovery rate, typically 40% for senior unsecured corporate bonds. 3. Solve for a piecewise-constant hazard rate $\lambda_t$ that reprices the CDS term structure exactly. The resulting $\lambda_t$ is a risk-neutral intensity. Converting to physical hazard requires a credit risk premium assumption, which in practice is calibrated from the historical ratio of observed default rates to CDS-implied rates, typically 0.25 to 0.5 for investment grade. For firms with liquid CDS, the CDS-implied PD is usually the preferred input for short-horizon trading decisions: CDS updates in real time, reflects credit market consensus, and is arbitrage-consistent with bond prices. For firms without liquid CDS (the vast majority of corporates by count), the KMV-style structural PD remains the standard. A sophisticated credit desk runs both and reconciles discrepancies as potential trading signals. ## Empirical comparison: structural, accounting, hybrid ### What the literature has settled Three families of corporate-default models compete in the empirical literature. **Structural.** DD from @merton1974pricing and its commercial implementation in KMV. Inputs: equity price, equity volatility, leverage. Output: PD as $\Phi(-\text{DD})$ or a proprietary EDF map. **Accounting-based.** @altman1968zscore (linear discriminant, @sec-ch06-discriminant), @ohlson1980financial (static logit), @shumway2001forecasting (hazard logit). Inputs: balance-sheet ratios. Output: default score, interpretable as log-odds of default. **Hybrid/dynamic.** @campbell2008search, @duffie2007multi, @duffie2009frailty. Inputs: DD plus accounting ratios plus macro/industry factors, fit via dynamic hazard model, often with latent frailty. The empirical verdict, across multiple studies on US data, is reasonably consistent: 1. @bharath2008forecasting show that a naive DD, computed without the iterative KMV solver, has nearly the same forecasting accuracy as the full DD. They also show that DD enters significantly in a hazard model with accounting ratios but does not dominate Altman Z. 2. @campbell2008search report an AUC near 0.94 for one-year bankruptcy prediction using a dynamic logit with twelve accounting and market covariates; DD by itself reaches about 0.87. The incremental contribution of DD after controlling for profitability, leverage, and equity volatility is modest but significant. 3. @hillegeist2004assessing compare Merton-based BSM probabilities to Altman Z and Ohlson O on US bankruptcies 1980-2000 and find BSM dominates accounting-only models but is dominated by the hybrid. 4. @duffie2009frailty document that a common frailty factor, on top of DD and accounting variables, is necessary to explain the clustering of defaults in 2001 and 2008. The practical implication is that structural DD is a useful covariate but not a sufficient statistic for corporate PD. Wholesale IRB models at large banks typically blend DD, accounting ratios, and industry/macro overlays, with ratings benchmarks from Moody's EDF and S&P as external anchors. ### Benchmark code We reuse the simulated panel from earlier, compute DD, Z, and an Ohlson-style logit, and compare discrimination on a held-out default label that mixes DD and accounting information. The hybrid dominates on the simulated panel because we wrote the DGP to mix both families. On real Compustat-CRSP panels [@bharath2008forecasting; @campbell2008search] the qualitative ordering is the same though the margins are smaller. ### Calibration and profit-based evaluation Discrimination is not enough for a regulatory model. Wholesale IRB capital is quadratic in PD, so miscalibration compounds into capital misallocation. @pluto2005thinking derive lower bounds on PD estimates under low default sampling, which is especially relevant for investment-grade wholesale portfolios where default counts are thin. A typical validation suite for a Merton-DD-based model includes: - **Rank correlation** with external ratings (Moody's, S&P). - **Transition matrices** over one-year and five-year windows. - **Calibration** by PD bucket: realized vs expected default frequency. - **Slotting** into Basel master scales where the regulator requires it. Bins are close on average but will deviate in the tails on real data, especially in the lowest-PD buckets where a handful of defaults can move the realized rate by an order of magnitude. ### Through-the-cycle versus point-in-time PD Wholesale PD estimates come in two flavors that do not always play nicely together. Point-in-time (PIT) PD conditions on current information and is the natural output of KMV EDF: a firm's PD today given equity, leverage, and market conditions today. Through-the-cycle (TTC) PD is an expected PD over a full business cycle, stripped of cyclical variation: the firm's PD averaged over booms and busts. Basel IRB rules require TTC PDs to avoid procyclical capital swings: if PD rises in a downturn, required capital rises, which forces banks to contract lending exactly when the economy most needs credit. @eba2017gl lays out the TTC requirement in detail. The practical methods for converting PIT to TTC are: 1. **Time-series smoothing.** Average a firm's PIT PD over the last one to three years. Simple but it lags reality. 2. **Macro-factor decomposition.** Regress PIT PD (or its logit) on macroeconomic variables and strip out the macro component, leaving a residual firm-specific PD. Recompose using long-run average values of the macro factors. This is the approach in @chen2010macroeconomic applied at the portfolio level. 3. **Rating anchoring.** Map PIT PDs to external rating categories, use historical long-run average default rates per rating as the TTC PD. This is the industry-standard approach for wholesale IRB and is documented in @pluto2005thinking. KMV EDF is explicitly PIT and must be converted for regulatory use. Through the 2008-2009 crisis, PIT EDFs rose dramatically and then reverted while realized default rates lagged by six to twelve months. The lag is exactly what you expect from a forward-looking signal: markets price default risk before it materializes in accounting figures or defaults. ### The low-default portfolio problem Investment-grade wholesale portfolios have typical one-year default rates of 5-20 bps. In a bank portfolio of 1,000 investment-grade corporate exposures, the expected number of defaults is 0.5 to 2 per year. Estimating a PD under this much noise is hard, and estimating the PD by rating bucket is essentially impossible from the bank's own data. @pluto2005thinking derive lower-confidence bounds on PD estimates under low default sampling: given $n$ exposures and $d$ observed defaults over $T$ years, a one-sided $(1 - \alpha)$ upper confidence bound on $\lambda$ is obtained by inverting the exponential likelihood. With $n = 1000$, $d = 1$, $T = 1$, and $\alpha = 5\%$, the upper bound is approximately 4.7 per 1000, or 47 bps, even though the point estimate is 10 bps. The practical implication: banks with small wholesale portfolios cannot rely on internal data alone for IRB PD calibration. They either pool with external data (via Moody's, S&P, Credit Bureau of Japan, etc.) or anchor to published rating-grade default rates. The KMV EDF is one of the standard anchors; the Basel IRB framework allows PIT-to-TTC conversion with external data provided the bank justifies the approach. ## Scalability A production Merton-KMV pipeline runs across a universe of tens of thousands of public firms with daily equity data going back decades. The scale challenge is the pointwise root-find on $V$ inside the iterative solver. Three tiers of scale matter. **Tier 1: single firm, single day.** `scipy.optimize.brentq` on a scalar function, sub-millisecond. This is the baseline. **Tier 2: single firm, time series of one year of daily data.** 252 root-finds per iteration, roughly 100 ms per iteration, 1-2 seconds for a typical convergence. Vectorizing with Newton's method and a smart warm start drops this to 50 ms per firm-year. **Tier 3: full Compustat universe, 40 years.** Roughly 10,000 firms by 10,000 trading days equals 100 million firm-days. At 50 ms per firm-year, this is manageable with parallelism: 400,000 firm-years divided over, say, 64 cores finishes in two hours. The preferred setup is Spark (`pyspark`) partitioning by firm-ticker: each partition runs an independent KMV solver. `polars` is an attractive middle layer for assembling the equity panel from Compustat and CRSP without the JVM overhead. Eight Newton steps converge to machine precision for a full panel of 252 observations in a few milliseconds. At Tier 3 scale, this Newton-based solver runs over the full Compustat universe in under an hour on a single modern workstation. ### Polars and Dask for the equity panel The KMV solver is embarrassingly parallel at the firm level. The scalability bottleneck is usually the panel construction: assembling equity prices, dividend-adjusted close, shares outstanding, and debt face values across firms and dates. `polars` handles the Compustat-CRSP merge faster than `pandas` and with lower memory overhead. A typical workflow: This lazy pipeline streams 40 years of daily equity and quarterly accounting data through the join in a few minutes on a modern laptop. `dask` is the fallback when data exceeds RAM. A `dask.dataframe` partitioned by `gvkey` makes the KMV solver trivially parallelizable: `.map_partitions` applies the iterative solver firm-by-firm. At BIS-scale or regulator-scale data (entire universe of listed firms, multi-decade history), PySpark with partitioning by industry sector adds another order of magnitude. The KMV solver itself does not vectorize across firms cleanly because the Newton step uses firm-specific Black-Scholes parameters, but the outer loop is trivially distributed. ## Deployment A wholesale PD service built on a Merton-KMV pipeline typically has three layers. **Feeds.** Daily equity prices (Bloomberg, Refinitiv, IEX), debt face value from Compustat quarterly (`DLTT + DLC`), risk-free rates from FRED or the swap curve. The feed orchestrator runs overnight, deduplicates, and materializes to a date-partitioned Parquet lake. **Estimation.** The KMV solver runs per firm on a rolling 1-year window of daily equity. Output is a time series of $(V_t, \sigma_V^{(t)}, \text{DD}_t, \text{EDF}_t)$ per firm. The job is embarrassingly parallel; any of Airflow, Dagster, or Spark structured streaming suffices. **Serving.** A FastAPI endpoint exposes `GET /firm/{ticker}/edf?date=YYYY-MM-DD` that reads from the EDF store, applies a rating-letter transformation, and returns the mapped PD and rating. The same endpoint is called by the bank's RAROC engine and by the wholesale limits system. The model-management wrapper tracks: - **Model card** [@mitchell2019model] with the DGP, calibration sample, known failure modes, and scope limitations. - **Version** with immutable parameter artifacts under MLflow. - **Challenger** model [@sr117] typically a refreshed EDF map or a competitor reduced-form model, running in shadow mode. ONNX export is less relevant here than in ML pipelines because the Merton-KMV formula is a closed-form computation rather than a learned function. What does matter is numerical reproducibility: the same equity input on the same day should produce bit-identical EDF regardless of the compute node, which requires pinned NumPy/SciPy versions and deterministic root-finding tolerances. The rest of this section walks through a deployable reference implementation. The full source is shipped with this book under [book/code/merton_kmv/](../code/merton_kmv/) (the estimation library) and [book/deployment/merton_kmv_app.py](../deployment/merton_kmv_app.py) (the FastAPI service). The chapter chunks below import from those modules and exercise each layer end to end on a synthetic Merton-consistent panel, so a reader can clone the repo, swap the synthetic feed for a real one, and have a working pipeline. ### Estimation layer: the production solver The chapter's pedagogical solver in @sec-ch08-kmv calls `brentq` once per observation per outer iteration. A production solver replaces the inner brentq with vectorised log-Newton on $V$, falls back to brentq only on rows that fail the monotonicity guard, and returns full diagnostics so monitoring can read iteration count, residual, damping, and fall-back use without re-running the solve. The interface lives in [solver.py](../code/merton_kmv/solver.py). The dataclass-frozen config is the single place every numerical knob is set; `MertonKMVConfig()` reproduces the Vassalou-Xing (2004) reference. Pinning NumPy and SciPy versions plus this config is what gives the bit-identical reproducibility the prose promised. ### Feeds and per-firm orchestration The feed adapter is intentionally schema-first: the rest of the pipeline only sees a long-form panel `(firm_id, date, equity, sector)` and a per-firm debt scalar. Switching from the synthetic generator below to a Bloomberg or Refinitiv adapter is a one-class change in [feeds.py](../code/merton_kmv/feeds.py). The orchestrator in [pipeline.py](../code/merton_kmv/pipeline.py) is a `joblib.Parallel` over firms, with per-firm error containment so a single bad ticker cannot poison the batch. `run_panel` returns two frames: the EDF panel that goes to the serving store, and a parallel diagnostics frame that goes to monitoring. Keeping them separate is what lets the FastAPI service stay read-only on the EDF store while the monitoring stack alerts on the diagnostics frame independently. ### End-to-end run on a synthetic Merton panel The chunk below runs the whole pipeline. It builds a 60-firm Merton-consistent synthetic panel, runs the parallel solver, and prints the EDF distribution by sector together with convergence diagnostics. The recovered $\sigma_V$ is concentrated near the sector ground truth (Utility 0.18, Industrial 0.28, Financial 0.18, Tech 0.45). Convergence is reached on every firm in roughly ten outer iterations, no fall-back to brentq is triggered, and no firm errors out. ### DD-to-PD calibration The chapter introduced two PD maps: the closed-form Merton tail $\Phi(-\text{DD})$ and an empirical isotonic curve. The isotonic version is what production EDF systems use because the diffusion-only Merton tail under-states short-horizon PD. The next chunk fits the isotonic map on a synthetic firm-year sample and compares both calibrations on the panel. The Merton-tail and isotonic columns rank firms identically (DD is the only input) but assign different absolute PD levels. Production EDF substitutes the isotonic curve at the last step. ### Serving layer: the FastAPI endpoint [merton_kmv_app.py](../deployment/merton_kmv_app.py) is the read-only service the bank's downstream systems call. The route signature mirrors the deployment prose above, and the model card from [model_card.py](../code/merton_kmv/model_card.py) is exposed under `/version` so audit can pull the same artefact the engineers see. The next chunk persists the EDF panel from the previous run to a Parquet artefact, points the FastAPI app at it, and exercises both endpoints in-process via `fastapi.testclient.TestClient`. This is the same path a CI smoke test would take. The same endpoint is what the wholesale RAROC engine and the limits system call in production. Replacing the demo Parquet artefact with the daily batch output and pointing `EDF_PATH` at the live store is the only change needed to deploy. ### Model management wrapper The model-management bullets above are operationalised by [model_card.py](../code/merton_kmv/model_card.py), which renders a markdown card from a dataclass. The card lists intended use, out-of-scope populations, known failure modes, and the challenger candidates, and it is what the SR 11-7 packet attaches. ### Monitoring and drift A Merton-KMV pipeline can fail in subtle ways that a simple "has the EDF number changed?" alert does not catch. The failure modes worth monitoring explicitly: **Asset-volatility drift.** $\sigma_V$ should be stable for established firms. If a firm's recovered $\sigma_V$ jumps by more than a few percent in a week without an obvious corporate event, the solver may have found a spurious fixed point. The standard remedy is to monitor rolling 90-day $\sigma_V$ and flag outliers. **Convergence statistics.** Every KMV run should log the number of iterations to convergence, the final residual, and the maximum damping factor used. A pipeline whose mean iteration count suddenly rises is usually hitting a numerical boundary, often because a new firm ticker has highly leveraged capital structure. **PD-to-spread reconciliation.** For firms with liquid bonds, the implied PD from the KMV model and the bond market should be rank-correlated at 0.7 or higher. A breakdown in this correlation, for example the KMV PDs fall while bond spreads widen, is a leading indicator that something is wrong, either in the pipeline or in the data feeds. **Back-testing.** Annual back-tests compare realized one-year default rates to the beginning-of-year EDF forecast. The Hosmer-Lemeshow test or the Binomial test by PD bucket give a disciplined way to measure miscalibration. **Sector drift.** Industry sectors have structurally different asset volatilities, drift rates, and leverage norms. A pipeline that ignores sector effects will over-estimate PD for utilities (stable, high leverage, low volatility) and under-estimate PD for tech (volatile, low leverage, high equity returns). A sector-level recalibration layer on top of the raw KMV EDF closes this gap. The five monitors are implemented in [monitoring.py](../code/merton_kmv/monitoring.py). The next chunk runs every monitor on the synthetic batch so the reader can see exactly what each one returns; in production these are scheduled jobs that write to a monitoring store and alert on threshold breaches. The five outputs are exactly what an operations dashboard plots. A breach in any of them, a spike in sigma drift alerts, a Hosmer-Lemeshow $p$-value below 0.01, a Binomial-test bucket with $p < 0.01$, a PD-spread rank correlation that drops below 0.7, or a sector recalibration shift larger than one notch, triggers a model-monitoring ticket and a rerun against the prior day's artefact for diff inspection. ## Regulatory considerations Structural models sit awkwardly in the regulatory framework. They are neither pure statistical models in the @sr117 sense nor pure accounting frameworks in the IFRS 9 [@ifrs9] sense. The practical regulatory touchpoints are the following. **SR 11-7 model risk management.** A Merton-KMV pipeline is unambiguously a model under @sr117. It requires documented conceptual soundness (the Black-Scholes derivation), ongoing monitoring (DD drift, parameter stability), effective challenge (alternative structural or reduced-form models), and outcomes analysis (realized defaults vs predicted EDF). The iterative solver's convergence properties must themselves be part of the validation because a non-converged $\sigma_V$ produces a silently wrong DD. **Basel II/III IRB wholesale.** Wholesale PD under @basel2006international must be estimated on a through-the-cycle basis with a minimum floor. KMV EDF is point-in-time and must be smoothed or cycle-adjusted before it enters the IRB risk-weight function. The Basel formula for wholesale risk-weighted assets [@basel2005irb] is the Vasicek one-factor model [@vasicek2002distribution; @gordy2003risk], which is itself structural in spirit: it uses a latent asset-return factor to drive correlation across firms. **IFRS 9 ECL.** Under @ifrs9, wholesale lifetime ECL requires forward-looking PDs conditional on macro scenarios. A Merton-DD pipeline with macro overlays (unemployment, GDP, term spread) on the drift or volatility can produce scenario-conditional EDFs that satisfy IFRS 9's "reasonable and supportable" requirement. **Capital floors and rating benchmarks.** US FDIC and Fed examiners routinely compare IRB PDs to Moody's KMV EDF as an external benchmark. A material deviation (say, more than one notch) triggers a question in the exam. Banks that use KMV EDF as the input face a different question: does the internal cycle adjustment move the TTC PD within a reasonable band? **Fairness.** Wholesale corporate lending is largely outside the ECOA/FCRA fair lending perimeter, which targets consumer credit. Corporate structural models are not regulated under @bartlett2022consumer or the CFPB's anti-discrimination guidance. The EU AI Act may reach corporate-credit AI systems if classified as high-risk, but structural models based on closed-form option pricing are not what the Act's "algorithmic decision system" language is targeting. **BCBS 239 data lineage.** A Merton-KMV pipeline must document where equity price came from, how debt face value was mapped from Compustat fields, and how missing data was handled, because @bcbs239 requires auditable lineage for any capital-relevant input. ## Vietnam and emerging markets ### Market context Vietnamese corporate credit is a bank-funded market with a thin public equity spine. HOSE (Ho Chi Minh Stock Exchange), HNX (Hanoi), and UPCoM together list approximately 1,600 listed or registered names across HOSE, HNX, and UPCoM, dominated by banks, real estate, and a few large manufacturers. Free float at a median listing is well under 30 percent and bid-ask spreads widen sharply outside the VN30 basket [@worldbank2022vietnamfinance]. Foreign-ownership caps and state shareholding produce a further wedge between market capitalization and economic equity. The private SME universe, which carries most of the credit exposure supervised by the State Bank of Vietnam under Circular 11/2021 [@sbv2021circular11], has no traded equity. For these firms, audited statements file late, tax filings are the alternative data, and CIC provides the cross-bank picture of outstanding balances and arrears [@cicvn2023report]. Fixed-income markets are bank-heavy, with a corporate bond market concentrated in real estate and infrastructure, which limits the CDS-implied PD workaround available in the US [@imf2023vietnamart4]. Decree 13/2023/ND-CP governs personal data but corporate credit files are outside its main perimeter, although beneficial-owner data falls inside [@govvn2023decree13]. ADB country surveys document the slow pace of private-sector credit deepening outside the banking channel [@adb2022vnfin]. Macro volatility is the elephant in the room. Vietnamese bank lending responds to uncertainty shocks with roughly twice the elasticity of developed-market benchmarks. Policy-driven property cycles (the 2022 bond-market freeze, the 2012 NPL episode) generated step changes in asset volatility that are easy to miss in a rolling-window KMV calibration. ### Application considerations Merton-KMV on the Vietnamese equity market works only on VN30 and a few large mid-caps. For these, two adjustments should be considered. First, the equity volatility input must be cleaned of event-driven gaps (ex-dividend shocks, trading-halt resumptions, foreign-ownership threshold hits) that a mechanical GARCH would treat as diffusion. Second, the debt face value from financial statements should be augmented with off-balance-sheet guarantees and intra-group payables, which are common in Vietnamese conglomerate structures and which a naive total-liabilities pull will miss. For the non-listed majority, pure Merton does not apply. Two realistic hybrids exist. Altman Z'' (@sec-ch06) with coefficients refit on Vietnamese defaults is the best pure-accounting anchor. A structural-lite alternative uses asset-return proxies built from peer-listed volatility plus firm-level accounting ratios to approximate $\sigma_V$. [@chava2011modeling]-style loss models can then combine the pseudo-DD with bureau-based indicators. CIC's own group rating, though coarse, is a useful prior. The reduced-form pathway via Jarrow-Turnbull requires a hazard input that is typically borrowed from pooled logistic or survival models fit on Vietnamese banking-book defaults, not from CDS spreads, because corporate CDS on Vietnamese names are rare outside a handful of sovereign-linked issuers. Through-the-cycle versus point-in-time. SBV expects IFRS 9 alignment for the largest banks under Circular 13/2018/TT-NHNN technical guidance on internal control [@sbv_circular13_2018]. A point-in-time Merton PD is too volatile for the Stage 2 trigger logic; supervisors prefer a smoothed PD with a macro overlay. The right engineering answer is a two-stage model: an EDF-style PD for MIS and a smoothed TTC PD for capital and provisioning, with a documented mapping between the two. ### Rationalization Merton fits Vietnam only for VN30-style large listings. It does not fit the private SME book, which is where most supervised credit risk lives. Practitioners should use Merton as one of several inputs in a hybrid stack rather than as the primary PD for wholesale. The structural intuition, that default is a threshold event driven by asset volatility, survives in a useful diagnostic form: distance-to-default and its trend tell a credit committee the same story that a rating migration tells, and the story is harder to game than an accounting ratio. In an emerging-market context the same intuition is why BIS EM staff find KMV-style inputs useful for early-warning analytics even when the PD map requires major recalibration [@bis2020em]. ### Practical notes Datasets. Use the HOSE/HNX daily equity panel from SSC (State Securities Commission) archives, merged with annual audited financials filed via the two exchanges. DataCore's corporate default database is the standard private source for Vietnamese defaults. Compustat does not cover Vietnamese privates. Regulator touchpoints. SBV on-site teams reviewing an IRB-aspirant model will check that the DD calibration is grounded in Vietnamese defaults, not imported from Moody's KMV global tables, and that the debt face-value mapping has been reviewed by internal audit under BCBS 239 lineage requirements [@basel2017finalising]. IMF Article IV consultations and World Bank FSAP reports provide the macro-scenario inputs that a forward-looking PD layer will need [@imf2023vietnamart4; @worldbank2022vietnamfinance]. Operational hygiene. Structural-model outputs should be produced daily for VN30 names and reviewed weekly by the corporate credit desk alongside CIC migration data. Equity volatility estimates should use an asymmetric model (GJR-GARCH) to pick up the leverage effect that matters around corporate-event news. Asset-volatility estimates should be smoothed with a prior drawn from sector peers because single-name inversion is noisy on thin-float listings. IFC MSME data and ADB Viet Nam banking reports are useful anchors for base-rate sanity checks on the non-listed extension [@ifc2019vnmsme; @adb2022vnfin]. Finally, stress testing under SBV Circular 13/2018/TT-NHNN expects scenario-conditional PDs [@sbv_circular13_2018], and a Merton-style model with macro-overlaid drift and volatility is well placed to produce them, provided the overlay is documented and the base calibration is local. ### Code: a Vietnam-specific deployment in action The five Vietnam-specific deviations called out above (Tet calendar, event-day winsorisation, off-balance-sheet debt augmentation, sector parameters anchored to VN30, PIT-to-TTC overlay) are implemented in [vietnam.py](../code/merton_kmv/vietnam.py) and compose with the production solver and orchestrator from @sec-ch08-deployment. The synthetic generator produces a VN30-style panel with five sector buckets (Banks, RealEstate, Utilities_SOE, Industrials, Consumer), a macro-shock window that mimics the 2022 corporate-bond freeze, and one ex-dividend and one trading-halt event per firm so the cleaner can be exercised on data that looks like a real HOSE/HNX feed. The `synthetic_vn_panel` returns four frames: equity, debt (both augmented and naive), risk-free, and metadata (per-firm sector, free float, ex-dividend date, trading-halt date, true asset volatility). The trading calendar honours the 2026 Tet closure (16-22 February), so the 252 daily observations in the panel are spread over a longer wall-clock window than a US 252-day window would be. The next chunk runs the production KMV solver on the augmented-debt face value and on the naive `0.5 * LT + ST` face value, so the reader can see what dropping off-balance-sheet guarantees and intra-group payables does to the PD level. The KMV solver is configured with `r = 0.04` (a VN 1y Treasury anchor) and `horizon_days = 245`, which is the actual HOSE/HNX trading-day count after Tet and public holidays. Augmenting the face value with the off-balance-sheet load lifts the median PD across every sector by roughly fifteen to twenty-five percent in relative terms, but the absolute basis-point shift concentrates in the sectors with the heaviest load. RealEstate, which sits at a 25 percent off-balance-sheet load against an already-high base PD, gains several hundred basis points; Banks gain seventy basis points; Industrials, Utilities, and Consumer move by single-digit basis points. This is the gap that BCBS 239 lineage reviews probe for: a model that prices Vietnamese banks and real-estate developers off `DLTT` and `DLC` alone is structurally optimistic. The next chunk runs the volatility cleaner on a single firm to show what the event-day winsorisation does. The synthetic injects an ex-dividend day and a halt-resumption day; the cleaner drops both, then winsorises the remaining log-returns at 4 MADs before annualising on the actual VN trading-day count. The raw equity-volatility estimator is biased upward by the two event days; the cleaner drops both and winsorises the rest, producing a tighter $\sigma_E$ that the KMV inversion then translates back to a less-biased $\sigma_V$. The asset volatility itself remains lower than the equity volatility (the BS hedge ratio, equation @eq-sigma-e-vega, multiplies asset vol by $V \Phi(d_1) / E$, which is well above one for a leveraged firm). The PIT-to-TTC overlay applies a credit-cycle multiplier to the point-in-time PD. The next chunk runs the overlay under three regimes: a neutral cycle (`cycle = 1.0`), a loose-credit cycle (`cycle > 1`, PIT under-states tail risk and TTC adjusts up), and a tight-credit cycle (`cycle < 1`, PIT over-states tail risk). The output is what flows downstream into the Stage 2 trigger and the Basel risk-weight calculation. In the loose-credit regime the TTC PD is pushed up (the loose cycle is suppressing observed PIT defaults, so the TTC anchor pulls the PD back toward the long-run average); in the tight-credit regime the TTC PD is pulled down (the cycle is amplifying observed PIT defaults). The smoother is documented in the model card and is what closes the SR 11-7 challenge on point-in-time volatility. The hybrid stack for the unlisted majority (Vietnamese SMEs without traded equity) borrows $\sigma_V$ from listed peers in the same sector, shrunk by a leverage gap. The next chunk simulates a private-firm balance sheet and routes it through `peer_sigma_lite` against the listed VN panel. The borrowed $\sigma_V$ is the structural-lite input that the chapter described: it lets the rest of the pipeline (DD computation, isotonic EDF map, monitoring) run on private-firm balance sheets without an equity feed. CIC group ratings can layer on top as a Bayesian prior, exactly as the prose recommended. A practical observation from the run above: Banks and RealEstate dominate the tail of the PD distribution, which is the right qualitative result for a panel that includes a 2022-style macro-shock window. SBV examiners look for exactly this: a model that flags the sectors that drove the last credit event, with the sector-level recalibration knobs documented and the PIT-TTC mapping shown to be model-monitored. ## Takeaways - Structural models tie default to the firm's capital structure through a single elegant identity: equity is a call on assets struck at debt face value. - Distance-to-default, $\text{DD} = [\ln(V/D) + (\mu - \sigma_V^2/2)T] / (\sigma_V \sqrt{T})$, is the workhorse metric; $\Phi(-\text{DD})$ is its theoretical PD and KMV EDF its empirical calibration. - The KMV iterative solver inverts observed equity and equity volatility into latent asset value and asset volatility; the iteration converges rapidly under mild conditions and is closely related to maximum-likelihood for the transformed GBM. - Structural PD is dominated out of sample by hybrid models that add accounting ratios, macro factors, and, for crisis periods, a latent frailty factor. - Reduced-form models bypass the structural mechanism by calibrating a hazard intensity directly; they are indispensable for pricing credit derivatives and for risk-neutral PD extraction from CDS. - For regulatory capital, KMV EDF enters as one input among several, not as the final PD; cycle adjustment and calibration testing are non-negotiable. ## Further reading - @merton1974pricing: the foundational paper. Indispensable. - @black1973pricing: the option-pricing engine underneath. - @vassalou2004default: DD as a priced risk factor in equity returns. - @bharath2008forecasting: naive DD versus full KMV on US data. - @duan1994maximum and @duan2004structural: MLE view of the KMV estimator. - @jarrow1995pricing: the canonical reduced-form paper. - @jarrow1997markov and @lando1998cox: rating-migration and Cox-process extensions. - @duffie1999modeling: defaultable bond pricing with default-adjusted discount rates. - @eom2004structural and @huang2012how: structural models and the credit-spread puzzle. - @campbell2008search: the leading hybrid bankruptcy-prediction paper. - @duffie2007multi and @duffie2009frailty: dynamic multi-period hazard with latent frailty. - @shumway2001forecasting and @ohlson1980financial: accounting-based baselines to benchmark against. - @leland1994corporate and @leland1996optimal: endogenous default with strategic debt service. - @sundaresan2013review: review of the Merton framework and its extensions. A correspondent-bank or emerging-market credit team needs the sovereign tier on top of the corporate one. @arellano2008default and @aguiar2006defaultable supply the canonical strategic-default model in which countries default in bad income states; @longstaff2011sovereign decompose the risk premium in sovereign CDS spreads into US-equity and global-volatility components, and @borri2023sovereign extend the analysis with a richer set of global macro factors. These models are not direct PD estimators for sovereigns the way KMV is for corporates, but they pin down the pricing kernel that converts country-level distance-to-default analogues into spread quotes that desks actually trade. ================================================================================ # Source: chapters/09-survival-analysis.qmd ================================================================================ # Survival Analysis and Time-to-Default **Scope: both retail and corporate.** Survival and discrete-time hazard models. Retail vintage analysis (account-level time-to-default) and corporate firm-year hazards (@sec-ch09-shumway, popularized by Shumway 2001) share the same likelihood. ## Overview {.unnumbered} ### A failure that motivates the chapter {.unnumbered} A logistic regression trained on a 36-month auto-loan vintage at month 6 and scored at month 24 will mis-rank an obligor who defaulted in month 4 the same way it mis-ranks one who was censored in month 4: both look like a positive label at horizon 6 even though the first obligor exited the risk set and the second is still on book. Dropping censored observations biases the bad rate; keeping them as zeros biases it the other way. Either way the IFRS 9 stage-2 lifetime provision computed off the resulting score is wrong by tens of basis points (the direction depends on which censoring choice you made), and the Basel one-year through-the-cycle PD is mis-calibrated by enough to fail an SR 11-7 effective-challenge benchmark against any model that respects the time axis. The failure is structural: a binary classifier *cannot* represent the joint distribution of (event, time) that the regulator's question is asking about. It is also avoidable: the same data, rescored on a Cox PH or a discrete-time Shumway logit fit on the same loan-month panel, recovers the time-dependent AUC and lifts the calibration deviation at 24 months back inside the stage-2 SLA. The rest of the chapter is what that rescoring entails, what it costs, and how to defend it in writing to four regulators. A binary default flag tells you whether a loan went bad. It does not tell you when. In consumer and corporate credit, the when matters at least as much as the whether. A loan that defaults in month 6 bleeds capital differently from a loan that defaults in month 36. An IFRS 9 stage-2 provision [@ifrs9] depends on the lifetime distribution of default, not on a point prediction. A Basel IRB model [@basel2006international] must deliver a through-the-cycle probability of default at a one-year horizon, plus term-structure inputs for stress tests [@bellotti2013forecasting]. The problem is intrinsically temporal, and treating it as classification throws away the most useful piece of the data: the time axis. Survival analysis is the right tool. It was built in biostatistics [@kaplan1958nonparametric; @cox1972regression; @aalen1978nonparametric] to handle exactly the situation lenders face: the event of interest may not occur during the observation window (censoring), covariates influence the timing of the event (regression on times), and competing events can preempt the one you care about (prepayment terminates a loan without default). Retail credit adopted these methods early [@narain1992survival; @banasik1999not; @stepanova2002survival] and continues to refine them [@bellotti2009credit; @dirick2017time]. ### The chapter's throughline {.unnumbered} Default is a time-to-event problem with five structural assumptions a model can lock in: independence of censoring from the event clock, a parametric (or nonparametric) hazard shape, proportional hazards across covariates, a single absorbing event, no immune fraction, and homogeneity within an observed risk band. This chapter walks the family of estimators that progressively relaxes those assumptions, scores the cost of each relaxation under controlled stress, and lands the surviving roster on a regulator-grade Vietnamese consumer-credit case study where four of the five assumptions are violated at once. ### Three threads, one chapter {.unnumbered} The chapter braids three threads. Knowing which one you are on at any moment is the difference between reading the chapter and being lost in it. - **Thread M (methods).** The genealogy walk from Kaplan-Meier down each branch (Cox, AFT, competing risks, cure, the heterogeneity extensions, Shumway). Every method section opens with the credit question it answers and the limitation of the prior section that motivated it. This is the chapter's spine. - **Thread P (production).** Every method has a "leave the notebook" companion: the `survival_diagnostics` package (@sec-ch09-defensibility-production), the `discrete_hazard` package (@sec-ch09-shumway-production), the FastAPI scoring service (@sec-ch09-deployment), the MLflow artifact lineage, the Spark-scale fits (@sec-ch09-scalability). Each Thread P interlude opens with one paragraph on why the code needs to leave the notebook. - **Thread C (case).** Two applied case threads do different work. The controlled six-DGP stress benchmark at @sec-ch09-comparison-stress proves the cost sheet at @sec-ch09-comparison-matrix by violating one assumption per world with a known oracle. The Vietnam capstone at @sec-ch09-vietnam-code proves the chapter on a portfolio that triggers four assumption violations at once with no oracle and a regulator watching. ### Reader contract {.unnumbered} Three concrete promises: - *Methods reader.* Every model is implemented twice (from-scratch so the math is visible, and with a reference library: `lifelines`, `scikit-survival`, `statsmodels`). Every section opens with the credit question it answers and the prior-section limitation it relaxes. - *Production reader.* Every method has a Thread P interlude with a versioned package, a schema validator, a FastAPI surface, and an MLflow lineage. The cross-cutting infrastructure is gathered around @sec-ch09-deployment. - *Reviewer reader.* The chapter delivers a cost sheet (@sec-ch09-comparison-matrix), a routing aid (@sec-ch09-comparison-flowchart), an upgrade aid (@sec-ch09-marketing's extension selector), a controlled assumption-violation oracle (@sec-ch09-comparison-stress), and a no-oracle public-file reality check (@sec-ch09-benchmark), all calibrated against a regulator's pre-read. The case for survival models is sharpest in emerging markets. Vietnamese consumer loans book with thin CIC histories, cash-flow incomes that flex with Tet, and informal-sector obligors whose default timing concentrates in months 2 to 6 when a seasonal cash buffer runs out. A one-year classification target hides both the seasonal spike and the early-prepayment culture that ends the risk window for a large fraction of the book. The capstone case study at @sec-ch09-vietnam returns to this with Circular 11/2021 default timing, competing-risk prepayment from Tet bonuses, vintage analysis under macro volatility, and Decree 13/2023 data-protection obligations. This chapter develops the machinery, end to end, from nonparametric product-limit estimators (@sec-ch09-km-cox) to parametric accelerated failure time models (@sec-ch09-aft), through competing risks (@sec-ch09-competing), cure mixtures (@sec-ch09-cure), heterogeneity and state dependence (@sec-ch09-marketing), vintage analysis (@sec-ch09-vintage), and the discrete-time hazard formulation (@sec-ch09-shumway) popularized in corporate default by @shumway2001forecasting and @duffie2007multi. ### Model genealogy: what each step up buys you {.unnumbered} Survival is a family of models, not a single estimator. Each member of the family relaxes a structural assumption that an earlier member relied on, and pays for that flexibility somewhere else (more data, more compute, weaker extrapolation, harder identification). @fig-ch09-genealogy is the chapter map. The cost sheet at @sec-ch09-comparison-matrix is the dual: each row is a node on the tree, each column an assumption an arrow into the node relaxed. The routing aid at @sec-ch09-comparison-flowchart compresses both into binary questions a model-risk pre-read answers in five minutes. The stress benchmark at @sec-ch09-comparison-stress drops the whole roster onto six controlled DGPs and turns each cost-sheet entry into a number. A reader can use the map as a decision aid. *Need a one-year PD with the strongest discrimination on the file you have?* Walk down to RSF or GBSurv and accept that you cannot extrapolate past the longest training horizon. *Need a lifetime ECL curve to month 60 from a book observed only to month 36?* Walk down the AFT branch and pay with a parametric hazard shape. *Need a CIF that does not double-count prepayments as defaults?* Walk down to Aalen-Johansen, then to Fine-Gray once covariates matter. *Need a covariate effect that flips sign at age 12?* Walk down to TVC or to Shumway with a period basis. *Suspect a long-run immune fraction (revolvers who never default)?* Walk to mixture cure. *Suspect cluster heterogeneity (branches, dealers, originators)?* Walk to frailty Cox, or to latent-class PWE if the heterogeneity is discrete and the hazard shape is unknown. The chapter walks each branch, fits each model both from scratch and with a reference library, and closes at @sec-ch09-comparison with the same roster scored on six DGPs that each break exactly one assumption. ### Notation {.unnumbered} - $T \in (0, \infty)$: time to default, a nonnegative random variable with density $f(t)$ and c.d.f. $F(t)$. - $S(t) = \Pr(T > t) = 1 - F(t)$: survival function. - $h(t) = \lim_{\Delta \downarrow 0} \Pr(t \le T < t+\Delta \mid T \ge t)/\Delta = f(t)/S(t)$: hazard rate. - $H(t) = \int_0^t h(u)du = -\log S(t)$: cumulative hazard. - $C$: right-censoring time, often administrative. We observe $Y = \min(T, C)$ and $\delta = \mathbf{1}\{T \le C\}$ (true default time seen), while $\delta= 0$: censored ($T >C$) (Loan still alive at cutoff $C$; default time unknown, only know $T > C$). - $x \in \mathbb{R}^p$: time-fixed covariates (e.g., application attributes). $x(t)$: time-varying (e.g., unemployment rate in month $t$). - $\beta \in \mathbb{R}^p$: regression coefficients in proportional hazards or AFT form. - Vintage $v$: the origination period of a cohort. Age $a$: months since origination. Calendar $c = v + a$. ## Credit as survival The logistic-regression failure that opened the chapter was a structural mismatch between the question (lifetime distribution of an event time) and the model (one-period probability of a binary label). The next page gives that question its language: a state machine for the loan, a likelihood that respects censoring, and three fundamental functions ($S$, $h$, $H$) that every estimator in the rest of the chapter is a parametrization of. Everything below in this section is data-side: shape of the panel, threats to identification, defensibility diagnostics. Everything from @sec-ch09-km-cox onward is a parametric or nonparametric specification of the hazard. A loan originated in month $v$ with principal $L$ and contractual term $M$ becomes a point in a state diagram. At each month $a = 1, 2, \ldots, M$ the loan is in exactly one of four states: current, delinquent, defaulted, closed (paid off, refinanced, or written off). The transition of interest is current-or-delinquent to defaulted. Call that random transition time $T$. Because the loan matures at month $M$, the event time is right-censored at $C = M$ unless the loan prepays, in which case a competing event removes the loan from the risk set early. This is the canonical survival setup [@cox1972regression; @prentice1978analysis]. @fig-ch09-states draws the state machine: solid arrows are within-loan rolls, the bold arrow into *Defaulted* is the event of interest, *Closed* is the competing event, and reaching age $M$ without either is administrative right-censoring. The three fundamental functions are equivalent descriptions of the same distribution: $$ S(t) = \Pr(T > t) = \exp\{-H(t)\}, \qquad H(t) = \int_0^t h(u) du, \qquad h(t) = -\frac{d}{dt}\log S(t). $$ The hazard is the natural modeling primitive. It is local in time (unlike $S$ or $F$, which are cumulative), it is nonnegative (unlike derivatives of $F$, which are nonnegative only because $F$ is monotone), and covariates enter it in clean multiplicative or additive form. Credit risk measurement reports prefer $S(t)$ or the probability of default curve $F(t)$ because provisioning formulas, Basel risk-weight functions [@basel2017finalising], and stress tests quote lifetime or 12-month probabilities. A good modeler specifies $h$ and reports $S$. @fig-ch09-spec-report makes that workflow concrete: pick a parametric hazard, integrate to the cumulative hazard $H$, exponentiate to $S$, and read off the 12-month and lifetime PDs the report consumer actually wants. ### Right censoring and the likelihood Right censoring is the defining feature of survival data. In retail credit, the most common form is administrative: the observation window ends at calendar time $\tau_{\text{end}}$, so a loan originated in month $v$ has follow-up $\tau_{\text{end}} - v$. Loans still current at $\tau_{\text{end}}$ contribute only their realized duration, not their (unobserved) default time. Assume independent censoring: $T \perp C \mid x$. In words, among loans that share the same covariate vector $x$, the ones whose follow-up gets cut short carry no extra information about default timing beyond what their $x$ already says. Equivalently, the censoring mechanism is allowed to depend on $x$ (and on calendar time, since that is the same for everyone) but not on the latent $T$ once $x$ is conditioned on. If the assumption holds, the at-risk set $\mathcal{R}(t) = \{i : Y_i \ge t\}$ is a random sample of the population still at risk at age $t$, and the partial-likelihood and product-limit estimators treat each censored observation as "alive on its last seen day, future unknown" without bias. Is the assumption realistic in retail credit? It is partly enforced by design and partly violated in practice. Three patterns matter: 1. *Administrative cutoff at* $\tau_{\text{end}}$ is the safe case. The data extraction date is exogenous to any individual loan's risk. Conditional on origination month $v$ and the covariate vector, the censoring time $C = \tau_{\text{end}} - v$ is deterministic, so $T \perp C \mid x, v$ holds by construction. This is why most credit-survival papers simply state "all censoring is administrative" and stop there.[^09-survival-analysis-1] 2. *Prepayment is the dangerous case.* A 36-month auto loan booked at month $v$ with covariates $x$ has a latent default time $T$ drawn from $h(t \mid x)$. At month 18, the borrower's credit improves (a fact not in $x$, unless you instrument refreshed scores), and a competitor offers a lower rate; the borrower refinances, so the loan is closed at $C = 18$ with $\delta_i = 0$. The naive likelihood treats this row as "survived 18 months, future unknown, average risk going forward" via the $S(18 \mid x)$ factor in @eq-liki. But the row was *not* average: it was a future low-risk borrower, removed from the risk set precisely because that information leaked through the refinance offer. Multiply across thousands of similar prepayments. After month 18, the surviving cohort is enriched in high-risk borrowers, the Kaplan-Meier drop rate over each subsequent interval rises, and the estimated baseline hazard $\hat{h}(t)$ for $t > 18$ tilts upward. Lifetime $\hat{F}(M \mid x) = 1 - \hat{S}(M \mid x)$ inherits the bias and the bank over-reserves on a portfolio that, if anything, is healthier than reported. **Fix**: do not call refinance "censoring." Treat it as a competing event with its own cause-specific hazard $h_{\text{prepay}}(t \mid x)$, fit jointly, and use Aalen-Johansen or Fine-Gray for the report (see @sec-ch09-competing). 3. *Lender-initiated closure (line cuts, charge-off short of default, forced refinance) is the intermediate case.* The decision is made by the bank using information about the account that may or may not be in $x$. If risk-driver scores, behavior, and macro covariates are all in $x$, conditional independence is plausible; if not, censoring is informative.[^09-survival-analysis-2] [^09-survival-analysis-1]: Even the safe case has corner cases. Suppose the bank truncates the data extract at $\tau_{\text{end}}$ but a separate IT pipeline drops loans that have been "inactive" for three months ahead of extraction. Now $C$ depends on payment behavior, which depends on $T$. The fix is to use the original servicing snapshot, not a cleaned downstream copy. [^09-survival-analysis-2]: Three concrete examples. (a) *Hardship programs* in the 2020 pandemic re-amortized millions of mortgages. The eligibility rule (recent unemployment, payment hardship attestation) used information about the borrower that the application-time $x$ did not contain. Loans that entered hardship were closed in the analytic record at the modification date; they were the ones most likely to default. Treating them as censored biases the default hazard *down*. (b) *Credit-line reductions* on revolving products. The bank cuts the limit on accounts whose utilization is climbing or whose external bureau score has fallen, and the account either pays out or transitions to a different product, ending its observation. Censoring depends on a behavior covariate that is rarely in the application-time $x$. (c) *Dealer recourse on indirect auto loans.* Loans bought with recourse can be sold back to the dealer when the dealer suspects payment trouble; those exits look like prepayments in the servicer's record but track future default better than prepayment does. Independent censoring is *not* fully testable from observed data: $T$ is unobserved precisely when $C$ is observed, so the joint distribution $(T, C)$ is not identified without further assumptions [@tsiatis1975nonidentifiability]. What can be done is to gather evidence: - *Compare covariate distributions across censoring causes.* If administratively-censored loans, prepaid loans, and lender-closed loans have visibly different $x$ distributions, conditional independence is more demanding; either widen $x$ or model the cause explicitly. - *Inverse-probability-of-censoring weighting (IPCW).* Fit a model for the censoring hazard $\lambda_C(t \mid x)$, weight each at-risk observation by $1/\hat{S}_C(t \mid x)$, and refit the survival model. Stable estimates under IPCW are evidence that conditional independence on the chosen $x$ is enough; large shifts say the censoring depends on something not in $x$ [@robins1992recovery]. - *Sensitivity / tipping-point analysis.* Assume censored borrowers default at rate $\rho \cdot \hat{h}(t \mid x)$ for $\rho \in [0.5, 2]$ and re-estimate $S$. Report the range. If the 12m PD is stable across the range, the report is robust; if it flips sign on a key decision, escalate. - *Holdout against a clean cohort.* Where possible, fit on a vintage with mostly administrative censoring and compare the implied hazard to a vintage with heavy prepay. Persistent disagreement past what covariates explain is informative-censoring evidence. > $T \perp C \mid x$ is a working assumption that you make defensible by > > \(a\) including the covariates that drive censoring, > > \(b\) modeling prepayment as a competing event rather than independent censoring, and > > \(c\) reporting the IPCW or tipping-point sensitivity alongside the headline survival curve. > > @sec-ch09-defensibility runs all four diagnostics in code on the simulated cohort. Then the contribution of observation $i$ to the likelihood is $$ \begin{aligned} L_i(\theta) &= f(y_i \mid x_i; \theta)^{\delta_i}\, S(y_i \mid x_i; \theta)^{1-\delta_i} \\ &= \bigl[h(y_i \mid x_i; \theta)\, S(y_i \mid x_i; \theta)\bigr]^{\delta_i}\, S(y_i \mid x_i; \theta)^{1-\delta_i} \\ &= h(y_i \mid x_i; \theta)^{\delta_i}\, S(y_i \mid x_i; \theta)^{\delta_i + (1-\delta_i)} \\ &= h(y_i \mid x_i; \theta)^{\delta_i}\, S(y_i \mid x_i; \theta). \end{aligned} $$ The step from line one to line two is the key substitution: $f(t) = h(t)\, S(t)$. This follows immediately from the definition of the hazard, $h(t) = f(t)/S(t)$, just rearranged. Once both observed and censored contributions are written in terms of $h$ and $S$, they share the same survival factor and the powers of $S$ collapse from $\delta_i + (1 - \delta_i) = 1$ to a single $S(y_i \mid x_i; \theta)$. The remaining $h^{\delta_i}$ rewards the model only when an event was actually observed ($\delta_i = 1$), and is silent otherwise. This is exactly why the hazard, not the density, is the natural primitive to specify: censored rows contribute through $S$, event rows contribute through $h \cdot S$, and both terms are something the modeler already controls. Total log-likelihood is $\ell(\theta) = \sum_i \delta_i \log h(y_i \mid x_i; \theta) - H(y_i \mid x_i; \theta)$. Every parametric model we will fit in this chapter (Weibull, log-logistic, log-normal, Cox with Breslow baseline, mixture cure) is a special case of @eq-liki. Every likelihood-ratio test, AIC comparison, and Wald statistic derives from it. A related but distinct pitfall is *left truncation*. Suppose the analytic window opens at calendar time $\tau_{\text{start}}$ and a loan was originated earlier, at $v < \tau_{\text{start}}$. The loan only enters the dataset because it was *still alive* at $\tau_{\text{start}}$, that is, at age $a_0 = \tau_{\text{start}} - v > 0$. What is wrong with treating it as if it had been observed from age 0? Two things, both about selection. - First, the cohort of "loans alive at $\tau_{\text{start}}$" excludes every loan from the same vintage that already defaulted before $\tau_{\text{start}}$. Pretending the observation started at age 0 puts a survivor in the risk set at every young age $0 \le t < a_0$ where they were *not actually observable*, so $n_k$ in the KM denominator is inflated for early time bins. Early hazards come out biased *downward*. - Second, the at-risk indicator inside the partial likelihood becomes wrong: at event time $t < a_0$, this loan should not be in $\mathcal{R}(t)$ at all, because we would never have seen it had it failed before $\tau_{\text{start}}$. Including it pretends we had information we did not. The fix is *delayed entry*, not deletion. Drop the rows and you discard valid follow-up at ages $a \ge a_0$, throwing away exactly the data the older vintages contribute (and biasing toward young vintages, which themselves bias toward early defaulters). Instead, re-define each row's at-risk window: enter the risk set at age $a_0$, exit at age $a_0 + \text{follow-up}$, with the event indicator unchanged. The Kaplan-Meier and Cox estimators then form $\mathcal{R}(t) = \{i : a_0^{(i)} \le t \le \text{exit}^{(i)}\}$ and the math goes through. The `lifelines` `entry` argument and the counting-process $(\text{start}, \text{stop}, \text{event})$ formulation of @andersen1982cox implement this directly. @sec-ch09-truncation-demo shows the bias and the fix on simulated data. The mirror-image pitfall is *right truncation*. It is structurally distinct from right *censoring* and the two are routinely confused in the credit-risk literature. Right censoring means a loan is alive at the analysis cutoff and we will eventually see whether it defaults; the row is in the dataset, the event time is bounded below. Right truncation means the row is in the dataset *only because* the event has already happened by some calendar bound. Three concrete sources in production: - *Defaulted-only extracts.* The data team hands you a chargeoff table joined to origination, on the grounds that "good loans don't need a default-time field". Every row is a defaulter; the never-defaulted population is silently absent. - *Reporting-lag truncation in incident data.* Fraud, first-payment-default, or recovery feeds arrive at the warehouse only once a case file is closed. The cohort assembled at calendar time $\tau_{\text{end}}$ contains case $i$ iff $t_{\text{event}}^{(i)} + \ell^{(i)} \le \tau_{\text{end}}$, where $\ell$ is the random reporting lag. Long-lag events for recently-originated loans are not yet visible. - *Recovery-time studies.* Loss-given-default analyses that retain only loans whose recovery completed by $\tau_{\text{end}}$ truncate exactly the long-lag, low-recovery tail. Naively fitting Kaplan-Meier on a right-truncated sample biases the survival curve *upward at the tail* (long-failing loans are over-represented) and *downward at the head* (short-failing loans are over-represented relative to the full origination cohort). The standard fixes invert the time axis and run KM on $\tau - t$ [@lagakos1988nonparametric] or use the @efron1999nonparametric self-consistent NPMLE. In `lifelines` the practical handle is `KaplanMeierFitter.fit_left_truncation_right_censoring` for the symmetric case; for retrospective right-truncation only, the reverse-time KM is a half-page of NumPy. @sec-ch09-right-truncation-demo shows both the bias and the fix on simulated data, and `survival_diagnostics.truncation` ships a production guard that flags when an incoming cohort looks event-only. ### Why not just classification? A naive approach frames default as a binary outcome: over the horizon $H$, did the borrower default? Fit a logistic regression [@thomas2000survey]. That works when $H$ is fixed and the portfolio composition is stable. It fails in three ways: 1. **Horizons are not fixed**. IFRS 9 stage-2 uses lifetime. Scenario testing uses 3-year. Pricing uses 5-year. A single logistic cannot produce all three without refitting. 2. **Censoring is ignored**. A loan booked 3 months ago with 33 months to go is treated as a non-default. It gives the same evidence as a loan that survived 36 months. The first is mostly missing. 3. **The time profile is informative**. Early defaults cluster around affordability shocks; late defaults track adverse selection and macro shocks [@duffie2007multi; @bellotti2009credit]. A hazard curve carries that signature. The rest of the chapter shows how to extract it. To make "specify $h$, report $S$" tangible before any data appears, fix a Weibull hazard $h(t \mid x) = (k/\lambda)(t/\lambda)^{k-1} \exp(\beta x)$ with shape $k$, scale $\lambda$, and a single binary covariate $x \in \{0, 1\}$ for a higher-risk segment. The modeler chooses the hazard form and parameters; everything the report consumer sees is derived. The cumulative hazard is $H(t \mid x) = (t/\lambda)^k \exp(\beta x)$, the survival is $S(t \mid x) = \exp\{-H(t \mid x)\}$, and the marginal default probability over horizon $H$ is $F(H \mid x) = 1 - S(H \mid x)$. @fig-ch09-spec-report shows the three curves; the table below it converts to the two numbers a stress test or IFRS 9 stage classifier actually wants. The modeler touched only $k$, $\lambda$, $\beta$. Everything the report shows, the curves and the two PDs, follows from @eq-triplet. Swapping the Weibull for a Cox baseline plus the same $\beta x$ would change the *shape* of $h$, but leave the pipeline (hazard $\to$ $H$ $\to$ $S$ $\to$ horizon PD) identical; that is the payoff of treating the hazard as the primitive. The remaining sections of this chapter populate the *specify* $h$ step with progressively richer estimators, but the *report* $S$ step never changes. ### Informative censoring: a numerical demo The earlier walkthrough claimed that treating prepayment as independent censoring biases the survival estimate. @fig-ch09-informative-censoring quantifies the bias on a simulated cohort where a latent risk score $Z$ drives both the default time and the prepayment time, in opposite directions: high $Z$ (bad risks) default early and rarely prepay; low $Z$ (good risks) survive long and prepay early. The naive Kaplan-Meier curve treats prepayments as ordinary censoring; the oracle curve uses the full latent default time. The gap is the bias. The naive lifetime PD comes out larger than the truth: prepay-driven exits removed the good risks early, so the conditional default rate among the survivors is inflated. In a real portfolio you do not have the oracle column; the right move is to recognize prepay as a competing event (@sec-ch09-competing) and report cause-specific or Aalen-Johansen cumulative incidence instead of treating prepay as censoring. ### Defensibility diagnostics: IPCW, tipping-point, and cohort holdout Independence $T \perp C \mid x$ is untestable directly: the joint distribution of $(T, C)$ is not identified from the data we observe. Four diagnostics provide *indirect* evidence by attacking the assumption from different angles. Each answers a distinct sub-question, and a validation pack should report all four: 1. *Cause-cohort overlap* asks whether censored loans look like at-risk loans on the covariates we already have. 2. *IPCW reweighting* asks whether putting the suspect covariate into the censoring model closes the bias. 3. *Tipping-point sensitivity* asks how wrong the assumption would have to be before the headline number flips. 4. *Clean-cohort holdout* asks whether the bias disappears on a parallel vintage where censoring is rare. All four run on the cohort from @sec-ch09-informative-censoring-demo, so the bias in @fig-ch09-informative-censoring and its corrections share one axis. The output is the artifact a model-validation pack attaches next to the headline KM curve. #### Diagnostic 1: cause-cohort overlap on covariates **Question.** Do prepaid loans look like administratively-censored loans on the observed covariates? **Intuition.** If censoring is unrelated to risk *conditional on* $x$, then censored and at-risk loans should share the same $x$ distribution within each stratum. The diagnostic is as follows: when prepaid loans cluster at low $Z$ (good risks), while admin-censored loans straddle the full $Z$ range, $x$ is too narrow to absorb the dependence. We do not need to know the truth to see this; we just need the cause-of-exit label. **How to read it.** A Kolmogorov-Smirnov statistic on $Z$ across cause cohorts, plus group means and standard deviations. A large KS distance with a small p-value means censoring is selective on $Z$, which forces a choice: widen $x$ to include $Z$, or move to IPCW with $Z$ in the censoring model. The prepaid pool sits at low $Z$ (good risks), the default pool at high $Z$, and admin censoring straddles both because it conditions only on age. The KS distance between admin and prepay is large and the null of equal $Z$ distributions is rejected: the censoring mechanism *is* selective on $Z$. #### Diagnostic 2: IPCW reweighting **Question.** If we put the suspect covariate into the censoring model, does the bias close? **Intuition.** Every loan that prepays would have continued accruing default-time information had it stayed in the book. IPCW reconstructs that lost information by *upweighting at-risk loans that look like the prepaid ones*, where resemblance is measured through the censoring survival $\hat S_C(Y_i^- \mid x_i)$ from a Cox model fit on the prepay hazard. Each row carries weight $1/\hat S_C$: observations whose covariate-siblings tend to leave early carry more weight, because they are speaking on behalf of the prepayers we no longer observe. If the lost information runs along $x$, IPCW recovers it; if it runs along an *unmeasured* driver, IPCW cannot help and the residual gap is evidence of that. **How to read it.** Overlay three KMs: the oracle (latent $T$, no prepay), the naive (treats prepay as independent censoring), and the IPCW-weighted. A closed gap on the IPCW curve is a positive signal but not proof, since IPCW only corrects for marginalization across the modeled covariates. Watch the weights: a max or 99th-percentile weight past 5-10 means a handful of rows do most of the correcting and bootstrap CIs widen accordingly. Production stabilizes the weights (numerator $\hat S_C^{\text{marg}}(t)$ from a covariate-free censoring KM) and caps at the 99th percentile to trade a small bias for a large variance reduction; @robins1992recovery is the IPCW reference. The IPCW curve closes most of the gap on this cohort because the lost information runs along $Z$, which the censoring model captures. A residual gap survives because IPCW corrects for marginalisation, not for unmeasured drivers; if the gap stayed wide after conditioning on every observable, that would be evidence of unmeasured informative censoring and a job for Diagnostic 3. #### Diagnostic 3: tipping-point sensitivity **Question.** How wrong would the censoring assumption have to be before the headline number flips? **Intuition.** IPCW asks "given $x$, what is the right answer?" Tipping-point asks the *dual*: ignore $x$ and ask how much the prepaid rows' true default hazard would have to differ from the at-risk pool's hazard for the lifetime PD to cross a policy threshold. Encode the discrepancy as a multiplier $\rho$ on the implied censored-row hazard, and sweep $\rho \in [0.5, 2]$ as a Rosenbaum-style robustness range. $\rho = 1$ recovers the naive estimate ("censored rows default at the same rate as the at-risk pool"); $\rho < 1$ says prepayers were better-than-average risks (which is correct for our DGP, since low-$Z$ borrowers prepay early); $\rho > 1$ says they were worse. The lifetime PD at horizon $h$ becomes the observed-event share plus the censored-row contribution $\Pr(T \le h \mid T > Y_i, \rho)$, computed off the naive baseline survival raised to $\rho$. **How to read it.** Plot lifetime PD as a function of $\rho$, mark the oracle, and report the $\rho$ at which the headline crosses any decision threshold the model feeds into. The width of the curve over $\rho \in [0.5, 2]$ is the *defensible* uncertainty around the point estimate, and a risk report should disclose it next to the headline. #### Diagnostic 4: clean-cohort holdout **Question.** When prepay is rare, does the bias disappear? **Intuition.** Find or construct a parallel vintage where censoring is sparse, a "clean cohort". In production, this might be an early-vintage book that closed before the rate-driven refinance wave, or a portfolio segment whose contracts forbid prepayment, or a synthetic counterfactual cohort generated under the same DGP with prepay suppressed (which is what we do here). Fit the *same* naive KM on the clean cohort and compare its lifetime PD against the prepay-heavy fit. The logic is a difference-in-differences over the censoring channel: if the clean-cohort PD lines up with the oracle but the prepay-heavy PD does not, censoring was the confound and IPCW ([Diagnostic 2](#sec-ch09-defensibility-ipcw)) is the right tool. If the clean cohort *also* misses the oracle, an unmeasured driver is in play and IPCW will not save you; that is the case for richer covariates or a structural model. **How to read it.** Print prepay share on each cohort, lifetime PD on each, and the clean-vs-oracle gap. - Small gap = censoring was the main confound. - Large gap = look elsewhere (covariate set, model form, or unmeasured exposure). #### Persisted artifact The four diagnostics serialize to one JSON blob that travels with the headline survival fit through the validation pack: Four numbers reach the validation pack: the 12m PD under naive vs IPCW, the lifetime PD range across $\rho \in [0.5, 2]$, the clean-cohort lifetime PD, and the KS distance on $Z$ across cause cohorts. No single number is dispositive: the naive-vs-IPCW gap detects mis-specification of $x$, the tipping range bounds decision robustness, the clean-cohort vintage probes for confounding the model never sees, and the KS column triggers all three when it is large. A model card that reports only the headline survival curve has not earned the right to call its censoring independent. ### From script to production: the `survival_diagnostics` package The scratch block above is the right shape for a chapter, but the validation cycle is not "run a notebook once." A bank pulls a fresh cohort every quarter, refits the headline survival model, and needs the four diagnostics rebuilt without rewriting any of them. The package `book/code/survival_diagnostics/` factors the same logic into versioned modules and exposes a single entry point `run_diagnostics(cohort, config)` that returns a JSON-serializable artifact suitable for the SR 11-7 / IFRS 9 model-validation pack. A FastAPI wrapper at `book/deployment/survival_diagnostics_app.py` serves the artifact on demand. The package layout mirrors the four diagnostics one-to-one: `overlap.py` runs the cause-cohort KS plus standardized mean differences, `ipcw.py` fits the censoring Cox with stabilized and capped weights, `tipping.py` runs the $\rho$ sweep, `holdout.py` compares the clean and prepay-heavy cohorts, and `competing.py` adds Aalen-Johansen cumulative incidence and a Fine-Gray fit under the Geskus reduction. `pipeline.py` orchestrates them, traps per-step failures into an `errors` block rather than failing the whole artifact, and serializes everything through `DiagnosticsArtifact.to_json()`. The same synthetic cohort that drove the scratch block, but routed through the production entry point: The values reproduce the scratch block to two decimals: the IPCW correction closes most of the naive-vs-oracle gap, the tipping band brackets the lifetime PD over the conventional $\rho \in [0.5, 2]$ range, the clean-cohort vintage sits close to the full cohort because the simulated DGP does not have unmeasured confounders, and the cause-overlap test fires because $Z$ does discriminate prepay from default by construction. The Fine-Gray fit returns a default-cause subdistribution coefficient on $Z$ that an IFRS 9 stage-1 lifetime PD curve would consume directly. The FastAPI service is the contract between this package and a downstream validation system. A `POST /diagnostics/run` with a vintage tag, a covariate list, and an optional clean-cohort query string runs the same `run_diagnostics` call against a cohort Parquet at `$SD_COHORT_ROOT/.parquet`, persists the artifact at `$SD_ARTIFACT_ROOT/.json`, and returns a summary block. `GET /diagnostics/` and `GET /diagnostics//card` serve the persisted artifact and the auto-generated model card. Two operational notes: - The Cox censoring fit is the slow step. For vintages above \~200k loans, batch the diagnostics in Airflow / Dagster overnight and let the API serve cached artifacts; ad-hoc reruns then fall back to the on-demand path for slices that fit in seconds. - The `errors` field is non-empty when one diagnostic fails (too few prepay events, positivity violations on a sub-cohort, sksurv's competing-risks routine refusing a degenerate cause vector). The pipeline records the error and returns the rest of the artifact: silence in a validation pack is worse than a partial result with an explicit failure mode. The package and the chapter block compute the same numbers off the same logic. The difference is reproducibility: the package is unit-testable, versionable through `__init__.py`, and the artifact JSON sits next to the headline KM in the validation pack with a SHA on the cohort file as provenance. ### Left truncation: a numerical demo @fig-ch09-truncation makes the selection issue concrete. A single Weibull cohort is generated and three KM curves are compared: (i) the oracle, observing every loan from origination; (ii) a left-truncated dataset where loans only enter when they are still alive at calendar window open ($\tau_{\text{start}}$), fit *naively* as if all observations started at age 0; and (iii) the same truncated dataset fit with delayed entry. Curves (i) and (iii) overlap. Curve (ii) lies above the oracle across the entire age axis: the gap *forms* over the first $\sim 10$ months (while truncation excludes early defaulters proportionally more than late ones, depressing the observed hazard) and then *persists* at older ages because KM is multiplicative and the early under-counting compounds into every later interval. The naive PD sits below the truth at both horizons. Two readings of the same gap matter for different audiences. In *absolute* PD, the bias grows with horizon (0.024 at 6m, 0.065 at 24m) because the early hazard deficit propagates multiplicatively, so risk reports keyed off lifetime PD are most distorted at long horizons. In *relative* PD, the bias is largest at the youngest ages (81% of truth at 6m, 37% at 24m) because the truth itself is small there: the truncation removes proportionally more of the early defaulters, and a small absolute deficit is a large fraction of a small denominator. Both readings vanish under the entry-corrected fit, which sits within Monte Carlo noise of the oracle at every horizon. The same correction extends to Cox: pass an `entry` column (or use the start/stop counting-process layout) and the partial-likelihood risk set $\mathcal{R}(t)$ is built from $\{i : a_0^{(i)} \le t \le \text{exit}^{(i)}\}$ instead of $\{i : \text{exit}^{(i)} \ge t\}$. Both fixes cost a single column in the input frame. ### Right truncation: a numerical demo Right truncation has a different fingerprint and a different fix. We simulate the *defaulted-only extract* case: a Weibull cohort is generated from origination, the analysis cutoff is $\tau_{\text{end}}$ months after the earliest origination, and we keep only the loans that have already defaulted by the cutoff. The pretend-it-is-complete sample is what arrives in the warehouse when a chargeoff team hands you "the default file" without the at-risk denominator. A clarification on what is identifiable. With right truncation alone, the data identify the *conditional* event-time distribution on the observed support $[0, t^*]$ where $t^* = \max_i R_i$ and $R_i = \tau_{\text{end}} - v_i$ is the per-row truncation bound, that is, $F_T(t)/F_T(t^*)$. The marginal $F_T$ on the full support is unidentifiable from the truncated sample alone; recovering it requires either an external estimate of $F_T(t^*)$ (e.g. a known portfolio default rate) or a parametric tail. The simulation below is calibrated so $F_T(t^*) \approx 1$, which lets us read the conditional and unconditional CDFs as essentially the same number; the production code reports the conditional CDF and flags whenever $t^*$ is materially below the credit-policy horizon. @fig-ch09-right-truncation overlays three curves. (i) The oracle KM, fit on the full origination cohort with administrative right-censoring at $\tau_{\text{end}}$, is the truth we are trying to recover. (ii) The naive KM, fit on the defaulted-only subsample as if it were complete, is biased: every observation is an event, so the estimator collapses to the empirical CDF of $\{T_i \mid T_i \le R_i\}$, which over-represents short failure times. (iii) The reverse-time delayed-entry KM applies the @lagakos1988nonparametric construction: with $X_i = t^* - T_i$ and $B_i = t^* - R_i$, the right-truncation constraint $T_i \le R_i$ becomes the left-truncation constraint $B_i \le X_i$, and forward-time delayed-entry KM on $(B_i, X_i)$ with all-event indicator gives $\widehat F_T(t)/\widehat F_T(t^*) = \widehat S_X(t^* - t)$. Curves (i) and (iii) overlap to within Monte Carlo noise; curve (ii) does not. Three things to read off the printed table: - First, the naive estimator overstates PD at every horizon: the defaulted-only sample is dominated by short failure times, so the empirical CDF climbs too fast. - Second, the bias is largest at the youngest ages and shrinks with $h$, because by $h \approx t^*$ the naive empirical CDF is forced to one (every retained row defaulted by then) regardless of cohort. - Third, the reverse-time delayed-entry KM matches the oracle to within tens of basis points across two horizons, which is the practical demonstration that the fix is the right one. Lifelines' `KaplanMeierFitter.fit_left_truncation_right_censoring` covers the symmetric case where both biases are present at once. The production lesson is that the *first* check on any incoming cohort should be whether the event indicator is degenerate. If `event.mean() == 1` the cohort is event-only and a right-truncation correction is mandatory; if `event.mean() < 0.001` the cohort may have lost the defaulter join, which is the mirror failure mode and equally damaging. `survival_diagnostics.truncation` wraps both checks, fits the appropriate corrected KM, and emits an artifact field that the validation pipeline blocks on when the corrected and naive lifetime PDs disagree by more than the configured basis-point threshold. ### Truncation diagnostics in production The chapter demos and the production code share a single implementation path. `detect_truncation(duration, event, entry=..., vintage_age_at_cutoff=...)` ingests exactly the columns each correction needs, fits the delayed-entry KM (left truncation) and the reverse-time delayed-entry KM (right truncation) under the hood, and returns a typed result with bias deltas in basis points. The summary table below is the same artifact field the FastAPI service writes into the validation pack JSON. Two points worth restating. The artifact is non-fatal by design: the pipeline records `blocks=True` and stops the validation run, but it preserves the rest of the diagnostic so reviewers see *which* check fired. And the `entry_age_months` and `vintage_age_at_cutoff_months` columns on the FastAPI request body are optional: a cohort assembled from a clean origination snapshot needs neither, but a cohort assembled from a calendar-window snapshot or a chargeoff feed needs at least one, and the model card escalation rule is the audit-side enforcement of that requirement. ## Input data layouts Survival fitters disagree on what their input looks like. The same cohort feeds Kaplan-Meier in lifelines, a Cox fit in scikit-survival, a Shumway logit in statsmodels, and a Fine-Gray Geskus reduction in lifelines, and each one wants a *different* in-memory shape. Most "the package crashed" tickets in production trace to a layout mismatch, not a modeling bug. This section materializes a small synthetic cohort and shows the `head()` of every layout the rest of the chapter uses, with the package and fitter that consumes each one. We use six loans so the printed frames fit on one screen. The same construction scales to a real portfolio without changes. Loan 3 enters the risk set six months after origination (the left-truncation case from @sec-ch09-truncation-demo). Loan 2 exits via prepayment, the competing risk in @sec-ch09-competing. Everything else is a vanilla right-censored observation. ### Layout 1: wide per-loan frame One row per loan, with `duration` and `event` columns and any number of fixed-at-origination covariates. This is the layout `lifelines` expects across `KaplanMeierFitter`, `CoxPHFitter`, and the AFT family (`WeibullAFTFitter`, `LogNormalAFTFitter`, `LogLogisticAFTFitter`). Consumers: - `KaplanMeierFitter().fit(wide['duration'], wide['event'])` — see @sec-ch09-km-cox. - `CoxPHFitter().fit(wide.drop(columns='loan_id'), 'duration', 'event')` — see @sec-ch09-km-cox. - `WeibullAFTFitter().fit(wide.drop(columns='loan_id'), 'duration', 'event')` — see @sec-ch09-aft. Add an `entry` column to handle left truncation in lifelines: `KaplanMeierFitter().fit(durations, events, entry=cohort['entry_age'])`. The Cox equivalent in lifelines is `CoxPHFitter().fit(..., entry_col='entry_age')`. Both implementations build the risk set $\mathcal{R}(t) = \{i : a_0^{(i)} \le t \le \text{exit}^{(i)}\}$ from those two columns. ### Layout 2: scikit-survival structured array `scikit-survival` separates the response from the design matrix. The response is a NumPy *structured array* of `(event_bool, time_float)` records; the design is a plain 2-D feature array. Consumers: - `RandomSurvivalForest().fit(X_sksurv, y_sksurv)` — see @sec-ch09-benchmark. - `GradientBoostingSurvivalAnalysis().fit(X_sksurv, y_sksurv)` — see @sec-ch09-benchmark. - `CoxPHSurvivalAnalysis().fit(X_sksurv, y_sksurv)` (the sksurv Cox, distinct from the lifelines one). - Metrics: `concordance_index_censored`, `cumulative_dynamic_auc`, `integrated_brier_score` all read this dtype directly. The dtype convention `[('event', '?'), ('time', '