---
execute:
echo: true
eval: true
warning: false
message: false
bibliography:
- ../references.bib
- ../refs/ch-18.bib
---
# Transaction Data and Open Banking {#sec-ch18}
::: {.callout-note appearance="simple" icon="false"}
**Scope: retail.** Open banking transaction streams (PSD2, CDR, Section 1033) for consumer underwriting: cashflow-based scoring, categorization, and feature stores. Corporate banking aggregation is partially covered in @sec-ch29.
:::
## Overview {.unnumbered}
Open banking changes what a credit file looks like. A traditional bureau record is a 24-row panel of tradelines: a handful of accounts, balance snapshots, delinquency flags, a FICO code. A PSD2-enabled data feed is a 12,000-row panel for the same consumer over the same year: every coffee, every rent payment, every salary credit, every overdraft alert, tagged to a merchant and a category, refreshed overnight. The marginal informational content is not small. @berg2020rise showed that ten crude device-level footprints rival a FICO score. Transaction data strictly dominates those footprints because it carries the cashflow primitives that drive default: income stability, expense structure, discretionary slack, reserve depth.
This chapter is a practitioner's walkthrough of how to turn raw transaction feeds into a production credit model. The pieces are feature engineering that respects time (@sec-ch18-features), an NLP layer for descriptions (@sec-ch18-nlp), an aggregation layer that reconciles accounts across institutions (@sec-ch18-aggregation), and a runtime stack that ingests new transactions before they go stale (@sec-ch18-decay). Each piece has both statistical and engineering content. A cashflow feature is only as good as its latency, and a merchant classifier is only as good as its coverage on the tail of merchants.
The regulatory backdrop is specific. PSD2 in Europe, the FCA's open banking standards in the UK, and the CFPB's 1033 final rule in the US all share the same skeleton: the consumer owns the data, a licensed third party can pull it with consent, the bank has to expose a standardized API, and authentication is strong. The statistical backdrop is general. Cashflow signals decay. Income today is worth more than income eight months ago. The chapter closes with an explicit decay model, because a model that does not respect freshness will overfit in backtest and underperform in production.
Vietnam is following a different path. There is no PSD2-style open-banking statute. Instead, the State Bank of Vietnam has used issue-specific instruments: Decision 2345/QD-NHNN on online-payment authentication [@sbv_decision2345_2023], Circular 16/2020 on electronic KYC [@sbv_circular16_2020], and the NAPAS-anchored interbank switch [@napas2023report] that approximates a common API surface. The Vietnam-and-EM section at the end of this chapter reads this stack next to PSD2 and draws the consequences for scoring.
### Notation {.unnumbered}
Let $i=1,\ldots,N$ index customers and $t=1,\ldots,T$ index days or months. A transaction is a tuple $(i, t, a_{it}^{(k)}, d_{it}^{(k)}, m_{it}^{(k)})$ where $a$ is signed amount, $d$ is a free text description, and $m$ is the merchant or category tag. Let $X_{it} \in \mathbb{R}^{p}$ be the feature vector at time $t$. Let $Y_{i,t+h} \in \{0,1\}$ be default within horizon $h$. The objective is $\Pr(Y_{i,t+h}=1 \mid X_{it})$.
---
## PSD2 and open banking {#sec-ch18-openbank}
The European Second Payment Services Directive (PSD2), transposed into national law by January 2018 with the Regulatory Technical Standards (RTS) live from September 2019, does three things that matter for credit modelers [@eu2015psd2]. It creates a legal category of Third Party Provider (TPP), it mandates that Account Servicing Payment Service Providers (ASPSPs, i.e., banks) expose a dedicated access-to-account (XS2A) interface, and it requires Strong Customer Authentication (SCA) on most payment flows.
### Licensed TPPs and consent
A TPP is authorized as either an Account Information Service Provider (AISP), a Payment Initiation Service Provider (PISP), or both. Credit scoring sits squarely on the AISP side. An AISP obtains consumer consent, calls the ASPSP's XS2A endpoint, and receives a stream of account information: balances, transaction history (typically 24 months), standing orders, direct debits. Consent is time-limited (EBA set 90 days, extended to 180 days in 2022) and explicit per-account. GDPR Article 6(1)(a) provides the legal basis and Article 22 constrains automated decisions, although credit scoring typically qualifies under the contractual-necessity derogation.
The supply side is asymmetric. The AISP license is not free. Capital requirements are low (EUR 50,000), but the approval process at a national competent authority (BaFin, FCA, AMF, etc.) takes months and requires a documented risk framework. Most lenders work through an intermediary (TrueLayer, Tink, Plaid, Yapily, Salt Edge) rather than hold their own license.
### XS2A and the certificate stack
XS2A is a REST-over-HTTPS interface secured by mutual TLS with QWAC (Qualified Website Authentication Certificates) and signed requests with QSealC (Qualified Electronic Seal Certificates) under eIDAS. The Berlin Group NextGenPSD2 framework is the dominant European schema; the UK uses the Open Banking Implementation Entity (OBIE) spec, which is close but not identical. Bodies differ on redirect flow versus decoupled flow versus embedded flow for SCA.
The practical implication for modelers: latency. A typical AISP round trip through a bank's XS2A endpoint is 400 to 2,000 ms, dominated by the bank side. Batch pulls overnight are normal; real-time pulls at application time are the exception. Feature engineering should assume a snapshot pulled at application time plus a nightly refresh for portfolio monitoring. @parlour2022fintech modeled the equilibrium implication: when payment data becomes interoperable, banks lose informational rents, and entry by informed non-bank lenders is profitable.
### Strong customer authentication
SCA requires two of three factors: knowledge, possession, inheritance. The consequence for modeling is hidden but important. SCA exemptions exist for low-value, trusted-beneficiary, and low-risk transactions under RTS 97/98, and the exemption logic creates a non-random sample: the set of transactions that appear in an AISP feed for a given customer is conditional on the customer having authenticated. Dormant customers churn out of the feed faster than engaged customers. This is a selection mechanism worth calibrating.
### US and UK divergence
The US CFPB finalized Section 1033 of Dodd-Frank on October 22, 2024 [@cfpb2024openbanking]. It mandates, over a staggered compliance window (April 2026 for the largest banks), that depository institutions provide consumer data to authorized third parties via standardized APIs, prohibits screen scraping for compliant institutions, and sets privacy and accuracy standards close to PSD2. The UK's OBIE framework predates PSD2 and in 2023 entered the "future entity" phase with the JROC [@fca2023openbanking]. Structurally the three regimes converge on the same stack: consumer consent, licensed TPP, standardized API, SCA. They diverge on whether the bank can charge for access (US no, EU limited), on the data-minimization scope, and on liability for unauthorized transactions.
### What the data looks like
A single transaction returned by a Berlin Group XS2A endpoint carries, at minimum, bookingDate, valueDate, transactionAmount (value and currency), creditorName or debtorName, remittanceInformationUnstructured, bankTransactionCode (ISO 20022 code like PMNT-RCDT-SALA for an inbound salary), and a bank-assigned transactionId. That is the input to every downstream step.
---
## Transaction-level feature engineering {#sec-ch18-features}
Raw transactions must become a fixed-dimensional vector per customer per snapshot. This section builds the taxonomy.
### From tuples to panels
Let $\mathcal{T}_i$ be the set of transactions for customer $i$. Partition $\mathcal{T}_i$ into inflows $\mathcal{T}_i^+ = \{(t, a) : a > 0\}$ and outflows $\mathcal{T}_i^- = \{(t, a) : a < 0\}$. For a window $W$ ending at snapshot date $s$, define the aggregate
$$
S_i(s, W, \mathcal{C}) = \sum_{(t, a, m) \in \mathcal{T}_i, t \in (s - W, s], m \in \mathcal{C}} f(a, t),
$$ {#eq-agg}
where $\mathcal{C}$ is a category filter and $f$ is a reduction. With $f = |a|$ and $W$ = 90 days, $\mathcal{C}$ = {"salary"}, the result is 90-day inflow from salary.
### Income
Income is a latent variable. Bank credits that look like income include salary direct deposits, pension transfers, self-employment invoices, benefit payments, and regular peer transfers. Signal quality varies. Salary credits are high-signal: recurring, predictable, tagged by the paying bank as PMNT-RCDT-SALA in ISO 20022. Self-employment income is lower-signal: irregular, variable amount, heterogeneous counterparty. Peer transfers from family look like income but are not pledgeable.
A defensible income estimator uses three quantities:
1. Median recurring inflow with period 28 to 31 days and amount coefficient of variation (CV) below 0.2.
2. Sum of transactions with ISO salary codes.
3. Twelfth-percentile of monthly inflows over the last year (a conservative floor).
Each has failure modes. @olafsson2018liquid documented that for liquid hand-to-mouth households, month-to-month inflow CV exceeds 0.3 even when annual income is stable.
### Expense structure and recurring outflows
Rent-like recurring outflows are the single most predictive category. They mimic debt service: large, monthly, nondiscretionary. Identification is a temporal pattern match. Let $a_1, a_2, \ldots, a_k$ be outflows to the same counterparty over $k$ months. They are "rent-like" if
$$
\frac{\text{std}(a_j)}{\text{mean}(a_j)} < 0.1, \quad \text{median}\{t_{j+1} - t_j\} \in [27, 33], \quad k \geq 3.
$$ {#eq-rentlike}
@ganong2019consumer used a similar recurring detector to identify mortgage and rent from bank data; their measured pass-through from unemployment to spending is sharper when recurring outflows are carved out.
### Volatility and balance troughs
Income volatility and spending volatility are separate features. Let $I_{i,m}$ be month-$m$ inflow and $E_{i,m}$ month-$m$ outflow. Useful moments:
$$
\text{CV}^{I}_i = \frac{\sqrt{\text{Var}(I_{i,m})}}{\text{E}[I_{i,m}]}, \qquad \text{DSR}_i = \frac{\sum_m R_{i,m}}{\sum_m I_{i,m}},
$$ {#eq-dsr}
where $R_{i,m}$ is rent-like recurring outflow in month $m$. DSR above 0.45 is a regulatory red line for UK mortgage affordability (FCA MCOB 11.6).
Balance troughs are the most discriminating derivative of the daily balance series. If $B_{i,t}$ is end-of-day balance on day $t$, define the 90-day trough as $\min_{t \in (s-90, s]} B_{i,t}$. A customer whose 90-day trough is close to zero or negative is riding overdraft. Reserve coverage is $\max(B_{i,s-W}) / \bar{E}_i$, where $\bar{E}_i$ is monthly expense: months of runway.
@baker2018debt linked spending responses to liquidity, showing that households with low liquid reserves cut discretionary spending by 20 to 30 percent on adverse shocks. That response channel is exactly the cashflow default channel.
### Taxonomy, in practice
A working feature set has the following axes. Category axis: salary, rent, utilities, groceries, transport, dining, subscriptions, gambling, BNPL, cash withdrawals. Window axis: 7, 30, 90, 180, 360 days. Statistic axis: count, sum, mean, std, min, max, unique-counterparty-count, trend slope, EWMA. Binary flags: any-overdraft, any-NSF, any-payday-loan-repayment, any-gambling, any-crypto-exchange.
The Cartesian product blows up quickly, so most practitioners build perhaps 300 to 1,500 candidate features, then prune via information value or permutation importance.
---
## Cash flow analysis {#sec-ch18-cashflow}
Cashflow analysis is the explicit model that links transaction streams to ability-to-pay. Bureau scores answer "how has this borrower repaid in the past?" Cashflow scores answer "how much slack does this borrower have next month?" The two are complements.
### Decomposition
The accounting identity for a snapshot month $m$ is $B_{i,m} = B_{i,m-1} + I_{i,m} - E_{i,m}$. Expanding $E$:
$$
E_{i,m} = R_{i,m} + D_{i,m} + T_{i,m},
$$ {#eq-expense}
where $R$ is recurring (rent, mortgage, utilities, subscriptions), $D$ is discretionary (dining, entertainment, shopping), and $T$ is transfers out (including debt service). Slack is $I - R - \text{minimum viable } D$, and it is the variable that drives default on a new loan.
### Income detection
A production income detector has four layers: ISO 20022 codes (high precision, partial coverage), regex on description (covers "ACME PAYROLL", "DWP CHILD BENEFIT"), counterparty-and-frequency signature (unsupervised), manual user confirmation at application (lifts precision on edge cases). The detector output is a labeled subset of inflows with a confidence score. Aggregations downstream use the confidence as an inclusion weight.
### Month-end balance dynamics
Month-end balances have a characteristic sawtooth shape: they rise on payday and fall across the month. Useful moments include the minimum of this series, the slope of a monotone regression through the series, and the number of months where the minimum hit zero. Customers whose sawtooth bottoms out at the same level each month are living paycheck to paycheck. Customers whose sawtooth drifts down month over month are running down reserves.
### Affordability as a test
Affordability in regulation (FCA CONC 5.2A, EU Mortgage Credit Directive) is a pass/fail gate: after proposed new debt service, does residual income exceed a minimum threshold? Cashflow analysis supplies both sides of the test. The threshold is usually a household composition table plus a cost-of-living index; the residual is $I - R - \text{new debt service}$.
---
## NLP on transaction descriptions {#sec-ch18-nlp}
A transaction description is a short, noisy string: "SQ *BLUE BOTTLE COFFEE", "TFL TRAVEL CH", "AMZN Mktp*M12JF8KQ0". The target is a merchant or category label. The problem is short-text classification on a long-tailed label space, which @devlin2019bert style models handle well.
### Why pretrained language models help
Transaction descriptions are not natural English; they are a dialect of acronyms, stock tickers, store numbers, and payment-processor prefixes. But subword tokenizers (WordPiece, BPE) break even unknown strings into known pieces, and the pretrained transformer supplies a distribution over token sequences that transfers to the merchant-classification task with modest fine-tuning data. DistilBERT [@sanh2019distilbert] compresses BERT-base to 66M parameters with 97 percent retention on GLUE, which is the right size for a fine-tuning run on a laptop.
The classifier head is a linear layer on the [CLS] representation, trained with cross-entropy. Calibration can be tuned via temperature scaling. The practical pipeline is tokenize, batch, fine-tune 1 to 3 epochs on 10k to 1M labeled descriptions, deploy ONNX-exported weights with INT8 quantization behind a micro-batching server.
### Failure modes
Three failure modes recur. First, the long tail: merchants seen once or twice at training time. Zero-shot strategies (bi-encoder similarity to a merchant catalog) cover the tail. Second, ambiguity: "SUMUP" is a payment processor, not a merchant, and the true merchant is inside the remittance information. Post-processing rules separate processor tags from merchant names. Third, category drift: a grocery chain launches a pharmacy, and the same string now covers two categories. Monitoring per-merchant category entropy catches drift.
---
## Account aggregation {#sec-ch18-aggregation}
Most retail customers in mature markets hold three to seven accounts across two to four institutions. Aggregation is the step that assembles a unified cashflow view.
### Reconciliation and duplicate detection
A transfer from Customer A's checking account at Bank 1 to Customer A's savings account at Bank 2 appears twice in an aggregated feed: once as an outflow at Bank 1, once as an inflow at Bank 2. It is not income and not expense. Detection requires matching on amount and date within a tolerance, and on counterparty strings. A robust pipeline uses blocked record linkage [@fernandez2010entity; @christen2012data] with blocking keys (amount bucket, date window) and a pairwise classifier (Jaro-Winkler on names, amount equality, date proximity).
Duplicate detection errors are asymmetric. A missed duplicate inflates both sides and does not move net cashflow, but an erroneous merge of two distinct transactions erases one real flow and creates signal.
### Identity linkage
Linking accounts to the same customer across institutions uses two signals: the consumer authenticates into each institution through the same AISP session (high confidence), or a fuzzy match on account-holder name, address, and date of birth (lower confidence, more common in bureau-style aggregation). Locality-sensitive hashing on n-gram shingles [@broder1997syntactic] is the standard scale-out.
### Coverage and missingness
Most customers do not connect every account. Missing accounts are not missing at random: customers hide accounts they are embarrassed about (payday loans, gambling). Imputation is dangerous because the conditional distribution of the missing account is not the unconditional population. Preferred practice is a coverage flag ("we saw N of M accounts") as a feature, letting the downstream model learn that coverage is itself predictive.
---
## Data freshness and signal decay {#sec-ch18-decay}
Cashflow data goes stale. The question is how fast. This section gives a Markov-chain derivation of exponential decay in predictive mutual information.
### Mutual information under a Markov assumption
Let the feature process $X_t$ be a first-order stationary Markov chain on a finite state space $\mathcal{X}$ with transition matrix $P$. Assume $Y$ depends only on $X_{t}$ at a target horizon, i.e., $Y = g(X_t, \epsilon)$ with $\epsilon$ independent of the chain. For lag $k$, the mutual information between a past observation and the target is
$$
I(X_{t-k}; Y) = \sum_{x, y} p(x, y) \log \frac{p(x, y)}{p(x) p(y)},
$$ {#eq-ch18-mi}
with $p(x, y) = \sum_{x'} p(x) [P^k]_{x, x'} \Pr(Y = y \mid X_t = x')$. The data-processing inequality [@cover2006elements] gives $I(X_{t-k}; Y) \leq I(X_{t-k+1}; Y)$: every extra step of mixing destroys information.
### Exponential decay
Let $P$ have second-largest eigenvalue in modulus $\lambda_2$. For any bounded function $h$ on $\mathcal{X}$,
$$
\big| \mathbb{E}[h(X_t) \mid X_{t-k}] - \mathbb{E}[h(X_t)] \big| \leq C |\lambda_2|^k,
$$ {#eq-mix}
where $C$ depends on $h$ and the stationary distribution. Plugging this bound into the chi-squared approximation for $I$ in the near-independence limit gives
$$
I(X_{t-k}; Y) \approx \tfrac{1}{2} \chi^2 \leq \tfrac{C'}{2} |\lambda_2|^{2k} = \tfrac{C'}{2} e^{-\alpha k},
$$ {#eq-decay}
with $\alpha = -2 \log |\lambda_2| > 0$. The half-life of predictive information is $k_{1/2} = \log 2 / \alpha$.
For retail cashflow, empirical $k_{1/2}$ is weeks to a few months depending on the feature. Income is sticky (half-life of many months). Gambling flags are volatile (half-life of weeks). Practice: weight recent observations up with exponential moving averages, and drop features whose measured half-life is shorter than the refresh cadence.
### Aggregation schemes
Trailing-window statistics are box filters: equal weight inside the window, zero weight outside. Exponentially-weighted moving averages are IIR filters:
$$
\text{EWMA}_t = (1 - \beta) x_t + \beta \text{EWMA}_{t-1}, \quad \beta \in (0, 1).
$$ {#eq-ewma}
The effective half-life is $\log 2 / \log(1/\beta)$. Choose $\beta$ so that the EWMA half-life matches the empirical predictive half-life of the feature. That is the only defensible tuning rule.
---
## Simulation: a six-month transaction panel {.unnumbered}
We simulate 1,000 customers over 180 days with structured income, rent, utilities, subscriptions, groceries, transport, dining, gambling, and idiosyncratic shocks. We then engineer 30+ features with pandas rolling windows, repeat in polars, benchmark, and train LightGBM. A small pretrained transformer fine-tunes a merchant classifier.
### Simulate
```{python}
#| label: sim-panel
import numpy as np
import pandas as pd
import sys
sys.path.insert(0, '../code')
from creditutils import ks_statistic, stable_sigmoid
rng = np.random.default_rng(42)
N_CUSTOMERS = 1000
DAYS = 180
start = pd.Timestamp("2024-01-01")
# Customer-level latent attributes
cust = pd.DataFrame({
"customer_id": np.arange(N_CUSTOMERS),
"monthly_income": rng.lognormal(mean=8.1, sigma=0.35, size=N_CUSTOMERS),
"monthly_rent_frac": rng.beta(4, 6, size=N_CUSTOMERS) * 0.55 + 0.15,
"savings_rate": rng.beta(2, 5, size=N_CUSTOMERS) * 0.3,
"gambling_prop": rng.beta(1, 20, size=N_CUSTOMERS),
"shock_prob": rng.beta(1, 40, size=N_CUSTOMERS),
})
cust["monthly_rent"] = (cust["monthly_income"] * cust["monthly_rent_frac"]).round(2)
# Latent default process: cashflow-driven
def latent_pd(row):
dsr = row["monthly_rent_frac"]
base = -2.5 + 2.4 * dsr - 3.0 * row["savings_rate"] + 6.0 * row["gambling_prop"] + 10 * row["shock_prob"]
return stable_sigmoid(base)
cust["pd_true"] = cust.apply(latent_pd, axis=1)
cust["default"] = rng.binomial(1, cust["pd_true"])
print("default rate:", cust["default"].mean().round(3))
```
```{python}
#| label: sim-txn
# Build a daily transaction log
txns = []
def push(cid, day, amount, desc, category):
txns.append((cid, start + pd.Timedelta(days=int(day)), float(amount), desc, category))
for _, row in cust.iterrows():
cid = int(row["customer_id"])
inc = row["monthly_income"]
rent = row["monthly_rent"]
sav = row["savings_rate"]
gmbl = row["gambling_prop"]
shock = row["shock_prob"]
# Salary on the 25th of each month, small jitter
for month_start in range(0, DAYS, 30):
payday = month_start + 25
if payday < DAYS:
amt = inc * rng.normal(1.0, 0.02)
push(cid, payday, amt, "ACME PAYROLL DIRECT DEP", "salary")
# Rent on the 1st
for month_start in range(0, DAYS, 30):
rent_day = month_start + 1
if rent_day < DAYS:
push(cid, rent_day, -rent, "LANDLORD CO STANDING ORDER", "rent")
# Utilities on 5th and 15th
for month_start in range(0, DAYS, 30):
for d, tag in [(5, "NATIONAL GRID DD"), (15, "THAMES WATER DD")]:
day = month_start + d
if day < DAYS:
push(cid, day, -rng.uniform(40, 110), tag, "utilities")
# Subscriptions
subs = [("NETFLIX.COM", rng.uniform(9, 18)),
("SPOTIFY PREMIUM", rng.uniform(9, 12)),
("AMAZON PRIME MBRSHIP", rng.uniform(8, 13))]
for i, (desc, amt) in enumerate(subs):
if rng.random() < 0.75:
for month_start in range(0, DAYS, 30):
day = month_start + 7 + i
if day < DAYS:
push(cid, day, -amt, desc, "subscriptions")
# Groceries: twice a week
n_grocery = DAYS // 3
for _ in range(n_grocery):
d = rng.integers(0, DAYS)
push(cid, d, -rng.uniform(20, 90), rng.choice(["TESCO STORES 4821", "SAINSBURYS SMKTS", "LIDL GB LONDON"]), "groceries")
# Transport
n_tube = DAYS // 2
for _ in range(n_tube):
d = rng.integers(0, DAYS)
push(cid, d, -rng.uniform(2, 25), rng.choice(["TFL TRAVEL CH", "UBER *TRIP", "NATIONAL RAIL"]), "transport")
# Dining
discretionary = max(0.05, 0.3 - sav)
n_dine = int(DAYS * discretionary * 0.5)
for _ in range(n_dine):
d = rng.integers(0, DAYS)
push(cid, d, -rng.uniform(8, 60), rng.choice(["PRET A MANGER", "SQ *BLUE BOTTLE", "NANDO'S CHICKEN", "DELIVEROO*ORDER"]), "dining")
# Gambling (sparse, high variance)
n_gamble = rng.poisson(DAYS * gmbl * 0.2)
for _ in range(n_gamble):
d = rng.integers(0, DAYS)
push(cid, d, -rng.uniform(10, 200), rng.choice(["BET365 GAMING", "SKYBET MOBILE", "PADDYPOWER BET"]), "gambling")
# Income shocks (unemployment, missed paycheck)
if rng.random() < shock * 6:
miss_month = rng.integers(1, DAYS // 30)
# Remove one salary: we just add a negative offset to simulate smaller paycheck
push(cid, miss_month * 30 + 25, -inc * 0.6, "REVERSAL PRIOR CR", "shock")
txns_df = pd.DataFrame(txns, columns=["customer_id", "date", "amount", "description", "category"])
txns_df = txns_df.sort_values(["customer_id", "date"]).reset_index(drop=True)
print(txns_df.shape, txns_df["category"].value_counts().to_dict())
```
The panel has 1,000 customers over 180 days and a realistic category mix. The latent default rate is driven by debt-service ratio, savings, gambling, and income shocks, which are exactly the targets of the feature engineering.
### Feature engineering with pandas
```{python}
#| label: features-pandas
import time
t0 = time.time()
# Snapshot date: feature cutoff
snap = start + pd.Timedelta(days=DAYS)
# Basic derived columns
txns_df["abs_amount"] = txns_df["amount"].abs()
txns_df["is_outflow"] = (txns_df["amount"] < 0).astype(int)
txns_df["is_inflow"] = (txns_df["amount"] > 0).astype(int)
def category_sum(df, cat, col="abs_amount"):
return df.loc[df["category"] == cat, col].sum()
def build_features(g):
days_obs = (g["date"].max() - g["date"].min()).days + 1
inflow = g.loc[g["is_inflow"] == 1, "amount"].sum()
outflow = -g.loc[g["is_outflow"] == 1, "amount"].sum()
# Per-category
cats = ["salary", "rent", "utilities", "groceries", "dining", "transport",
"gambling", "subscriptions", "shock"]
cat_sums = {f"sum_{c}": category_sum(g, c) for c in cats}
cat_counts = {f"cnt_{c}": int((g["category"] == c).sum()) for c in cats}
# Monthly series
g2 = g.set_index("date").sort_index()
monthly_in = g2.loc[g2["is_inflow"] == 1, "amount"].resample("ME").sum()
monthly_out = (-g2.loc[g2["is_outflow"] == 1, "amount"]).resample("ME").sum()
# Volatility
inflow_cv = float(monthly_in.std() / (monthly_in.mean() + 1e-9)) if len(monthly_in) > 1 else 0.0
outflow_cv = float(monthly_out.std() / (monthly_out.mean() + 1e-9)) if len(monthly_out) > 1 else 0.0
# Daily net balance (cumulative)
daily = g2["amount"].resample("D").sum().fillna(0.0)
bal = daily.cumsum()
trough_90 = float(bal.iloc[-min(90, len(bal)):].min())
trough_30 = float(bal.iloc[-min(30, len(bal)):].min())
# Recurring rent-like detection: abs amount with low CV, monthly cadence
rent_like = 0.0
rent_pmts = g[g["category"] == "rent"]["abs_amount"]
if len(rent_pmts) >= 3 and rent_pmts.std() / (rent_pmts.mean() + 1e-9) < 0.1:
rent_like = float(rent_pmts.mean())
dsr = rent_like * 6 / (inflow + 1e-9) if inflow > 0 else 0.0
# Reserve coverage (approx)
monthly_exp = outflow / max(days_obs / 30, 1)
reserve_months = float(bal.iloc[-1] / (monthly_exp + 1e-9)) if monthly_exp > 0 else 0.0
# Counts of unique counterparties (from description first token)
g["merchant_key"] = g["description"].str.split().str[0]
uniq_merchants = int(g["merchant_key"].nunique())
uniq_gambling = int(g.loc[g["category"] == "gambling", "merchant_key"].nunique())
# Discretionary ratio
discretionary = cat_sums["sum_dining"] + 0.5 * cat_sums["sum_transport"]
disc_ratio = float(discretionary / (outflow + 1e-9))
# Flags
any_gambling = int(cat_counts["cnt_gambling"] > 0)
any_shock = int(cat_counts["cnt_shock"] > 0)
# EWMA of monthly inflow
ewma_inflow = float(monthly_in.ewm(alpha=0.5).mean().iloc[-1]) if len(monthly_in) else 0.0
# Trough as share of monthly inflow
trough_norm = trough_90 / (monthly_in.mean() + 1e-9) if len(monthly_in) else 0.0
# 90-day-vs-all average outflow drift
out_mean = float(monthly_out.mean()) if len(monthly_out) else 0.0
out_last = float(monthly_out.iloc[-1]) if len(monthly_out) else 0.0
drift = (out_last - out_mean) / (out_mean + 1e-9)
out = {
"total_inflow": inflow, "total_outflow": outflow,
"inflow_cv": inflow_cv, "outflow_cv": outflow_cv,
"rent_like_amt": rent_like, "dsr": dsr,
"trough_30": trough_30, "trough_90": trough_90,
"trough_norm": trough_norm, "reserve_months": reserve_months,
"uniq_merchants": uniq_merchants, "uniq_gambling": uniq_gambling,
"disc_ratio": disc_ratio, "any_gambling": any_gambling,
"any_shock": any_shock, "ewma_inflow": ewma_inflow,
"outflow_drift": drift,
}
out.update(cat_sums); out.update(cat_counts)
return pd.Series(out)
feats_pd = txns_df.groupby("customer_id", sort=False).apply(build_features, include_groups=False)
feats_pd = feats_pd.reset_index()
t_pandas = time.time() - t0
print("pandas features:", feats_pd.shape, "time:", round(t_pandas, 2), "s")
print("feature columns:", len(feats_pd.columns) - 1)
feats_pd.head(2)
```
The builder produces 35 features per customer, covering the four axes from @sec-ch18-features: category sums, category counts, volatility moments, recurring detectors, trough statistics, and flags.
### Polars implementation and benchmark
```{python}
#| label: features-polars
import polars as pl
t0 = time.time()
pl_df = pl.from_pandas(txns_df)
pl_df = pl_df.with_columns([
pl.col("amount").abs().alias("abs_amount"),
(pl.col("amount") < 0).cast(pl.Int8).alias("is_outflow"),
(pl.col("amount") > 0).cast(pl.Int8).alias("is_inflow"),
pl.col("date").dt.truncate("1mo").alias("month"),
pl.col("description").str.split(" ").list.get(0).alias("merchant_key"),
])
cats = ["salary", "rent", "utilities", "groceries", "dining", "transport",
"gambling", "subscriptions", "shock"]
agg_exprs = [
pl.col("amount").filter(pl.col("is_inflow") == 1).sum().alias("total_inflow"),
(-pl.col("amount").filter(pl.col("is_outflow") == 1).sum()).alias("total_outflow"),
pl.col("merchant_key").n_unique().alias("uniq_merchants"),
pl.col("merchant_key").filter(pl.col("category") == "gambling").n_unique().alias("uniq_gambling"),
(pl.col("category") == "gambling").any().cast(pl.Int8).alias("any_gambling"),
(pl.col("category") == "shock").any().cast(pl.Int8).alias("any_shock"),
]
for c in cats:
agg_exprs.append(pl.col("abs_amount").filter(pl.col("category") == c).sum().alias(f"sum_{c}"))
agg_exprs.append((pl.col("category") == c).sum().cast(pl.Int32).alias(f"cnt_{c}"))
base = pl_df.group_by("customer_id").agg(agg_exprs)
# Monthly inflow / outflow series -> volatility
monthly = (pl_df.group_by(["customer_id", "month"])
.agg([
pl.col("amount").filter(pl.col("is_inflow") == 1).sum().alias("in_m"),
(-pl.col("amount").filter(pl.col("is_outflow") == 1).sum()).alias("out_m"),
])
.sort(["customer_id", "month"]))
vols = (monthly.group_by("customer_id")
.agg([
(pl.col("in_m").std() / (pl.col("in_m").mean().abs() + 1e-9)).alias("inflow_cv"),
(pl.col("out_m").std() / (pl.col("out_m").mean().abs() + 1e-9)).alias("outflow_cv"),
pl.col("in_m").ewm_mean(alpha=0.5).last().alias("ewma_inflow"),
pl.col("in_m").mean().alias("inflow_mean_monthly"),
pl.col("out_m").mean().alias("outflow_mean_monthly"),
pl.col("out_m").last().alias("outflow_last_monthly"),
]))
feats_pl = base.join(vols, on="customer_id", how="left")
feats_pl = feats_pl.with_columns([
(pl.col("outflow_last_monthly") - pl.col("outflow_mean_monthly") /
(pl.col("outflow_mean_monthly").abs() + 1e-9)).alias("outflow_drift_proxy")
])
t_polars = time.time() - t0
print("polars features:", feats_pl.shape, "time:", round(t_polars, 2), "s")
print("pandas/polars speedup:", round(t_pandas / max(t_polars, 1e-3), 2), "x")
```
Polars produces a subset of the pandas features through pure group-by plus window operations. The speedup factor on this panel is typically 3 to 10x, driven by columnar layout and zero-copy expressions. For production monthly refresh of multi-million customer panels, the polars path scales linearly with CPU and memory, and is the first stop before reaching for Spark.
### LightGBM on engineered features
```{python}
#| label: lgbm-fit
import lightgbm as lgb
from sklearn.metrics import roc_auc_score, brier_score_loss
X = feats_pd.drop(columns=["customer_id"]).fillna(0.0)
y = cust["default"].values
rng2 = np.random.default_rng(0)
idx = np.arange(len(X)); rng2.shuffle(idx)
cut = int(0.7 * len(X))
tr, te = idx[:cut], idx[cut:]
dtr = lgb.Dataset(X.iloc[tr].values, label=y[tr])
dte = lgb.Dataset(X.iloc[te].values, label=y[te], reference=dtr)
params = dict(objective="binary", metric="auc", learning_rate=0.05,
num_leaves=31, feature_fraction=0.9, bagging_fraction=0.9,
bagging_freq=1, min_data_in_leaf=20, verbose=-1, seed=0)
booster = lgb.train(params, dtr, num_boost_round=300, valid_sets=[dte],
callbacks=[lgb.early_stopping(30, verbose=False), lgb.log_evaluation(0)])
p_tr = booster.predict(X.iloc[tr].values)
p_te = booster.predict(X.iloc[te].values)
auc = roc_auc_score(y[te], p_te)
ks = ks_statistic(y[te], p_te)
brier = brier_score_loss(y[te], p_te)
print({"auc": round(auc, 3), "ks": round(ks, 3), "brier": round(brier, 4)})
```
### Thin-file bureau baseline
A "thin-file bureau" baseline uses only the age, income, and one crude expense ratio: what a non-open-banking lender would see.
```{python}
#| label: baseline
from sklearn.linear_model import LogisticRegression
base_X = feats_pd[["total_inflow", "total_outflow"]].copy()
base_X["inc_out_ratio"] = base_X["total_outflow"] / (base_X["total_inflow"] + 1e-9)
base_Xv = base_X.values
base_model = LogisticRegression(max_iter=1000)
base_model.fit(base_Xv[tr], y[tr])
p_base = base_model.predict_proba(base_Xv[te])[:, 1]
print({"baseline_auc": round(roc_auc_score(y[te], p_base), 3),
"baseline_ks": round(ks_statistic(y[te], p_base), 3),
"baseline_brier": round(brier_score_loss(y[te], p_base), 4)})
```
The engineered open-banking model has sharply higher AUC and KS than the three-variable baseline, which is the empirical content behind the open-banking case. @berg2020rise reported similar lifts from digital footprints; cashflow features are typically stronger still because they carry quantitative information, not just binary flags.
### Feature importance
```{python}
#| label: lgbm-importance
import matplotlib.pyplot as plt
imp = pd.Series(booster.feature_importance(importance_type="gain"),
index=X.columns).sort_values(ascending=True)
top = imp.tail(15)
fig, ax = plt.subplots(figsize=(6.5, 5.0))
ax.barh(top.index, top.values, color="#2a6cc0")
ax.set_xlabel("gain"); ax.set_title("LightGBM top features (gain)")
plt.tight_layout()
plt.show()
```
The top features are the cashflow primitives: debt-service ratio, recurring rent amount, reserve months, gambling flag, income CV. This is exactly the ordering the derivation in @sec-ch18-features predicted.
---
## Merchant classification with a pretrained transformer {.unnumbered}
We fine-tune DistilBERT on a small labeled set of transaction descriptions to predict category. We keep the corpus tiny so the cell finishes in under 90 seconds.
```{python}
#| label: bert-setup
import os
os.environ.setdefault("TRANSFORMERS_VERBOSITY", "error")
os.environ.setdefault("TOKENIZERS_PARALLELISM", "false")
# Build a small labeled dataset from the simulated transactions
lab_df = (txns_df[["description", "category"]]
.sample(n=500, random_state=0)
.reset_index(drop=True))
labels = sorted(lab_df["category"].unique().tolist())
label2id = {c: i for i, c in enumerate(labels)}
id2label = {i: c for c, i in label2id.items()}
lab_df["y"] = lab_df["category"].map(label2id)
print("labels:", labels, "n:", len(lab_df))
```
```{python}
#| label: bert-train
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModelForSequenceClassification
torch.manual_seed(0)
np.random.seed(0)
MODEL = "distilbert-base-uncased"
tok = AutoTokenizer.from_pretrained(MODEL)
mdl = AutoModelForSequenceClassification.from_pretrained(
MODEL, num_labels=len(labels))
device = torch.device("cpu")
mdl.to(device)
idx_all = np.arange(len(lab_df))
np.random.shuffle(idx_all)
ntr = int(0.8 * len(idx_all))
idx_tr, idx_te = idx_all[:ntr], idx_all[ntr:]
class TxnDS(Dataset):
def __init__(self, df, idx):
self.texts = df["description"].values[idx]
self.y = df["y"].values[idx]
def __len__(self): return len(self.texts)
def __getitem__(self, i):
enc = tok(str(self.texts[i]), truncation=True, padding="max_length",
max_length=24, return_tensors="pt")
return {k: v.squeeze(0) for k, v in enc.items()}, int(self.y[i])
def collate(batch):
xs = {k: torch.stack([b[0][k] for b in batch]) for k in batch[0][0]}
ys = torch.tensor([b[1] for b in batch], dtype=torch.long)
return xs, ys
train_dl = DataLoader(TxnDS(lab_df, idx_tr), batch_size=16, shuffle=True, collate_fn=collate)
test_dl = DataLoader(TxnDS(lab_df, idx_te), batch_size=16, shuffle=False, collate_fn=collate)
opt = torch.optim.AdamW(mdl.parameters(), lr=5e-5)
import time
t0 = time.time()
mdl.train()
for epoch in range(2):
tot, n = 0.0, 0
for xs, ys in train_dl:
xs = {k: v.to(device) for k, v in xs.items()}
ys = ys.to(device)
out = mdl(**xs, labels=ys)
out.loss.backward()
opt.step(); opt.zero_grad()
tot += out.loss.item() * len(ys); n += len(ys)
print(f"epoch {epoch+1} loss {tot/n:.3f}")
print("train time:", round(time.time() - t0, 1), "s")
```
```{python}
#| label: bert-eval
mdl.eval()
correct, total = 0, 0
with torch.no_grad():
for xs, ys in test_dl:
xs = {k: v.to(device) for k, v in xs.items()}
logits = mdl(**xs).logits
pred = logits.argmax(-1).cpu().numpy()
correct += (pred == ys.numpy()).sum()
total += len(ys)
print("accuracy:", round(correct / max(total, 1), 3))
# Inference on a couple of unseen strings
samples = ["DOORDASH SOHO", "BET365 LIVEWIRE", "THAMES WATER DD"]
enc = tok(samples, truncation=True, padding=True, max_length=24, return_tensors="pt")
with torch.no_grad():
pred = mdl(**enc).logits.argmax(-1).cpu().numpy()
for s, p in zip(samples, pred):
print(f" {s} -> {id2label[int(p)]}")
```
Two epochs on 400 samples is enough for the model to separate the coarse label set in the simulated world. A production run has two differences: the label space is larger (500 to 5,000 merchants, 30 to 80 categories) and the training corpus is larger (labeled tens of thousands to millions of descriptions from historical feeds). Training cost scales sublinearly with data past that point because the pretrained representation already covers most tokens.
### Calibration and deployment notes
For deployment, export to ONNX and quantize to INT8. A DistilBERT classifier at max_length 32 achieves sub-10 ms CPU inference per description, which is enough to bulk-label a 24-month history in under a second. For streaming, cache predictions keyed on normalized description strings. The hit rate after a week of traffic exceeds 90 percent for retail feeds because merchant strings repeat.
---
## Scalability: PySpark Structured Streaming {.unnumbered}
Once transaction volume exceeds what a single node can aggregate in the refresh window, Spark Structured Streaming [@armbrust2018structured; @zaharia2016apache] is the standard choice. The pattern is an append-only event stream from Kafka or Kinesis, watermarking on the bookingDate, windowed aggregates per customer and category, and an output sink to a feature store.
```{python}
#| label: spark-doc
#| eval: false
# Illustrative only. Runs against a live Spark session in production.
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = (SparkSession.builder
.appName("openbanking-features")
.getOrCreate())
raw = (spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "broker:9092")
.option("subscribe", "txn.raw")
.load())
schema = "customer_id STRING, booking_ts TIMESTAMP, amount DOUBLE, category STRING, description STRING"
txn = (raw.selectExpr("CAST(value AS STRING) as json")
.select(F.from_json("json", schema).alias("r"))
.select("r.*")
.withWatermark("booking_ts", "7 days"))
agg = (txn.groupBy(
F.col("customer_id"),
F.window("booking_ts", "30 days", "1 day"),
F.col("category"))
.agg(F.sum("amount").alias("sum_amt"),
F.count("*").alias("n_txn"))
)
query = (agg.writeStream
.outputMode("update")
.format("delta")
.option("checkpointLocation", "/chk/openbanking/")
.start("/features/openbanking/"))
```
The key engineering knobs are watermark length (balance late data against state size), trigger interval (micro-batch cadence), and state store backend (RocksDB for large state, HDFSStateStore for simple cases). The cost model is roughly linear in events per second and in the number of feature-windows per customer. @armbrust2018structured is the canonical reference for correctness guarantees.
A polars path is appropriate up to hundreds of millions of transactions per refresh on a single fat box. A Spark path is mandatory beyond that or whenever the ingestion is a continuous stream with sub-hour freshness targets.
---
## Decay: an empirical check {.unnumbered}
To ground @eq-decay, we estimate information content of 30-day rolling feature slabs at different lags, using our simulated panel. The prediction target is a simulated default indicator. Mutual information between discretized feature bins and the target is a practical proxy.
```{python}
#| label: decay
from sklearn.feature_selection import mutual_info_classif
# Build 30-day-window features at three lags ending at day 90, 120, 150
def window_feats(df, end_day):
start_d = start + pd.Timedelta(days=end_day - 30)
stop_d = start + pd.Timedelta(days=end_day)
sub = df[(df["date"] > start_d) & (df["date"] <= stop_d)]
agg = (sub.groupby("customer_id")
.agg(inflow=("amount", lambda s: s[s > 0].sum()),
outflow=("amount", lambda s: -s[s < 0].sum()),
n_gamble=("category", lambda s: (s == "gambling").sum()),
n_rent=("category", lambda s: (s == "rent").sum()))
.reset_index())
return agg
lags = [150, 120, 90, 60, 30] # end-day of the 30-day window; nearer = fresher
mi_by_lag = {}
for end in lags:
w = window_feats(txns_df, end)
merged = w.merge(cust[["customer_id", "default"]], on="customer_id", how="right").fillna(0)
mi = mutual_info_classif(
merged[["inflow", "outflow", "n_gamble", "n_rent"]].values,
merged["default"].values, random_state=0)
mi_by_lag[DAYS - end] = float(np.sum(mi))
for k, v in mi_by_lag.items():
print(f"lag (days): {k:3d} sum MI: {v:.4f}")
```
```{python}
#| label: decay-fit
# Fit an exponential decay: I(k) = A * exp(-alpha * k)
ks_arr = np.array(list(mi_by_lag.keys()))
mis = np.array(list(mi_by_lag.values()))
mask = mis > 0
logmi = np.log(mis[mask])
slope, intercept = np.polyfit(ks_arr[mask], logmi, 1)
alpha = -slope
halflife = np.log(2) / alpha if alpha > 0 else np.inf
print(f"alpha (per day): {alpha:.4f}")
print(f"implied half-life (days): {halflife:.1f}")
fig, ax = plt.subplots(figsize=(6.0, 3.8))
ax.plot(ks_arr, mis, "o-", color="#2a6cc0", label="empirical MI")
xs = np.linspace(ks_arr.min(), ks_arr.max(), 50)
ax.plot(xs, np.exp(intercept + slope * xs), "--", color="#b02828",
label=f"exp fit, half-life {halflife:.0f}d")
ax.set_xlabel("lag (days)"); ax.set_ylabel("sum MI (nats)")
ax.set_title("Information decay of cashflow features")
ax.legend(); plt.tight_layout(); plt.show()
```
The exponential fit recovers a half-life on the order of 60 to 180 days for this simulated panel, with the exact value depending on sampling. Production half-lives differ by feature class: gambling flags (fast decay, order of weeks), income stability (slow decay, order of a year or more). The policy implication is specific: a monthly refresh preserves most of the signal for slow-decay features, but gambling and NSF signals should be refreshed weekly or at application.
---
## Regulatory considerations {.unnumbered}
Open banking data comes with a layered compliance stack. The data pull sits under PSD2 or its local analog (CFPB 1033 in the US, the UK's OBIE, Australia's CDR). Model governance sits under SR 11-7 in the US [@sr117] and under the ECB TRIM guidance in Europe. Anti-discrimination sits under ECOA and Regulation B, and under the EU Charter's equality articles. @bartlett2022consumer is the canonical reference on algorithmic discrimination in consumer lending.
Three points deserve flag-level attention. First, Article 22 of the GDPR: a fully automated decision with legal or similarly significant effects requires either explicit consent, contractual necessity, or an explicit legal basis, and the data subject has the right to obtain human intervention. Credit decisions usually rely on contractual necessity, which is narrower than consent; documentation must show the decision is necessary for entering the contract. Second, the EU AI Act (2024) classifies credit-scoring systems as high-risk under Annex III, which triggers requirements on data governance, human oversight, logging, and post-market monitoring. Third, CFPB's 1033 rule has an explicit requirement that consumer-authorized data cannot be used for "targeted advertising, cross-selling, or sale of covered data," which constrains how open-banking features cross into marketing feedback loops.
From a model-risk perspective, open-banking features have two properties that matter: they decay fast (so validation must test freshness-stratified performance) and they are rich in personal-life information (so fairness audits should probe proxy discrimination, for instance on gambling features that may correlate with religion or socioeconomic status).
---
## Vietnam and emerging markets {.unnumbered}
### Market context
Vietnam does not have PSD2. It has a regulator-led payment-rail modernization anchored on NAPAS, the national payment switch, which clears interbank card and QR transactions across almost all commercial banks [@napas2023report]. Three instruments define the functional perimeter. SBV Decision 2345/QD-NHNN (effective July 2024) mandates biometric authentication for online transfers above defined thresholds and for first-time device binding, effectively creating a strong-customer-authentication regime analogous to PSD2 SCA [@sbv_decision2345_2023]. Circular 16/2020 establishes electronic KYC for payment-account opening with remote identity verification [@sbv_circular16_2020]. Circular 41/2016 sets Basel II standardized capital rules for banks [@sbv_circular41_2016], and Circular 22/2023/TT-NHNN (29 Dec 2023) amends Circular 41/2016 on capital adequacy ratios [@sbv_circular22_2023]. Circular 43/2016/TT-NHNN sets the separate consumer-lending regime for finance companies.
What Vietnam does not yet have is a consumer-owned, third-party-accessible open-banking API comparable to PSD2's XS2A or CFPB 1033. Data access for non-bank fintechs runs through bilateral partnerships, e-wallet ecosystems (MoMo, ZaloPay, VNPay), and the NAPAS common switch rather than through a statutory right to pull. The IMF's Article IV and ADB reports flag this gap as a financial-inclusion frontier [@imf2024vietnamart4, @adb2022vnfin]. Decree 13/2023 on Personal Data Protection sets the consent and data-subject-rights baseline against which any future API access will be built [@vn_decree13_2023].
### Application considerations
A Vietnamese lender that wants PSD2-style cashflow features has three practical paths. The first is partner-bank integration: negotiate a data-sharing contract with a commercial bank where the applicant maintains a primary current account, pull transaction history under Decree 13/2023 consent, and run the feature engineering pipeline from @sec-ch18-cashflow. The second is e-wallet integration: pull wallet transaction history from MoMo, ZaloPay, or VNPay where the applicant has consented, and treat the wallet as a partial proxy for a current account. Wallet data is cleaner than bureau data (explicit categories, merchant tags) but thinner than bank data (wallet balances are typically small, salary and rent rarely clear through the wallet). The third is salary-credit capture via the NAPAS rail: where the applicant's employer disburses salary into a partner bank, NAPAS-settled income features are available under bank consent.
Feature engineering priorities shift in this context. Income stability and recurring-outflow detectors transfer directly from @sec-ch18-cashflow. Gambling-flag features translate poorly because Vietnamese consumer gambling flows through offshore channels and rarely shows on card rails. Tet seasonality requires explicit treatment: income, spend, and transfer volumes spike in the month before Tet and fall in the two weeks after. A model that uses a raw monthly-average income feature without a Tet adjustment will misprice January and February applications.
### Rationalization
Two arguments justify importing the PSD2 pipeline into Vietnam despite the absence of statutory open banking. First, the informational primitives are the same. A salary credit is a salary credit whether it arrives through SEPA or through NAPAS. The @berg2020rise and @olafsson2018liquid findings about cashflow primitives (income stability, recurring-outflow depth, trough statistics) are structural and do not depend on the legal access mechanism. Second, the access path, while bilateral, already carries enough Vietnamese volume to support a production model. NAPAS processes the majority of interbank card and QR transactions in Vietnam [@napas2023report], and the top three e-wallets cover a large share of digital retail payments. A lender that integrates with even one major bank plus one major e-wallet can reach a meaningful share of urban consumer applicants.
The limits are real. A PSD2 feed is consumer-portable: a borrower can grant access to any licensed TPP. A Vietnamese bilateral feed is not portable: switching lenders breaks the data link. This matters for the Babina-Buchak-Gornall-type competitive effects [@babina2024customer], which may be muted in Vietnam until a statutory open-banking regime exists. The @he2023open equilibrium analysis about open banking lowering entry barriers is therefore a prediction about Vietnam's future, not its present.
### Practical notes
Operationally, a Vietnamese bank or finance company building an open-banking-style scorecard should do four things. First, align the data-pull consent template with Decree 13/2023 Articles on purpose limitation, cross-border transfer, and data-subject rights [@vn_decree13_2023]. Second, build the SCA layer to Decision 2345 requirements before the feature layer; biometric re-auth on first device binding is now a hard gate for high-value consumer flows [@sbv_decision2345_2023]. Third, engineer Tet-adjusted features explicitly (de-seasonalized income, Tet-window transaction flags). Fourth, validate the model to SBV Circular 41/2016 standardized-approach expectations for PD inputs as updated by Circular 22/2023/TT-NHNN (29 Dec 2023) on capital adequacy ratios, document segment-level calibration for the finance-company use case under Circular 43/2016/TT-NHNN on consumer lending by finance companies, and maintain a feature-provenance ledger that a CIC examination can reconcile [@sbv_circular41_2016, @sbv_circular22_2023, @cicvn2023report]. The decay and freshness analysis in @sec-ch18-decay applies without modification: salary-credit signals decay at roughly the same half-life regardless of the jurisdiction.
## Takeaways {.unnumbered}
- PSD2 and CFPB 1033 are not just regulations, they are a stable supply of transaction-level data that dominates bureau signals on cashflow-relevant questions.
- Feature engineering is the product. Recurring-outflow detectors, trough statistics, volatility moments, and income-stability measures are the features that move AUC.
- Signal decays. Exponential decay of mutual information under a Markov assumption is both the theory and the empirically observed behavior; pick EWMA half-lives to match.
- BERT-style models on short descriptions give a merchant classifier that scales cleanly and retrains cheaply; zero-shot fallbacks cover the tail.
- Aggregation across institutions is a duplicate-detection problem, not a sum, and coverage flags are themselves predictive.
---
## Further reading {.unnumbered}
- @berg2020rise on digital footprints in credit scoring, the closest published benchmark.
- @olafsson2018liquid on cashflow-panel evidence from personal finance software.
- @ganong2019consumer and @baker2018debt on cashflow responses to income shocks.
- @he2023open on the equilibrium theory of open banking and credit competition.
- @babina2024customer on empirical entry effects of open banking on fintech lending.
- @parlour2022fintech on payment-data externalities.
- @gambacorta2024data on non-traditional data lifts in credit scoring.
- @jagtiani2019roles on alternative-data evidence in marketplace lending.
- @devlin2019bert and @sanh2019distilbert on the language models used for merchant classification.
- @armbrust2018structured for the engineering reference on streaming cashflow aggregation.