235 Business Applications of Visual Inference: From eKYC to Physiognomy

235.1 1. Introduction

The previous chapters built systems that verify a claimed identity. This chapter widens the lens to the broader commercial question that practitioners and executives actually ask: what business value can AI extract from a face, a voice, or a video, and where does that value turn into liability, pseudoscience, or illegality?

The organizing idea is a single distinction that cuts through the entire field:

Verification asks a checkable question (“is this the person in the document?” “is this person plausibly over 18?”). Inference asks an unanswerable one (“is this person creditworthy, employable, honest, or gay, from their face?”).

As applications move from the first kind to the second, the “signal” the model finds increasingly turns out to be self-presentation, demographic proxy, or dataset artifact rather than the trait claimed. This chapter organizes the landscape into three tiers, legitimate, contested, and discredited, with concrete business and academic references for each, and closes with practical guidance on which use cases are safe and compelling to demonstrate in a real business setting, which is exactly what a responsible practitioner needs before promising a client a “face-based” product.

235.1.1 1.1 A precise definition of the verification / inference axis

The intuitive split above can be made formal, and the formalization is what makes the rest of the chapter rigorous rather than rhetorical. Let $X$ denote a facial image (or short video), and let $Y$ denote the target a vendor wants to predict.

Definition (verification task). A task is a verification when $Y$ is a function of an externally recorded fact $f$, and a ground-truth oracle exists that returns $Y$ independently of the image. Identity matching has $Y = \mathbb{1}[\text{same person as enrolled template}]$, with the enrollment record as oracle. Age verification has $Y = \mathbb{1}[\text{age} \ge t]$, with a birth date as oracle. The defining property is falsifiability: for any prediction $\hat Y(X)$ there is an instance whose true label can be checked, so an error rate is a measurable quantity.

Definition (inference task). A task is a latent-trait inference when $Y$ is an internal or socially ascribed property (creditworthiness as character, personality, felt emotion, criminality, sexual orientation) for which the only available “ground truth” is itself a human judgement or a downstream proxy, not an independent reading of the trait. There is no oracle that returns the trait directly; what looks like ground truth is a label generated by raters or by an institutional process.

The hazard of the inference case is captured by a single decomposition. Write the trait the vendor claims to predict as $T$, and the label actually used for training and evaluation as $L$. A reported accuracy measures agreement between $\hat Y(X)$ and $L$, never between $\hat Y(X)$ and $T$. These coincide only if $L$ is an unbiased measurement of $T$. When $L$ is a rater impression, a mugshot-versus-headshot data source, or an enforcement outcome, the model can score very well against $L$ while telling us nothing about $T$. Most of this chapter is the working out of that one gap.

235.1.2 1.2 A structural model of where the “signal” comes from

It helps to name the channels through which an image can correlate with a label even when no face-to-trait mechanism exists. Decompose any measured association between $X$ and $L$ into four additive sources:

Genuine signal, a real causal path from the trait to the appearance ($T \to X$). For age this path is strong (ageing changes the face). For criminality it is absent.
Demographic proxy, a common cause $D$ (age, sex, ancestry, socioeconomic status) that influences both the face and the label, inducing $X \leftarrow D \rightarrow L$. This is the path that turns a “trait” score into laundered protected-class discrimination.
Self-presentation, behaviour the subject controls (grooming, expression, makeup, glasses, pose) that responds to context $S$ and correlates with the label, $X \leftarrow S \rightarrow L$. Whether someone smiles or wears a professional headshot is presentation, not physiology.
Acquisition leakage, properties of the capture pipeline $C$ (camera, lighting, image source, compression) that differ systematically between label classes, $X \leftarrow C \rightarrow L$. The mugshot-versus-ID-photo split is pure leakage.

A useful informal accounting is \[ \text{measured association}(X, L) \;=\; \underbrace{a_T}_{\text{genuine } T\to X} \;+\; \underbrace{a_D + a_S + a_C}_{\text{confound and leakage}} . \] The central empirical claim of this chapter is that the three tiers are distinguished by which terms dominate. In Tier 1 the genuine term $a_T$ dominates and the confound terms are either small or controllable. In Tier 2 a small $a_T$ is real but is swamped by $a_D + a_S$. In Tier 3 $a_T \approx 0$ and the entire reported accuracy is $a_D + a_S + a_C$. High accuracy alone cannot tell these cases apart, which is exactly why “the model gets 90 percent” is not an argument.

A complementary way to see the danger is the proxy-discrimination inequality. Suppose a face score $\hat Y$ is used in a decision and a protected attribute $A$ is a common cause of both face and label. Even if $A$ is never an input, the mutual information $I(\hat Y; A)$ can be large whenever the face encodes $A$, so $\hat Y$ can reproduce a disparate impact while appearing attribute-blind. Blinding the model to $A$ does not blind it to a face that encodes $A$. This is the formal reason “we did not use race as a feature” is not a defence under disparate-impact law.

flowchart LR
    T["Trait T"] --> X["Face image X"]
    D["Demographics D"] --> X
    D --> L["Training label L"]
    S["Self presentation S"] --> X
    S --> L
    C["Acquisition C"] --> X
    C --> L
    T --> L
    X --> P["Model prediction"]

Figure 235.1: Four channels by which a face image can correlate with a label. Only the first is a genuine trait signal.

235.2 2. Tier 1, Legitimate, Defensible Applications

These are well-grounded because the task is a 1:1 match or a bounded estimation with objective ground truth, not an inference about who someone is inside.

Customer onboarding (eKYC) and fraud prevention. The dominant legitimate use is selfie-to-document matching with liveness detection during remote account opening, standard at banks, neobanks, fintechs, crypto exchanges, gig-economy marketplaces, and telecoms (SIM registration). The business case is compliance (KYC/AML) plus fraud-loss reduction, and the credible academic backbone is NIST’s ongoing, public, demographic-stratified biometric evaluations. Vendors report large fraud-catch and manual-review-cost improvements; present these as industry claims, not peer-reviewed effect sizes.

Age verification and estimation. Facial age estimation, a regression on apparent age, distinct from identification, is now a regulator-recognized “highly effective age check” under the UK Online Safety Act. The most transparent vendor publishes a mean absolute error around 1.1 years for ages 13 to 17 and about 2.1 years for 18 to 24, independently evaluated in NIST’s Face Analysis Technology Evaluation. It is defensible precisely because the output is a bounded number with a published error distribution, and because privacy-preserving deployments delete the image immediately and never link it to an identity.

The quality metric here is concrete and worth stating precisely. With predictions $\hat a_i$ and true ages $a_i$, the mean absolute error is $\mathrm{MAE} = \frac{1}{n}\sum_{i=1}^n |\hat a_i - a_i|$, a quantity in years that any auditor can reproduce given dates of birth. A threshold check at age $t$ does not use the point estimate directly. Instead it accepts a user only when the estimated age clears the threshold by a safety buffer $b$, that is when $\hat a_i \ge t + b$. The buffer trades the two error types: raising $b$ drives the false-acceptance rate of underage users toward zero at the cost of a higher false-rejection rate of legitimate adults, who then fall back to a document check. Because the error distribution is published, an operator can choose $b$ to hit a target underage-pass rate. This is the signature of a defensible Tier 1 task: a tunable operating point on a curve with measurable axes, rather than an unfalsifiable claim about a person.

Access control, payment authentication, account recovery, returning-user re-verification. Face unlock, face-based payment confirmation, biometric re-authentication, and re-verifying a returning driver or gig worker are all 1:1 verification against an enrolled template, defensible when consented, tightly governed, and offered with a non-biometric fallback.

Why Tier 1 is safe. The ground truth is objective and checkable, error rates are independently measurable, and nothing is inferred about character. The genuine risks are operational and equity risks, the demographic accuracy gaps documented earlier, spoofing/deepfakes, and exclusion of people the system cannot read, not the epistemic risk of inferring an unmeasurable trait.

235.3 3. Tier 2, Commercially Deployed but Scientifically Contested

Here a real published literature claims predictive signal, products ship, and serious methodological critiques and regulation exist. The job is to separate “a paper reports a correlation” from “this is a valid, deployable, non-discriminatory inference.” This is the tier the user’s question, “credit scoring from face or video”, lands in, so it deserves the most careful treatment.

235.3.1 3.1 Credit Scoring and Default Prediction from Faces

There is a genuine peer-reviewed thread in top finance and management journals:

Duarte, Siegel & Young (2012), Review of Financial Studies, “Trust and Credit.” Borrowers who appear more trustworthy in peer-to-peer-lending photos are more likely to be funded, get lower rates, and actually default less. Perceived trustworthiness carried some real signal, but this is human raters scoring photos, and the effect is small relative to hard financial variables.
Chen, Liu, Meng & Wang (2023), Management Science, “What’s in a Face?” The most important paper for a balanced view. A machine-learning model can predict repayment from facial images to some degree, but giving human loan officers the photos does not improve their decisions, humans hold biased facial priors and over-weight facial information. The lesson is double-edged: even where a weak algorithmic signal exists, injecting faces into the human decision pipeline degrades judgment and imports bias.
CFO facial-trustworthiness studies find that firms whose executives have more trustworthy-looking faces obtain better loan terms, evidence of an appearance premium to be controlled, not a feature to productize.

Commercially, microlending vendors and patents have promoted “read the applicant’s face to score repayment,” sometimes blended with smartphone/digital-footprint scoring, but note that the digital-footprint signal (behavioral exhaust predicting default) is far better validated and is not facial inference.

Why deploying this is hazardous. Four problems compound, and they map one-to-one onto the four channels of section 1.2: (1) demographic proxy ($a_D$), facial “signal” for default is confounded with age, gender, race, and socioeconomic markers, so a face score can launder protected-class discrimination into a credit decision; (2) reverse causality / self-presentation ($a_S$), a “trustworthy” photo reflects grooming, income, and access, not bone structure; (3) leakage ($a_C$), the image source (professional headshot vs. webcam vs. mugshot) carries the apparent signal; (4) legality, in the US this collides with fair-lending disparate-impact doctrine (ECOA), and in the EU with GDPR special-category rules and the AI Act’s high-risk classification of creditworthiness AI. A correlation in a research dataset is real but small, dominated by confounds, and deploying it as a credit feature is both ethically and legally dangerous.

Worked example: how a small real signal becomes a discriminatory feature. Consider a stylised lending population split by an unobserved socioeconomic factor that also shapes appearance. Suppose the true default rate is 8 percent in group $A$ and 16 percent in group $B$, and that group membership is itself 75 percent recoverable from the photo (because grooming, attire, and background track socioeconomic status). A face model trained to predict default will discover that the cheapest way to reduce its loss is to infer the group and copy its base rate: it learns the $X \leftarrow D \rightarrow L$ path, not any face-to-character path. The model can report a respectable AUC, yet decompose its score and you find almost all of the lift comes from separating $A$ from $B$, which is precisely the protected-class proxy a fair-lending audit forbids. Now suppose the genuinely face-intrinsic signal $a_T$ is real but tiny, lifting AUC from 0.50 to perhaps 0.52 once group is held fixed. The deployable conclusion is stark: the useful part of the score is the part you are not allowed to use, and the part you are allowed to use is too small to matter. A correct audit therefore does not ask “what is the AUC”; it asks “what is the residual AUC after conditioning on protected attributes”, and for face-based credit scoring that residual is close to noise.

235.3.2 3.2 Automated Video Interviewing and Hireability Inference

The product story is cautionary. A pioneer of AI-scored asynchronous video interviews included automated facial-expression analysis to infer traits, then, after an FTC complaint, ACLU criticism, and scrutiny under the Illinois Artificial Intelligence Video Interview Act, publicly dropped facial analysis in 2021, stating it “no longer significantly added value” relative to language analysis. Regulation has since tightened (Illinois AIVIA, NYC Local Law 144’s bias-audit mandate, EEOC guidance under Title VII/ADA).

The academic evidence is genuinely mixed and label-dependent. Hickman et al. (2022, Journal of Applied Psychology) trained models on ~1,073 video interviews to predict Big Five personality: models trained on observer-rated personality explained on average R² ≈ 0.16, but models trained on self-reported personality explained essentially nothing (R² ≈ 0.01). The algorithm partly learns to reproduce raters’ impressions, not the construct itself. The defensible residue is structured, content/verbal scoring with bias audits; facial-expression-to-hireability inference is the part the market itself retreated from.

235.3.3 3.3 Affect / Emotion Recognition

The market is large and real, ad-testing, market research, call-center voice analytics, and automotive driver-state monitoring. But the foundational scientific critique is devastating: Barrett et al. (2019), Psychological Science in the Public Interest, “Emotional Expressions Reconsidered,” a ~60-page review concluding that the assumed one-to-one mapping from facial configurations to internal emotional states is not supported, how people move their faces for a given emotion varies widely within a person, across contexts, and across cultures. The inference “this face = this felt emotion” is therefore scientifically unreliable. A useful nuance: driver drowsiness/distraction monitoring is more defensible than “customer emotion,” because it targets observable physiological states (eyelid closure, gaze) with a safety rationale, and the EU AI Act carves out exactly such safety uses while banning workplace emotion inference.

235.4 4. Tier 3, Scientifically Discredited and Banned

These claim to read inner character or protected status from facial structure. This is physiognomy, the long-discredited pseudoscience (Lavater, Lombroso) that fed scientific racism, re-skinned with deep learning.

Wu & Zhang (2016), “Automated Inference on Criminality Using Face Images,” claimed classifiers distinguish “criminal” from “non-criminal” faces. The fatal flaw: the “non-criminal” images were professional/ID photos (often smiling) while “criminal” images were government mugshots. The model learned expression and photo-source artifacts, not criminality, and “criminal” is a socially constructed, enforcement-biased label with no causal link to face geometry.
Wang & Kosinski (2018), “Detecting Sexual Orientation From Facial Images,” reported high AUC distinguishing gay from straight in dating-profile photos. Critiques (notably Agüera y Arcas and colleagues) showed the signal comes overwhelmingly from self-presentation and grooming, makeup, facial hair, glasses, camera angle, not innate structure, demolishing the authors’ prenatal-hormone story. The classifier exposes grooming norms and stereotypes, not biology.
Kosinski (2021), “Facial Recognition Technology Can Expose Political Orientation,” drew the same family of objections: self-presentation, demographic and regional confounds, and an unfounded leap from correlation to essence.
The umbrella critique, “Physiognomy’s New Clothes” (Agüera y Arcas, Mitchell & Todorov, 2017), is the definitive accessible takedown: these systems revive the exact logic historically used to justify discrimination, and their apparent accuracy reflects confounds, not any real face-to-character mapping.

In one line: there is no validated causal mechanism linking facial morphology to criminality, sexuality, or politics; the “accuracy” is real pattern-matching on confounds, and high AUC on a biased dataset is not evidence of a true relationship.

In the language of section 1.2, every Tier 3 result has $a_T \approx 0$ and a reported score equal to $a_S + a_C$ (plus $a_D$). The test that exposes this is not internal cross-validation, which preserves the confound, but invariance under a presentation or acquisition swap: re-photograph the same people under the other group’s conditions (same camera, same expression, same crop) and watch the accuracy collapse toward chance. A genuine $T \to X$ signal survives that swap; a confound does not. This is the single experiment every Tier 3 claim fails and every Tier 1 task passes, and it is far more informative than any headline accuracy number.

Regulatory bans. The EU AI Act (Article 5, prohibitions effective February 2025) bans social scoring; emotion recognition in workplaces and education (narrow safety/medical carve-outs); biometric categorisation inferring race, political opinions, religion, or sexual orientation; and untargeted facial-image scraping. Tier-3 use cases map almost exactly onto these prohibitions, while creditworthiness and employment AI (Tier 2) are separately classified high-risk, permitted but heavily constrained.

235.5 5. How to Demo Responsibly in a Real Business Case

The user’s practical need, compelling demos for real business cases, has a clear, safe answer: lean entirely on Tier 1.

Before committing to any face-based feature, the following decision procedure separates the buildable from the indefensible. It operationalises the verification / inference axis and the invariance test into questions a product team can actually answer.

flowchart TD
    Q1["Is there an oracle that returns the true label independently of the image?"]
    Q1 -- "No" --> STOP["Latent-trait inference. Treat as cautionary analysis only."]
    Q1 -- "Yes" --> Q2["Does accuracy survive a presentation and acquisition swap?"]
    Q2 -- "No" --> STOP
    Q2 -- "Yes" --> Q3["Is residual signal small after conditioning on protected attributes?"]
    Q3 -- "Yes, signal is mostly proxy" --> STOP
    Q3 -- "No, genuine residual signal" --> BUILD["Buildable Tier 1 task. Verify, audit, deploy with fallback."]

Figure 235.2: A go / no-go screen for any proposed face-based feature.

The three gates correspond exactly to the chapter’s machinery: gate one is the verification / inference definition, gate two is the invariance test that kills confounds, and gate three is the proxy-discrimination check that prevents laundering protected attributes. A feature that clears all three is a verification task with a real and lawful signal. A feature that fails any one of them belongs in analysis, not in a product.

Best demo, identity verification with deepfake defense. A live selfie-to-ID match with liveness/PAD, ideally showing a deepfake spoof being caught, framed around fraud-loss reduction and KYC/AML compliance. It has objective ground truth, independent (NIST) evaluation, and no character inference. This is the demo that wins enterprise trust.
Strong second, privacy-preserving age estimation. Show the published mean-absolute-error on screen, estimate an age, and delete the image immediately. A concrete, regulator-aligned use case (UK Online Safety Act) with honest error bars.
Other safe demos, payment authentication, account-recovery re-verification, returning-user matching, and document-fraud detection from the Document AI chapter.

Present Tier 2 only as analysis, never as a live product pitch. If credit-from-face or video-interview scoring must be shown, show it as a cautionary case: demonstrate the confound directly, for instance, that a “risk” model’s output flips when a mugshot-style photo is swapped for a smiling headshot, or that interview scores track rater impressions rather than job performance. Pair every Tier 2 example with its critique (Chen et al. on humans over-weighting faces; Hickman et al. on label dependence; Barrett et al. on emotion), and note that workplace emotion recognition is prohibited in the EU.

Never demo Tier 3 as if it works. Use the criminality, sexuality, and political-orientation papers only as worked examples of how confounds, leakage, biased labels, and reverse causality manufacture spurious accuracy, a teaching device for “why high accuracy ≠ a real relationship,” with the EU AI Act bans flagged explicitly.

235.5.1 5.1 Pitfalls checklist

Even within Tier 1, the following failure modes recur often enough to be worth naming explicitly.

Reading internal cross-validation as proof of a real signal. Cross-validation that draws train and test from the same biased collection preserves every confound. Only an out-of-distribution or invariance test is diagnostic.
Reporting a single aggregate accuracy. A face system can hit 99 percent overall while failing badly on a demographic subgroup. The classic demonstration is Buolamwini and Gebru’s Gender Shades, where commercial gender classifiers that looked strong in aggregate had error rates up to about 34 percent on darker-skinned women. Always report error stratified by the attributes documented in the bias-and-fairness literature, the way NIST does.
Treating “we did not use protected attributes as features” as a fairness guarantee. As the proxy-discrimination inequality shows, a face encodes those attributes, so blinding the inputs does not blind the model.
Confusing a regulator-recognised category with approval of a product. Age estimation being a recognised method does not mean a given vendor’s error distribution meets the threshold; the published MAE and operating point still have to be checked.
Shipping without a non-biometric fallback. Any face system excludes some users it cannot read; an accessible alternative path is both an equity requirement and, increasingly, a legal one.

235.6 6. Conclusion

The commercial value of AI on faces and video is real but lives almost entirely in verification, confirming a checkable claim of identity or age, not in inference of latent traits. The verification-versus-inference axis predicts both the science and the law: Tier 1 verifies an objective fact and is defensible; Tier 3 infers an unobservable essence and is pseudoscience the EU now bans; Tier 2 is the contested middle, where weak real correlations exist but are dominated by confounds and constrained by fair-lending and high-risk-AI regulation. For a practitioner building a demo or a product, the discipline is simple to state and hard to hold: if the question has an objective, checkable answer, you can build on it; if it requires reading character from a face, the accuracy is an artifact and the deployment is a liability. That principle, more than any model, is the takeaway of this cluster.

235.7 References

Duarte, J., Siegel, S., Young, L. “Trust and Credit: The Role of Appearance in Peer-to-Peer Lending.” Review of Financial Studies, 2012. https://academic.oup.com/rfs/article-abstract/25/8/2455/1570804
Chen, Z., Liu, B., Meng, Y., Wang, Z. “What’s in a Face? An Experiment on Facial Information and Loan-Approval Decision.” Management Science, 2023. https://pubsonline.informs.org/doi/10.1287/mnsc.2022.4436
Hickman, L. et al. “Automated Video Interview Personality Assessments: Reliability, Validity, and Generalizability.” Journal of Applied Psychology, 2022. https://pubmed.ncbi.nlm.nih.gov/34110849/
Barrett, L. F., Adolphs, R., Marsella, S., Martinez, A., Pollak, S. “Emotional Expressions Reconsidered.” Psychological Science in the Public Interest, 2019. https://journals.sagepub.com/doi/10.1177/1529100619832930
Wu, X., Zhang, X. “Automated Inference on Criminality Using Face Images.” 2016. https://arxiv.org/abs/1611.04135
Wang, Y., Kosinski, M. “Deep Neural Networks Are More Accurate Than Humans at Detecting Sexual Orientation From Facial Images.” J. Personality and Social Psychology, 2018. https://www.gsb.stanford.edu/faculty-research/publications/deep-neural-networks-are-more-accurate-humans-detecting-sexual
Agüera y Arcas, B., Mitchell, M., Todorov, A. “Physiognomy’s New Clothes.” 2017. https://medium.com/Agüera y Arcas, Mitchell, and Todorov (2017)
Yoti. “Facial Age Estimation” (accuracy / NIST FATE evaluation). https://www.yoti.com/business/age-verification/
EU AI Act, Article 5 (prohibited practices). https://artificialintelligenceact.eu/article/5/
NIST. Face Analysis Technology Evaluation (FATE), Age Estimation. https://pages.nist.gov/frvt/html/frvt_age_estimation.html
Buolamwini, J., Gebru, T. “Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification.” Proceedings of Machine Learning Research (FAT* 2018), 81:77-91. https://proceedings.mlr.press/v81/buolamwini18a.html

# Business Applications of Visual Inference: From eKYC to Physiognomy ## 1. Introduction The previous chapters built systems that *verify a claimed identity*. This chapter widens the lens to the broader commercial question that practitioners and executives actually ask: **what business value can AI extract from a face, a voice, or a video, and where does that value turn into liability, pseudoscience, or illegality?** The organizing idea is a single distinction that cuts through the entire field: > **Verification asks a checkable question** ("is this the person in the document?" "is this person plausibly over 18?"). **Inference asks an unanswerable one** ("is this person creditworthy, employable, honest, or gay, from their face?"). As applications move from the first kind to the second, the "signal" the model finds increasingly turns out to be *self-presentation, demographic proxy, or dataset artifact* rather than the trait claimed. This chapter organizes the landscape into three tiers, legitimate, contested, and discredited, with concrete business and academic references for each, and closes with practical guidance on which use cases are safe and compelling to demonstrate in a real business setting, which is exactly what a responsible practitioner needs before promising a client a "face-based" product. ### 1.1 A precise definition of the verification / inference axis The intuitive split above can be made formal, and the formalization is what makes the rest of the chapter rigorous rather than rhetorical. Let $X$ denote a facial image (or short video), and let $Y$ denote the target a vendor wants to predict. **Definition (verification task).** A task is a *verification* when $Y$ is a function of an externally recorded fact $f$, and a ground-truth oracle exists that returns $Y$ independently of the image. Identity matching has $Y = \mathbb{1}[\text{same person as enrolled template}]$, with the enrollment record as oracle. Age verification has $Y = \mathbb{1}[\text{age} \ge t]$, with a birth date as oracle. The defining property is **falsifiability**: for any prediction $\hat Y(X)$ there is an instance whose true label can be checked, so an error rate is a measurable quantity. **Definition (inference task).** A task is a *latent-trait inference* when $Y$ is an internal or socially ascribed property (creditworthiness as character, personality, felt emotion, criminality, sexual orientation) for which the only available "ground truth" is itself a human judgement or a downstream proxy, not an independent reading of the trait. There is no oracle that returns the trait directly; what looks like ground truth is a *label* generated by raters or by an institutional process. The hazard of the inference case is captured by a single decomposition. Write the trait the vendor *claims* to predict as $T$, and the label actually used for training and evaluation as $L$. A reported accuracy measures agreement between $\hat Y(X)$ and $L$, never between $\hat Y(X)$ and $T$. These coincide only if $L$ is an unbiased measurement of $T$. When $L$ is a rater impression, a mugshot-versus-headshot data source, or an enforcement outcome, the model can score very well against $L$ while telling us nothing about $T$. Most of this chapter is the working out of that one gap. ### 1.2 A structural model of where the "signal" comes from It helps to name the channels through which an image can correlate with a label even when no face-to-trait mechanism exists. Decompose any measured association between $X$ and $L$ into four additive sources: 1. **Genuine signal**, a real causal path from the trait to the appearance ($T \to X$). For age this path is strong (ageing changes the face). For criminality it is absent. 2. **Demographic proxy**, a common cause $D$ (age, sex, ancestry, socioeconomic status) that influences both the face and the label, inducing $X \leftarrow D \rightarrow L$. This is the path that turns a "trait" score into laundered protected-class discrimination. 3. **Self-presentation**, behaviour the subject controls (grooming, expression, makeup, glasses, pose) that responds to context $S$ and correlates with the label, $X \leftarrow S \rightarrow L$. Whether someone smiles or wears a professional headshot is presentation, not physiology. 4. **Acquisition leakage**, properties of the capture pipeline $C$ (camera, lighting, image source, compression) that differ systematically between label classes, $X \leftarrow C \rightarrow L$. The mugshot-versus-ID-photo split is pure leakage. A useful informal accounting is $$ \text{measured association}(X, L) \;=\; \underbrace{a_T}_{\text{genuine } T\to X} \;+\; \underbrace{a_D + a_S + a_C}_{\text{confound and leakage}} . $$ The central empirical claim of this chapter is that the three tiers are distinguished by which terms dominate. In Tier 1 the genuine term $a_T$ dominates and the confound terms are either small or controllable. In Tier 2 a small $a_T$ is real but is swamped by $a_D + a_S$. In Tier 3 $a_T \approx 0$ and the entire reported accuracy is $a_D + a_S + a_C$. High accuracy alone cannot tell these cases apart, which is exactly why "the model gets 90 percent" is not an argument. A complementary way to see the danger is the *proxy-discrimination inequality*. Suppose a face score $\hat Y$ is used in a decision and a protected attribute $A$ is a common cause of both face and label. Even if $A$ is never an input, the mutual information $I(\hat Y; A)$ can be large whenever the face encodes $A$, so $\hat Y$ can reproduce a disparate impact while appearing attribute-blind. Blinding the model to $A$ does not blind it to a face that *encodes* $A$. This is the formal reason "we did not use race as a feature" is not a defence under disparate-impact law. ```{mermaid} %%| label: fig-signal-channels %%| fig-cap: "Four channels by which a face image can correlate with a label. Only the first is a genuine trait signal." flowchart LR T["Trait T"] --> X["Face image X"] D["Demographics D"] --> X D --> L["Training label L"] S["Self presentation S"] --> X S --> L C["Acquisition C"] --> X C --> L T --> L X --> P["Model prediction"] ``` ## 2. Tier 1, Legitimate, Defensible Applications These are well-grounded because the task is a **1:1 match** or a **bounded estimation with objective ground truth**, not an inference about who someone *is* inside. **Customer onboarding (eKYC) and fraud prevention.** The dominant legitimate use is selfie-to-document matching with liveness detection during remote account opening, standard at banks, neobanks, fintechs, crypto exchanges, gig-economy marketplaces, and telecoms (SIM registration). The business case is compliance (KYC/AML) plus fraud-loss reduction, and the credible academic backbone is NIST's ongoing, public, demographic-stratified biometric evaluations. Vendors report large fraud-catch and manual-review-cost improvements; present these as *industry claims*, not peer-reviewed effect sizes. **Age verification and estimation.** Facial age *estimation*, a regression on apparent age, distinct from identification, is now a regulator-recognized "highly effective age check" under the UK Online Safety Act. The most transparent vendor publishes a mean absolute error around 1.1 years for ages 13 to 17 and about 2.1 years for 18 to 24, independently evaluated in NIST's Face Analysis Technology Evaluation. It is defensible precisely *because* the output is a bounded number with a published error distribution, and because privacy-preserving deployments delete the image immediately and never link it to an identity. The quality metric here is concrete and worth stating precisely. With predictions $\hat a_i$ and true ages $a_i$, the mean absolute error is $\mathrm{MAE} = \frac{1}{n}\sum_{i=1}^n |\hat a_i - a_i|$, a quantity in *years* that any auditor can reproduce given dates of birth. A threshold check at age $t$ does not use the point estimate directly. Instead it accepts a user only when the estimated age clears the threshold by a safety **buffer** $b$, that is when $\hat a_i \ge t + b$. The buffer trades the two error types: raising $b$ drives the false-acceptance rate of underage users toward zero at the cost of a higher false-rejection rate of legitimate adults, who then fall back to a document check. Because the error distribution is published, an operator can *choose* $b$ to hit a target underage-pass rate. This is the signature of a defensible Tier 1 task: a tunable operating point on a curve with measurable axes, rather than an unfalsifiable claim about a person. **Access control, payment authentication, account recovery, returning-user re-verification.** Face unlock, face-based payment confirmation, biometric re-authentication, and re-verifying a returning driver or gig worker are all 1:1 verification against an enrolled template, defensible when consented, tightly governed, and offered with a non-biometric fallback. **Why Tier 1 is safe.** The ground truth is objective and checkable, error rates are independently measurable, and nothing is inferred about character. The genuine risks are *operational and equity* risks, the demographic accuracy gaps documented earlier, spoofing/deepfakes, and exclusion of people the system cannot read, not the *epistemic* risk of inferring an unmeasurable trait. ## 3. Tier 2, Commercially Deployed but Scientifically Contested Here a real published literature claims predictive signal, products ship, *and* serious methodological critiques and regulation exist. The job is to separate "a paper reports a correlation" from "this is a valid, deployable, non-discriminatory inference." This is the tier the user's question, "credit scoring from face or video", lands in, so it deserves the most careful treatment. ### 3.1 Credit Scoring and Default Prediction from Faces There is a genuine peer-reviewed thread in top finance and management journals: - **Duarte, Siegel & Young (2012),** *Review of Financial Studies*, "Trust and Credit." Borrowers who *appear* more trustworthy in peer-to-peer-lending photos are more likely to be funded, get lower rates, **and** actually default less. Perceived trustworthiness carried *some* real signal, but this is human raters scoring photos, and the effect is small relative to hard financial variables. - **Chen, Liu, Meng & Wang (2023),** *Management Science*, "What's in a Face?" The most important paper for a balanced view. A machine-learning model *can* predict repayment from facial images to some degree, **but** giving human loan officers the photos does *not* improve their decisions, humans hold biased facial priors and *over-weight* facial information. The lesson is double-edged: even where a weak algorithmic signal exists, injecting faces into the human decision pipeline degrades judgment and imports bias. - **CFO facial-trustworthiness studies** find that firms whose executives have more trustworthy-*looking* faces obtain better loan terms, evidence of an *appearance premium* to be controlled, not a feature to productize. Commercially, microlending vendors and patents have promoted "read the applicant's face to score repayment," sometimes blended with smartphone/digital-footprint scoring, but note that the digital-footprint signal (behavioral exhaust predicting default) is far better validated and is *not* facial inference. **Why deploying this is hazardous.** Four problems compound, and they map one-to-one onto the four channels of section 1.2: (1) **demographic proxy** ($a_D$), facial "signal" for default is confounded with age, gender, race, and socioeconomic markers, so a face score can launder protected-class discrimination into a credit decision; (2) **reverse causality / self-presentation** ($a_S$), a "trustworthy" photo reflects grooming, income, and access, not bone structure; (3) **leakage** ($a_C$), the image *source* (professional headshot vs. webcam vs. mugshot) carries the apparent signal; (4) **legality**, in the US this collides with fair-lending disparate-impact doctrine (ECOA), and in the EU with GDPR special-category rules and the AI Act's high-risk classification of creditworthiness AI. A correlation in a research dataset is real but small, dominated by confounds, and deploying it as a credit feature is both ethically and legally dangerous. **Worked example: how a small real signal becomes a discriminatory feature.** Consider a stylised lending population split by an unobserved socioeconomic factor that also shapes appearance. Suppose the true default rate is 8 percent in group $A$ and 16 percent in group $B$, and that group membership is itself 75 percent recoverable from the photo (because grooming, attire, and background track socioeconomic status). A face model trained to predict default will discover that the cheapest way to reduce its loss is to *infer the group and copy its base rate*: it learns the $X \leftarrow D \rightarrow L$ path, not any face-to-character path. The model can report a respectable AUC, yet decompose its score and you find almost all of the lift comes from separating $A$ from $B$, which is precisely the protected-class proxy a fair-lending audit forbids. Now suppose the genuinely face-intrinsic signal $a_T$ is real but tiny, lifting AUC from 0.50 to perhaps 0.52 once group is held fixed. The deployable conclusion is stark: the *useful* part of the score is the part you are not allowed to use, and the part you are allowed to use is too small to matter. A correct audit therefore does not ask "what is the AUC"; it asks "what is the residual AUC after conditioning on protected attributes", and for face-based credit scoring that residual is close to noise. ### 3.2 Automated Video Interviewing and Hireability Inference The product story is cautionary. A pioneer of AI-scored asynchronous video interviews included automated **facial-expression analysis** to infer traits, then, after an FTC complaint, ACLU criticism, and scrutiny under the Illinois Artificial Intelligence Video Interview Act, **publicly dropped facial analysis in 2021**, stating it "no longer significantly added value" relative to language analysis. Regulation has since tightened (Illinois AIVIA, NYC Local Law 144's bias-audit mandate, EEOC guidance under Title VII/ADA). The academic evidence is genuinely mixed and *label-dependent*. Hickman et al. (2022, *Journal of Applied Psychology*) trained models on ~1,073 video interviews to predict Big Five personality: models trained on **observer-rated** personality explained on average R² ≈ 0.16, but models trained on **self-reported** personality explained essentially nothing (R² ≈ 0.01). The algorithm partly learns to reproduce *raters' impressions*, not the construct itself. The defensible residue is structured, content/verbal scoring with bias audits; facial-expression-to-hireability inference is the part the market itself retreated from. ### 3.3 Affect / Emotion Recognition The market is large and real, ad-testing, market research, call-center voice analytics, and automotive driver-state monitoring. But the foundational scientific critique is devastating: **Barrett et al. (2019),** *Psychological Science in the Public Interest*, "Emotional Expressions Reconsidered," a ~60-page review concluding that the assumed one-to-one mapping from facial configurations to internal emotional states is **not supported**, how people move their faces for a given emotion varies widely within a person, across contexts, and across cultures. The inference "this face = this felt emotion" is therefore scientifically unreliable. A useful nuance: **driver drowsiness/distraction monitoring is more defensible** than "customer emotion," because it targets observable physiological states (eyelid closure, gaze) with a safety rationale, and the EU AI Act carves out exactly such safety uses while banning workplace emotion inference. ## 4. Tier 3, Scientifically Discredited and Banned These claim to read **inner character or protected status from facial structure**. This is **physiognomy**, the long-discredited pseudoscience (Lavater, Lombroso) that fed scientific racism, re-skinned with deep learning. - **Wu & Zhang (2016),** "Automated Inference on Criminality Using Face Images," claimed classifiers distinguish "criminal" from "non-criminal" faces. The fatal flaw: the "non-criminal" images were professional/ID photos (often smiling) while "criminal" images were government mugshots. The model learned **expression and photo-source artifacts**, not criminality, and "criminal" is a socially constructed, enforcement-biased label with no causal link to face geometry. - **Wang & Kosinski (2018),** "Detecting Sexual Orientation From Facial Images," reported high AUC distinguishing gay from straight in dating-profile photos. Critiques (notably Agüera y Arcas and colleagues) showed the signal comes overwhelmingly from **self-presentation and grooming**, makeup, facial hair, glasses, camera angle, not innate structure, demolishing the authors' prenatal-hormone story. The classifier exposes *grooming norms and stereotypes*, not biology. - **Kosinski (2021),** "Facial Recognition Technology Can Expose Political Orientation," drew the same family of objections: self-presentation, demographic and regional confounds, and an unfounded leap from correlation to essence. - **The umbrella critique, "Physiognomy's New Clothes"** (Agüera y Arcas, Mitchell & Todorov, 2017), is the definitive accessible takedown: these systems revive the exact logic historically used to justify discrimination, and their apparent accuracy reflects confounds, not any real face-to-character mapping. **In one line:** there is no validated causal mechanism linking facial morphology to criminality, sexuality, or politics; the "accuracy" is real *pattern-matching on confounds*, and high AUC on a biased dataset is not evidence of a true relationship. In the language of section 1.2, every Tier 3 result has $a_T \approx 0$ and a reported score equal to $a_S + a_C$ (plus $a_D$). The test that exposes this is not internal cross-validation, which preserves the confound, but **invariance under a presentation or acquisition swap**: re-photograph the same people under the *other* group's conditions (same camera, same expression, same crop) and watch the accuracy collapse toward chance. A genuine $T \to X$ signal survives that swap; a confound does not. This is the single experiment every Tier 3 claim fails and every Tier 1 task passes, and it is far more informative than any headline accuracy number. **Regulatory bans.** The EU AI Act (Article 5, prohibitions effective February 2025) bans social scoring; emotion recognition in workplaces and education (narrow safety/medical carve-outs); biometric *categorisation* inferring race, political opinions, religion, or sexual orientation; and untargeted facial-image scraping. Tier-3 use cases map almost exactly onto these prohibitions, while creditworthiness and employment AI (Tier 2) are separately classified *high-risk*, permitted but heavily constrained. ## 5. How to Demo Responsibly in a Real Business Case The user's practical need, compelling demos for real business cases, has a clear, safe answer: **lean entirely on Tier 1.** Before committing to any face-based feature, the following decision procedure separates the buildable from the indefensible. It operationalises the verification / inference axis and the invariance test into questions a product team can actually answer. ```{mermaid} %%| label: fig-decision %%| fig-cap: "A go / no-go screen for any proposed face-based feature." flowchart TD Q1["Is there an oracle that returns the true label independently of the image?"] Q1 -- "No" --> STOP["Latent-trait inference. Treat as cautionary analysis only."] Q1 -- "Yes" --> Q2["Does accuracy survive a presentation and acquisition swap?"] Q2 -- "No" --> STOP Q2 -- "Yes" --> Q3["Is residual signal small after conditioning on protected attributes?"] Q3 -- "Yes, signal is mostly proxy" --> STOP Q3 -- "No, genuine residual signal" --> BUILD["Buildable Tier 1 task. Verify, audit, deploy with fallback."] ``` The three gates correspond exactly to the chapter's machinery: gate one is the verification / inference definition, gate two is the invariance test that kills confounds, and gate three is the proxy-discrimination check that prevents laundering protected attributes. A feature that clears all three is a verification task with a real and lawful signal. A feature that fails any one of them belongs in analysis, not in a product. - **Best demo, identity verification with deepfake defense.** A live selfie-to-ID match with liveness/PAD, ideally *showing a deepfake spoof being caught*, framed around fraud-loss reduction and KYC/AML compliance. It has objective ground truth, independent (NIST) evaluation, and no character inference. This is the demo that wins enterprise trust. - **Strong second, privacy-preserving age estimation.** Show the published mean-absolute-error on screen, estimate an age, and delete the image immediately. A concrete, regulator-aligned use case (UK Online Safety Act) with honest error bars. - **Other safe demos**, payment authentication, account-recovery re-verification, returning-user matching, and document-fraud detection from the Document AI chapter. **Present Tier 2 only as analysis, never as a live product pitch.** If credit-from-face or video-interview scoring must be shown, show it as a *cautionary case*: demonstrate the confound directly, for instance, that a "risk" model's output flips when a mugshot-style photo is swapped for a smiling headshot, or that interview scores track rater impressions rather than job performance. Pair every Tier 2 example with its critique (Chen et al. on humans over-weighting faces; Hickman et al. on label dependence; Barrett et al. on emotion), and note that workplace emotion recognition is *prohibited* in the EU. **Never demo Tier 3 as if it works.** Use the criminality, sexuality, and political-orientation papers *only* as worked examples of how confounds, leakage, biased labels, and reverse causality manufacture spurious accuracy, a teaching device for "why high accuracy ≠ a real relationship," with the EU AI Act bans flagged explicitly. ### 5.1 Pitfalls checklist Even within Tier 1, the following failure modes recur often enough to be worth naming explicitly. - **Reading internal cross-validation as proof of a real signal.** Cross-validation that draws train and test from the *same* biased collection preserves every confound. Only an out-of-distribution or invariance test is diagnostic. - **Reporting a single aggregate accuracy.** A face system can hit 99 percent overall while failing badly on a demographic subgroup. The classic demonstration is Buolamwini and Gebru's *Gender Shades*, where commercial gender classifiers that looked strong in aggregate had error rates up to about 34 percent on darker-skinned women. Always report error stratified by the attributes documented in the bias-and-fairness literature, the way NIST does. - **Treating "we did not use protected attributes as features" as a fairness guarantee.** As the proxy-discrimination inequality shows, a face encodes those attributes, so blinding the inputs does not blind the model. - **Confusing a regulator-recognised *category* with approval of a *product*.** Age estimation being a recognised method does not mean a given vendor's error distribution meets the threshold; the published MAE and operating point still have to be checked. - **Shipping without a non-biometric fallback.** Any face system excludes some users it cannot read; an accessible alternative path is both an equity requirement and, increasingly, a legal one. ## 6. Conclusion The commercial value of AI on faces and video is real but lives almost entirely in **verification**, confirming a checkable claim of identity or age, not in **inference** of latent traits. The verification-versus-inference axis predicts both the science and the law: Tier 1 verifies an objective fact and is defensible; Tier 3 infers an unobservable essence and is pseudoscience the EU now bans; Tier 2 is the contested middle, where weak real correlations exist but are dominated by confounds and constrained by fair-lending and high-risk-AI regulation. For a practitioner building a demo or a product, the discipline is simple to state and hard to hold: *if the question has an objective, checkable answer, you can build on it; if it requires reading character from a face, the accuracy is an artifact and the deployment is a liability.* That principle, more than any model, is the takeaway of this cluster. ## References 1. Duarte, J., Siegel, S., Young, L. "Trust and Credit: The Role of Appearance in Peer-to-Peer Lending." *Review of Financial Studies*, 2012. https://academic.oup.com/rfs/article-abstract/25/8/2455/1570804 2. Chen, Z., Liu, B., Meng, Y., Wang, Z. "What's in a Face? An Experiment on Facial Information and Loan-Approval Decision." *Management Science*, 2023. https://pubsonline.informs.org/doi/10.1287/mnsc.2022.4436 3. Hickman, L. et al. "Automated Video Interview Personality Assessments: Reliability, Validity, and Generalizability." *Journal of Applied Psychology*, 2022. https://pubmed.ncbi.nlm.nih.gov/34110849/ 4. Barrett, L. F., Adolphs, R., Marsella, S., Martinez, A., Pollak, S. "Emotional Expressions Reconsidered." *Psychological Science in the Public Interest*, 2019. https://journals.sagepub.com/doi/10.1177/1529100619832930 5. Wu, X., Zhang, X. "Automated Inference on Criminality Using Face Images." 2016. https://arxiv.org/abs/1611.04135 6. Wang, Y., Kosinski, M. "Deep Neural Networks Are More Accurate Than Humans at Detecting Sexual Orientation From Facial Images." *J. Personality and Social Psychology*, 2018. https://www.gsb.stanford.edu/faculty-research/publications/deep-neural-networks-are-more-accurate-humans-detecting-sexual 7. Agüera y Arcas, B., Mitchell, M., Todorov, A. "Physiognomy's New Clothes." 2017. https://medium.com/@blaisea/physiognomys-new-clothes-f2d4b59fdd6a 8. Yoti. "Facial Age Estimation" (accuracy / NIST FATE evaluation). https://www.yoti.com/business/age-verification/ 9. EU AI Act, Article 5 (prohibited practices). https://artificialintelligenceact.eu/article/5/ 10. NIST. *Face Analysis Technology Evaluation (FATE), Age Estimation.* https://pages.nist.gov/frvt/html/frvt_age_estimation.html 11. Buolamwini, J., Gebru, T. "Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification." *Proceedings of Machine Learning Research* (FAT* 2018), 81:77-91. https://proceedings.mlr.press/v81/buolamwini18a.html