234 Face Verification, Liveness, and Presentation-Attack Detection

234.1 1. Introduction

The previous chapter extracted a portrait, from a document’s printed photo or, better, from its NFC chip. This chapter answers the second half of identity verification: is the person presenting the document its genuine, living owner? That decomposes into two distinct technical problems. Face verification asks whether a live selfie matches the reference portrait (a 1:1 comparison). Liveness detection, or presentation-attack detection (PAD), asks whether the selfie comes from a real, present human rather than a photo, a video replay, a mask, or an injected deepfake. A system that solves the first but not the second is trivially defeated by holding up a printout of the victim’s face, so in practice the two must always travel together.

The two problems are also logically independent and must be evaluated separately. A verifier with a perfect match score but no liveness check accepts a printed photo of the genuine owner; a perfect liveness detector with a weak verifier accepts a live impostor. The end-to-end security of an eKYC pipeline is bounded by the weaker of the two stages, and the two failure modes have to be measured with their own metrics, recognition error rates for the matcher (Section 2.3) and attack/bona-fide error rates for the PAD subsystem (Section 6).

We develop the embedding paradigm and the margin losses that define modern face recognition, the precise error metrics and their geometry, the independent NIST benchmarks and their sobering demographic findings, the PAD threat taxonomy and methods, the rising deepfake and injection threats of 2024 to 2025, the certification standards, and the privacy and legal constraints that govern any deployment.

%%{init: {"flowchart": {"htmlLabels": false}} }%%
flowchart LR
  A["Live selfie capture"] --> B["PAD liveness check"]
  B -->|"attack detected"| R["Reject"]
  B -->|"bona fide"| C["Face encoder"]
  D["Reference portrait from doc or chip"] --> E["Face encoder"]
  C --> F["Cosine similarity score"]
  E --> F
  F -->|"score above threshold"| P["Accept match"]
  F -->|"score below threshold"| R

The diagram makes the ordering explicit: liveness is a gate that runs before, or jointly with, the match, so that a spoofed sample never reaches the comparison stage.

234.2 2. Recognition, Verification (1:1), and Identification (1:N)

Three operational modes must be distinguished, because their error behaviour differs sharply:

Verification (1:1), “is this person who they claim to be?” A probe (selfie) is compared against a single claimed reference (the ID photo), yielding one score thresholded to accept or reject. This is the eKYC mode.
Identification (1:N), “who is this?” A probe is searched against a gallery of N enrolled identities. The probability of a false positive scales with N, so an algorithm safe at 1:1 can be unsafe when searching millions of identities.
Recognition is the umbrella term for both.

The scaling argument deserves a precise statement. Let $\text{FMR}_1$ be the per-comparison false-match rate of a 1:1 verifier. In an open-set 1:N search the probe is compared against all $N$ gallery entries, and if the comparisons behaved independently the probability of at least one false match among the non-mates would be

\[ \text{FPIR}(N) \;=\; 1 - (1 - \text{FMR}_1)^{N} \;\approx\; N \cdot \text{FMR}_1 \quad \text{for } N\,\text{FMR}_1 \ll 1, \]

where FPIR is the false-positive identification rate. A verifier comfortable at $\text{FMR}_1 = 10^{-6}$ produces a roughly $10^{-6}\cdot N$ chance of a spurious hit, which at $N = 10^7$ enrolled identities is about $10$, that is, a near-certain false alarm per innocent probe. Real galleries violate the independence assumption (correlated look-alikes, families, twins), so the linear approximation is only a guide, but it explains why 1:N deployments must run at much stricter thresholds than 1:1 eKYC and why NIST reports the two regimes with separate metrics.

234.2.1 2.1 The Embedding Paradigm

A modern system maps a face to a fixed-length vector, an embedding or template, typically 128 to 512 dimensions, via a deep encoder (historically a CNN such as ResNet or MobileFaceNet, increasingly a Vision Transformer). The network is trained so that embeddings of the same identity cluster tightly (intra-class compactness) while different identities are pushed apart (inter-class separability). At inference, matching reduces to a distance computation between two embeddings; the encoder itself is fixed. This is the same representational idea developed in the embeddings chapter, specialized to faces.

Formally, the encoder $f_\theta : \mathcal{X} \to \mathbb{R}^d$ maps an image to an embedding that is then $L_2$-normalized to $\hat{e} = f_\theta(x) / \lVert f_\theta(x) \rVert_2$, so every template lives on the unit hypersphere $\mathbb{S}^{d-1}$. On that sphere cosine similarity and squared Euclidean distance are monotonically related,

\[ \lVert \hat{e}_a - \hat{e}_b \rVert_2^2 \;=\; 2 - 2\,\cos\theta_{ab}, \qquad \cos\theta_{ab} = \hat{e}_a^\top \hat{e}_b, \]

so thresholding cosine similarity and thresholding distance are equivalent decisions. This identity is why face systems report a single scalar score and why the geometry of the angle $\theta_{ab}$ is the object the margin losses below manipulate.

234.2.2 2.2 Margin Losses

The defining innovation of the 2015 to 2019 era was the loss function that shapes the embedding space. The common starting point is the softmax cross-entropy reinterpreted on normalized features and weights. With $L_2$-normalized feature $\hat{e}$ and per-class normalized weight vectors $W_j$, the logit for class $j$ is $s\cos\theta_j$, where $\theta_j$ is the angle between the feature and the class center and $s$ is a fixed scale (radius). The margin variants differ only in how they penalize the target class angle $\theta_y$:

FaceNet / triplet loss (Schroff et al., 2015) optimizes triplets of (anchor, positive, negative) so that, with margin $\alpha$, $\lVert f(x^a) - f(x^p)\rVert_2^2 + \alpha \le \lVert f(x^a) - f(x^n)\rVert_2^2$. Powerful but sensitive to triplet mining and slow to converge.
SphereFace (2017) introduced a multiplicative angular margin $m$, replacing $\cos\theta_y$ with $\cos(m\theta_y)$.
CosFace (2018) applies an additive cosine margin, using $\cos\theta_y - m$, subtracting a fixed margin from the target-class cosine on a hypersphere of fixed radius $s$.
ArcFace (Deng et al., CVPR 2019) adds the margin directly to the angle, using $\cos(\theta_y + m)$, the geodesic distance on the hypersphere, giving an exact, constant, geometrically interpretable margin. ArcFace became the de-facto standard and remains a strong baseline.

Writing the three penalized targets together makes the unification clear. The ArcFace objective for a batch of $B$ samples is

\[ \mathcal{L}_{\text{ArcFace}} = -\frac{1}{B}\sum_{i=1}^{B} \log \frac{e^{\,s\cos(\theta_{y_i} + m)}} {e^{\,s\cos(\theta_{y_i} + m)} + \sum_{j \ne y_i} e^{\,s\cos\theta_{j}}}, \]

and SphereFace ($\cos m\theta_{y_i}$) and CosFace ($\cos\theta_{y_i} - m$) are obtained by substituting their target terms. All three share one goal: penalize the target logit to force a gap at the decision boundary. They operate on $L_2$-normalized features and weights, which is why cosine similarity is the natural scoring metric, and why the scale $s$ (often $30$ to $64$) is needed so that the normalized logits retain enough dynamic range for cross-entropy to train.

234.2.3 2.3 Scoring and Error Trade-offs

A selfie-to-ID match runs both images through the encoder, $L_2$-normalizes the two embeddings, computes cosine similarity, and thresholds it at $\tau$: accept if $s(\hat{e}_a, \hat{e}_b) \ge \tau$. The threshold sets the trade-off between two error rates, defined as conditional probabilities:

False Match Rate, $\text{FMR}(\tau) = \Pr(\text{score} \ge \tau \mid \text{impostor pair})$, impostors wrongly accepted (a security failure).
False Non-Match Rate, $\text{FNMR}(\tau) = \Pr(\text{score} < \tau \mid \text{genuine pair})$, genuine users wrongly rejected (a usability failure, and in eKYC an exclusion failure).

These are monotone in $\tau$: raising $\tau$ lowers FMR and raises FNMR, and vice versa. If $G$ and $I$ denote the genuine and impostor score densities, then $\text{FMR}(\tau) = \int_\tau^\infty I(s)\,ds$ and $\text{FNMR}(\tau) = \int_{-\infty}^\tau G(s)\,ds$. The full trade-off is visualized by the ROC curve (true-accept vs. false-accept) or, more diagnostically in the low-error regime, the DET curve, which plots FNMR against FMR on logarithmic (or normal-deviate) axes so that the operationally interesting tail near $\text{FMR}=10^{-6}$ is legible rather than crushed into a corner.

Two summary points are worth naming. The equal error rate (EER) is the threshold where $\text{FMR}=\text{FNMR}$, a convenient single number for comparing systems but not a sensible operating point, since real deployments deliberately run away from it. Instead operators fix the security side (say $\text{FMR}=10^{-6}$) and report the resulting FNMR, because the cost of a false accept (admitting a fraudster) and a false reject (excluding a legitimate user) are rarely symmetric.

234.2.3.1 Worked example: choosing a threshold

Suppose calibration on held-out data gives, at the candidate thresholds below, the following empirical rates for a verifier:

Threshold $\tau$	FMR	FNMR
$0.30$	$1\times10^{-3}$	$0.2\%$
$0.40$	$1\times10^{-5}$	$0.8\%$
$0.45$	$1\times10^{-6}$	$1.5\%$
$0.50$	$1\times10^{-7}$	$3.0\%$

If the eKYC policy requires at most one impostor accepted per million attempts, choose $\tau = 0.45$, accepting that roughly $1.5\%$ of genuine users are rejected on the first try and routed to a fallback (a retry or manual review). Tightening to $\tau = 0.50$ buys a further factor of ten in security at the cost of doubling the genuine-user rejection rate, a trade that is only worthwhile if the downstream fraud cost dominates the friction cost. This is the concrete meaning of “fix FMR, report FNMR”: the business chooses the column, not the algorithm.

234.3 3. NIST Benchmarks and the Demographic Reality

NIST’s Face Recognition Vendor Test (FRVT), now the Face Recognition Technology Evaluation (FRTE), with morph and age strands under FATE, is the authoritative independent benchmark, evaluating hundreds of vendor algorithms on sequestered operational datasets. This independence matters: it is the credible alternative to vendor self-report, because the test data is never released, the protocol is fixed, and every submitted algorithm is scored on identical terms.

State of the art (1:1). The best algorithms achieve roughly FNMR ≈ 0.0001 to 0.002 at FMR = 10⁻⁶ on a ~12-million-image mugshot dataset, under 1% miss rate at a one-in-a-million false-match setting. (Specific numbers are leaderboard snapshots that shift continuously and should be read from the live FRTE report, not memorized.)

Demographic differentials, NISTIR 8280 (FRVT Part 3, December 2019). This landmark study should be cited precisely, because it is frequently exaggerated and frequently dismissed:

In 1:1 verification, false positives were higher for Asian and African American faces than for Caucasian faces, with differentials “often ranging from a factor of 10 to 100×,” depending on the algorithm.
For US-developed algorithms, the highest false positives appeared for the American Indian group, with elevated rates for Asian and African American faces.
A crucial nuance: algorithms developed in Asian countries did not show the Asian-versus-Caucasian disparity, evidence that training-data composition, not any immutable property, drives much of the gap.
In 1:N identification, the highest false positives were for African American females, operationally the most serious finding, since 1:N false positives in law-enforcement search can implicate innocent people.

NIST’s follow-up work (NISTIR 8429) clarifies the mechanism: false-negative inequities are largely an image-quality problem (under-exposure of darker skin), correctable at the capture stage, whereas the larger false-positive variations persist even in high-quality images and must be addressed in algorithm design. The distinction matters operationally because the two failures have different owners: a capture-stage problem is fixed by better cameras, exposure metering, and capture guidance, while a false-match disparity is the algorithm vendor’s responsibility and is invisible unless you measure it per group.

The asymmetry also means that a single aggregate FMR threshold delivers unequal security and usability across groups: a threshold tuned to $\text{FMR}=10^{-6}$ on the population average may sit at $10^{-5}$ for one subgroup and $10^{-7}$ for another. The best modern algorithms are markedly more equitable than the 2019 cohort, but the differentials have not vanished, which is why an eKYC deployment must monitor error rates by demographic group, not just in aggregate, and should consider whether a per-group or worst-group threshold policy is required by its fairness obligations.

234.4 4. Liveness and Presentation-Attack Detection

PAD determines whether the biometric sample comes from a live, present human. Passive liveness analyzes a single image or short clip with no user action (texture, moiré, reflectance, micro-physiology). Active (challenge, response) liveness prompts the user, blink, smile, turn the head, or follow on-screen color flashes for reflectance, and is more robust but higher-friction and more spoofable by interactive deepfakes. The friction trade-off is real: passive liveness preserves a one-tap user experience and is preferred for high-volume consumer onboarding, while active challenges raise the bar against simple replays at the cost of abandonment and accessibility burden.

234.4.1 4.1 The Attack Taxonomy (ISO/IEC 30107)

ISO/IEC 30107-1 frames the threat. Presentation-attack instruments (PAIs) include print attacks (a photo on paper), replay attacks (a video on a screen), 3D masks (paper, resin, silicone), and increasingly digital/synthetic artefacts. An important scope point: classic PAD addresses artefacts presented to the sensor; injection attacks that bypass the sensor entirely (Section 5) fall outside the original presentation model. The standard also distinguishes PAI species (a specific kind of instrument, for example “silicone mask of subject X”) from broader categories, because a detector that defeats one species says nothing about an unseen species, the generalization problem that recurs throughout this section.

234.4.2 4.2 Methods

Texture and image-quality cues detect print and screen artefacts (moiré, banding, reduced micro-texture, color distortion).
Remote photoplethysmography (rPPG) recovers the subtle pulse-driven skin-color signal, present in live faces, absent in prints and masks, but is computationally heavy and sensitive to motion and lighting.
Depth exploits the planarity of prints and replays versus genuine 3D facial structure (stereo, structured light, or learned depth).
Deep-learning PAD dominates: CNN/transformer classifiers, auxiliary-supervision models that jointly regress depth maps and rPPG signals (Liu et al., 2018), and generalization-focused approaches (domain adaptation, anomaly detection) to handle unseen attack types, the central open problem, since PAD models generalize poorly across datasets.

A useful way to read these methods is as a defense-in-depth stack rather than competitors: texture cues are cheap and catch crude prints, depth defeats flat replays, rPPG and motion defeat static masks, and a learned classifier fuses them. No single cue is robust on its own, and the strongest production systems combine several so that defeating the system requires simultaneously fooling independent physical signals.

234.4.3 4.3 Benchmarks

Canonical datasets include CASIA-FASD (50 subjects), Replay-Attack (Idiap; 1,200 videos), OULU-NPU (4,950 clips across four protocols isolating illumination, instrument, and camera), SiW (165 subjects), the multi-modal cross-ethnicity CASIA-SURF and CeFA, and CelebA-Spoof (~625k images, 10,177 subjects). The recurring lesson across community competitions is poor cross-domain generalization, a detector tuned on one dataset’s attacks often fails on another’s. The standard guard against fooling oneself is a cross-dataset protocol, train on one corpus and test on a different one, since within-dataset accuracy systematically overstates real-world robustness.

234.5 5. Deepfakes, Morphing, and Injection Attacks

234.5.1 5.1 Face Morphing (the ID-issuance attack)

A morph blends two faces into one image that matches both contributing identities above threshold. Concretely, if $\hat{e}_A$ and $\hat{e}_B$ are the embeddings of two accomplices, a successful morph image $x_M$ has an embedding $\hat{e}_M$ for which both $s(\hat{e}_M, \hat{e}_A) \ge \tau$ and $s(\hat{e}_M, \hat{e}_B) \ge \tau$. If a passport is issued from a morphed photo, the two contributors share one credential, defeating downstream verification. The Morphing Attack Detection (MAD) literature splits into single-image MAD (detect artefacts in one image) and differential MAD (compare the document image against a trusted live capture, which a morph cannot simultaneously satisfy as well as a genuine photo). NIST runs FATE MORPH for independent benchmarking and in 2025 released a lay-language guide (NISTIR 8584). The primary operational defense is live, supervised enrolment at issuance, which the EU mandated for passports under Regulation 2019/1157, removing the applicant-supplied photo that morphing exploits. This is the cleanest illustration of a theme in this chapter: a process change at capture time can neutralize an attack that is hard to detect algorithmically after the fact.

234.5.2 5.2 Injection Attacks, the 2024 to 2025 Shift

The fastest-growing threat is not holding an artefact to a camera but injecting a synthetic video stream directly into the verification pipeline through a virtual camera, bypassing the physical sensor. Industry threat-intelligence reporting (for example iProov’s 2025 report) documents steep year-over-year rises in virtual-camera and face-swap attacks and a growing share of verification attempts involving deepfakes. These vendor figures are directional threat intelligence, not peer-reviewed measurements, and should be cited with that caveat. The structural point, however, is robust: because injection sidesteps the sensor, traditional presentation-attack detection does not cover it, since there is no physical instrument and no moiré, depth, or reflectance artefact to detect. Defenses therefore shift the trust anchor away from the image content and toward the provenance of the capture: trusted capture, device attestation, server-side imagery-integrity checks, and one-time challenge schemes (such as randomized screen-illumination patterns) that are hard to pre-render or replay, rather than analysis of the image alone.

The randomized-illumination idea is worth stating as a principle. If the server emits a fresh, unpredictable challenge (for example a sequence of screen colors) and verifies that the returned video shows the correct corresponding reflectance on the face in real time, then a pre-recorded or pre-rendered injection cannot anticipate the challenge and fails. The security rests on the challenge being unguessable and the response being verified within a tight time window, the same freshness logic that defeats replay in cryptographic authentication.

234.6 6. Standards and Certification

ISO/IEC 30107-3 specifies the testing and reporting methodology for PAD, defining APCER (attack presentation classification error rate, the fraction of attack presentations wrongly accepted as bona fide) and BPCER (bona-fide presentation classification error rate, the fraction of genuine presentations wrongly rejected) against defined attack species. These are the PAD analogues of FMR and FNMR: APCER is the security error and BPCER is the usability error. Critically, APCER is reported per attack species and the worst species governs, because a system is only as strong as its weakest covered attack; an average over species would let a strong defense against prints hide a weak defense against masks.
iBeta (an NVLAP-accredited lab) is the dominant PAD test house. Level 1 covers 2D attacks (prints, cutouts, screen replays) and requires 0% successful attacks across the battery; Level 2 adds 3D attacks (silicone/resin/latex masks, wrapped 3D paper) with higher material budgets. Conformance is reported per ISO/IEC 30107-3.
FIDO Alliance Biometric Component Certification independently certifies biometric subcomponents for both PAD and recognition performance via accredited labs, giving a vendor-neutral assurance bar.

A practical caution on reading certificates: a PAD certification is scoped to the species tested, the capture hardware, and the software version. It is evidence of competence against a defined battery, not a guarantee against novel attacks (injection in particular often sits outside the tested scope), so a certificate should be treated as a floor to build on, not a finished security argument.

234.7 7. Privacy, Bias, and Regulation

Template protection. Raw embeddings are sensitive and partially invertible, so ISO/IEC 24745 (biometric information protection) and cancelable biometrics, irreversible, revocable transforms of the template, plus secure-element and fuzzy-vault schemes are the standard mitigations. The defining requirement is unlinkability and renewability: a protected template must be hard to invert back to a face, and if compromised it must be revocable and re-issuable as a fresh transform, which a raw face image can never be, since you cannot reissue someone’s face.
EU AI Act (Article 5). Prohibits certain biometric practices: real-time remote biometric identification in public spaces for law enforcement (with narrow, court-authorized exceptions); biometric categorisation inferring sensitive attributes; and emotion recognition in workplaces and education (with safety/medical carve-outs). Most verification and permitted remote-identification uses are pushed into the high-risk tier with conformity-assessment duties, a theme developed in the eKYC and business-applications chapters.
Biometric privacy law (US). Illinois’ Biometric Information Privacy Act (BIPA, 2008) is the template statute: it mandates informed written consent and retention/destruction policies for face-geometry data and uniquely provides a private right of action with statutory damages ($1,000 per negligent, $5,000 per reckless violation). It has driven nine-figure settlements (Facebook $650M; Clearview AI), and a 2024 amendment limiting per-scan accumulation has shaped how damages are counted. Texas and Washington have analogous regimes enforced by the state attorney general rather than a private right of action.

234.7.1 When to use, and pitfalls

Face verification with liveness is the right tool for remote, document-bound identity proofing (account opening, regulated onboarding) where a trusted reference portrait exists and a brief capture is acceptable. It is a poor fit where no trusted reference exists, where capture conditions cannot be controlled, or where the law forbids the use outright (Section 7). Recurring pitfalls:

Liveness omitted or bolted on late. Verification without PAD is not security; PAD must gate the match, not annotate it afterward.
Aggregate-only metrics. A single population FMR/FNMR hides the demographic differentials of Section 3; monitor by group.
Within-dataset PAD evaluation. Reporting accuracy on the same corpus the detector was tuned on overstates robustness; require cross-dataset and unseen-species testing.
Treating a certificate as complete. iBeta or FIDO conformance covers tested species and versions only, and typically not injection.
Storing raw embeddings. Templates are partially invertible and irrevocable; protect them under ISO/IEC 24745 so a breach can be remediated.

234.8 8. Conclusion

Face verification is a mature, independently benchmarked technology: margin-loss embeddings compared by cosine similarity, with error trade-offs made explicit by the DET curve and the discipline of fixing FMR and reporting FNMR, and a sobering, well-documented demographic gap that any deployment must monitor by group rather than in aggregate. But verification without liveness is no defense at all, and liveness is now an arms race, the threat moved from printed photos to 3D masks to, in 2024 to 2025, deepfake injection that bypasses the camera entirely, pushing the trust anchor toward device attestation and trusted capture. The two subsystems are measured with parallel metric pairs, FMR/FNMR for the matcher and APCER/BPCER for PAD, and the security of the whole is bounded by the weaker of them. With document reading (previous chapter) and biometric verification (this chapter) in hand, the next chapter assembles them into a complete eKYC system and confronts the regulation, risk scoring, and fairness obligations that govern it.

234.9 References

Deng, J. et al. “ArcFace: Additive Angular Margin Loss for Deep Face Recognition.” CVPR 2019. https://arxiv.org/abs/1801.07698
Schroff, F., Kalenichenko, D., Philbin, J. “FaceNet: A Unified Embedding for Face Recognition and Clustering.” CVPR 2015. https://arxiv.org/abs/1503.03832
Wang, H. et al. “CosFace: Large Margin Cosine Loss for Deep Face Recognition.” CVPR 2018. https://arxiv.org/abs/1801.09414
Liu, W. et al. “SphereFace: Deep Hypersphere Embedding for Face Recognition.” CVPR 2017. https://arxiv.org/abs/1704.08063
NIST. “FRVT Part 3: Demographic Effects” (NISTIR 8280). December 2019. https://nvlpubs.nist.gov/nistpubs/ir/2019/nist.ir.8280.pdf
NIST. “FRVT Part 8 / Demographic Effects update” (NISTIR 8429). https://pages.nist.gov/frvt/reports/demographics/nistir_8429.pdf
ISO/IEC 30107-3. Biometric presentation attack detection, Testing and reporting. https://www.iso.org/standard/79520.html
Liu, Y., Jourabloo, A., Liu, X. “Learning Deep Models for Face Anti-Spoofing: Binary or Auxiliary Supervision.” CVPR 2018. https://arxiv.org/abs/1803.11097
NIST FATE MORPH and NISTIR 8584 (morph-attack detection). https://pages.nist.gov/frvt/html/frvt_morph.html
iProov. “Threat Intelligence Report 2025” (injection/deepfake trends; vendor threat intelligence). https://www.iproov.com/reports/threat-intelligence-report-2025-remote-identity-attack
FIDO Alliance. “Biometric Component Certification.” https://fidoalliance.org/certification/biometric-component-certification/
EU AI Act, Article 5 (prohibited practices). https://artificialintelligenceact.eu/article/5/

# Face Verification, Liveness, and Presentation-Attack Detection ## 1. Introduction The previous chapter extracted a portrait, from a document's printed photo or, better, from its NFC chip. This chapter answers the second half of identity verification: **is the person presenting the document its genuine, living owner?** That decomposes into two distinct technical problems. *Face verification* asks whether a live selfie matches the reference portrait (a 1:1 comparison). *Liveness detection*, or presentation-attack detection (PAD), asks whether the selfie comes from a real, present human rather than a photo, a video replay, a mask, or an injected deepfake. A system that solves the first but not the second is trivially defeated by holding up a printout of the victim's face, so in practice the two must always travel together. The two problems are also logically independent and must be evaluated separately. A verifier with a perfect match score but no liveness check accepts a printed photo of the genuine owner; a perfect liveness detector with a weak verifier accepts a live impostor. The end-to-end security of an eKYC pipeline is bounded by the weaker of the two stages, and the two failure modes have to be measured with their own metrics, recognition error rates for the matcher (Section 2.3) and attack/bona-fide error rates for the PAD subsystem (Section 6). We develop the embedding paradigm and the margin losses that define modern face recognition, the precise error metrics and their geometry, the independent NIST benchmarks and their sobering demographic findings, the PAD threat taxonomy and methods, the rising deepfake and injection threats of 2024 to 2025, the certification standards, and the privacy and legal constraints that govern any deployment. ```{mermaid} %%{init: {"flowchart": {"htmlLabels": false}} }%% flowchart LR A["Live selfie capture"] --> B["PAD liveness check"] B -->|"attack detected"| R["Reject"] B -->|"bona fide"| C["Face encoder"] D["Reference portrait from doc or chip"] --> E["Face encoder"] C --> F["Cosine similarity score"] E --> F F -->|"score above threshold"| P["Accept match"] F -->|"score below threshold"| R ``` The diagram makes the ordering explicit: liveness is a gate that runs before, or jointly with, the match, so that a spoofed sample never reaches the comparison stage. ## 2. Recognition, Verification (1:1), and Identification (1:N) Three operational modes must be distinguished, because their error behaviour differs sharply: - **Verification (1:1)**, "is this person who they claim to be?" A probe (selfie) is compared against a *single* claimed reference (the ID photo), yielding one score thresholded to accept or reject. This is the eKYC mode. - **Identification (1:N)**, "who is this?" A probe is searched against a gallery of *N* enrolled identities. The probability of a false positive scales with *N*, so an algorithm safe at 1:1 can be unsafe when searching millions of identities. - **Recognition** is the umbrella term for both. The scaling argument deserves a precise statement. Let $\text{FMR}_1$ be the per-comparison false-match rate of a 1:1 verifier. In an open-set 1:N search the probe is compared against all $N$ gallery entries, and if the comparisons behaved independently the probability of *at least one* false match among the non-mates would be $$ \text{FPIR}(N) \;=\; 1 - (1 - \text{FMR}_1)^{N} \;\approx\; N \cdot \text{FMR}_1 \quad \text{for } N\,\text{FMR}_1 \ll 1, $$ where FPIR is the false-positive identification rate. A verifier comfortable at $\text{FMR}_1 = 10^{-6}$ produces a roughly $10^{-6}\cdot N$ chance of a spurious hit, which at $N = 10^7$ enrolled identities is about $10$, that is, a near-certain false alarm per innocent probe. Real galleries violate the independence assumption (correlated look-alikes, families, twins), so the linear approximation is only a guide, but it explains why 1:N deployments must run at much stricter thresholds than 1:1 eKYC and why NIST reports the two regimes with separate metrics. ### 2.1 The Embedding Paradigm A modern system maps a face to a fixed-length vector, an *embedding* or *template*, typically 128 to 512 dimensions, via a deep encoder (historically a CNN such as ResNet or MobileFaceNet, increasingly a Vision Transformer). The network is trained so that embeddings of the same identity cluster tightly (intra-class compactness) while different identities are pushed apart (inter-class separability). At inference, matching reduces to a distance computation between two embeddings; the encoder itself is fixed. This is the same representational idea developed in the embeddings chapter, specialized to faces. Formally, the encoder $f_\theta : \mathcal{X} \to \mathbb{R}^d$ maps an image to an embedding that is then $L_2$-normalized to $\hat{e} = f_\theta(x) / \lVert f_\theta(x) \rVert_2$, so every template lives on the unit hypersphere $\mathbb{S}^{d-1}$. On that sphere cosine similarity and squared Euclidean distance are monotonically related, $$ \lVert \hat{e}_a - \hat{e}_b \rVert_2^2 \;=\; 2 - 2\,\cos\theta_{ab}, \qquad \cos\theta_{ab} = \hat{e}_a^\top \hat{e}_b, $$ so thresholding cosine similarity and thresholding distance are equivalent decisions. This identity is why face systems report a single scalar score and why the geometry of the *angle* $\theta_{ab}$ is the object the margin losses below manipulate. ### 2.2 Margin Losses The defining innovation of the 2015 to 2019 era was the loss function that shapes the embedding space. The common starting point is the softmax cross-entropy reinterpreted on normalized features and weights. With $L_2$-normalized feature $\hat{e}$ and per-class normalized weight vectors $W_j$, the logit for class $j$ is $s\cos\theta_j$, where $\theta_j$ is the angle between the feature and the class center and $s$ is a fixed scale (radius). The margin variants differ only in how they penalize the *target* class angle $\theta_y$: - **FaceNet / triplet loss** (Schroff et al., 2015) optimizes triplets of (anchor, positive, negative) so that, with margin $\alpha$, $\lVert f(x^a) - f(x^p)\rVert_2^2 + \alpha \le \lVert f(x^a) - f(x^n)\rVert_2^2$. Powerful but sensitive to triplet mining and slow to converge. - **SphereFace** (2017) introduced a *multiplicative angular margin* $m$, replacing $\cos\theta_y$ with $\cos(m\theta_y)$. - **CosFace** (2018) applies an *additive cosine margin*, using $\cos\theta_y - m$, subtracting a fixed margin from the target-class cosine on a hypersphere of fixed radius $s$. - **ArcFace** (Deng et al., CVPR 2019) adds the margin directly to the *angle*, using $\cos(\theta_y + m)$, the geodesic distance on the hypersphere, giving an exact, constant, geometrically interpretable margin. ArcFace became the de-facto standard and remains a strong baseline. Writing the three penalized targets together makes the unification clear. The ArcFace objective for a batch of $B$ samples is $$ \mathcal{L}_{\text{ArcFace}} = -\frac{1}{B}\sum_{i=1}^{B} \log \frac{e^{\,s\cos(\theta_{y_i} + m)}} {e^{\,s\cos(\theta_{y_i} + m)} + \sum_{j \ne y_i} e^{\,s\cos\theta_{j}}}, $$ and SphereFace ($\cos m\theta_{y_i}$) and CosFace ($\cos\theta_{y_i} - m$) are obtained by substituting their target terms. All three share one goal: penalize the target logit to force a gap at the decision boundary. They operate on $L_2$-normalized features and weights, which is why **cosine similarity** is the natural scoring metric, and why the scale $s$ (often $30$ to $64$) is needed so that the normalized logits retain enough dynamic range for cross-entropy to train. ### 2.3 Scoring and Error Trade-offs A selfie-to-ID match runs both images through the encoder, $L_2$-normalizes the two embeddings, computes cosine similarity, and thresholds it at $\tau$: accept if $s(\hat{e}_a, \hat{e}_b) \ge \tau$. The threshold sets the trade-off between two error rates, defined as conditional probabilities: - **False Match Rate**, $\text{FMR}(\tau) = \Pr(\text{score} \ge \tau \mid \text{impostor pair})$, impostors wrongly accepted (a *security* failure). - **False Non-Match Rate**, $\text{FNMR}(\tau) = \Pr(\text{score} < \tau \mid \text{genuine pair})$, genuine users wrongly rejected (a *usability* failure, and in eKYC an *exclusion* failure). These are monotone in $\tau$: raising $\tau$ lowers FMR and raises FNMR, and vice versa. If $G$ and $I$ denote the genuine and impostor score densities, then $\text{FMR}(\tau) = \int_\tau^\infty I(s)\,ds$ and $\text{FNMR}(\tau) = \int_{-\infty}^\tau G(s)\,ds$. The full trade-off is visualized by the **ROC curve** (true-accept vs. false-accept) or, more diagnostically in the low-error regime, the **DET curve**, which plots FNMR against FMR on logarithmic (or normal-deviate) axes so that the operationally interesting tail near $\text{FMR}=10^{-6}$ is legible rather than crushed into a corner. Two summary points are worth naming. The **equal error rate** (EER) is the threshold where $\text{FMR}=\text{FNMR}$, a convenient single number for comparing systems but *not* a sensible operating point, since real deployments deliberately run away from it. Instead operators **fix the security side** (say $\text{FMR}=10^{-6}$) and report the resulting FNMR, because the cost of a false accept (admitting a fraudster) and a false reject (excluding a legitimate user) are rarely symmetric. #### Worked example: choosing a threshold Suppose calibration on held-out data gives, at the candidate thresholds below, the following empirical rates for a verifier: | Threshold $\tau$ | FMR | FNMR | |---|---|---| | $0.30$ | $1\times10^{-3}$ | $0.2\%$ | | $0.40$ | $1\times10^{-5}$ | $0.8\%$ | | $0.45$ | $1\times10^{-6}$ | $1.5\%$ | | $0.50$ | $1\times10^{-7}$ | $3.0\%$ | If the eKYC policy requires at most one impostor accepted per million attempts, choose $\tau = 0.45$, accepting that roughly $1.5\%$ of genuine users are rejected on the first try and routed to a fallback (a retry or manual review). Tightening to $\tau = 0.50$ buys a further factor of ten in security at the cost of doubling the genuine-user rejection rate, a trade that is only worthwhile if the downstream fraud cost dominates the friction cost. This is the concrete meaning of "fix FMR, report FNMR": the business chooses the column, not the algorithm. ## 3. NIST Benchmarks and the Demographic Reality NIST's **Face Recognition Vendor Test (FRVT)**, now the **Face Recognition Technology Evaluation (FRTE)**, with morph and age strands under **FATE**, is the authoritative independent benchmark, evaluating hundreds of vendor algorithms on sequestered operational datasets. This independence matters: it is the credible alternative to vendor self-report, because the test data is never released, the protocol is fixed, and every submitted algorithm is scored on identical terms. **State of the art (1:1).** The best algorithms achieve roughly FNMR ≈ 0.0001 to 0.002 at FMR = 10⁻⁶ on a ~12-million-image mugshot dataset, under 1% miss rate at a one-in-a-million false-match setting. (Specific numbers are leaderboard snapshots that shift continuously and should be read from the live FRTE report, not memorized.) **Demographic differentials, NISTIR 8280 (FRVT Part 3, December 2019).** This landmark study should be cited precisely, because it is frequently exaggerated *and* frequently dismissed: - In **1:1 verification**, false positives were higher for Asian and African American faces than for Caucasian faces, with differentials "often ranging from a factor of 10 to 100×," depending on the algorithm. - For **US-developed** algorithms, the highest false positives appeared for the American Indian group, with elevated rates for Asian and African American faces. - A crucial nuance: algorithms developed **in Asian countries** did *not* show the Asian-versus-Caucasian disparity, evidence that training-data composition, not any immutable property, drives much of the gap. - In **1:N identification**, the highest false positives were for African American females, operationally the most serious finding, since 1:N false positives in law-enforcement search can implicate innocent people. NIST's follow-up work (NISTIR 8429) clarifies the mechanism: **false-negative** inequities are largely an *image-quality* problem (under-exposure of darker skin), correctable at the capture stage, whereas the larger **false-positive** variations persist even in high-quality images and must be addressed in algorithm design. The distinction matters operationally because the two failures have different owners: a capture-stage problem is fixed by better cameras, exposure metering, and capture guidance, while a false-match disparity is the algorithm vendor's responsibility and is invisible unless you measure it per group. The asymmetry also means that a single aggregate FMR threshold delivers *unequal* security and usability across groups: a threshold tuned to $\text{FMR}=10^{-6}$ on the population average may sit at $10^{-5}$ for one subgroup and $10^{-7}$ for another. The best modern algorithms are markedly more equitable than the 2019 cohort, but the differentials have not vanished, which is why an eKYC deployment must monitor error rates *by demographic group*, not just in aggregate, and should consider whether a per-group or worst-group threshold policy is required by its fairness obligations. ## 4. Liveness and Presentation-Attack Detection PAD determines whether the biometric sample comes from a live, present human. **Passive liveness** analyzes a single image or short clip with no user action (texture, moiré, reflectance, micro-physiology). **Active (challenge, response) liveness** prompts the user, blink, smile, turn the head, or follow on-screen color flashes for reflectance, and is more robust but higher-friction and more spoofable by interactive deepfakes. The friction trade-off is real: passive liveness preserves a one-tap user experience and is preferred for high-volume consumer onboarding, while active challenges raise the bar against simple replays at the cost of abandonment and accessibility burden. ### 4.1 The Attack Taxonomy (ISO/IEC 30107) ISO/IEC 30107-1 frames the threat. Presentation-attack instruments (PAIs) include print attacks (a photo on paper), replay attacks (a video on a screen), 3D masks (paper, resin, silicone), and increasingly digital/synthetic artefacts. An important scope point: classic PAD addresses artefacts presented *to the sensor*; **injection** attacks that bypass the sensor entirely (Section 5) fall outside the original presentation model. The standard also distinguishes **PAI species** (a specific kind of instrument, for example "silicone mask of subject X") from broader categories, because a detector that defeats one species says nothing about an unseen species, the generalization problem that recurs throughout this section. ### 4.2 Methods - **Texture and image-quality** cues detect print and screen artefacts (moiré, banding, reduced micro-texture, color distortion). - **Remote photoplethysmography (rPPG)** recovers the subtle pulse-driven skin-color signal, present in live faces, absent in prints and masks, but is computationally heavy and sensitive to motion and lighting. - **Depth** exploits the planarity of prints and replays versus genuine 3D facial structure (stereo, structured light, or learned depth). - **Deep-learning PAD** dominates: CNN/transformer classifiers, auxiliary-supervision models that jointly regress depth maps and rPPG signals (Liu et al., 2018), and generalization-focused approaches (domain adaptation, anomaly detection) to handle *unseen* attack types, the central open problem, since PAD models generalize poorly across datasets. A useful way to read these methods is as a **defense-in-depth stack** rather than competitors: texture cues are cheap and catch crude prints, depth defeats flat replays, rPPG and motion defeat static masks, and a learned classifier fuses them. No single cue is robust on its own, and the strongest production systems combine several so that defeating the system requires simultaneously fooling independent physical signals. ### 4.3 Benchmarks Canonical datasets include CASIA-FASD (50 subjects), Replay-Attack (Idiap; 1,200 videos), OULU-NPU (4,950 clips across four protocols isolating illumination, instrument, and camera), SiW (165 subjects), the multi-modal cross-ethnicity CASIA-SURF and CeFA, and CelebA-Spoof (~625k images, 10,177 subjects). The recurring lesson across community competitions is poor *cross-domain* generalization, a detector tuned on one dataset's attacks often fails on another's. The standard guard against fooling oneself is a **cross-dataset protocol**, train on one corpus and test on a different one, since within-dataset accuracy systematically overstates real-world robustness. ## 5. Deepfakes, Morphing, and Injection Attacks ### 5.1 Face Morphing (the ID-issuance attack) A *morph* blends two faces into one image that matches *both* contributing identities above threshold. Concretely, if $\hat{e}_A$ and $\hat{e}_B$ are the embeddings of two accomplices, a successful morph image $x_M$ has an embedding $\hat{e}_M$ for which both $s(\hat{e}_M, \hat{e}_A) \ge \tau$ and $s(\hat{e}_M, \hat{e}_B) \ge \tau$. If a passport is issued from a morphed photo, the two contributors share one credential, defeating downstream verification. The **Morphing Attack Detection (MAD)** literature splits into *single-image MAD* (detect artefacts in one image) and *differential MAD* (compare the document image against a trusted live capture, which a morph cannot simultaneously satisfy as well as a genuine photo). NIST runs FATE MORPH for independent benchmarking and in 2025 released a lay-language guide (NISTIR 8584). The primary *operational* defense is **live, supervised enrolment** at issuance, which the EU mandated for passports under Regulation 2019/1157, removing the applicant-supplied photo that morphing exploits. This is the cleanest illustration of a theme in this chapter: a process change at capture time can neutralize an attack that is hard to detect algorithmically after the fact. ### 5.2 Injection Attacks, the 2024 to 2025 Shift The fastest-growing threat is not holding an artefact to a camera but **injecting** a synthetic video stream directly into the verification pipeline through a virtual camera, bypassing the physical sensor. Industry threat-intelligence reporting (for example iProov's 2025 report) documents steep year-over-year rises in virtual-camera and face-swap attacks and a growing share of verification attempts involving deepfakes. *These vendor figures are directional threat intelligence, not peer-reviewed measurements, and should be cited with that caveat.* The structural point, however, is robust: because injection sidesteps the sensor, traditional presentation-attack detection does not cover it, since there is no physical instrument and no moiré, depth, or reflectance artefact to detect. Defenses therefore shift the trust anchor away from the image content and toward the *provenance* of the capture: **trusted capture**, device attestation, server-side imagery-integrity checks, and one-time challenge schemes (such as randomized screen-illumination patterns) that are hard to pre-render or replay, rather than analysis of the image alone. The randomized-illumination idea is worth stating as a principle. If the server emits a fresh, unpredictable challenge (for example a sequence of screen colors) and verifies that the returned video shows the *correct* corresponding reflectance on the face in real time, then a pre-recorded or pre-rendered injection cannot anticipate the challenge and fails. The security rests on the challenge being unguessable and the response being verified within a tight time window, the same freshness logic that defeats replay in cryptographic authentication. ## 6. Standards and Certification - **ISO/IEC 30107-3** specifies the testing and reporting methodology for PAD, defining **APCER** (attack presentation classification error rate, the fraction of attack presentations wrongly accepted as bona fide) and **BPCER** (bona-fide presentation classification error rate, the fraction of genuine presentations wrongly rejected) against defined attack species. These are the PAD analogues of FMR and FNMR: APCER is the security error and BPCER is the usability error. Critically, **APCER is reported per attack species and the worst species governs**, because a system is only as strong as its weakest covered attack; an average over species would let a strong defense against prints hide a weak defense against masks. - **iBeta** (an NVLAP-accredited lab) is the dominant PAD test house. **Level 1** covers 2D attacks (prints, cutouts, screen replays) and requires 0% successful attacks across the battery; **Level 2** adds 3D attacks (silicone/resin/latex masks, wrapped 3D paper) with higher material budgets. Conformance is reported per ISO/IEC 30107-3. - **FIDO Alliance Biometric Component Certification** independently certifies biometric subcomponents for both PAD and recognition performance via accredited labs, giving a vendor-neutral assurance bar. A practical caution on reading certificates: a PAD certification is scoped to the **species tested**, the capture hardware, and the software version. It is evidence of competence against a defined battery, not a guarantee against novel attacks (injection in particular often sits outside the tested scope), so a certificate should be treated as a floor to build on, not a finished security argument. ## 7. Privacy, Bias, and Regulation - **Template protection.** Raw embeddings are sensitive and partially invertible, so ISO/IEC 24745 (biometric information protection) and *cancelable biometrics*, irreversible, revocable transforms of the template, plus secure-element and fuzzy-vault schemes are the standard mitigations. The defining requirement is **unlinkability and renewability**: a protected template must be hard to invert back to a face, and if compromised it must be revocable and re-issuable as a fresh transform, which a raw face image can never be, since you cannot reissue someone's face. - **EU AI Act (Article 5).** Prohibits certain biometric practices: real-time remote biometric identification in public spaces for law enforcement (with narrow, court-authorized exceptions); biometric *categorisation* inferring sensitive attributes; and emotion recognition in workplaces and education (with safety/medical carve-outs). Most verification and permitted remote-identification uses are pushed into the high-risk tier with conformity-assessment duties, a theme developed in the eKYC and business-applications chapters. - **Biometric privacy law (US).** Illinois' Biometric Information Privacy Act (BIPA, 2008) is the template statute: it mandates informed written consent and retention/destruction policies for face-geometry data and uniquely provides a private right of action with statutory damages ($1,000 per negligent, $5,000 per reckless violation). It has driven nine-figure settlements (Facebook $650M; Clearview AI), and a 2024 amendment limiting per-scan accumulation has shaped how damages are counted. Texas and Washington have analogous regimes enforced by the state attorney general rather than a private right of action. ### When to use, and pitfalls Face verification with liveness is the right tool for **remote, document-bound identity proofing** (account opening, regulated onboarding) where a trusted reference portrait exists and a brief capture is acceptable. It is a poor fit where no trusted reference exists, where capture conditions cannot be controlled, or where the law forbids the use outright (Section 7). Recurring pitfalls: - **Liveness omitted or bolted on late.** Verification without PAD is not security; PAD must gate the match, not annotate it afterward. - **Aggregate-only metrics.** A single population FMR/FNMR hides the demographic differentials of Section 3; monitor by group. - **Within-dataset PAD evaluation.** Reporting accuracy on the same corpus the detector was tuned on overstates robustness; require cross-dataset and unseen-species testing. - **Treating a certificate as complete.** iBeta or FIDO conformance covers tested species and versions only, and typically not injection. - **Storing raw embeddings.** Templates are partially invertible and irrevocable; protect them under ISO/IEC 24745 so a breach can be remediated. ## 8. Conclusion Face verification is a mature, independently benchmarked technology: margin-loss embeddings compared by cosine similarity, with error trade-offs made explicit by the DET curve and the discipline of fixing FMR and reporting FNMR, and a sobering, well-documented demographic gap that any deployment must monitor by group rather than in aggregate. But verification without liveness is no defense at all, and liveness is now an arms race, the threat moved from printed photos to 3D masks to, in 2024 to 2025, deepfake *injection* that bypasses the camera entirely, pushing the trust anchor toward device attestation and trusted capture. The two subsystems are measured with parallel metric pairs, FMR/FNMR for the matcher and APCER/BPCER for PAD, and the security of the whole is bounded by the weaker of them. With document reading (previous chapter) and biometric verification (this chapter) in hand, the next chapter assembles them into a complete eKYC system and confronts the regulation, risk scoring, and fairness obligations that govern it. ## References 1. Deng, J. et al. "ArcFace: Additive Angular Margin Loss for Deep Face Recognition." CVPR 2019. https://arxiv.org/abs/1801.07698 2. Schroff, F., Kalenichenko, D., Philbin, J. "FaceNet: A Unified Embedding for Face Recognition and Clustering." CVPR 2015. https://arxiv.org/abs/1503.03832 3. Wang, H. et al. "CosFace: Large Margin Cosine Loss for Deep Face Recognition." CVPR 2018. https://arxiv.org/abs/1801.09414 4. Liu, W. et al. "SphereFace: Deep Hypersphere Embedding for Face Recognition." CVPR 2017. https://arxiv.org/abs/1704.08063 5. NIST. "FRVT Part 3: Demographic Effects" (NISTIR 8280). December 2019. https://nvlpubs.nist.gov/nistpubs/ir/2019/nist.ir.8280.pdf 6. NIST. "FRVT Part 8 / Demographic Effects update" (NISTIR 8429). https://pages.nist.gov/frvt/reports/demographics/nistir_8429.pdf 7. ISO/IEC 30107-3. *Biometric presentation attack detection, Testing and reporting.* https://www.iso.org/standard/79520.html 8. Liu, Y., Jourabloo, A., Liu, X. "Learning Deep Models for Face Anti-Spoofing: Binary or Auxiliary Supervision." CVPR 2018. https://arxiv.org/abs/1803.11097 9. NIST FATE MORPH and NISTIR 8584 (morph-attack detection). https://pages.nist.gov/frvt/html/frvt_morph.html 10. iProov. "Threat Intelligence Report 2025" (injection/deepfake trends; vendor threat intelligence). https://www.iproov.com/reports/threat-intelligence-report-2025-remote-identity-attack 11. FIDO Alliance. "Biometric Component Certification." https://fidoalliance.org/certification/biometric-component-certification/ 12. EU AI Act, Article 5 (prohibited practices). https://artificialintelligenceact.eu/article/5/

Threshold \(\tau\)	FMR	FNMR
\(0.30\)	\(1\times10^{-3}\)	\(0.2\%\)
\(0.40\)	\(1\times10^{-5}\)	\(0.8\%\)
\(0.45\)	\(1\times10^{-6}\)	\(1.5\%\)
\(0.50\)	\(1\times10^{-7}\)	\(3.0\%\)