190 Loss Functions Beyond Cross-Entropy

Cross-entropy is the default objective for classification, and for good reason. It is the maximum likelihood objective for a categorical model, it is convex in the logits for a fixed target, and its gradient with respect to the logits has the clean form $\hat{p} - y$. Yet a great many modern learning problems are poorly served by plain cross-entropy. Detectors must learn from images where background pixels outnumber objects by a thousand to one. Retrieval systems must place semantically similar items near one another in an embedding space without any fixed label set. Recommender systems care only about the relative order of items, not their absolute scores. This chapter develops a family of loss functions that address these regimes: focal loss for extreme class imbalance, contrastive and triplet losses for metric learning, the InfoNCE loss that underpins modern self-supervised and multimodal representation learning, and the broader class of margin and ranking losses.

190.1 1. The Limits of Cross-Entropy

For a single example with one-hot label $y$ and predicted distribution $\hat{p} = \mathrm{softmax}(z)$ over $K$ classes, cross-entropy is

\[ \mathcal{L}_{\mathrm{CE}} = -\sum_{k=1}^{K} y_k \log \hat{p}_k = -\log \hat{p}_{c}, \]

where $c$ is the true class. The objective is calibrated and well behaved, but it carries two implicit assumptions that fail in practice. First, it treats every example as equally important. Under heavy class imbalance, the aggregate loss is dominated by the majority class, and the gradient signal from rare classes is drowned out even when each rare example is poorly classified. Second, cross-entropy is fundamentally a per example classification objective tied to a fixed label vocabulary. It says nothing about the geometry of the representation space and cannot express a goal like “embed these two augmented views of the same image close together.”

To see the first failure quantitatively, suppose a detector sees $10^4$ background anchors, each confidently correct at $\hat{p}_t = 0.99$, alongside $10$ foreground objects that the model finds hard at $\hat{p}_t = 0.3$. Each easy negative contributes $-\log 0.99 \approx 0.01$ to the loss, so the negatives together contribute about $100$. The ten hard positives contribute $10 \times (-\log 0.3) \approx 12$. The background, though individually trivial, supplies roughly eight times the total loss and dominates the gradient. No tuning of the learning rate fixes this; the problem is the shape of the objective, not its scale. This single observation motivates focal loss in Section 190.2.

The losses below relax one or both assumptions. Focal loss keeps the classification framing but reweights examples by difficulty. Metric and contrastive losses abandon the fixed label set entirely and instead shape pairwise or higher order relationships in embedding space.

190.2 2. Focal Loss for Imbalance

190.2.1 2.1 Motivation and definition

Consider one sided binary detection where $p_t$ denotes the model’s estimated probability of the ground truth class,

\[ p_t = \begin{cases} \hat{p} & \text{if } y = 1, \\ 1 - \hat{p} & \text{if } y = 0. \end{cases} \]

Binary cross-entropy is $\mathcal{L}_{\mathrm{CE}} = -\log p_t$. The trouble in dense detection is that a flood of easy negatives, each with $p_t$ near $1$, still contributes a small but nonzero loss $-\log p_t$. Summed over tens of thousands of background anchors, these small contributions overwhelm the loss from a handful of hard, informative examples.

Focal loss, introduced by Lin et al. for the RetinaNet detector, multiplies cross-entropy by a modulating factor that decays as confidence grows:

\[ \mathcal{L}_{\mathrm{FL}} = -\alpha_t (1 - p_t)^{\gamma} \log p_t . \]

The focusing parameter $\gamma \geq 0$ controls how aggressively easy examples are down weighted, and $\alpha_t \in [0,1]$ is an optional class balancing weight analogous to a per class prior correction.

190.2.2 2.2 How the modulating factor behaves

When an example is misclassified and $p_t$ is small, the factor $(1 - p_t)^{\gamma}$ is close to $1$ and the loss is essentially unchanged from cross-entropy. When an example is easy and $p_t \to 1$, the factor goes to zero and the loss is sharply suppressed. With $\gamma = 2$, a confident example at $p_t = 0.9$ has its loss scaled by $(0.1)^2 = 0.01$, a hundredfold reduction, while a hard example at $p_t = 0.1$ is scaled by $(0.9)^2 \approx 0.81$, almost untouched. The net effect is to refocus training on the hard minority. Setting $\gamma = 0$ recovers ordinary weighted cross-entropy.

The gradient also illuminates the mechanism. Writing $\mathcal{L}_{\mathrm{FL}}$ as a function of the logit $z$ for the positive class, one finds

\[ \frac{\partial \mathcal{L}_{\mathrm{FL}}}{\partial z} = \alpha_t (1 - p_t)^{\gamma}\Big( \gamma\, p_t \log p_t + p_t - 1 \Big), \]

so the per example gradient magnitude is itself attenuated by $(1 - p_t)^{\gamma}$ for easy examples, ensuring they neither dominate the loss nor the update direction. To make the derivation concrete, recall that for binary logistic output $p_t = \sigma(z)$ on a positive example we have $\partial p_t / \partial z = p_t (1 - p_t)$ and $\partial \log p_t / \partial z = 1 - p_t$. Differentiating $-(1-p_t)^{\gamma}\log p_t$ by the product rule gives a term from the modulating factor, $\gamma (1-p_t)^{\gamma-1} p_t (1-p_t) \log p_t = \gamma (1-p_t)^{\gamma} p_t \log p_t$, and a term from the log, $-(1-p_t)^{\gamma}(1-p_t)$. Collecting them yields the expression above. The two terms have opposite sign and the overall factor $(1-p_t)^{\gamma}$ guarantees that as $p_t \to 1$ the gradient vanishes faster than under cross-entropy, which is precisely the desired behavior.

The pipeline that turns raw logits into a focal contribution is summarized below.

flowchart LR
  A["logit z"] --> B["probability p_t"]
  B --> C["cross-entropy term: minus log p_t"]
  B --> D["modulating factor: one minus p_t, raised to gamma"]
  D --> E["class weight alpha_t"]
  C --> F["focal loss: alpha_t times factor times CE"]
  E --> F

Figure 190.1: How focal loss reshapes the per example contribution from logits to weighted loss.

# Binary focal loss (illustrative, not runnable)
def focal_loss(p, y, gamma=2.0, alpha=0.25):
    p_t = p * y + (1 - p) * (1 - y)
    alpha_t = alpha * y + (1 - alpha) * (1 - y)
    return -alpha_t * (1 - p_t) ** gamma * log(p_t)

190.2.3 2.3 Practical notes

A typical RetinaNet configuration uses $\gamma = 2$ and $\alpha = 0.25$. The $\alpha$ term is set below $0.5$ because, somewhat counterintuitively, once the focusing term has down weighted easy negatives, a mild up weighting of the abundant negative class stabilizes the total loss scale. Focal loss has since been generalized: the quality focal and varifocal variants extend the idea to continuous targets and joint classification plus localization quality estimation, and class balanced focal loss combines it with an effective number of samples reweighting for long tailed recognition.

190.3 3. Metric Learning with Contrastive and Triplet Losses

When the goal is a representation in which distance encodes semantic similarity, we leave classification behind. Let $f_\theta(\cdot)$ map an input to an embedding $\mathbf{e} \in \mathbb{R}^d$, often $\ell_2$ normalized so that $\lVert \mathbf{e} \rVert = 1$. Metric learning objectives operate on pairs or triplets of embeddings.

190.3.1 3.1 The contrastive (pairwise) loss

The classic contrastive loss of Hadsell, Chopra and LeCun takes a pair $(\mathbf{e}_i, \mathbf{e}_j)$ with a binary label $Y = 0$ if the pair is similar and $Y = 1$ if dissimilar, and a distance $D = \lVert \mathbf{e}_i - \mathbf{e}_j \rVert_2$:

\[ \mathcal{L}_{\mathrm{contrast}} = (1 - Y)\,\tfrac{1}{2} D^2 \;+\; Y\,\tfrac{1}{2}\,\big[\max(0,\, m - D)\big]^2 . \]

Similar pairs are simply pulled together by minimizing $D^2$. Dissimilar pairs are pushed apart, but only until their distance reaches the margin $m$; beyond that, a dissimilar pair contributes no loss and no gradient. The margin prevents the model from wasting capacity scattering already well separated negatives infinitely far apart.

190.3.2 3.2 The triplet loss

The triplet loss, popularized by FaceNet, replaces absolute distances with a relative comparison. Each training unit is an anchor $a$, a positive $p$ of the same class, and a negative $n$ of a different class. The objective requires the anchor to positive distance to be smaller than the anchor to negative distance by at least a margin $m$:

\[ \mathcal{L}_{\mathrm{triplet}} = \max\!\big(0,\; D(a,p)^2 - D(a,n)^2 + m \big). \]

This relative formulation is more flexible than the pairwise loss because it never imposes an absolute target distance; it only constrains orderings, which is usually what downstream retrieval cares about.

190.3.3 3.3 The central role of mining

The triplet loss is only as good as the triplets fed to it. The vast majority of randomly sampled triplets already satisfy the margin and yield zero gradient, so naive sampling stalls. Define a triplet as hard when $D(a,n) < D(a,p)$ (the negative is closer than the positive) and semi-hard when

\[ D(a,p) < D(a,n) < D(a,p) + m, \]

meaning the negative is farther than the positive but still inside the margin. Training exclusively on the hardest negatives tends to be unstable and can collapse the embedding, because the hardest negatives are often label noise or genuinely ambiguous. FaceNet therefore mines semi-hard negatives within each minibatch, a robust compromise that supplies a useful gradient without chasing pathological examples. Batch hard mining, in which one selects the hardest positive and hardest negative for each anchor within the batch, is another widely used strategy in person re identification.

190.4 4. The InfoNCE Loss

190.4.1 4.1 From pairs to a classification over a set

Triplet loss uses exactly one negative per update. Modern contrastive representation learning generalizes this to many negatives at once and frames the problem as a softmax classification: given an anchor, identify its single positive among a pool of one positive and many negatives. This is the InfoNCE loss (Noise Contrastive Estimation in the information theoretic sense), introduced by van den Oord et al. for Contrastive Predictive Coding.

Let $\mathbf{q}$ be a query embedding, $\mathbf{k}^{+}$ its positive key, and $\{\mathbf{k}^{-}_j\}_{j=1}^{N}$ a set of negative keys. With a similarity function $s(\cdot,\cdot)$, usually cosine similarity, and a temperature $\tau > 0$,

\[ \mathcal{L}_{\mathrm{InfoNCE}} = -\log \frac{\exp\!\big(s(\mathbf{q}, \mathbf{k}^{+})/\tau\big)} {\exp\!\big(s(\mathbf{q}, \mathbf{k}^{+})/\tau\big) + \sum_{j=1}^{N} \exp\!\big(s(\mathbf{q}, \mathbf{k}^{-}_j)/\tau\big)} . \]

This is exactly the cross-entropy of an $(N+1)$ way classifier whose logits are scaled similarities and whose correct answer is always the positive. The structure unifies the earlier sections: InfoNCE is a softmax cross-entropy, but over a dynamically constructed set of candidates in embedding space rather than over a fixed label vocabulary.

190.4.2 4.2 The mutual information bound

InfoNCE is not just a convenient surrogate; minimizing it maximizes a lower bound on the mutual information between query and positive. Specifically,

\[ I(\mathbf{q}; \mathbf{k}^{+}) \;\geq\; \log(N+1) - \mathcal{L}_{\mathrm{InfoNCE}} . \]

The bound tightens as the number of negatives $N$ grows, which gives a principled reason to want many negatives. The factor $\log(N+1)$ also caps how much information any single InfoNCE term can certify: with a batch of $256$ candidates the bound saturates near $\log 256 \approx 5.5$ nats, so contrastive pretraining that needs to capture high mutual information must either grow the candidate pool or accept that the bound is loose. This insight drove architectures that decouple the negative pool size from the gradient batch size. SimCLR uses very large batches so that all other examples serve as in batch negatives. MoCo maintains a momentum updated encoder and a queue of past keys, providing a large and consistent dictionary of negatives without the memory cost of large batches.

190.4.3 4.3 Temperature and its effect

The temperature $\tau$ controls the sharpness of the distribution over candidates. A small $\tau$ sharpens the softmax, concentrating gradient on the hardest negatives, those most similar to the query, which improves discrimination but risks instability. A large $\tau$ softens the distribution and treats negatives more uniformly. Empirically $\tau$ in the range $0.05$ to $0.2$ is common, and the choice materially affects the learned uniformity and separability of the embedding space.

A small worked example makes the role of $\tau$ tangible. Take an anchor with cosine similarity $0.8$ to its positive and $0.6$ and $0.1$ to two negatives. At $\tau = 0.1$ the scaled logits are $8.0$, $6.0$, and $1.0$; the softmax probability mass on the positive is about $0.88$, so the loss $-\log 0.88 \approx 0.13$ and the hard negative at $0.6$ absorbs almost all of the residual gradient. At $\tau = 0.5$ the logits become $1.6$, $1.2$, and $0.2$; the positive probability falls to about $0.52$, the loss rises to $\approx 0.65$, and the easy negative at $0.1$ now receives a non trivial share of the push. Lower temperature thus concentrates effort on the most confusable negatives, at the cost of larger gradients that can destabilize early training.

# InfoNCE for a batch of L2-normalized embeddings (illustrative)
# q, k: (batch, dim); positives are the diagonal pairs
logits = (q @ k.T) / tau              # (batch, batch) similarity matrix
labels = arange(batch)                # positive for row i is column i
loss = cross_entropy(logits, labels)  # standard softmax CE

190.4.4 4.4 Supervised contrastive and multimodal variants

When labels are available, the supervised contrastive loss (SupCon) treats all examples sharing a label as positives for one another, generalizing InfoNCE to multiple positives per anchor and often outperforming cross-entropy on classification. In the multimodal setting, CLIP applies a symmetric InfoNCE over image and text embeddings: each image is matched against all texts in the batch and vice versa, with a learnable temperature. This single objective, scaled to hundreds of millions of pairs, yields the transferable image text representations that anchor much of contemporary multimodal modeling.

190.5 5. Margin and Ranking Losses

190.5.1 5.1 The hinge and margin principle

Several losses above share a common ingredient: a margin enforced through a hinge $\max(0, \cdot)$. The canonical example is the binary hinge loss of the support vector machine. For label $y \in \{-1, +1\}$ and score $s$,

\[ \mathcal{L}_{\mathrm{hinge}} = \max(0,\; 1 - y\,s). \]

The loss is zero once the example is on the correct side of the decision boundary by at least a unit margin, and grows linearly inside the margin. This produces sparse, robust gradients and is the discriminative counterpart to the smooth, always nonzero gradient of cross-entropy.

190.5.2 5.2 Pairwise ranking losses

Ranking problems care about relative order. Given a query and a pair of items, one preferred (positive) with score $s^{+}$ and one not (negative) with score $s^{-}$, the margin ranking loss enforces a gap:

\[ \mathcal{L}_{\mathrm{rank}} = \max\!\big(0,\; m - (s^{+} - s^{-})\big). \]

A probabilistic alternative is the RankNet loss, which models the probability that the positive outranks the negative with a logistic function of the score difference and applies cross-entropy to it:

\[ \mathcal{L}_{\mathrm{RankNet}} = -\log \sigma\big(s^{+} - s^{-}\big), \qquad \sigma(x) = \frac{1}{1 + e^{-x}} . \]

190.5.3 5.3 From pairs to lists

Optimizing pairwise order does not directly optimize list level retrieval metrics such as normalized discounted cumulative gain, which weight the top of the ranking far more heavily than the bottom. LambdaRank addresses this by scaling each pairwise gradient by the change in the target metric that would result from swapping the two items, so that pairs whose reordering most affects the top of the list receive the strongest updates. LambdaMART instantiates this idea inside gradient boosted trees and long remained a strong baseline for learning to rank. For applications where only a single relevant item exists per query, such as entity retrieval, the multiclass softmax over candidates, which is exactly InfoNCE, can be read as a listwise objective.

190.5.4 5.4 Margins on the sphere for face recognition

A distinct and influential line of work injects margins directly into the softmax classifier for embedding learning. ArcFace normalizes both the weight vectors and the features, so each logit becomes the cosine of the angle between a feature and its class prototype, then adds an additive angular margin $m$ to the true class angle $\theta_y$:

\[ \mathcal{L}_{\mathrm{ArcFace}} = -\log \frac{\exp\!\big(s \cos(\theta_y + m)\big)} {\exp\!\big(s \cos(\theta_y + m)\big) + \sum_{j \neq y} \exp\!\big(s \cos \theta_j\big)} . \]

The scale $s$ plays the role of an inverse temperature on the unit sphere. By demanding a geodesic margin between classes, ArcFace produces highly discriminative, well separated embeddings and became a standard for face recognition. It is a revealing synthesis: a softmax cross-entropy at heart, but reshaped by the margin and metric learning ideas of the preceding sections.

190.6 6. Choosing an Objective

The following table maps problem structure to a default objective and its dominant practical lever.

Problem structure	Default loss	Key knob	Main pitfall
Balanced classification, fixed labels	Cross-entropy	label smoothing	overconfidence
Severe imbalance, dense prediction	Focal loss	focusing $\gamma$, balance $\alpha$	over down weighting hard positives
Embedding geometry, scarce labels	Contrastive or triplet	margin $m$, negative mining	mining collapse, dead triplets
Embedding geometry, many negatives	InfoNCE	temperature $\tau$, negatives $N$	small $N$ loosens the MI bound
Relative order matters	Ranking or LambdaRank	margin, metric weighting	pairwise objective ignores list position
Open set verification	ArcFace	angular margin $m$, scale $s$	overlarge margin stalls convergence

A few cross cutting pitfalls deserve emphasis. Margin based losses produce exactly zero gradient once the margin is satisfied, so a poorly chosen margin or a stale set of pairs can silently halt learning while the loss curve looks healthy; monitoring the fraction of active (nonzero loss) pairs is a cheap and reliable diagnostic. Contrastive objectives are acutely sensitive to false negatives, since two samples that share a latent class but no label are pushed apart by construction, which is one reason supervised positives help when labels exist. Finally, focal loss assumes the easy examples really are correct; under heavy label noise its down weighting can starve the model of signal, so it pairs best with reasonably clean annotations. Mature open source implementations of all of these losses ship with torchvision, timm, pytorch-metric-learning, and sentence-transformers, and reaching for a tested implementation is preferable to re deriving the numerically delicate stable forms by hand.

The losses in this chapter are not competitors so much as tools matched to problem structure. If the task is classification with a fixed label set and roughly balanced classes, cross-entropy remains the right default. Under severe imbalance, especially dense prediction, focal loss restores a useful gradient from the rare, hard examples. When the goal is an embedding space whose geometry encodes similarity, and labels are scarce or open ended, contrastive, triplet, and InfoNCE losses shape that geometry directly, with the number and quality of negatives being the dominant practical lever. When relative order is what matters, margin and ranking losses optimize it explicitly, and listwise refinements align training with the metrics that actually govern retrieval quality. A recurring theme ties them together. Most of these objectives can be read as a softmax cross-entropy that has been reweighted by difficulty, restructured over a dynamic candidate set, or augmented with an explicit margin. Understanding that common skeleton is the surest guide to selecting and adapting a loss for a new problem.

190.7 References

Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollar, P. “Focal Loss for Dense Object Detection.” ICCV 2017. https://arxiv.org/abs/1708.02002
Hadsell, R., Chopra, S., LeCun, Y. “Dimensionality Reduction by Learning an Invariant Mapping.” CVPR 2006. https://ieeexplore.ieee.org/document/1640964
Schroff, F., Kalenichenko, D., Philbin, J. “FaceNet: A Unified Embedding for Face Recognition and Clustering.” CVPR 2015. https://arxiv.org/abs/1503.03832
Hermans, A., Beyer, L., Leibe, B. “In Defense of the Triplet Loss for Person Re-Identification.” 2017. https://arxiv.org/abs/1703.07737
van den Oord, A., Li, Y., Vinyals, O. “Representation Learning with Contrastive Predictive Coding.” 2018. https://arxiv.org/abs/1807.03748
Chen, T., Kornblith, S., Norouzi, M., Hinton, G. “A Simple Framework for Contrastive Learning of Visual Representations (SimCLR).” ICML 2020. https://arxiv.org/abs/2002.05709
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R. “Momentum Contrast for Unsupervised Visual Representation Learning (MoCo).” CVPR 2020. https://arxiv.org/abs/1911.05722
Khosla, P., et al. “Supervised Contrastive Learning.” NeurIPS 2020. https://arxiv.org/abs/2004.11362
Radford, A., et al. “Learning Transferable Visual Models From Natural Language Supervision (CLIP).” ICML 2021. https://arxiv.org/abs/2103.00020
Burges, C., et al. “Learning to Rank using Gradient Descent (RankNet).” ICML 2005. https://www.microsoft.com/en-us/research/publication/learning-to-rank-using-gradient-descent/
Burges, C. “From RankNet to LambdaRank to LambdaMART: An Overview.” Microsoft Research Technical Report, 2010. https://www.microsoft.com/en-us/research/publication/from-ranknet-to-lambdarank-to-lambdamart-an-overview/
Deng, J., Guo, J., Xue, N., Zafeiriou, S. “ArcFace: Additive Angular Margin Loss for Deep Face Recognition.” CVPR 2019. https://arxiv.org/abs/1801.07698

# Loss Functions Beyond Cross-Entropy Cross-entropy is the default objective for classification, and for good reason. It is the maximum likelihood objective for a categorical model, it is convex in the logits for a fixed target, and its gradient with respect to the logits has the clean form $\hat{p} - y$. Yet a great many modern learning problems are poorly served by plain cross-entropy. Detectors must learn from images where background pixels outnumber objects by a thousand to one. Retrieval systems must place semantically similar items near one another in an embedding space without any fixed label set. Recommender systems care only about the relative order of items, not their absolute scores. This chapter develops a family of loss functions that address these regimes: focal loss for extreme class imbalance, contrastive and triplet losses for metric learning, the InfoNCE loss that underpins modern self-supervised and multimodal representation learning, and the broader class of margin and ranking losses. ## 1. The Limits of Cross-Entropy For a single example with one-hot label $y$ and predicted distribution $\hat{p} = \mathrm{softmax}(z)$ over $K$ classes, cross-entropy is $$ \mathcal{L}_{\mathrm{CE}} = -\sum_{k=1}^{K} y_k \log \hat{p}_k = -\log \hat{p}_{c}, $$ where $c$ is the true class. The objective is calibrated and well behaved, but it carries two implicit assumptions that fail in practice. First, it treats every example as equally important. Under heavy class imbalance, the aggregate loss is dominated by the majority class, and the gradient signal from rare classes is drowned out even when each rare example is poorly classified. Second, cross-entropy is fundamentally a per example classification objective tied to a fixed label vocabulary. It says nothing about the geometry of the representation space and cannot express a goal like "embed these two augmented views of the same image close together." To see the first failure quantitatively, suppose a detector sees $10^4$ background anchors, each confidently correct at $\hat{p}_t = 0.99$, alongside $10$ foreground objects that the model finds hard at $\hat{p}_t = 0.3$. Each easy negative contributes $-\log 0.99 \approx 0.01$ to the loss, so the negatives together contribute about $100$. The ten hard positives contribute $10 \times (-\log 0.3) \approx 12$. The background, though individually trivial, supplies roughly eight times the total loss and dominates the gradient. No tuning of the learning rate fixes this; the problem is the shape of the objective, not its scale. This single observation motivates focal loss in @sec-focal. The losses below relax one or both assumptions. Focal loss keeps the classification framing but reweights examples by difficulty. Metric and contrastive losses abandon the fixed label set entirely and instead shape pairwise or higher order relationships in embedding space. ## 2. Focal Loss for Imbalance {#sec-focal} ### 2.1 Motivation and definition Consider one sided binary detection where $p_t$ denotes the model's estimated probability of the ground truth class, $$ p_t = \begin{cases} \hat{p} & \text{if } y = 1, \\ 1 - \hat{p} & \text{if } y = 0. \end{cases} $$ Binary cross-entropy is $\mathcal{L}_{\mathrm{CE}} = -\log p_t$. The trouble in dense detection is that a flood of easy negatives, each with $p_t$ near $1$, still contributes a small but nonzero loss $-\log p_t$. Summed over tens of thousands of background anchors, these small contributions overwhelm the loss from a handful of hard, informative examples. Focal loss, introduced by Lin et al. for the RetinaNet detector, multiplies cross-entropy by a modulating factor that decays as confidence grows: $$ \mathcal{L}_{\mathrm{FL}} = -\alpha_t (1 - p_t)^{\gamma} \log p_t . $$ The focusing parameter $\gamma \geq 0$ controls how aggressively easy examples are down weighted, and $\alpha_t \in [0,1]$ is an optional class balancing weight analogous to a per class prior correction. ### 2.2 How the modulating factor behaves When an example is misclassified and $p_t$ is small, the factor $(1 - p_t)^{\gamma}$ is close to $1$ and the loss is essentially unchanged from cross-entropy. When an example is easy and $p_t \to 1$, the factor goes to zero and the loss is sharply suppressed. With $\gamma = 2$, a confident example at $p_t = 0.9$ has its loss scaled by $(0.1)^2 = 0.01$, a hundredfold reduction, while a hard example at $p_t = 0.1$ is scaled by $(0.9)^2 \approx 0.81$, almost untouched. The net effect is to refocus training on the hard minority. Setting $\gamma = 0$ recovers ordinary weighted cross-entropy. The gradient also illuminates the mechanism. Writing $\mathcal{L}_{\mathrm{FL}}$ as a function of the logit $z$ for the positive class, one finds $$ \frac{\partial \mathcal{L}_{\mathrm{FL}}}{\partial z} = \alpha_t (1 - p_t)^{\gamma}\Big( \gamma\, p_t \log p_t + p_t - 1 \Big), $$ so the per example gradient magnitude is itself attenuated by $(1 - p_t)^{\gamma}$ for easy examples, ensuring they neither dominate the loss nor the update direction. To make the derivation concrete, recall that for binary logistic output $p_t = \sigma(z)$ on a positive example we have $\partial p_t / \partial z = p_t (1 - p_t)$ and $\partial \log p_t / \partial z = 1 - p_t$. Differentiating $-(1-p_t)^{\gamma}\log p_t$ by the product rule gives a term from the modulating factor, $\gamma (1-p_t)^{\gamma-1} p_t (1-p_t) \log p_t = \gamma (1-p_t)^{\gamma} p_t \log p_t$, and a term from the log, $-(1-p_t)^{\gamma}(1-p_t)$. Collecting them yields the expression above. The two terms have opposite sign and the overall factor $(1-p_t)^{\gamma}$ guarantees that as $p_t \to 1$ the gradient vanishes faster than under cross-entropy, which is precisely the desired behavior. The pipeline that turns raw logits into a focal contribution is summarized below. ```{mermaid} %%| label: fig-focal-flow %%| fig-cap: "How focal loss reshapes the per example contribution from logits to weighted loss." flowchart LR A["logit z"] --> B["probability p_t"] B --> C["cross-entropy term: minus log p_t"] B --> D["modulating factor: one minus p_t, raised to gamma"] D --> E["class weight alpha_t"] C --> F["focal loss: alpha_t times factor times CE"] E --> F ``` ```python # Binary focal loss (illustrative, not runnable) def focal_loss(p, y, gamma=2.0, alpha=0.25): p_t = p * y + (1 - p) * (1 - y) alpha_t = alpha * y + (1 - alpha) * (1 - y) return -alpha_t * (1 - p_t) ** gamma * log(p_t) ``` ### 2.3 Practical notes A typical RetinaNet configuration uses $\gamma = 2$ and $\alpha = 0.25$. The $\alpha$ term is set below $0.5$ because, somewhat counterintuitively, once the focusing term has down weighted easy negatives, a mild up weighting of the abundant negative class stabilizes the total loss scale. Focal loss has since been generalized: the quality focal and varifocal variants extend the idea to continuous targets and joint classification plus localization quality estimation, and class balanced focal loss combines it with an effective number of samples reweighting for long tailed recognition. ## 3. Metric Learning with Contrastive and Triplet Losses When the goal is a representation in which distance encodes semantic similarity, we leave classification behind. Let $f_\theta(\cdot)$ map an input to an embedding $\mathbf{e} \in \mathbb{R}^d$, often $\ell_2$ normalized so that $\lVert \mathbf{e} \rVert = 1$. Metric learning objectives operate on pairs or triplets of embeddings. ### 3.1 The contrastive (pairwise) loss The classic contrastive loss of Hadsell, Chopra and LeCun takes a pair $(\mathbf{e}_i, \mathbf{e}_j)$ with a binary label $Y = 0$ if the pair is similar and $Y = 1$ if dissimilar, and a distance $D = \lVert \mathbf{e}_i - \mathbf{e}_j \rVert_2$: $$ \mathcal{L}_{\mathrm{contrast}} = (1 - Y)\,\tfrac{1}{2} D^2 \;+\; Y\,\tfrac{1}{2}\,\big[\max(0,\, m - D)\big]^2 . $$ Similar pairs are simply pulled together by minimizing $D^2$. Dissimilar pairs are pushed apart, but only until their distance reaches the margin $m$; beyond that, a dissimilar pair contributes no loss and no gradient. The margin prevents the model from wasting capacity scattering already well separated negatives infinitely far apart. ### 3.2 The triplet loss The triplet loss, popularized by FaceNet, replaces absolute distances with a relative comparison. Each training unit is an anchor $a$, a positive $p$ of the same class, and a negative $n$ of a different class. The objective requires the anchor to positive distance to be smaller than the anchor to negative distance by at least a margin $m$: $$ \mathcal{L}_{\mathrm{triplet}} = \max\!\big(0,\; D(a,p)^2 - D(a,n)^2 + m \big). $$ This relative formulation is more flexible than the pairwise loss because it never imposes an absolute target distance; it only constrains orderings, which is usually what downstream retrieval cares about. ### 3.3 The central role of mining The triplet loss is only as good as the triplets fed to it. The vast majority of randomly sampled triplets already satisfy the margin and yield zero gradient, so naive sampling stalls. Define a triplet as **hard** when $D(a,n) < D(a,p)$ (the negative is closer than the positive) and **semi-hard** when $$ D(a,p) < D(a,n) < D(a,p) + m, $$ meaning the negative is farther than the positive but still inside the margin. Training exclusively on the hardest negatives tends to be unstable and can collapse the embedding, because the hardest negatives are often label noise or genuinely ambiguous. FaceNet therefore mines semi-hard negatives within each minibatch, a robust compromise that supplies a useful gradient without chasing pathological examples. Batch hard mining, in which one selects the hardest positive and hardest negative for each anchor within the batch, is another widely used strategy in person re identification. ## 4. The InfoNCE Loss ### 4.1 From pairs to a classification over a set Triplet loss uses exactly one negative per update. Modern contrastive representation learning generalizes this to many negatives at once and frames the problem as a softmax classification: given an anchor, identify its single positive among a pool of one positive and many negatives. This is the InfoNCE loss (Noise Contrastive Estimation in the information theoretic sense), introduced by van den Oord et al. for Contrastive Predictive Coding. Let $\mathbf{q}$ be a query embedding, $\mathbf{k}^{+}$ its positive key, and $\{\mathbf{k}^{-}_j\}_{j=1}^{N}$ a set of negative keys. With a similarity function $s(\cdot,\cdot)$, usually cosine similarity, and a temperature $\tau > 0$, $$ \mathcal{L}_{\mathrm{InfoNCE}} = -\log \frac{\exp\!\big(s(\mathbf{q}, \mathbf{k}^{+})/\tau\big)} {\exp\!\big(s(\mathbf{q}, \mathbf{k}^{+})/\tau\big) + \sum_{j=1}^{N} \exp\!\big(s(\mathbf{q}, \mathbf{k}^{-}_j)/\tau\big)} . $$ This is exactly the cross-entropy of an $(N+1)$ way classifier whose logits are scaled similarities and whose correct answer is always the positive. The structure unifies the earlier sections: InfoNCE is a softmax cross-entropy, but over a dynamically constructed set of candidates in embedding space rather than over a fixed label vocabulary. ### 4.2 The mutual information bound InfoNCE is not just a convenient surrogate; minimizing it maximizes a lower bound on the mutual information between query and positive. Specifically, $$ I(\mathbf{q}; \mathbf{k}^{+}) \;\geq\; \log(N+1) - \mathcal{L}_{\mathrm{InfoNCE}} . $$ The bound tightens as the number of negatives $N$ grows, which gives a principled reason to want many negatives. The factor $\log(N+1)$ also caps how much information any single InfoNCE term can certify: with a batch of $256$ candidates the bound saturates near $\log 256 \approx 5.5$ nats, so contrastive pretraining that needs to capture high mutual information must either grow the candidate pool or accept that the bound is loose. This insight drove architectures that decouple the negative pool size from the gradient batch size. SimCLR uses very large batches so that all other examples serve as in batch negatives. MoCo maintains a momentum updated encoder and a queue of past keys, providing a large and consistent dictionary of negatives without the memory cost of large batches. ### 4.3 Temperature and its effect The temperature $\tau$ controls the sharpness of the distribution over candidates. A small $\tau$ sharpens the softmax, concentrating gradient on the hardest negatives, those most similar to the query, which improves discrimination but risks instability. A large $\tau$ softens the distribution and treats negatives more uniformly. Empirically $\tau$ in the range $0.05$ to $0.2$ is common, and the choice materially affects the learned uniformity and separability of the embedding space. A small worked example makes the role of $\tau$ tangible. Take an anchor with cosine similarity $0.8$ to its positive and $0.6$ and $0.1$ to two negatives. At $\tau = 0.1$ the scaled logits are $8.0$, $6.0$, and $1.0$; the softmax probability mass on the positive is about $0.88$, so the loss $-\log 0.88 \approx 0.13$ and the hard negative at $0.6$ absorbs almost all of the residual gradient. At $\tau = 0.5$ the logits become $1.6$, $1.2$, and $0.2$; the positive probability falls to about $0.52$, the loss rises to $\approx 0.65$, and the easy negative at $0.1$ now receives a non trivial share of the push. Lower temperature thus concentrates effort on the most confusable negatives, at the cost of larger gradients that can destabilize early training. ```python # InfoNCE for a batch of L2-normalized embeddings (illustrative) # q, k: (batch, dim); positives are the diagonal pairs logits = (q @ k.T) / tau # (batch, batch) similarity matrix labels = arange(batch) # positive for row i is column i loss = cross_entropy(logits, labels) # standard softmax CE ``` ### 4.4 Supervised contrastive and multimodal variants When labels are available, the supervised contrastive loss (SupCon) treats all examples sharing a label as positives for one another, generalizing InfoNCE to multiple positives per anchor and often outperforming cross-entropy on classification. In the multimodal setting, CLIP applies a symmetric InfoNCE over image and text embeddings: each image is matched against all texts in the batch and vice versa, with a learnable temperature. This single objective, scaled to hundreds of millions of pairs, yields the transferable image text representations that anchor much of contemporary multimodal modeling. ## 5. Margin and Ranking Losses ### 5.1 The hinge and margin principle Several losses above share a common ingredient: a margin enforced through a hinge $\max(0, \cdot)$. The canonical example is the binary hinge loss of the support vector machine. For label $y \in \{-1, +1\}$ and score $s$, $$ \mathcal{L}_{\mathrm{hinge}} = \max(0,\; 1 - y\,s). $$ The loss is zero once the example is on the correct side of the decision boundary by at least a unit margin, and grows linearly inside the margin. This produces sparse, robust gradients and is the discriminative counterpart to the smooth, always nonzero gradient of cross-entropy. ### 5.2 Pairwise ranking losses Ranking problems care about relative order. Given a query and a pair of items, one preferred (positive) with score $s^{+}$ and one not (negative) with score $s^{-}$, the margin ranking loss enforces a gap: $$ \mathcal{L}_{\mathrm{rank}} = \max\!\big(0,\; m - (s^{+} - s^{-})\big). $$ A probabilistic alternative is the RankNet loss, which models the probability that the positive outranks the negative with a logistic function of the score difference and applies cross-entropy to it: $$ \mathcal{L}_{\mathrm{RankNet}} = -\log \sigma\big(s^{+} - s^{-}\big), \qquad \sigma(x) = \frac{1}{1 + e^{-x}} . $$ ### 5.3 From pairs to lists Optimizing pairwise order does not directly optimize list level retrieval metrics such as normalized discounted cumulative gain, which weight the top of the ranking far more heavily than the bottom. LambdaRank addresses this by scaling each pairwise gradient by the change in the target metric that would result from swapping the two items, so that pairs whose reordering most affects the top of the list receive the strongest updates. LambdaMART instantiates this idea inside gradient boosted trees and long remained a strong baseline for learning to rank. For applications where only a single relevant item exists per query, such as entity retrieval, the multiclass softmax over candidates, which is exactly InfoNCE, can be read as a listwise objective. ### 5.4 Margins on the sphere for face recognition A distinct and influential line of work injects margins directly into the softmax classifier for embedding learning. ArcFace normalizes both the weight vectors and the features, so each logit becomes the cosine of the angle between a feature and its class prototype, then adds an additive angular margin $m$ to the true class angle $\theta_y$: $$ \mathcal{L}_{\mathrm{ArcFace}} = -\log \frac{\exp\!\big(s \cos(\theta_y + m)\big)} {\exp\!\big(s \cos(\theta_y + m)\big) + \sum_{j \neq y} \exp\!\big(s \cos \theta_j\big)} . $$ The scale $s$ plays the role of an inverse temperature on the unit sphere. By demanding a geodesic margin between classes, ArcFace produces highly discriminative, well separated embeddings and became a standard for face recognition. It is a revealing synthesis: a softmax cross-entropy at heart, but reshaped by the margin and metric learning ideas of the preceding sections. ## 6. Choosing an Objective The following table maps problem structure to a default objective and its dominant practical lever. | Problem structure | Default loss | Key knob | Main pitfall | |---|---|---|---| | Balanced classification, fixed labels | Cross-entropy | label smoothing | overconfidence | | Severe imbalance, dense prediction | Focal loss | focusing $\gamma$, balance $\alpha$ | over down weighting hard positives | | Embedding geometry, scarce labels | Contrastive or triplet | margin $m$, negative mining | mining collapse, dead triplets | | Embedding geometry, many negatives | InfoNCE | temperature $\tau$, negatives $N$ | small $N$ loosens the MI bound | | Relative order matters | Ranking or LambdaRank | margin, metric weighting | pairwise objective ignores list position | | Open set verification | ArcFace | angular margin $m$, scale $s$ | overlarge margin stalls convergence | A few cross cutting pitfalls deserve emphasis. Margin based losses produce exactly zero gradient once the margin is satisfied, so a poorly chosen margin or a stale set of pairs can silently halt learning while the loss curve looks healthy; monitoring the fraction of active (nonzero loss) pairs is a cheap and reliable diagnostic. Contrastive objectives are acutely sensitive to false negatives, since two samples that share a latent class but no label are pushed apart by construction, which is one reason supervised positives help when labels exist. Finally, focal loss assumes the easy examples really are correct; under heavy label noise its down weighting can starve the model of signal, so it pairs best with reasonably clean annotations. Mature open source implementations of all of these losses ship with `torchvision`, `timm`, `pytorch-metric-learning`, and `sentence-transformers`, and reaching for a tested implementation is preferable to re deriving the numerically delicate stable forms by hand. The losses in this chapter are not competitors so much as tools matched to problem structure. If the task is classification with a fixed label set and roughly balanced classes, cross-entropy remains the right default. Under severe imbalance, especially dense prediction, focal loss restores a useful gradient from the rare, hard examples. When the goal is an embedding space whose geometry encodes similarity, and labels are scarce or open ended, contrastive, triplet, and InfoNCE losses shape that geometry directly, with the number and quality of negatives being the dominant practical lever. When relative order is what matters, margin and ranking losses optimize it explicitly, and listwise refinements align training with the metrics that actually govern retrieval quality. A recurring theme ties them together. Most of these objectives can be read as a softmax cross-entropy that has been reweighted by difficulty, restructured over a dynamic candidate set, or augmented with an explicit margin. Understanding that common skeleton is the surest guide to selecting and adapting a loss for a new problem. ## References 1. Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollar, P. "Focal Loss for Dense Object Detection." ICCV 2017. https://arxiv.org/abs/1708.02002 2. Hadsell, R., Chopra, S., LeCun, Y. "Dimensionality Reduction by Learning an Invariant Mapping." CVPR 2006. https://ieeexplore.ieee.org/document/1640964 3. Schroff, F., Kalenichenko, D., Philbin, J. "FaceNet: A Unified Embedding for Face Recognition and Clustering." CVPR 2015. https://arxiv.org/abs/1503.03832 4. Hermans, A., Beyer, L., Leibe, B. "In Defense of the Triplet Loss for Person Re-Identification." 2017. https://arxiv.org/abs/1703.07737 5. van den Oord, A., Li, Y., Vinyals, O. "Representation Learning with Contrastive Predictive Coding." 2018. https://arxiv.org/abs/1807.03748 6. Chen, T., Kornblith, S., Norouzi, M., Hinton, G. "A Simple Framework for Contrastive Learning of Visual Representations (SimCLR)." ICML 2020. https://arxiv.org/abs/2002.05709 7. He, K., Fan, H., Wu, Y., Xie, S., Girshick, R. "Momentum Contrast for Unsupervised Visual Representation Learning (MoCo)." CVPR 2020. https://arxiv.org/abs/1911.05722 8. Khosla, P., et al. "Supervised Contrastive Learning." NeurIPS 2020. https://arxiv.org/abs/2004.11362 9. Radford, A., et al. "Learning Transferable Visual Models From Natural Language Supervision (CLIP)." ICML 2021. https://arxiv.org/abs/2103.00020 10. Burges, C., et al. "Learning to Rank using Gradient Descent (RankNet)." ICML 2005. https://www.microsoft.com/en-us/research/publication/learning-to-rank-using-gradient-descent/ 11. Burges, C. "From RankNet to LambdaRank to LambdaMART: An Overview." Microsoft Research Technical Report, 2010. https://www.microsoft.com/en-us/research/publication/from-ranknet-to-lambdarank-to-lambdamart-an-overview/ 12. Deng, J., Guo, J., Xue, N., Zafeiriou, S. "ArcFace: Additive Angular Margin Loss for Deep Face Recognition." CVPR 2019. https://arxiv.org/abs/1801.07698