188 Loss Functions for Regression in Neural Networks

Regression asks a neural network to map an input $x$ to a continuous target $y \in \mathbb{R}^d$. The architecture proposes a prediction $\hat{y} = f_\theta(x)$, but the loss function decides what counts as a good prediction. This choice is not cosmetic. Each loss encodes an implicit assumption about how the observed targets deviate from the underlying signal, and that assumption determines which statistic of the conditional distribution the network learns, how it responds to outliers, and how its gradients behave during optimization. This chapter develops the standard regression losses from first principles, connects each to a probabilistic noise model, and gives practical guidance for matching the loss to the data.

188.1 1. The Statistical View of Regression Loss

Suppose the data are drawn from a joint distribution $p(x, y)$. A loss $L(y, \hat{y})$ measures the cost of predicting $\hat{y}$ when the truth is $y$. Training minimizes the empirical risk

\[ \hat{R}(\theta) = \frac{1}{N} \sum_{i=1}^{N} L\big(y_i, f_\theta(x_i)\big), \]

which approximates the population risk $R(\theta) = \mathbb{E}_{(x,y) \sim p}\,[L(y, f_\theta(x))]$. For a flexible enough network, the minimizer at each input $x$ is the constant $c$ that minimizes the conditional expected loss $\mathbb{E}[L(y, c) \mid x]$. The functional form of $L$ therefore selects which property of $p(y \mid x)$ the network estimates. Squared error recovers the conditional mean, absolute error recovers the conditional median, and quantile loss recovers an arbitrary conditional quantile. Understanding a loss means knowing the statistic it targets and the noise model that makes it the maximum likelihood objective.

188.1.1 1.1 The pointwise minimizer and elicitability

The reason a loss can be analyzed one input at a time is a decomposition of the population risk. Because expectation towers, for any prediction function $g$,

\[ R(g) = \mathbb{E}_{x}\Big[\, \mathbb{E}_{y}\big[L(y, g(x)) \mid x\big] \,\Big]. \]

The inner conditional risk $\rho(c \mid x) = \mathbb{E}[L(y, c) \mid x]$ depends on $g$ only through the single number $c = g(x)$, and different inputs are coupled only through the capacity of the network. An unconstrained, infinitely flexible predictor can therefore set $g^\star(x) = \arg\min_c \rho(c \mid x)$ independently at every $x$. The statistic $T[p(y \mid x)] = \arg\min_c \rho(c \mid x)$ that a loss recovers in this idealized limit is called the functional elicited by the loss, and a functional that arises this way from some loss is called elicitable. The mean, the median, and every quantile are elicitable; the variance and the conditional mode are not elicitable on their own, which is one reason they are awkward to target with a plain regression loss. Real networks have finite capacity, so the recovered statistic is an approximation, but the elicited functional is what training aims at and is the right way to reason about what a loss does.

188.1.2 1.2 Negative log likelihood as the bridge

The bridge between losses and probability is negative log likelihood. If we posit $p(y \mid x) = q(y \mid f_\theta(x))$ for some parametric density $q$, then minimizing $-\sum_i \log q(y_i \mid f_\theta(x_i))$ is equivalent, up to constants, to minimizing a particular loss. Reading the loss off the density, and the density off the loss, is the central skill of this chapter. The general rule is

\[ L(y, \hat{y}) = -\log q(y \mid \hat{y}) + \text{const}, \]

so any loss of the form $L(y, \hat{y}) = \ell(y - \hat{y})$ that depends only on the residual corresponds to an additive noise model $y = \hat{y} + \varepsilon$ whose density is $q(\varepsilon) \propto \exp(-\ell(\varepsilon))$, provided $\exp(-\ell)$ is integrable. Squared error gives a Gaussian, absolute error gives a Laplace, and the compromise losses of Section 4 give bell-shaped cores with heavier-than-Gaussian tails. This correspondence also explains a subtlety: a loss is only a proper likelihood if the implied density integrates to one for a fixed scale, which is why losses with a free scale parameter (the $\delta$ of Huber, the $b$ of Laplace) should jointly estimate or fix that scale to remain interpretable as maximum likelihood.

188.2 2. Squared Error and the Gaussian Model

188.2.1 2.1 Definition and the conditional mean

The squared error loss, often called L2 loss, is

\[ L_{\text{SE}}(y, \hat{y}) = \tfrac{1}{2}\,(y - \hat{y})^2, \]

and its average over a dataset is the mean squared error (MSE). To see which statistic it targets, fix $x$ and minimize the conditional risk $\rho(c) = \tfrac{1}{2}\mathbb{E}[(y - c)^2 \mid x]$ over $c$. The function is strictly convex in $c$ with $\rho''(c) = 1 > 0$, so its unique minimizer is found by setting the derivative to zero:

\[ \rho'(c) = \mathbb{E}\big[-(y - c) \mid x\big] = c - \mathbb{E}[y \mid x] = 0 \;\Longrightarrow\; c^\star = \mathbb{E}[y \mid x]. \]

Squared error trains the network to output the conditional mean. A useful companion identity is the bias-variance style decomposition $\mathbb{E}[(y - c)^2 \mid x] = \operatorname{Var}(y \mid x) + (c - \mathbb{E}[y\mid x])^2$, which shows that the irreducible part of the risk equals the conditional noise variance and that no predictor can drive the squared-error risk below it.

188.2.2 2.2 The Gaussian likelihood

Assume the target is the signal plus homoscedastic Gaussian noise, $y = f_\theta(x) + \varepsilon$ with $\varepsilon \sim \mathcal{N}(0, \sigma^2)$. The negative log likelihood of one observation is

\[ -\log q(y \mid \hat{y}) = \frac{(y - \hat{y})^2}{2\sigma^2} + \tfrac{1}{2}\log(2\pi\sigma^2). \]

With $\sigma$ fixed, minimizing this over $\theta$ is exactly minimizing squared error. MSE is therefore the maximum likelihood objective under additive Gaussian noise. If the noise is heteroscedastic, the network can also predict $\sigma^2(x)$ and minimize the full Gaussian negative log likelihood, which reweights each residual by its inverse variance.

188.2.3 2.3 Gradient behavior

The gradient with respect to the prediction is $\partial L_{\text{SE}} / \partial \hat{y} = \hat{y} - y$, linear in the residual $r = \hat{y} - y$. Large residuals produce large gradients. This is efficient when errors are genuinely Gaussian, since big residuals are rare, but it makes MSE sensitive to outliers and to mislabeled targets, because a single gross error can dominate the batch gradient.

def squared_error(y, y_hat):
    r = y_hat - y
    return 0.5 * r**2          # gradient w.r.t. y_hat is r

188.3 3. Absolute Error and the Laplace Model

188.3.1 3.1 Definition and the conditional median

The absolute error loss, or L1 loss, is

\[ L_{\text{AE}}(y, \hat{y}) = |y - \hat{y}|, \]

and its average is the mean absolute error (MAE). Minimizing $\rho(c) = \mathbb{E}[\,|y - c|\, \mid x]$ over $c$ yields the conditional median rather than the mean. To see this, differentiate under the expectation using $\tfrac{d}{dc}|y - c| = -\operatorname{sign}(y - c)$:

\[ \rho'(c) = \mathbb{E}\big[-\operatorname{sign}(y - c) \mid x\big] = P(y < c \mid x) - P(y > c \mid x). \]

Setting this to zero requires $P(y < c \mid x) = P(y > c \mid x)$, which is the defining property of a median. The median is robust: moving a far-away target even farther does not move the optimal prediction, because the subgradient of $|r|$ saturates at $\pm 1$ and the optimality condition counts points on each side rather than weighting them by distance.

188.3.2 3.2 The Laplace likelihood

Absolute error is the negative log likelihood of a Laplace distribution, $q(y \mid \hat{y}) \propto \exp(-|y - \hat{y}| / b)$. The Laplace density has heavier tails than the Gaussian, so it assigns more probability to large deviations. Choosing MAE is therefore a statement that the data contain occasional large errors that should not dominate fitting.

188.3.3 3.3 Gradient behavior

The subgradient is $\partial L_{\text{AE}} / \partial \hat{y} = \operatorname{sign}(\hat{y} - y)$. Its magnitude is constant, so outliers contribute no more to the gradient than small errors do. This delivers robustness but causes two difficulties. First, the gradient does not shrink as the prediction approaches the target, which can cause the optimizer to oscillate around the minimum unless the learning rate decays. Second, the loss is nondifferentiable at $r = 0$, though autodiff frameworks define a subgradient there.

188.4 4. The Huber and Log-Cosh Compromises

Squared error has smooth, vanishing gradients near the optimum but is fragile to outliers. Absolute error is robust but has a kink at zero and constant gradient magnitude. Two losses interpolate between these regimes.

188.4.1 4.1 Huber loss

The Huber loss is quadratic for small residuals and linear for large ones, with a threshold $\delta$ marking the transition:

\[ L_\delta(r) = \begin{cases} \tfrac{1}{2} r^2, & |r| \le \delta, \\[4pt] \delta\big(|r| - \tfrac{1}{2}\delta\big), & |r| > \delta, \end{cases} \qquad r = \hat{y} - y. \]

The pieces are stitched so that both the value and the first derivative are continuous at $|r| = \delta$. The derivative is

\[ \frac{\partial L_\delta}{\partial \hat{y}} = \begin{cases} r, & |r| \le \delta, \\ \delta \operatorname{sign}(r), & |r| > \delta, \end{cases} \]

which is clipped at magnitude $\delta$. Small residuals receive MSE-style gradients that vanish at the optimum, while large residuals receive bounded MAE-style gradients that limit the influence of outliers. The hyperparameter $\delta$ sets the residual scale at which a point is treated as an outlier. As $\delta \to 0$ the loss approaches MAE, and as $\delta \to \infty$ it approaches MSE. In practice $\delta$ is tuned to the expected noise scale, sometimes adaptively from a robust estimate of the residual spread such as the median absolute deviation.

def huber(y, y_hat, delta=1.0):
    r = y_hat - y
    a = abs(r)
    quad = 0.5 * r**2
    lin  = delta * (a - 0.5 * delta)
    return quad if a <= delta else lin

Probabilistically, Huber loss corresponds to a density that is Gaussian in the core and Laplace in the tails, which is the maximum likelihood view of a contamination model where most points are clean and a minority are heavy-tailed.

188.4.2 4.2 Log-cosh loss

The log-cosh loss is a smooth surrogate with similar behavior but no threshold to tune:

\[ L_{\text{lc}}(r) = \log\!\big(\cosh r\big). \]

For small $r$, a Taylor expansion gives $\log\cosh r \approx \tfrac{1}{2} r^2$, so it behaves like squared error near the optimum. For large $|r|$, $\log\cosh r \approx |r| - \log 2$, so it grows linearly like absolute error. Its derivative is

\[ \frac{\partial L_{\text{lc}}}{\partial \hat{y}} = \tanh(r), \]

which is smooth everywhere, bounded in $(-1, 1)$, and vanishes at $r = 0$. Log-cosh is twice differentiable, which can help second order optimizers and avoids the nondifferentiable kink of both MAE and Huber. The cost is that its outlier resistance is fixed by the $\tanh$ saturation scale and cannot be tuned the way $\delta$ tunes Huber.

188.4.3 4.3 The epsilon-insensitive loss

A third robust loss treats small residuals as free rather than merely cheap. The epsilon-insensitive loss, central to support vector regression, charges nothing inside a tube of half-width $\varepsilon$ around the target and grows linearly outside it:

\[ L_\varepsilon(r) = \max\big(0,\; |r| - \varepsilon\big), \qquad r = \hat{y} - y. \]

Inside the tube the gradient is zero, so residuals smaller than $\varepsilon$ exert no pull at all and the fit is sparse in the sense that only points on or outside the tube boundary influence the solution. Outside the tube the gradient has constant magnitude one, giving the same outlier resistance as MAE. The flat interior makes the loss tolerant of small measurement noise and yields models that depend on few support points, which is attractive when a tolerance band is acceptable and a compact model is wanted. The price is a second hyperparameter and a region of exactly zero gradient that can stall learning if $\varepsilon$ is set too wide. Setting $\varepsilon = 0$ recovers MAE.

188.5 5. Quantile Loss and Conditional Quantiles

188.5.1 5.1 The pinball loss

The losses so far predict a central tendency. Many applications need a range. Quantile regression estimates a chosen conditional quantile $\tau \in (0, 1)$ using the pinball loss

\[ L_\tau(y, \hat{y}) = \begin{cases} \tau\,(y - \hat{y}), & y \ge \hat{y}, \\ (1 - \tau)\,(\hat{y} - y), & y < \hat{y}. \end{cases} \]

This is an asymmetric absolute error. When the prediction is too low (an underestimate, $y \ge \hat{y}$), the error is weighted by $\tau$. When the prediction is too high, it is weighted by $1 - \tau$. For $\tau = 0.5$ both weights equal $\tfrac{1}{2}$, the loss reduces to half the absolute error, and the target is the median. For $\tau = 0.9$ underestimates cost nine times as much as overestimates, pushing the prediction up until only ten percent of targets exceed it.

Minimizing $\rho(c) = \mathbb{E}[L_\tau(y, c) \mid x]$ yields exactly the $\tau$-th conditional quantile of $p(y \mid x)$. Differentiating the two branches and combining gives

\[ \rho'(c) = (1 - \tau)\,P(y < c \mid x) - \tau\,P(y > c \mid x) = P(y < c \mid x) - \tau, \]

using $P(y < c \mid x) + P(y > c \mid x) = 1$ for a continuous target. Setting $\rho'(c) = 0$ gives $P(y < c \mid x) = \tau$, the definition of the $\tau$-th quantile. The median proof of Section 3.1 is the special case $\tau = \tfrac{1}{2}$.

def pinball(y, y_hat, tau=0.5):
    r = y - y_hat
    return max(tau * r, (tau - 1.0) * r)

188.5.2 5.2 Prediction intervals

Training one network with multiple output heads, each with its own $\tau$, produces a set of conditional quantiles. A pair such as $\tau = 0.05$ and $\tau = 0.95$ forms a ninety percent prediction interval, giving a distribution-free measure of uncertainty without any Gaussian assumption. Because the heads are fit independently, their outputs can in principle cross, with a lower quantile exceeding a higher one. Sorting the outputs, penalizing crossings, or using a monotone architecture restores a coherent ordering. Quantile loss thus moves regression from point estimation to a partial picture of the conditional distribution.

188.6 6. A Worked Example: One Outlier, Four Answers

A small numerical example makes the difference between the central-tendency losses concrete. Consider a single input $x$ for which five targets have been observed,

\[ y \in \{1,\; 2,\; 3,\; 4,\; 30\}, \]

where the value $30$ is a gross outlier and the clean signal sits near $3$. We ask what constant prediction $c$ each loss would settle on, treating this as the pointwise problem of Section 1.

The squared-error optimum is the mean, $c^\star_{\text{SE}} = (1 + 2 + 3 + 4 + 30)/5 = 8$. The single outlier has dragged the prediction far above every clean point. The absolute-error optimum is the median, $c^\star_{\text{AE}} = 3$, which ignores the magnitude of the outlier entirely and lands on the clean signal. Huber loss with $\delta = 1$ sits in between but close to the median: the four clean residuals are handled quadratically while the outlier contributes only a bounded linear pull, so the optimizer balances them at a value just above $3$ rather than near $8$. Quantile loss at $\tau = 0.9$ instead targets the upper tail and returns a value near the fourth-largest point, deliberately high because the application asked for an upper bound rather than a center.

The table summarizes the elicited statistic, the implied noise model, and the answer each loss gives on this dataset.

Loss	Elicited statistic	Implied noise	Answer here
Squared error	conditional mean	Gaussian	$8$
Absolute error	conditional median	Laplace	$3$
Huber, $\delta = 1$	robust mean	Gaussian core, Laplace tails	just above $3$
Quantile, $\tau = 0.9$	$0.9$ quantile	distribution-free	near the upper end

The lesson is not that one answer is correct and the others wrong. Each loss answers a different question, and the gap between $8$ and $3$ is exactly the gap between “what is the average of these numbers, outlier included” and “what is a typical value.” Choosing a loss is choosing which question to ask.

188.7 7. Robustness, M-Estimation, and Influence

The losses of Sections 2 through 4 are instances of M-estimation, the framework of estimators defined by minimizing a sum of a function $\rho$ of the residuals. Within that framework the influence function measures how much a single observation can move the estimate, and it is proportional to the derivative $\psi(r) = \rho'(r)$, often called the score. The shape of $\psi$ explains the robustness ordering of the losses at a glance.

For squared error $\psi(r) = r$ is unbounded, so a single point arbitrarily far away has arbitrarily large influence and can move the estimate without limit. For absolute error $\psi(r) = \operatorname{sign}(r)$ is bounded by one, so any single point has at most a fixed influence no matter how far it lies. Huber and log-cosh share this bounded-influence property, with $\psi(r) = \delta\operatorname{sign}(r)$ in the tails and $\psi(r) = \tanh(r)$ respectively, which is precisely why they resist outliers while keeping smooth behavior near zero. A bounded score is the formal statement of robustness.

flowchart TD
    A["Pick a loss for regression"] --> B{"Need uncertainty bounds or asymmetric cost"}
    B -->|"yes"| Q["Quantile loss, fit several tau"]
    B -->|"no"| C{"Outliers or heavy tails in residuals"}
    C -->|"no"| SE["Squared error, targets the mean"]
    C -->|"yes"| D{"Want one tunable robustness scale"}
    D -->|"yes"| H["Huber loss, tune delta"]
    D -->|"no"| E{"Need a smooth twice differentiable loss"}
    E -->|"yes"| LC["Log cosh loss"]
    E -->|"no"| AE["Absolute error, targets the median"]

This diagram encodes the same logic as the decision guide below: first decide whether you need a distribution summary or a point, then whether the noise is clean, and finally how much tuning and smoothness you want.

188.8 8. Matching the Loss to the Noise Model

The choice of loss is the choice of an implicit noise model, and the right choice follows from the statistic you want and the shape of the noise you expect.

188.8.1 8.1 A decision guide

Squared error is the default when residuals are roughly symmetric, light-tailed, and free of gross outliers, and when the conditional mean is the quantity of interest. It is the maximum likelihood estimator under Gaussian noise and yields the most efficient estimates when that assumption holds.

Absolute error suits data with heavy-tailed noise or a meaningful fraction of corrupt labels, when the conditional median is acceptable or preferred. It trades statistical efficiency under clean Gaussian noise for robustness under contamination.

Huber and log-cosh are pragmatic defaults when outliers are present but not dominant. They keep the smooth, well-behaved gradients of MSE near the optimum while bounding the influence of large residuals like MAE. Huber is preferred when the outlier scale is known or tunable through $\delta$; log-cosh is convenient when a single hyperparameter-free smooth loss is wanted.

Quantile loss is the tool when the application needs uncertainty bounds or an asymmetric cost structure, for example when underprediction and overprediction carry different real-world penalties, as in inventory, staffing, or risk-sensitive forecasting.

188.8.2 8.2 Practical considerations

Several engineering points cut across the choice. Target scaling matters because most of these losses are not scale invariant. Squared error scales as the square of the target units and absolute error scales linearly, so the relative weight of MSE and MAE in a combined objective shifts if you rescale $y$; standardizing $y$ to zero mean and unit variance makes the meaning of $\delta$ in Huber and of the learning rate consistent across problems. The threshold $\delta$ is itself a scale, and a common practice is to set it from a robust estimate of the residual spread such as the median absolute deviation, $\widehat{\sigma} \approx 1.4826 \cdot \operatorname{median}_i |r_i - \operatorname{median}_j r_j|$, which tracks the noise level without being inflated by outliers.

Reduction matters: averaging the per-example loss over the batch keeps gradient magnitudes independent of batch size, whereas summing couples them. Heteroscedastic noise can be addressed directly by letting the network output both a mean and a variance and minimizing the Gaussian negative log likelihood, which down-weights residuals where the model is uncertain; care is needed because the variance head can collapse, and clamping or a softplus parameterization of the variance keeps training stable. Finally, the loss interacts with optimization: the constant gradient magnitude of MAE often calls for a decaying learning rate, while the vanishing gradients of MSE near the optimum behave well with standard schedules. The mature open-source frameworks expose these losses directly, for example torch.nn.MSELoss, L1Loss, HuberLoss, and SmoothL1Loss in PyTorch and mean_squared_error, mean_absolute_error, and Huber in Keras, so the engineering choice usually reduces to selecting and parameterizing a built-in.

The unifying principle is that a regression loss is a likelihood in disguise. When you select squared error you assert Gaussian noise and ask for the mean; when you select absolute error you assert Laplace noise and ask for the median; when you select Huber or log-cosh you assert a Gaussian core with heavy tails; and when you select quantile loss you ask directly for a quantile of the conditional distribution. Making that assertion explicit, and checking it against the empirical residuals, is the most reliable way to match the loss to the data.

188.9 9. Summary

Regression losses are not interchangeable knobs. Each defines a target statistic of the conditional distribution, an implied noise model, and through its score function a degree of robustness to outliers. Squared error gives the mean under Gaussian noise with unbounded influence, absolute error gives the median under Laplace noise with bounded influence, Huber and log-cosh blend the two for robustness with smooth optimization, the epsilon-insensitive loss adds a tolerance tube and sparse fits, and quantile loss reaches any conditional quantile and supports prediction intervals. The disciplined workflow is to decide which statistic the application needs, characterize the noise from residual diagnostics, and select the loss whose likelihood and influence function match that noise. The loss, not the architecture alone, determines what the network ultimately learns to predict.

188.10 References

Huber, P. J. “Robust Estimation of a Location Parameter.” Annals of Mathematical Statistics, 1964. https://doi.org/10.1214/aoms/1177703732
Koenker, R., and Bassett, G. “Regression Quantiles.” Econometrica, 1978. https://doi.org/10.2307/1913643
Bishop, C. M. Pattern Recognition and Machine Learning. Springer, 2006. https://www.microsoft.com/en-us/research/publication/pattern-recognition-machine-learning/
Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning. MIT Press, 2016. https://www.deeplearningbook.org/
Hastie, T., Tibshirani, R., and Friedman, J. The Elements of Statistical Learning. 2nd ed. Springer, 2009. https://hastie.su.domains/ElemStatLearn/
Nix, D. A., and Weigend, A. S. “Estimating the Mean and Variance of the Target Probability Distribution.” IEEE ICNN, 1994. https://doi.org/10.1109/ICNN.1994.374138
Koenker, R. Quantile Regression. Cambridge University Press, 2005. https://doi.org/10.1017/CBO9780511754098
Hampel, F. R. “The Influence Curve and Its Role in Robust Estimation.” Journal of the American Statistical Association, 1974. https://doi.org/10.1080/01621459.1974.10482962
Drucker, H., Burges, C. J. C., Kaufman, L., Smola, A., and Vapnik, V. “Support Vector Regression Machines.” Advances in Neural Information Processing Systems, 1996. https://papers.nips.cc/paper/1996/hash/d38901788c533e8286cb6400b40b386d-Abstract.html
PyTorch Documentation. “Loss Functions.” https://pytorch.org/docs/stable/nn.html#loss-functions

# Loss Functions for Regression in Neural Networks Regression asks a neural network to map an input $x$ to a continuous target $y \in \mathbb{R}^d$. The architecture proposes a prediction $\hat{y} = f_\theta(x)$, but the loss function decides what counts as a good prediction. This choice is not cosmetic. Each loss encodes an implicit assumption about how the observed targets deviate from the underlying signal, and that assumption determines which statistic of the conditional distribution the network learns, how it responds to outliers, and how its gradients behave during optimization. This chapter develops the standard regression losses from first principles, connects each to a probabilistic noise model, and gives practical guidance for matching the loss to the data. ## 1. The Statistical View of Regression Loss Suppose the data are drawn from a joint distribution $p(x, y)$. A loss $L(y, \hat{y})$ measures the cost of predicting $\hat{y}$ when the truth is $y$. Training minimizes the empirical risk $$ \hat{R}(\theta) = \frac{1}{N} \sum_{i=1}^{N} L\big(y_i, f_\theta(x_i)\big), $$ which approximates the population risk $R(\theta) = \mathbb{E}_{(x,y) \sim p}\,[L(y, f_\theta(x))]$. For a flexible enough network, the minimizer at each input $x$ is the constant $c$ that minimizes the conditional expected loss $\mathbb{E}[L(y, c) \mid x]$. The functional form of $L$ therefore selects which property of $p(y \mid x)$ the network estimates. Squared error recovers the conditional mean, absolute error recovers the conditional median, and quantile loss recovers an arbitrary conditional quantile. Understanding a loss means knowing the statistic it targets and the noise model that makes it the maximum likelihood objective. ### 1.1 The pointwise minimizer and elicitability The reason a loss can be analyzed one input at a time is a decomposition of the population risk. Because expectation towers, for any prediction function $g$, $$ R(g) = \mathbb{E}_{x}\Big[\, \mathbb{E}_{y}\big[L(y, g(x)) \mid x\big] \,\Big]. $$ The inner conditional risk $\rho(c \mid x) = \mathbb{E}[L(y, c) \mid x]$ depends on $g$ only through the single number $c = g(x)$, and different inputs are coupled only through the capacity of the network. An unconstrained, infinitely flexible predictor can therefore set $g^\star(x) = \arg\min_c \rho(c \mid x)$ independently at every $x$. The statistic $T[p(y \mid x)] = \arg\min_c \rho(c \mid x)$ that a loss recovers in this idealized limit is called the *functional elicited* by the loss, and a functional that arises this way from some loss is called *elicitable*. The mean, the median, and every quantile are elicitable; the variance and the conditional mode are not elicitable on their own, which is one reason they are awkward to target with a plain regression loss. Real networks have finite capacity, so the recovered statistic is an approximation, but the elicited functional is what training aims at and is the right way to reason about what a loss does. ### 1.2 Negative log likelihood as the bridge The bridge between losses and probability is negative log likelihood. If we posit $p(y \mid x) = q(y \mid f_\theta(x))$ for some parametric density $q$, then minimizing $-\sum_i \log q(y_i \mid f_\theta(x_i))$ is equivalent, up to constants, to minimizing a particular loss. Reading the loss off the density, and the density off the loss, is the central skill of this chapter. The general rule is $$ L(y, \hat{y}) = -\log q(y \mid \hat{y}) + \text{const}, $$ so any loss of the form $L(y, \hat{y}) = \ell(y - \hat{y})$ that depends only on the residual corresponds to an additive noise model $y = \hat{y} + \varepsilon$ whose density is $q(\varepsilon) \propto \exp(-\ell(\varepsilon))$, provided $\exp(-\ell)$ is integrable. Squared error gives a Gaussian, absolute error gives a Laplace, and the compromise losses of Section 4 give bell-shaped cores with heavier-than-Gaussian tails. This correspondence also explains a subtlety: a loss is only a proper likelihood if the implied density integrates to one for a fixed scale, which is why losses with a free scale parameter (the $\delta$ of Huber, the $b$ of Laplace) should jointly estimate or fix that scale to remain interpretable as maximum likelihood. ## 2. Squared Error and the Gaussian Model ### 2.1 Definition and the conditional mean The squared error loss, often called L2 loss, is $$ L_{\text{SE}}(y, \hat{y}) = \tfrac{1}{2}\,(y - \hat{y})^2, $$ and its average over a dataset is the mean squared error (MSE). To see which statistic it targets, fix $x$ and minimize the conditional risk $\rho(c) = \tfrac{1}{2}\mathbb{E}[(y - c)^2 \mid x]$ over $c$. The function is strictly convex in $c$ with $\rho''(c) = 1 > 0$, so its unique minimizer is found by setting the derivative to zero: $$ \rho'(c) = \mathbb{E}\big[-(y - c) \mid x\big] = c - \mathbb{E}[y \mid x] = 0 \;\Longrightarrow\; c^\star = \mathbb{E}[y \mid x]. $$ Squared error trains the network to output the conditional mean. A useful companion identity is the bias-variance style decomposition $\mathbb{E}[(y - c)^2 \mid x] = \operatorname{Var}(y \mid x) + (c - \mathbb{E}[y\mid x])^2$, which shows that the irreducible part of the risk equals the conditional noise variance and that no predictor can drive the squared-error risk below it. ### 2.2 The Gaussian likelihood Assume the target is the signal plus homoscedastic Gaussian noise, $y = f_\theta(x) + \varepsilon$ with $\varepsilon \sim \mathcal{N}(0, \sigma^2)$. The negative log likelihood of one observation is $$ -\log q(y \mid \hat{y}) = \frac{(y - \hat{y})^2}{2\sigma^2} + \tfrac{1}{2}\log(2\pi\sigma^2). $$ With $\sigma$ fixed, minimizing this over $\theta$ is exactly minimizing squared error. MSE is therefore the maximum likelihood objective under additive Gaussian noise. If the noise is heteroscedastic, the network can also predict $\sigma^2(x)$ and minimize the full Gaussian negative log likelihood, which reweights each residual by its inverse variance. ### 2.3 Gradient behavior The gradient with respect to the prediction is $\partial L_{\text{SE}} / \partial \hat{y} = \hat{y} - y$, linear in the residual $r = \hat{y} - y$. Large residuals produce large gradients. This is efficient when errors are genuinely Gaussian, since big residuals are rare, but it makes MSE sensitive to outliers and to mislabeled targets, because a single gross error can dominate the batch gradient. ```python def squared_error(y, y_hat): r = y_hat - y return 0.5 * r**2 # gradient w.r.t. y_hat is r ``` ## 3. Absolute Error and the Laplace Model ### 3.1 Definition and the conditional median The absolute error loss, or L1 loss, is $$ L_{\text{AE}}(y, \hat{y}) = |y - \hat{y}|, $$ and its average is the mean absolute error (MAE). Minimizing $\rho(c) = \mathbb{E}[\,|y - c|\, \mid x]$ over $c$ yields the conditional median rather than the mean. To see this, differentiate under the expectation using $\tfrac{d}{dc}|y - c| = -\operatorname{sign}(y - c)$: $$ \rho'(c) = \mathbb{E}\big[-\operatorname{sign}(y - c) \mid x\big] = P(y < c \mid x) - P(y > c \mid x). $$ Setting this to zero requires $P(y < c \mid x) = P(y > c \mid x)$, which is the defining property of a median. The median is robust: moving a far-away target even farther does not move the optimal prediction, because the subgradient of $|r|$ saturates at $\pm 1$ and the optimality condition counts points on each side rather than weighting them by distance. ### 3.2 The Laplace likelihood Absolute error is the negative log likelihood of a Laplace distribution, $q(y \mid \hat{y}) \propto \exp(-|y - \hat{y}| / b)$. The Laplace density has heavier tails than the Gaussian, so it assigns more probability to large deviations. Choosing MAE is therefore a statement that the data contain occasional large errors that should not dominate fitting. ### 3.3 Gradient behavior The subgradient is $\partial L_{\text{AE}} / \partial \hat{y} = \operatorname{sign}(\hat{y} - y)$. Its magnitude is constant, so outliers contribute no more to the gradient than small errors do. This delivers robustness but causes two difficulties. First, the gradient does not shrink as the prediction approaches the target, which can cause the optimizer to oscillate around the minimum unless the learning rate decays. Second, the loss is nondifferentiable at $r = 0$, though autodiff frameworks define a subgradient there. ## 4. The Huber and Log-Cosh Compromises Squared error has smooth, vanishing gradients near the optimum but is fragile to outliers. Absolute error is robust but has a kink at zero and constant gradient magnitude. Two losses interpolate between these regimes. ### 4.1 Huber loss The Huber loss is quadratic for small residuals and linear for large ones, with a threshold $\delta$ marking the transition: $$ L_\delta(r) = \begin{cases} \tfrac{1}{2} r^2, & |r| \le \delta, \\[4pt] \delta\big(|r| - \tfrac{1}{2}\delta\big), & |r| > \delta, \end{cases} \qquad r = \hat{y} - y. $$ The pieces are stitched so that both the value and the first derivative are continuous at $|r| = \delta$. The derivative is $$ \frac{\partial L_\delta}{\partial \hat{y}} = \begin{cases} r, & |r| \le \delta, \\ \delta \operatorname{sign}(r), & |r| > \delta, \end{cases} $$ which is clipped at magnitude $\delta$. Small residuals receive MSE-style gradients that vanish at the optimum, while large residuals receive bounded MAE-style gradients that limit the influence of outliers. The hyperparameter $\delta$ sets the residual scale at which a point is treated as an outlier. As $\delta \to 0$ the loss approaches MAE, and as $\delta \to \infty$ it approaches MSE. In practice $\delta$ is tuned to the expected noise scale, sometimes adaptively from a robust estimate of the residual spread such as the median absolute deviation. ```python def huber(y, y_hat, delta=1.0): r = y_hat - y a = abs(r) quad = 0.5 * r**2 lin = delta * (a - 0.5 * delta) return quad if a <= delta else lin ``` Probabilistically, Huber loss corresponds to a density that is Gaussian in the core and Laplace in the tails, which is the maximum likelihood view of a contamination model where most points are clean and a minority are heavy-tailed. ### 4.2 Log-cosh loss The log-cosh loss is a smooth surrogate with similar behavior but no threshold to tune: $$ L_{\text{lc}}(r) = \log\!\big(\cosh r\big). $$ For small $r$, a Taylor expansion gives $\log\cosh r \approx \tfrac{1}{2} r^2$, so it behaves like squared error near the optimum. For large $|r|$, $\log\cosh r \approx |r| - \log 2$, so it grows linearly like absolute error. Its derivative is $$ \frac{\partial L_{\text{lc}}}{\partial \hat{y}} = \tanh(r), $$ which is smooth everywhere, bounded in $(-1, 1)$, and vanishes at $r = 0$. Log-cosh is twice differentiable, which can help second order optimizers and avoids the nondifferentiable kink of both MAE and Huber. The cost is that its outlier resistance is fixed by the $\tanh$ saturation scale and cannot be tuned the way $\delta$ tunes Huber. ### 4.3 The epsilon-insensitive loss A third robust loss treats small residuals as free rather than merely cheap. The epsilon-insensitive loss, central to support vector regression, charges nothing inside a tube of half-width $\varepsilon$ around the target and grows linearly outside it: $$ L_\varepsilon(r) = \max\big(0,\; |r| - \varepsilon\big), \qquad r = \hat{y} - y. $$ Inside the tube the gradient is zero, so residuals smaller than $\varepsilon$ exert no pull at all and the fit is sparse in the sense that only points on or outside the tube boundary influence the solution. Outside the tube the gradient has constant magnitude one, giving the same outlier resistance as MAE. The flat interior makes the loss tolerant of small measurement noise and yields models that depend on few support points, which is attractive when a tolerance band is acceptable and a compact model is wanted. The price is a second hyperparameter and a region of exactly zero gradient that can stall learning if $\varepsilon$ is set too wide. Setting $\varepsilon = 0$ recovers MAE. ## 5. Quantile Loss and Conditional Quantiles ### 5.1 The pinball loss The losses so far predict a central tendency. Many applications need a range. Quantile regression estimates a chosen conditional quantile $\tau \in (0, 1)$ using the pinball loss $$ L_\tau(y, \hat{y}) = \begin{cases} \tau\,(y - \hat{y}), & y \ge \hat{y}, \\ (1 - \tau)\,(\hat{y} - y), & y < \hat{y}. \end{cases} $$ This is an asymmetric absolute error. When the prediction is too low (an underestimate, $y \ge \hat{y}$), the error is weighted by $\tau$. When the prediction is too high, it is weighted by $1 - \tau$. For $\tau = 0.5$ both weights equal $\tfrac{1}{2}$, the loss reduces to half the absolute error, and the target is the median. For $\tau = 0.9$ underestimates cost nine times as much as overestimates, pushing the prediction up until only ten percent of targets exceed it. Minimizing $\rho(c) = \mathbb{E}[L_\tau(y, c) \mid x]$ yields exactly the $\tau$-th conditional quantile of $p(y \mid x)$. Differentiating the two branches and combining gives $$ \rho'(c) = (1 - \tau)\,P(y < c \mid x) - \tau\,P(y > c \mid x) = P(y < c \mid x) - \tau, $$ using $P(y < c \mid x) + P(y > c \mid x) = 1$ for a continuous target. Setting $\rho'(c) = 0$ gives $P(y < c \mid x) = \tau$, the definition of the $\tau$-th quantile. The median proof of Section 3.1 is the special case $\tau = \tfrac{1}{2}$. ```python def pinball(y, y_hat, tau=0.5): r = y - y_hat return max(tau * r, (tau - 1.0) * r) ``` ### 5.2 Prediction intervals Training one network with multiple output heads, each with its own $\tau$, produces a set of conditional quantiles. A pair such as $\tau = 0.05$ and $\tau = 0.95$ forms a ninety percent prediction interval, giving a distribution-free measure of uncertainty without any Gaussian assumption. Because the heads are fit independently, their outputs can in principle cross, with a lower quantile exceeding a higher one. Sorting the outputs, penalizing crossings, or using a monotone architecture restores a coherent ordering. Quantile loss thus moves regression from point estimation to a partial picture of the conditional distribution. ## 6. A Worked Example: One Outlier, Four Answers A small numerical example makes the difference between the central-tendency losses concrete. Consider a single input $x$ for which five targets have been observed, $$ y \in \{1,\; 2,\; 3,\; 4,\; 30\}, $$ where the value $30$ is a gross outlier and the clean signal sits near $3$. We ask what constant prediction $c$ each loss would settle on, treating this as the pointwise problem of Section 1. The squared-error optimum is the mean, $c^\star_{\text{SE}} = (1 + 2 + 3 + 4 + 30)/5 = 8$. The single outlier has dragged the prediction far above every clean point. The absolute-error optimum is the median, $c^\star_{\text{AE}} = 3$, which ignores the magnitude of the outlier entirely and lands on the clean signal. Huber loss with $\delta = 1$ sits in between but close to the median: the four clean residuals are handled quadratically while the outlier contributes only a bounded linear pull, so the optimizer balances them at a value just above $3$ rather than near $8$. Quantile loss at $\tau = 0.9$ instead targets the upper tail and returns a value near the fourth-largest point, deliberately high because the application asked for an upper bound rather than a center. The table summarizes the elicited statistic, the implied noise model, and the answer each loss gives on this dataset. | Loss | Elicited statistic | Implied noise | Answer here | |------|--------------------|---------------|-------------| | Squared error | conditional mean | Gaussian | $8$ | | Absolute error | conditional median | Laplace | $3$ | | Huber, $\delta = 1$ | robust mean | Gaussian core, Laplace tails | just above $3$ | | Quantile, $\tau = 0.9$ | $0.9$ quantile | distribution-free | near the upper end | The lesson is not that one answer is correct and the others wrong. Each loss answers a different question, and the gap between $8$ and $3$ is exactly the gap between "what is the average of these numbers, outlier included" and "what is a typical value." Choosing a loss is choosing which question to ask. ## 7. Robustness, M-Estimation, and Influence The losses of Sections 2 through 4 are instances of M-estimation, the framework of estimators defined by minimizing a sum of a function $\rho$ of the residuals. Within that framework the influence function measures how much a single observation can move the estimate, and it is proportional to the derivative $\psi(r) = \rho'(r)$, often called the score. The shape of $\psi$ explains the robustness ordering of the losses at a glance. For squared error $\psi(r) = r$ is unbounded, so a single point arbitrarily far away has arbitrarily large influence and can move the estimate without limit. For absolute error $\psi(r) = \operatorname{sign}(r)$ is bounded by one, so any single point has at most a fixed influence no matter how far it lies. Huber and log-cosh share this bounded-influence property, with $\psi(r) = \delta\operatorname{sign}(r)$ in the tails and $\psi(r) = \tanh(r)$ respectively, which is precisely why they resist outliers while keeping smooth behavior near zero. A bounded score is the formal statement of robustness. ```{mermaid} flowchart TD A["Pick a loss for regression"] --> B{"Need uncertainty bounds or asymmetric cost"} B -->|"yes"| Q["Quantile loss, fit several tau"] B -->|"no"| C{"Outliers or heavy tails in residuals"} C -->|"no"| SE["Squared error, targets the mean"] C -->|"yes"| D{"Want one tunable robustness scale"} D -->|"yes"| H["Huber loss, tune delta"] D -->|"no"| E{"Need a smooth twice differentiable loss"} E -->|"yes"| LC["Log cosh loss"] E -->|"no"| AE["Absolute error, targets the median"] ``` This diagram encodes the same logic as the decision guide below: first decide whether you need a distribution summary or a point, then whether the noise is clean, and finally how much tuning and smoothness you want. ## 8. Matching the Loss to the Noise Model The choice of loss is the choice of an implicit noise model, and the right choice follows from the statistic you want and the shape of the noise you expect. ### 8.1 A decision guide Squared error is the default when residuals are roughly symmetric, light-tailed, and free of gross outliers, and when the conditional mean is the quantity of interest. It is the maximum likelihood estimator under Gaussian noise and yields the most efficient estimates when that assumption holds. Absolute error suits data with heavy-tailed noise or a meaningful fraction of corrupt labels, when the conditional median is acceptable or preferred. It trades statistical efficiency under clean Gaussian noise for robustness under contamination. Huber and log-cosh are pragmatic defaults when outliers are present but not dominant. They keep the smooth, well-behaved gradients of MSE near the optimum while bounding the influence of large residuals like MAE. Huber is preferred when the outlier scale is known or tunable through $\delta$; log-cosh is convenient when a single hyperparameter-free smooth loss is wanted. Quantile loss is the tool when the application needs uncertainty bounds or an asymmetric cost structure, for example when underprediction and overprediction carry different real-world penalties, as in inventory, staffing, or risk-sensitive forecasting. ### 8.2 Practical considerations Several engineering points cut across the choice. Target scaling matters because most of these losses are not scale invariant. Squared error scales as the square of the target units and absolute error scales linearly, so the relative weight of MSE and MAE in a combined objective shifts if you rescale $y$; standardizing $y$ to zero mean and unit variance makes the meaning of $\delta$ in Huber and of the learning rate consistent across problems. The threshold $\delta$ is itself a scale, and a common practice is to set it from a robust estimate of the residual spread such as the median absolute deviation, $\widehat{\sigma} \approx 1.4826 \cdot \operatorname{median}_i |r_i - \operatorname{median}_j r_j|$, which tracks the noise level without being inflated by outliers. Reduction matters: averaging the per-example loss over the batch keeps gradient magnitudes independent of batch size, whereas summing couples them. Heteroscedastic noise can be addressed directly by letting the network output both a mean and a variance and minimizing the Gaussian negative log likelihood, which down-weights residuals where the model is uncertain; care is needed because the variance head can collapse, and clamping or a softplus parameterization of the variance keeps training stable. Finally, the loss interacts with optimization: the constant gradient magnitude of MAE often calls for a decaying learning rate, while the vanishing gradients of MSE near the optimum behave well with standard schedules. The mature open-source frameworks expose these losses directly, for example `torch.nn.MSELoss`, `L1Loss`, `HuberLoss`, and `SmoothL1Loss` in PyTorch and `mean_squared_error`, `mean_absolute_error`, and `Huber` in Keras, so the engineering choice usually reduces to selecting and parameterizing a built-in. The unifying principle is that a regression loss is a likelihood in disguise. When you select squared error you assert Gaussian noise and ask for the mean; when you select absolute error you assert Laplace noise and ask for the median; when you select Huber or log-cosh you assert a Gaussian core with heavy tails; and when you select quantile loss you ask directly for a quantile of the conditional distribution. Making that assertion explicit, and checking it against the empirical residuals, is the most reliable way to match the loss to the data. ## 9. Summary Regression losses are not interchangeable knobs. Each defines a target statistic of the conditional distribution, an implied noise model, and through its score function a degree of robustness to outliers. Squared error gives the mean under Gaussian noise with unbounded influence, absolute error gives the median under Laplace noise with bounded influence, Huber and log-cosh blend the two for robustness with smooth optimization, the epsilon-insensitive loss adds a tolerance tube and sparse fits, and quantile loss reaches any conditional quantile and supports prediction intervals. The disciplined workflow is to decide which statistic the application needs, characterize the noise from residual diagnostics, and select the loss whose likelihood and influence function match that noise. The loss, not the architecture alone, determines what the network ultimately learns to predict. ## References 1. Huber, P. J. "Robust Estimation of a Location Parameter." Annals of Mathematical Statistics, 1964. https://doi.org/10.1214/aoms/1177703732 2. Koenker, R., and Bassett, G. "Regression Quantiles." Econometrica, 1978. https://doi.org/10.2307/1913643 3. Bishop, C. M. Pattern Recognition and Machine Learning. Springer, 2006. https://www.microsoft.com/en-us/research/publication/pattern-recognition-machine-learning/ 4. Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning. MIT Press, 2016. https://www.deeplearningbook.org/ 5. Hastie, T., Tibshirani, R., and Friedman, J. The Elements of Statistical Learning. 2nd ed. Springer, 2009. https://hastie.su.domains/ElemStatLearn/ 6. Nix, D. A., and Weigend, A. S. "Estimating the Mean and Variance of the Target Probability Distribution." IEEE ICNN, 1994. https://doi.org/10.1109/ICNN.1994.374138 7. Koenker, R. Quantile Regression. Cambridge University Press, 2005. https://doi.org/10.1017/CBO9780511754098 8. Hampel, F. R. "The Influence Curve and Its Role in Robust Estimation." Journal of the American Statistical Association, 1974. https://doi.org/10.1080/01621459.1974.10482962 9. Drucker, H., Burges, C. J. C., Kaufman, L., Smola, A., and Vapnik, V. "Support Vector Regression Machines." Advances in Neural Information Processing Systems, 1996. https://papers.nips.cc/paper/1996/hash/d38901788c533e8286cb6400b40b386d-Abstract.html 10. PyTorch Documentation. "Loss Functions." https://pytorch.org/docs/stable/nn.html#loss-functions