188 Loss Functions for Regression in Neural Networks
Regression asks a neural network to map an input \(x\) to a continuous target \(y \in \mathbb{R}^d\). The architecture proposes a prediction \(\hat{y} = f_\theta(x)\), but the loss function decides what counts as a good prediction. This choice is not cosmetic. Each loss encodes an implicit assumption about how the observed targets deviate from the underlying signal, and that assumption determines which statistic of the conditional distribution the network learns, how it responds to outliers, and how its gradients behave during optimization. This chapter develops the standard regression losses from first principles, connects each to a probabilistic noise model, and gives practical guidance for matching the loss to the data.
188.1 1. The Statistical View of Regression Loss
Suppose the data are drawn from a joint distribution \(p(x, y)\). A loss \(L(y, \hat{y})\) measures the cost of predicting \(\hat{y}\) when the truth is \(y\). Training minimizes the empirical risk
\[ \hat{R}(\theta) = \frac{1}{N} \sum_{i=1}^{N} L\big(y_i, f_\theta(x_i)\big), \]
which approximates the population risk \(R(\theta) = \mathbb{E}_{(x,y) \sim p}\,[L(y, f_\theta(x))]\). For a flexible enough network, the minimizer at each input \(x\) is the constant \(c\) that minimizes the conditional expected loss \(\mathbb{E}[L(y, c) \mid x]\). The functional form of \(L\) therefore selects which property of \(p(y \mid x)\) the network estimates. Squared error recovers the conditional mean, absolute error recovers the conditional median, and quantile loss recovers an arbitrary conditional quantile. Understanding a loss means knowing the statistic it targets and the noise model that makes it the maximum likelihood objective.
The bridge between losses and probability is negative log likelihood. If we posit \(p(y \mid x) = q(y \mid f_\theta(x))\) for some parametric density \(q\), then minimizing \(-\sum_i \log q(y_i \mid f_\theta(x_i))\) is equivalent, up to constants, to minimizing a particular loss. Reading the loss off the density, and the density off the loss, is the central skill of this chapter.
188.2 2. Squared Error and the Gaussian Model
188.2.1 2.1 Definition and the conditional mean
The squared error loss, often called L2 loss, is
\[ L_{\text{SE}}(y, \hat{y}) = \tfrac{1}{2}\,(y - \hat{y})^2, \]
and its average over a dataset is the mean squared error (MSE). To see which statistic it targets, fix \(x\) and minimize \(\mathbb{E}[(y - c)^2 \mid x]\) over \(c\). Differentiating and setting the result to zero gives \(c^\star = \mathbb{E}[y \mid x]\). Squared error trains the network to output the conditional mean.
188.2.2 2.2 The Gaussian likelihood
Assume the target is the signal plus homoscedastic Gaussian noise, \(y = f_\theta(x) + \varepsilon\) with \(\varepsilon \sim \mathcal{N}(0, \sigma^2)\). The negative log likelihood of one observation is
\[ -\log q(y \mid \hat{y}) = \frac{(y - \hat{y})^2}{2\sigma^2} + \tfrac{1}{2}\log(2\pi\sigma^2). \]
With \(\sigma\) fixed, minimizing this over \(\theta\) is exactly minimizing squared error. MSE is therefore the maximum likelihood objective under additive Gaussian noise. If the noise is heteroscedastic, the network can also predict \(\sigma^2(x)\) and minimize the full Gaussian negative log likelihood, which reweights each residual by its inverse variance.
188.2.3 2.3 Gradient behavior
The gradient with respect to the prediction is \(\partial L_{\text{SE}} / \partial \hat{y} = \hat{y} - y\), linear in the residual \(r = \hat{y} - y\). Large residuals produce large gradients. This is efficient when errors are genuinely Gaussian, since big residuals are rare, but it makes MSE sensitive to outliers and to mislabeled targets, because a single gross error can dominate the batch gradient.
def squared_error(y, y_hat):
r = y_hat - y
return 0.5 * r**2 # gradient w.r.t. y_hat is r188.3 3. Absolute Error and the Laplace Model
188.3.1 3.1 Definition and the conditional median
The absolute error loss, or L1 loss, is
\[ L_{\text{AE}}(y, \hat{y}) = |y - \hat{y}|, \]
and its average is the mean absolute error (MAE). Minimizing \(\mathbb{E}[\,|y - c|\, \mid x]\) over \(c\) yields the conditional median rather than the mean. The median is robust: moving a far-away target even farther does not move the optimal prediction, because the subgradient of \(|r|\) saturates at \(\pm 1\).
188.3.2 3.2 The Laplace likelihood
Absolute error is the negative log likelihood of a Laplace distribution, \(q(y \mid \hat{y}) \propto \exp(-|y - \hat{y}| / b)\). The Laplace density has heavier tails than the Gaussian, so it assigns more probability to large deviations. Choosing MAE is therefore a statement that the data contain occasional large errors that should not dominate fitting.
188.3.3 3.3 Gradient behavior
The subgradient is \(\partial L_{\text{AE}} / \partial \hat{y} = \operatorname{sign}(\hat{y} - y)\). Its magnitude is constant, so outliers contribute no more to the gradient than small errors do. This delivers robustness but causes two difficulties. First, the gradient does not shrink as the prediction approaches the target, which can cause the optimizer to oscillate around the minimum unless the learning rate decays. Second, the loss is nondifferentiable at \(r = 0\), though autodiff frameworks define a subgradient there.
188.4 4. The Huber and Log-Cosh Compromises
Squared error has smooth, vanishing gradients near the optimum but is fragile to outliers. Absolute error is robust but has a kink at zero and constant gradient magnitude. Two losses interpolate between these regimes.
188.4.1 4.1 Huber loss
The Huber loss is quadratic for small residuals and linear for large ones, with a threshold \(\delta\) marking the transition:
\[ L_\delta(r) = \begin{cases} \tfrac{1}{2} r^2, & |r| \le \delta, \\[4pt] \delta\big(|r| - \tfrac{1}{2}\delta\big), & |r| > \delta, \end{cases} \qquad r = \hat{y} - y. \]
The pieces are stitched so that both the value and the first derivative are continuous at \(|r| = \delta\). The derivative is
\[ \frac{\partial L_\delta}{\partial \hat{y}} = \begin{cases} r, & |r| \le \delta, \\ \delta \operatorname{sign}(r), & |r| > \delta, \end{cases} \]
which is clipped at magnitude \(\delta\). Small residuals receive MSE-style gradients that vanish at the optimum, while large residuals receive bounded MAE-style gradients that limit the influence of outliers. The hyperparameter \(\delta\) sets the residual scale at which a point is treated as an outlier. As \(\delta \to 0\) the loss approaches MAE, and as \(\delta \to \infty\) it approaches MSE. In practice \(\delta\) is tuned to the expected noise scale, sometimes adaptively from a robust estimate of the residual spread such as the median absolute deviation.
def huber(y, y_hat, delta=1.0):
r = y_hat - y
a = abs(r)
quad = 0.5 * r**2
lin = delta * (a - 0.5 * delta)
return quad if a <= delta else linProbabilistically, Huber loss corresponds to a density that is Gaussian in the core and Laplace in the tails, which is the maximum likelihood view of a contamination model where most points are clean and a minority are heavy-tailed.
188.4.2 4.2 Log-cosh loss
The log-cosh loss is a smooth surrogate with similar behavior but no threshold to tune:
\[ L_{\text{lc}}(r) = \log\!\big(\cosh r\big). \]
For small \(r\), a Taylor expansion gives \(\log\cosh r \approx \tfrac{1}{2} r^2\), so it behaves like squared error near the optimum. For large \(|r|\), \(\log\cosh r \approx |r| - \log 2\), so it grows linearly like absolute error. Its derivative is
\[ \frac{\partial L_{\text{lc}}}{\partial \hat{y}} = \tanh(r), \]
which is smooth everywhere, bounded in \((-1, 1)\), and vanishes at \(r = 0\). Log-cosh is twice differentiable, which can help second order optimizers and avoids the nondifferentiable kink of both MAE and Huber. The cost is that its outlier resistance is fixed by the \(\tanh\) saturation scale and cannot be tuned the way \(\delta\) tunes Huber.
188.5 5. Quantile Loss and Conditional Quantiles
188.5.1 5.1 The pinball loss
The losses so far predict a central tendency. Many applications need a range. Quantile regression estimates a chosen conditional quantile \(\tau \in (0, 1)\) using the pinball loss
\[ L_\tau(y, \hat{y}) = \begin{cases} \tau\,(y - \hat{y}), & y \ge \hat{y}, \\ (1 - \tau)\,(\hat{y} - y), & y < \hat{y}. \end{cases} \]
This is an asymmetric absolute error. When the prediction is too low (an underestimate, \(y \ge \hat{y}\)), the error is weighted by \(\tau\). When the prediction is too high, it is weighted by \(1 - \tau\). For \(\tau = 0.5\) both weights equal \(\tfrac{1}{2}\), the loss reduces to half the absolute error, and the target is the median. For \(\tau = 0.9\) underestimates cost nine times as much as overestimates, pushing the prediction up until only ten percent of targets exceed it. Minimizing \(\mathbb{E}[L_\tau(y, c) \mid x]\) yields exactly the \(\tau\)-th conditional quantile of \(p(y \mid x)\).
def pinball(y, y_hat, tau=0.5):
r = y - y_hat
return max(tau * r, (tau - 1.0) * r)188.5.2 5.2 Prediction intervals
Training one network with multiple output heads, each with its own \(\tau\), produces a set of conditional quantiles. A pair such as \(\tau = 0.05\) and \(\tau = 0.95\) forms a ninety percent prediction interval, giving a distribution-free measure of uncertainty without any Gaussian assumption. Because the heads are fit independently, their outputs can in principle cross, with a lower quantile exceeding a higher one. Sorting the outputs, penalizing crossings, or using a monotone architecture restores a coherent ordering. Quantile loss thus moves regression from point estimation to a partial picture of the conditional distribution.
188.6 6. Matching the Loss to the Noise Model
The choice of loss is the choice of an implicit noise model, and the right choice follows from the statistic you want and the shape of the noise you expect.
188.6.1 6.1 A decision guide
Squared error is the default when residuals are roughly symmetric, light-tailed, and free of gross outliers, and when the conditional mean is the quantity of interest. It is the maximum likelihood estimator under Gaussian noise and yields the most efficient estimates when that assumption holds.
Absolute error suits data with heavy-tailed noise or a meaningful fraction of corrupt labels, when the conditional median is acceptable or preferred. It trades statistical efficiency under clean Gaussian noise for robustness under contamination.
Huber and log-cosh are pragmatic defaults when outliers are present but not dominant. They keep the smooth, well-behaved gradients of MSE near the optimum while bounding the influence of large residuals like MAE. Huber is preferred when the outlier scale is known or tunable through \(\delta\); log-cosh is convenient when a single hyperparameter-free smooth loss is wanted.
Quantile loss is the tool when the application needs uncertainty bounds or an asymmetric cost structure, for example when underprediction and overprediction carry different real-world penalties, as in inventory, staffing, or risk-sensitive forecasting.
188.6.2 6.2 Practical considerations
Several engineering points cut across the choice. Target scaling matters: standardizing \(y\) to unit variance makes the meaning of \(\delta\) in Huber and of the learning rate consistent across problems. Reduction matters: averaging the per-example loss over the batch keeps gradient magnitudes independent of batch size, whereas summing couples them. Heteroscedastic noise can be addressed directly by letting the network output both a mean and a variance and minimizing the Gaussian negative log likelihood, which down-weights residuals where the model is uncertain. Finally, the loss interacts with optimization: the constant gradient magnitude of MAE often calls for a decaying learning rate, while the vanishing gradients of MSE near the optimum behave well with standard schedules.
The unifying principle is that a regression loss is a likelihood in disguise. When you select squared error you assert Gaussian noise and ask for the mean; when you select absolute error you assert Laplace noise and ask for the median; when you select Huber or log-cosh you assert a Gaussian core with heavy tails; and when you select quantile loss you ask directly for a quantile of the conditional distribution. Making that assertion explicit, and checking it against the empirical residuals, is the most reliable way to match the loss to the data.
188.7 7. Summary
Regression losses are not interchangeable knobs. Each defines a target statistic of the conditional distribution and an implied noise model. Squared error gives the mean under Gaussian noise, absolute error gives the median under Laplace noise, Huber and log-cosh blend the two for robustness with smooth optimization, and quantile loss reaches any conditional quantile and supports prediction intervals. The disciplined workflow is to decide which statistic the application needs, characterize the noise from residual diagnostics, and select the loss whose likelihood matches that noise. The loss, not the architecture alone, determines what the network ultimately learns to predict.
188.8 References
- Huber, P. J. “Robust Estimation of a Location Parameter.” Annals of Mathematical Statistics, 1964. https://doi.org/10.1214/aoms/1177703732
- Koenker, R., and Bassett, G. “Regression Quantiles.” Econometrica, 1978. https://doi.org/10.2307/1913643
- Bishop, C. M. Pattern Recognition and Machine Learning. Springer, 2006. https://www.microsoft.com/en-us/research/publication/pattern-recognition-machine-learning/
- Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning. MIT Press, 2016. https://www.deeplearningbook.org/
- Hastie, T., Tibshirani, R., and Friedman, J. The Elements of Statistical Learning. 2nd ed. Springer, 2009. https://hastie.su.domains/ElemStatLearn/
- Nix, D. A., and Weigend, A. S. “Estimating the Mean and Variance of the Target Probability Distribution.” IEEE ICNN, 1994. https://doi.org/10.1109/ICNN.1994.374138
- Koenker, R. Quantile Regression. Cambridge University Press, 2005. https://doi.org/10.1017/CBO9780511754098
- PyTorch Documentation. “Loss Functions.” https://pytorch.org/docs/stable/nn.html#loss-functions