192 Backpropagation Derivation

Backpropagation is the algorithm that makes training deep neural networks computationally feasible. At its core it is nothing more than a disciplined application of the chain rule of multivariable calculus, organized so that every partial derivative needed for gradient descent is computed exactly once. This chapter derives the algorithm from first principles. We build the computational graph, state the chain rule on that graph, introduce the delta terms that summarize per layer error signals, prove the four backpropagation equations, and then assemble a complete derivation for a multilayer network. The treatment is rigorous and self contained, assuming only familiarity with partial derivatives, matrix algebra, and the structure of a feedforward network.

192.1 1. Setup and Notation

We consider a feedforward network with $L$ layers. Layer $l$ holds a vector of activations $a^l \in \mathbb{R}^{n_l}$, where $n_l$ is the number of units in that layer. The input layer is $a^0 = x$, the raw feature vector, and the output layer is $a^L$, the network prediction. Between consecutive layers we have a weight matrix $W^l \in \mathbb{R}^{n_l \times n_{l-1}}$ and a bias vector $b^l \in \mathbb{R}^{n_l}$.

The forward computation for each layer $l = 1, \dots, L$ proceeds in two steps. First the affine pre activation, often called the weighted input:

\[ z^l = W^l a^{l-1} + b^l . \]

Then the elementwise nonlinearity:

\[ a^l = \sigma(z^l), \]

where $\sigma$ is an activation function such as the logistic sigmoid, the hyperbolic tangent, or the rectified linear unit, applied componentwise. In scalar form, the $j$ th unit of layer $l$ computes

\[ z^l_j = \sum_{k=1}^{n_{l-1}} W^l_{jk}\, a^{l-1}_k + b^l_j, \qquad a^l_j = \sigma(z^l_j). \]

The index convention deserves attention. The entry $W^l_{jk}$ is the weight on the connection from unit $k$ in layer $l-1$ to unit $j$ in layer $l$. Writing the destination index first looks backward at a glance, but it is exactly what lets us write the forward pass as the clean matrix product $W^l a^{l-1}$ without a transpose. This choice pays off repeatedly in what follows.

Training minimizes a scalar loss $C$ that compares the prediction $a^L$ to a target $y$. For a single example we might use the squared error $C = \tfrac{1}{2}\lVert a^L - y \rVert^2$ or, for classification, the cross entropy. The entire purpose of backpropagation is to compute the gradients $\partial C / \partial W^l_{jk}$ and $\partial C / \partial b^l_j$ for every weight and bias, so that a gradient based optimizer can update the parameters.

192.2 2. The Computational Graph

A neural network is a composition of simple functions, and that composition is naturally a directed acyclic graph. Each node represents an intermediate quantity, and each edge represents a functional dependency. For our feedforward network the dependency chain for any single training example reads

\[ x \to z^1 \to a^1 \to z^2 \to a^2 \to \cdots \to z^L \to a^L \to C . \]

Every arrow is a differentiable map. The arrow $a^{l-1} \to z^l$ is the affine map parameterized by $W^l$ and $b^l$. The arrow $z^l \to a^l$ is the elementwise nonlinearity. The final arrow $a^L \to C$ is the loss.

The graph view is more than decoration. It tells us precisely which quantities influence which others, and therefore which partial derivatives are nonzero. Because the graph is feedforward, $z^l$ depends on the parameters only through layers $1$ through $l$, and $C$ depends on $z^l$ only through the downstream chain $z^l \to a^l \to z^{l+1} \to \cdots \to C$. This locality is what backpropagation exploits.

The key insight is that the loss is a function of the weighted inputs at the last layer, which are functions of the weighted inputs at the second to last layer, and so on. To differentiate $C$ with respect to a parameter buried deep in the network, we must trace every path from that parameter to the output. The chain rule on a graph automates this tracing.

192.3 3. The Chain Rule on the Graph

The multivariable chain rule is the engine of backpropagation. Suppose a scalar $C$ depends on variables $u_1, \dots, u_m$, and each $u_i$ depends on a variable $t$. Then

\[ \frac{\partial C}{\partial t} = \sum_{i=1}^{m} \frac{\partial C}{\partial u_i}\, \frac{\partial u_i}{\partial t}. \]

The sum runs over all intermediate variables through which $t$ influences $C$. On the computational graph this is the statement that the total derivative along a node is the sum of contributions over all outgoing edges.

Apply this to a weighted input. The scalar $z^l_j$ feeds into every weighted input of the next layer, since $z^{l+1}_i = \sum_j W^{l+1}_{ij} \sigma(z^l_j) + b^{l+1}_i$. Therefore the loss depends on $z^l_j$ only through the vector $z^{l+1}$, and the chain rule gives

\[ \frac{\partial C}{\partial z^l_j} = \sum_{i=1}^{n_{l+1}} \frac{\partial C}{\partial z^{l+1}_i}\, \frac{\partial z^{l+1}_i}{\partial z^l_j}. \]

This single equation is the heart of the backward pass. It expresses the sensitivity of the loss to a weighted input in layer $l$ in terms of the sensitivities at layer $l+1$. The recursion runs from the output back toward the input, which is exactly why the algorithm is called backpropagation. Notice that the only derivatives we ever need are local ones, $\partial z^{l+1}_i / \partial z^l_j$, plus the already computed downstream sensitivities. Nothing requires us to expand the full composite function symbolically.

192.4 4. The Delta Terms

The quantity that appears on both sides of the recursion above deserves its own name. Define the error of unit $j$ in layer $l$ as the partial derivative of the cost with respect to the weighted input of that unit:

\[ \delta^l_j \;\equiv\; \frac{\partial C}{\partial z^l_j}. \]

We collect these into a vector $\delta^l \in \mathbb{R}^{n_l}$. The choice to differentiate with respect to $z^l_j$ rather than the activation $a^l_j$ is deliberate and it is what makes the algebra clean. Because the bias $b^l_j$ enters the network only through $z^l_j$ with a coefficient of one, the error $\delta^l_j$ turns out to equal the bias gradient directly, as we will see. It also separates the contribution of the nonlinearity from the contribution of the linear weights in a tidy way.

The name “error” comes from an interpretation. Imagine perturbing the weighted input by a small amount $\Delta z^l_j$. The loss changes by approximately $\delta^l_j\, \Delta z^l_j$. If $\delta^l_j$ is large in magnitude, the unit is far from a configuration that locally minimizes the loss, and adjusting it can yield a large improvement. If $\delta^l_j$ is near zero, the unit is at a local plateau and contributes little to the gradient. The delta terms are the carriers of the error signal as it flows backward through the network.

With this definition, the entire backpropagation algorithm reduces to four equations: one that initializes the delta at the output layer, one that propagates delta from layer $l+1$ to layer $l$, and two that read off the parameter gradients from the deltas. We now derive each.

192.5 5. The Four Backpropagation Equations

192.5.1 5.1 BP1: Error at the Output Layer

We begin at the end of the graph, where $C$ is a direct function of $a^L$, and $a^L = \sigma(z^L)$. By the chain rule, the loss depends on $z^L_j$ only through $a^L_j$ (the nonlinearity is elementwise, so $z^L_j$ affects only $a^L_j$):

\[ \delta^L_j = \frac{\partial C}{\partial z^L_j} = \frac{\partial C}{\partial a^L_j}\, \frac{\partial a^L_j}{\partial z^L_j} = \frac{\partial C}{\partial a^L_j}\, \sigma'(z^L_j). \]

The first factor $\partial C / \partial a^L_j$ measures how fast the loss changes with the output activation, and depends only on the choice of loss function. The second factor $\sigma'(z^L_j)$ measures how fast the activation changes with the weighted input, and depends only on the choice of nonlinearity. In vector form, writing $\nabla_a C$ for the vector of output activation derivatives and $\odot$ for the elementwise Hadamard product,

\[ \boxed{\;\delta^L = \nabla_a C \odot \sigma'(z^L).\;} \tag{BP1} \]

For the squared error $C = \tfrac{1}{2}\lVert a^L - y \rVert^2$ we have $\nabla_a C = a^L - y$, so $\delta^L = (a^L - y) \odot \sigma'(z^L)$. A worthwhile special case arises with softmax output combined with cross entropy loss, where the $\sigma'$ factor and the loss derivative combine so that $\delta^L = a^L - y$ exactly, with no leftover activation derivative. This cancellation is one reason that pairing is ubiquitous in classification.

192.5.2 5.2 BP2: Backpropagating the Error

Next we propagate the error from layer $l+1$ to layer $l$. We start from the chain rule expression derived in Section 3:

\[ \delta^l_j = \sum_{i=1}^{n_{l+1}} \frac{\partial C}{\partial z^{l+1}_i}\, \frac{\partial z^{l+1}_i}{\partial z^l_j} = \sum_{i=1}^{n_{l+1}} \delta^{l+1}_i\, \frac{\partial z^{l+1}_i}{\partial z^l_j}. \]

It remains to compute the local derivative $\partial z^{l+1}_i / \partial z^l_j$. Recall that

\[ z^{l+1}_i = \sum_{k=1}^{n_l} W^{l+1}_{ik}\, a^l_k + b^{l+1}_i = \sum_{k=1}^{n_l} W^{l+1}_{ik}\, \sigma(z^l_k) + b^{l+1}_i. \]

Differentiating with respect to $z^l_j$, only the term $k = j$ survives, since the activations $\sigma(z^l_k)$ for $k \neq j$ do not depend on $z^l_j$:

\[ \frac{\partial z^{l+1}_i}{\partial z^l_j} = W^{l+1}_{ij}\, \sigma'(z^l_j). \]

Substituting back,

\[ \delta^l_j = \sum_{i=1}^{n_{l+1}} \delta^{l+1}_i\, W^{l+1}_{ij}\, \sigma'(z^l_j) = \sigma'(z^l_j) \sum_{i=1}^{n_{l+1}} W^{l+1}_{ij}\, \delta^{l+1}_i. \]

The inner sum $\sum_i W^{l+1}_{ij} \delta^{l+1}_i$ is precisely the $j$ th component of the matrix vector product $(W^{l+1})^\top \delta^{l+1}$, because transposing swaps the row and column indices. In vector form,

\[ \boxed{\;\delta^l = \big( (W^{l+1})^\top \delta^{l+1} \big) \odot \sigma'(z^l).\;} \tag{BP2} \]

This is the workhorse recursion. Read it as a two stage operation. First, $(W^{l+1})^\top \delta^{l+1}$ transports the error backward across the linear layer, using the transpose of the same weights that carried activations forward. Then the Hadamard product with $\sigma'(z^l)$ passes that transported error through the local slope of the nonlinearity. The forward pass moves signal through $W^{l+1}$ and then $\sigma$; the backward pass moves error through $\sigma'$ and then $(W^{l+1})^\top$. The transpose is the discrete analogue of an adjoint, and its appearance is not a coincidence but a structural feature of differentiating a linear map.

192.5.3 5.3 BP3 and BP4: Gradients with Respect to Parameters

We now extract the gradients we actually want. Consider the bias $b^l_j$ first. It enters the network only through $z^l_j$, with $\partial z^l_j / \partial b^l_j = 1$. By the chain rule,

\[ \frac{\partial C}{\partial b^l_j} = \frac{\partial C}{\partial z^l_j}\, \frac{\partial z^l_j}{\partial b^l_j} = \delta^l_j \cdot 1 = \delta^l_j. \]

So the bias gradient is simply the delta, which retroactively justifies our decision to define the error in terms of $z^l$ rather than $a^l$. In vector form,

\[ \boxed{\;\frac{\partial C}{\partial b^l} = \delta^l.\;} \tag{BP3} \]

Now the weight $W^l_{jk}$. It enters through $z^l_j$ alone, multiplying the upstream activation $a^{l-1}_k$, so $\partial z^l_j / \partial W^l_{jk} = a^{l-1}_k$. The chain rule yields

\[ \frac{\partial C}{\partial W^l_{jk}} = \frac{\partial C}{\partial z^l_j}\, \frac{\partial z^l_j}{\partial W^l_{jk}} = \delta^l_j\, a^{l-1}_k. \]

Each weight gradient is the product of the error at the destination unit and the activation at the source unit. Collecting all $j, k$ pairs into a matrix, this is the outer product

\[ \boxed{\;\frac{\partial C}{\partial W^l} = \delta^l \,(a^{l-1})^\top.\;} \tag{BP4} \]

Equations BP1 through BP4 constitute the complete algorithm. BP1 seeds the recursion at the output. BP2 carries it backward layer by layer. BP3 and BP4 convert the per layer deltas into parameter gradients. Nothing else is needed.

A reading of BP4 worth internalizing: a weight gradient is small when either the input activation is small or the output error is small. A neuron with $a^{l-1}_k \approx 0$ learns slowly on its outgoing weights regardless of error, and a unit whose $\delta^l_j \approx 0$ learns slowly on its incoming weights regardless of input. These observations explain phenomena such as slow learning under saturated sigmoids, where $\sigma'(z) \approx 0$ drives the deltas toward zero and stalls the gradient.

192.6 6. Complete Derivation for a Multilayer Network

We now assemble the full algorithm for a network of arbitrary depth $L$, combining the equations above into a forward pass followed by a backward pass. We then verify the logic by unrolling the recursion explicitly.

192.6.1 6.1 The Algorithm

Given a single training example $(x, y)$, the procedure is as follows.

Forward pass:
  a[0] = x
  for l = 1, ..., L:
      z[l] = W[l] @ a[l-1] + b[l]
      a[l] = sigma(z[l])

Backward pass:
  delta[L] = grad_a C  *  sigma_prime(z[L])          # BP1
  for l = L-1, ..., 1:
      delta[l] = (W[l+1].T @ delta[l+1]) * sigma_prime(z[l])   # BP2

Gradients:
  for l = 1, ..., L:
      dC_db[l] = delta[l]                            # BP3
      dC_dW[l] = delta[l] @ a[l-1].T                 # BP4

The forward pass caches every $z^l$ and $a^l$, because the backward pass needs them: BP2 requires $\sigma'(z^l)$ and BP4 requires $a^{l-1}$. This caching is the memory cost of backpropagation, and it is why training a deep network consumes far more memory than mere inference.

192.6.2 6.2 Unrolling the Recursion

To convince ourselves that BP2 truly computes $\partial C / \partial z^l_j$ for an arbitrary layer, we unroll it. Consider a three layer network and ask for $\delta^1$. Applying BP2 once,

\[ \delta^2 = \big( (W^3)^\top \delta^3 \big) \odot \sigma'(z^2), \]

and applying it again,

\[ \delta^1 = \big( (W^2)^\top \delta^2 \big) \odot \sigma'(z^1). \]

Substituting and writing $S^l = \operatorname{diag}(\sigma'(z^l))$ for the diagonal matrix of activation slopes, the Hadamard products become matrix multiplications and the deltas chain into a clean product:

\[ \delta^1 = S^1 (W^2)^\top S^2 (W^3)^\top \delta^3 . \]

With $\delta^3 = S^3 \nabla_a C$ from BP1, the full sensitivity of the loss to the first layer weighted inputs is

\[ \delta^1 = S^1 (W^2)^\top S^2 (W^3)^\top S^3\, \nabla_a C . \]

This product form is illuminating. The gradient that reaches the early layers is a chain of weight transposes interleaved with diagonal slope matrices. If the spectral norms of these factors are consistently less than one, the product shrinks geometrically with depth and the gradient vanishes; if they consistently exceed one, the product grows and the gradient explodes. The vanishing and exploding gradient problems that motivate careful initialization, normalization layers, and residual connections are visible directly in this unrolled expression. The same product, read forward, is exactly the Jacobian of the loss with respect to the deep weighted inputs, confirming that backpropagation computes a genuine derivative and not an approximation.

192.6.3 6.3 Correctness and the Total Derivative

Why does the layer by layer recursion give the correct total derivative, even though a parameter in an early layer influences the loss through an enormous number of distinct paths through the network? The answer is that BP2 implicitly sums over all those paths. Each application of the chain rule at a node sums over its immediate successors, and composing these single step sums across layers produces a sum over every directed path from the parameter to the output. This is the path sum interpretation of the chain rule: the total derivative equals the sum, over all paths from input node to output node, of the product of local derivatives along each path. Backpropagation never enumerates paths explicitly. By storing one delta per node and reusing it for all predecessors, it computes the same path sum in time linear in the number of edges. A naive forward mode that recomputed the influence of each parameter separately would cost time proportional to the number of parameters times the number of outputs, which is wasteful when there are many parameters and few outputs, exactly the regime of deep learning.

192.6.4 6.4 Batching and the Loss over a Dataset

The derivation above used a single example. In practice we minimize an average loss over a mini batch of $m$ examples, $C = \tfrac{1}{m} \sum_{i=1}^{m} C_i$. Because differentiation is linear, the gradient of the average is the average of the gradients:

\[ \frac{\partial C}{\partial W^l} = \frac{1}{m} \sum_{i=1}^{m} \frac{\partial C_i}{\partial W^l}. \]

Implementations realize this by stacking the $m$ examples as columns of an activation matrix $A^{l-1} \in \mathbb{R}^{n_{l-1} \times m}$ and likewise stacking the deltas into $\Delta^l \in \mathbb{R}^{n_l \times m}$. Then BP4 becomes a single matrix product that sums the per example outer products automatically:

\[ \frac{\partial C}{\partial W^l} = \frac{1}{m}\, \Delta^l (A^{l-1})^\top, \]

while BP3 becomes a row sum, $\partial C / \partial b^l = \tfrac{1}{m} \Delta^l \mathbf{1}$, with $\mathbf{1}$ the all ones vector. The bias gradient sums the delta over the batch dimension. These vectorized forms are what GPU based frameworks execute, and they are algebraically identical to averaging the four equations over the batch.

192.6.5 6.5 Gradient Checking

Since the derivation has many places to slip a sign or a transpose, it is standard practice to verify an implementation against a numerical estimate. For any scalar parameter $\theta$ the central difference approximation gives

\[ \frac{\partial C}{\partial \theta} \approx \frac{C(\theta + \epsilon) - C(\theta - \epsilon)}{2\epsilon}, \]

with error of order $\epsilon^2$. Choosing $\epsilon$ around $10^{-7}$ and comparing the analytic gradient from BP1 through BP4 against this estimate, one expects agreement to several significant digits. A relative discrepancy larger than roughly $10^{-5}$ signals a bug, most often a missing transpose in BP2 or a mismatched activation derivative in BP1. Gradient checking is slow, costing two forward passes per parameter, so it is used only to validate code and never inside the training loop.

192.7 7. Reference Implementation

The companion aiinaction library ships the four equations as a small, validated multilayer perceptron in Python, Julia, and Rust. The network uses sigmoid hidden layers and a linear output with the one-half squared-error loss, so BP1 reduces to the bare residual $\delta^L = a^L - y$ and the activation-derivative factor is isolated entirely in the hidden layers. All three implementations agree to floating-point tolerance on the shared fixture below, and each ships a central-difference numerical_gradient so the analytic gradients from BP1 through BP4 can be checked exactly as Section 6.5 prescribes.

We run one fixed example through a two-layer network ($3 \to 2 \to 2$) and read off the loss, the analytic gradients, and the gradient-checking residual.

Code

from aiinaction.ch187_backprop import (
    make_mlp, forward, squared_error_loss, backprop, numerical_gradient,
)
import numpy as np

# A 3 -> 2 -> 2 network: sigmoid hidden layer, linear output.
net = make_mlp(
    weights=[
        [[0.10, 0.20, -0.30], [0.40, -0.50, 0.60]],   # W^1 (2 x 3)
        [[0.70, -0.80], [-0.10, 0.30]],               # W^2 (2 x 2)
    ],
    biases=[[0.10, -0.20], [0.05, -0.05]],
)
x = [1.0, -2.0, 0.5]
y = [0.3, -0.7]

zs, acts = forward(net, x)
print("output a^L :", np.round(acts[-1], 6).tolist())
print("loss C     :", round(squared_error_loss(net, x, y), 6))

grad_W, grad_b = backprop(net, x, y)
print("dC/dW^1    :", np.round(grad_W[0], 6).tolist())
print("dC/db^1    :", np.round(grad_b[0], 6).tolist())

# Gradient check: analytic vs central difference (Section 6.5).
nW, nb = numerical_gradient(net, x, y, eps=1e-6)
max_err = max(
    float(np.max(np.abs(a - b)))
    for a, b in zip(grad_W + grad_b, nW + nb)
)
print("max |analytic - numerical| :", f"{max_err:.2e}")

output a^L : [-0.314692, 0.153934]
loss C     : 0.553525
dC/dW^1    : [[-0.125051, 0.250101, -0.062525], [0.111552, -0.223103, 0.055776]]
dC/db^1    : [-0.125051, 0.111552]
max |analytic - numerical| : 9.68e-11

using AIInAction.Ch187Backprop

# A 3 -> 2 -> 2 network: sigmoid hidden layer, linear output.
net = make_mlp(
    [[0.10 0.20 -0.30; 0.40 -0.50 0.60],   # W^1 (2 x 3)
     [0.70 -0.80; -0.10 0.30]],            # W^2 (2 x 2)
    [[0.10, -0.20], [0.05, -0.05]],
)
x = [1.0, -2.0, 0.5]
y = [0.3, -0.7]

zs, acts = forward(net, x)
println("output a^L : ", acts[end])
println("loss C     : ", squared_error_loss(net, x, y))

grad_W, grad_b = backprop(net, x, y)
println("dC/dW^1    : ", grad_W[1])
println("dC/db^1    : ", grad_b[1])

nW, nb = numerical_gradient(net, x, y)
max_err = maximum(maximum(abs.(a .- b)) for (a, b) in zip(vcat(grad_W, grad_b), vcat(nW, nb)))
println("max |analytic - numerical| : ", max_err)

use aiinaction::ch187_backprop::*;

fn main() {
    // A 3 -> 2 -> 2 network: sigmoid hidden layer, linear output.
    let net = Mlp::new(
        vec![
            vec![vec![0.10, 0.20, -0.30], vec![0.40, -0.50, 0.60]], // W^1 (2 x 3)
            vec![vec![0.70, -0.80], vec![-0.10, 0.30]],             // W^2 (2 x 2)
        ],
        vec![vec![0.10, -0.20], vec![0.05, -0.05]],
    )
    .unwrap();
    let x = [1.0, -2.0, 0.5];
    let y = [0.3, -0.7];

    let (_zs, acts) = forward(&net, &x).unwrap();
    println!("output a^L : {:?}", acts.last().unwrap());
    println!("loss C     : {}", squared_error_loss(&net, &x, &y).unwrap());

    let (grad_w, grad_b) = backprop(&net, &x, &y).unwrap();
    println!("dC/dW^1    : {:?}", grad_w[0]);
    println!("dC/db^1    : {:?}", grad_b[0]);

    let (nw, nb) = numerical_gradient(&net, &x, &y, 1e-6).unwrap();
    let mut max_err = 0.0_f64;
    for (a, b) in grad_w.iter().zip(nw.iter()) {
        for (ar, br) in a.iter().zip(b.iter()) {
            for (av, bv) in ar.iter().zip(br.iter()) {
                max_err = max_err.max((av - bv).abs());
            }
        }
    }
    for (a, b) in grad_b.iter().zip(nb.iter()) {
        for (av, bv) in a.iter().zip(b.iter()) {
            max_err = max_err.max((av - bv).abs());
        }
    }
    println!("max |analytic - numerical| : {:.2e}", max_err);
}

192.8 8. Summary

Backpropagation is the chain rule applied systematically to the computational graph of a neural network, with intermediate sensitivities cached as delta terms so that no derivative is computed twice. The four equations capture the entire method. BP1, $\delta^L = \nabla_a C \odot \sigma'(z^L)$, seeds the error at the output. BP2, $\delta^l = ((W^{l+1})^\top \delta^{l+1}) \odot \sigma'(z^l)$, propagates it backward through weight transposes and activation slopes. BP3, $\partial C / \partial b^l = \delta^l$, and BP4, $\partial C / \partial W^l = \delta^l (a^{l-1})^\top$, read off the parameter gradients as a delta and as a delta times input outer product respectively. Unrolling the recursion exposes the gradient as a product of weight transposes and slope matrices, which makes the vanishing and exploding gradient problems transparent and motivates much of modern architecture design. The algorithm runs in time linear in the number of edges of the graph, which is what makes the training of deep networks with millions or billions of parameters tractable at all.

192.9 References

Rumelhart, D. E., Hinton, G. E., and Williams, R. J. “Learning representations by back propagating errors.” Nature, 323, 533 to 536, 1986. https://www.nature.com/articles/323533a0
Nielsen, M. “Neural Networks and Deep Learning,” Chapter 2: How the backpropagation algorithm works. 2015. http://neuralnetworksanddeeplearning.com/chap2.html
Goodfellow, I., Bengio, Y., and Courville, A. “Deep Learning,” Chapter 6: Deep Feedforward Networks. MIT Press, 2016. https://www.deeplearningbook.org/contents/mlp.html
Baydin, A. G., Pearlmutter, B. A., Radul, A. A., and Siskind, J. M. “Automatic Differentiation in Machine Learning: a Survey.” Journal of Machine Learning Research, 18, 1 to 43, 2018. https://jmlr.org/papers/v18/17-468.html
LeCun, Y., Bottou, L., Orr, G. B., and Muller, K. R. “Efficient BackProp.” In Neural Networks: Tricks of the Trade, Springer, 1998. https://link.springer.com/chapter/10.1007/3-540-49430-8_2
Werbos, P. J. “Backpropagation through time: what it does and how to do it.” Proceedings of the IEEE, 78(10), 1550 to 1560, 1990. https://ieeexplore.ieee.org/document/58337

# Backpropagation Derivation Backpropagation is the algorithm that makes training deep neural networks computationally feasible. At its core it is nothing more than a disciplined application of the chain rule of multivariable calculus, organized so that every partial derivative needed for gradient descent is computed exactly once. This chapter derives the algorithm from first principles. We build the computational graph, state the chain rule on that graph, introduce the delta terms that summarize per layer error signals, prove the four backpropagation equations, and then assemble a complete derivation for a multilayer network. The treatment is rigorous and self contained, assuming only familiarity with partial derivatives, matrix algebra, and the structure of a feedforward network. ## 1. Setup and Notation We consider a feedforward network with $L$ layers. Layer $l$ holds a vector of activations $a^l \in \mathbb{R}^{n_l}$, where $n_l$ is the number of units in that layer. The input layer is $a^0 = x$, the raw feature vector, and the output layer is $a^L$, the network prediction. Between consecutive layers we have a weight matrix $W^l \in \mathbb{R}^{n_l \times n_{l-1}}$ and a bias vector $b^l \in \mathbb{R}^{n_l}$. The forward computation for each layer $l = 1, \dots, L$ proceeds in two steps. First the affine pre activation, often called the weighted input: $$ z^l = W^l a^{l-1} + b^l . $$ Then the elementwise nonlinearity: $$ a^l = \sigma(z^l), $$ where $\sigma$ is an activation function such as the logistic sigmoid, the hyperbolic tangent, or the rectified linear unit, applied componentwise. In scalar form, the $j$ th unit of layer $l$ computes $$ z^l_j = \sum_{k=1}^{n_{l-1}} W^l_{jk}\, a^{l-1}_k + b^l_j, \qquad a^l_j = \sigma(z^l_j). $$ The index convention deserves attention. The entry $W^l_{jk}$ is the weight on the connection from unit $k$ in layer $l-1$ to unit $j$ in layer $l$. Writing the destination index first looks backward at a glance, but it is exactly what lets us write the forward pass as the clean matrix product $W^l a^{l-1}$ without a transpose. This choice pays off repeatedly in what follows. Training minimizes a scalar loss $C$ that compares the prediction $a^L$ to a target $y$. For a single example we might use the squared error $C = \tfrac{1}{2}\lVert a^L - y \rVert^2$ or, for classification, the cross entropy. The entire purpose of backpropagation is to compute the gradients $\partial C / \partial W^l_{jk}$ and $\partial C / \partial b^l_j$ for every weight and bias, so that a gradient based optimizer can update the parameters. ## 2. The Computational Graph A neural network is a composition of simple functions, and that composition is naturally a directed acyclic graph. Each node represents an intermediate quantity, and each edge represents a functional dependency. For our feedforward network the dependency chain for any single training example reads $$ x \to z^1 \to a^1 \to z^2 \to a^2 \to \cdots \to z^L \to a^L \to C . $$ Every arrow is a differentiable map. The arrow $a^{l-1} \to z^l$ is the affine map parameterized by $W^l$ and $b^l$. The arrow $z^l \to a^l$ is the elementwise nonlinearity. The final arrow $a^L \to C$ is the loss. The graph view is more than decoration. It tells us precisely which quantities influence which others, and therefore which partial derivatives are nonzero. Because the graph is feedforward, $z^l$ depends on the parameters only through layers $1$ through $l$, and $C$ depends on $z^l$ only through the downstream chain $z^l \to a^l \to z^{l+1} \to \cdots \to C$. This locality is what backpropagation exploits. The key insight is that the loss is a function of the weighted inputs at the last layer, which are functions of the weighted inputs at the second to last layer, and so on. To differentiate $C$ with respect to a parameter buried deep in the network, we must trace every path from that parameter to the output. The chain rule on a graph automates this tracing. ## 3. The Chain Rule on the Graph The multivariable chain rule is the engine of backpropagation. Suppose a scalar $C$ depends on variables $u_1, \dots, u_m$, and each $u_i$ depends on a variable $t$. Then $$ \frac{\partial C}{\partial t} = \sum_{i=1}^{m} \frac{\partial C}{\partial u_i}\, \frac{\partial u_i}{\partial t}. $$ The sum runs over all intermediate variables through which $t$ influences $C$. On the computational graph this is the statement that the total derivative along a node is the sum of contributions over all outgoing edges. Apply this to a weighted input. The scalar $z^l_j$ feeds into every weighted input of the next layer, since $z^{l+1}_i = \sum_j W^{l+1}_{ij} \sigma(z^l_j) + b^{l+1}_i$. Therefore the loss depends on $z^l_j$ only through the vector $z^{l+1}$, and the chain rule gives $$ \frac{\partial C}{\partial z^l_j} = \sum_{i=1}^{n_{l+1}} \frac{\partial C}{\partial z^{l+1}_i}\, \frac{\partial z^{l+1}_i}{\partial z^l_j}. $$ This single equation is the heart of the backward pass. It expresses the sensitivity of the loss to a weighted input in layer $l$ in terms of the sensitivities at layer $l+1$. The recursion runs from the output back toward the input, which is exactly why the algorithm is called backpropagation. Notice that the only derivatives we ever need are local ones, $\partial z^{l+1}_i / \partial z^l_j$, plus the already computed downstream sensitivities. Nothing requires us to expand the full composite function symbolically. ## 4. The Delta Terms The quantity that appears on both sides of the recursion above deserves its own name. Define the error of unit $j$ in layer $l$ as the partial derivative of the cost with respect to the weighted input of that unit: $$ \delta^l_j \;\equiv\; \frac{\partial C}{\partial z^l_j}. $$ We collect these into a vector $\delta^l \in \mathbb{R}^{n_l}$. The choice to differentiate with respect to $z^l_j$ rather than the activation $a^l_j$ is deliberate and it is what makes the algebra clean. Because the bias $b^l_j$ enters the network only through $z^l_j$ with a coefficient of one, the error $\delta^l_j$ turns out to equal the bias gradient directly, as we will see. It also separates the contribution of the nonlinearity from the contribution of the linear weights in a tidy way. The name "error" comes from an interpretation. Imagine perturbing the weighted input by a small amount $\Delta z^l_j$. The loss changes by approximately $\delta^l_j\, \Delta z^l_j$. If $\delta^l_j$ is large in magnitude, the unit is far from a configuration that locally minimizes the loss, and adjusting it can yield a large improvement. If $\delta^l_j$ is near zero, the unit is at a local plateau and contributes little to the gradient. The delta terms are the carriers of the error signal as it flows backward through the network. With this definition, the entire backpropagation algorithm reduces to four equations: one that initializes the delta at the output layer, one that propagates delta from layer $l+1$ to layer $l$, and two that read off the parameter gradients from the deltas. We now derive each. ## 5. The Four Backpropagation Equations ### 5.1 BP1: Error at the Output Layer We begin at the end of the graph, where $C$ is a direct function of $a^L$, and $a^L = \sigma(z^L)$. By the chain rule, the loss depends on $z^L_j$ only through $a^L_j$ (the nonlinearity is elementwise, so $z^L_j$ affects only $a^L_j$): $$ \delta^L_j = \frac{\partial C}{\partial z^L_j} = \frac{\partial C}{\partial a^L_j}\, \frac{\partial a^L_j}{\partial z^L_j} = \frac{\partial C}{\partial a^L_j}\, \sigma'(z^L_j). $$ The first factor $\partial C / \partial a^L_j$ measures how fast the loss changes with the output activation, and depends only on the choice of loss function. The second factor $\sigma'(z^L_j)$ measures how fast the activation changes with the weighted input, and depends only on the choice of nonlinearity. In vector form, writing $\nabla_a C$ for the vector of output activation derivatives and $\odot$ for the elementwise Hadamard product, $$ \boxed{\;\delta^L = \nabla_a C \odot \sigma'(z^L).\;} \tag{BP1} $$ For the squared error $C = \tfrac{1}{2}\lVert a^L - y \rVert^2$ we have $\nabla_a C = a^L - y$, so $\delta^L = (a^L - y) \odot \sigma'(z^L)$. A worthwhile special case arises with softmax output combined with cross entropy loss, where the $\sigma'$ factor and the loss derivative combine so that $\delta^L = a^L - y$ exactly, with no leftover activation derivative. This cancellation is one reason that pairing is ubiquitous in classification. ### 5.2 BP2: Backpropagating the Error Next we propagate the error from layer $l+1$ to layer $l$. We start from the chain rule expression derived in Section 3: $$ \delta^l_j = \sum_{i=1}^{n_{l+1}} \frac{\partial C}{\partial z^{l+1}_i}\, \frac{\partial z^{l+1}_i}{\partial z^l_j} = \sum_{i=1}^{n_{l+1}} \delta^{l+1}_i\, \frac{\partial z^{l+1}_i}{\partial z^l_j}. $$ It remains to compute the local derivative $\partial z^{l+1}_i / \partial z^l_j$. Recall that $$ z^{l+1}_i = \sum_{k=1}^{n_l} W^{l+1}_{ik}\, a^l_k + b^{l+1}_i = \sum_{k=1}^{n_l} W^{l+1}_{ik}\, \sigma(z^l_k) + b^{l+1}_i. $$ Differentiating with respect to $z^l_j$, only the term $k = j$ survives, since the activations $\sigma(z^l_k)$ for $k \neq j$ do not depend on $z^l_j$: $$ \frac{\partial z^{l+1}_i}{\partial z^l_j} = W^{l+1}_{ij}\, \sigma'(z^l_j). $$ Substituting back, $$ \delta^l_j = \sum_{i=1}^{n_{l+1}} \delta^{l+1}_i\, W^{l+1}_{ij}\, \sigma'(z^l_j) = \sigma'(z^l_j) \sum_{i=1}^{n_{l+1}} W^{l+1}_{ij}\, \delta^{l+1}_i. $$ The inner sum $\sum_i W^{l+1}_{ij} \delta^{l+1}_i$ is precisely the $j$ th component of the matrix vector product $(W^{l+1})^\top \delta^{l+1}$, because transposing swaps the row and column indices. In vector form, $$ \boxed{\;\delta^l = \big( (W^{l+1})^\top \delta^{l+1} \big) \odot \sigma'(z^l).\;} \tag{BP2} $$ This is the workhorse recursion. Read it as a two stage operation. First, $(W^{l+1})^\top \delta^{l+1}$ transports the error backward across the linear layer, using the transpose of the same weights that carried activations forward. Then the Hadamard product with $\sigma'(z^l)$ passes that transported error through the local slope of the nonlinearity. The forward pass moves signal through $W^{l+1}$ and then $\sigma$; the backward pass moves error through $\sigma'$ and then $(W^{l+1})^\top$. The transpose is the discrete analogue of an adjoint, and its appearance is not a coincidence but a structural feature of differentiating a linear map. ### 5.3 BP3 and BP4: Gradients with Respect to Parameters We now extract the gradients we actually want. Consider the bias $b^l_j$ first. It enters the network only through $z^l_j$, with $\partial z^l_j / \partial b^l_j = 1$. By the chain rule, $$ \frac{\partial C}{\partial b^l_j} = \frac{\partial C}{\partial z^l_j}\, \frac{\partial z^l_j}{\partial b^l_j} = \delta^l_j \cdot 1 = \delta^l_j. $$ So the bias gradient is simply the delta, which retroactively justifies our decision to define the error in terms of $z^l$ rather than $a^l$. In vector form, $$ \boxed{\;\frac{\partial C}{\partial b^l} = \delta^l.\;} \tag{BP3} $$ Now the weight $W^l_{jk}$. It enters through $z^l_j$ alone, multiplying the upstream activation $a^{l-1}_k$, so $\partial z^l_j / \partial W^l_{jk} = a^{l-1}_k$. The chain rule yields $$ \frac{\partial C}{\partial W^l_{jk}} = \frac{\partial C}{\partial z^l_j}\, \frac{\partial z^l_j}{\partial W^l_{jk}} = \delta^l_j\, a^{l-1}_k. $$ Each weight gradient is the product of the error at the destination unit and the activation at the source unit. Collecting all $j, k$ pairs into a matrix, this is the outer product $$ \boxed{\;\frac{\partial C}{\partial W^l} = \delta^l \,(a^{l-1})^\top.\;} \tag{BP4} $$ Equations BP1 through BP4 constitute the complete algorithm. BP1 seeds the recursion at the output. BP2 carries it backward layer by layer. BP3 and BP4 convert the per layer deltas into parameter gradients. Nothing else is needed. A reading of BP4 worth internalizing: a weight gradient is small when either the input activation is small or the output error is small. A neuron with $a^{l-1}_k \approx 0$ learns slowly on its outgoing weights regardless of error, and a unit whose $\delta^l_j \approx 0$ learns slowly on its incoming weights regardless of input. These observations explain phenomena such as slow learning under saturated sigmoids, where $\sigma'(z) \approx 0$ drives the deltas toward zero and stalls the gradient. ## 6. Complete Derivation for a Multilayer Network We now assemble the full algorithm for a network of arbitrary depth $L$, combining the equations above into a forward pass followed by a backward pass. We then verify the logic by unrolling the recursion explicitly. ### 6.1 The Algorithm Given a single training example $(x, y)$, the procedure is as follows. ```text Forward pass: a[0] = x for l = 1, ..., L: z[l] = W[l] @ a[l-1] + b[l] a[l] = sigma(z[l]) Backward pass: delta[L] = grad_a C * sigma_prime(z[L]) # BP1 for l = L-1, ..., 1: delta[l] = (W[l+1].T @ delta[l+1]) * sigma_prime(z[l]) # BP2 Gradients: for l = 1, ..., L: dC_db[l] = delta[l] # BP3 dC_dW[l] = delta[l] @ a[l-1].T # BP4 ``` The forward pass caches every $z^l$ and $a^l$, because the backward pass needs them: BP2 requires $\sigma'(z^l)$ and BP4 requires $a^{l-1}$. This caching is the memory cost of backpropagation, and it is why training a deep network consumes far more memory than mere inference. ### 6.2 Unrolling the Recursion To convince ourselves that BP2 truly computes $\partial C / \partial z^l_j$ for an arbitrary layer, we unroll it. Consider a three layer network and ask for $\delta^1$. Applying BP2 once, $$ \delta^2 = \big( (W^3)^\top \delta^3 \big) \odot \sigma'(z^2), $$ and applying it again, $$ \delta^1 = \big( (W^2)^\top \delta^2 \big) \odot \sigma'(z^1). $$ Substituting and writing $S^l = \operatorname{diag}(\sigma'(z^l))$ for the diagonal matrix of activation slopes, the Hadamard products become matrix multiplications and the deltas chain into a clean product: $$ \delta^1 = S^1 (W^2)^\top S^2 (W^3)^\top \delta^3 . $$ With $\delta^3 = S^3 \nabla_a C$ from BP1, the full sensitivity of the loss to the first layer weighted inputs is $$ \delta^1 = S^1 (W^2)^\top S^2 (W^3)^\top S^3\, \nabla_a C . $$ This product form is illuminating. The gradient that reaches the early layers is a chain of weight transposes interleaved with diagonal slope matrices. If the spectral norms of these factors are consistently less than one, the product shrinks geometrically with depth and the gradient vanishes; if they consistently exceed one, the product grows and the gradient explodes. The vanishing and exploding gradient problems that motivate careful initialization, normalization layers, and residual connections are visible directly in this unrolled expression. The same product, read forward, is exactly the Jacobian of the loss with respect to the deep weighted inputs, confirming that backpropagation computes a genuine derivative and not an approximation. ### 6.3 Correctness and the Total Derivative Why does the layer by layer recursion give the correct total derivative, even though a parameter in an early layer influences the loss through an enormous number of distinct paths through the network? The answer is that BP2 implicitly sums over all those paths. Each application of the chain rule at a node sums over its immediate successors, and composing these single step sums across layers produces a sum over every directed path from the parameter to the output. This is the path sum interpretation of the chain rule: the total derivative equals the sum, over all paths from input node to output node, of the product of local derivatives along each path. Backpropagation never enumerates paths explicitly. By storing one delta per node and reusing it for all predecessors, it computes the same path sum in time linear in the number of edges. A naive forward mode that recomputed the influence of each parameter separately would cost time proportional to the number of parameters times the number of outputs, which is wasteful when there are many parameters and few outputs, exactly the regime of deep learning. ### 6.4 Batching and the Loss over a Dataset The derivation above used a single example. In practice we minimize an average loss over a mini batch of $m$ examples, $C = \tfrac{1}{m} \sum_{i=1}^{m} C_i$. Because differentiation is linear, the gradient of the average is the average of the gradients: $$ \frac{\partial C}{\partial W^l} = \frac{1}{m} \sum_{i=1}^{m} \frac{\partial C_i}{\partial W^l}. $$ Implementations realize this by stacking the $m$ examples as columns of an activation matrix $A^{l-1} \in \mathbb{R}^{n_{l-1} \times m}$ and likewise stacking the deltas into $\Delta^l \in \mathbb{R}^{n_l \times m}$. Then BP4 becomes a single matrix product that sums the per example outer products automatically: $$ \frac{\partial C}{\partial W^l} = \frac{1}{m}\, \Delta^l (A^{l-1})^\top, $$ while BP3 becomes a row sum, $\partial C / \partial b^l = \tfrac{1}{m} \Delta^l \mathbf{1}$, with $\mathbf{1}$ the all ones vector. The bias gradient sums the delta over the batch dimension. These vectorized forms are what GPU based frameworks execute, and they are algebraically identical to averaging the four equations over the batch. ### 6.5 Gradient Checking Since the derivation has many places to slip a sign or a transpose, it is standard practice to verify an implementation against a numerical estimate. For any scalar parameter $\theta$ the central difference approximation gives $$ \frac{\partial C}{\partial \theta} \approx \frac{C(\theta + \epsilon) - C(\theta - \epsilon)}{2\epsilon}, $$ with error of order $\epsilon^2$. Choosing $\epsilon$ around $10^{-7}$ and comparing the analytic gradient from BP1 through BP4 against this estimate, one expects agreement to several significant digits. A relative discrepancy larger than roughly $10^{-5}$ signals a bug, most often a missing transpose in BP2 or a mismatched activation derivative in BP1. Gradient checking is slow, costing two forward passes per parameter, so it is used only to validate code and never inside the training loop. ## 7. Reference Implementation The companion `aiinaction` library ships the four equations as a small, validated multilayer perceptron in Python, Julia, and Rust. The network uses sigmoid hidden layers and a *linear* output with the one-half squared-error loss, so BP1 reduces to the bare residual $\delta^L = a^L - y$ and the activation-derivative factor is isolated entirely in the hidden layers. All three implementations agree to floating-point tolerance on the shared fixture below, and each ships a central-difference `numerical_gradient` so the analytic gradients from BP1 through BP4 can be checked exactly as Section 6.5 prescribes. We run one fixed example through a two-layer network ($3 \to 2 \to 2$) and read off the loss, the analytic gradients, and the gradient-checking residual. ::: {.panel-tabset} ## Python ```{python} from aiinaction.ch187_backprop import ( make_mlp, forward, squared_error_loss, backprop, numerical_gradient, ) import numpy as np # A 3 -> 2 -> 2 network: sigmoid hidden layer, linear output. net = make_mlp( weights=[ [[0.10, 0.20, -0.30], [0.40, -0.50, 0.60]], # W^1 (2 x 3) [[0.70, -0.80], [-0.10, 0.30]], # W^2 (2 x 2) ], biases=[[0.10, -0.20], [0.05, -0.05]], ) x = [1.0, -2.0, 0.5] y = [0.3, -0.7] zs, acts = forward(net, x) print("output a^L :", np.round(acts[-1], 6).tolist()) print("loss C :", round(squared_error_loss(net, x, y), 6)) grad_W, grad_b = backprop(net, x, y) print("dC/dW^1 :", np.round(grad_W[0], 6).tolist()) print("dC/db^1 :", np.round(grad_b[0], 6).tolist()) # Gradient check: analytic vs central difference (Section 6.5). nW, nb = numerical_gradient(net, x, y, eps=1e-6) max_err = max( float(np.max(np.abs(a - b))) for a, b in zip(grad_W + grad_b, nW + nb) ) print("max |analytic - numerical| :", f"{max_err:.2e}") ``` ## Julia ```julia using AIInAction.Ch187Backprop # A 3 -> 2 -> 2 network: sigmoid hidden layer, linear output. net = make_mlp( [[0.10 0.20 -0.30; 0.40 -0.50 0.60], # W^1 (2 x 3) [0.70 -0.80; -0.10 0.30]], # W^2 (2 x 2) [[0.10, -0.20], [0.05, -0.05]], ) x = [1.0, -2.0, 0.5] y = [0.3, -0.7] zs, acts = forward(net, x) println("output a^L : ", acts[end]) println("loss C : ", squared_error_loss(net, x, y)) grad_W, grad_b = backprop(net, x, y) println("dC/dW^1 : ", grad_W[1]) println("dC/db^1 : ", grad_b[1]) nW, nb = numerical_gradient(net, x, y) max_err = maximum(maximum(abs.(a .- b)) for (a, b) in zip(vcat(grad_W, grad_b), vcat(nW, nb))) println("max |analytic - numerical| : ", max_err) ``` ## Rust ```rust use aiinaction::ch187_backprop::*; fn main() { // A 3 -> 2 -> 2 network: sigmoid hidden layer, linear output. let net = Mlp::new( vec![ vec![vec![0.10, 0.20, -0.30], vec![0.40, -0.50, 0.60]], // W^1 (2 x 3) vec![vec![0.70, -0.80], vec![-0.10, 0.30]], // W^2 (2 x 2) ], vec![vec![0.10, -0.20], vec![0.05, -0.05]], ) .unwrap(); let x = [1.0, -2.0, 0.5]; let y = [0.3, -0.7]; let (_zs, acts) = forward(&net, &x).unwrap(); println!("output a^L : {:?}", acts.last().unwrap()); println!("loss C : {}", squared_error_loss(&net, &x, &y).unwrap()); let (grad_w, grad_b) = backprop(&net, &x, &y).unwrap(); println!("dC/dW^1 : {:?}", grad_w[0]); println!("dC/db^1 : {:?}", grad_b[0]); let (nw, nb) = numerical_gradient(&net, &x, &y, 1e-6).unwrap(); let mut max_err = 0.0_f64; for (a, b) in grad_w.iter().zip(nw.iter()) { for (ar, br) in a.iter().zip(b.iter()) { for (av, bv) in ar.iter().zip(br.iter()) { max_err = max_err.max((av - bv).abs()); } } } for (a, b) in grad_b.iter().zip(nb.iter()) { for (av, bv) in a.iter().zip(b.iter()) { max_err = max_err.max((av - bv).abs()); } } println!("max |analytic - numerical| : {:.2e}", max_err); } ``` ::: ## 8. Summary Backpropagation is the chain rule applied systematically to the computational graph of a neural network, with intermediate sensitivities cached as delta terms so that no derivative is computed twice. The four equations capture the entire method. BP1, $\delta^L = \nabla_a C \odot \sigma'(z^L)$, seeds the error at the output. BP2, $\delta^l = ((W^{l+1})^\top \delta^{l+1}) \odot \sigma'(z^l)$, propagates it backward through weight transposes and activation slopes. BP3, $\partial C / \partial b^l = \delta^l$, and BP4, $\partial C / \partial W^l = \delta^l (a^{l-1})^\top$, read off the parameter gradients as a delta and as a delta times input outer product respectively. Unrolling the recursion exposes the gradient as a product of weight transposes and slope matrices, which makes the vanishing and exploding gradient problems transparent and motivates much of modern architecture design. The algorithm runs in time linear in the number of edges of the graph, which is what makes the training of deep networks with millions or billions of parameters tractable at all. ## References 1. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. "Learning representations by back propagating errors." Nature, 323, 533 to 536, 1986. https://www.nature.com/articles/323533a0 2. Nielsen, M. "Neural Networks and Deep Learning," Chapter 2: How the backpropagation algorithm works. 2015. http://neuralnetworksanddeeplearning.com/chap2.html 3. Goodfellow, I., Bengio, Y., and Courville, A. "Deep Learning," Chapter 6: Deep Feedforward Networks. MIT Press, 2016. https://www.deeplearningbook.org/contents/mlp.html 4. Baydin, A. G., Pearlmutter, B. A., Radul, A. A., and Siskind, J. M. "Automatic Differentiation in Machine Learning: a Survey." Journal of Machine Learning Research, 18, 1 to 43, 2018. https://jmlr.org/papers/v18/17-468.html 5. LeCun, Y., Bottou, L., Orr, G. B., and Muller, K. R. "Efficient BackProp." In Neural Networks: Tricks of the Trade, Springer, 1998. https://link.springer.com/chapter/10.1007/3-540-49430-8_2 6. Werbos, P. J. "Backpropagation through time: what it does and how to do it." Proceedings of the IEEE, 78(10), 1550 to 1560, 1990. https://ieeexplore.ieee.org/document/58337