191 Forward Propagation
Forward propagation is the computational engine that turns an input vector into a prediction by passing it through the successive layers of a neural network. Every quantity used in training, including the loss and its gradients, is downstream of this single forward sweep. Mastery of forward propagation is therefore the prerequisite for understanding backpropagation, optimization, and the practical performance characteristics of deep learning systems. This chapter develops the layer-by-layer computation from first principles, derives its vectorized and batched forms, isolates the distinct roles played by weights and biases, and grounds the abstractions in a fully worked numerical example.
191.1 1. The Computational Setting
A feedforward neural network defines a function \(f: \mathbb{R}^{d_0} \to \mathbb{R}^{d_L}\) as a composition of \(L\) parametric layers. Given an input \(\mathbf{x} \in \mathbb{R}^{d_0}\), the network produces an output \(\hat{\mathbf{y}} = f(\mathbf{x})\) through a strictly ordered sequence of transformations. Each layer applies an affine map followed by a nonlinear activation. The “forward” qualifier signals direction: information flows from inputs toward outputs, with no layer consuming a quantity it has not yet produced.
We index layers by \(\ell \in \{1, \dots, L\}\). Layer \(\ell\) has \(d_\ell\) units, so the network is fully specified by the widths \(d_0, d_1, \dots, d_L\) and by the parameters attached to each layer. The choice of \(d_0\) is fixed by the data dimensionality and \(d_L\) by the task, while the hidden widths \(d_1, \dots, d_{L-1}\) are design decisions.
191.1.1 1.1 Notation
For each layer \(\ell\) we collect the following objects.
- \(\mathbf{W}^{[\ell]} \in \mathbb{R}^{d_\ell \times d_{\ell-1}}\), the weight matrix.
- \(\mathbf{b}^{[\ell]} \in \mathbb{R}^{d_\ell}\), the bias vector.
- \(\mathbf{z}^{[\ell]} \in \mathbb{R}^{d_\ell}\), the pre-activation (also called the linear combination or logit vector).
- \(\mathbf{a}^{[\ell]} \in \mathbb{R}^{d_\ell}\), the post-activation output of the layer.
- \(g^{[\ell]}: \mathbb{R} \to \mathbb{R}\), the activation function applied elementwise.
By convention the input is treated as the zeroth activation, \(\mathbf{a}^{[0]} = \mathbf{x}\), and the network output is the final activation, \(\hat{\mathbf{y}} = \mathbf{a}^{[L]}\).
191.2 2. The Layer-by-Layer Computation
The defining recurrence of forward propagation is the pair of equations that map one layer’s output to the next. For \(\ell = 1, 2, \dots, L\),
\[ \mathbf{z}^{[\ell]} = \mathbf{W}^{[\ell]} \mathbf{a}^{[\ell-1]} + \mathbf{b}^{[\ell]}, \]
\[ \mathbf{a}^{[\ell]} = g^{[\ell]}\!\left(\mathbf{z}^{[\ell]}\right). \]
The first equation is purely linear in the previous activation, up to the additive bias, and produces the pre-activation \(\mathbf{z}^{[\ell]}\). The second equation introduces nonlinearity by applying \(g^{[\ell]}\) componentwise. Writing the activation componentwise makes the elementwise nature explicit: for unit \(i\) in layer \(\ell\),
\[ z_i^{[\ell]} = \sum_{j=1}^{d_{\ell-1}} W_{ij}^{[\ell]} \, a_j^{[\ell-1]} + b_i^{[\ell]}, \qquad a_i^{[\ell]} = g^{[\ell]}\!\left(z_i^{[\ell]}\right). \]
The scalar \(W_{ij}^{[\ell]}\) is the strength of the connection from unit \(j\) of layer \(\ell-1\) to unit \(i\) of layer \(\ell\). The bias \(b_i^{[\ell]}\) shifts the pre-activation of unit \(i\) independently of any input.
191.2.1 2.1 Why Nonlinearity Is Essential
If every activation were the identity, \(g^{[\ell]}(z) = z\), the entire network would collapse into a single affine map. Composing affine maps yields an affine map: \(\mathbf{W}^{[2]}(\mathbf{W}^{[1]}\mathbf{x} + \mathbf{b}^{[1]}) + \mathbf{b}^{[2]} = \mathbf{W}'\mathbf{x} + \mathbf{b}'\) for some \(\mathbf{W}'\) and \(\mathbf{b}'\). Depth would buy nothing in representational power. The nonlinear activations \(g^{[\ell]}\) are precisely what allow a deep network to represent functions that no single affine layer can. Common choices include the logistic sigmoid \(\sigma(z) = 1 / (1 + e^{-z})\), the hyperbolic tangent \(\tanh(z)\), and the rectified linear unit \(\mathrm{ReLU}(z) = \max(0, z)\).
191.2.2 2.2 The Forward Sweep as Pseudocode
The recurrence translates directly into an iterative procedure. The following non-executable sketch captures the control flow.
a = x # a^[0]
for l in 1..L:
z = W[l] @ a + b[l] # pre-activation
a = g[l](z) # post-activation
y_hat = a # a^[L]
Each iteration depends only on the activation produced by the previous iteration, which is why the loop is inherently sequential across layers even though the work inside a layer is highly parallel.
191.3 3. The Role of Weights and Biases
The weights and biases are the learnable parameters; the activations are computed quantities. It is worth separating their roles precisely, because they contribute to the model in different ways.
191.3.1 3.1 Weights as Learned Feature Detectors
Row \(i\) of \(\mathbf{W}^{[\ell]}\) is a vector in the input space of layer \(\ell\). The pre-activation \(z_i^{[\ell]} = \langle \mathbf{w}_i^{[\ell]}, \mathbf{a}^{[\ell-1]} \rangle + b_i^{[\ell]}\) is largest when the incoming activation \(\mathbf{a}^{[\ell-1]}\) aligns with \(\mathbf{w}_i^{[\ell]}\). Each unit therefore acts as a template matcher or feature detector, and the weight matrix as a whole projects the previous representation onto a new set of learned coordinates. The total parameter count of the affine map at layer \(\ell\) is \(d_\ell \times d_{\ell-1}\) weights plus \(d_\ell\) biases.
191.3.2 3.2 Biases as Threshold Shifts
The bias \(b_i^{[\ell]}\) controls the activation threshold of unit \(i\). With a ReLU activation, the unit fires only when \(\langle \mathbf{w}_i^{[\ell]}, \mathbf{a}^{[\ell-1]} \rangle > -b_i^{[\ell]}\), so the bias sets the offset of the decision boundary. Geometrically, the weight orients a hyperplane in the input space and the bias translates that hyperplane away from the origin. Without biases, every layer’s decision surface would be forced through the origin, a severe and usually unjustified restriction on the hypothesis class.
191.3.3 3.3 The Bias Trick
A common notational simplification absorbs the bias into the weight matrix. Augment the activation with a constant entry, \(\tilde{\mathbf{a}}^{[\ell-1]} = [\mathbf{a}^{[\ell-1]}; 1]\), and append the bias as an extra column, \(\tilde{\mathbf{W}}^{[\ell]} = [\mathbf{W}^{[\ell]} \mid \mathbf{b}^{[\ell]}]\). Then \(\mathbf{z}^{[\ell]} = \tilde{\mathbf{W}}^{[\ell]} \tilde{\mathbf{a}}^{[\ell-1]}\), and the affine map becomes a single matrix product. This trick is convenient for analysis, though most production frameworks keep weights and biases separate so that the two can be regularized and initialized independently.
191.4 4. The Vectorized Form
Writing the per-unit sum \(z_i^{[\ell]} = \sum_j W_{ij}^{[\ell]} a_j^{[\ell-1]} + b_i^{[\ell]}\) as an explicit loop over \(i\) and \(j\) is both notationally heavy and computationally wasteful. The vectorized form expresses the entire layer as one matrix-vector product plus a vector addition,
\[ \mathbf{z}^{[\ell]} = \mathbf{W}^{[\ell]} \mathbf{a}^{[\ell-1]} + \mathbf{b}^{[\ell]} \in \mathbb{R}^{d_\ell}. \]
This is not merely cosmetic. Numerical libraries dispatch the matrix-vector product to highly optimized BLAS routines that exploit cache locality and SIMD instructions, so the vectorized form runs orders of magnitude faster than an interpreted double loop while computing exactly the same numbers. The dimensional bookkeeping is worth stating explicitly: \(\mathbf{W}^{[\ell]}\) is \(d_\ell \times d_{\ell-1}\), \(\mathbf{a}^{[\ell-1]}\) is \(d_{\ell-1} \times 1\), the product is \(d_\ell \times 1\), and the bias \(\mathbf{b}^{[\ell]}\) is \(d_\ell \times 1\), so the sum is well defined and yields a \(d_\ell \times 1\) pre-activation.
191.5 5. The Batched Form
In practice we rarely propagate a single example at a time. Training proceeds over mini-batches, and inference is frequently performed on batches for throughput. Let \(\mathbf{X} \in \mathbb{R}^{d_0 \times m}\) be a batch of \(m\) examples stacked as columns, so column \(k\) is the \(k\)-th input. Define the batched activation \(\mathbf{A}^{[\ell]} \in \mathbb{R}^{d_\ell \times m}\) analogously. The forward recurrence generalizes cleanly,
\[ \mathbf{Z}^{[\ell]} = \mathbf{W}^{[\ell]} \mathbf{A}^{[\ell-1]} + \mathbf{b}^{[\ell]} \mathbf{1}_m^\top, \]
\[ \mathbf{A}^{[\ell]} = g^{[\ell]}\!\left(\mathbf{Z}^{[\ell]}\right), \]
where \(\mathbf{1}_m \in \mathbb{R}^m\) is the all-ones vector and the outer product \(\mathbf{b}^{[\ell]} \mathbf{1}_m^\top \in \mathbb{R}^{d_\ell \times m}\) replicates the bias across all \(m\) columns. In array-programming frameworks this replication is performed implicitly by broadcasting, so the code simply adds a length \(d_\ell\) vector to a \(d_\ell \times m\) matrix.
191.5.1 5.1 Why Batching Helps
The single matrix-matrix product \(\mathbf{W}^{[\ell]} \mathbf{A}^{[\ell-1]}\) of shape \(d_\ell \times m\) amortizes the cost of loading \(\mathbf{W}^{[\ell]}\) from memory across all \(m\) examples. On parallel hardware such as a GPU, a matrix-matrix multiply achieves far higher arithmetic intensity than \(m\) separate matrix-vector multiplies, because the weight matrix is reused many times once resident in fast memory. Batching thus improves hardware utilization without altering the mathematics: the result for column \(k\) is identical to propagating example \(k\) alone.
191.5.2 5.2 A Convention Caveat
Many deep learning frameworks adopt the transposed convention, storing the batch as \(\mathbf{X} \in \mathbb{R}^{m \times d_0}\) with examples in rows and computing \(\mathbf{Z} = \mathbf{A}^{[\ell-1]} \mathbf{W}^{[\ell]\top} + \mathbf{b}^{[\ell]}\). The two conventions are transposes of one another and describe the same computation. The reader should always confirm which layout a given codebase uses before reasoning about shapes.
191.6 6. A Worked Numerical Example
We now propagate a concrete input through a small network by hand. Consider a network with \(d_0 = 2\), one hidden layer of width \(d_1 = 2\) using ReLU, and an output layer of width \(d_2 = 1\) using the logistic sigmoid. The parameters are
\[ \mathbf{W}^{[1]} = \begin{bmatrix} 0.5 & -0.2 \\ 0.1 & 0.4 \end{bmatrix}, \quad \mathbf{b}^{[1]} = \begin{bmatrix} 0.1 \\ -0.3 \end{bmatrix}, \quad \mathbf{W}^{[2]} = \begin{bmatrix} 0.7 & -0.6 \end{bmatrix}, \quad \mathbf{b}^{[2]} = \begin{bmatrix} 0.2 \end{bmatrix}. \]
Let the input be \(\mathbf{x} = \mathbf{a}^{[0]} = [1.0,\ 2.0]^\top\).
191.6.1 6.1 First Layer
The pre-activation of the hidden layer is
\[ \mathbf{z}^{[1]} = \mathbf{W}^{[1]} \mathbf{a}^{[0]} + \mathbf{b}^{[1]} = \begin{bmatrix} 0.5(1.0) + (-0.2)(2.0) \\ 0.1(1.0) + 0.4(2.0) \end{bmatrix} + \begin{bmatrix} 0.1 \\ -0.3 \end{bmatrix}. \]
Computing the matrix product gives \([0.5 - 0.4,\ 0.1 + 0.8]^\top = [0.1,\ 0.9]^\top\), and adding the bias yields
\[ \mathbf{z}^{[1]} = \begin{bmatrix} 0.1 + 0.1 \\ 0.9 - 0.3 \end{bmatrix} = \begin{bmatrix} 0.2 \\ 0.6 \end{bmatrix}. \]
Applying the ReLU activation elementwise, both entries are positive, so
\[ \mathbf{a}^{[1]} = \mathrm{ReLU}\!\left(\mathbf{z}^{[1]}\right) = \begin{bmatrix} 0.2 \\ 0.6 \end{bmatrix}. \]
191.6.2 6.2 Second Layer
The output pre-activation is a scalar,
\[ z^{[2]} = \mathbf{W}^{[2]} \mathbf{a}^{[1]} + b^{[2]} = 0.7(0.2) + (-0.6)(0.6) + 0.2. \]
Evaluating term by term gives \(0.14 - 0.36 + 0.2 = -0.02\). Applying the logistic sigmoid,
\[ \hat{y} = \sigma(z^{[2]}) = \frac{1}{1 + e^{-(-0.02)}} = \frac{1}{1 + e^{0.02}}. \]
Since \(e^{0.02} \approx 1.0202\), we obtain
\[ \hat{y} \approx \frac{1}{2.0202} \approx 0.4950. \]
The network maps the input \([1.0, 2.0]^\top\) to a prediction of approximately \(0.4950\), which in a binary classification setting reads as just under even odds for the positive class.
191.6.3 6.3 The Same Example, Batched
Suppose we also feed a second input \([0.0, 1.0]^\top\) alongside the first. Stacking both as columns gives
\[ \mathbf{A}^{[0]} = \begin{bmatrix} 1.0 & 0.0 \\ 2.0 & 1.0 \end{bmatrix}, \qquad \mathbf{Z}^{[1]} = \mathbf{W}^{[1]} \mathbf{A}^{[0]} + \mathbf{b}^{[1]} \mathbf{1}_2^\top. \]
The matrix product is
\[ \mathbf{W}^{[1]} \mathbf{A}^{[0]} = \begin{bmatrix} 0.1 & -0.2 \\ 0.9 & 0.4 \end{bmatrix}, \]
where the first column reproduces the single-example result and the second column is \([0.5(0) - 0.2(1),\ 0.1(0) + 0.4(1)]^\top = [-0.2,\ 0.4]^\top\). Broadcasting the bias across both columns yields
\[ \mathbf{Z}^{[1]} = \begin{bmatrix} 0.2 & -0.1 \\ 0.6 & 0.1 \end{bmatrix}. \]
After ReLU the negative entry \(-0.1\) is clamped to zero, giving \(\mathbf{A}^{[1]} = \begin{bmatrix} 0.2 & 0.0 \\ 0.6 & 0.1 \end{bmatrix}\). The first column is identical to the unbatched computation, confirming that batching changes throughput but not results.
191.7 7. Computational Cost
The dominant cost of forward propagation is the matrix multiplications. For a single example, layer \(\ell\) requires \(d_\ell \, d_{\ell-1}\) multiply-add operations for the affine map plus \(O(d_\ell)\) for the bias and activation. Summing over layers, one forward pass costs approximately \(\sum_{\ell=1}^{L} d_\ell \, d_{\ell-1}\) multiply-adds, which equals the total number of weights in the network. For a batch of \(m\) examples the cost scales by \(m\), becoming \(m \sum_\ell d_\ell d_{\ell-1}\). The memory footprint during training is also governed by the forward pass, since every activation \(\mathbf{a}^{[\ell]}\) must be retained for use in the subsequent backward pass. This coupling between forward propagation and backpropagation is the reason activation memory, rather than parameter count, often limits the batch size that fits on an accelerator.
191.8 8. Summary
Forward propagation is the repeated application of an affine transform followed by a nonlinear activation, sweeping from input to output. The weights orient and scale each layer’s learned projection, the biases shift activation thresholds, and the nonlinearities give depth its representational value. The vectorized form expresses a layer as a single matrix-vector product, and the batched form generalizes this to a matrix-matrix product with broadcast biases, exposing the parallelism that modern hardware exploits. The worked example demonstrates that these forms agree exactly, differing only in how many examples flow through at once. Because the loss and all gradients are functions of the activations computed here, a precise understanding of forward propagation is the foundation on which training rests.
191.9 References
- Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning, Chapter 6: Deep Feedforward Networks. MIT Press, 2016. https://www.deeplearningbook.org/contents/mlp.html
- Nielsen, M. Neural Networks and Deep Learning, Chapter 2: How the Backpropagation Algorithm Works. 2015. http://neuralnetworksanddeeplearning.com/chap2.html
- Ng, A. et al. Deep Learning Specialization, Course 1: Neural Networks and Deep Learning. DeepLearning.AI. https://www.deeplearning.ai/courses/deep-learning-specialization/
- Rumelhart, D. E., Hinton, G. E., and Williams, R. J. Learning Representations by Back-Propagating Errors. Nature, 323, 533 to 536, 1986. https://www.nature.com/articles/323533a0
- Bishop, C. M. Pattern Recognition and Machine Learning, Chapter 5: Neural Networks. Springer, 2006. https://www.microsoft.com/en-us/research/publication/pattern-recognition-machine-learning/
- Karpathy, A. CS231n Convolutional Neural Networks for Visual Recognition: Neural Networks Part 1. Stanford University. https://cs231n.github.io/neural-networks-1/