191 Forward Propagation

Forward propagation is the computational engine that turns an input vector into a prediction by passing it through the successive layers of a neural network. Every quantity used in training, including the loss and its gradients, is downstream of this single forward sweep. Mastery of forward propagation is therefore the prerequisite for understanding backpropagation, optimization, and the practical performance characteristics of deep learning systems. This chapter develops the layer-by-layer computation from first principles, derives its vectorized and batched forms, isolates the distinct roles played by weights and biases, and grounds the abstractions in a fully worked numerical example.

191.1 1. The Computational Setting

A feedforward neural network defines a function $f: \mathbb{R}^{d_0} \to \mathbb{R}^{d_L}$ as a composition of $L$ parametric layers. Given an input $\mathbf{x} \in \mathbb{R}^{d_0}$, the network produces an output $\hat{\mathbf{y}} = f(\mathbf{x})$ through a strictly ordered sequence of transformations. Each layer applies an affine map followed by a nonlinear activation. The “forward” qualifier signals direction: information flows from inputs toward outputs, with no layer consuming a quantity it has not yet produced.

We index layers by $\ell \in \{1, \dots, L\}$. Layer $\ell$ has $d_\ell$ units, so the network is fully specified by the widths $d_0, d_1, \dots, d_L$ and by the parameters attached to each layer. The choice of $d_0$ is fixed by the data dimensionality and $d_L$ by the task, while the hidden widths $d_1, \dots, d_{L-1}$ are design decisions.

191.1.1 1.1 Notation

For each layer $\ell$ we collect the following objects.

$\mathbf{W}^{[\ell]} \in \mathbb{R}^{d_\ell \times d_{\ell-1}}$, the weight matrix.
$\mathbf{b}^{[\ell]} \in \mathbb{R}^{d_\ell}$, the bias vector.
$\mathbf{z}^{[\ell]} \in \mathbb{R}^{d_\ell}$, the pre-activation (also called the linear combination or logit vector).
$\mathbf{a}^{[\ell]} \in \mathbb{R}^{d_\ell}$, the post-activation output of the layer.
$g^{[\ell]}: \mathbb{R} \to \mathbb{R}$, the activation function applied elementwise.

By convention the input is treated as the zeroth activation, $\mathbf{a}^{[0]} = \mathbf{x}$, and the network output is the final activation, $\hat{\mathbf{y}} = \mathbf{a}^{[L]}$.

191.2 2. The Layer-by-Layer Computation

The defining recurrence of forward propagation is the pair of equations that map one layer’s output to the next. For $\ell = 1, 2, \dots, L$,

\[ \mathbf{z}^{[\ell]} = \mathbf{W}^{[\ell]} \mathbf{a}^{[\ell-1]} + \mathbf{b}^{[\ell]}, \]

\[ \mathbf{a}^{[\ell]} = g^{[\ell]}\!\left(\mathbf{z}^{[\ell]}\right). \]

The first equation is purely linear in the previous activation, up to the additive bias, and produces the pre-activation $\mathbf{z}^{[\ell]}$. The second equation introduces nonlinearity by applying $g^{[\ell]}$ componentwise. Writing the activation componentwise makes the elementwise nature explicit: for unit $i$ in layer $\ell$,

\[ z_i^{[\ell]} = \sum_{j=1}^{d_{\ell-1}} W_{ij}^{[\ell]} \, a_j^{[\ell-1]} + b_i^{[\ell]}, \qquad a_i^{[\ell]} = g^{[\ell]}\!\left(z_i^{[\ell]}\right). \]

The scalar $W_{ij}^{[\ell]}$ is the strength of the connection from unit $j$ of layer $\ell-1$ to unit $i$ of layer $\ell$. The bias $b_i^{[\ell]}$ shifts the pre-activation of unit $i$ independently of any input.

191.2.1 2.1 Why Nonlinearity Is Essential

If every activation were the identity, $g^{[\ell]}(z) = z$, the entire network would collapse into a single affine map. Composing affine maps yields an affine map: $\mathbf{W}^{[2]}(\mathbf{W}^{[1]}\mathbf{x} + \mathbf{b}^{[1]}) + \mathbf{b}^{[2]} = \mathbf{W}'\mathbf{x} + \mathbf{b}'$ for some $\mathbf{W}'$ and $\mathbf{b}'$. Depth would buy nothing in representational power. The nonlinear activations $g^{[\ell]}$ are precisely what allow a deep network to represent functions that no single affine layer can. Common choices include the logistic sigmoid $\sigma(z) = 1 / (1 + e^{-z})$, the hyperbolic tangent $\tanh(z)$, and the rectified linear unit $\mathrm{ReLU}(z) = \max(0, z)$.

191.2.2 2.2 The Forward Sweep as Pseudocode

The recurrence translates directly into an iterative procedure. The following non-executable sketch captures the control flow.

a = x                      # a^[0]
for l in 1..L:
    z = W[l] @ a + b[l]    # pre-activation
    a = g[l](z)            # post-activation
y_hat = a                  # a^[L]

Each iteration depends only on the activation produced by the previous iteration, which is why the loop is inherently sequential across layers even though the work inside a layer is highly parallel.

191.3 3. The Role of Weights and Biases

The weights and biases are the learnable parameters; the activations are computed quantities. It is worth separating their roles precisely, because they contribute to the model in different ways.

191.3.1 3.1 Weights as Learned Feature Detectors

Row $i$ of $\mathbf{W}^{[\ell]}$ is a vector in the input space of layer $\ell$. The pre-activation $z_i^{[\ell]} = \langle \mathbf{w}_i^{[\ell]}, \mathbf{a}^{[\ell-1]} \rangle + b_i^{[\ell]}$ is largest when the incoming activation $\mathbf{a}^{[\ell-1]}$ aligns with $\mathbf{w}_i^{[\ell]}$. Each unit therefore acts as a template matcher or feature detector, and the weight matrix as a whole projects the previous representation onto a new set of learned coordinates. The total parameter count of the affine map at layer $\ell$ is $d_\ell \times d_{\ell-1}$ weights plus $d_\ell$ biases.

191.3.2 3.2 Biases as Threshold Shifts

The bias $b_i^{[\ell]}$ controls the activation threshold of unit $i$. With a ReLU activation, the unit fires only when $\langle \mathbf{w}_i^{[\ell]}, \mathbf{a}^{[\ell-1]} \rangle > -b_i^{[\ell]}$, so the bias sets the offset of the decision boundary. Geometrically, the weight orients a hyperplane in the input space and the bias translates that hyperplane away from the origin. Without biases, every layer’s decision surface would be forced through the origin, a severe and usually unjustified restriction on the hypothesis class.

191.3.3 3.3 The Bias Trick

A common notational simplification absorbs the bias into the weight matrix. Augment the activation with a constant entry, $\tilde{\mathbf{a}}^{[\ell-1]} = [\mathbf{a}^{[\ell-1]}; 1]$, and append the bias as an extra column, $\tilde{\mathbf{W}}^{[\ell]} = [\mathbf{W}^{[\ell]} \mid \mathbf{b}^{[\ell]}]$. Then $\mathbf{z}^{[\ell]} = \tilde{\mathbf{W}}^{[\ell]} \tilde{\mathbf{a}}^{[\ell-1]}$, and the affine map becomes a single matrix product. This trick is convenient for analysis, though most production frameworks keep weights and biases separate so that the two can be regularized and initialized independently.

191.4 4. The Vectorized Form

Writing the per-unit sum $z_i^{[\ell]} = \sum_j W_{ij}^{[\ell]} a_j^{[\ell-1]} + b_i^{[\ell]}$ as an explicit loop over $i$ and $j$ is both notationally heavy and computationally wasteful. The vectorized form expresses the entire layer as one matrix-vector product plus a vector addition,

\[ \mathbf{z}^{[\ell]} = \mathbf{W}^{[\ell]} \mathbf{a}^{[\ell-1]} + \mathbf{b}^{[\ell]} \in \mathbb{R}^{d_\ell}. \]

This is not merely cosmetic. Numerical libraries dispatch the matrix-vector product to highly optimized BLAS routines that exploit cache locality and SIMD instructions, so the vectorized form runs orders of magnitude faster than an interpreted double loop while computing exactly the same numbers. The dimensional bookkeeping is worth stating explicitly: $\mathbf{W}^{[\ell]}$ is $d_\ell \times d_{\ell-1}$, $\mathbf{a}^{[\ell-1]}$ is $d_{\ell-1} \times 1$, the product is $d_\ell \times 1$, and the bias $\mathbf{b}^{[\ell]}$ is $d_\ell \times 1$, so the sum is well defined and yields a $d_\ell \times 1$ pre-activation.

191.5 5. The Batched Form

In practice we rarely propagate a single example at a time. Training proceeds over mini-batches, and inference is frequently performed on batches for throughput. Let $\mathbf{X} \in \mathbb{R}^{d_0 \times m}$ be a batch of $m$ examples stacked as columns, so column $k$ is the $k$-th input. Define the batched activation $\mathbf{A}^{[\ell]} \in \mathbb{R}^{d_\ell \times m}$ analogously. The forward recurrence generalizes cleanly,

\[ \mathbf{Z}^{[\ell]} = \mathbf{W}^{[\ell]} \mathbf{A}^{[\ell-1]} + \mathbf{b}^{[\ell]} \mathbf{1}_m^\top, \]

\[ \mathbf{A}^{[\ell]} = g^{[\ell]}\!\left(\mathbf{Z}^{[\ell]}\right), \]

where $\mathbf{1}_m \in \mathbb{R}^m$ is the all-ones vector and the outer product $\mathbf{b}^{[\ell]} \mathbf{1}_m^\top \in \mathbb{R}^{d_\ell \times m}$ replicates the bias across all $m$ columns. In array-programming frameworks this replication is performed implicitly by broadcasting, so the code simply adds a length $d_\ell$ vector to a $d_\ell \times m$ matrix.

191.5.1 5.1 Why Batching Helps

The single matrix-matrix product $\mathbf{W}^{[\ell]} \mathbf{A}^{[\ell-1]}$ of shape $d_\ell \times m$ amortizes the cost of loading $\mathbf{W}^{[\ell]}$ from memory across all $m$ examples. On parallel hardware such as a GPU, a matrix-matrix multiply achieves far higher arithmetic intensity than $m$ separate matrix-vector multiplies, because the weight matrix is reused many times once resident in fast memory. Batching thus improves hardware utilization without altering the mathematics: the result for column $k$ is identical to propagating example $k$ alone.

191.5.2 5.2 A Convention Caveat

Many deep learning frameworks adopt the transposed convention, storing the batch as $\mathbf{X} \in \mathbb{R}^{m \times d_0}$ with examples in rows and computing $\mathbf{Z} = \mathbf{A}^{[\ell-1]} \mathbf{W}^{[\ell]\top} + \mathbf{b}^{[\ell]}$. The two conventions are transposes of one another and describe the same computation. The reader should always confirm which layout a given codebase uses before reasoning about shapes.

191.6 6. A Worked Numerical Example

We now propagate a concrete input through a small network by hand. Consider a network with $d_0 = 2$, one hidden layer of width $d_1 = 2$ using ReLU, and an output layer of width $d_2 = 1$ using the logistic sigmoid. The parameters are

\[ \mathbf{W}^{[1]} = \begin{bmatrix} 0.5 & -0.2 \\ 0.1 & 0.4 \end{bmatrix}, \quad \mathbf{b}^{[1]} = \begin{bmatrix} 0.1 \\ -0.3 \end{bmatrix}, \quad \mathbf{W}^{[2]} = \begin{bmatrix} 0.7 & -0.6 \end{bmatrix}, \quad \mathbf{b}^{[2]} = \begin{bmatrix} 0.2 \end{bmatrix}. \]

Let the input be $\mathbf{x} = \mathbf{a}^{[0]} = [1.0,\ 2.0]^\top$.

191.6.1 6.1 First Layer

The pre-activation of the hidden layer is

\[ \mathbf{z}^{[1]} = \mathbf{W}^{[1]} \mathbf{a}^{[0]} + \mathbf{b}^{[1]} = \begin{bmatrix} 0.5(1.0) + (-0.2)(2.0) \\ 0.1(1.0) + 0.4(2.0) \end{bmatrix} + \begin{bmatrix} 0.1 \\ -0.3 \end{bmatrix}. \]

Computing the matrix product gives $[0.5 - 0.4,\ 0.1 + 0.8]^\top = [0.1,\ 0.9]^\top$, and adding the bias yields

\[ \mathbf{z}^{[1]} = \begin{bmatrix} 0.1 + 0.1 \\ 0.9 - 0.3 \end{bmatrix} = \begin{bmatrix} 0.2 \\ 0.6 \end{bmatrix}. \]

Applying the ReLU activation elementwise, both entries are positive, so

\[ \mathbf{a}^{[1]} = \mathrm{ReLU}\!\left(\mathbf{z}^{[1]}\right) = \begin{bmatrix} 0.2 \\ 0.6 \end{bmatrix}. \]

191.6.2 6.2 Second Layer

The output pre-activation is a scalar,

\[ z^{[2]} = \mathbf{W}^{[2]} \mathbf{a}^{[1]} + b^{[2]} = 0.7(0.2) + (-0.6)(0.6) + 0.2. \]

Evaluating term by term gives $0.14 - 0.36 + 0.2 = -0.02$. Applying the logistic sigmoid,

\[ \hat{y} = \sigma(z^{[2]}) = \frac{1}{1 + e^{-(-0.02)}} = \frac{1}{1 + e^{0.02}}. \]

Since $e^{0.02} \approx 1.0202$, we obtain

\[ \hat{y} \approx \frac{1}{2.0202} \approx 0.4950. \]

The network maps the input $[1.0, 2.0]^\top$ to a prediction of approximately $0.4950$, which in a binary classification setting reads as just under even odds for the positive class.

191.6.3 6.3 The Same Example, Batched

Suppose we also feed a second input $[0.0, 1.0]^\top$ alongside the first. Stacking both as columns gives

\[ \mathbf{A}^{[0]} = \begin{bmatrix} 1.0 & 0.0 \\ 2.0 & 1.0 \end{bmatrix}, \qquad \mathbf{Z}^{[1]} = \mathbf{W}^{[1]} \mathbf{A}^{[0]} + \mathbf{b}^{[1]} \mathbf{1}_2^\top. \]

The matrix product is

\[ \mathbf{W}^{[1]} \mathbf{A}^{[0]} = \begin{bmatrix} 0.1 & -0.2 \\ 0.9 & 0.4 \end{bmatrix}, \]

where the first column reproduces the single-example result and the second column is $[0.5(0) - 0.2(1),\ 0.1(0) + 0.4(1)]^\top = [-0.2,\ 0.4]^\top$. Broadcasting the bias across both columns yields

\[ \mathbf{Z}^{[1]} = \begin{bmatrix} 0.2 & -0.1 \\ 0.6 & 0.1 \end{bmatrix}. \]

After ReLU the negative entry $-0.1$ is clamped to zero, giving $\mathbf{A}^{[1]} = \begin{bmatrix} 0.2 & 0.0 \\ 0.6 & 0.1 \end{bmatrix}$. The first column is identical to the unbatched computation, confirming that batching changes throughput but not results.

191.7 7. Computational Cost

The dominant cost of forward propagation is the matrix multiplications. For a single example, layer $\ell$ requires $d_\ell \, d_{\ell-1}$ multiply-add operations for the affine map plus $O(d_\ell)$ for the bias and activation. Summing over layers, one forward pass costs approximately $\sum_{\ell=1}^{L} d_\ell \, d_{\ell-1}$ multiply-adds, which equals the total number of weights in the network. For a batch of $m$ examples the cost scales by $m$, becoming $m \sum_\ell d_\ell d_{\ell-1}$. The memory footprint during training is also governed by the forward pass, since every activation $\mathbf{a}^{[\ell]}$ must be retained for use in the subsequent backward pass. This coupling between forward propagation and backpropagation is the reason activation memory, rather than parameter count, often limits the batch size that fits on an accelerator.

191.8 8. Reference Implementation

The aiinaction libraries ship a small, validated forward-propagation engine in all three languages. A network is a list of dense Layer objects, each holding a weight matrix W of shape $(d_\ell \times d_{\ell-1})$, a bias vector b of length $d_\ell$, and the name of an elementwise activation. The forward function runs the sweep of Section 2 over a batch stored with examples as columns, exactly matching the worked example above. The sigmoid is evaluated in the numerically stable form $\sigma(z) = e^{z}/(1 + e^{z})$ for $z < 0$ so it cannot overflow on large negative pre-activations.

Code

from aiinaction.ch186_forward_propagation import make_layer, forward

# The 2-2-1 network from the worked example: ReLU hidden, sigmoid output.
net = [
    make_layer([[0.5, -0.2], [0.1, 0.4]], [0.1, -0.3], "relu"),
    make_layer([[0.7, -0.6]], [0.2], "sigmoid"),
]

# A two-example batch, features down the rows, examples across the columns.
X = [[1.0, 0.0],
     [2.0, 1.0]]

Y = forward(net, X)
print("output shape:", Y.shape)
print("predictions :", [round(float(v), 6) for v in Y.ravel()])

# A single example reproduces the first column exactly.
single = forward(net, [1.0, 2.0])
print("single x    :", round(float(single[0, 0]), 6))

output shape: (1, 2)
predictions : [0.495, 0.534943]
single x    : 0.495

using AIInAction.Ch186ForwardPropagation

# The 2-2-1 network from the worked example: ReLU hidden, sigmoid output.
net = [
    make_layer([0.5 -0.2; 0.1 0.4], [0.1, -0.3], "relu"),
    make_layer([0.7 -0.6], [0.2], "sigmoid"),
]

# A two-example batch, features down the rows, examples across the columns.
X = [1.0 0.0; 2.0 1.0]

Y = forward(net, X)
println("output shape: ", size(Y))
println("predictions : ", round.(vec(Y); digits=6))

# A single example reproduces the first column exactly.
single = forward(net, [1.0, 2.0])
println("single x    : ", round(single[1, 1]; digits=6))

use aiinaction::ch186_forward_propagation::{forward, make_layer, Matrix};

// The 2-2-1 network from the worked example: ReLU hidden, sigmoid output.
let net = vec![
    make_layer(&[vec![0.5, -0.2], vec![0.1, 0.4]], &[0.1, -0.3], "relu").unwrap(),
    make_layer(&[vec![0.7, -0.6]], &[0.2], "sigmoid").unwrap(),
];

// A two-example batch, features down the rows, examples across the columns.
let x = Matrix::from_rows(&[vec![1.0, 0.0], vec![2.0, 1.0]]).unwrap();

let y = forward(&net, &x).unwrap();
println!("output shape: {}x{}", y.rows, y.cols);
println!("predictions : {:?}", y.data); // [0.49500016666, 0.534942945158]

// A single example reproduces the first column exactly.
let single = forward(&net, &Matrix::from_vec(&[1.0, 2.0]).unwrap()).unwrap();
println!("single x    : {}", single.data[0]); // 0.49500016666

All three implementations agree on the shared fixtures to within $10^{-9}$: the batch output is $[0.495000,\ 0.534943]$ and the single-example prediction is $0.495000$, matching the hand computation of Section 6.

191.9 9. Summary

Forward propagation is the repeated application of an affine transform followed by a nonlinear activation, sweeping from input to output. The weights orient and scale each layer’s learned projection, the biases shift activation thresholds, and the nonlinearities give depth its representational value. The vectorized form expresses a layer as a single matrix-vector product, and the batched form generalizes this to a matrix-matrix product with broadcast biases, exposing the parallelism that modern hardware exploits. The worked example demonstrates that these forms agree exactly, differing only in how many examples flow through at once. Because the loss and all gradients are functions of the activations computed here, a precise understanding of forward propagation is the foundation on which training rests.

191.10 References

Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning, Chapter 6: Deep Feedforward Networks. MIT Press, 2016. https://www.deeplearningbook.org/contents/mlp.html
Nielsen, M. Neural Networks and Deep Learning, Chapter 2: How the Backpropagation Algorithm Works. 2015. http://neuralnetworksanddeeplearning.com/chap2.html
Ng, A. et al. Deep Learning Specialization, Course 1: Neural Networks and Deep Learning. DeepLearning.AI. https://www.deeplearning.ai/courses/deep-learning-specialization/
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. Learning Representations by Back-Propagating Errors. Nature, 323, 533 to 536, 1986. https://www.nature.com/articles/323533a0
Bishop, C. M. Pattern Recognition and Machine Learning, Chapter 5: Neural Networks. Springer, 2006. https://www.microsoft.com/en-us/research/publication/pattern-recognition-machine-learning/
Karpathy, A. CS231n Convolutional Neural Networks for Visual Recognition: Neural Networks Part 1. Stanford University. https://cs231n.github.io/neural-networks-1/

# Forward Propagation Forward propagation is the computational engine that turns an input vector into a prediction by passing it through the successive layers of a neural network. Every quantity used in training, including the loss and its gradients, is downstream of this single forward sweep. Mastery of forward propagation is therefore the prerequisite for understanding backpropagation, optimization, and the practical performance characteristics of deep learning systems. This chapter develops the layer-by-layer computation from first principles, derives its vectorized and batched forms, isolates the distinct roles played by weights and biases, and grounds the abstractions in a fully worked numerical example. ## 1. The Computational Setting A feedforward neural network defines a function $f: \mathbb{R}^{d_0} \to \mathbb{R}^{d_L}$ as a composition of $L$ parametric layers. Given an input $\mathbf{x} \in \mathbb{R}^{d_0}$, the network produces an output $\hat{\mathbf{y}} = f(\mathbf{x})$ through a strictly ordered sequence of transformations. Each layer applies an affine map followed by a nonlinear activation. The "forward" qualifier signals direction: information flows from inputs toward outputs, with no layer consuming a quantity it has not yet produced. We index layers by $\ell \in \{1, \dots, L\}$. Layer $\ell$ has $d_\ell$ units, so the network is fully specified by the widths $d_0, d_1, \dots, d_L$ and by the parameters attached to each layer. The choice of $d_0$ is fixed by the data dimensionality and $d_L$ by the task, while the hidden widths $d_1, \dots, d_{L-1}$ are design decisions. ### 1.1 Notation For each layer $\ell$ we collect the following objects. - $\mathbf{W}^{[\ell]} \in \mathbb{R}^{d_\ell \times d_{\ell-1}}$, the weight matrix. - $\mathbf{b}^{[\ell]} \in \mathbb{R}^{d_\ell}$, the bias vector. - $\mathbf{z}^{[\ell]} \in \mathbb{R}^{d_\ell}$, the pre-activation (also called the linear combination or logit vector). - $\mathbf{a}^{[\ell]} \in \mathbb{R}^{d_\ell}$, the post-activation output of the layer. - $g^{[\ell]}: \mathbb{R} \to \mathbb{R}$, the activation function applied elementwise. By convention the input is treated as the zeroth activation, $\mathbf{a}^{[0]} = \mathbf{x}$, and the network output is the final activation, $\hat{\mathbf{y}} = \mathbf{a}^{[L]}$. ## 2. The Layer-by-Layer Computation The defining recurrence of forward propagation is the pair of equations that map one layer's output to the next. For $\ell = 1, 2, \dots, L$, $$ \mathbf{z}^{[\ell]} = \mathbf{W}^{[\ell]} \mathbf{a}^{[\ell-1]} + \mathbf{b}^{[\ell]}, $$ $$ \mathbf{a}^{[\ell]} = g^{[\ell]}\!\left(\mathbf{z}^{[\ell]}\right). $$ The first equation is purely linear in the previous activation, up to the additive bias, and produces the pre-activation $\mathbf{z}^{[\ell]}$. The second equation introduces nonlinearity by applying $g^{[\ell]}$ componentwise. Writing the activation componentwise makes the elementwise nature explicit: for unit $i$ in layer $\ell$, $$ z_i^{[\ell]} = \sum_{j=1}^{d_{\ell-1}} W_{ij}^{[\ell]} \, a_j^{[\ell-1]} + b_i^{[\ell]}, \qquad a_i^{[\ell]} = g^{[\ell]}\!\left(z_i^{[\ell]}\right). $$ The scalar $W_{ij}^{[\ell]}$ is the strength of the connection from unit $j$ of layer $\ell-1$ to unit $i$ of layer $\ell$. The bias $b_i^{[\ell]}$ shifts the pre-activation of unit $i$ independently of any input. ### 2.1 Why Nonlinearity Is Essential If every activation were the identity, $g^{[\ell]}(z) = z$, the entire network would collapse into a single affine map. Composing affine maps yields an affine map: $\mathbf{W}^{[2]}(\mathbf{W}^{[1]}\mathbf{x} + \mathbf{b}^{[1]}) + \mathbf{b}^{[2]} = \mathbf{W}'\mathbf{x} + \mathbf{b}'$ for some $\mathbf{W}'$ and $\mathbf{b}'$. Depth would buy nothing in representational power. The nonlinear activations $g^{[\ell]}$ are precisely what allow a deep network to represent functions that no single affine layer can. Common choices include the logistic sigmoid $\sigma(z) = 1 / (1 + e^{-z})$, the hyperbolic tangent $\tanh(z)$, and the rectified linear unit $\mathrm{ReLU}(z) = \max(0, z)$. ### 2.2 The Forward Sweep as Pseudocode The recurrence translates directly into an iterative procedure. The following non-executable sketch captures the control flow. ``` a = x # a^[0] for l in 1..L: z = W[l] @ a + b[l] # pre-activation a = g[l](z) # post-activation y_hat = a # a^[L] ``` Each iteration depends only on the activation produced by the previous iteration, which is why the loop is inherently sequential across layers even though the work inside a layer is highly parallel. ## 3. The Role of Weights and Biases The weights and biases are the learnable parameters; the activations are computed quantities. It is worth separating their roles precisely, because they contribute to the model in different ways. ### 3.1 Weights as Learned Feature Detectors Row $i$ of $\mathbf{W}^{[\ell]}$ is a vector in the input space of layer $\ell$. The pre-activation $z_i^{[\ell]} = \langle \mathbf{w}_i^{[\ell]}, \mathbf{a}^{[\ell-1]} \rangle + b_i^{[\ell]}$ is largest when the incoming activation $\mathbf{a}^{[\ell-1]}$ aligns with $\mathbf{w}_i^{[\ell]}$. Each unit therefore acts as a template matcher or feature detector, and the weight matrix as a whole projects the previous representation onto a new set of learned coordinates. The total parameter count of the affine map at layer $\ell$ is $d_\ell \times d_{\ell-1}$ weights plus $d_\ell$ biases. ### 3.2 Biases as Threshold Shifts The bias $b_i^{[\ell]}$ controls the activation threshold of unit $i$. With a ReLU activation, the unit fires only when $\langle \mathbf{w}_i^{[\ell]}, \mathbf{a}^{[\ell-1]} \rangle > -b_i^{[\ell]}$, so the bias sets the offset of the decision boundary. Geometrically, the weight orients a hyperplane in the input space and the bias translates that hyperplane away from the origin. Without biases, every layer's decision surface would be forced through the origin, a severe and usually unjustified restriction on the hypothesis class. ### 3.3 The Bias Trick A common notational simplification absorbs the bias into the weight matrix. Augment the activation with a constant entry, $\tilde{\mathbf{a}}^{[\ell-1]} = [\mathbf{a}^{[\ell-1]}; 1]$, and append the bias as an extra column, $\tilde{\mathbf{W}}^{[\ell]} = [\mathbf{W}^{[\ell]} \mid \mathbf{b}^{[\ell]}]$. Then $\mathbf{z}^{[\ell]} = \tilde{\mathbf{W}}^{[\ell]} \tilde{\mathbf{a}}^{[\ell-1]}$, and the affine map becomes a single matrix product. This trick is convenient for analysis, though most production frameworks keep weights and biases separate so that the two can be regularized and initialized independently. ## 4. The Vectorized Form Writing the per-unit sum $z_i^{[\ell]} = \sum_j W_{ij}^{[\ell]} a_j^{[\ell-1]} + b_i^{[\ell]}$ as an explicit loop over $i$ and $j$ is both notationally heavy and computationally wasteful. The vectorized form expresses the entire layer as one matrix-vector product plus a vector addition, $$ \mathbf{z}^{[\ell]} = \mathbf{W}^{[\ell]} \mathbf{a}^{[\ell-1]} + \mathbf{b}^{[\ell]} \in \mathbb{R}^{d_\ell}. $$ This is not merely cosmetic. Numerical libraries dispatch the matrix-vector product to highly optimized BLAS routines that exploit cache locality and SIMD instructions, so the vectorized form runs orders of magnitude faster than an interpreted double loop while computing exactly the same numbers. The dimensional bookkeeping is worth stating explicitly: $\mathbf{W}^{[\ell]}$ is $d_\ell \times d_{\ell-1}$, $\mathbf{a}^{[\ell-1]}$ is $d_{\ell-1} \times 1$, the product is $d_\ell \times 1$, and the bias $\mathbf{b}^{[\ell]}$ is $d_\ell \times 1$, so the sum is well defined and yields a $d_\ell \times 1$ pre-activation. ## 5. The Batched Form In practice we rarely propagate a single example at a time. Training proceeds over mini-batches, and inference is frequently performed on batches for throughput. Let $\mathbf{X} \in \mathbb{R}^{d_0 \times m}$ be a batch of $m$ examples stacked as columns, so column $k$ is the $k$-th input. Define the batched activation $\mathbf{A}^{[\ell]} \in \mathbb{R}^{d_\ell \times m}$ analogously. The forward recurrence generalizes cleanly, $$ \mathbf{Z}^{[\ell]} = \mathbf{W}^{[\ell]} \mathbf{A}^{[\ell-1]} + \mathbf{b}^{[\ell]} \mathbf{1}_m^\top, $$ $$ \mathbf{A}^{[\ell]} = g^{[\ell]}\!\left(\mathbf{Z}^{[\ell]}\right), $$ where $\mathbf{1}_m \in \mathbb{R}^m$ is the all-ones vector and the outer product $\mathbf{b}^{[\ell]} \mathbf{1}_m^\top \in \mathbb{R}^{d_\ell \times m}$ replicates the bias across all $m$ columns. In array-programming frameworks this replication is performed implicitly by broadcasting, so the code simply adds a length $d_\ell$ vector to a $d_\ell \times m$ matrix. ### 5.1 Why Batching Helps The single matrix-matrix product $\mathbf{W}^{[\ell]} \mathbf{A}^{[\ell-1]}$ of shape $d_\ell \times m$ amortizes the cost of loading $\mathbf{W}^{[\ell]}$ from memory across all $m$ examples. On parallel hardware such as a GPU, a matrix-matrix multiply achieves far higher arithmetic intensity than $m$ separate matrix-vector multiplies, because the weight matrix is reused many times once resident in fast memory. Batching thus improves hardware utilization without altering the mathematics: the result for column $k$ is identical to propagating example $k$ alone. ### 5.2 A Convention Caveat Many deep learning frameworks adopt the transposed convention, storing the batch as $\mathbf{X} \in \mathbb{R}^{m \times d_0}$ with examples in rows and computing $\mathbf{Z} = \mathbf{A}^{[\ell-1]} \mathbf{W}^{[\ell]\top} + \mathbf{b}^{[\ell]}$. The two conventions are transposes of one another and describe the same computation. The reader should always confirm which layout a given codebase uses before reasoning about shapes. ## 6. A Worked Numerical Example We now propagate a concrete input through a small network by hand. Consider a network with $d_0 = 2$, one hidden layer of width $d_1 = 2$ using ReLU, and an output layer of width $d_2 = 1$ using the logistic sigmoid. The parameters are $$ \mathbf{W}^{[1]} = \begin{bmatrix} 0.5 & -0.2 \\ 0.1 & 0.4 \end{bmatrix}, \quad \mathbf{b}^{[1]} = \begin{bmatrix} 0.1 \\ -0.3 \end{bmatrix}, \quad \mathbf{W}^{[2]} = \begin{bmatrix} 0.7 & -0.6 \end{bmatrix}, \quad \mathbf{b}^{[2]} = \begin{bmatrix} 0.2 \end{bmatrix}. $$ Let the input be $\mathbf{x} = \mathbf{a}^{[0]} = [1.0,\ 2.0]^\top$. ### 6.1 First Layer The pre-activation of the hidden layer is $$ \mathbf{z}^{[1]} = \mathbf{W}^{[1]} \mathbf{a}^{[0]} + \mathbf{b}^{[1]} = \begin{bmatrix} 0.5(1.0) + (-0.2)(2.0) \\ 0.1(1.0) + 0.4(2.0) \end{bmatrix} + \begin{bmatrix} 0.1 \\ -0.3 \end{bmatrix}. $$ Computing the matrix product gives $[0.5 - 0.4,\ 0.1 + 0.8]^\top = [0.1,\ 0.9]^\top$, and adding the bias yields $$ \mathbf{z}^{[1]} = \begin{bmatrix} 0.1 + 0.1 \\ 0.9 - 0.3 \end{bmatrix} = \begin{bmatrix} 0.2 \\ 0.6 \end{bmatrix}. $$ Applying the ReLU activation elementwise, both entries are positive, so $$ \mathbf{a}^{[1]} = \mathrm{ReLU}\!\left(\mathbf{z}^{[1]}\right) = \begin{bmatrix} 0.2 \\ 0.6 \end{bmatrix}. $$ ### 6.2 Second Layer The output pre-activation is a scalar, $$ z^{[2]} = \mathbf{W}^{[2]} \mathbf{a}^{[1]} + b^{[2]} = 0.7(0.2) + (-0.6)(0.6) + 0.2. $$ Evaluating term by term gives $0.14 - 0.36 + 0.2 = -0.02$. Applying the logistic sigmoid, $$ \hat{y} = \sigma(z^{[2]}) = \frac{1}{1 + e^{-(-0.02)}} = \frac{1}{1 + e^{0.02}}. $$ Since $e^{0.02} \approx 1.0202$, we obtain $$ \hat{y} \approx \frac{1}{2.0202} \approx 0.4950. $$ The network maps the input $[1.0, 2.0]^\top$ to a prediction of approximately $0.4950$, which in a binary classification setting reads as just under even odds for the positive class. ### 6.3 The Same Example, Batched Suppose we also feed a second input $[0.0, 1.0]^\top$ alongside the first. Stacking both as columns gives $$ \mathbf{A}^{[0]} = \begin{bmatrix} 1.0 & 0.0 \\ 2.0 & 1.0 \end{bmatrix}, \qquad \mathbf{Z}^{[1]} = \mathbf{W}^{[1]} \mathbf{A}^{[0]} + \mathbf{b}^{[1]} \mathbf{1}_2^\top. $$ The matrix product is $$ \mathbf{W}^{[1]} \mathbf{A}^{[0]} = \begin{bmatrix} 0.1 & -0.2 \\ 0.9 & 0.4 \end{bmatrix}, $$ where the first column reproduces the single-example result and the second column is $[0.5(0) - 0.2(1),\ 0.1(0) + 0.4(1)]^\top = [-0.2,\ 0.4]^\top$. Broadcasting the bias across both columns yields $$ \mathbf{Z}^{[1]} = \begin{bmatrix} 0.2 & -0.1 \\ 0.6 & 0.1 \end{bmatrix}. $$ After ReLU the negative entry $-0.1$ is clamped to zero, giving $\mathbf{A}^{[1]} = \begin{bmatrix} 0.2 & 0.0 \\ 0.6 & 0.1 \end{bmatrix}$. The first column is identical to the unbatched computation, confirming that batching changes throughput but not results. ## 7. Computational Cost The dominant cost of forward propagation is the matrix multiplications. For a single example, layer $\ell$ requires $d_\ell \, d_{\ell-1}$ multiply-add operations for the affine map plus $O(d_\ell)$ for the bias and activation. Summing over layers, one forward pass costs approximately $\sum_{\ell=1}^{L} d_\ell \, d_{\ell-1}$ multiply-adds, which equals the total number of weights in the network. For a batch of $m$ examples the cost scales by $m$, becoming $m \sum_\ell d_\ell d_{\ell-1}$. The memory footprint during training is also governed by the forward pass, since every activation $\mathbf{a}^{[\ell]}$ must be retained for use in the subsequent backward pass. This coupling between forward propagation and backpropagation is the reason activation memory, rather than parameter count, often limits the batch size that fits on an accelerator. ## 8. Reference Implementation The `aiinaction` libraries ship a small, validated forward-propagation engine in all three languages. A network is a list of dense `Layer` objects, each holding a weight matrix `W` of shape $(d_\ell \times d_{\ell-1})$, a bias vector `b` of length $d_\ell$, and the name of an elementwise activation. The `forward` function runs the sweep of Section 2 over a batch stored with examples as columns, exactly matching the worked example above. The sigmoid is evaluated in the numerically stable form $\sigma(z) = e^{z}/(1 + e^{z})$ for $z < 0$ so it cannot overflow on large negative pre-activations. ::: {.panel-tabset} ## Python ```{python} from aiinaction.ch186_forward_propagation import make_layer, forward # The 2-2-1 network from the worked example: ReLU hidden, sigmoid output. net = [ make_layer([[0.5, -0.2], [0.1, 0.4]], [0.1, -0.3], "relu"), make_layer([[0.7, -0.6]], [0.2], "sigmoid"), ] # A two-example batch, features down the rows, examples across the columns. X = [[1.0, 0.0], [2.0, 1.0]] Y = forward(net, X) print("output shape:", Y.shape) print("predictions :", [round(float(v), 6) for v in Y.ravel()]) # A single example reproduces the first column exactly. single = forward(net, [1.0, 2.0]) print("single x :", round(float(single[0, 0]), 6)) ``` ## Julia ```julia using AIInAction.Ch186ForwardPropagation # The 2-2-1 network from the worked example: ReLU hidden, sigmoid output. net = [ make_layer([0.5 -0.2; 0.1 0.4], [0.1, -0.3], "relu"), make_layer([0.7 -0.6], [0.2], "sigmoid"), ] # A two-example batch, features down the rows, examples across the columns. X = [1.0 0.0; 2.0 1.0] Y = forward(net, X) println("output shape: ", size(Y)) println("predictions : ", round.(vec(Y); digits=6)) # A single example reproduces the first column exactly. single = forward(net, [1.0, 2.0]) println("single x : ", round(single[1, 1]; digits=6)) ``` ## Rust ```rust use aiinaction::ch186_forward_propagation::{forward, make_layer, Matrix}; // The 2-2-1 network from the worked example: ReLU hidden, sigmoid output. let net = vec![ make_layer(&[vec![0.5, -0.2], vec![0.1, 0.4]], &[0.1, -0.3], "relu").unwrap(), make_layer(&[vec![0.7, -0.6]], &[0.2], "sigmoid").unwrap(), ]; // A two-example batch, features down the rows, examples across the columns. let x = Matrix::from_rows(&[vec![1.0, 0.0], vec![2.0, 1.0]]).unwrap(); let y = forward(&net, &x).unwrap(); println!("output shape: {}x{}", y.rows, y.cols); println!("predictions : {:?}", y.data); // [0.49500016666, 0.534942945158] // A single example reproduces the first column exactly. let single = forward(&net, &Matrix::from_vec(&[1.0, 2.0]).unwrap()).unwrap(); println!("single x : {}", single.data[0]); // 0.49500016666 ``` ::: All three implementations agree on the shared fixtures to within $10^{-9}$: the batch output is $[0.495000,\ 0.534943]$ and the single-example prediction is $0.495000$, matching the hand computation of Section 6. ## 9. Summary Forward propagation is the repeated application of an affine transform followed by a nonlinear activation, sweeping from input to output. The weights orient and scale each layer's learned projection, the biases shift activation thresholds, and the nonlinearities give depth its representational value. The vectorized form expresses a layer as a single matrix-vector product, and the batched form generalizes this to a matrix-matrix product with broadcast biases, exposing the parallelism that modern hardware exploits. The worked example demonstrates that these forms agree exactly, differing only in how many examples flow through at once. Because the loss and all gradients are functions of the activations computed here, a precise understanding of forward propagation is the foundation on which training rests. ## References 1. Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning, Chapter 6: Deep Feedforward Networks. MIT Press, 2016. https://www.deeplearningbook.org/contents/mlp.html 2. Nielsen, M. Neural Networks and Deep Learning, Chapter 2: How the Backpropagation Algorithm Works. 2015. http://neuralnetworksanddeeplearning.com/chap2.html 3. Ng, A. et al. Deep Learning Specialization, Course 1: Neural Networks and Deep Learning. DeepLearning.AI. https://www.deeplearning.ai/courses/deep-learning-specialization/ 4. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. Learning Representations by Back-Propagating Errors. Nature, 323, 533 to 536, 1986. https://www.nature.com/articles/323533a0 5. Bishop, C. M. Pattern Recognition and Machine Learning, Chapter 5: Neural Networks. Springer, 2006. https://www.microsoft.com/en-us/research/publication/pattern-recognition-machine-learning/ 6. Karpathy, A. CS231n Convolutional Neural Networks for Visual Recognition: Neural Networks Part 1. Stanford University. https://cs231n.github.io/neural-networks-1/