The forward pass is the part of a neural network that actually produces a prediction. You feed inputs into the model, the model applies a sequence of mathematical operations, and an output comes out the other side.
That sounds trivial, but it is one of the most important ideas in deep learning because everything else depends on it:
- the loss compares the forward-pass output to the target
- backpropagation differentiates through the forward pass
- training is just repeating the forward pass and improving it
The easiest way to understand the forward pass is to start with a single neuron and then scale it up into a full layer written in matrix form.
This follows the same teaching sequence used in the chapter 1 PDF: inputs feed into a weighted sum, the bias shifts the pre-activation, and the activation turns that value into the final output.
What the Forward Pass Means
A forward pass is simply:
input -> transformations -> output
For a neural network, those transformations are usually:
- weighted sums
- bias additions
- activation functions
- repeated layer by layer
You can think of the forward pass as the model saying:
Given these current weights, this is my current answer.
Step 1: A Single Artificial Neuron
One neuron takes several inputs, multiplies each by a weight, adds them together, adds a bias, and then usually applies a nonlinear activation function.
The core equation is:
z = w1x1 + w2x2 + ... + wnxn + b
Then:
a = f(z)
Where:
x1, x2, ..., xnare inputsw1, w2, ..., wnare weightsbis the biaszis the pre-activationfis the activation functionais the final output of the neuron
This is the smallest useful forward pass.
The PDF’s key move is to show that the left-to-right flow stays the same even when we stop thinking about one neuron and start thinking in vectors and matrices.
A Worked Numerical Example
Suppose:
x1 = 1.5x2 = -2.0x3 = 0.8w1 = 0.4w2 = -0.5w3 = 0.3b = 0.1
First compute the weighted sum:
z = (1.5 * 0.4) + (-2.0 * -0.5) + (0.8 * 0.3) + 0.1
z = 0.6 + 1.0 + 0.24 + 0.1 = 1.94
Now apply ReLU:
a = max(0, 1.94) = 1.94
So the forward pass output of this neuron is 1.94.
Why We Need the Bias
Without the bias term, the neuron would always produce 0 whenever all inputs are 0.
The bias gives the neuron a learnable offset. It lets the model shift thresholds and decision boundaries instead of forcing every computation to pass through the origin.
In practice, the bias is small in notation and extremely important in behavior.
Pre-Activation vs Post-Activation
This distinction matters:
zis the pre-activationa = f(z)is the post-activation
Why keep them separate?
Because:
- the weighted sum carries the linear combination
- the activation function introduces nonlinearity
Without that nonlinearity, stacking many layers would still collapse into one big linear transformation.
Why Activations Matter in the Forward Pass
If every layer were just:
z = Wx + b
then even many stacked layers would still be equivalent to a single linear map.
Activation functions break that limitation.
Common examples:
- Sigmoid maps values into
(0, 1) - Tanh maps values into
(-1, 1) - ReLU maps negative values to
0and leaves positive values unchanged - GELU smoothly gates inputs and is widely used in transformers
For many modern neural networks, the forward pass is really:
linear transform -> activation -> linear transform -> activation -> ...
Step 2: From One Neuron to One Layer
A layer is just many neurons operating on the same input.
Suppose the input has 3 features:
x = [x1, x2, x3]
And the layer has 4 neurons. Each neuron has its own weights and bias.
That means:
- neuron 1 computes its own weighted sum
- neuron 2 computes a different weighted sum
- neuron 3 computes another one
- neuron 4 does the same
The outputs are stacked into a vector.
So instead of one scalar output, the layer produces:
z = [z1, z2, z3, z4]
Then:
a = f(z)
where the activation function is applied elementwise.
The Matrix Form
Writing each neuron separately gets tedious fast. Matrix notation is cleaner and more powerful.
For a dense layer:
z = Wx + b
Where:
xhas shape(m,)Whas shape(n, m)bhas shape(n,)zhas shape(n,)
If the input has m features and the layer has n neurons, then the layer contains:
- one row of weights per neuron
- one bias per neuron
Then the activation gives:
a = f(z)
So the full forward pass of the layer is:
a = f(Wx + b)
This one expression captures an entire layer of neurons.
A Concrete Matrix Example
Let:
x = [2, -1]
W = [[0.5, 0.2],
[-0.3, 0.8],
[1.0, -0.5]]
b = [0.1, -0.2, 0.0]
Then:
Wx =
[
(0.5 * 2) + (0.2 * -1),
(-0.3 * 2) + (0.8 * -1),
(1.0 * 2) + (-0.5 * -1)
]
= [0.8, -1.4, 2.5]
Add the bias:
z = [0.9, -1.6, 2.5]
Apply ReLU:
a = [0.9, 0, 2.5]
So this 3-neuron layer converts a 2-dimensional input into a 3-dimensional output.
Why Matrix Form Matters
The matrix form is not just mathematically elegant. It is how real neural networks are implemented efficiently.
Reasons it matters:
- GPUs are extremely good at matrix multiplication
- whole layers can be computed in parallel
- frameworks like PyTorch and TensorFlow are built around tensor operations
- the same formula scales cleanly from tiny demos to giant models
This is one of the key transitions from “I understand the idea” to “I understand how this is actually implemented.”
Adding a Batch Dimension
In real training, we usually process many examples at once.
Instead of one input vector, we have a batch:
X with shape (batch_size, input_dim)
Then the layer computation becomes:
Z = XW^T + b
or equivalently, depending on notation conventions:
Z = WX + b
The exact formula changes based on whether examples are row vectors or column vectors, but the idea does not:
- each example is run through the same weights
- the outputs are computed together in one batched matrix operation
For example, if:
- batch size =
32 - input dimension =
128 - hidden size =
256
then:
Xmight be(32, 128)Wmight be(256, 128)- output
Zbecomes(32, 256)
That is one forward pass through one dense layer for 32 examples at once.
Stacking Layers
Once one layer produces an output, the next layer treats that output as its input.
For a 2-layer network:
a1 = f(W1x + b1)
a2 = g(W2a1 + b2)
This is how representation learning happens:
- the first layer extracts simple patterns
- later layers transform those patterns into more useful abstractions
Even very large models are still built from this repeating idea.
Forward Pass in a Tiny Neural Network
Imagine:
- input dimension = 3
- hidden layer = 4 neurons
- output layer = 2 neurons
Then the forward pass looks like:
x (3,)
-> z1 = W1x + b1 # (4,)
-> a1 = ReLU(z1) # (4,)
-> z2 = W2a1 + b2 # (2,)
-> y_hat = softmax(z2)
If this is a classifier, the final output might be:
y_hat = [0.91, 0.09]
meaning the model currently believes class 1 is much more likely than class 2.
A Minimal PyTorch Example
import torch
import torch.nn.functional as F
x = torch.tensor([[1.5, -2.0, 0.8]]) # batch of 1
W1 = torch.tensor([
[0.4, -0.5, 0.3],
[0.1, 0.2, -0.4]
], dtype=torch.float32)
b1 = torch.tensor([0.1, -0.2], dtype=torch.float32)
z1 = x @ W1.T + b1
a1 = F.relu(z1)
print("z1:", z1)
print("a1:", a1)
This performs exactly the same kind of forward pass we computed by hand, just in batched tensor form.
Common Mistakes When Learning This
Mixing up scalar and matrix notation
The neuron equation and the matrix equation describe the same idea at different scales.
Forgetting the activation function
The weighted sum alone is not enough for a deep network to be expressive.
Ignoring tensor shapes
A lot of confusion in neural network code is really shape confusion. It is worth checking shapes constantly.
Thinking the forward pass is only for inference
Training also depends on the forward pass. You cannot compute loss or gradients without first producing predictions.
Summary
- The forward pass is the process of turning inputs into outputs using the model’s current parameters.
- A single neuron computes a weighted sum plus bias, then applies an activation.
- A dense layer is many neurons applied in parallel.
- Matrix form compresses the layer computation into
a = f(Wx + b). - Batched matrix operations make neural networks fast on GPUs.
- Every large neural network is still built from repeated forward-pass blocks like these.
Once matrix form feels natural, a lot of neural-network architecture becomes much easier to read and implement.