Forward Pass Explained: From a Single Neuron to Matrix Form

The forward pass is the part of a neural network that actually produces a prediction. You feed inputs into the model, the model applies a sequence of mathematical operations, and an output comes out the other side.

That sounds trivial, but it is one of the most important ideas in deep learning because everything else depends on it:

the loss compares the forward-pass output to the target
backpropagation differentiates through the forward pass
training is just repeating the forward pass and improving it

The easiest way to understand the forward pass is to start with a single neuron and then scale it up into a full layer written in matrix form.

Forward pass neuron anatomy

This follows the same teaching sequence used in the chapter 1 PDF: inputs feed into a weighted sum, the bias shifts the pre-activation, and the activation turns that value into the final output.

What the Forward Pass Means

A forward pass is simply:

input -> transformations -> output

For a neural network, those transformations are usually:

weighted sums
bias additions
activation functions
repeated layer by layer

You can think of the forward pass as the model saying:

Given these current weights, this is my current answer.

Step 1: A Single Artificial Neuron

One neuron takes several inputs, multiplies each by a weight, adds them together, adds a bias, and then usually applies a nonlinear activation function.

The core equation is:

z = w1x1 + w2x2 + ... + wnxn + b

Then:

a = f(z)

Where:

x1, x2, ..., xn are inputs
w1, w2, ..., wn are weights
b is the bias
z is the pre-activation
f is the activation function
a is the final output of the neuron

This is the smallest useful forward pass.

Forward pass flow and matrix scaling

The PDF’s key move is to show that the left-to-right flow stays the same even when we stop thinking about one neuron and start thinking in vectors and matrices.

A Worked Numerical Example

Suppose:

x1 = 1.5
x2 = -2.0
x3 = 0.8
w1 = 0.4
w2 = -0.5
w3 = 0.3
b = 0.1

First compute the weighted sum:

z = (1.5 * 0.4) + (-2.0 * -0.5) + (0.8 * 0.3) + 0.1

z = 0.6 + 1.0 + 0.24 + 0.1 = 1.94

Now apply ReLU:

a = max(0, 1.94) = 1.94

So the forward pass output of this neuron is 1.94.

Why We Need the Bias

Without the bias term, the neuron would always produce 0 whenever all inputs are 0.

The bias gives the neuron a learnable offset. It lets the model shift thresholds and decision boundaries instead of forcing every computation to pass through the origin.

In practice, the bias is small in notation and extremely important in behavior.

Pre-Activation vs Post-Activation

This distinction matters:

z is the pre-activation
a = f(z) is the post-activation

Why keep them separate?

Because:

the weighted sum carries the linear combination
the activation function introduces nonlinearity

Without that nonlinearity, stacking many layers would still collapse into one big linear transformation.

Why Activations Matter in the Forward Pass

If every layer were just:

z = Wx + b

then even many stacked layers would still be equivalent to a single linear map.

Activation functions break that limitation.

Common examples:

Sigmoid maps values into (0, 1)
Tanh maps values into (-1, 1)
ReLU maps negative values to 0 and leaves positive values unchanged
GELU smoothly gates inputs and is widely used in transformers

For many modern neural networks, the forward pass is really:

linear transform -> activation -> linear transform -> activation -> ...

Step 2: From One Neuron to One Layer

A layer is just many neurons operating on the same input.

Suppose the input has 3 features:

x = [x1, x2, x3]

And the layer has 4 neurons. Each neuron has its own weights and bias.

That means:

neuron 1 computes its own weighted sum
neuron 2 computes a different weighted sum
neuron 3 computes another one
neuron 4 does the same

The outputs are stacked into a vector.

So instead of one scalar output, the layer produces:

z = [z1, z2, z3, z4]

Then:

a = f(z)

where the activation function is applied elementwise.

The Matrix Form

Writing each neuron separately gets tedious fast. Matrix notation is cleaner and more powerful.

For a dense layer:

z = Wx + b

Where:

x has shape (m,)
W has shape (n, m)
b has shape (n,)
z has shape (n,)

If the input has m features and the layer has n neurons, then the layer contains:

one row of weights per neuron
one bias per neuron

Then the activation gives:

a = f(z)

So the full forward pass of the layer is:

a = f(Wx + b)

This one expression captures an entire layer of neurons.

A Concrete Matrix Example

Let:

x = [2, -1]

W = [[0.5, 0.2],
     [-0.3, 0.8],
     [1.0, -0.5]]

b = [0.1, -0.2, 0.0]

Then:

Wx =
[
  (0.5 * 2) + (0.2 * -1),
  (-0.3 * 2) + (0.8 * -1),
  (1.0 * 2) + (-0.5 * -1)
]

= [0.8, -1.4, 2.5]

Add the bias:

z = [0.9, -1.6, 2.5]

Apply ReLU:

a = [0.9, 0, 2.5]

So this 3-neuron layer converts a 2-dimensional input into a 3-dimensional output.

Why Matrix Form Matters

The matrix form is not just mathematically elegant. It is how real neural networks are implemented efficiently.

Reasons it matters:

GPUs are extremely good at matrix multiplication
whole layers can be computed in parallel
frameworks like PyTorch and TensorFlow are built around tensor operations
the same formula scales cleanly from tiny demos to giant models

This is one of the key transitions from “I understand the idea” to “I understand how this is actually implemented.”

Adding a Batch Dimension

In real training, we usually process many examples at once.

Instead of one input vector, we have a batch:

X with shape (batch_size, input_dim)

Then the layer computation becomes:

Z = XW^T + b

or equivalently, depending on notation conventions:

Z = WX + b

The exact formula changes based on whether examples are row vectors or column vectors, but the idea does not:

each example is run through the same weights
the outputs are computed together in one batched matrix operation

For example, if:

batch size = 32
input dimension = 128
hidden size = 256

then:

X might be (32, 128)
W might be (256, 128)
output Z becomes (32, 256)

That is one forward pass through one dense layer for 32 examples at once.

Stacking Layers

Once one layer produces an output, the next layer treats that output as its input.

For a 2-layer network:

a1 = f(W1x + b1)

a2 = g(W2a1 + b2)

This is how representation learning happens:

the first layer extracts simple patterns
later layers transform those patterns into more useful abstractions

Even very large models are still built from this repeating idea.

Forward Pass in a Tiny Neural Network

Imagine:

input dimension = 3
hidden layer = 4 neurons
output layer = 2 neurons

Then the forward pass looks like:

x (3,)
 -> z1 = W1x + b1      # (4,)
 -> a1 = ReLU(z1)      # (4,)
 -> z2 = W2a1 + b2     # (2,)
 -> y_hat = softmax(z2)

If this is a classifier, the final output might be:

y_hat = [0.91, 0.09]

meaning the model currently believes class 1 is much more likely than class 2.

A Minimal PyTorch Example

import torch
import torch.nn.functional as F

x = torch.tensor([[1.5, -2.0, 0.8]])  # batch of 1

W1 = torch.tensor([
    [0.4, -0.5, 0.3],
    [0.1,  0.2, -0.4]
], dtype=torch.float32)

b1 = torch.tensor([0.1, -0.2], dtype=torch.float32)

z1 = x @ W1.T + b1
a1 = F.relu(z1)

print("z1:", z1)
print("a1:", a1)

This performs exactly the same kind of forward pass we computed by hand, just in batched tensor form.

Common Mistakes When Learning This

Mixing up scalar and matrix notation

The neuron equation and the matrix equation describe the same idea at different scales.

Forgetting the activation function

The weighted sum alone is not enough for a deep network to be expressive.

Ignoring tensor shapes

A lot of confusion in neural network code is really shape confusion. It is worth checking shapes constantly.

Thinking the forward pass is only for inference

Training also depends on the forward pass. You cannot compute loss or gradients without first producing predictions.

Summary

The forward pass is the process of turning inputs into outputs using the model’s current parameters.
A single neuron computes a weighted sum plus bias, then applies an activation.
A dense layer is many neurons applied in parallel.
Matrix form compresses the layer computation into a = f(Wx + b).
Batched matrix operations make neural networks fast on GPUs.
Every large neural network is still built from repeated forward-pass blocks like these.

Once matrix form feels natural, a lot of neural-network architecture becomes much easier to read and implement.

What the Forward Pass Means#

Step 1: A Single Artificial Neuron#

A Worked Numerical Example#

Why We Need the Bias#

Pre-Activation vs Post-Activation#

Why Activations Matter in the Forward Pass#

Step 2: From One Neuron to One Layer#

The Matrix Form#

A Concrete Matrix Example#

Why Matrix Form Matters#

Adding a Batch Dimension#

Stacking Layers#

Forward Pass in a Tiny Neural Network#

A Minimal PyTorch Example#

Common Mistakes When Learning This#

Mixing up scalar and matrix notation#

Forgetting the activation function#

Ignoring tensor shapes#

Thinking the forward pass is only for inference#

Summary#

Related Articles

What the Forward Pass Means

Step 1: A Single Artificial Neuron

A Worked Numerical Example

Why We Need the Bias

Pre-Activation vs Post-Activation

Why Activations Matter in the Forward Pass

Step 2: From One Neuron to One Layer

The Matrix Form

A Concrete Matrix Example

Why Matrix Form Matters

Adding a Batch Dimension

Stacking Layers

Forward Pass in a Tiny Neural Network

A Minimal PyTorch Example

Common Mistakes When Learning This

Mixing up scalar and matrix notation

Forgetting the activation function

Ignoring tensor shapes

Thinking the forward pass is only for inference

Summary