The forward pass is the part of a neural network that actually produces a prediction. You feed inputs into the model, the model applies a sequence of mathematical operations, and an output comes out the other side.

That sounds trivial, but it is one of the most important ideas in deep learning because everything else depends on it:

  • the loss compares the forward-pass output to the target
  • backpropagation differentiates through the forward pass
  • training is just repeating the forward pass and improving it

The easiest way to understand the forward pass is to start with a single neuron and then scale it up into a full layer written in matrix form.

Forward pass neuron anatomy

This follows the same teaching sequence used in the chapter 1 PDF: inputs feed into a weighted sum, the bias shifts the pre-activation, and the activation turns that value into the final output.

What the Forward Pass Means

A forward pass is simply:

input -> transformations -> output

For a neural network, those transformations are usually:

  • weighted sums
  • bias additions
  • activation functions
  • repeated layer by layer

You can think of the forward pass as the model saying:

Given these current weights, this is my current answer.

Step 1: A Single Artificial Neuron

One neuron takes several inputs, multiplies each by a weight, adds them together, adds a bias, and then usually applies a nonlinear activation function.

The core equation is:

z = w1x1 + w2x2 + ... + wnxn + b

Then:

a = f(z)

Where:

  • x1, x2, ..., xn are inputs
  • w1, w2, ..., wn are weights
  • b is the bias
  • z is the pre-activation
  • f is the activation function
  • a is the final output of the neuron

This is the smallest useful forward pass.

Forward pass flow and matrix scaling

The PDF’s key move is to show that the left-to-right flow stays the same even when we stop thinking about one neuron and start thinking in vectors and matrices.

A Worked Numerical Example

Suppose:

  • x1 = 1.5
  • x2 = -2.0
  • x3 = 0.8
  • w1 = 0.4
  • w2 = -0.5
  • w3 = 0.3
  • b = 0.1

First compute the weighted sum:

z = (1.5 * 0.4) + (-2.0 * -0.5) + (0.8 * 0.3) + 0.1

z = 0.6 + 1.0 + 0.24 + 0.1 = 1.94

Now apply ReLU:

a = max(0, 1.94) = 1.94

So the forward pass output of this neuron is 1.94.

Why We Need the Bias

Without the bias term, the neuron would always produce 0 whenever all inputs are 0.

The bias gives the neuron a learnable offset. It lets the model shift thresholds and decision boundaries instead of forcing every computation to pass through the origin.

In practice, the bias is small in notation and extremely important in behavior.

Pre-Activation vs Post-Activation

This distinction matters:

  • z is the pre-activation
  • a = f(z) is the post-activation

Why keep them separate?

Because:

  • the weighted sum carries the linear combination
  • the activation function introduces nonlinearity

Without that nonlinearity, stacking many layers would still collapse into one big linear transformation.

Why Activations Matter in the Forward Pass

If every layer were just:

z = Wx + b

then even many stacked layers would still be equivalent to a single linear map.

Activation functions break that limitation.

Common examples:

  • Sigmoid maps values into (0, 1)
  • Tanh maps values into (-1, 1)
  • ReLU maps negative values to 0 and leaves positive values unchanged
  • GELU smoothly gates inputs and is widely used in transformers

For many modern neural networks, the forward pass is really:

linear transform -> activation -> linear transform -> activation -> ...

Step 2: From One Neuron to One Layer

A layer is just many neurons operating on the same input.

Suppose the input has 3 features:

x = [x1, x2, x3]

And the layer has 4 neurons. Each neuron has its own weights and bias.

That means:

  • neuron 1 computes its own weighted sum
  • neuron 2 computes a different weighted sum
  • neuron 3 computes another one
  • neuron 4 does the same

The outputs are stacked into a vector.

So instead of one scalar output, the layer produces:

z = [z1, z2, z3, z4]

Then:

a = f(z)

where the activation function is applied elementwise.

The Matrix Form

Writing each neuron separately gets tedious fast. Matrix notation is cleaner and more powerful.

For a dense layer:

z = Wx + b

Where:

  • x has shape (m,)
  • W has shape (n, m)
  • b has shape (n,)
  • z has shape (n,)

If the input has m features and the layer has n neurons, then the layer contains:

  • one row of weights per neuron
  • one bias per neuron

Then the activation gives:

a = f(z)

So the full forward pass of the layer is:

a = f(Wx + b)

This one expression captures an entire layer of neurons.

A Concrete Matrix Example

Let:

x = [2, -1]

W = [[0.5, 0.2],
     [-0.3, 0.8],
     [1.0, -0.5]]

b = [0.1, -0.2, 0.0]

Then:

Wx =
[
  (0.5 * 2) + (0.2 * -1),
  (-0.3 * 2) + (0.8 * -1),
  (1.0 * 2) + (-0.5 * -1)
]

= [0.8, -1.4, 2.5]

Add the bias:

z = [0.9, -1.6, 2.5]

Apply ReLU:

a = [0.9, 0, 2.5]

So this 3-neuron layer converts a 2-dimensional input into a 3-dimensional output.

Why Matrix Form Matters

The matrix form is not just mathematically elegant. It is how real neural networks are implemented efficiently.

Reasons it matters:

  • GPUs are extremely good at matrix multiplication
  • whole layers can be computed in parallel
  • frameworks like PyTorch and TensorFlow are built around tensor operations
  • the same formula scales cleanly from tiny demos to giant models

This is one of the key transitions from “I understand the idea” to “I understand how this is actually implemented.”

Adding a Batch Dimension

In real training, we usually process many examples at once.

Instead of one input vector, we have a batch:

X with shape (batch_size, input_dim)

Then the layer computation becomes:

Z = XW^T + b

or equivalently, depending on notation conventions:

Z = WX + b

The exact formula changes based on whether examples are row vectors or column vectors, but the idea does not:

  • each example is run through the same weights
  • the outputs are computed together in one batched matrix operation

For example, if:

  • batch size = 32
  • input dimension = 128
  • hidden size = 256

then:

  • X might be (32, 128)
  • W might be (256, 128)
  • output Z becomes (32, 256)

That is one forward pass through one dense layer for 32 examples at once.

Stacking Layers

Once one layer produces an output, the next layer treats that output as its input.

For a 2-layer network:

a1 = f(W1x + b1)

a2 = g(W2a1 + b2)

This is how representation learning happens:

  • the first layer extracts simple patterns
  • later layers transform those patterns into more useful abstractions

Even very large models are still built from this repeating idea.

Forward Pass in a Tiny Neural Network

Imagine:

  • input dimension = 3
  • hidden layer = 4 neurons
  • output layer = 2 neurons

Then the forward pass looks like:

x (3,)
 -> z1 = W1x + b1      # (4,)
 -> a1 = ReLU(z1)      # (4,)
 -> z2 = W2a1 + b2     # (2,)
 -> y_hat = softmax(z2)

If this is a classifier, the final output might be:

y_hat = [0.91, 0.09]

meaning the model currently believes class 1 is much more likely than class 2.

A Minimal PyTorch Example

import torch
import torch.nn.functional as F

x = torch.tensor([[1.5, -2.0, 0.8]])  # batch of 1

W1 = torch.tensor([
    [0.4, -0.5, 0.3],
    [0.1,  0.2, -0.4]
], dtype=torch.float32)

b1 = torch.tensor([0.1, -0.2], dtype=torch.float32)

z1 = x @ W1.T + b1
a1 = F.relu(z1)

print("z1:", z1)
print("a1:", a1)

This performs exactly the same kind of forward pass we computed by hand, just in batched tensor form.

Common Mistakes When Learning This

Mixing up scalar and matrix notation

The neuron equation and the matrix equation describe the same idea at different scales.

Forgetting the activation function

The weighted sum alone is not enough for a deep network to be expressive.

Ignoring tensor shapes

A lot of confusion in neural network code is really shape confusion. It is worth checking shapes constantly.

Thinking the forward pass is only for inference

Training also depends on the forward pass. You cannot compute loss or gradients without first producing predictions.

Summary

  • The forward pass is the process of turning inputs into outputs using the model’s current parameters.
  • A single neuron computes a weighted sum plus bias, then applies an activation.
  • A dense layer is many neurons applied in parallel.
  • Matrix form compresses the layer computation into a = f(Wx + b).
  • Batched matrix operations make neural networks fast on GPUs.
  • Every large neural network is still built from repeated forward-pass blocks like these.

Once matrix form feels natural, a lot of neural-network architecture becomes much easier to read and implement.