Backpropagation is the core algorithm that makes neural networks trainable. The forward pass tells the model what prediction it currently makes. Backpropagation tells the model how each weight contributed to the error so the optimizer can update those weights in the right direction.
People often hear that backpropagation is “just the chain rule,” which is true but not especially helpful. The useful mental model is this:
- the forward pass computes values
- the backward pass computes sensitivities
- each node only needs its own local derivative
- the full gradient is built by multiplying those local derivatives along the path
If that sounds abstract, it becomes much clearer once you look at one neuron first and then scale up.
The chapter 1 PDF frames backpropagation in exactly the right order: start with the chain rule, then show how that rule becomes a graph algorithm.
The High-Level Idea
Training a neural network repeats the same loop:
- Run a forward pass and compute a prediction.
- Compare that prediction to the true answer using a loss function.
- Run a backward pass to compute gradients.
- Update weights using gradient descent or one of its variants.
Backpropagation is step 3.
When we say “gradient,” we mean:
How much would the loss change if I nudged this parameter a little?
If increasing a weight increases the loss, we want to push that weight down. If increasing a weight decreases the loss, we want to push it up.
A Single Neuron: The Smallest Useful Example
Consider one neuron:
z = wx + b
a = ReLU(z)
L = loss(a, y)
Where:
xis the inputwis the weightbis the biaszis the pre-activationais the output after activationLis the loss
The forward pass flows left to right:
x, w, b
-> z = wx + b
-> a = ReLU(z)
-> L
The backward pass flows right to left:
L
-> dL/da
-> dL/dz
-> dL/dw, dL/db, dL/dx
This reverse flow is why it is called backpropagation.
Why the Chain Rule Is the Whole Story
Suppose L depends on a, a depends on z, and z depends on w.
Then:
dL/dw = (dL/da) * (da/dz) * (dz/dw)
That is the chain rule.
This matters because large neural networks are just many small operations glued together:
- multiply
- add
- activation
- matrix multiply
- normalization
Each operation knows how to compute its own local derivative. Backpropagation stitches those pieces together.
A single neuron already contains the full backprop pattern: forward values move left to right, gradients move right to left, and the multiply node routes gradients differently to each input.
A Concrete Numerical Example
Take:
x = 2w = 3b = 1
Forward pass:
z = wx + b = 2*3 + 1 = 7
a = ReLU(7) = 7
Suppose the target is y = 4, and we use squared error:
L = (a - y)^2 = (7 - 4)^2 = 9
Now let us go backward.
Step 1: Derivative of the loss with respect to the output
L = (a - y)^2
So:
dL/da = 2(a - y) = 2(7 - 4) = 6
This means: if a increases a little, the loss increases about 6 times that amount.
Step 2: Derivative through ReLU
a = ReLU(z)
For ReLU:
- derivative is
1ifz > 0 - derivative is
0ifz < 0
Here z = 7, so:
da/dz = 1
Then:
dL/dz = (dL/da) * (da/dz) = 6 * 1 = 6
Step 3: Derivative with respect to the weight
z = wx + b
So:
dz/dw = x = 2
Then:
dL/dw = (dL/dz) * (dz/dw) = 6 * 2 = 12
Step 4: Derivative with respect to the bias
dz/db = 1
So:
dL/db = (dL/dz) * (dz/db) = 6
Step 5: Derivative with respect to the input
dz/dx = w = 3
So:
dL/dx = (dL/dz) * (dz/dx) = 6 * 3 = 18
This last one is not used to update the data input, but it becomes important when one layer’s output is another layer’s input.
Why the Multiply Node “Swaps” Gradients
For a multiplication m = xw:
dm/dx = wdm/dw = x
That is why during backpropagation the multiply node seems to “swap” the values:
- gradient sent to
xis scaled byw - gradient sent to
wis scaled byx
This pattern appears constantly in neural networks.
Computation Graphs Make Backpropagation Much Easier to Understand
A neural network can be viewed as a computation graph. Every node is an operation, and every edge carries a value.
For our neuron:
x ----\
* ----\
w ----/ \
+ ---- z ---- ReLU ---- a ---- loss ---- L
b -------------/
In the forward pass, you compute values at each node.
In the backward pass, you compute:
- the upstream gradient coming into a node
- the local derivative of that node
- the downstream gradients sent to its inputs
That is the entire algorithm.
General Backpropagation Pattern
Every node follows the same recipe:
- Receive gradient from the node above.
- Multiply by the node’s local derivative.
- Pass the result to the node’s inputs.
For example:
upstream gradient * local derivative = downstream gradient
For an addition node:
z = u + v
The derivatives are:
dz/du = 1dz/dv = 1
So the upstream gradient gets copied to both inputs.
For a multiplication node:
z = uv
The derivatives are:
dz/du = vdz/dv = u
So the upstream gradient gets scaled differently for each input.
Extending to a Layer
A layer is just many neurons processed together.
Instead of:
z = wx + b
we write:
z = Wx + b
Where:
xis a vector of inputsWis a weight matrixbis a bias vectorzis the vector of pre-activations
Backpropagation still follows the same logic, just with vectors and matrices instead of single numbers.
For a dense layer:
- the gradient with respect to
Wdepends on the input activations - the gradient with respect to
bis the accumulated gradient over the batch - the gradient with respect to
xis what gets passed to the previous layer
This is why modern frameworks can train huge networks efficiently. The same chain-rule logic becomes large, optimized matrix math.
Backpropagation Through Multiple Layers
Imagine a 2-layer network:
x
-> z1 = W1x + b1
-> a1 = ReLU(z1)
-> z2 = W2a1 + b2
-> y_hat
-> L
Backpropagation works from the end toward the start:
- Compute
dL/dy_hat - Push it through the output layer to get gradients for
W2andb2 - Compute the gradient with respect to
a1 - Push that through ReLU to get
dL/dz1 - Use that to compute gradients for
W1andb1
Earlier layers do not directly see the loss. They receive a gradient signal passed backward through later layers.
Once the graph spans multiple layers, the logic does not change. The loss gradient simply gets relayed backward, one local derivative at a time, until it reaches earlier weights.
Why Activations Matter for Gradient Flow
Activation functions affect not just the forward pass but also the backward pass.
Sigmoid and Tanh
Sigmoid and tanh can saturate. In their flat regions, the derivative becomes very small. When many such small derivatives are multiplied together across layers, gradients can shrink toward zero.
That is the vanishing gradient problem.
ReLU
ReLU avoids this problem for positive activations because its derivative is 1 there. That is one reason ReLU made deep networks much easier to train.
But ReLU has its own issue:
- if a neuron stays on the negative side, its gradient can become
0 - then that neuron may stop learning
This is often called a dead ReLU.
Why Backpropagation Is Efficient
A naive approach would compute the effect of each weight separately from scratch. That would be absurdly expensive.
Backpropagation is efficient because it reuses intermediate results.
The forward pass stores values like:
- inputs
- pre-activations
- activations
The backward pass reuses them to compute gradients locally.
This dynamic-programming flavor is what makes deep learning practical.
How the Optimizer Uses These Gradients
Backpropagation itself does not update the weights. It only computes gradients.
Then an optimizer such as SGD or Adam applies an update like:
w <- w - eta * dL/dw
Where eta is the learning rate.
So the separation is:
- backpropagation computes gradient information
- optimization uses that information to move parameters
A Minimal PyTorch Example
import torch
import torch.nn as nn
x = torch.tensor([[2.0]])
y = torch.tensor([[4.0]])
linear = nn.Linear(1, 1)
with torch.no_grad():
linear.weight[:] = torch.tensor([[3.0]])
linear.bias[:] = torch.tensor([1.0])
pred = torch.relu(linear(x))
loss = (pred - y).pow(2).mean()
loss.backward()
print("prediction:", pred.item())
print("loss:", loss.item())
print("dL/dw:", linear.weight.grad.item())
print("dL/db:", linear.bias.grad.item())
Autograd performs the same chain-rule computation we walked through manually.
Common Misunderstandings
“Backpropagation and gradient descent are the same thing”
They are related but different.
- backpropagation computes gradients
- gradient descent uses gradients to update weights
“The model learns by changing all weights equally”
No. Each parameter gets its own gradient, so each parameter gets its own update.
“Backpropagation is specific to neural networks”
Not really. It is a general reverse-mode automatic differentiation technique applied to computation graphs. Neural networks just happen to be the most famous use case.
Intuition to Keep
If you remember only one thing, make it this:
Backpropagation answers the question:
Which earlier choices caused the final error, and by how much?
That answer is computed by moving backward through the graph, one local derivative at a time.
Summary
- The forward pass computes predictions.
- The backward pass computes gradients of the loss with respect to parameters.
- Backpropagation is just the chain rule applied efficiently over a computation graph.
- Each node uses local derivatives and the upstream gradient.
- The resulting gradients tell the optimizer how to adjust weights.
- Deep learning works because this process scales from a single neuron to networks with billions of parameters.
Once this clicks, a lot of deep learning stops feeling magical. It becomes a structured pipeline of matrix operations, local derivatives, and careful bookkeeping.