Backpropagation Explained Visually: How Neural Networks Actually Learn

Backpropagation is the core algorithm that makes neural networks trainable. The forward pass tells the model what prediction it currently makes. Backpropagation tells the model how each weight contributed to the error so the optimizer can update those weights in the right direction.

People often hear that backpropagation is “just the chain rule,” which is true but not especially helpful. The useful mental model is this:

the forward pass computes values
the backward pass computes sensitivities
each node only needs its own local derivative
the full gradient is built by multiplying those local derivatives along the path

If that sounds abstract, it becomes much clearer once you look at one neuron first and then scale up.

Backpropagation chain rule visual

The chapter 1 PDF frames backpropagation in exactly the right order: start with the chain rule, then show how that rule becomes a graph algorithm.

The High-Level Idea

Training a neural network repeats the same loop:

Run a forward pass and compute a prediction.
Compare that prediction to the true answer using a loss function.
Run a backward pass to compute gradients.
Update weights using gradient descent or one of its variants.

Backpropagation is step 3.

When we say “gradient,” we mean:

How much would the loss change if I nudged this parameter a little?

If increasing a weight increases the loss, we want to push that weight down. If increasing a weight decreases the loss, we want to push it up.

A Single Neuron: The Smallest Useful Example

Consider one neuron:

z = wx + b

a = ReLU(z)

L = loss(a, y)

Where:

x is the input
w is the weight
b is the bias
z is the pre-activation
a is the output after activation
L is the loss

The forward pass flows left to right:

x, w, b
  -> z = wx + b
  -> a = ReLU(z)
  -> L

The backward pass flows right to left:

L
  -> dL/da
  -> dL/dz
  -> dL/dw, dL/db, dL/dx

This reverse flow is why it is called backpropagation.

Why the Chain Rule Is the Whole Story

Suppose L depends on a, a depends on z, and z depends on w.

Then:

dL/dw = (dL/da) * (da/dz) * (dz/dw)

That is the chain rule.

This matters because large neural networks are just many small operations glued together:

multiply
add
activation
matrix multiply
normalization

Each operation knows how to compute its own local derivative. Backpropagation stitches those pieces together.

Single neuron computation graph

A single neuron already contains the full backprop pattern: forward values move left to right, gradients move right to left, and the multiply node routes gradients differently to each input.

A Concrete Numerical Example

Take:

x = 2
w = 3
b = 1

Forward pass:

z = wx + b = 2*3 + 1 = 7

a = ReLU(7) = 7

Suppose the target is y = 4, and we use squared error:

L = (a - y)^2 = (7 - 4)^2 = 9

Now let us go backward.

Step 1: Derivative of the loss with respect to the output

L = (a - y)^2

So:

dL/da = 2(a - y) = 2(7 - 4) = 6

This means: if a increases a little, the loss increases about 6 times that amount.

Step 2: Derivative through ReLU

a = ReLU(z)

For ReLU:

derivative is 1 if z > 0
derivative is 0 if z < 0

Here z = 7, so:

da/dz = 1

Then:

dL/dz = (dL/da) * (da/dz) = 6 * 1 = 6

Step 3: Derivative with respect to the weight

z = wx + b

So:

dz/dw = x = 2

Then:

dL/dw = (dL/dz) * (dz/dw) = 6 * 2 = 12

Step 4: Derivative with respect to the bias

dz/db = 1

So:

dL/db = (dL/dz) * (dz/db) = 6

Step 5: Derivative with respect to the input

dz/dx = w = 3

So:

dL/dx = (dL/dz) * (dz/dx) = 6 * 3 = 18

This last one is not used to update the data input, but it becomes important when one layer’s output is another layer’s input.

Why the Multiply Node “Swaps” Gradients

For a multiplication m = xw:

dm/dx = w
dm/dw = x

That is why during backpropagation the multiply node seems to “swap” the values:

gradient sent to x is scaled by w
gradient sent to w is scaled by x

This pattern appears constantly in neural networks.

Computation Graphs Make Backpropagation Much Easier to Understand

A neural network can be viewed as a computation graph. Every node is an operation, and every edge carries a value.

For our neuron:

x ----\
       * ----\
w ----/       \
               + ---- z ---- ReLU ---- a ---- loss ---- L
b -------------/

In the forward pass, you compute values at each node.

In the backward pass, you compute:

the upstream gradient coming into a node
the local derivative of that node
the downstream gradients sent to its inputs

That is the entire algorithm.

General Backpropagation Pattern

Every node follows the same recipe:

Receive gradient from the node above.
Multiply by the node’s local derivative.
Pass the result to the node’s inputs.

For example:

upstream gradient * local derivative = downstream gradient

For an addition node:

z = u + v

The derivatives are:

dz/du = 1
dz/dv = 1

So the upstream gradient gets copied to both inputs.

For a multiplication node:

z = uv

The derivatives are:

dz/du = v
dz/dv = u

So the upstream gradient gets scaled differently for each input.

Extending to a Layer

A layer is just many neurons processed together.

Instead of:

z = wx + b

we write:

z = Wx + b

Where:

x is a vector of inputs
W is a weight matrix
b is a bias vector
z is the vector of pre-activations

Backpropagation still follows the same logic, just with vectors and matrices instead of single numbers.

For a dense layer:

the gradient with respect to W depends on the input activations
the gradient with respect to b is the accumulated gradient over the batch
the gradient with respect to x is what gets passed to the previous layer

This is why modern frameworks can train huge networks efficiently. The same chain-rule logic becomes large, optimized matrix math.

Backpropagation Through Multiple Layers

Imagine a 2-layer network:

x
 -> z1 = W1x + b1
 -> a1 = ReLU(z1)
 -> z2 = W2a1 + b2
 -> y_hat
 -> L

Backpropagation works from the end toward the start:

Compute dL/dy_hat
Push it through the output layer to get gradients for W2 and b2
Compute the gradient with respect to a1
Push that through ReLU to get dL/dz1
Use that to compute gradients for W1 and b1

Earlier layers do not directly see the loss. They receive a gradient signal passed backward through later layers.

Two-layer network backpropagation

Once the graph spans multiple layers, the logic does not change. The loss gradient simply gets relayed backward, one local derivative at a time, until it reaches earlier weights.

Why Activations Matter for Gradient Flow

Activation functions affect not just the forward pass but also the backward pass.

Sigmoid and Tanh

Sigmoid and tanh can saturate. In their flat regions, the derivative becomes very small. When many such small derivatives are multiplied together across layers, gradients can shrink toward zero.

That is the vanishing gradient problem.

ReLU

ReLU avoids this problem for positive activations because its derivative is 1 there. That is one reason ReLU made deep networks much easier to train.

But ReLU has its own issue:

if a neuron stays on the negative side, its gradient can become 0
then that neuron may stop learning

This is often called a dead ReLU.

Why Backpropagation Is Efficient

A naive approach would compute the effect of each weight separately from scratch. That would be absurdly expensive.

Backpropagation is efficient because it reuses intermediate results.

The forward pass stores values like:

inputs
pre-activations
activations

The backward pass reuses them to compute gradients locally.

This dynamic-programming flavor is what makes deep learning practical.

How the Optimizer Uses These Gradients

Backpropagation itself does not update the weights. It only computes gradients.

Then an optimizer such as SGD or Adam applies an update like:

w <- w - eta * dL/dw

Where eta is the learning rate.

So the separation is:

backpropagation computes gradient information
optimization uses that information to move parameters

A Minimal PyTorch Example

import torch
import torch.nn as nn

x = torch.tensor([[2.0]])
y = torch.tensor([[4.0]])

linear = nn.Linear(1, 1)

with torch.no_grad():
    linear.weight[:] = torch.tensor([[3.0]])
    linear.bias[:] = torch.tensor([1.0])

pred = torch.relu(linear(x))
loss = (pred - y).pow(2).mean()

loss.backward()

print("prediction:", pred.item())
print("loss:", loss.item())
print("dL/dw:", linear.weight.grad.item())
print("dL/db:", linear.bias.grad.item())

Autograd performs the same chain-rule computation we walked through manually.

Common Misunderstandings

“Backpropagation and gradient descent are the same thing”

They are related but different.

backpropagation computes gradients
gradient descent uses gradients to update weights

“The model learns by changing all weights equally”

No. Each parameter gets its own gradient, so each parameter gets its own update.

“Backpropagation is specific to neural networks”

Not really. It is a general reverse-mode automatic differentiation technique applied to computation graphs. Neural networks just happen to be the most famous use case.

Intuition to Keep

If you remember only one thing, make it this:

Backpropagation answers the question:

Which earlier choices caused the final error, and by how much?

That answer is computed by moving backward through the graph, one local derivative at a time.

Summary

The forward pass computes predictions.
The backward pass computes gradients of the loss with respect to parameters.
Backpropagation is just the chain rule applied efficiently over a computation graph.
Each node uses local derivatives and the upstream gradient.
The resulting gradients tell the optimizer how to adjust weights.
Deep learning works because this process scales from a single neuron to networks with billions of parameters.

Once this clicks, a lot of deep learning stops feeling magical. It becomes a structured pipeline of matrix operations, local derivatives, and careful bookkeeping.

The High-Level Idea#

A Single Neuron: The Smallest Useful Example#

Why the Chain Rule Is the Whole Story#

A Concrete Numerical Example#

Step 1: Derivative of the loss with respect to the output#

Step 2: Derivative through ReLU#

Step 3: Derivative with respect to the weight#

Step 4: Derivative with respect to the bias#

Step 5: Derivative with respect to the input#

Why the Multiply Node “Swaps” Gradients#

Computation Graphs Make Backpropagation Much Easier to Understand#

General Backpropagation Pattern#

Extending to a Layer#

Backpropagation Through Multiple Layers#

Why Activations Matter for Gradient Flow#

Sigmoid and Tanh#

ReLU#

Why Backpropagation Is Efficient#

How the Optimizer Uses These Gradients#

A Minimal PyTorch Example#

Common Misunderstandings#

“Backpropagation and gradient descent are the same thing”#

“The model learns by changing all weights equally”#

“Backpropagation is specific to neural networks”#

Intuition to Keep#

Summary#

Related Articles

The High-Level Idea

A Single Neuron: The Smallest Useful Example

Why the Chain Rule Is the Whole Story

A Concrete Numerical Example

Step 1: Derivative of the loss with respect to the output

Step 2: Derivative through ReLU

Step 3: Derivative with respect to the weight

Step 4: Derivative with respect to the bias

Step 5: Derivative with respect to the input

Why the Multiply Node “Swaps” Gradients

Computation Graphs Make Backpropagation Much Easier to Understand

General Backpropagation Pattern

Extending to a Layer

Backpropagation Through Multiple Layers

Why Activations Matter for Gradient Flow

Sigmoid and Tanh

ReLU

Why Backpropagation Is Efficient

How the Optimizer Uses These Gradients

A Minimal PyTorch Example

Common Misunderstandings

“Backpropagation and gradient descent are the same thing”

“The model learns by changing all weights equally”

“Backpropagation is specific to neural networks”

Intuition to Keep

Summary