Training Loop Explained: Batches, Epochs, Iterations, and Convergence

Once you understand neurons, activations, loss functions, and backpropagation, the next thing to understand is the training loop. This is the repetitive engine of deep learning.

At a high level, training is boring in the best possible way. It is the same four steps repeated over and over:

make a prediction
measure the error
compute gradients
update the weights

The interesting part is not the loop itself. The interesting part is how concepts like batch size, epoch, iteration, and convergence affect the behavior of that loop in practice.

The Training Loop in One Picture

Every iteration of training follows this pattern:

Input batch
  -> Forward pass
  -> Loss
  -> Backward pass
  -> Optimizer update
  -> Repeat

Or in code:

for batch in data_loader:
    predictions = model(batch.inputs)
    loss = loss_fn(predictions, batch.targets)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

That is the whole training loop in miniature.

Step 1: Forward Pass

The model takes the current batch of inputs and computes predictions using its current weights.

Examples:

a regression model predicts prices
a classifier predicts class logits
an LLM predicts next-token logits

At this moment, the model is not learning yet. It is just answering:

Given my current parameters, what do I predict for these examples?

Step 2: Compute Loss

The loss function measures how wrong those predictions are.

Examples:

MSE for regression
cross-entropy for classification

The output is a scalar value. That scalar becomes the thing training tries to reduce over time.

Step 3: Backward Pass

The backward pass computes gradients of the loss with respect to model parameters.

This is where backpropagation happens.

After loss.backward() in PyTorch, each trainable parameter now has a gradient attached to it.

Those gradients answer:

How should this weight change if I want to reduce the loss?

Step 4: Optimizer Update

The optimizer uses the gradients to update weights.

In the simplest case:

w <- w - eta * grad

Where eta is the learning rate.

In practice, optimizers like Adam keep extra state, but the central idea is the same:

gradients tell you the direction
optimizer decides how to step

What Is a Batch?

A batch is a subset of training examples processed together before one weight update.

If the dataset has 10,000 examples, you usually do not feed all 10,000 at once. Instead, you split them into smaller groups such as:

If the batch size is 64, the model processes 64 examples, computes one average loss, computes gradients, and then updates weights once.

This is called mini-batch training, and it is the standard way neural networks are trained.

Why Not Use the Entire Dataset at Once?

You could, in principle, use full-batch gradient descent, but mini-batches are usually better in practice.

Reasons:

full-batch training can be too memory-intensive
mini-batches are much faster on GPUs
mini-batch noise can actually help optimization
weight updates happen more frequently

So mini-batches are a practical and often optimization-friendly compromise.

What Is an Epoch?

An epoch means one complete pass through the entire training dataset.

Example:

dataset size = 12,000
batch size = 100

Then:

one epoch contains 120 batches
therefore one epoch contains 120 optimizer updates

An epoch is about dataset coverage, not about one single update.

What Is an Iteration or Step?

An iteration or step usually means:

one batch processed -> one weight update

So with:

dataset size = 12,000
batch size = 100

you get:

120 iterations per epoch

This distinction matters because:

epochs tell you how many times you have seen the dataset
steps tell you how many times you have updated the weights

Many modern training setups, especially for large models, are tracked more in steps than in epochs.

Worked Example: Dataset, Batch Size, Epochs

Suppose:

dataset has 1,000 samples
batch size = 100
training runs for 5 epochs

Then:

batches per epoch = 1,000 / 100 = 10
total iterations = 5 * 10 = 50
total weight updates = 50

This is the simplest arithmetic behind training schedules.

Why Batch Size Matters

Batch size changes both compute behavior and optimization behavior.

Small batches

Pros:

more frequent updates
less memory usage
noisier gradients can help escape poor regions

Cons:

training can be less stable
GPU utilization may be worse

Large batches

Pros:

better hardware efficiency
smoother gradient estimates

Cons:

more memory required
fewer updates per epoch
can sometimes generalize worse without proper tuning

There is no universal best batch size. It depends on the model, hardware, and objective.

Why Training Loss Goes Down but Not Always Smoothly

People often expect loss curves to drop smoothly. In reality, mini-batch training introduces noise.

Why?

each batch is only a sample of the dataset
different batches have different difficulty
the gradient is only an estimate of the full-data gradient

So it is normal for batch-level loss to bounce around even while the broader trend is improving.

That is not failure. That is just stochastic optimization doing its thing.

What Convergence Means

Convergence does not necessarily mean reaching the perfect global minimum.

In practical deep learning, convergence usually means something closer to:

training loss is no longer improving much
validation performance has stabilized
further training gives diminishing returns

That is enough to stop in many real workflows.

Too Small, Too Large, and Just Right Learning Rates

The training loop is highly sensitive to learning rate.

Too small

training progresses very slowly
loss decreases, but painfully
many epochs may be wasted

Too large

loss may oscillate or diverge
updates overshoot good regions
training may become numerically unstable

Reasonable learning rate

loss decreases efficiently
training is stable enough to make progress
convergence is much faster

A lot of “model problems” are really optimization-setting problems, especially bad learning-rate choices.

Training Loss vs Validation Loss

To understand convergence properly, you usually track both:

training loss
validation loss

Good fit

both losses go down
both stay reasonably close

Underfitting

both losses stay high
model is not learning enough

Overfitting

training loss keeps dropping
validation loss stops improving or starts rising

This is why “lowest training loss” is not always the best checkpoint.

The goal is not memorization. The goal is generalization.

Why Shuffle the Data

Training data is usually shuffled each epoch.

Why?

avoids pathological ordering effects
makes batches more representative
improves stochastic optimization behavior

If similar examples are clustered together and never shuffled, training can behave badly or learn biased update patterns.

Why We Zero Gradients

In PyTorch, gradients accumulate by default.

That is why a typical loop includes:

optimizer.zero_grad()
loss.backward()
optimizer.step()

If you skip zero_grad(), gradients from previous steps will accumulate, which is usually wrong unless you are intentionally doing gradient accumulation.

Gradient Accumulation

Sometimes the ideal batch size does not fit in memory.

One workaround is gradient accumulation:

process several smaller micro-batches
call loss.backward() on each
delay optimizer.step()
update only after accumulating enough gradients

This simulates a larger effective batch size without needing all examples in memory at once.

A Minimal Training Loop in PyTorch

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset

X = torch.randn(1000, 10)
y = torch.randint(0, 2, (1000,))

dataset = TensorDataset(X, y)
loader = DataLoader(dataset, batch_size=32, shuffle=True)

model = nn.Sequential(
    nn.Linear(10, 32),
    nn.ReLU(),
    nn.Linear(32, 2)
)

loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

for epoch in range(5):
    for batch_x, batch_y in loader:
        logits = model(batch_x)
        loss = loss_fn(logits, batch_y)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    print(f"epoch={epoch+1}, loss={loss.item():.4f}")

This example has all the standard pieces:

mini-batches
forward pass
loss
backward pass
optimizer step
multiple epochs

Common Misunderstandings

“One epoch means one gradient update”

No. One epoch may contain hundreds or thousands of updates depending on dataset size and batch size.

“One batch means one sample”

No. A batch usually contains many samples.

“Convergence means zero loss”

Not in practice. It usually means the model has stopped improving meaningfully.

“More epochs always means better results”

No. More epochs can eventually lead to overfitting.

“Batch size is only a hardware choice”

No. It affects both hardware efficiency and optimization dynamics.

Practical Mental Model

Think of training like repeated course correction:

each batch gives the model a noisy hint about what it is doing wrong
each backward pass converts that hint into gradients
each optimizer step nudges the parameters
many nudges, accumulated over time, produce learning

That is why deep learning is iterative by nature. One step learns almost nothing. A well-tuned sequence of steps learns a lot.

Summary

The training loop repeats: forward pass, loss, backward pass, optimizer update.
A batch is a subset of examples processed together before one weight update.
An epoch is one full pass through the dataset.
An iteration or step is usually one batch processed and one optimizer update.
Batch size affects memory, throughput, stability, and generalization.
Convergence means training has mostly stopped improving in a meaningful way, not necessarily that loss is zero.
Good training practice depends on monitoring both optimization progress and validation behavior.

Once these terms are clear, training logs stop looking like random numbers and start looking like a readable story of how the model is learning.

The Training Loop in One Picture#

Step 1: Forward Pass#

Step 2: Compute Loss#

Step 3: Backward Pass#

Step 4: Optimizer Update#

What Is a Batch?#

Why Not Use the Entire Dataset at Once?#

What Is an Epoch?#

What Is an Iteration or Step?#

Worked Example: Dataset, Batch Size, Epochs#

Why Batch Size Matters#

Small batches#

Large batches#

Why Training Loss Goes Down but Not Always Smoothly#

What Convergence Means#

Too Small, Too Large, and Just Right Learning Rates#

Too small#

Too large#

Reasonable learning rate#

Training Loss vs Validation Loss#

Good fit#

Underfitting#

Overfitting#

Why Shuffle the Data#

Why We Zero Gradients#

Gradient Accumulation#

A Minimal Training Loop in PyTorch#

Common Misunderstandings#

“One epoch means one gradient update”#

“One batch means one sample”#

“Convergence means zero loss”#

“More epochs always means better results”#

“Batch size is only a hardware choice”#

Practical Mental Model#

Summary#

Related Articles