Once you understand neurons, activations, loss functions, and backpropagation, the next thing to understand is the training loop. This is the repetitive engine of deep learning.

At a high level, training is boring in the best possible way. It is the same four steps repeated over and over:

  1. make a prediction
  2. measure the error
  3. compute gradients
  4. update the weights

The interesting part is not the loop itself. The interesting part is how concepts like batch size, epoch, iteration, and convergence affect the behavior of that loop in practice.

The Training Loop in One Picture

Every iteration of training follows this pattern:

Input batch
  -> Forward pass
  -> Loss
  -> Backward pass
  -> Optimizer update
  -> Repeat

Or in code:

for batch in data_loader:
    predictions = model(batch.inputs)
    loss = loss_fn(predictions, batch.targets)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

That is the whole training loop in miniature.

Step 1: Forward Pass

The model takes the current batch of inputs and computes predictions using its current weights.

Examples:

  • a regression model predicts prices
  • a classifier predicts class logits
  • an LLM predicts next-token logits

At this moment, the model is not learning yet. It is just answering:

Given my current parameters, what do I predict for these examples?

Step 2: Compute Loss

The loss function measures how wrong those predictions are.

Examples:

  • MSE for regression
  • cross-entropy for classification

The output is a scalar value. That scalar becomes the thing training tries to reduce over time.

Step 3: Backward Pass

The backward pass computes gradients of the loss with respect to model parameters.

This is where backpropagation happens.

After loss.backward() in PyTorch, each trainable parameter now has a gradient attached to it.

Those gradients answer:

How should this weight change if I want to reduce the loss?

Step 4: Optimizer Update

The optimizer uses the gradients to update weights.

In the simplest case:

w <- w - eta * grad

Where eta is the learning rate.

In practice, optimizers like Adam keep extra state, but the central idea is the same:

  • gradients tell you the direction
  • optimizer decides how to step

What Is a Batch?

A batch is a subset of training examples processed together before one weight update.

If the dataset has 10,000 examples, you usually do not feed all 10,000 at once. Instead, you split them into smaller groups such as:

  • 16
  • 32
  • 64
  • 128
  • 256

If the batch size is 64, the model processes 64 examples, computes one average loss, computes gradients, and then updates weights once.

This is called mini-batch training, and it is the standard way neural networks are trained.

Why Not Use the Entire Dataset at Once?

You could, in principle, use full-batch gradient descent, but mini-batches are usually better in practice.

Reasons:

  • full-batch training can be too memory-intensive
  • mini-batches are much faster on GPUs
  • mini-batch noise can actually help optimization
  • weight updates happen more frequently

So mini-batches are a practical and often optimization-friendly compromise.

What Is an Epoch?

An epoch means one complete pass through the entire training dataset.

Example:

  • dataset size = 12,000
  • batch size = 100

Then:

  • one epoch contains 120 batches
  • therefore one epoch contains 120 optimizer updates

An epoch is about dataset coverage, not about one single update.

What Is an Iteration or Step?

An iteration or step usually means:

one batch processed -> one weight update

So with:

  • dataset size = 12,000
  • batch size = 100

you get:

  • 120 iterations per epoch

This distinction matters because:

  • epochs tell you how many times you have seen the dataset
  • steps tell you how many times you have updated the weights

Many modern training setups, especially for large models, are tracked more in steps than in epochs.

Worked Example: Dataset, Batch Size, Epochs

Suppose:

  • dataset has 1,000 samples
  • batch size = 100
  • training runs for 5 epochs

Then:

  • batches per epoch = 1,000 / 100 = 10
  • total iterations = 5 * 10 = 50
  • total weight updates = 50

This is the simplest arithmetic behind training schedules.

Why Batch Size Matters

Batch size changes both compute behavior and optimization behavior.

Small batches

Pros:

  • more frequent updates
  • less memory usage
  • noisier gradients can help escape poor regions

Cons:

  • training can be less stable
  • GPU utilization may be worse

Large batches

Pros:

  • better hardware efficiency
  • smoother gradient estimates

Cons:

  • more memory required
  • fewer updates per epoch
  • can sometimes generalize worse without proper tuning

There is no universal best batch size. It depends on the model, hardware, and objective.

Why Training Loss Goes Down but Not Always Smoothly

People often expect loss curves to drop smoothly. In reality, mini-batch training introduces noise.

Why?

  • each batch is only a sample of the dataset
  • different batches have different difficulty
  • the gradient is only an estimate of the full-data gradient

So it is normal for batch-level loss to bounce around even while the broader trend is improving.

That is not failure. That is just stochastic optimization doing its thing.

What Convergence Means

Convergence does not necessarily mean reaching the perfect global minimum.

In practical deep learning, convergence usually means something closer to:

  • training loss is no longer improving much
  • validation performance has stabilized
  • further training gives diminishing returns

That is enough to stop in many real workflows.

Too Small, Too Large, and Just Right Learning Rates

The training loop is highly sensitive to learning rate.

Too small

  • training progresses very slowly
  • loss decreases, but painfully
  • many epochs may be wasted

Too large

  • loss may oscillate or diverge
  • updates overshoot good regions
  • training may become numerically unstable

Reasonable learning rate

  • loss decreases efficiently
  • training is stable enough to make progress
  • convergence is much faster

A lot of “model problems” are really optimization-setting problems, especially bad learning-rate choices.

Training Loss vs Validation Loss

To understand convergence properly, you usually track both:

  • training loss
  • validation loss

Good fit

  • both losses go down
  • both stay reasonably close

Underfitting

  • both losses stay high
  • model is not learning enough

Overfitting

  • training loss keeps dropping
  • validation loss stops improving or starts rising

This is why “lowest training loss” is not always the best checkpoint.

The goal is not memorization. The goal is generalization.

Why Shuffle the Data

Training data is usually shuffled each epoch.

Why?

  • avoids pathological ordering effects
  • makes batches more representative
  • improves stochastic optimization behavior

If similar examples are clustered together and never shuffled, training can behave badly or learn biased update patterns.

Why We Zero Gradients

In PyTorch, gradients accumulate by default.

That is why a typical loop includes:

optimizer.zero_grad()
loss.backward()
optimizer.step()

If you skip zero_grad(), gradients from previous steps will accumulate, which is usually wrong unless you are intentionally doing gradient accumulation.

Gradient Accumulation

Sometimes the ideal batch size does not fit in memory.

One workaround is gradient accumulation:

  • process several smaller micro-batches
  • call loss.backward() on each
  • delay optimizer.step()
  • update only after accumulating enough gradients

This simulates a larger effective batch size without needing all examples in memory at once.

A Minimal Training Loop in PyTorch

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset

X = torch.randn(1000, 10)
y = torch.randint(0, 2, (1000,))

dataset = TensorDataset(X, y)
loader = DataLoader(dataset, batch_size=32, shuffle=True)

model = nn.Sequential(
    nn.Linear(10, 32),
    nn.ReLU(),
    nn.Linear(32, 2)
)

loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

for epoch in range(5):
    for batch_x, batch_y in loader:
        logits = model(batch_x)
        loss = loss_fn(logits, batch_y)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    print(f"epoch={epoch+1}, loss={loss.item():.4f}")

This example has all the standard pieces:

  • mini-batches
  • forward pass
  • loss
  • backward pass
  • optimizer step
  • multiple epochs

Common Misunderstandings

“One epoch means one gradient update”

No. One epoch may contain hundreds or thousands of updates depending on dataset size and batch size.

“One batch means one sample”

No. A batch usually contains many samples.

“Convergence means zero loss”

Not in practice. It usually means the model has stopped improving meaningfully.

“More epochs always means better results”

No. More epochs can eventually lead to overfitting.

“Batch size is only a hardware choice”

No. It affects both hardware efficiency and optimization dynamics.

Practical Mental Model

Think of training like repeated course correction:

  • each batch gives the model a noisy hint about what it is doing wrong
  • each backward pass converts that hint into gradients
  • each optimizer step nudges the parameters
  • many nudges, accumulated over time, produce learning

That is why deep learning is iterative by nature. One step learns almost nothing. A well-tuned sequence of steps learns a lot.

Summary

  • The training loop repeats: forward pass, loss, backward pass, optimizer update.
  • A batch is a subset of examples processed together before one weight update.
  • An epoch is one full pass through the dataset.
  • An iteration or step is usually one batch processed and one optimizer update.
  • Batch size affects memory, throughput, stability, and generalization.
  • Convergence means training has mostly stopped improving in a meaningful way, not necessarily that loss is zero.
  • Good training practice depends on monitoring both optimization progress and validation behavior.

Once these terms are clear, training logs stop looking like random numbers and start looking like a readable story of how the model is learning.