Loss functions answer one basic question:

How wrong is the model right now?

Without a loss function, a neural network has no way to measure its own mistakes, and without that measurement, gradient-based training has nothing to optimize.

Two of the most important losses in machine learning are:

  • Mean Squared Error (MSE)
  • Cross-Entropy

They are both common. They are both differentiable. But they solve different kinds of problems, and using the wrong one makes training harder than it needs to be.

The Core Distinction

Use MSE when the target is a continuous numeric value.

Examples:

  • house price
  • temperature
  • demand forecast
  • stock volatility estimate

Use cross-entropy when the target is a class or probability distribution.

Examples:

  • cat vs dog classification
  • spam vs not spam
  • predicting the next token in an LLM
  • multi-class image classification

That is the big split:

  • regression -> MSE
  • classification -> cross-entropy

Decision guide for MSE vs cross-entropy

A quick rule that works most of the time: if the target is a number, start with MSE; if the target is a class distribution, start with cross-entropy.

Mean Squared Error (MSE)

The formula is:

MSE = (1/n) * sum((yi - y_hat_i)^2)

For each example:

  1. subtract prediction from target
  2. square the error
  3. average across examples

Why square the error?

  • it makes all errors positive
  • it penalizes larger mistakes more heavily
  • it is smooth and easy to differentiate

MSE Intuition

Suppose the true values are:

[3, 5, 2]

and the model predicts:

[4, 2, 3]

Then the errors are:

[1, -3, 1]

Squared errors:

[1, 9, 1]

Average:

MSE = (1 + 9 + 1) / 3 = 11/3

Notice what happened:

  • the error of -3 became 9
  • large mistakes dominate the loss

That is often desirable in regression.

Mean squared error visual

MSE does not treat every miss equally. Once errors are squared, larger misses dominate the average.

Where MSE Fits Best

MSE is a natural choice when:

  • outputs are real-valued numbers
  • distance between prediction and truth matters directly
  • large deviations should be punished strongly

Typical applications:

  • forecasting
  • regression benchmarks
  • continuous control targets
  • reconstruction losses in some autoencoder setups

Cross-Entropy Loss

Cross-entropy is used when the model outputs probabilities over classes.

For binary classification, the loss is:

L = -[y * log(p) + (1 - y) * log(1 - p)]

Where:

  • y is the true label (0 or 1)
  • p is the predicted probability of class 1

For multi-class classification, the idea generalizes:

  • the model outputs a probability distribution
  • cross-entropy measures how far that predicted distribution is from the true one

In practice, for a one-hot target, the loss mostly reduces to:

negative log probability assigned to the correct class

Cross-Entropy Intuition

Suppose the true class is cat.

The model predicts:

cat: 0.70
dog: 0.20
bird: 0.10

That is decent. The model assigned fairly high probability to the correct class.

Now compare with:

cat: 0.05
dog: 0.90
bird: 0.05

This is much worse, and cross-entropy punishes it heavily because the model was not just wrong. It was confidently wrong.

That is one of cross-entropy’s key properties.

Cross-entropy probability penalty visual

Cross-entropy mainly cares about the probability on the true class. If that probability is tiny, the loss gets large fast.

Why MSE Feels Natural for Regression

If your model predicts a house price of $410,000 and the truth is $400,000, then the most natural question is:

How far off was the number?

MSE measures exactly that kind of numeric error.

There is no notion of “class probability” here. You do not care whether the model was “70% house-price-ish.” You care about distance in value.

Why Cross-Entropy Feels Natural for Classification

If your model predicts whether an email is spam, the key question is not:

How numerically far was my output from 1?

The key question is:

How much probability did the model assign to the correct class?

Cross-entropy is built for exactly that.

It aligns the loss with the probabilistic nature of classification.

Why Cross-Entropy Usually Beats MSE for Classification

You technically can use MSE for classification in some settings, but it is usually the wrong tool.

Reasons:

1. Cross-entropy gives better gradients

When classification outputs come from sigmoid or softmax, cross-entropy tends to produce stronger and more useful gradient signals than MSE.

2. Cross-entropy matches the probabilistic objective

Classification is fundamentally about assigning high probability to the correct class. Cross-entropy measures that directly.

3. MSE can slow learning in classification

If the model output saturates, MSE can lead to weaker gradients and slower optimization.

This is one reason logistic regression and neural classifiers are normally trained with cross-entropy, not MSE.

Binary Classification: Sigmoid + Cross-Entropy

In binary classification, a common setup is:

  • final linear layer
  • sigmoid activation
  • binary cross-entropy loss

The sigmoid converts the output into a probability between 0 and 1.

Then binary cross-entropy asks:

How much probability did the model assign to the correct answer?

This pairing is standard because the modeling assumptions and the loss fit each other.

Multi-Class Classification: Softmax + Cross-Entropy

For multiple mutually exclusive classes, the common setup is:

  • output layer with one logit per class
  • softmax to convert logits into probabilities
  • cross-entropy loss

Example:

Classes: cat, dog, bird
Logits -> softmax -> probabilities

Then cross-entropy penalizes the model based on how much probability it assigned to the true class.

This is also exactly what happens in next-token prediction for language models, except the “classes” are vocabulary tokens.

MSE vs Cross-Entropy by Example

Suppose true label = 1.

Two models output:

  • Model A: 0.9
  • Model B: 0.6

With MSE:

  • Model A loss = (1 - 0.9)^2 = 0.01
  • Model B loss = (1 - 0.6)^2 = 0.16

So Model B is worse, as expected.

But now imagine a badly wrong prediction:

  • Model C: 0.01

MSE gives:

(1 - 0.01)^2 = 0.9801

Cross-entropy gives:

-log(0.01) ≈ 4.605

Cross-entropy punishes confident wrongness much more aggressively, which is exactly what we usually want in classification.

Regression with MSE: Why Squaring Helps and Hurts

MSE has a useful property:

  • bigger mistakes matter much more than smaller ones

That helps when large errors are genuinely much worse.

But it also means MSE is sensitive to outliers. If your dataset has a few extreme targets, they can dominate training.

In such settings, people sometimes prefer alternatives like MAE or Huber loss. Still, MSE remains the default baseline for regression.

The Probabilistic View

Another way to think about this:

  • minimizing MSE corresponds to assuming Gaussian-like error behavior
  • minimizing cross-entropy corresponds to maximizing likelihood for classification probabilities

You do not need to work from those probabilistic derivations every day, but they explain why these losses are not arbitrary formulas. They correspond to different output assumptions.

A Tiny PyTorch Example

MSE for regression

import torch
import torch.nn.functional as F

pred = torch.tensor([2.5, 0.0, 2.1])
target = torch.tensor([3.0, -0.5, 2.0])

loss = F.mse_loss(pred, target)
print(loss.item())

Cross-entropy for classification

import torch
import torch.nn.functional as F

logits = torch.tensor([[2.0, 0.5, -1.0]])
target = torch.tensor([0])  # correct class index

loss = F.cross_entropy(logits, target)
print(loss.item())

Notice that cross_entropy expects raw logits, not already-softmaxed probabilities. The function handles the stable softmax-like computation internally.

Common Mistakes

Using MSE for a standard classification model

This usually works worse than cross-entropy and gives weaker training dynamics.

Applying softmax before cross_entropy in PyTorch

Usually do not do that. torch.nn.functional.cross_entropy expects logits.

Treating cross-entropy like “just another error distance”

It is not primarily measuring numeric distance. It is measuring mismatch between predicted probabilities and the target distribution.

Forgetting the task-output-loss alignment

Your final layer, activation choice, and loss function need to fit together.

Quick Decision Rule

Use:

  • MSE for predicting numbers
  • cross-entropy for predicting classes

That rule alone will get you through most practical cases.

Summary

  • MSE measures squared numeric error and is the standard loss for regression.
  • Cross-entropy measures how much probability the model assigns to the correct class and is the standard loss for classification.
  • MSE is intuitive for continuous targets.
  • Cross-entropy is better aligned with probabilistic classification and usually produces better gradients there.
  • In modern deep learning, picking the right loss is not a detail. It defines what “good prediction” even means.

If the model output is a number, think MSE first. If the model output is a class probability distribution, think cross-entropy first.