Loss functions answer one basic question:
How wrong is the model right now?
Without a loss function, a neural network has no way to measure its own mistakes, and without that measurement, gradient-based training has nothing to optimize.
Two of the most important losses in machine learning are:
- Mean Squared Error (MSE)
- Cross-Entropy
They are both common. They are both differentiable. But they solve different kinds of problems, and using the wrong one makes training harder than it needs to be.
The Core Distinction
Use MSE when the target is a continuous numeric value.
Examples:
- house price
- temperature
- demand forecast
- stock volatility estimate
Use cross-entropy when the target is a class or probability distribution.
Examples:
- cat vs dog classification
- spam vs not spam
- predicting the next token in an LLM
- multi-class image classification
That is the big split:
- regression -> MSE
- classification -> cross-entropy
A quick rule that works most of the time: if the target is a number, start with MSE; if the target is a class distribution, start with cross-entropy.
Mean Squared Error (MSE)
The formula is:
MSE = (1/n) * sum((yi - y_hat_i)^2)
For each example:
- subtract prediction from target
- square the error
- average across examples
Why square the error?
- it makes all errors positive
- it penalizes larger mistakes more heavily
- it is smooth and easy to differentiate
MSE Intuition
Suppose the true values are:
[3, 5, 2]
and the model predicts:
[4, 2, 3]
Then the errors are:
[1, -3, 1]
Squared errors:
[1, 9, 1]
Average:
MSE = (1 + 9 + 1) / 3 = 11/3
Notice what happened:
- the error of
-3became9 - large mistakes dominate the loss
That is often desirable in regression.
MSE does not treat every miss equally. Once errors are squared, larger misses dominate the average.
Where MSE Fits Best
MSE is a natural choice when:
- outputs are real-valued numbers
- distance between prediction and truth matters directly
- large deviations should be punished strongly
Typical applications:
- forecasting
- regression benchmarks
- continuous control targets
- reconstruction losses in some autoencoder setups
Cross-Entropy Loss
Cross-entropy is used when the model outputs probabilities over classes.
For binary classification, the loss is:
L = -[y * log(p) + (1 - y) * log(1 - p)]
Where:
yis the true label (0or1)pis the predicted probability of class1
For multi-class classification, the idea generalizes:
- the model outputs a probability distribution
- cross-entropy measures how far that predicted distribution is from the true one
In practice, for a one-hot target, the loss mostly reduces to:
negative log probability assigned to the correct class
Cross-Entropy Intuition
Suppose the true class is cat.
The model predicts:
cat: 0.70
dog: 0.20
bird: 0.10
That is decent. The model assigned fairly high probability to the correct class.
Now compare with:
cat: 0.05
dog: 0.90
bird: 0.05
This is much worse, and cross-entropy punishes it heavily because the model was not just wrong. It was confidently wrong.
That is one of cross-entropy’s key properties.
Cross-entropy mainly cares about the probability on the true class. If that probability is tiny, the loss gets large fast.
Why MSE Feels Natural for Regression
If your model predicts a house price of $410,000 and the truth is $400,000, then the most natural question is:
How far off was the number?
MSE measures exactly that kind of numeric error.
There is no notion of “class probability” here. You do not care whether the model was “70% house-price-ish.” You care about distance in value.
Why Cross-Entropy Feels Natural for Classification
If your model predicts whether an email is spam, the key question is not:
How numerically far was my output from 1?
The key question is:
How much probability did the model assign to the correct class?
Cross-entropy is built for exactly that.
It aligns the loss with the probabilistic nature of classification.
Why Cross-Entropy Usually Beats MSE for Classification
You technically can use MSE for classification in some settings, but it is usually the wrong tool.
Reasons:
1. Cross-entropy gives better gradients
When classification outputs come from sigmoid or softmax, cross-entropy tends to produce stronger and more useful gradient signals than MSE.
2. Cross-entropy matches the probabilistic objective
Classification is fundamentally about assigning high probability to the correct class. Cross-entropy measures that directly.
3. MSE can slow learning in classification
If the model output saturates, MSE can lead to weaker gradients and slower optimization.
This is one reason logistic regression and neural classifiers are normally trained with cross-entropy, not MSE.
Binary Classification: Sigmoid + Cross-Entropy
In binary classification, a common setup is:
- final linear layer
- sigmoid activation
- binary cross-entropy loss
The sigmoid converts the output into a probability between 0 and 1.
Then binary cross-entropy asks:
How much probability did the model assign to the correct answer?
This pairing is standard because the modeling assumptions and the loss fit each other.
Multi-Class Classification: Softmax + Cross-Entropy
For multiple mutually exclusive classes, the common setup is:
- output layer with one logit per class
- softmax to convert logits into probabilities
- cross-entropy loss
Example:
Classes: cat, dog, bird
Logits -> softmax -> probabilities
Then cross-entropy penalizes the model based on how much probability it assigned to the true class.
This is also exactly what happens in next-token prediction for language models, except the “classes” are vocabulary tokens.
MSE vs Cross-Entropy by Example
Suppose true label = 1.
Two models output:
- Model A:
0.9 - Model B:
0.6
With MSE:
- Model A loss =
(1 - 0.9)^2 = 0.01 - Model B loss =
(1 - 0.6)^2 = 0.16
So Model B is worse, as expected.
But now imagine a badly wrong prediction:
- Model C:
0.01
MSE gives:
(1 - 0.01)^2 = 0.9801
Cross-entropy gives:
-log(0.01) ≈ 4.605
Cross-entropy punishes confident wrongness much more aggressively, which is exactly what we usually want in classification.
Regression with MSE: Why Squaring Helps and Hurts
MSE has a useful property:
- bigger mistakes matter much more than smaller ones
That helps when large errors are genuinely much worse.
But it also means MSE is sensitive to outliers. If your dataset has a few extreme targets, they can dominate training.
In such settings, people sometimes prefer alternatives like MAE or Huber loss. Still, MSE remains the default baseline for regression.
The Probabilistic View
Another way to think about this:
- minimizing MSE corresponds to assuming Gaussian-like error behavior
- minimizing cross-entropy corresponds to maximizing likelihood for classification probabilities
You do not need to work from those probabilistic derivations every day, but they explain why these losses are not arbitrary formulas. They correspond to different output assumptions.
A Tiny PyTorch Example
MSE for regression
import torch
import torch.nn.functional as F
pred = torch.tensor([2.5, 0.0, 2.1])
target = torch.tensor([3.0, -0.5, 2.0])
loss = F.mse_loss(pred, target)
print(loss.item())
Cross-entropy for classification
import torch
import torch.nn.functional as F
logits = torch.tensor([[2.0, 0.5, -1.0]])
target = torch.tensor([0]) # correct class index
loss = F.cross_entropy(logits, target)
print(loss.item())
Notice that cross_entropy expects raw logits, not already-softmaxed probabilities. The function handles the stable softmax-like computation internally.
Common Mistakes
Using MSE for a standard classification model
This usually works worse than cross-entropy and gives weaker training dynamics.
Applying softmax before cross_entropy in PyTorch
Usually do not do that. torch.nn.functional.cross_entropy expects logits.
Treating cross-entropy like “just another error distance”
It is not primarily measuring numeric distance. It is measuring mismatch between predicted probabilities and the target distribution.
Forgetting the task-output-loss alignment
Your final layer, activation choice, and loss function need to fit together.
Quick Decision Rule
Use:
- MSE for predicting numbers
- cross-entropy for predicting classes
That rule alone will get you through most practical cases.
Summary
- MSE measures squared numeric error and is the standard loss for regression.
- Cross-entropy measures how much probability the model assigns to the correct class and is the standard loss for classification.
- MSE is intuitive for continuous targets.
- Cross-entropy is better aligned with probabilistic classification and usually produces better gradients there.
- In modern deep learning, picking the right loss is not a detail. It defines what “good prediction” even means.
If the model output is a number, think MSE first. If the model output is a class probability distribution, think cross-entropy first.