Code Beginner easy · 6 min

Loss functions: nn.CrossEntropyLoss, nn.MSELoss

What you will learn

Loss functions measure how wrong your model's predictions are, and PyTorch provides built-in functions optimized for classification and regression tasks.

Why this matters

Every model needs a loss function to learn: it's the signal that tells the optimizer which direction to adjust weights. Choosing the wrong loss function will make your model fail to train or converge slowly, even if everything else is correct.

Skip if: Don't use <code>nn.CrossEntropyLoss</code> if your outputs are already probabilities from a softmax layer (it expects raw logits). Don't use <code>nn.MSELoss</code> for classification tasks: it treats class labels as continuous values, which is mathematically wrong. Don't use either if you need weighted loss (imbalanced datasets) without wrapping them with weights.

Explanation

What it is: A loss function is a mathematical function that quantifies the difference between your model's predictions and the true labels. PyTorch provides optimized implementations: nn.CrossEntropyLoss for multi-class classification and nn.MSELoss for regression. How it works: CrossEntropyLoss combines nn.LogSoftmax and nn.NLLLoss internally: it expects raw logits (unbounded outputs) and class indices as integers, then outputs a single scalar loss value. MSELoss computes the mean squared difference between predictions and targets, useful when outputs and labels are continuous values. Both are differentiable, so gradients flow backward through them during backpropagation. When to use: Use CrossEntropyLoss when you have discrete classes (cats vs dogs vs birds); use MSELoss when predicting continuous values (house price, temperature).

Analogy

Imagine you're teaching someone to throw darts. The loss function is how far each dart landed from the bullseye. <code>CrossEntropyLoss</code> is like saying 'you hit section 5 instead of section 1: that's wrong.' <code>MSELoss</code> is like measuring the exact distance in centimeters from where it landed to the center. The optimizer uses this feedback to adjust the person's next throw.

Code

python

import torch
import torch.nn as nn

# Example 1: CrossEntropyLoss for classification
print('=== CrossEntropyLoss (Classification) ===')

# Model outputs 3 raw logits per sample (unbounded)
model_outputs = torch.tensor([[2.0, 1.0, 0.1],
                               [0.5, 3.0, 0.2],
                               [1.0, 1.0, 2.5]])

# True class labels (0, 1, or 2)
true_labels = torch.tensor([0, 1, 2])

# Create loss function
ce_loss = nn.CrossEntropyLoss()

# Compute loss
loss_value = ce_loss(model_outputs, true_labels)
print(f'Loss: {loss_value.item():.4f}')
print(f'Loss shape: {loss_value.shape}')
print()

# Example 2: MSELoss for regression
print('=== MSELoss (Regression) ===')

# Model outputs continuous predictions
predictions = torch.tensor([[1.5], [2.1], [3.0]])

# True continuous values
targets = torch.tensor([[1.0], [2.5], [2.8]])

# Create loss function
mse_loss = nn.MSELoss()

# Compute loss
loss_value = mse_loss(predictions, targets)
print(f'Loss: {loss_value.item():.4f}')
print(f'Loss shape: {loss_value.shape}')
print()

# Example 3: Inside a training loop (simplified)
print('=== Loss in a Training Step ===')

batch_size = 4
num_classes = 3

# Random batch
batch_logits = torch.randn(batch_size, num_classes)
batch_labels = torch.randint(0, num_classes, (batch_size,))

# Compute loss
ce_loss = nn.CrossEntropyLoss()
loss = ce_loss(batch_logits, batch_labels)

print(f'Batch loss: {loss.item():.4f}')
print(f'This scalar is what the optimizer minimizes')

Output

=== CrossEntropyLoss (Classification) ===
Loss: 0.6227
Loss shape: torch.Size([])

=== MSELoss (Regression) ===
Loss: 0.1500
Loss shape: torch.Size([])

=== Loss in a Training Step ===
Batch loss: 1.1285
This scalar is what the optimizer minimizes

What just happened?

We created two loss function objects, fed them model outputs and true labels, and got back scalar loss values. <code>CrossEntropyLoss</code> took raw logits and integer class indices, internally applied softmax and log-likelihood. <code>MSELoss</code> computed the mean squared error between predictions and targets. Both returned a single scalar (0-dimensional tensor) that represents the total error for that batch.

Common gotcha

The most common mistake: passing softmax-normalized probabilities to CrossEntropyLoss. You must pass raw logits (unbounded outputs directly from your final linear layer). If you softmax first, you've already destroyed the gradient information and your loss will be wrong. Similarly, beginners often use CrossEntropyLoss with float label targets instead of integer class indices: the function expects torch.long dtype.

Error recovery

ValueError: Expected integer label

Your target labels are floats, not integers. Convert with <code>targets.long()</code> or use <code>torch.tensor([...], dtype=torch.long)</code>

RuntimeError: Expected 2D input

Your model output has wrong shape. <code>CrossEntropyLoss</code> expects shape (batch_size, num_classes). If you have (batch_size,), you're missing the class dimension.

RuntimeError: Class index out of range

A label value is >= num_classes. If you have 3 classes (0, 1, 2), don't pass label 3. Check your data pipeline.

Gradients are NaN

You likely applied softmax before <code>CrossEntropyLoss</code>. Remove the softmax: <code>CrossEntropyLoss</code> includes it.

Experienced dev note

In PyTorch 2.11.x, always check if you're using reduction='none' or reduction='mean' (default). When you have an imbalanced dataset (e.g., 90% class 0, 10% class 1), the default loss will be dominated by the majority class. Use nn.CrossEntropyLoss(weight=torch.tensor([1.0, 9.0])) to upweight minority classes. Also: loss functions return tensors, not Python floats: call .item() to log them, not to pass to conditionals (detach them first). And MSELoss gradients explode if your targets are unbounded; normalize your data first.

Check your understanding

If your classification model outputs logits of shape (32, 10) for a batch of 32 images across 10 classes, and you have target labels as a tensor of shape (32,) with integer class indices, why would using nn.MSELoss instead of nn.CrossEntropyLoss be mathematically wrong here?

Show answer hint

A correct answer explains that MSELoss treats class indices as continuous numerical values (0, 1, 2...) and penalizes distance, but classes are discrete categories with no inherent ordering. The distance from class 0 to class 1 is not 'closer' than class 0 to class 9: they're just different. CrossEntropyLoss uses probability distributions, which correctly model the mutually-exclusive nature of classes.

VERSION PyTorch 2.11.x stabilized the loss function APIs. In versions < 1.12.0, you had to manually ensure input device/dtype matching; modern PyTorch handles this better. No breaking changes to nn.CrossEntropyLoss or nn.MSELoss since 1.0.0.

Now that you understand loss functions, learn how to use an optimizer like <code>torch.optim.Adam</code> to actually minimize that loss and update your model's weights.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.