Loss functions: nn.CrossEntropyLoss, nn.MSELoss
Why this matters
Every model needs a loss function to learn: it's the signal that tells the optimizer which direction to adjust weights. Choosing the wrong loss function will make your model fail to train or converge slowly, even if everything else is correct.
Explanation
What it is: A loss function is a mathematical function that quantifies the difference between your model's predictions and the true labels. PyTorch provides optimized implementations: nn.CrossEntropyLoss for multi-class classification and nn.MSELoss for regression. How it works: CrossEntropyLoss combines nn.LogSoftmax and nn.NLLLoss internally: it expects raw logits (unbounded outputs) and class indices as integers, then outputs a single scalar loss value. MSELoss computes the mean squared difference between predictions and targets, useful when outputs and labels are continuous values. Both are differentiable, so gradients flow backward through them during backpropagation. When to use: Use CrossEntropyLoss when you have discrete classes (cats vs dogs vs birds); use MSELoss when predicting continuous values (house price, temperature).
Analogy
Imagine you're teaching someone to throw darts. The loss function is how far each dart landed from the bullseye. <code>CrossEntropyLoss</code> is like saying 'you hit section 5 instead of section 1: that's wrong.' <code>MSELoss</code> is like measuring the exact distance in centimeters from where it landed to the center. The optimizer uses this feedback to adjust the person's next throw.
Code
import torch
import torch.nn as nn
# Example 1: CrossEntropyLoss for classification
print('=== CrossEntropyLoss (Classification) ===')
# Model outputs 3 raw logits per sample (unbounded)
model_outputs = torch.tensor([[2.0, 1.0, 0.1],
[0.5, 3.0, 0.2],
[1.0, 1.0, 2.5]])
# True class labels (0, 1, or 2)
true_labels = torch.tensor([0, 1, 2])
# Create loss function
ce_loss = nn.CrossEntropyLoss()
# Compute loss
loss_value = ce_loss(model_outputs, true_labels)
print(f'Loss: {loss_value.item():.4f}')
print(f'Loss shape: {loss_value.shape}')
print()
# Example 2: MSELoss for regression
print('=== MSELoss (Regression) ===')
# Model outputs continuous predictions
predictions = torch.tensor([[1.5], [2.1], [3.0]])
# True continuous values
targets = torch.tensor([[1.0], [2.5], [2.8]])
# Create loss function
mse_loss = nn.MSELoss()
# Compute loss
loss_value = mse_loss(predictions, targets)
print(f'Loss: {loss_value.item():.4f}')
print(f'Loss shape: {loss_value.shape}')
print()
# Example 3: Inside a training loop (simplified)
print('=== Loss in a Training Step ===')
batch_size = 4
num_classes = 3
# Random batch
batch_logits = torch.randn(batch_size, num_classes)
batch_labels = torch.randint(0, num_classes, (batch_size,))
# Compute loss
ce_loss = nn.CrossEntropyLoss()
loss = ce_loss(batch_logits, batch_labels)
print(f'Batch loss: {loss.item():.4f}')
print(f'This scalar is what the optimizer minimizes') === CrossEntropyLoss (Classification) === Loss: 0.6227 Loss shape: torch.Size([]) === MSELoss (Regression) === Loss: 0.1500 Loss shape: torch.Size([]) === Loss in a Training Step === Batch loss: 1.1285 This scalar is what the optimizer minimizes
What just happened?
We created two loss function objects, fed them model outputs and true labels, and got back scalar loss values. <code>CrossEntropyLoss</code> took raw logits and integer class indices, internally applied softmax and log-likelihood. <code>MSELoss</code> computed the mean squared error between predictions and targets. Both returned a single scalar (0-dimensional tensor) that represents the total error for that batch.
Common gotcha
The most common mistake: passing softmax-normalized probabilities to CrossEntropyLoss. You must pass raw logits (unbounded outputs directly from your final linear layer). If you softmax first, you've already destroyed the gradient information and your loss will be wrong. Similarly, beginners often use CrossEntropyLoss with float label targets instead of integer class indices: the function expects torch.long dtype.
Error recovery
ValueError: Expected integer labelRuntimeError: Expected 2D inputRuntimeError: Class index out of rangeGradients are NaNExperienced dev note
In PyTorch 2.11.x, always check if you're using reduction='none' or reduction='mean' (default). When you have an imbalanced dataset (e.g., 90% class 0, 10% class 1), the default loss will be dominated by the majority class. Use nn.CrossEntropyLoss(weight=torch.tensor([1.0, 9.0])) to upweight minority classes. Also: loss functions return tensors, not Python floats: call .item() to log them, not to pass to conditionals (detach them first). And MSELoss gradients explode if your targets are unbounded; normalize your data first.
Check your understanding
If your classification model outputs logits of shape (32, 10) for a batch of 32 images across 10 classes, and you have target labels as a tensor of shape (32,) with integer class indices, why would using nn.MSELoss instead of nn.CrossEntropyLoss be mathematically wrong here?
Show answer hint
A correct answer explains that MSELoss treats class indices as continuous numerical values (0, 1, 2...) and penalizes distance, but classes are discrete categories with no inherent ordering. The distance from class 0 to class 1 is not 'closer' than class 0 to class 9: they're just different. CrossEntropyLoss uses probability distributions, which correctly model the mutually-exclusive nature of classes.
nn.CrossEntropyLoss or nn.MSELoss since 1.0.0.