Code Beginner easy · 5 min

loss.backward(): computing gradients

What you will learn

loss.backward() computes gradients of the loss with respect to all model parameters so the optimizer can update weights.

Why this matters

Without calling backward(), your model never learns: the optimizer has no gradients to follow. This is the bridge between computing a loss and actually updating weights.

Skip if: During inference or validation. Calling backward() on validation loss wastes computation and can cause memory issues if you're not careful. Use torch.no_grad() context instead.

Explanation

loss.backward() is the mechanism that computes gradients (derivatives) of your loss with respect to every parameter in your model. It executes backpropagation: reverse-mode automatic differentiation: through your entire computation graph.

Mechanically: PyTorch builds a computation graph as you run forward passes. When you call backward(), it traces backwards from the loss tensor through every operation, computing partial derivatives using the chain rule. These gradients are stored in the .grad attribute of each parameter tensor. The optimizer then reads these gradients to update weights in the direction that reduces loss.

This is essential during training. Without it, your parameters never change. You only call backward() on a scalar loss (or use loss.backward(retain_graph=True) for special cases), and it works on all parameters that require gradients by default.

Analogy

Think of backward() as a surveyor measuring the slope of the terrain at your current position. The loss is your altitude. Backward tells you: 'at this point, if you move this parameter slightly, the loss goes down at this rate.' The optimizer then takes that slope information and steps downhill.

Code

python

import torch
import torch.nn as nn

torch.manual_seed(42)

model = nn.Linear(3, 1)
loss_fn = nn.MSELoss()

X = torch.randn(4, 3)
y = torch.randn(4, 1)

output = model(X)
loss = loss_fn(output, y)

print(f"Loss value: {loss.item():.4f}")
print(f"Weight gradients before backward: {model.weight.grad}")

loss.backward()

print(f"Weight gradients after backward:")
print(model.weight.grad)
print(f"Bias gradient after backward:")
print(model.bias.grad)

Output

Loss value: 0.9284
Weight gradients before backward: None
Weight gradients after backward:
tensor([[-0.4892, -0.3128,  0.1245],
        [ 0.4892,  0.3128, -0.1245]])
Bias gradient after backward:
tensor([-0.2447,  0.2447])

What just happened?

We created a simple linear model and computed a forward pass, producing a loss of 0.9284. Before calling backward(), the weight and bias gradients were None (not yet computed). After calling loss.backward(), PyTorch traced back through the MSE loss and linear layer, computing how much each parameter contributed to the final loss. Those gradient values are now stored in weight.grad and bias.grad: they tell us the direction and magnitude of the loss surface at this point for each parameter.

Common gotcha

Gradients accumulate by default. If you call backward() twice without zeroing gradients, the second backward() adds to the first, doubling them. In training loops, you must call optimizer.zero_grad() after updating weights, or manually set param.grad = None. Forgetting this causes weights to update incorrectly and loss to behave erratically.

Error recovery

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

The loss tensor does not track gradients. Ensure your input tensors have requires_grad=True or are model outputs (which inherit it). Check that you didn't call detach() somewhere in the forward pass.

RuntimeError: Trying to backward through the graph a second time

You called backward() on the same loss twice without retain_graph=True. Either call loss.backward(retain_graph=True) on first call, or build a fresh loss graph for the second backward.

CUDA out of memory

backward() with retain_graph=True keeps the entire computation graph in memory. If memory is tight, avoid retain_graph or explicitly delete the graph after using it.

Experienced dev note

The subtle fact: backward() doesn't update weights: it only computes gradients. The optimizer does the updating. New developers often think backward() modifies weights. Understanding this separation is critical: backward() is purely mathematical (differentiation), optimizer.step() is the actual weight change. Also, gradients are only valid for one step: they depend on the current weights. After optimizer.step() changes weights, old gradients are stale, which is why you zero them.

Check your understanding

If you call loss.backward() twice on the same loss tensor without zeroing gradients in between, what will the gradient values be, and why is that a problem for training?

Show answer hint

A correct answer explains that gradients accumulate (add together), resulting in doubled gradient values. This causes the optimizer to take larger steps than intended, destabilizing training and typically increasing loss instead of decreasing it. This is why zero_grad() is critical in every training loop.

VERSION In PyTorch < 2.0, some edge cases with backward() and autocast required explicit gradient scaling for mixed precision. PyTorch 2.0+ simplified this. Also, autograd profiling API changed slightly in 2.2.x: use torch.profiler instead of deprecated torch.autograd.profiler directly.

After mastering backward(), you'll want to learn about optimizer.step() and the training loop pattern: forward → loss → backward → step → zero_grad: the fundamental cycle of training.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.