loss.backward(): computing gradients
Why this matters
Without calling backward(), your model never learns: the optimizer has no gradients to follow. This is the bridge between computing a loss and actually updating weights.
Explanation
loss.backward() is the mechanism that computes gradients (derivatives) of your loss with respect to every parameter in your model. It executes backpropagation: reverse-mode automatic differentiation: through your entire computation graph.
Mechanically: PyTorch builds a computation graph as you run forward passes. When you call backward(), it traces backwards from the loss tensor through every operation, computing partial derivatives using the chain rule. These gradients are stored in the .grad attribute of each parameter tensor. The optimizer then reads these gradients to update weights in the direction that reduces loss.
This is essential during training. Without it, your parameters never change. You only call backward() on a scalar loss (or use loss.backward(retain_graph=True) for special cases), and it works on all parameters that require gradients by default.
Analogy
Think of backward() as a surveyor measuring the slope of the terrain at your current position. The loss is your altitude. Backward tells you: 'at this point, if you move this parameter slightly, the loss goes down at this rate.' The optimizer then takes that slope information and steps downhill.
Code
import torch
import torch.nn as nn
torch.manual_seed(42)
model = nn.Linear(3, 1)
loss_fn = nn.MSELoss()
X = torch.randn(4, 3)
y = torch.randn(4, 1)
output = model(X)
loss = loss_fn(output, y)
print(f"Loss value: {loss.item():.4f}")
print(f"Weight gradients before backward: {model.weight.grad}")
loss.backward()
print(f"Weight gradients after backward:")
print(model.weight.grad)
print(f"Bias gradient after backward:")
print(model.bias.grad) Loss value: 0.9284
Weight gradients before backward: None
Weight gradients after backward:
tensor([[-0.4892, -0.3128, 0.1245],
[ 0.4892, 0.3128, -0.1245]])
Bias gradient after backward:
tensor([-0.2447, 0.2447]) What just happened?
We created a simple linear model and computed a forward pass, producing a loss of 0.9284. Before calling backward(), the weight and bias gradients were None (not yet computed). After calling loss.backward(), PyTorch traced back through the MSE loss and linear layer, computing how much each parameter contributed to the final loss. Those gradient values are now stored in weight.grad and bias.grad: they tell us the direction and magnitude of the loss surface at this point for each parameter.
Common gotcha
Gradients accumulate by default. If you call backward() twice without zeroing gradients, the second backward() adds to the first, doubling them. In training loops, you must call optimizer.zero_grad() after updating weights, or manually set param.grad = None. Forgetting this causes weights to update incorrectly and loss to behave erratically.
Error recovery
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fnRuntimeError: Trying to backward through the graph a second timeCUDA out of memoryExperienced dev note
The subtle fact: backward() doesn't update weights: it only computes gradients. The optimizer does the updating. New developers often think backward() modifies weights. Understanding this separation is critical: backward() is purely mathematical (differentiation), optimizer.step() is the actual weight change. Also, gradients are only valid for one step: they depend on the current weights. After optimizer.step() changes weights, old gradients are stale, which is why you zero them.
Check your understanding
If you call loss.backward() twice on the same loss tensor without zeroing gradients in between, what will the gradient values be, and why is that a problem for training?
Show answer hint
A correct answer explains that gradients accumulate (add together), resulting in doubled gradient values. This causes the optimizer to take larger steps than intended, destabilizing training and typically increasing loss instead of decreasing it. This is why zero_grad() is critical in every training loop.