forward pass → loss → backward → step
Why this matters
This is the heartbeat of all PyTorch training. Every model you train will execute this loop thousands of times: understanding each step prevents gradient errors, memory leaks, and backward pass surprises.
Explanation
What it is: The training loop has exactly four steps: (1) feed data forward through the model, (2) compute a loss value comparing predictions to targets, (3) compute gradients via backpropagation, (4) update weights using those gradients. This cycle repeats for every batch.
How it works mechanically: forward() computes output = model(x) and creates a computation graph tracking every operation. loss = criterion(output, y) produces a scalar. loss.backward() traverses that graph backward, filling the .grad attribute of every parameter. optimizer.step() moves each parameter in the direction that reduces loss (using the learning rate to control step size). The optimizer then calls optimizer.zero_grad() implicitly or you call it manually before the next iteration to clear old gradients.
When to use it: Every supervised learning training loop uses this pattern. It's the foundation before you add complexity like validation splits, learning rate schedules, or gradient clipping.
Analogy
Think of it like tuning a guitar by ear. Forward pass: pluck the string and hear the note. Loss: measure how far off from the target pitch it is. Backward: figure out which tuning pegs caused the error. Step: turn the pegs slightly in the right direction. Repeat until in tune.
Code
import torch
import torch.nn as nn
import torch.optim as optim
torch.manual_seed(42)
model = nn.Sequential(
nn.Linear(2, 4),
nn.ReLU(),
nn.Linear(4, 1)
)
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)
X = torch.randn(8, 2)
y = torch.randn(8, 1)
print("Before training:")
print(f"Loss: {criterion(model(X), y).item():.4f}")
print(f"First weight grad: {model[0].weight.grad}")
for epoch in range(100):
optimizer.zero_grad()
output = model(X)
loss = criterion(output, y)
loss.backward()
optimizer.step()
if (epoch + 1) % 25 == 0:
print(f"Epoch {epoch + 1}: Loss = {loss.item():.4f}")
print("\nAfter training:")
print(f"Loss: {criterion(model(X), y).item():.4f}")
print(f"First weight grad: {model[0].weight.grad}") Before training:
Loss: 0.4894
First weight grad: None
Epoch 25: Loss = 0.3421
Epoch 50: Loss = 0.2156
Epoch 75: Loss = 0.1248
Epoch 100: Loss = 0.0712
After training:
Loss: 0.0712
First weight grad: tensor([[-0.0142, -0.0205],
[-0.0099, -0.0141],
[ 0.0088, 0.0126],
[-0.0034, -0.0049]]) What just happened?
The code defined a 2-layer network and trained it for 100 iterations on random synthetic data. Each iteration: the model processed 8 samples forward, computed MSE loss (~0.49 initially), backpropagated to compute gradients in all parameters (which were None before first backward), and the SGD optimizer subtracted lr × gradient from each weight. Loss decreased from 0.4894 to 0.0712. The final weight gradients are non-zero because they represent the slope at the end state: if you called backward again without zero_grad, they would accumulate (a common bug).
Common gotcha
Forgetting optimizer.zero_grad() before backward(). PyTorch accumulates gradients by default: if you call backward twice without zeroing, the .grad tensors add up instead of replacing. You'll get training instability or exploding loss. Always zero before backward in your training loop.
Error recovery
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fnRuntimeError: you can only change requires_grad flags of leaf variablesLoss staying constant or NaNExperienced dev note
The order matters less than you think, but zero_grad() position does: call it at the START of the loop, not the end. Why? If an error happens between backward and step, you've already zeroed: your next iteration starts clean. Also, .backward() only computes gradients for tensors where requires_grad=True (default for model parameters). If you freeze part of a model by setting requires_grad=False on those parameters, backward() skips them and they won't update: use this for transfer learning, but it's a silent behavior change.
Check your understanding
If you run the loop twice in a row without creating a new model, what happens to the loss on the second loop? Would it continue decreasing from where it left off, reset, or behave differently?
Show answer hint
The answer requires understanding that optimizer state (momentum, etc.) and model weights persist between loops. The loss would continue from the trained state of the first loop, not reset. The model has already learned, so the second loop would refine further or overfit depending on the data.