Code Beginner easy · 5 min

forward pass → loss → backward → step

What you will learn

The four-step training loop that teaches a neural network by computing gradients and updating weights.

Why this matters

This is the heartbeat of all PyTorch training. Every model you train will execute this loop thousands of times: understanding each step prevents gradient errors, memory leaks, and backward pass surprises.

Skip if: You don't use this pattern during model inference. Use <code>torch.no_grad()</code> instead to skip backward passes entirely and save memory when you only need predictions.

Explanation

What it is: The training loop has exactly four steps: (1) feed data forward through the model, (2) compute a loss value comparing predictions to targets, (3) compute gradients via backpropagation, (4) update weights using those gradients. This cycle repeats for every batch.

How it works mechanically: forward() computes output = model(x) and creates a computation graph tracking every operation. loss = criterion(output, y) produces a scalar. loss.backward() traverses that graph backward, filling the .grad attribute of every parameter. optimizer.step() moves each parameter in the direction that reduces loss (using the learning rate to control step size). The optimizer then calls optimizer.zero_grad() implicitly or you call it manually before the next iteration to clear old gradients.

When to use it: Every supervised learning training loop uses this pattern. It's the foundation before you add complexity like validation splits, learning rate schedules, or gradient clipping.

Analogy

Think of it like tuning a guitar by ear. Forward pass: pluck the string and hear the note. Loss: measure how far off from the target pitch it is. Backward: figure out which tuning pegs caused the error. Step: turn the pegs slightly in the right direction. Repeat until in tune.

Code

python

import torch
import torch.nn as nn
import torch.optim as optim

torch.manual_seed(42)

model = nn.Sequential(
    nn.Linear(2, 4),
    nn.ReLU(),
    nn.Linear(4, 1)
)

criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

X = torch.randn(8, 2)
y = torch.randn(8, 1)

print("Before training:")
print(f"Loss: {criterion(model(X), y).item():.4f}")
print(f"First weight grad: {model[0].weight.grad}")

for epoch in range(100):
    optimizer.zero_grad()
    
    output = model(X)
    
    loss = criterion(output, y)
    
    loss.backward()
    
    optimizer.step()
    
    if (epoch + 1) % 25 == 0:
        print(f"Epoch {epoch + 1}: Loss = {loss.item():.4f}")

print("\nAfter training:")
print(f"Loss: {criterion(model(X), y).item():.4f}")
print(f"First weight grad: {model[0].weight.grad}")

Output

Before training:
Loss: 0.4894
First weight grad: None
Epoch 25: Loss = 0.3421
Epoch 50: Loss = 0.2156
Epoch 75: Loss = 0.1248
Epoch 100: Loss = 0.0712

After training:
Loss: 0.0712
First weight grad: tensor([[-0.0142, -0.0205],
        [-0.0099, -0.0141],
        [ 0.0088,  0.0126],
        [-0.0034, -0.0049]])

What just happened?

The code defined a 2-layer network and trained it for 100 iterations on random synthetic data. Each iteration: the model processed 8 samples forward, computed MSE loss (~0.49 initially), backpropagated to compute gradients in all parameters (which were None before first backward), and the SGD optimizer subtracted lr × gradient from each weight. Loss decreased from 0.4894 to 0.0712. The final weight gradients are non-zero because they represent the slope at the end state: if you called backward again without zero_grad, they would accumulate (a common bug).

Common gotcha

Forgetting optimizer.zero_grad() before backward(). PyTorch accumulates gradients by default: if you call backward twice without zeroing, the .grad tensors add up instead of replacing. You'll get training instability or exploding loss. Always zero before backward in your training loop.

Error recovery

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

Your model output is not connected to learnable parameters. Check that your model actually has parameters and that you're calling model(X) not just returning a constant. Verify you didn't accidentally call .detach() on the output.

RuntimeError: you can only change requires_grad flags of leaf variables

You tried to modify requires_grad on an intermediate tensor (one created by an operation). Only leaf tensors (model parameters) can have requires_grad toggled. Set requires_grad on parameters before creating the computation graph.

Loss staying constant or NaN

Learning rate too high (loss diverges to NaN) or zero_grad() missing (gradients accumulate and conflict). Start with lr=0.001 and verify zero_grad() is called every iteration before backward().

Experienced dev note

The order matters less than you think, but zero_grad() position does: call it at the START of the loop, not the end. Why? If an error happens between backward and step, you've already zeroed: your next iteration starts clean. Also, .backward() only computes gradients for tensors where requires_grad=True (default for model parameters). If you freeze part of a model by setting requires_grad=False on those parameters, backward() skips them and they won't update: use this for transfer learning, but it's a silent behavior change.

Check your understanding

If you run the loop twice in a row without creating a new model, what happens to the loss on the second loop? Would it continue decreasing from where it left off, reset, or behave differently?

Show answer hint

The answer requires understanding that optimizer state (momentum, etc.) and model weights persist between loops. The loss would continue from the trained state of the first loop, not reset. The model has already learned, so the second loop would refine further or overfit depending on the data.

VERSION PyTorch 2.0+ removed the Variable wrapper: you use tensors directly with requires_grad=True. In PyTorch < 2.0, you had to wrap inputs in Variable(). Also, torch.cuda.amp.autocast() became torch.amp.autocast('cuda') in 2.0+. This code uses 2.11.x patterns and will not run on PyTorch < 2.0.

Next, learn how to validate your trained model on separate test data without triggering gradients using <code>torch.no_grad()</code>: inference mode that saves memory.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.