Code Beginner easy · 5 min

What autograd does: computing gradients

What you will learn

PyTorch's autograd automatically computes gradients by tracking operations on tensors and building a computational graph.

Why this matters

Gradients are how neural networks learn: they tell you which direction to adjust weights to reduce loss. Without autograd, you'd compute derivatives by hand for every model. Understanding how it works prevents you from accidentally breaking gradient flow in production.

Skip if: You don't need autograd when doing inference (prediction) on a trained model: use <code>torch.no_grad()</code> to disable it and save memory. You also don't use it for data preprocessing or loss calculation if those don't involve learnable parameters.

Explanation

Autograd is PyTorch's automatic differentiation engine. When you perform operations on tensors with requires_grad=True, PyTorch records what you did in a computational graph. Later, when you call .backward() on a scalar loss, PyTorch walks backwards through that graph using the chain rule to compute how much each parameter contributed to the final loss. This gradient is stored in each tensor's .grad attribute.

Mechanically: Each tensor is a node in the graph. Each operation (addition, multiplication, ReLU, etc.) is an edge. PyTorch stores a reference to the operation that created each tensor. When you call .backward(), PyTorch uses the chain rule: if loss depends on y, and y depends on x, then ∂loss/∂x = (∂loss/∂y) × (∂y/∂x). It traces backward from loss to all parameters, multiplying these partial derivatives together.

When to use it: Always enable autograd on model parameters and inputs during training. Disable it during inference with torch.no_grad() to avoid wasting memory and computation.

Analogy

Think of autograd as a detailed receipt of every calculation your model made. If the final bill (loss) is higher than expected, the receipt tells you exactly which ingredient (parameter) contributed most to that cost. You can then adjust those ingredients for next time.

Code

python

import torch

# Create a tensor with gradient tracking enabled
x = torch.tensor(3.0, requires_grad=True)
print(f"x: {x}")
print(f"x.requires_grad: {x.requires_grad}")
print(f"x.grad before backward: {x.grad}")

# Perform operations — PyTorch records them in a graph
y = x ** 2
print(f"\ny = x^2 = {y}")

z = y * 4
print(f"z = y * 4 = {z}")

# Compute gradients: dz/dx
z.backward()
print(f"\nAfter z.backward():")
print(f"x.grad (dz/dx): {x.grad}")

# Verify by hand: z = 4x^2, so dz/dx = 8x = 8*3 = 24
print(f"\nManual check: dz/dx = 8*x = 8*{x.item()} = {8 * x.item()}")

Output

x: 3.0
x.requires_grad: True
x.grad before backward: None

y = x^2 = 9.0
z = y * 4 = 36.0

After z.backward():
x.grad (dz/dx): 24.0

Manual check: dz/dx = 8*x = 8*3.0 = 24.0

What just happened?

We created a tensor with <code>requires_grad=True</code>, which told PyTorch to track operations on it. We then performed two operations: squaring x to get y, then multiplying y by 4 to get z. PyTorch built an internal graph recording these steps. When we called <code>z.backward()</code>, PyTorch applied the chain rule: dz/dy = 4, dy/dx = 2x = 6, so dz/dx = 4 * 6 = 24. This gradient was stored in <code>x.grad</code>.

Common gotcha

The most common mistake is forgetting that .grad accumulates across multiple .backward() calls. If you call backward twice on the same loss without clearing .grad first, the gradients add together. In training loops, you must call optimizer.zero_grad() before each backward pass, or manually set x.grad = None.

Error recovery

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

You called <code>.backward()</code> on a tensor with <code>requires_grad=False</code>. Either create the tensor with <code>requires_grad=True</code>, or ensure all input tensors along the path have gradient tracking enabled.

RuntimeError: grad can be implicitly created only for scalar outputs

You called <code>.backward()</code> on a non-scalar tensor (shape with more than one element or more than zero dimensions). Call <code>.backward()</code> only on scalars like loss. If you have a vector, sum it first: <code>loss.sum().backward()</code>.

AttributeError: 'NoneType' object has no attribute 'backward'

The operation produced <code>None</code> instead of a tensor: usually because you used a function that doesn't support autograd or you're in <code>torch.no_grad()</code> context. Check that all operations along the computation path are differentiable.

Experienced dev note

In PyTorch 2.11.x, you no longer use torch.Variable() (removed in 2.0) or volatile=True (removed in 2.0): tensors are variables by default. Also, requires_grad is NOT inherited by default when you slice or reshape a tensor; if you do x[0], it still requires grad, but if you do x.detach(), it doesn't. Use .detach() explicitly when you want to break the gradient flow, not by accident. In production, profile your code with torch.profiler to see how much memory autograd is using: you'll often find that disabling it during inference saves more memory than you'd expect.

Check your understanding

If you have a model where loss = model(x), and you want to compute gradients with respect to x (not the model parameters), what would prevent you from getting x.grad after calling loss.backward(), and how would you fix it?

Show answer hint

A correct answer explains that x must have <code>requires_grad=True</code> at creation time, and that this is independent of whether the model parameters require grad. The fix is to set <code>x.requires_grad=True</code> before passing it to the model.

VERSION In PyTorch < 2.0, autograd required wrapping tensors in torch.Variable() and using volatile=True to disable gradients. PyTorch 2.0+ removed this distinction: all tensors are variables, and you use torch.no_grad() context manager instead of volatile. If you're reading old code, watch for these patterns.

Next, learn how to use <code>torch.no_grad()</code> and <code>.detach()</code> to control when gradients are computed: essential for separating training and inference.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.