What autograd does: computing gradients
Why this matters
Gradients are how neural networks learn: they tell you which direction to adjust weights to reduce loss. Without autograd, you'd compute derivatives by hand for every model. Understanding how it works prevents you from accidentally breaking gradient flow in production.
Explanation
Autograd is PyTorch's automatic differentiation engine. When you perform operations on tensors with requires_grad=True, PyTorch records what you did in a computational graph. Later, when you call .backward() on a scalar loss, PyTorch walks backwards through that graph using the chain rule to compute how much each parameter contributed to the final loss. This gradient is stored in each tensor's .grad attribute.
Mechanically: Each tensor is a node in the graph. Each operation (addition, multiplication, ReLU, etc.) is an edge. PyTorch stores a reference to the operation that created each tensor. When you call .backward(), PyTorch uses the chain rule: if loss depends on y, and y depends on x, then ∂loss/∂x = (∂loss/∂y) × (∂y/∂x). It traces backward from loss to all parameters, multiplying these partial derivatives together.
When to use it: Always enable autograd on model parameters and inputs during training. Disable it during inference with torch.no_grad() to avoid wasting memory and computation.
Analogy
Think of autograd as a detailed receipt of every calculation your model made. If the final bill (loss) is higher than expected, the receipt tells you exactly which ingredient (parameter) contributed most to that cost. You can then adjust those ingredients for next time.
Code
import torch
# Create a tensor with gradient tracking enabled
x = torch.tensor(3.0, requires_grad=True)
print(f"x: {x}")
print(f"x.requires_grad: {x.requires_grad}")
print(f"x.grad before backward: {x.grad}")
# Perform operations — PyTorch records them in a graph
y = x ** 2
print(f"\ny = x^2 = {y}")
z = y * 4
print(f"z = y * 4 = {z}")
# Compute gradients: dz/dx
z.backward()
print(f"\nAfter z.backward():")
print(f"x.grad (dz/dx): {x.grad}")
# Verify by hand: z = 4x^2, so dz/dx = 8x = 8*3 = 24
print(f"\nManual check: dz/dx = 8*x = 8*{x.item()} = {8 * x.item()}") x: 3.0 x.requires_grad: True x.grad before backward: None y = x^2 = 9.0 z = y * 4 = 36.0 After z.backward(): x.grad (dz/dx): 24.0 Manual check: dz/dx = 8*x = 8*3.0 = 24.0
What just happened?
We created a tensor with <code>requires_grad=True</code>, which told PyTorch to track operations on it. We then performed two operations: squaring x to get y, then multiplying y by 4 to get z. PyTorch built an internal graph recording these steps. When we called <code>z.backward()</code>, PyTorch applied the chain rule: dz/dy = 4, dy/dx = 2x = 6, so dz/dx = 4 * 6 = 24. This gradient was stored in <code>x.grad</code>.
Common gotcha
The most common mistake is forgetting that .grad accumulates across multiple .backward() calls. If you call backward twice on the same loss without clearing .grad first, the gradients add together. In training loops, you must call optimizer.zero_grad() before each backward pass, or manually set x.grad = None.
Error recovery
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fnRuntimeError: grad can be implicitly created only for scalar outputsAttributeError: 'NoneType' object has no attribute 'backward'Experienced dev note
In PyTorch 2.11.x, you no longer use torch.Variable() (removed in 2.0) or volatile=True (removed in 2.0): tensors are variables by default. Also, requires_grad is NOT inherited by default when you slice or reshape a tensor; if you do x[0], it still requires grad, but if you do x.detach(), it doesn't. Use .detach() explicitly when you want to break the gradient flow, not by accident. In production, profile your code with torch.profiler to see how much memory autograd is using: you'll often find that disabling it during inference saves more memory than you'd expect.
Check your understanding
If you have a model where loss = model(x), and you want to compute gradients with respect to x (not the model parameters), what would prevent you from getting x.grad after calling loss.backward(), and how would you fix it?
Show answer hint
A correct answer explains that x must have <code>requires_grad=True</code> at creation time, and that this is independent of whether the model parameters require grad. The fix is to set <code>x.requires_grad=True</code> before passing it to the model.
torch.Variable() and using volatile=True to disable gradients. PyTorch 2.0+ removed this distinction: all tensors are variables, and you use torch.no_grad() context manager instead of volatile. If you're reading old code, watch for these patterns.