torch.no_grad(): disabling gradient tracking
Why this matters
During inference (evaluation, prediction), you don't need gradients: but PyTorch computes them by default. torch.no_grad() prevents this waste, speeding up your code by 30-50% and freeing GPU memory that gradients would occupy.
Explanation
What it is: torch.no_grad() is a context manager (or decorator) that disables automatic differentiation. When active, PyTorch skips building the computational graph that tracks operations for gradient computation.
How it works mechanically: Every tensor operation in PyTorch has a requires_grad flag. Inside torch.no_grad(), this flag is temporarily set to False, and no operation creates gradient information. When you exit the context, gradient tracking resumes. This is why it's safe to use: it only affects code inside the block.
When to use it: Use torch.no_grad() during validation, testing, and inference. Also use it when you need to modify weights manually (e.g., custom weight decay) without triggering gradients.
Analogy
Think of gradient tracking as a security camera recording every move a tensor makes through your network. During training, you need that recording to replay and understand what caused the loss. During inference, you already know what the network does: you just want the answer. torch.no_grad() turns off the camera, making everything run faster.
Code
import torch
import torch.nn as nn
# Create a simple model
model = nn.Linear(10, 5)
input_tensor = torch.randn(2, 10, requires_grad=True)
# WITHOUT torch.no_grad() - tracks gradients
output_with_grad = model(input_tensor)
print(f"Output requires grad: {output_with_grad.requires_grad}")
print(f"Output shape: {output_with_grad.shape}")
# WITH torch.no_grad() - no gradient tracking
with torch.no_grad():
output_no_grad = model(input_tensor)
print(f"Output requires grad (inside no_grad): {output_no_grad.requires_grad}")
print(f"Output shape: {output_no_grad.shape}")
print(f"Output requires grad (after no_grad): {output_no_grad.requires_grad}") Output requires grad: True Output shape: torch.Size([2, 5]) Output requires grad (inside no_grad): False Output shape: torch.Size([2, 5]) Output requires grad (after no_grad): False
What just happened?
We created a model and tensor with gradient tracking enabled. When we ran the model normally, the output had requires_grad=True. Inside the torch.no_grad() context, the same model forward pass produced output with requires_grad=False. After exiting the context, the output still has requires_grad=False because it was already computed without gradients.
Common gotcha
Many developers think torch.no_grad() only affects new operations, but it affects all operations inside the block, including operations on tensors that have requires_grad=True. If you create a tensor inside the no_grad block, it will have requires_grad=False even if you explicitly set requires_grad=True when creating it. Also, exiting the context doesn't retroactively enable gradients for tensors created inside: they stay frozen.
Error recovery
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fnUnexpected low model accuracy or loss not decreasingExperienced dev note
In PyTorch 2.11.x, torch.no_grad() is still the standard, but torch.inference_mode() (added in 1.9.0) is slightly faster for inference because it also disables version checking. For production inference pipelines, consider torch.inference_mode() instead: it gives you an extra 5-10% speedup with zero downside during pure forward passes. However, torch.no_grad() is safer if you're doing anything unusual (manual gradient computation, weight updates) because it still maintains autograd machinery, just disabled.
Check your understanding
If you have a validation loop that processes 1000 batches, and you wrap only the model forward pass in torch.no_grad() but not the loss computation, will your GPU memory usage increase compared to wrapping both forward and loss together? Explain why or why not.
Show answer hint
Think about what happens after loss is computed: does the loss tensor need gradients? What about intermediate activations from the model?