Code Intermediate medium · 6 min

Mixed precision: torch.autocast

What you will learn

Use torch.autocast to automatically cast operations to lower precision (float16) during forward passes to reduce memory and speed up training while maintaining accuracy with float32 gradients.

Why this matters

Mixed precision training can halve memory usage and 2–4× speed up training on modern GPUs (A100, H100, RTX40xx) without sacrificing model accuracy. For large models, this is the difference between fitting in VRAM and out-of-memory errors.

Skip if: Do not use autocast if: (1) your model contains operations that numerically degrade under float16 (e.g., BatchNorm, LayerNorm with very small epsilon), (2) you are on CPU-only hardware where the speedup is negligible, or (3) you need deterministic results and float16 rounding differences matter for your use case.

Explanation

torch.autocast automatically casts eligible operations to float16 (or bfloat16) during the forward pass, then back to float32 for backward/gradient computation. This hybrid approach: sometimes called "mixed precision": exploits the fact that forward computations tolerate lower precision while gradients require higher precision for stability. Mechanically, autocast wraps your forward pass and intercepts operations, casting inputs and weights to float16 for compute-heavy layers (matmul, conv2d) and keeping precision-sensitive ops (reductions, normalization) in float32. You enable it via a context manager torch.amp.autocast('cuda') or torch.amp.autocast('cpu'). Use this whenever training large models on GPU; the speedup is automatic and transparent.

Analogy

Think of a film crew: the camera (float16) captures footage fast and cheap, but the color grading (float32) happens in high-precision post-production. You save on compute without losing quality in the final output.

Code

python

import torch
import torch.nn as nn
import torch.optim as optim

# Simple model for demonstration
model = nn.Sequential(
    nn.Linear(1024, 512),
    nn.ReLU(),
    nn.Linear(512, 10)
).cuda()

optimizer = optim.Adam(model.parameters(), lr=0.001)
loss_fn = nn.CrossEntropyLoss()

# Dummy batch
x = torch.randn(32, 1024, device='cuda')
y = torch.randint(0, 10, (32,), device='cuda')

# WITHOUT autocast: forward in float32
with torch.no_grad():
    out_fp32 = model(x)
    print(f"Weight dtype without autocast: {model[0].weight.dtype}")
    print(f"Output dtype without autocast: {out_fp32.dtype}")

# WITH autocast: forward in float16, backward in float32
with torch.amp.autocast('cuda'):
    logits = model(x)
    loss = loss_fn(logits, y)
    print(f"\nWeight dtype inside autocast: {model[0].weight.dtype}")
    print(f"Logits dtype inside autocast: {logits.dtype}")

optimizer.zero_grad()
loss.backward()
print(f"Gradient dtype: {model[0].weight.grad.dtype}")
optimizer.step()

print("\nTraining step completed with mixed precision.")

Output

Weight dtype without autocast: torch.float32
Output dtype without autocast: torch.float32

Weight dtype inside autocast: torch.float32
Logits dtype inside autocast: torch.float16
Gradient dtype: torch.float32

Training step completed with mixed precision.

What just happened?

The code demonstrated the difference between normal float32 training and mixed precision. Outside autocast, all tensors stay float32. Inside the autocast context, the forward pass outputs become float16 (you can see logits are float16), but weights and gradients remain float32 because backward() runs outside the autocast block. The optimizer update happens with float32 gradients, ensuring stability.

Common gotcha

The biggest mistake is placing loss.backward() inside the autocast context. If you do, gradients are computed in float16, which can cause NaN or numerical instability. Backward must always run in float32. Also, autocast('cuda') does nothing on CPU; use autocast('cpu') if you want casting on CPU, but the speedup there is minimal.

Error recovery

RuntimeError: autocast not available for device

You called autocast('cuda') but CUDA is not available. Check torch.cuda.is_available() or use autocast('cpu') instead.

NaN loss during training

You likely called backward() inside the autocast block. Move loss.backward() outside the `with torch.amp.autocast():` block so gradients compute in float32.

TypeError: argument of type 'NoneType' is not iterable

Rare: occurs if autocast tries to cast a None tensor. Check that all model outputs are valid tensors, not None.

Experienced dev note

In PyTorch 2.11.x, the API is torch.amp.autocast(), not the deprecated torch.cuda.amp.autocast() from earlier versions. The newer API is device-agnostic and integrates better with torch.amp.GradScaler for loss scaling (which you'll need for stability in larger models). Also: autocast is context-manager-based, so it only affects code inside the `with` block: everything else runs at normal precision. This is safer than global dtype switches and easier to reason about.

Check your understanding

Why does putting the backward pass inside the autocast context cause training instability, and what dtype are the gradients computed in if you do this?

Show answer hint

A correct answer explains that gradients computed in float16 have insufficient precision for small gradient values, leading to underflow (gradients become zero) or overflow. The answer should also state that gradients end up in float16 dtype if backward() is inside autocast, which is the problem.

VERSION PyTorch < 1.10: torch.cuda.amp.autocast() was the only API and was CUDA-only. PyTorch 1.10+: torch.amp.autocast() became the standard and works on both 'cuda' and 'cpu'. PyTorch 2.0+: old torch.cuda.amp.autocast() is soft-deprecated in favor of torch.amp.autocast(). Use torch.amp.autocast() in all new code for 2.11.x.

Next, learn torch.amp.GradScaler to dynamically scale loss before backward(), which prevents gradient underflow when using float16 and is critical for stable large-model training.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.