How to Intermediate · 4 min read

How to use mixed precision training for fine-tuning

Quick answer
Use mixed precision training by enabling automatic mixed precision (AMP) in frameworks like PyTorch during fine-tuning to speed up training and reduce memory usage without sacrificing model accuracy. This involves wrapping your forward and backward passes with torch.cuda.amp.autocast() and using a GradScaler to scale gradients safely.

PREREQUISITES

  • Python 3.8+
  • PyTorch 1.10+
  • CUDA-enabled GPU
  • pip install torch torchvision

Setup

Install PyTorch with CUDA support and ensure your GPU drivers are up to date. Use the following command to install PyTorch with CUDA 11.7 support:

bash
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu117

Step by step

This example shows how to fine-tune a simple model using mixed precision training with PyTorch's AMP API. It includes model, optimizer, loss, and the AMP context manager.

python
import torch
import torch.nn as nn
import torch.optim as optim
from torch.cuda.amp import autocast, GradScaler

# Dummy dataset
inputs = torch.randn(64, 3, 224, 224).cuda()
targets = torch.randint(0, 10, (64,)).cuda()

# Simple model
model = nn.Sequential(
    nn.Conv2d(3, 16, kernel_size=3, stride=2),
    nn.ReLU(),
    nn.Flatten(),
    nn.Linear(16 * 111 * 111, 10)
).cuda()

optimizer = optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

# Initialize GradScaler for mixed precision
scaler = GradScaler()

model.train()
for epoch in range(1):
    optimizer.zero_grad()
    with autocast():  # Enables mixed precision
        outputs = model(inputs)
        loss = criterion(outputs, targets)
    # Scales loss, calls backward(), unscales gradients, and steps optimizer
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()
    print(f"Epoch {epoch+1}, Loss: {loss.item():.4f}")
output
Epoch 1, Loss: 2.3025

Common variations

  • Use torch.cuda.amp.autocast() selectively on forward passes for custom models.
  • For async training, integrate AMP with distributed data parallel (DDP) setups.
  • Other frameworks like TensorFlow have similar mixed precision APIs (e.g., tf.keras.mixed_precision).
  • Adjust GradScaler parameters for stability on different hardware.

Troubleshooting

  • If you see RuntimeError: CUDA out of memory, reduce batch size or disable mixed precision temporarily.
  • Non-finite gradients can cause GradScaler to skip steps; monitor scaler warnings and adjust learning rate.
  • Ensure your GPU supports Tensor Cores (NVIDIA Volta or newer) for best AMP performance.

Key Takeaways

  • Enable mixed precision with torch.cuda.amp.autocast() to speed up fine-tuning and reduce GPU memory usage.
  • Use GradScaler to safely scale gradients and avoid underflow during backpropagation.
  • Test and adjust batch size and learning rate when using mixed precision to maintain training stability.
Verified 2026-04 · gpt-4o, claude-3-5-sonnet-20241022
Verify ↗