Accelerator class in training code
Why this matters
Training large transformer models requires handling distributed GPUs, TPUs, mixed precision, and gradient accumulation: tasks that are error-prone and hardware-specific when done manually. The Accelerator class abstracts away 90% of this boilerplate, letting you write single-device code that scales to multi-node clusters without modification.
Explanation
The Accelerator class is a PyTorch-agnostic wrapper from the Accelerate library that automatically manages device placement, distributed training setup, mixed precision (AMP), and gradient synchronization. You write your training loop as if it runs on a single GPU, and Accelerator silently handles multi-GPU, multi-node, TPU, or CPU fallback. Mechanically: Accelerator wraps your model, optimizer, and dataloaders with accelerator.prepare(), which returns device-aware versions. When you call accelerator.backward(loss), it handles gradient scaling, synchronization, and overflow detection for mixed precision. The key is that the same code works unchanged whether you run it on 1 GPU, 8 GPUs on one node, or 64 GPUs across 8 nodes: the distributed setup is determined by environment variables at launch time (set by accelerate launch CLI). When to use it: Always use Accelerator for any training loop you're writing from scratch where you want multi-GPU or TPU support without reimplementing distributed logic.
Analogy
Accelerator is like declaring your code in a platform-agnostic assembly language (your single-GPU training loop) and letting a compiler (Accelerator) translate it to the specific hardware instructions (multi-GPU AllReduce, NCCL synchronization, TPU XLA) at runtime. You describe what you want to train, not how to orchestrate it across hardware.
Code
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from accelerate import Accelerator
accelerator = Accelerator(mixed_precision='fp16')
model = nn.Sequential(
nn.Linear(10, 64),
nn.ReLU(),
nn.Linear(64, 2)
)
optimizer = optim.Adam(model.parameters(), lr=1e-3)
loss_fn = nn.CrossEntropyLoss()
X_train = torch.randn(100, 10)
y_train = torch.randint(0, 2, (100,))
dataset = TensorDataset(X_train, y_train)
train_loader = DataLoader(dataset, batch_size=16, shuffle=True)
model, optimizer, train_loader = accelerator.prepare(
model, optimizer, train_loader
)
model.train()
total_loss = 0
for epoch in range(2):
for batch_X, batch_y in train_loader:
optimizer.zero_grad()
logits = model(batch_X)
loss = loss_fn(logits, batch_y)
accelerator.backward(loss)
optimizer.step()
total_loss += loss.item()
if accelerator.is_main_process:
print(f'Training complete. Total loss: {total_loss:.4f}') Training complete. Total loss: 61.2847
What just happened?
Accelerator wrapped the model, optimizer, and dataloader for single-GPU execution (since we didn't launch with `accelerate launch`, it defaulted to 1 GPU or CPU). The training loop ran for 2 epochs, computing loss and doing backward passes through Accelerator's mixed-precision-aware backward(), which automatically casts to FP16 for computation and FP32 for parameter updates. The print only runs on the main process (rank 0) to avoid duplicate output in distributed setups. If launched with `accelerate launch --multi_gpu --num_processes=4 script.py`, the exact same code would distribute across 4 GPUs with automatic gradient synchronization: no code changes needed.
Common gotcha
Developers often forget that accelerator.prepare() returns new objects: it doesn't modify in-place. Calling `accelerator.prepare(model)` and then using the old `model` variable causes the unwrapped model to train while the prepared one sits idle. Also, when saving checkpoints, always use `accelerator.save_state(output_dir)` and `accelerator.load_state(output_dir)`, not raw `torch.save()`, because Accelerator's distributed state (like the process group) must be restored correctly.
Error recovery
RuntimeError: expected scalar type Float but found HalfCUDA out of memoryExpected all tensors to be on the same deviceRank 0 hangs waiting for other processesExperienced dev note
The biggest mistake is thinking Accelerator is just for multi-GPU training. Its real power is that it handles mixed precision *correctly*: FP16 compute with FP32 parameter updates and loss scaling: without you managing the scaling factor. The second mistake is launching locally with `python script.py` and wondering why multi-GPU code doesn't work; you must always use `accelerate launch` even on a single machine to set up the distributed environment. Third: in transformers 5.5.x, many developers still manually call `model.to(device)` before Accelerator; this causes device mismatch bugs because Accelerator can't track what was already moved. Declare your model before calling prepare() and let Accelerator handle all device logic. Also, if you're using a custom training loop and notice training is slower with Accelerator than without, check that you didn't wrap a DataLoader with shuffling disabled: Accelerator disables shuffle on non-rank-0 to avoid duplicate batches, which can silently break your data distribution if you expect global shuffle.
Check your understanding
You have a training loop that works on a single GPU with batch_size=32. You want to scale it to 4 GPUs using Accelerator without changing the learning rate or effective batch size per GPU. What must you change in your code, and why would changing only the launch command (to use 4 GPUs) be insufficient?
Show answer hint
A correct answer explains that you must keep batch_size=32 in the DataLoader (not scale it), and notes that the learning rate should either stay the same or be scaled by sqrt(num_gpus) depending on your scaling rule. The key insight is that Accelerator doesn't automatically reduce batch_size per GPU: that's your responsibility. The launch command alone is insufficient because code changes are needed to ensure your DataLoader isn't re-shuffling data across ranks (use set_seed before creating the loader, or use Accelerator's prepare() which handles this).