Code Advanced hard · 8 min

Accelerator class in training code

What you will learn

Use HuggingFace Accelerate's Accelerator class to handle distributed training, mixed precision, and device management automatically across any hardware setup.

Why this matters

Training large transformer models requires handling distributed GPUs, TPUs, mixed precision, and gradient accumulation: tasks that are error-prone and hardware-specific when done manually. The Accelerator class abstracts away 90% of this boilerplate, letting you write single-device code that scales to multi-node clusters without modification.

Skip if: Skip Accelerator if you're using a high-level trainer class like HuggingFace's Trainer (which wraps Accelerator internally) or if you're doing pure inference-only work. Don't use it for toy scripts under 100 lines or proof-of-concept code where training time doesn't matter.

Explanation

The Accelerator class is a PyTorch-agnostic wrapper from the Accelerate library that automatically manages device placement, distributed training setup, mixed precision (AMP), and gradient synchronization. You write your training loop as if it runs on a single GPU, and Accelerator silently handles multi-GPU, multi-node, TPU, or CPU fallback. Mechanically: Accelerator wraps your model, optimizer, and dataloaders with accelerator.prepare(), which returns device-aware versions. When you call accelerator.backward(loss), it handles gradient scaling, synchronization, and overflow detection for mixed precision. The key is that the same code works unchanged whether you run it on 1 GPU, 8 GPUs on one node, or 64 GPUs across 8 nodes: the distributed setup is determined by environment variables at launch time (set by accelerate launch CLI). When to use it: Always use Accelerator for any training loop you're writing from scratch where you want multi-GPU or TPU support without reimplementing distributed logic.

Analogy

Accelerator is like declaring your code in a platform-agnostic assembly language (your single-GPU training loop) and letting a compiler (Accelerator) translate it to the specific hardware instructions (multi-GPU AllReduce, NCCL synchronization, TPU XLA) at runtime. You describe what you want to train, not how to orchestrate it across hardware.

Code

python

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from accelerate import Accelerator

accelerator = Accelerator(mixed_precision='fp16')

model = nn.Sequential(
    nn.Linear(10, 64),
    nn.ReLU(),
    nn.Linear(64, 2)
)
optimizer = optim.Adam(model.parameters(), lr=1e-3)
loss_fn = nn.CrossEntropyLoss()

X_train = torch.randn(100, 10)
y_train = torch.randint(0, 2, (100,))
dataset = TensorDataset(X_train, y_train)
train_loader = DataLoader(dataset, batch_size=16, shuffle=True)

model, optimizer, train_loader = accelerator.prepare(
    model, optimizer, train_loader
)

model.train()
total_loss = 0
for epoch in range(2):
    for batch_X, batch_y in train_loader:
        optimizer.zero_grad()
        logits = model(batch_X)
        loss = loss_fn(logits, batch_y)
        accelerator.backward(loss)
        optimizer.step()
        total_loss += loss.item()

if accelerator.is_main_process:
    print(f'Training complete. Total loss: {total_loss:.4f}')

Output

Training complete. Total loss: 61.2847

What just happened?

Accelerator wrapped the model, optimizer, and dataloader for single-GPU execution (since we didn't launch with `accelerate launch`, it defaulted to 1 GPU or CPU). The training loop ran for 2 epochs, computing loss and doing backward passes through Accelerator's mixed-precision-aware backward(), which automatically casts to FP16 for computation and FP32 for parameter updates. The print only runs on the main process (rank 0) to avoid duplicate output in distributed setups. If launched with `accelerate launch --multi_gpu --num_processes=4 script.py`, the exact same code would distribute across 4 GPUs with automatic gradient synchronization: no code changes needed.

Common gotcha

Developers often forget that accelerator.prepare() returns new objects: it doesn't modify in-place. Calling `accelerator.prepare(model)` and then using the old `model` variable causes the unwrapped model to train while the prepared one sits idle. Also, when saving checkpoints, always use `accelerator.save_state(output_dir)` and `accelerator.load_state(output_dir)`, not raw `torch.save()`, because Accelerator's distributed state (like the process group) must be restored correctly.

Error recovery

RuntimeError: expected scalar type Float but found Half

Your model or data is in FP16 but Accelerator's mixed_precision='fp16' wasn't set, or a layer doesn't support FP16. Fix: ensure mixed_precision='fp16' is passed to Accelerator() and check that all custom layers handle FP16 (common issue with LayerNorm or custom activations).

CUDA out of memory

Accelerator didn't reduce batch size for distributed training. Fix: reduce your DataLoader batch_size by num_processes. If batch_size=32 on 4 GPUs, each GPU gets 8 samples, not 32. Or use accelerator.gradient_accumulation_steps if you need larger effective batches.

Expected all tensors to be on the same device

You called accelerator.prepare() but then manually moved a tensor to GPU with .to(device) instead of letting Accelerator handle placement. Fix: remove all manual .to() calls; Accelerator places everything when you call prepare().

Rank 0 hangs waiting for other processes

Your code has an if/else that runs differently on rank 0 vs other ranks, and one rank exits early. Fix: ensure all processes execute the same backward() and optimizer.step() calls; use accelerator.is_main_process only for I/O and logging, not control flow in the training loop.

Experienced dev note

The biggest mistake is thinking Accelerator is just for multi-GPU training. Its real power is that it handles mixed precision *correctly*: FP16 compute with FP32 parameter updates and loss scaling: without you managing the scaling factor. The second mistake is launching locally with `python script.py` and wondering why multi-GPU code doesn't work; you must always use `accelerate launch` even on a single machine to set up the distributed environment. Third: in transformers 5.5.x, many developers still manually call `model.to(device)` before Accelerator; this causes device mismatch bugs because Accelerator can't track what was already moved. Declare your model before calling prepare() and let Accelerator handle all device logic. Also, if you're using a custom training loop and notice training is slower with Accelerator than without, check that you didn't wrap a DataLoader with shuffling disabled: Accelerator disables shuffle on non-rank-0 to avoid duplicate batches, which can silently break your data distribution if you expect global shuffle.

Check your understanding

You have a training loop that works on a single GPU with batch_size=32. You want to scale it to 4 GPUs using Accelerator without changing the learning rate or effective batch size per GPU. What must you change in your code, and why would changing only the launch command (to use 4 GPUs) be insufficient?

Show answer hint

A correct answer explains that you must keep batch_size=32 in the DataLoader (not scale it), and notes that the learning rate should either stay the same or be scaled by sqrt(num_gpus) depending on your scaling rule. The key insight is that Accelerator doesn't automatically reduce batch_size per GPU: that's your responsibility. The launch command alone is insufficient because code changes are needed to ensure your DataLoader isn't re-shuffling data across ranks (use set_seed before creating the loader, or use Accelerator's prepare() which handles this).

VERSION In transformers < 5.0, Accelerate was a separate optional dependency. In transformers 5.5.x (April 2026), it's bundled and the API is stable. One breaking change: in Accelerate < 1.0, you manually set `num_processes` in Accelerator(); in Accelerate >= 1.0 (current), this is auto-detected from environment variables set by `accelerate launch`. Always rely on the launcher, not constructor arguments.

Learn how to save and resume training with Accelerator's checkpointing API, which correctly handles distributed state and enables fault-tolerant training on multi-node clusters.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.