Code Advanced hard · 8 min

What accelerate adds: multi-GPU, multi-node

What you will learn
The Hugging Face Accelerate library abstracts away distributed training complexity, letting you write single-GPU code that runs on multi-GPU and multi-node setups without rewrites.

Why this matters

In production, you'll train on 8 GPUs or 16 nodes. Without Accelerate, you rewrite your entire training loop for DistributedDataParallel, handle gradient accumulation per device, manage device placement, and debug rank-specific code. Accelerate eliminates this boilerplate. Your single-GPU training script becomes multi-GPU with ~3 lines changed.

Skip if: If you're using a high-level framework like Hugging Face Trainer (which wraps Accelerate internally), you don't need Accelerate directly: the Trainer handles distribution for you. Also skip Accelerate if your workload is inference-only on a single GPU; the overhead isn't worth it.

Explanation

What it is: Accelerate is a library that handles the boilerplate of distributed training (multi-GPU on one machine, multi-node across machines). It detects your hardware setup automatically and distributes your model, data, and optimizer across devices without you manually writing DistributedDataParallel or setting up rank-specific logic.

How it works mechanically: You wrap your model, optimizer, and dataloader with Accelerate objects (accelerator.prepare(model, optimizer, train_dataloader)). Accelerate detects the number of GPUs, TPUs, or nodes and automatically shards the model (using FSDP or DDP under the hood). When you call loss.backward(), Accelerate divides the loss by the number of processes, and all gradients are averaged across devices. You call accelerator.backward(loss) instead of loss.backward(), and Accelerate handles synchronization.

When to use it: Use Accelerate when you're writing custom training loops and need to scale from 1 GPU to 8 GPUs or from 1 node to 4 nodes without rewriting. If you're using Trainer or PyTorch Lightning, those already integrate Accelerate internally.

Analogy

Think of Accelerate as a postal service for your tensors. Instead of you manually deciding which letter goes to which mailbox (rank 0, rank 1, etc.) and ensuring all recipients acknowledge receipt, the postal service (Accelerate) figures out the delivery network, handles synchronization, and delivers to all ranks transparently. You just ship the letter.

Code

python
import torch
import torch.nn as nn
from torch.optim import AdamW
from torch.utils.data import DataLoader, TensorDataset
from accelerate import Accelerator

accelerator = Accelerator()

model = nn.Sequential(
    nn.Linear(10, 64),
    nn.ReLU(),
    nn.Linear(64, 1)
)

optimizer = AdamW(model.parameters(), lr=1e-4)

X_train = torch.randn(100, 10)
y_train = torch.randn(100, 1)
dataset = TensorDataset(X_train, y_train)
train_loader = DataLoader(dataset, batch_size=8)

model, optimizer, train_loader = accelerator.prepare(
    model, optimizer, train_loader
)

model.train()
for epoch in range(2):
    total_loss = 0
    for X_batch, y_batch in train_loader:
        optimizer.zero_grad()
        logits = model(X_batch)
        loss = nn.functional.mse_loss(logits, y_batch)
        accelerator.backward(loss)
        optimizer.step()
        total_loss += loss.item()
    if accelerator.is_main_process:
        print(f"Epoch {epoch + 1}, Loss: {total_loss / len(train_loader):.4f}")
Output
Epoch 1, Loss: 0.8234
Epoch 2, Loss: 0.7891

What just happened?

The code created a simple 2-layer neural network, wrapped it with Accelerate using <code>accelerator.prepare()</code>, and ran a 2-epoch training loop. On a single GPU, Accelerate detected the setup and ran normally. If you run this with <code>accelerate launch --multi_gpu --num_processes 4 script.py</code>, Accelerate automatically shards the model across 4 GPUs, divides the batch across devices, and synchronizes gradients after <code>accelerator.backward()</code>. The code itself never changes.

Common gotcha

Developers forget to replace loss.backward() with accelerator.backward(loss). If you use loss.backward() directly, Accelerate doesn't synchronize gradients across devices, and each GPU trains independently with inconsistent weights. Also, gradient accumulation logic changes: with Accelerate, you check if (step + 1) % accumulation_steps == 0, but Accelerate divides the loss by accumulation_steps automatically if you set it during initialization.

Error recovery

RuntimeError: Expected all tensors to be on the same device
Your model was prepared with Accelerate but you're manually moving some tensors to a device. After <code>accelerator.prepare()</code>, let Accelerate handle device placement. Never call <code>.to(device)</code> on prepared objects: Accelerate already placed them.
NCCL timeout or hang during training
A process died or one GPU is slower than others. Check that all GPUs have the same CUDA version and PyTorch build. If one GPU is different (e.g., A100 vs V100), reduce batch size or use gradient accumulation so slower GPUs don't bottleneck. Also verify you called <code>accelerator.backward(loss)</code> not <code>loss.backward()</code>.
Output printed multiple times (once per GPU)
You're printing in the training loop without checking <code>if accelerator.is_main_process</code>. Only the main rank (rank 0) should log. All other ranks run the same code, so they print too unless gated.

Experienced dev note

The critical insight: Accelerate's power isn't in the code you write: it's in the code you *don't* write. Before Accelerate, multi-GPU training meant learning DistributedDataParallel, process groups, rank-specific logic, and debugging why rank 2 crashed but ranks 0, 1, 3 didn't. Accelerate hides all that. But it also means you must trust it: if you add manual .cuda() calls or try to optimize 'manually,' you'll break Accelerate's invariants. Write naive single-GPU code and let Accelerate optimize.

Check your understanding

Your single-GPU training script trains in 10 minutes on 1 A100. You want to scale to 4 A100s using Accelerate. You expect 4x speedup (2.5 minutes). After wrapping with accelerator.prepare() and replacing loss.backward(), you measure 2.8x speedup instead. What likely happened, and why does Accelerate alone not guarantee perfect scaling?

Show answer hint

A correct answer identifies that communication overhead (gradient synchronization across devices) and any GPU memory limitations that force smaller batch sizes per GPU (even though total batch size is larger) reduce scaling efficiency below theoretical max. Accelerate distributes the work but doesn't eliminate the cost of AllReduce operations or memory constraints. You may also need to adjust learning rate or use gradient accumulation: Accelerate doesn't auto-tune hyperparameters.

VERSION In transformers 5.5.x (April 2026), Accelerate 1.2.x removed the deprecated dispatch_model API; use init_empty_weights() and load_checkpoint_and_dispatch() instead for efficient CPU offloading. Also, Accelerate now integrates tightly with torch.distributed.fsdp: prefer FSDP for models >7B parameters rather than DDP.
NEXT

Once you master Accelerate for training, the next step is using it with quantization (BitsAndBytesConfig) to fit massive models into multi-GPU VRAM: this is where Accelerate's device_map='auto' shines for inference.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.