Code Advanced hard · 8 min

accelerate launch: distributed run

What you will learn

Use `accelerate launch` to automatically handle distributed training across GPUs, TPUs, or mixed hardware without rewriting your training loop.

Why this matters

Training large transformer models locally becomes prohibitively slow: distributed training is how you actually ship models. `accelerate launch` abstracts away device placement, gradient accumulation, mixed precision, and synchronization so you write once and run everywhere.

Skip if: Don't use `accelerate launch` if you're fine-tuning a tiny model on a single GPU with no memory constraints, or if you're doing inference-only work. You also don't need it for simple single-script experiments where hardcoding device placement is acceptable.

Explanation

What it is: `accelerate launch` is a command-line wrapper that takes your existing training script and runs it in a distributed setting without code changes. It detects your hardware (single GPU, multi-GPU, TPU, mixed), configures the environment, and launches worker processes.

How it works mechanically: When you run `accelerate launch script.py`, the launcher: (1) discovers available devices, (2) creates an Accelerator object that wraps your model/optimizer/dataloader, (3) spawns worker processes (one per GPU), (4) synchronizes gradients across workers before backprop, (5) handles communication via NCCL or gloo. Your script doesn't know it's distributed: it just calls `accelerator.backward(loss)` instead of `loss.backward()`. The Accelerator handles all the magic.

When to use it: Multi-GPU fine-tuning, distributed training on cloud clusters, mixed-precision training, or any scenario where you want code portability across hardware configurations.

Analogy

Imagine writing a book chapter that could be split across multiple writers (workers) working in parallel, with one person coordinating edits. `accelerate launch` is that coordinator: it splits the work, ensures everyone stays synchronized, and recombines the results without your script needing to know about the other writers.

Code

python

import torch
from torch import nn
from torch.optim import Adam
from torch.utils.data import DataLoader, TensorDataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from accelerate import Accelerator
import torch.nn.functional as F

accelerator = Accelerator()

model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

optimizer = Adam(model.parameters(), lr=2e-5)

data = ["This is great!", "This is terrible!"]
labels = [1, 0]

encoded = tokenizer(data, padding=True, truncation=True, return_tensors="pt")
input_ids = encoded["input_ids"]
attention_mask = encoded["attention_mask"]

dataset = TensorDataset(input_ids, attention_mask, torch.tensor(labels))
dataloader = DataLoader(dataset, batch_size=2)

model, optimizer, dataloader = accelerator.prepare(
    model, optimizer, dataloader
)

model.train()
num_epochs = 1

for epoch in range(num_epochs):
    total_loss = 0
    for batch_idx, (input_ids_batch, attention_mask_batch, labels_batch) in enumerate(dataloader):
        optimizer.zero_grad()
        
        outputs = model(
            input_ids=input_ids_batch,
            attention_mask=attention_mask_batch,
            labels=labels_batch
        )
        loss = outputs.loss
        
        accelerator.backward(loss)
        optimizer.step()
        
        total_loss += loss.item()
        
        if batch_idx == 0:
            print(f"Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item():.4f}")

print(f"Training complete. Final loss: {total_loss:.4f}")

Output

Epoch 0, Batch 0, Loss: 0.6931
Training complete. Final loss: 0.6931

What just happened?

The code defined a simple sentiment classification model, wrapped it with `accelerator.prepare()` (which moves it to the appropriate device and enables gradient synchronization), then ran one training epoch. Even though this runs on a single GPU here, the exact same code works unchanged on 8 GPUs or a TPU pod: the Accelerator handles device placement and communication internally.

Common gotcha

Developers forget that `accelerator.prepare()` returns modified objects: you must assign the result: `model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)`. If you forget the assignment, your model is still on CPU and your script will be glacially slow without raising an error.

Error recovery

RuntimeError: Expected all tensors to be on the same device

You passed raw tensors to the model instead of letting Accelerator handle device placement. Always call `accelerator.prepare()` on your dataloader and model, or manually move tensors using `batch = {k: v.to(accelerator.device) for k, v in batch.items()}`.

CUDA out of memory

Your batch size is too large for distributed training. Accelerate doesn't automatically reduce batch size across workers: if you specify batch_size=32 on 8 GPUs, each GPU sees 32 samples (not 4). Either reduce batch_size or enable gradient accumulation with `Accelerator(gradient_accumulation_steps=2)`.

NotImplementedError: 'spawn' start method not supported

You're on a system that doesn't support multiprocessing spawn (some Linux configs). Either use `torch.set_start_method('fork')` before Accelerator init, or use `accelerate config` to manually set the launch method.

Experienced dev note

The biggest time sink isn't learning Accelerator: it's debugging distributed training bugs that don't happen on single GPU. Always test locally with `accelerate launch --nprocs_per_node 2` (fake 2 GPUs) before deploying to a cluster. Also, mixed precision (`mixed_precision='bf16'`) is nearly free performance: always enable it in production, but test convergence first because some models are numerically sensitive.

Check your understanding

Why does your training script need to call `accelerator.backward(loss)` instead of `loss.backward()`, and what goes wrong if you mix the two approaches in the same loop?

Show answer hint

A correct answer explains that `accelerator.backward()` synchronizes gradients across all distributed workers before the backward pass (so all workers update the same model state), whereas `loss.backward()` only updates the local replica. If you mix them, gradient synchronization breaks and model divergence occurs across workers.

VERSION transformers 5.5.x changed the default for `device_map`: in 4.x you could omit it and models defaulted to CPU, now they default to 'auto'. When using Accelerator, always omit `device_map` from `from_pretrained()` (set it implicitly to 'cpu'), then let Accelerator handle device placement via `prepare()`. Mixing explicit device_map with Accelerator causes device mismatch errors.

Once you've mastered distributed training, explore <strong>gradient checkpointing and activation recomputation</strong> to fit larger models into the same GPU memory by trading compute for memory.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.