accelerate launch: distributed run
Why this matters
Training large transformer models locally becomes prohibitively slow: distributed training is how you actually ship models. `accelerate launch` abstracts away device placement, gradient accumulation, mixed precision, and synchronization so you write once and run everywhere.
Explanation
What it is: `accelerate launch` is a command-line wrapper that takes your existing training script and runs it in a distributed setting without code changes. It detects your hardware (single GPU, multi-GPU, TPU, mixed), configures the environment, and launches worker processes.
How it works mechanically: When you run `accelerate launch script.py`, the launcher: (1) discovers available devices, (2) creates an Accelerator object that wraps your model/optimizer/dataloader, (3) spawns worker processes (one per GPU), (4) synchronizes gradients across workers before backprop, (5) handles communication via NCCL or gloo. Your script doesn't know it's distributed: it just calls `accelerator.backward(loss)` instead of `loss.backward()`. The Accelerator handles all the magic.
When to use it: Multi-GPU fine-tuning, distributed training on cloud clusters, mixed-precision training, or any scenario where you want code portability across hardware configurations.
Analogy
Imagine writing a book chapter that could be split across multiple writers (workers) working in parallel, with one person coordinating edits. `accelerate launch` is that coordinator: it splits the work, ensures everyone stays synchronized, and recombines the results without your script needing to know about the other writers.
Code
import torch
from torch import nn
from torch.optim import Adam
from torch.utils.data import DataLoader, TensorDataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from accelerate import Accelerator
import torch.nn.functional as F
accelerator = Accelerator()
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
optimizer = Adam(model.parameters(), lr=2e-5)
data = ["This is great!", "This is terrible!"]
labels = [1, 0]
encoded = tokenizer(data, padding=True, truncation=True, return_tensors="pt")
input_ids = encoded["input_ids"]
attention_mask = encoded["attention_mask"]
dataset = TensorDataset(input_ids, attention_mask, torch.tensor(labels))
dataloader = DataLoader(dataset, batch_size=2)
model, optimizer, dataloader = accelerator.prepare(
model, optimizer, dataloader
)
model.train()
num_epochs = 1
for epoch in range(num_epochs):
total_loss = 0
for batch_idx, (input_ids_batch, attention_mask_batch, labels_batch) in enumerate(dataloader):
optimizer.zero_grad()
outputs = model(
input_ids=input_ids_batch,
attention_mask=attention_mask_batch,
labels=labels_batch
)
loss = outputs.loss
accelerator.backward(loss)
optimizer.step()
total_loss += loss.item()
if batch_idx == 0:
print(f"Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item():.4f}")
print(f"Training complete. Final loss: {total_loss:.4f}") Epoch 0, Batch 0, Loss: 0.6931 Training complete. Final loss: 0.6931
What just happened?
The code defined a simple sentiment classification model, wrapped it with `accelerator.prepare()` (which moves it to the appropriate device and enables gradient synchronization), then ran one training epoch. Even though this runs on a single GPU here, the exact same code works unchanged on 8 GPUs or a TPU pod: the Accelerator handles device placement and communication internally.
Common gotcha
Developers forget that `accelerator.prepare()` returns modified objects: you must assign the result: `model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)`. If you forget the assignment, your model is still on CPU and your script will be glacially slow without raising an error.
Error recovery
RuntimeError: Expected all tensors to be on the same deviceCUDA out of memoryNotImplementedError: 'spawn' start method not supportedExperienced dev note
The biggest time sink isn't learning Accelerator: it's debugging distributed training bugs that don't happen on single GPU. Always test locally with `accelerate launch --nprocs_per_node 2` (fake 2 GPUs) before deploying to a cluster. Also, mixed precision (`mixed_precision='bf16'`) is nearly free performance: always enable it in production, but test convergence first because some models are numerically sensitive.
Check your understanding
Why does your training script need to call `accelerator.backward(loss)` instead of `loss.backward()`, and what goes wrong if you mix the two approaches in the same loop?
Show answer hint
A correct answer explains that `accelerator.backward()` synchronizes gradients across all distributed workers before the backward pass (so all workers update the same model state), whereas `loss.backward()` only updates the local replica. If you mix them, gradient synchronization breaks and model divergence occurs across workers.