How to Intermediate · 4 min read

How to use Accelerate for distributed fine-tuning

Quick answer
Use the Accelerate library to simplify distributed fine-tuning by managing device placement, mixed precision, and multi-GPU synchronization automatically. Initialize Accelerator, prepare your model, optimizer, and dataloaders with accelerator.prepare(), then run your training loop as usual. This abstracts away complex distributed setup, enabling scalable fine-tuning with minimal code changes.

PREREQUISITES

  • Python 3.8+
  • pip install accelerate transformers datasets torch
  • Access to multiple GPUs or nodes for distributed training

Setup

Install the required libraries and configure accelerate for your environment. Run accelerate config to set up distributed training parameters like number of processes and mixed precision.

bash
pip install accelerate transformers datasets torch
accelerate config
output
Welcome to Accelerate configuration!\n\nHow many different machines will you use for distributed training? [1]: 1\n\nHow many processes in total will you use? [1]: 2\n\nDo you want to use FP16 mixed precision? (yes/NO) [NO]: yes\n\nConfig saved to ~/.cache/huggingface/accelerate/default_config.yaml

Step by step

This example shows distributed fine-tuning of a Hugging Face transformer model using Accelerate. It prepares the model, optimizer, and dataloaders, then runs a simple training loop.

python
from accelerate import Accelerator
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from datasets import load_dataset
import torch

# Initialize accelerator
accelerator = Accelerator()

# Load dataset and tokenizer
raw_datasets = load_dataset('glue', 'mrpc')
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Tokenize function
 def tokenize_function(examples):
     return tokenizer(examples['sentence1'], examples['sentence2'], truncation=True, padding='max_length', max_length=128)

# Prepare datasets
encoded_datasets = raw_datasets.map(tokenize_function, batched=True)
train_dataset = encoded_datasets['train']

# Create dataloader
train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=16, shuffle=True)

# Load model
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

# Optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)

# Prepare everything with accelerator
model, optimizer, train_dataloader = accelerator.prepare(model, optimizer, train_dataloader)

# Training loop
model.train()
for epoch in range(3):
    for batch in train_dataloader:
        outputs = model(input_ids=batch['input_ids'], attention_mask=batch['attention_mask'], labels=batch['label'])
        loss = outputs.loss
        accelerator.backward(loss)
        optimizer.step()
        optimizer.zero_grad()
    print(f"Epoch {epoch + 1} completed")
output
Epoch 1 completed\nEpoch 2 completed\nEpoch 3 completed

Common variations

  • Use accelerate.launch CLI to run your script across multiple GPUs or nodes without manual setup.
  • Enable mixed precision by setting fp16=True in Accelerator() or via config.
  • Switch to async dataloading or gradient accumulation by adjusting dataloader and training loop.
  • Use different models by changing the Hugging Face model checkpoint.
python
from accelerate import Accelerator

# Enable mixed precision
accelerator = Accelerator(fp16=True)

# Or run script with CLI:
# accelerate launch train_script.py

Troubleshooting

  • If you see CUDA out of memory errors, reduce batch size or enable gradient accumulation.
  • Ensure accelerate config matches your hardware setup (number of GPUs, nodes).
  • Check that all inputs to the model are on the correct device by using accelerator.prepare().
  • Use accelerator.wait_for_everyone() to synchronize processes if needed.

Key Takeaways

  • Use Hugging Face Accelerate to abstract distributed training complexities and scale fine-tuning easily.
  • Always prepare your model, optimizer, and dataloaders with accelerator.prepare() before training.
  • Configure your environment with accelerate config to match your hardware and precision needs.
  • Run your training script with accelerate launch for seamless multi-GPU or multi-node execution.
  • Troubleshoot memory and device placement issues by adjusting batch size and verifying accelerator usage.
Verified 2026-04 · bert-base-uncased, AutoModelForSequenceClassification
Verify ↗