How to use Accelerate for distributed fine-tuning
Quick answer
Use the
Accelerate library to simplify distributed fine-tuning by managing device placement, mixed precision, and multi-GPU synchronization automatically. Initialize Accelerator, prepare your model, optimizer, and dataloaders with accelerator.prepare(), then run your training loop as usual. This abstracts away complex distributed setup, enabling scalable fine-tuning with minimal code changes.PREREQUISITES
Python 3.8+pip install accelerate transformers datasets torchAccess to multiple GPUs or nodes for distributed training
Setup
Install the required libraries and configure accelerate for your environment. Run accelerate config to set up distributed training parameters like number of processes and mixed precision.
pip install accelerate transformers datasets torch
accelerate config output
Welcome to Accelerate configuration!\n\nHow many different machines will you use for distributed training? [1]: 1\n\nHow many processes in total will you use? [1]: 2\n\nDo you want to use FP16 mixed precision? (yes/NO) [NO]: yes\n\nConfig saved to ~/.cache/huggingface/accelerate/default_config.yaml
Step by step
This example shows distributed fine-tuning of a Hugging Face transformer model using Accelerate. It prepares the model, optimizer, and dataloaders, then runs a simple training loop.
from accelerate import Accelerator
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from datasets import load_dataset
import torch
# Initialize accelerator
accelerator = Accelerator()
# Load dataset and tokenizer
raw_datasets = load_dataset('glue', 'mrpc')
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
# Tokenize function
def tokenize_function(examples):
return tokenizer(examples['sentence1'], examples['sentence2'], truncation=True, padding='max_length', max_length=128)
# Prepare datasets
encoded_datasets = raw_datasets.map(tokenize_function, batched=True)
train_dataset = encoded_datasets['train']
# Create dataloader
train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=16, shuffle=True)
# Load model
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
# Optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
# Prepare everything with accelerator
model, optimizer, train_dataloader = accelerator.prepare(model, optimizer, train_dataloader)
# Training loop
model.train()
for epoch in range(3):
for batch in train_dataloader:
outputs = model(input_ids=batch['input_ids'], attention_mask=batch['attention_mask'], labels=batch['label'])
loss = outputs.loss
accelerator.backward(loss)
optimizer.step()
optimizer.zero_grad()
print(f"Epoch {epoch + 1} completed") output
Epoch 1 completed\nEpoch 2 completed\nEpoch 3 completed
Common variations
- Use
accelerate.launchCLI to run your script across multiple GPUs or nodes without manual setup. - Enable mixed precision by setting
fp16=TrueinAccelerator()or via config. - Switch to async dataloading or gradient accumulation by adjusting dataloader and training loop.
- Use different models by changing the Hugging Face model checkpoint.
from accelerate import Accelerator
# Enable mixed precision
accelerator = Accelerator(fp16=True)
# Or run script with CLI:
# accelerate launch train_script.py Troubleshooting
- If you see CUDA out of memory errors, reduce batch size or enable gradient accumulation.
- Ensure
accelerate configmatches your hardware setup (number of GPUs, nodes). - Check that all inputs to the model are on the correct device by using
accelerator.prepare(). - Use
accelerator.wait_for_everyone()to synchronize processes if needed.
Key Takeaways
- Use Hugging Face Accelerate to abstract distributed training complexities and scale fine-tuning easily.
- Always prepare your model, optimizer, and dataloaders with
accelerator.prepare()before training. - Configure your environment with
accelerate configto match your hardware and precision needs. - Run your training script with
accelerate launchfor seamless multi-GPU or multi-node execution. - Troubleshoot memory and device placement issues by adjusting batch size and verifying
acceleratorusage.