How to use gradient accumulation for fine-tuning
Quick answer
Use
gradient accumulation during fine-tuning to simulate a larger batch size by accumulating gradients over multiple smaller batches before performing an optimizer step. This technique helps train large models on limited GPU memory without reducing batch size or model quality.PREREQUISITES
Python 3.8+PyTorch 1.12+Transformers library (pip install transformers)Access to a GPU with CUDABasic knowledge of model fine-tuning
Setup
Install the necessary libraries and set environment variables for GPU usage.
pip install torch transformers Step by step
This example shows how to implement gradient accumulation in PyTorch for fine-tuning a Hugging Face transformer model. The key is to accumulate gradients over accumulation_steps batches before calling optimizer.step().
import os
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
# Setup
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.train()
# Optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
# Dummy data loader (replace with real data loader)
texts = ["Hello world!", "Gradient accumulation example."] * 16
labels = torch.tensor([0, 1] * 16)
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
# Hyperparameters
batch_size = 4
accumulation_steps = 4 # Accumulate gradients over 4 batches
# Simulate batches
num_batches = len(texts) // batch_size
for epoch in range(1):
optimizer.zero_grad()
for i in range(num_batches):
batch_inputs = {k: v[i*batch_size:(i+1)*batch_size] for k, v in inputs.items()}
batch_labels = labels[i*batch_size:(i+1)*batch_size]
outputs = model(**batch_inputs, labels=batch_labels)
loss = outputs.loss
loss = loss / accumulation_steps # Normalize loss
loss.backward() # Accumulate gradients
if (i + 1) % accumulation_steps == 0:
optimizer.step() # Update weights
optimizer.zero_grad() # Reset gradients
print(f"Step {i+1}, loss: {loss.item()*accumulation_steps:.4f}") output
Step 4, loss: 0.6931 Step 8, loss: 0.6931 Step 12, loss: 0.6931 Step 16, loss: 0.6931
Common variations
- Use
accelerateordeepspeedlibraries for built-in gradient accumulation support. - Adjust
accumulation_stepsbased on GPU memory constraints. - Apply gradient clipping before
optimizer.step()to stabilize training. - Use mixed precision training (AMP) with gradient accumulation for faster fine-tuning.
Troubleshooting
- If loss does not decrease, verify that loss is divided by
accumulation_stepsbeforeloss.backward(). - If GPU memory is still insufficient, reduce batch size or increase
accumulation_steps. - Ensure
optimizer.zero_grad()is called only after the optimizer step, not every batch.
Key Takeaways
- Gradient accumulation simulates large batch sizes by summing gradients over multiple smaller batches before updating model weights.
- Always divide the loss by the number of accumulation steps to keep gradient scale consistent.
- Call optimizer.step() and optimizer.zero_grad() only after accumulating gradients for the specified number of steps.
- Adjust accumulation steps based on your GPU memory to balance training speed and resource constraints.