How to Intermediate · 3 min read

How to use gradient accumulation for fine-tuning

Q: How to use gradient accumulation for fine-tuning

Use gradient accumulation during fine-tuning to simulate a larger batch size by accumulating gradients over multiple smaller batches before performing an optimizer step. This technique helps train large models on limited GPU memory without reducing batch size or model quality.

Quick answer

Use gradient accumulation during fine-tuning to simulate a larger batch size by accumulating gradients over multiple smaller batches before performing an optimizer step. This technique helps train large models on limited GPU memory without reducing batch size or model quality.

PREREQUISITES

Python 3.8+
PyTorch 1.12+
Transformers library (pip install transformers)
Access to a GPU with CUDA
Basic knowledge of model fine-tuning

Setup

Install the necessary libraries and set environment variables for GPU usage.

bash

pip install torch transformers

Step by step

This example shows how to implement gradient accumulation in PyTorch for fine-tuning a Hugging Face transformer model. The key is to accumulate gradients over accumulation_steps batches before calling optimizer.step().

python

import os
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Setup
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.train()

# Optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)

# Dummy data loader (replace with real data loader)
texts = ["Hello world!", "Gradient accumulation example."] * 16
labels = torch.tensor([0, 1] * 16)

inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

# Hyperparameters
batch_size = 4
accumulation_steps = 4  # Accumulate gradients over 4 batches

# Simulate batches
num_batches = len(texts) // batch_size

for epoch in range(1):
    optimizer.zero_grad()
    for i in range(num_batches):
        batch_inputs = {k: v[i*batch_size:(i+1)*batch_size] for k, v in inputs.items()}
        batch_labels = labels[i*batch_size:(i+1)*batch_size]

        outputs = model(**batch_inputs, labels=batch_labels)
        loss = outputs.loss
        loss = loss / accumulation_steps  # Normalize loss
        loss.backward()  # Accumulate gradients

        if (i + 1) % accumulation_steps == 0:
            optimizer.step()  # Update weights
            optimizer.zero_grad()  # Reset gradients
            print(f"Step {i+1}, loss: {loss.item()*accumulation_steps:.4f}")

output

Step 4, loss: 0.6931
Step 8, loss: 0.6931
Step 12, loss: 0.6931
Step 16, loss: 0.6931

Common variations

Use accelerate or deepspeed libraries for built-in gradient accumulation support.
Adjust accumulation_steps based on GPU memory constraints.
Apply gradient clipping before optimizer.step() to stabilize training.
Use mixed precision training (AMP) with gradient accumulation for faster fine-tuning.

Troubleshooting

If loss does not decrease, verify that loss is divided by accumulation_steps before loss.backward().
If GPU memory is still insufficient, reduce batch size or increase accumulation_steps.
Ensure optimizer.zero_grad() is called only after the optimizer step, not every batch.

✅

Key Takeaways

Gradient accumulation simulates large batch sizes by summing gradients over multiple smaller batches before updating model weights.
Always divide the loss by the number of accumulation steps to keep gradient scale consistent.
Call optimizer.step() and optimizer.zero_grad() only after accumulating gradients for the specified number of steps.
Adjust accumulation steps based on your GPU memory to balance training speed and resource constraints.

Verified 2026-04 · bert-base-uncased

Verify ↗