How to reduce memory usage during fine-tuning
Quick answer
To reduce memory usage during
fine-tuning, use techniques like gradient checkpointing to trade compute for memory, mixed precision training (e.g., FP16) to lower tensor sizes, and parameter-efficient fine-tuning methods such as LoRA or prefix tuning that update fewer parameters. These approaches significantly cut GPU memory requirements without sacrificing model quality.PREREQUISITES
Python 3.8+PyTorch or TensorFlow installedAccess to a GPU with CUDApip install transformers>=4.30pip install accelerate
Setup
Install the necessary libraries for fine-tuning and memory optimization:
transformersfor model and tokenizeracceleratefor efficient training and mixed precision
pip install transformers accelerate Step by step
This example shows how to enable gradient checkpointing and mixed precision during fine-tuning a Hugging Face transformers model to reduce memory usage.
import os
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
from accelerate import Accelerator
model_name = "gpt2"
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Enable gradient checkpointing to save memory
model.gradient_checkpointing_enable()
# Prepare dummy dataset
texts = ["Hello world!", "Fine-tuning with less memory."]
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
# Setup training arguments with mixed precision (fp16)
training_args = TrainingArguments(
output_dir="./results",
per_device_train_batch_size=2,
num_train_epochs=1,
fp16=True, # Enable mixed precision
logging_steps=1,
save_strategy="no"
)
# Initialize Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=[inputs],
tokenizer=tokenizer
)
# Start training
trainer.train() output
***** Running training ***** Num examples = 2 Num Epochs = 1 Instantaneous batch size per device = 2 Total train batch size (w. parallel, distributed & accumulation) = 2 Gradient checkpointing enabled Mixed precision enabled (fp16) ... Training completed.
Common variations
Other memory-saving methods include:
- Parameter-efficient fine-tuning: Use
LoRAorprefix tuningto update only a small subset of parameters, drastically reducing memory. - Gradient accumulation: Use smaller batches and accumulate gradients over steps to simulate larger batch sizes without extra memory.
- Offloading: Offload model weights or optimizer states to CPU or NVMe using libraries like
accelerateorDeepSpeed.
| Technique | Description |
|---|---|
| LoRA | Low-rank adaptation updates fewer parameters, reducing memory. |
| Gradient checkpointing | Saves memory by recomputing activations during backward pass. |
| Mixed precision | Uses FP16 to halve tensor memory usage. |
| Gradient accumulation | Simulates large batch sizes with small memory footprint. |
| Offloading | Moves parts of model or optimizer to CPU/NVMe to save GPU memory. |
Troubleshooting
If you encounter CUDA out of memory errors:
- Reduce batch size further.
- Ensure
fp16=Trueis enabled in training arguments. - Use gradient checkpointing by calling
model.gradient_checkpointing_enable(). - Try parameter-efficient fine-tuning methods like
LoRA. - Check for memory leaks by restarting your runtime environment.
Key Takeaways
- Enable gradient checkpointing to trade compute for significant memory savings during fine-tuning.
- Use mixed precision (FP16) training to reduce tensor memory footprint without losing model quality.
- Apply parameter-efficient fine-tuning methods like LoRA to update fewer parameters and save memory.
- Use gradient accumulation to handle large effective batch sizes with limited GPU memory.
- Offload model or optimizer states to CPU/NVMe when GPU memory is constrained.