How to Intermediate · 4 min read

How to reduce memory usage during fine-tuning

Q: How to reduce memory usage during fine-tuning

To reduce memory usage during fine-tuning, use techniques like gradient checkpointing to trade compute for memory, mixed precision training (e.g., FP16) to lower tensor sizes, and parameter-efficient fine-tuning methods such as LoRA or prefix tuning that update fewer parameters. These approaches significantly cut GPU memory requirements without sacrificing model quality.

Quick answer

To reduce memory usage during fine-tuning, use techniques like gradient checkpointing to trade compute for memory, mixed precision training (e.g., FP16) to lower tensor sizes, and parameter-efficient fine-tuning methods such as LoRA or prefix tuning that update fewer parameters. These approaches significantly cut GPU memory requirements without sacrificing model quality.

PREREQUISITES

Python 3.8+
PyTorch or TensorFlow installed
Access to a GPU with CUDA
pip install transformers>=4.30
pip install accelerate

Setup

Install the necessary libraries for fine-tuning and memory optimization:

transformers for model and tokenizer
accelerate for efficient training and mixed precision

bash

pip install transformers accelerate

Step by step

This example shows how to enable gradient checkpointing and mixed precision during fine-tuning a Hugging Face transformers model to reduce memory usage.

python

import os
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
from accelerate import Accelerator

model_name = "gpt2"

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Enable gradient checkpointing to save memory
model.gradient_checkpointing_enable()

# Prepare dummy dataset
texts = ["Hello world!", "Fine-tuning with less memory."]
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

# Setup training arguments with mixed precision (fp16)
training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=2,
    num_train_epochs=1,
    fp16=True,  # Enable mixed precision
    logging_steps=1,
    save_strategy="no"
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=[inputs],
    tokenizer=tokenizer
)

# Start training
trainer.train()

output

***** Running training *****
  Num examples = 2
  Num Epochs = 1
  Instantaneous batch size per device = 2
  Total train batch size (w. parallel, distributed & accumulation) = 2
  Gradient checkpointing enabled
  Mixed precision enabled (fp16)
...
Training completed.

Common variations

Other memory-saving methods include:

Parameter-efficient fine-tuning: Use LoRA or prefix tuning to update only a small subset of parameters, drastically reducing memory.
Gradient accumulation: Use smaller batches and accumulate gradients over steps to simulate larger batch sizes without extra memory.
Offloading: Offload model weights or optimizer states to CPU or NVMe using libraries like accelerate or DeepSpeed.

Technique	Description
LoRA	Low-rank adaptation updates fewer parameters, reducing memory.
Gradient checkpointing	Saves memory by recomputing activations during backward pass.
Mixed precision	Uses FP16 to halve tensor memory usage.
Gradient accumulation	Simulates large batch sizes with small memory footprint.
Offloading	Moves parts of model or optimizer to CPU/NVMe to save GPU memory.

Troubleshooting

If you encounter CUDA out of memory errors:

Reduce batch size further.
Ensure fp16=True is enabled in training arguments.
Use gradient checkpointing by calling model.gradient_checkpointing_enable().
Try parameter-efficient fine-tuning methods like LoRA.
Check for memory leaks by restarting your runtime environment.

✅

Key Takeaways

Enable gradient checkpointing to trade compute for significant memory savings during fine-tuning.
Use mixed precision (FP16) training to reduce tensor memory footprint without losing model quality.
Apply parameter-efficient fine-tuning methods like LoRA to update fewer parameters and save memory.
Use gradient accumulation to handle large effective batch sizes with limited GPU memory.
Offload model or optimizer states to CPU/NVMe when GPU memory is constrained.

Verified 2026-04 · gpt2, transformers, LoRA

Verify ↗