QLoRA memory requirements
Quick answer
QLoRA reduces memory usage by quantizing large language models to 4-bit precision, enabling fine-tuning on GPUs with as little as 8-12GB VRAM for 7B-13B parameter models. Total memory depends on base model size, batch size, and optimizer states, typically requiring 12-24GB VRAM for smooth training of 13B models with moderate batch sizes.
PREREQUISITES
Python 3.8+pip install transformers>=4.30.0pip install bitsandbytesA GPU with CUDA support and at least 8GB VRAM
Setup
Install the necessary libraries for QLoRA fine-tuning, including transformers and bitsandbytes for 4-bit quantization support.
pip install transformers bitsandbytes accelerate Step by step
QLoRA uses 4-bit quantization to reduce memory footprint. For example, a 7B parameter model can be fine-tuned on a single 8GB GPU, while 13B models typically require 12-24GB VRAM depending on batch size and optimizer.
Here is a minimal example to load a 4-bit quantized model with transformers and bitsandbytes:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "meta-llama/Llama-2-13b-chat-hf"
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Load 4-bit quantized model
model = AutoModelForCausalLM.from_pretrained(
model_name,
load_in_4bit=True,
device_map="auto"
)
# Example input
inputs = tokenizer("Hello, QLoRA!", return_tensors="pt").to(model.device)
# Forward pass
outputs = model(**inputs)
print(outputs.logits.shape) output
(1, 6, 32000)
Common variations
Memory usage varies with:
- Model size: 7B models need ~8GB VRAM, 13B models 12-24GB.
- Batch size: Larger batches increase VRAM linearly.
- Optimizer: Using 8-bit optimizers like
bitsandbytes.Adam8bitreduces memory. - Gradient checkpointing: Saves memory at the cost of compute.
import bitsandbytes as bnb
# Use 8-bit Adam optimizer to save memory
optimizer = bnb.Adam8bit(model.parameters(), lr=3e-4)
# Enable gradient checkpointing
model.gradient_checkpointing_enable() Troubleshooting
If you encounter CUDA out of memory errors:
- Reduce batch size.
- Enable gradient checkpointing.
- Use mixed precision training (
fp16). - Ensure
device_map="auto"is set for optimal GPU memory allocation.
Key Takeaways
- QLoRA enables fine-tuning large models with 4-bit quantization, drastically reducing VRAM needs.
- A 7B parameter model can fit on 8GB GPU VRAM; 13B models require 12-24GB depending on batch size and optimizer.
- Use 8-bit optimizers and gradient checkpointing to further reduce memory footprint.
- Adjust batch size and precision to avoid CUDA out of memory errors during training.