How to Intermediate · 3 min read

QLoRA memory requirements

Quick answer
QLoRA reduces memory usage by quantizing large language models to 4-bit precision, enabling fine-tuning on GPUs with as little as 8-12GB VRAM for 7B-13B parameter models. Total memory depends on base model size, batch size, and optimizer states, typically requiring 12-24GB VRAM for smooth training of 13B models with moderate batch sizes.

PREREQUISITES

  • Python 3.8+
  • pip install transformers>=4.30.0
  • pip install bitsandbytes
  • A GPU with CUDA support and at least 8GB VRAM

Setup

Install the necessary libraries for QLoRA fine-tuning, including transformers and bitsandbytes for 4-bit quantization support.

bash
pip install transformers bitsandbytes accelerate

Step by step

QLoRA uses 4-bit quantization to reduce memory footprint. For example, a 7B parameter model can be fine-tuned on a single 8GB GPU, while 13B models typically require 12-24GB VRAM depending on batch size and optimizer.

Here is a minimal example to load a 4-bit quantized model with transformers and bitsandbytes:

python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "meta-llama/Llama-2-13b-chat-hf"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load 4-bit quantized model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_4bit=True,
    device_map="auto"
)

# Example input
inputs = tokenizer("Hello, QLoRA!", return_tensors="pt").to(model.device)

# Forward pass
outputs = model(**inputs)
print(outputs.logits.shape)
output
(1, 6, 32000)

Common variations

Memory usage varies with:

  • Model size: 7B models need ~8GB VRAM, 13B models 12-24GB.
  • Batch size: Larger batches increase VRAM linearly.
  • Optimizer: Using 8-bit optimizers like bitsandbytes.Adam8bit reduces memory.
  • Gradient checkpointing: Saves memory at the cost of compute.
python
import bitsandbytes as bnb

# Use 8-bit Adam optimizer to save memory
optimizer = bnb.Adam8bit(model.parameters(), lr=3e-4)

# Enable gradient checkpointing
model.gradient_checkpointing_enable()

Troubleshooting

If you encounter CUDA out of memory errors:

  • Reduce batch size.
  • Enable gradient checkpointing.
  • Use mixed precision training (fp16).
  • Ensure device_map="auto" is set for optimal GPU memory allocation.

Key Takeaways

  • QLoRA enables fine-tuning large models with 4-bit quantization, drastically reducing VRAM needs.
  • A 7B parameter model can fit on 8GB GPU VRAM; 13B models require 12-24GB depending on batch size and optimizer.
  • Use 8-bit optimizers and gradient checkpointing to further reduce memory footprint.
  • Adjust batch size and precision to avoid CUDA out of memory errors during training.
Verified 2026-04 · meta-llama/Llama-2-13b-chat-hf
Verify ↗