Debug Fix intermediate · 3 min read

Fix QLoRA out of memory error

Quick answer
A QLoRA out of memory error occurs when the GPU memory is insufficient due to large batch sizes, lack of 4-bit quantization, or improper device mapping. Fix this by enabling BitsAndBytesConfig for 4-bit loading, reducing batch size, and using device_map="auto" to optimize memory allocation.
ERROR TYPE config_error
⚡ QUICK FIX
Reduce batch size and enable 4-bit quantization with BitsAndBytesConfig to fix QLoRA out of memory errors.

Why this happens

Out of memory errors during QLoRA fine-tuning typically arise because the model and training parameters exceed the available GPU memory. Common triggers include large batch_size, loading the model in full precision instead of 4-bit quantization, and not using device mapping to spread the model across GPUs.

Example error output:

RuntimeError: CUDA out of memory. Tried to allocate ...

Broken code example loading a model without 4-bit quantization and large batch size:

python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

inputs = tokenizer(["Hello world"] * 16, return_tensors="pt", padding=True)
outputs = model(**inputs)
output
RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 16.00 GiB total capacity; 14.00 GiB already allocated; 1.50 GiB free; 14.50 GiB reserved in total)

The fix

Use BitsAndBytesConfig to load the model in 4-bit precision, which drastically reduces memory usage. Also, set device_map="auto" to automatically distribute model layers across available GPUs. Reduce batch_size to fit within memory limits.

This code snippet shows the correct setup:

python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    quantization_config=quantization_config,
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

inputs = tokenizer(["Hello world"] * 4, return_tensors="pt", padding=True)
outputs = model(**inputs)
print("Model output shape:", outputs.logits.shape)
output
Model output shape: torch.Size([4, sequence_length, vocab_size])

Preventing it in production

  • Implement dynamic batch sizing or gradient accumulation to keep memory usage stable.
  • Validate model loading with 4-bit quantization and device mapping before training.
  • Use monitoring tools to track GPU memory and trigger fallbacks or retries if memory limits are approached.
  • Consider mixed precision training and offloading techniques if supported.

Key Takeaways

  • Enable 4-bit quantization with BitsAndBytesConfig to reduce GPU memory usage.
  • Use device_map="auto" to optimize model layer placement across GPUs.
  • Reduce batch_size or use gradient accumulation to fit training within memory limits.
Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct
Verify ↗