Debug Fix intermediate · 3 min read

Fix QLoRA out of memory error

Q: Fix QLoRA out of memory error

A QLoRA out of memory error occurs when the GPU memory is insufficient due to large batch sizes, lack of 4-bit quantization, or improper device mapping. Fix this by enabling BitsAndBytesConfig for 4-bit loading, reducing batch size, and using device_map="auto" to optimize memory allocation.

Quick answer

A QLoRA out of memory error occurs when the GPU memory is insufficient due to large batch sizes, lack of 4-bit quantization, or improper device mapping. Fix this by enabling BitsAndBytesConfig for 4-bit loading, reducing batch size, and using device_map="auto" to optimize memory allocation.

ERROR TYPE config_error

⚡ QUICK FIX

Reduce batch size and enable 4-bit quantization with BitsAndBytesConfig to fix QLoRA out of memory errors.

Why this happens

Out of memory errors during QLoRA fine-tuning typically arise because the model and training parameters exceed the available GPU memory. Common triggers include large batch_size, loading the model in full precision instead of 4-bit quantization, and not using device mapping to spread the model across GPUs.

Example error output:

RuntimeError: CUDA out of memory. Tried to allocate ...

Broken code example loading a model without 4-bit quantization and large batch size:

python

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

inputs = tokenizer(["Hello world"] * 16, return_tensors="pt", padding=True)
outputs = model(**inputs)

output

RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 16.00 GiB total capacity; 14.00 GiB already allocated; 1.50 GiB free; 14.50 GiB reserved in total)

The fix

Use BitsAndBytesConfig to load the model in 4-bit precision, which drastically reduces memory usage. Also, set device_map="auto" to automatically distribute model layers across available GPUs. Reduce batch_size to fit within memory limits.

This code snippet shows the correct setup:

python

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    quantization_config=quantization_config,
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

inputs = tokenizer(["Hello world"] * 4, return_tensors="pt", padding=True)
outputs = model(**inputs)
print("Model output shape:", outputs.logits.shape)

output

Model output shape: torch.Size([4, sequence_length, vocab_size])

Preventing it in production

Implement dynamic batch sizing or gradient accumulation to keep memory usage stable.
Validate model loading with 4-bit quantization and device mapping before training.
Use monitoring tools to track GPU memory and trigger fallbacks or retries if memory limits are approached.
Consider mixed precision training and offloading techniques if supported.

Related errors

Error	Cause	Quick fix
CUDA out of memory	Batch size too large or no quantization	Reduce batch size, enable 4-bit quantization
RuntimeError: device_map mismatch	Incorrect device_map setting	Use device_map="auto"
OOM during gradient accumulation	Accumulated gradients exceed memory	Lower accumulation steps or batch size

✅

Key Takeaways

Enable 4-bit quantization with BitsAndBytesConfig to reduce GPU memory usage.
Use device_map="auto" to optimize model layer placement across GPUs.
Reduce batch_size or use gradient accumulation to fit training within memory limits.

Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct

Verify ↗