Critical severity intermediate · Fix: 5-15 min

RuntimeError

torch.cuda.OutOfMemoryError (RuntimeError during LoRA merge_and_unload)

What this error means
This error occurs when the GPU runs out of VRAM during the LoRA merge_and_unload operation in lora-qlora, causing a RuntimeError.

Stack trace

traceback
RuntimeError: CUDA out of memory. Tried to allocate XX GiB (GPU 0; XX GiB total capacity; XX GiB already allocated; XX GiB free; XX GiB reserved in total by PyTorch)\n  File "/path/to/lora_qlora.py", line XX, in merge_and_unload\n    model = lora_model.merge_and_unload()
QUICK FIX
Move the model to CPU before calling merge_and_unload to avoid VRAM overflow: model.to('cpu').merge_and_unload()

Why it happens

The merge_and_unload method attempts to merge LoRA weights into the base model and then unloads the LoRA adapters to free memory. This process requires temporarily duplicating model weights in VRAM. If the GPU does not have enough free VRAM to hold both the original and merged weights simultaneously, a CUDA out of memory RuntimeError is raised.

Detection

Monitor GPU VRAM usage before calling merge_and_unload. Use try/except RuntimeError around merge_and_unload and log VRAM stats to detect imminent out-of-memory conditions.

Causes & fixes

1

GPU VRAM is insufficient to hold both base model and merged LoRA weights simultaneously during merge_and_unload.

✓ Fix

Reduce batch size or model size, or use a GPU with more VRAM. Alternatively, perform merging on CPU by moving model to CPU before merge_and_unload.

2

Multiple large models or processes occupy VRAM concurrently, leaving insufficient free memory for merging.

✓ Fix

Close other GPU processes and clear cache with torch.cuda.empty_cache() before calling merge_and_unload.

3

merge_and_unload is called without offloading or memory optimization strategies enabled.

✓ Fix

Use memory-efficient offloading techniques such as Hugging Face Accelerate or load model with device_map='auto' to reduce VRAM usage during merging.

Code: broken vs fixed

Broken - triggers the error
python
from lora_qlora import LoraModel
import torch

model = LoraModel.load_from_checkpoint('checkpoint.pt')
# This line triggers VRAM out of memory error
model.merge_and_unload()
Fixed - works correctly
python
import os
import torch
from lora_qlora import LoraModel

os.environ['CUDA_VISIBLE_DEVICES'] = '0'
model = LoraModel.load_from_checkpoint('checkpoint.pt')
# Move model to CPU before merging to avoid VRAM overflow
model.to('cpu')
model.merge_and_unload()
print('Merge and unload completed successfully')
Moved the model to CPU before calling merge_and_unload to prevent GPU VRAM out of memory errors during the merge process.

Workaround

Wrap merge_and_unload in try/except RuntimeError; on failure, move model to CPU and retry merging to bypass VRAM limits temporarily.

Prevention

Use memory offloading strategies and monitor VRAM usage proactively. Prefer models and hardware configurations that fit your VRAM budget before merging LoRA weights.

Python 3.9+ · lora-qlora >=0.1.0 · tested on 0.1.5
Verified 2026-04
Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.