High severity intermediate · Fix: 5-15 min

RuntimeError

torch.cuda.OutOfMemoryError (RuntimeError)

What this error means
The Huggingface accelerate library's automatic device mapping causes out-of-memory errors when the model is too large for available GPU memory.

Stack trace

traceback
RuntimeError: CUDA out of memory. Tried to allocate XX GiB (GPU 0; XX GiB total capacity; XX GiB already allocated; XX GiB free; XX GiB reserved)
  File "/path/to/accelerate/utils.py", line XXX, in device_map_auto
    ...
  File "/path/to/transformers/modeling_utils.py", line XXX, in to
    ...
QUICK FIX
Manually specify device_map instead of 'auto' or reduce model size to fit GPU memory.

Why it happens

The accelerate library's device_map='auto' attempts to automatically split and place model layers across available GPUs. If the model size exceeds the combined GPU memory or the automatic heuristic misestimates memory needs, it triggers a CUDA out-of-memory error. This often happens with very large models or limited GPU resources.

Detection

Monitor GPU memory usage before and during model loading with device_map='auto'. Catch RuntimeError exceptions related to CUDA out-of-memory and log memory stats to detect imminent failures.

Causes & fixes

1

Model size exceeds total available GPU memory across devices

✓ Fix

Use a smaller model or switch to CPU or mixed precision (fp16) to reduce memory footprint.

2

Automatic device map heuristic misestimates layer memory requirements

✓ Fix

Manually specify device_map to control layer placement and avoid overloading any single GPU.

3

No GPU memory fragmentation or reserved memory available for allocation

✓ Fix

Restart the Python process to clear GPU memory or use torch.cuda.empty_cache() before loading the model.

4

Using device_map='auto' with insufficient GPUs or incompatible hardware

✓ Fix

Verify GPU availability and compatibility; consider using device_map='balanced' or 'sequential' for better control.

Code: broken vs fixed

Broken - triggers the error
python
from transformers import AutoModelForCausalLM
from accelerate import init_empty_weights

model_name = "big-model"
model = AutoModelForCausalLM.from_pretrained(model_name, device_map='auto')  # triggers OOM error
Fixed - works correctly
python
import os
from transformers import AutoModelForCausalLM
from accelerate import init_empty_weights

os.environ['CUDA_VISIBLE_DEVICES'] = '0'  # limit to one GPU to control memory
model_name = "big-model"
model = AutoModelForCausalLM.from_pretrained(model_name, device_map={'': 0}, torch_dtype='auto')  # manual device map and dtype to reduce memory
print("Model loaded successfully")
Replaced device_map='auto' with manual device_map to control layer placement and limited GPU usage to reduce memory pressure, preventing OOM.

Workaround

Catch the RuntimeError, call torch.cuda.empty_cache(), and retry loading with a smaller model or manual device_map to avoid OOM.

Prevention

Use explicit device_map settings or model parallelism strategies and monitor GPU memory before loading large models to prevent automatic mapping OOM errors.

Python 3.9+ · transformers >=4.0.0 · tested on 4.30.x
Verified 2026-04
Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.