Critical severity intermediate · Fix: 5-15 min

torch.cuda.OutOfMemoryError

What this error means

The GPU runs out of memory when loading or running a large HuggingFace model, causing a torch.cuda.OutOfMemoryError crash.

Stack trace

traceback

Traceback (most recent call last):
  File "run_model.py", line 12, in <module>
    model = AutoModelForSequenceClassification.from_pretrained('bert-large-uncased')
  File "/usr/local/lib/python3.9/site-packages/transformers/models/auto/modeling_auto.py", line 1017, in from_pretrained
    model = model_class.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/transformers/modeling_utils.py", line 1410, in from_pretrained
    state_dict = torch.load(resolved_archive_file, map_location="cpu")
  File "/usr/local/lib/python3.9/site-packages/torch/cuda/memory.py", line 123, in _lazy_init
    torch._C._cuda_init()
RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 11.17 GiB total capacity; 8.00 GiB already allocated; 1.50 GiB free; 8.50 GiB reserved in total by PyTorch)

QUICK FIX

Use HuggingFace Accelerate with device_map='auto' to offload model layers and reduce GPU memory usage immediately.

Why it happens

Loading large HuggingFace models requires significant GPU memory. If the available GPU memory is insufficient to hold the model weights and intermediate tensors, PyTorch raises torch.cuda.OutOfMemoryError. This often happens on GPUs with limited VRAM or when multiple processes compete for GPU resources.

Detection

Monitor GPU memory usage with tools like nvidia-smi before and during model loading; catching torch.cuda.OutOfMemoryError exceptions allows graceful fallback or retry with smaller models.

Causes & fixes

The GPU does not have enough free VRAM to load the full model weights.

✓ Fix

Use a smaller model variant or reduce batch size; alternatively, free GPU memory by closing other applications or processes using the GPU.

Multiple processes or models are loaded simultaneously, exhausting GPU memory.

✓ Fix

Serialize model loading or run models on separate GPUs; use torch.cuda.empty_cache() to clear unused memory before loading.

Model is loaded entirely on GPU without offloading or quantization.

✓ Fix

Use model offloading techniques like HuggingFace Accelerate's device_map='auto' or load model with torch_dtype=torch.float16 and enable quantization to reduce memory footprint.

Code: broken vs fixed

Broken - triggers the error

python

from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained('bert-large-uncased')  # triggers OutOfMemoryError

Fixed - works correctly

python

import os
from transformers import AutoModelForSequenceClassification
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
model = AutoModelForSequenceClassification.from_pretrained('bert-large-uncased', device_map='auto', torch_dtype='auto')  # fixed with offloading
print('Model loaded successfully')

Enabled device_map='auto' and torch_dtype='auto' to offload model layers and use mixed precision, reducing GPU memory usage and preventing OutOfMemoryError.

⚠

Workaround

Catch torch.cuda.OutOfMemoryError and fallback to CPU loading by setting device='cpu' in from_pretrained, allowing the program to continue without GPU acceleration.

✓

Prevention

Design your system to use model offloading, mixed precision, and monitor GPU memory usage; prefer smaller or quantized models for limited VRAM environments to avoid OOM errors.

Python 3.8+ · transformers >=4.0.0 · tested on 4.30.x

Verified 2026-04

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.