Critical severity intermediate · Fix: 5-15 min

RuntimeError

torch.cuda.OutOfMemoryError

What this error means
The GPU runs out of memory when loading or running Huggingface Transformer models, causing a CUDA out of memory RuntimeError.

Stack trace

traceback
RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 11.17 GiB total capacity; 9.00 GiB already allocated; 1.50 GiB free; 9.50 GiB reserved in total by PyTorch)
QUICK FIX
Reduce batch size and call torch.cuda.empty_cache() before model inference to quickly free GPU memory.

Why it happens

Huggingface Transformers models require significant GPU memory for model weights and intermediate tensors. When the model or batch size exceeds available GPU memory, PyTorch raises a CUDA out of memory error. This often happens with large models, large batch sizes, or insufficient GPU memory.

Detection

Monitor GPU memory usage with tools like nvidia-smi before running your code, and catch RuntimeError exceptions related to CUDA memory to detect imminent failures.

Causes & fixes

1

Batch size too large for available GPU memory

✓ Fix

Reduce the batch size in your DataLoader or model input to fit within GPU memory limits.

2

Model size too large for GPU memory

✓ Fix

Use a smaller pretrained model or switch to CPU or mixed precision (float16) to reduce memory footprint.

3

GPU memory fragmentation or memory not freed from previous runs

✓ Fix

Restart your Python kernel or clear GPU cache with torch.cuda.empty_cache() before running the model.

4

Not using gradient checkpointing or mixed precision for large models

✓ Fix

Enable gradient checkpointing or use torch.cuda.amp.autocast() to reduce memory usage during training or inference.

Code: broken vs fixed

Broken - triggers the error
python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

model_name = 'bert-large-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name).cuda()

inputs = tokenizer(['Hello world!'] * 64, return_tensors='pt', padding=True, truncation=True)
outputs = model(**{k: v.cuda() for k, v in inputs.items()})  # This line triggers CUDA out of memory error
Fixed - works correctly
python
import os
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

os.environ['CUDA_VISIBLE_DEVICES'] = '0'
model_name = 'bert-large-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name).cuda()

inputs = tokenizer(['Hello world!'] * 8, return_tensors='pt', padding=True, truncation=True)  # Reduced batch size

torch.cuda.empty_cache()  # Clear GPU cache before inference
outputs = model(**{k: v.cuda() for k, v in inputs.items()})  # Fixed: batch size reduced and cache cleared
print('Inference succeeded without CUDA OOM')
Reduced batch size to fit GPU memory and cleared GPU cache with torch.cuda.empty_cache() to prevent out of memory error.

Workaround

Catch RuntimeError exceptions, then call torch.cuda.empty_cache() and retry inference with a smaller batch size or on CPU fallback.

Prevention

Design your pipeline to monitor GPU memory usage, use mixed precision training/inference, gradient checkpointing, and choose model sizes and batch sizes appropriate for your GPU capacity.

Python 3.8+ · transformers >=4.0.0 · tested on 4.30.0
Verified 2026-04
Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.