RuntimeError
torch.cuda.OutOfMemoryError
Stack trace
RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 11.17 GiB total capacity; 9.00 GiB already allocated; 1.50 GiB free; 9.50 GiB reserved in total by PyTorch)
Why it happens
Huggingface Transformers models require significant GPU memory for model weights and intermediate tensors. When the model or batch size exceeds available GPU memory, PyTorch raises a CUDA out of memory error. This often happens with large models, large batch sizes, or insufficient GPU memory.
Detection
Monitor GPU memory usage with tools like nvidia-smi before running your code, and catch RuntimeError exceptions related to CUDA memory to detect imminent failures.
Causes & fixes
Batch size too large for available GPU memory
Reduce the batch size in your DataLoader or model input to fit within GPU memory limits.
Model size too large for GPU memory
Use a smaller pretrained model or switch to CPU or mixed precision (float16) to reduce memory footprint.
GPU memory fragmentation or memory not freed from previous runs
Restart your Python kernel or clear GPU cache with torch.cuda.empty_cache() before running the model.
Not using gradient checkpointing or mixed precision for large models
Enable gradient checkpointing or use torch.cuda.amp.autocast() to reduce memory usage during training or inference.
Code: broken vs fixed
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
model_name = 'bert-large-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name).cuda()
inputs = tokenizer(['Hello world!'] * 64, return_tensors='pt', padding=True, truncation=True)
outputs = model(**{k: v.cuda() for k, v in inputs.items()}) # This line triggers CUDA out of memory error import os
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
model_name = 'bert-large-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name).cuda()
inputs = tokenizer(['Hello world!'] * 8, return_tensors='pt', padding=True, truncation=True) # Reduced batch size
torch.cuda.empty_cache() # Clear GPU cache before inference
outputs = model(**{k: v.cuda() for k, v in inputs.items()}) # Fixed: batch size reduced and cache cleared
print('Inference succeeded without CUDA OOM') Workaround
Catch RuntimeError exceptions, then call torch.cuda.empty_cache() and retry inference with a smaller batch size or on CPU fallback.
Prevention
Design your pipeline to monitor GPU memory usage, use mixed precision training/inference, gradient checkpointing, and choose model sizes and batch sizes appropriate for your GPU capacity.