torch.cuda.OutOfMemoryError
torch.cuda.OutOfMemoryError
Stack trace
Traceback (most recent call last):
File "run_model.py", line 12, in <module>
model = AutoModelForSequenceClassification.from_pretrained('bert-large-uncased')
File "/usr/local/lib/python3.9/site-packages/transformers/models/auto/modeling_auto.py", line 1017, in from_pretrained
model = model_class.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/transformers/modeling_utils.py", line 1410, in from_pretrained
state_dict = torch.load(resolved_archive_file, map_location="cpu")
File "/usr/local/lib/python3.9/site-packages/torch/cuda/memory.py", line 123, in _lazy_init
torch._C._cuda_init()
RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 11.17 GiB total capacity; 8.00 GiB already allocated; 1.50 GiB free; 8.50 GiB reserved in total by PyTorch) Why it happens
Loading large HuggingFace models requires significant GPU memory. If the available GPU memory is insufficient to hold the model weights and intermediate tensors, PyTorch raises torch.cuda.OutOfMemoryError. This often happens on GPUs with limited VRAM or when multiple processes compete for GPU resources.
Detection
Monitor GPU memory usage with tools like nvidia-smi before and during model loading; catching torch.cuda.OutOfMemoryError exceptions allows graceful fallback or retry with smaller models.
Causes & fixes
The GPU does not have enough free VRAM to load the full model weights.
Use a smaller model variant or reduce batch size; alternatively, free GPU memory by closing other applications or processes using the GPU.
Multiple processes or models are loaded simultaneously, exhausting GPU memory.
Serialize model loading or run models on separate GPUs; use torch.cuda.empty_cache() to clear unused memory before loading.
Model is loaded entirely on GPU without offloading or quantization.
Use model offloading techniques like HuggingFace Accelerate's device_map='auto' or load model with torch_dtype=torch.float16 and enable quantization to reduce memory footprint.
Code: broken vs fixed
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained('bert-large-uncased') # triggers OutOfMemoryError import os
from transformers import AutoModelForSequenceClassification
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
model = AutoModelForSequenceClassification.from_pretrained('bert-large-uncased', device_map='auto', torch_dtype='auto') # fixed with offloading
print('Model loaded successfully') Workaround
Catch torch.cuda.OutOfMemoryError and fallback to CPU loading by setting device='cpu' in from_pretrained, allowing the program to continue without GPU acceleration.
Prevention
Design your system to use model offloading, mixed precision, and monitor GPU memory usage; prefer smaller or quantized models for limited VRAM environments to avoid OOM errors.