RuntimeError
torch.cuda.OutOfMemoryError
Stack trace
RuntimeError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 8.00 GiB total capacity; 6.50 GiB already allocated; 256.00 MiB free; 6.70 GiB reserved in total by PyTorch)
Why it happens
This error occurs because the GPU does not have enough free memory to allocate the tensors required for the model or batch during training. Large models, big batch sizes, or memory fragmentation can cause this. PyTorch attempts to allocate memory but fails when the requested size exceeds available GPU memory.
Detection
Monitor GPU memory usage with tools like nvidia-smi during training and catch RuntimeError exceptions with 'CUDA out of memory' in the message to detect imminent failures.
Causes & fixes
Batch size is too large for the available GPU memory
Reduce the batch size to fit within the GPU memory limits.
Model architecture is too large or uses excessive memory
Use a smaller model or optimize the model architecture to reduce memory footprint.
GPU memory fragmentation from previous allocations
Restart the training process or clear GPU cache using torch.cuda.empty_cache() before training.
Not using mixed precision training to reduce memory usage
Enable mixed precision training with torch.cuda.amp to lower memory consumption.
Code: broken vs fixed
import torch
model = MyLargeModel().cuda()
optimizer = torch.optim.Adam(model.parameters())
batch_size = 128 # Too large for GPU memory
for data in dataloader:
inputs = data.to('cuda')
outputs = model(inputs) # RuntimeError: CUDA out of memory here
loss = loss_fn(outputs, targets)
loss.backward()
optimizer.step() import os
import torch
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
model = MyLargeModel().cuda()
optimizer = torch.optim.Adam(model.parameters())
batch_size = 32 # Reduced batch size to fit GPU memory
torch.cuda.empty_cache() # Clear cache before training
for data in dataloader:
inputs = data.to('cuda')
with torch.cuda.amp.autocast(): # Enable mixed precision
outputs = model(inputs)
loss = loss_fn(outputs, targets)
loss.backward()
optimizer.step()
print('Training started successfully with reduced memory usage') Workaround
Catch the RuntimeError exception, call torch.cuda.empty_cache(), reduce batch size dynamically, and retry the training step to temporarily avoid crashes.
Prevention
Design models and training loops with memory profiling, use mixed precision training, gradient checkpointing, and monitor GPU memory to prevent out of memory errors.