Critical severity intermediate · Fix: 5-15 min

RuntimeError

torch.cuda.OutOfMemoryError

What this error means
Modal CUDA out of memory error occurs when the GPU memory is insufficient to run the requested workload, causing a runtime failure.

Stack trace

traceback
RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 8.00 GiB total capacity; 5.50 GiB already allocated; 1.00 GiB free; 6.00 GiB reserved in total)
  File "/app/main.py", line 42, in run_model
    output = model(input_tensor)  # triggers CUDA OOM error
  File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in forward
    # model forward logic
QUICK FIX
Reduce batch size and call torch.cuda.empty_cache() before your model run to immediately free GPU memory.

Why it happens

This error happens because the GPU device does not have enough free memory to allocate the tensors or model parameters required for the computation. Modal runs your code on GPU-enabled containers, and if the workload exceeds the available GPU memory, CUDA throws this out of memory error.

Detection

Monitor GPU memory usage before running your model with torch.cuda.memory_allocated() or nvidia-smi; catch RuntimeError exceptions related to CUDA OOM to log and handle gracefully.

Causes & fixes

1

Model or batch size too large for the available GPU memory

✓ Fix

Reduce the batch size or use a smaller model to fit within the GPU memory limits.

2

GPU memory fragmentation from previous allocations not freed

✓ Fix

Call torch.cuda.empty_cache() before running your workload to clear unused cached memory.

3

Multiple processes competing for the same GPU memory

✓ Fix

Ensure exclusive GPU access or reduce concurrent GPU workloads to free memory.

4

Not using mixed precision or memory optimization techniques

✓ Fix

Use mixed precision training (e.g., torch.cuda.amp) or gradient checkpointing to reduce memory footprint.

Code: broken vs fixed

Broken - triggers the error
python
import modal
import torch

client = modal.Client()

@modal.function(gpu='A100')
def run_model(input_tensor):
    model = torch.nn.Linear(1024, 1024).cuda()
    output = model(input_tensor.cuda())  # triggers CUDA OOM error
    return output

input_tensor = torch.randn(128, 1024)
run_model.call(input_tensor)
Fixed - works correctly
python
import os
import modal
import torch

os.environ['MODAL_API_KEY'] = os.environ.get('MODAL_API_KEY')  # Use env var for API key

client = modal.Client()

@modal.function(gpu='A100')
def run_model(input_tensor):
    torch.cuda.empty_cache()  # Clear cached memory before model run
    model = torch.nn.Linear(1024, 1024).cuda()
    output = model(input_tensor.cuda())
    return output

input_tensor = torch.randn(64, 1024)  # Reduced batch size to fit GPU memory
run_model.call(input_tensor)
print("Model run completed without CUDA OOM error.")
Added torch.cuda.empty_cache() to clear GPU memory cache and reduced batch size to fit within available GPU memory, preventing the CUDA OOM error.

Workaround

Wrap the model call in try/except RuntimeError, catch CUDA OOM errors, then reduce batch size dynamically or retry after torch.cuda.empty_cache().

Prevention

Design your Modal GPU workloads with memory profiling, use mixed precision training, and monitor GPU usage to avoid exceeding memory limits before deployment.

Python 3.9+ · modal >=0.1.0 · tested on 0.4.x
Verified 2026-04
Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.