RuntimeError
torch._C._RuntimeError
Stack trace
Traceback (most recent call last):
File "train.py", line 45, in <module>
output = model(input_tensor) # triggers CUDA illegal memory access
RuntimeError: CUDA error: an illegal memory access was encountered Why it happens
This error occurs when a CUDA kernel accesses invalid or out-of-bounds GPU memory, often due to indexing errors, race conditions, or corrupted tensors. It can also happen if previous CUDA operations failed silently, leaving the GPU in an invalid state.
Detection
Monitor CUDA error status after kernel launches using torch.cuda.synchronize() and catch RuntimeError exceptions to detect illegal memory access early.
Causes & fixes
Out-of-bounds indexing in custom CUDA kernels or PyTorch operations
Check all tensor indexing and slicing to ensure indices are within valid ranges; add bounds checks in custom CUDA code.
Use of uninitialized or corrupted GPU tensors
Initialize all tensors properly before use and verify tensor shapes and device placement before CUDA operations.
Race conditions or improper synchronization between CUDA streams
Use torch.cuda.synchronize() to enforce proper synchronization and avoid concurrent writes to the same memory.
Previous CUDA errors not cleared, causing cascading failures
Call torch.cuda.empty_cache() and torch.cuda.synchronize() after catching errors to reset GPU state before retrying.
Code: broken vs fixed
import torch
tensor = torch.randn(10, device='cuda')
index = 15
value = tensor[index] # RuntimeError: CUDA illegal memory access import os
import torch
os.environ['CUDA_VISIBLE_DEVICES'] = '0' # Use first GPU
tensor = torch.randn(10, device='cuda')
index = 9 # Fixed index within bounds
value = tensor[index] # No error
print(value) Workaround
Wrap CUDA operations in try/except RuntimeError, call torch.cuda.synchronize() to flush errors, and reset GPU state with torch.cuda.empty_cache() before retrying.
Prevention
Use thorough tensor shape validation, proper CUDA synchronization, and test custom CUDA kernels extensively to avoid illegal memory access errors.