RuntimeError
torch._C._RuntimeError: CUDA error: device-side assert triggered
Stack trace
Traceback (most recent call last):
File "train.py", line 45, in <module>
outputs = model(input_ids)
File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.9/site-packages/transformers/modeling_utils.py", line 1234, in forward
outputs = self.base_model(input_ids, attention_mask=attention_mask)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stack trace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Why it happens
This error occurs when a CUDA kernel detects an illegal operation on the GPU, such as an out-of-bound index in embedding layers or invalid label indices during loss computation. It often results from mismatched label values or corrupted input tensors that violate expected ranges.
Detection
Enable CUDA synchronous error reporting by setting CUDA_LAUNCH_BLOCKING=1 to catch the exact operation causing the device-side assert before the program crashes.
Causes & fixes
Label indices for classification are out of the valid range (e.g., label >= num_classes)
Ensure all label tensors contain only valid class indices within [0, num_classes - 1] before passing to the model.
Input token IDs contain invalid or out-of-vocabulary indices not handled by the tokenizer
Verify that input_ids are correctly tokenized and contain only valid token indices recognized by the model's vocabulary.
Mismatch between model output dimension and target labels during loss calculation
Confirm that the model's output dimension matches the number of classes and that loss functions receive correctly shaped inputs.
Using GPU tensors with inconsistent device placement or corrupted tensors
Check that all tensors involved in computation are on the same CUDA device and are not corrupted or uninitialized.
Code: broken vs fixed
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2).cuda()
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
inputs = tokenizer(['Hello world'], return_tensors='pt')
inputs = {k: v.cuda() for k, v in inputs.items()}
labels = torch.tensor([2]).cuda() # Invalid label index causing device-side assert
outputs = model(**inputs, labels=labels) # RuntimeError: CUDA error device-side assert triggered import os
os.environ['CUDA_LAUNCH_BLOCKING'] = '1' # Enable synchronous CUDA error reporting
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2).cuda()
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
inputs = tokenizer(['Hello world'], return_tensors='pt')
inputs = {k: v.cuda() for k, v in inputs.items()}
labels = torch.tensor([1]).cuda() # Fixed label index within valid range
outputs = model(**inputs, labels=labels)
print('Model output computed successfully') Workaround
Catch RuntimeError exceptions and run the code with CUDA_LAUNCH_BLOCKING=1 to identify the exact failing operation; alternatively, validate all label and input indices before model calls to avoid the assert.
Prevention
Always validate input tensors and label indices against model configuration before training or inference, and use CUDA_LAUNCH_BLOCKING=1 during development to catch device-side errors early.