Critical severity intermediate · Fix: 5-15 min

RuntimeError

torch._C._RuntimeError: CUDA error: device-side assert triggered

What this error means

This error indicates a CUDA device-side assertion failure during model training or inference, often caused by invalid tensor operations or out-of-bound indices on GPU.

Stack trace

traceback

Traceback (most recent call last):
  File "train.py", line 45, in <module>
    outputs = model(input_ids)
  File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/transformers/modeling_utils.py", line 1234, in forward
    outputs = self.base_model(input_ids, attention_mask=attention_mask)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stack trace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

QUICK FIX

Set environment variable CUDA_LAUNCH_BLOCKING=1 to get accurate stack traces and verify label indices are within valid ranges.

Why it happens

This error occurs when a CUDA kernel detects an illegal operation on the GPU, such as an out-of-bound index in embedding layers or invalid label indices during loss computation. It often results from mismatched label values or corrupted input tensors that violate expected ranges.

Detection

Enable CUDA synchronous error reporting by setting CUDA_LAUNCH_BLOCKING=1 to catch the exact operation causing the device-side assert before the program crashes.

Causes & fixes

Label indices for classification are out of the valid range (e.g., label >= num_classes)

✓ Fix

Ensure all label tensors contain only valid class indices within [0, num_classes - 1] before passing to the model.

Input token IDs contain invalid or out-of-vocabulary indices not handled by the tokenizer

✓ Fix

Verify that input_ids are correctly tokenized and contain only valid token indices recognized by the model's vocabulary.

Mismatch between model output dimension and target labels during loss calculation

✓ Fix

Confirm that the model's output dimension matches the number of classes and that loss functions receive correctly shaped inputs.

Using GPU tensors with inconsistent device placement or corrupted tensors

✓ Fix

Check that all tensors involved in computation are on the same CUDA device and are not corrupted or uninitialized.

Code: broken vs fixed

Broken - triggers the error

python

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2).cuda()
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

inputs = tokenizer(['Hello world'], return_tensors='pt')
inputs = {k: v.cuda() for k, v in inputs.items()}
labels = torch.tensor([2]).cuda()  # Invalid label index causing device-side assert

outputs = model(**inputs, labels=labels)  # RuntimeError: CUDA error device-side assert triggered

Fixed - works correctly

python

import os
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'  # Enable synchronous CUDA error reporting

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2).cuda()
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

inputs = tokenizer(['Hello world'], return_tensors='pt')
inputs = {k: v.cuda() for k, v in inputs.items()}
labels = torch.tensor([1]).cuda()  # Fixed label index within valid range

outputs = model(**inputs, labels=labels)
print('Model output computed successfully')

Added CUDA_LAUNCH_BLOCKING=1 for accurate error location and fixed label tensor to contain valid class index within model's num_labels.

⚠

Workaround

Catch RuntimeError exceptions and run the code with CUDA_LAUNCH_BLOCKING=1 to identify the exact failing operation; alternatively, validate all label and input indices before model calls to avoid the assert.

✓

Prevention

Always validate input tensors and label indices against model configuration before training or inference, and use CUDA_LAUNCH_BLOCKING=1 during development to catch device-side errors early.

Python 3.9+ · transformers >=4.0.0 · tested on 4.30.0

Verified 2026-04

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.