RuntimeError
torch._C._RuntimeError
Stack trace
Traceback (most recent call last):
File "train.py", line 45, in <module>
loss.backward() # RuntimeError: grad can be NaN or Inf
RuntimeError: Function 'SomeFunctionBackward' returned nan values in its 0th output. Why it happens
During backpropagation, gradients can become NaN if the model outputs invalid values (like Inf or NaN), or if operations cause numerical instability such as division by zero or exploding gradients. This corrupts the gradient computation and halts training.
Detection
Monitor gradients and loss values during training using hooks or logging; detect NaNs early by checking tensor.isnan() or tensor.isinf() after backward calls.
Causes & fixes
Exploding gradients cause very large values that overflow to NaN during backward pass.
Apply gradient clipping using torch.nn.utils.clip_grad_norm_ or clip_grad_value_ to keep gradients within a stable range.
Invalid input data or labels (e.g., NaNs or Infs) propagate through the model causing NaN loss and gradients.
Validate and clean input tensors before training; use torch.isnan() and torch.isinf() to filter or replace invalid values.
Numerical instability in model operations such as division by zero, log of zero, or sqrt of negative values.
Add small epsilon values to denominators or inputs to log/sqrt functions; use stable implementations like torch.nn.functional.softplus instead of ReLU if needed.
Learning rate too high causing parameter updates to diverge and produce NaNs.
Reduce the learning rate and use learning rate schedulers to stabilize training.
Code: broken vs fixed
import torch
model = MyModel()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-2)
for data, target in dataloader:
optimizer.zero_grad()
output = model(data)
loss = loss_fn(output, target)
loss.backward() # RuntimeError: grad can be NaN or Inf
optimizer.step() import os
import torch
os.environ['PYTORCH_ENABLE_MPS_FALLBACK'] = '1' # example env var if needed
model = MyModel()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3) # Reduced LR
for data, target in dataloader:
optimizer.zero_grad()
output = model(data)
loss = loss_fn(output, target)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) # Added gradient clipping
optimizer.step()
print("Training step completed without NaN gradients.") Workaround
Wrap loss.backward() in try/except RuntimeError, catch NaN errors, skip optimizer.step() for that batch, and log inputs for offline debugging.
Prevention
Use gradient clipping, validate inputs for NaNs/Infs, apply stable numerical operations, and tune learning rate to maintain stable gradients throughout training.