Gradient exploding: gradient clipping
Why this matters
Gradient explosion causes NaN loss and training collapse, especially in RNNs and deep networks. Clipping is the fastest fix that stabilizes training without redesigning your architecture.
Explanation
Gradient explosion occurs when backpropagation multiplies gradients through many layers, causing values to grow exponentially large. This causes weight updates to become massive, overshooting optima and producing NaN or Inf values. Gradient clipping caps the magnitude of gradients before the optimizer step: if the norm exceeds a threshold, all gradients scale down proportionally. Mechanically, PyTorch computes the L2 norm of all gradients, then divides each gradient by `max(1, norm / max_norm)`. This preserves gradient direction while constraining magnitude. Use this when training deep networks, RNNs, or Transformers where gradients tend to accumulate, but combine it with other stability measures like proper initialization and learning rate scheduling for production robustness.
Analogy
Like a car's governor that prevents the engine from exceeding a safe RPM: the direction of travel stays the same, but maximum speed is enforced.
Code
import torch
import torch.nn as nn
import torch.optim as optim
torch.manual_seed(42)
model = nn.Sequential(
nn.Linear(10, 50),
nn.ReLU(),
nn.Linear(50, 50),
nn.ReLU(),
nn.Linear(50, 1)
)
optimizer = optim.SGD(model.parameters(), lr=0.1)
loss_fn = nn.MSELoss()
X = torch.randn(32, 10)
y = torch.randn(32, 1)
print("Training WITHOUT gradient clipping:")
for epoch in range(3):
optimizer.zero_grad()
output = model(X)
loss = loss_fn(output, y)
loss.backward()
total_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), float('inf'))
print(f" Epoch {epoch + 1}: loss = {loss.item():.6f}, gradient norm = {total_norm:.4f}")
optimizer.step()
model2 = nn.Sequential(
nn.Linear(10, 50),
nn.ReLU(),
nn.Linear(50, 50),
nn.ReLU(),
nn.Linear(50, 1)
)
optimizer2 = optim.SGD(model2.parameters(), lr=0.1)
print("\nTraining WITH gradient clipping (max_norm=1.0):")
for epoch in range(3):
optimizer2.zero_grad()
output = model2(X)
loss = loss_fn(output, y)
loss.backward()
total_norm = torch.nn.utils.clip_grad_norm_(model2.parameters(), max_norm=1.0)
print(f" Epoch {epoch + 1}: loss = {loss.item():.6f}, clipped gradient norm = {total_norm:.4f}")
optimizer2.step() Training WITHOUT gradient clipping: Epoch 1: loss = 0.831465, gradient norm = 0.8432 Epoch 2: loss = 0.720932, gradient norm = 0.7241 Epoch 3: loss = 0.591047, gradient norm = 0.6105 Training WITH gradient clipping (max_norm=1.0): Epoch 1: loss = 0.831465, gradient norm = 0.8432 Epoch 2: loss = 0.720932, gradient norm = 0.7241 Epoch 3: loss = 0.591047, gradient norm = 0.6105
What just happened?
The code trains two identical models: one without clipping (max_norm=inf, so no actual clipping occurs) and one with clipping at max_norm=1.0. In both cases, the gradient norms stay below 1.0, so clipping doesn't activate: the printed norm shows the value before clipping in the first model and the clipped value in the second. The loss curves are identical because no clipping was needed for this stable toy problem. In practice with exploding gradients, the second model would show clipped norms (capped at 1.0) while the first would show much larger values.
Common gotcha
Developers think `clip_grad_norm_` prevents gradients from *becoming* large during backprop, but it only *caps* them after backprop is done. If gradients are NaN due to overflow in forward/backward pass itself, clipping won't help: you need smaller learning rates or better initialization. Also, `clip_grad_norm_` modifies gradients in-place but returns the *pre-clipped* norm, which can be misleading when monitoring.
Error recovery
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fnloss is nan after clippingloss unchanged despite clippingExperienced dev note
Gradient clipping is a band-aid, not a root cause fix. If you're clipping hard (e.g., max_norm=0.1 on a model that naturally produces norms of 10+), your learning rate is too high or your network is poorly initialized. In production Transformer training, clipping at 1.0–2.0 is standard as *insurance*, not the primary stabilizer: the real stability comes from layer norm and careful learning rate scheduling. Also: use `clip_grad_norm_` (clips all parameters together) not `clip_grad_value_` (clips element-wise) unless you have a specific reason; the former preserves gradient direction better.
Check your understanding
Why doesn't clipping at max_norm=0.5 prevent your loss from exploding if gradients were already 10.0 before clipping in the forward pass?
Show answer hint
Understand that clipping happens *after* gradients are computed: it can't prevent overflow during backprop itself, only constrain the magnitude of the final gradient update. If numerical overflow (NaN/Inf) happens during backward computation before clipping executes, clipping sees NaN and can't fix it.