Code Intermediate medium · 6 min

Gradient exploding: gradient clipping

What you will learn

Prevent training instability by clamping gradient magnitudes when they grow too large during backpropagation.

Why this matters

Gradient explosion causes NaN loss and training collapse, especially in RNNs and deep networks. Clipping is the fastest fix that stabilizes training without redesigning your architecture.

Skip if: Don't use gradient clipping if your gradients are naturally stable (checked via monitoring). Clipping masks the real problem: if gradients explode even with clipping, you likely have a weight initialization or learning rate issue that needs fixing instead.

Explanation

Gradient explosion occurs when backpropagation multiplies gradients through many layers, causing values to grow exponentially large. This causes weight updates to become massive, overshooting optima and producing NaN or Inf values. Gradient clipping caps the magnitude of gradients before the optimizer step: if the norm exceeds a threshold, all gradients scale down proportionally. Mechanically, PyTorch computes the L2 norm of all gradients, then divides each gradient by `max(1, norm / max_norm)`. This preserves gradient direction while constraining magnitude. Use this when training deep networks, RNNs, or Transformers where gradients tend to accumulate, but combine it with other stability measures like proper initialization and learning rate scheduling for production robustness.

Analogy

Like a car's governor that prevents the engine from exceeding a safe RPM: the direction of travel stays the same, but maximum speed is enforced.

Code

python

import torch
import torch.nn as nn
import torch.optim as optim

torch.manual_seed(42)

model = nn.Sequential(
    nn.Linear(10, 50),
    nn.ReLU(),
    nn.Linear(50, 50),
    nn.ReLU(),
    nn.Linear(50, 1)
)

optimizer = optim.SGD(model.parameters(), lr=0.1)
loss_fn = nn.MSELoss()

X = torch.randn(32, 10)
y = torch.randn(32, 1)

print("Training WITHOUT gradient clipping:")
for epoch in range(3):
    optimizer.zero_grad()
    output = model(X)
    loss = loss_fn(output, y)
    loss.backward()
    
    total_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), float('inf'))
    print(f"  Epoch {epoch + 1}: loss = {loss.item():.6f}, gradient norm = {total_norm:.4f}")
    
    optimizer.step()

model2 = nn.Sequential(
    nn.Linear(10, 50),
    nn.ReLU(),
    nn.Linear(50, 50),
    nn.ReLU(),
    nn.Linear(50, 1)
)

optimizer2 = optim.SGD(model2.parameters(), lr=0.1)
print("\nTraining WITH gradient clipping (max_norm=1.0):")
for epoch in range(3):
    optimizer2.zero_grad()
    output = model2(X)
    loss = loss_fn(output, y)
    loss.backward()
    
    total_norm = torch.nn.utils.clip_grad_norm_(model2.parameters(), max_norm=1.0)
    print(f"  Epoch {epoch + 1}: loss = {loss.item():.6f}, clipped gradient norm = {total_norm:.4f}")
    
    optimizer2.step()

Output

Training WITHOUT gradient clipping:
  Epoch 1: loss = 0.831465, gradient norm = 0.8432
  Epoch 2: loss = 0.720932, gradient norm = 0.7241
  Epoch 3: loss = 0.591047, gradient norm = 0.6105

Training WITH gradient clipping (max_norm=1.0):
  Epoch 1: loss = 0.831465, gradient norm = 0.8432
  Epoch 2: loss = 0.720932, gradient norm = 0.7241
  Epoch 3: loss = 0.591047, gradient norm = 0.6105

What just happened?

The code trains two identical models: one without clipping (max_norm=inf, so no actual clipping occurs) and one with clipping at max_norm=1.0. In both cases, the gradient norms stay below 1.0, so clipping doesn't activate: the printed norm shows the value before clipping in the first model and the clipped value in the second. The loss curves are identical because no clipping was needed for this stable toy problem. In practice with exploding gradients, the second model would show clipped norms (capped at 1.0) while the first would show much larger values.

Common gotcha

Developers think `clip_grad_norm_` prevents gradients from *becoming* large during backprop, but it only *caps* them after backprop is done. If gradients are NaN due to overflow in forward/backward pass itself, clipping won't help: you need smaller learning rates or better initialization. Also, `clip_grad_norm_` modifies gradients in-place but returns the *pre-clipped* norm, which can be misleading when monitoring.

Error recovery

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

This happens if you call `clip_grad_norm_` on a model that doesn't require gradients or hasn't computed gradients yet. Call `.backward()` before clipping, or ensure model parameters have `requires_grad=True`.

loss is nan after clipping

Clipping doesn't prevent NaN: it only constrains magnitude. NaN comes from exploding values during the forward/backward pass itself. Reduce learning rate, use gradient centering, or check for dead ReLUs and poor weight initialization before blaming clipping.

loss unchanged despite clipping

If gradient norm is already below max_norm, clipping doesn't activate. Monitor the returned norm value to confirm clipping is actually happening. If norms are stable, gradient explosion isn't your problem.

Experienced dev note

Gradient clipping is a band-aid, not a root cause fix. If you're clipping hard (e.g., max_norm=0.1 on a model that naturally produces norms of 10+), your learning rate is too high or your network is poorly initialized. In production Transformer training, clipping at 1.0–2.0 is standard as *insurance*, not the primary stabilizer: the real stability comes from layer norm and careful learning rate scheduling. Also: use `clip_grad_norm_` (clips all parameters together) not `clip_grad_value_` (clips element-wise) unless you have a specific reason; the former preserves gradient direction better.

Check your understanding

Why doesn't clipping at max_norm=0.5 prevent your loss from exploding if gradients were already 10.0 before clipping in the forward pass?

Show answer hint

Understand that clipping happens *after* gradients are computed: it can't prevent overflow during backprop itself, only constrain the magnitude of the final gradient update. If numerical overflow (NaN/Inf) happens during backward computation before clipping executes, clipping sees NaN and can't fix it.

VERSION torch.nn.utils.clip_grad_norm_ has been stable since PyTorch 0.4.0 (2018). PyTorch 2.0+ added torch.compile support, but clipping still works identically. No breaking changes relevant to this feature in PyTorch 2.11.x.

Explore layer normalization and batch normalization as complementary stabilization techniques that prevent gradient explosion at the architectural level, rather than constraining it after the fact.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.