Code Beginner easy · 4 min

.detach(): stopping gradient flow

What you will learn

<code>.detach()</code> creates a new tensor that is disconnected from the computational graph, preventing gradients from flowing backward through it.

Why this matters

You need to stop gradient flow when you want to use a tensor's values in your computation without updating its parameters: common in loss calculations, target values, or when freezing part of a model. Without it, you'll train things you meant to keep fixed.

Skip if: Do NOT use <code>.detach()</code> when you want gradients to flow through a tensor. If you're computing a loss that should update your model weights, the tensor must stay attached to the graph.

Explanation

What it is: .detach() returns a new tensor that shares the same data as the original, but is no longer part of PyTorch's autograd graph. Gradients will not flow backward through it.

How it works mechanically: When you call loss.backward(), PyTorch traces back through every operation in the computational graph, computing gradients. If a tensor is detached, that backward trace stops: no gradient is computed for operations that depend only on the detached tensor. The tensor's values are still usable for forward computation; only the gradient tracking is severed.

When to use it: Use .detach() when you have a tensor whose values matter for computation but whose parameters should never be updated. Classic cases: target values in supervised learning (we want to predict them, not adjust them), or when you're implementing a two-network architecture and only want to update one network's weights.

Analogy

Imagine a student solving a math problem with a calculator. The calculator gives a number (the tensor's value). Normally, if the answer is wrong, you debug both the student's logic and the calculator (backprop updates both). With <code>.detach()</code>, you cut off the cable to the calculator: its number still appears in the work, but blame for errors never flows back to fix it.

Code

python

import torch

# Create a simple tensor that requires gradients
x = torch.tensor([2.0, 3.0], requires_grad=True)
print(f"Original x: {x}")
print(f"Original x.requires_grad: {x.requires_grad}")

# Detach the tensor
x_detached = x.detach()
print(f"\nDetached x_detached: {x_detached}")
print(f"Detached x_detached.requires_grad: {x_detached.requires_grad}")

# Compute a loss using the detached tensor
y = x.sum()  # This WILL accumulate gradients
z = x_detached.sum()  # This will NOT accumulate gradients

loss = y + z
print(f"\nLoss: {loss}")

# Backpropagate
loss.backward()
print(f"\nGradient of x after backward: {x.grad}")
print("Note: gradient exists because y = x.sum() kept the connection")
print("z = x_detached.sum() did not contribute to x's gradient")

Output

Original x: tensor([2., 3.], requires_grad=True)
Original x.requires_grad: True

Detached x_detached: tensor([2., 3.])
Detached x_detached.requires_grad: False

Loss: tensor(10., grad_fn=<AddBackward0>)

Gradient of x after backward: tensor([2., 2.])
Note: gradient exists because y = x.sum() kept the connection
z = x_detached.sum() did not contribute to x's gradient

What just happened?

We created a tensor with gradient tracking enabled. When we called <code>.detach()</code>, we got a new tensor with the same values but <code>requires_grad=False</code> and no connection to the autograd graph. Both tensors were used in the loss, but only the original <code>x</code> received gradients during backprop because only <code>y = x.sum()</code> kept the computational graph intact. The detached operation <code>z = x_detached.sum()</code> contributed to the loss value but not to any gradient computation.

Common gotcha

The most common mistake: assuming .detach() creates a completely independent copy. It doesn't: it shares the same underlying data. If you modify the detached tensor's values in-place, the original changes too. More importantly, developers often think .detach() makes a copy when they actually need .clone().detach() to get a true independent copy that won't affect the original.

Error recovery

RuntimeError: 'NoneType' object is not subscriptable when accessing grad

Your tensor was detached and has <code>requires_grad=False</code>, so <code>.grad</code> is <code>None</code>. Only tensors with <code>requires_grad=True</code> will have gradients after backprop. Remove the <code>.detach()</code> call if you need gradients.

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

You detached a tensor that was part of your loss computation. If the loss depends on a detached tensor, backprop will fail because there's no computational graph to trace. Remove <code>.detach()</code> from tensors that should contribute to learning.

Experienced dev note

In production, .detach() is how you implement target networks in reinforcement learning (compute loss against a frozen copy of your network), freeze pre-trained layers during transfer learning, or implement custom loss functions that mix learnable and fixed components. The alternative: with torch.no_grad():: is for a different use case: it disables gradient tracking for an entire block of code (useful for inference), whereas .detach() is surgical: it marks one specific tensor as non-learnable while the surrounding code still tracks gradients. Learn to distinguish them early; mixing them up is a hidden cause of training failures.

Check your understanding

If you compute a loss as loss = model_output - detached_target, and then call loss.backward(), will the model's parameters be updated? Why or why not?

Show answer hint

Yes, the model's parameters will be updated. The detached tensor is only one operand in the subtraction. The <code>model_output</code> side of the graph is still connected and will receive gradients. The detached target simply doesn't contribute to any gradient: only the model output's computation graph is traced backward.

VERSION In PyTorch < 0.4.0, the pattern was Variable(tensor, requires_grad=True), and .detach() did not exist: developers used .data instead. Since PyTorch 0.4.0 (April 2017), tensors and variables merged, and .detach() became the standard. Current PyTorch 2.11.x (March 2026) heavily discourages .data access; always use .detach().

Next, explore <code>torch.no_grad()</code> to understand how to disable gradient tracking for entire code blocks during inference or when you want to avoid memory overhead entirely.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.