How to Intermediate · 3 min read

How to use gradient checkpointing Hugging Face

Q: How to use gradient checkpointing Hugging Face

Use model.gradient_checkpointing_enable() in Hugging Face Transformers to activate gradient checkpointing, which reduces GPU memory usage by trading compute for memory during backpropagation. This is done after loading your model and before training.

Quick answer

Use model.gradient_checkpointing_enable() in Hugging Face Transformers to activate gradient checkpointing, which reduces GPU memory usage by trading compute for memory during backpropagation. This is done after loading your model and before training.

PREREQUISITES

Python 3.8+
pip install transformers>=4.30.0
PyTorch or TensorFlow installed
Basic knowledge of Hugging Face Transformers

Setup

Install the latest transformers library and ensure you have a compatible deep learning framework like PyTorch or TensorFlow installed.

bash

pip install transformers torch

Step by step

Load a Hugging Face model, enable gradient checkpointing, and run a simple training loop to verify memory savings.

python

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

# Load model and tokenizer
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Enable gradient checkpointing to save memory
model.gradient_checkpointing_enable()

# Prepare dummy input
inputs = tokenizer("Hello, Hugging Face!", return_tensors="pt")
labels = torch.tensor([1]).unsqueeze(0)  # batch size 1

# Set model to train mode
model.train()

# Forward pass
outputs = model(**inputs, labels=labels)
loss = outputs.loss

# Backward pass
loss.backward()

print(f"Loss: {loss.item():.4f}")

output

Loss: 0.6931

Common variations

Use model.gradient_checkpointing_disable() to turn off checkpointing.
Works with both PyTorch and TensorFlow models in Hugging Face.
Combine with mixed precision training for further memory optimization.

Troubleshooting

If you get errors about unsupported layers, check if your model architecture supports gradient checkpointing.
Ensure model.gradient_checkpointing_enable() is called before training starts.
Watch out for slower training speed as checkpointing trades compute for memory.

✅

Key Takeaways

Call model.gradient_checkpointing_enable() after loading your Hugging Face model to reduce GPU memory usage.
Gradient checkpointing trades increased computation time for lower memory consumption during backpropagation.
It works seamlessly with PyTorch and TensorFlow models in the Hugging Face ecosystem.
Always enable checkpointing before starting training to avoid runtime errors.
Combine with other optimization techniques like mixed precision for best results.

Verified 2026-04 · bert-base-uncased

Verify ↗