How to Intermediate · 3 min read

How to use gradient checkpointing Hugging Face

Quick answer
Use model.gradient_checkpointing_enable() in Hugging Face Transformers to activate gradient checkpointing, which reduces GPU memory usage by trading compute for memory during backpropagation. This is done after loading your model and before training.

PREREQUISITES

  • Python 3.8+
  • pip install transformers>=4.30.0
  • PyTorch or TensorFlow installed
  • Basic knowledge of Hugging Face Transformers

Setup

Install the latest transformers library and ensure you have a compatible deep learning framework like PyTorch or TensorFlow installed.

bash
pip install transformers torch

Step by step

Load a Hugging Face model, enable gradient checkpointing, and run a simple training loop to verify memory savings.

python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

# Load model and tokenizer
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Enable gradient checkpointing to save memory
model.gradient_checkpointing_enable()

# Prepare dummy input
inputs = tokenizer("Hello, Hugging Face!", return_tensors="pt")
labels = torch.tensor([1]).unsqueeze(0)  # batch size 1

# Set model to train mode
model.train()

# Forward pass
outputs = model(**inputs, labels=labels)
loss = outputs.loss

# Backward pass
loss.backward()

print(f"Loss: {loss.item():.4f}")
output
Loss: 0.6931

Common variations

  • Use model.gradient_checkpointing_disable() to turn off checkpointing.
  • Works with both PyTorch and TensorFlow models in Hugging Face.
  • Combine with mixed precision training for further memory optimization.

Troubleshooting

  • If you get errors about unsupported layers, check if your model architecture supports gradient checkpointing.
  • Ensure model.gradient_checkpointing_enable() is called before training starts.
  • Watch out for slower training speed as checkpointing trades compute for memory.

Key Takeaways

  • Call model.gradient_checkpointing_enable() after loading your Hugging Face model to reduce GPU memory usage.
  • Gradient checkpointing trades increased computation time for lower memory consumption during backpropagation.
  • It works seamlessly with PyTorch and TensorFlow models in the Hugging Face ecosystem.
  • Always enable checkpointing before starting training to avoid runtime errors.
  • Combine with other optimization techniques like mixed precision for best results.
Verified 2026-04 · bert-base-uncased
Verify ↗