How to use gradient checkpointing Hugging Face
Quick answer
Use
model.gradient_checkpointing_enable() in Hugging Face Transformers to activate gradient checkpointing, which reduces GPU memory usage by trading compute for memory during backpropagation. This is done after loading your model and before training.PREREQUISITES
Python 3.8+pip install transformers>=4.30.0PyTorch or TensorFlow installedBasic knowledge of Hugging Face Transformers
Setup
Install the latest transformers library and ensure you have a compatible deep learning framework like PyTorch or TensorFlow installed.
pip install transformers torch Step by step
Load a Hugging Face model, enable gradient checkpointing, and run a simple training loop to verify memory savings.
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
# Load model and tokenizer
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Enable gradient checkpointing to save memory
model.gradient_checkpointing_enable()
# Prepare dummy input
inputs = tokenizer("Hello, Hugging Face!", return_tensors="pt")
labels = torch.tensor([1]).unsqueeze(0) # batch size 1
# Set model to train mode
model.train()
# Forward pass
outputs = model(**inputs, labels=labels)
loss = outputs.loss
# Backward pass
loss.backward()
print(f"Loss: {loss.item():.4f}") output
Loss: 0.6931
Common variations
- Use
model.gradient_checkpointing_disable()to turn off checkpointing. - Works with both PyTorch and TensorFlow models in Hugging Face.
- Combine with mixed precision training for further memory optimization.
Troubleshooting
- If you get errors about unsupported layers, check if your model architecture supports gradient checkpointing.
- Ensure
model.gradient_checkpointing_enable()is called before training starts. - Watch out for slower training speed as checkpointing trades compute for memory.
Key Takeaways
- Call
model.gradient_checkpointing_enable()after loading your Hugging Face model to reduce GPU memory usage. - Gradient checkpointing trades increased computation time for lower memory consumption during backpropagation.
- It works seamlessly with PyTorch and TensorFlow models in the Hugging Face ecosystem.
- Always enable checkpointing before starting training to avoid runtime errors.
- Combine with other optimization techniques like mixed precision for best results.