How to intermediate · 3 min read

How to reduce model memory usage Hugging Face

Quick answer
To reduce memory usage of Hugging Face models, use techniques like model quantization (e.g., 8-bit or 4-bit), model pruning, or load models with device_map to offload parts to CPU. Additionally, use smaller or distilled versions of models to save memory.

PREREQUISITES

  • Python 3.8+
  • pip install transformers>=4.30.0
  • pip install accelerate
  • Basic knowledge of PyTorch or TensorFlow

Setup

Install the latest transformers and accelerate libraries to enable memory optimization features.

bash
pip install transformers accelerate

Step by step

Use Hugging Face's transformers with bitsandbytes for 8-bit quantization and device_map to reduce GPU memory usage.

python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "gpt2"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load model with 8-bit quantization and automatic device mapping
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_8bit=True,  # quantize to 8-bit
    device_map="auto"  # automatically place layers on devices
)

# Encode input
inputs = tokenizer("Hello, how are you?", return_tensors="pt")

# Generate output
outputs = model.generate(**inputs, max_length=20)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
output
Hello, how are you?

Common variations

  • Use load_in_4bit=True for more aggressive quantization if supported.
  • Use torch_dtype=torch.float16 to load model in half precision.
  • Use distilled or smaller models like distilbert-base-uncased to reduce memory footprint.
  • Use accelerate for model offloading and mixed precision training.
python
from transformers import AutoModelForSequenceClassification

# Load model in half precision
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    torch_dtype=torch.float16,
    device_map="auto"
)

Troubleshooting

  • If you get CUDA out of memory errors, try reducing batch size or use device_map to offload layers to CPU.
  • Ensure bitsandbytes is installed for 8-bit loading: pip install bitsandbytes.
  • For unsupported models, quantization may not work; fallback to half precision or smaller models.

Key Takeaways

  • Use 8-bit or 4-bit quantization with load_in_8bit or load_in_4bit to reduce memory usage drastically.
  • Leverage device_map="auto" to automatically distribute model layers across CPU and GPU.
  • Choose smaller or distilled models to minimize memory footprint without large accuracy loss.
Verified 2026-04 · gpt2, distilbert-base-uncased
Verify ↗