How to intermediate · 3 min read

How to reduce model memory usage Hugging Face

Quick answer

To reduce memory usage of Hugging Face models, use techniques like model quantization (e.g., 8-bit or 4-bit), model pruning, or load models with device_map to offload parts to CPU. Additionally, use smaller or distilled versions of models to save memory.

PREREQUISITES

Python 3.8+
pip install transformers>=4.30.0
pip install accelerate
Basic knowledge of PyTorch or TensorFlow

Setup

Install the latest transformers and accelerate libraries to enable memory optimization features.

bash

pip install transformers accelerate

Step by step

Use Hugging Face's transformers with bitsandbytes for 8-bit quantization and device_map to reduce GPU memory usage.

python

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "gpt2"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load model with 8-bit quantization and automatic device mapping
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_8bit=True,  # quantize to 8-bit
    device_map="auto"  # automatically place layers on devices
)

# Encode input
inputs = tokenizer("Hello, how are you?", return_tensors="pt")

# Generate output
outputs = model.generate(**inputs, max_length=20)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

output

Hello, how are you?

Common variations

Use load_in_4bit=True for more aggressive quantization if supported.
Use torch_dtype=torch.float16 to load model in half precision.
Use distilled or smaller models like distilbert-base-uncased to reduce memory footprint.
Use accelerate for model offloading and mixed precision training.

python

from transformers import AutoModelForSequenceClassification

# Load model in half precision
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    torch_dtype=torch.float16,
    device_map="auto"
)

Troubleshooting

If you get CUDA out of memory errors, try reducing batch size or use device_map to offload layers to CPU.
Ensure bitsandbytes is installed for 8-bit loading: pip install bitsandbytes.
For unsupported models, quantization may not work; fallback to half precision or smaller models.

Key Takeaways

Use 8-bit or 4-bit quantization with load_in_8bit or load_in_4bit to reduce memory usage drastically.
Leverage device_map="auto" to automatically distribute model layers across CPU and GPU.
Choose smaller or distilled models to minimize memory footprint without large accuracy loss.

Verified 2026-04 · gpt2, distilbert-base-uncased

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.