How to beginner · 3 min read

How to load 8-bit model with Hugging Face

Q: How to load 8-bit model with Hugging Face

Use Hugging Face Transformers with BitsAndBytesConfig to load models in 8-bit precision by setting load_in_8bit=True in the config and passing it to from_pretrained(). This reduces GPU memory usage while maintaining good performance.

Quick answer

Use Hugging Face Transformers with BitsAndBytesConfig to load models in 8-bit precision by setting load_in_8bit=True in the config and passing it to from_pretrained(). This reduces GPU memory usage while maintaining good performance.

PREREQUISITES

Python 3.8+
pip install transformers>=4.30.0
pip install bitsandbytes
A compatible GPU with CUDA
PyTorch installed with CUDA support

Setup

Install the required packages transformers and bitsandbytes to enable 8-bit quantization support. Ensure your environment has a CUDA-enabled GPU and PyTorch with CUDA installed.

bash

pip install transformers bitsandbytes

Step by step

Use BitsAndBytesConfig from transformers to configure 8-bit loading, then load the model with load_in_8bit=True. This example loads a causal language model in 8-bit mode.

python

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# Configure 8-bit loading
quantization_config = BitsAndBytesConfig(load_in_8bit=True)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Load model with 8-bit quantization
model = AutoModelForCausalLM.from_pretrained(
    "gpt2",
    quantization_config=quantization_config,
    device_map="auto"
)

# Test inference
inputs = tokenizer("Hello, how are you?", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

output

Hello, how are you? I am fine, thank you for asking.

Common variations

Use load_in_4bit=True in BitsAndBytesConfig for 4-bit quantization instead.
Specify bnb_8bit_use_double_quant=True in BitsAndBytesConfig for better accuracy.
Use device_map="auto" to automatically place model layers on available GPUs.
For CPU inference, 8-bit loading is not supported; use full precision or 4-bit if supported.

Troubleshooting

If you get ImportError for bitsandbytes, ensure it is installed and compatible with your CUDA version.
If device_map="auto" fails, try device_map="cpu" or manually assign devices.
Out of memory errors may require reducing batch size or using 4-bit quantization.

Key Takeaways

Use BitsAndBytesConfig(load_in_8bit=True) to enable 8-bit quantization in Hugging Face models.
Loading models in 8-bit reduces GPU memory usage and speeds up inference with minimal accuracy loss.
Ensure bitsandbytes and compatible CUDA drivers are installed for 8-bit support.
Use device_map="auto" to automatically distribute model layers across GPUs.
If memory issues occur, consider 4-bit quantization or smaller batch sizes.

Verified 2026-04 · gpt2

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.