How to Intermediate · 3 min read

How to load model in 8-bit quantization Hugging Face

Q: How to load model in 8-bit quantization Hugging Face

Use the transformers library with bitsandbytes integration to load models in 8-bit quantization by setting load_in_8bit=True in from_pretrained(). This reduces memory footprint and speeds up inference while maintaining accuracy.

Quick answer

Use the transformers library with bitsandbytes integration to load models in 8-bit quantization by setting load_in_8bit=True in from_pretrained(). This reduces memory footprint and speeds up inference while maintaining accuracy.

PREREQUISITES

Python 3.8+
pip install transformers bitsandbytes accelerate
A compatible GPU with CUDA support for 8-bit quantization

Setup

Install the required libraries to enable 8-bit quantization support in Hugging Face models. You need transformers, bitsandbytes, and accelerate for efficient loading and inference.

bash

pip install transformers bitsandbytes accelerate

Step by step

Load a Hugging Face transformer model in 8-bit quantization mode using from_pretrained() with load_in_8bit=True. This example loads facebook/opt-1.3b in 8-bit on GPU.

python

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "facebook/opt-1.3b"

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_8bit=True,  # Enable 8-bit quantization
    device_map="auto"  # Automatically place model on GPU
)

inputs = tokenizer("Hello, how are you?", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

output

Hello, how are you? I am doing well, thank you.

Common variations

Use device_map="auto" to automatically distribute model layers across GPUs.
For CPU-only usage, 8-bit quantization is not supported; use full precision instead.
Use load_in_4bit=True with compatible models and libraries for even smaller memory footprint.

Troubleshooting

If you get bitsandbytes import errors, ensure it is installed and your CUDA version is compatible.
Out of memory errors can be mitigated by using smaller models or enabling model offloading with device_map.
Check GPU compatibility; 8-bit quantization requires NVIDIA GPUs with CUDA.

✅

Key Takeaways

Use load_in_8bit=True in from_pretrained() to enable 8-bit quantization in Hugging Face models.
Install bitsandbytes and accelerate to support efficient 8-bit loading and device mapping.
8-bit quantization reduces GPU memory usage and speeds up inference without significant accuracy loss.
Ensure your environment has a CUDA-compatible GPU and matching bitsandbytes version.
Use device_map="auto" to automatically place model layers on available GPUs.

Verified 2026-04 · facebook/opt-1.3b

Verify ↗