How to load model in 8-bit quantization Hugging Face
Quick answer
Use the
transformers library with bitsandbytes integration to load models in 8-bit quantization by setting load_in_8bit=True in from_pretrained(). This reduces memory footprint and speeds up inference while maintaining accuracy.PREREQUISITES
Python 3.8+pip install transformers bitsandbytes accelerateA compatible GPU with CUDA support for 8-bit quantization
Setup
Install the required libraries to enable 8-bit quantization support in Hugging Face models. You need transformers, bitsandbytes, and accelerate for efficient loading and inference.
pip install transformers bitsandbytes accelerate Step by step
Load a Hugging Face transformer model in 8-bit quantization mode using from_pretrained() with load_in_8bit=True. This example loads facebook/opt-1.3b in 8-bit on GPU.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "facebook/opt-1.3b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
load_in_8bit=True, # Enable 8-bit quantization
device_map="auto" # Automatically place model on GPU
)
inputs = tokenizer("Hello, how are you?", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) output
Hello, how are you? I am doing well, thank you.
Common variations
- Use
device_map="auto"to automatically distribute model layers across GPUs. - For CPU-only usage, 8-bit quantization is not supported; use full precision instead.
- Use
load_in_4bit=Truewith compatible models and libraries for even smaller memory footprint.
Troubleshooting
- If you get
bitsandbytesimport errors, ensure it is installed and your CUDA version is compatible. - Out of memory errors can be mitigated by using smaller models or enabling model offloading with
device_map. - Check GPU compatibility; 8-bit quantization requires NVIDIA GPUs with CUDA.
Key Takeaways
- Use
load_in_8bit=Trueinfrom_pretrained()to enable 8-bit quantization in Hugging Face models. - Install
bitsandbytesandaccelerateto support efficient 8-bit loading and device mapping. - 8-bit quantization reduces GPU memory usage and speeds up inference without significant accuracy loss.
- Ensure your environment has a CUDA-compatible GPU and matching
bitsandbytesversion. - Use
device_map="auto"to automatically place model layers on available GPUs.