How to load 8-bit model with Hugging Face
Quick answer
Use Hugging Face Transformers with
BitsAndBytesConfig to load models in 8-bit precision by setting load_in_8bit=True in the config and passing it to from_pretrained(). This reduces GPU memory usage while maintaining good performance.PREREQUISITES
Python 3.8+pip install transformers>=4.30.0pip install bitsandbytesA compatible GPU with CUDAPyTorch installed with CUDA support
Setup
Install the required packages transformers and bitsandbytes to enable 8-bit quantization support. Ensure your environment has a CUDA-enabled GPU and PyTorch with CUDA installed.
pip install transformers bitsandbytes Step by step
Use BitsAndBytesConfig from transformers to configure 8-bit loading, then load the model with load_in_8bit=True. This example loads a causal language model in 8-bit mode.
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
# Configure 8-bit loading
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
# Load model with 8-bit quantization
model = AutoModelForCausalLM.from_pretrained(
"gpt2",
quantization_config=quantization_config,
device_map="auto"
)
# Test inference
inputs = tokenizer("Hello, how are you?", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) output
Hello, how are you? I am fine, thank you for asking.
Common variations
- Use
load_in_4bit=TrueinBitsAndBytesConfigfor 4-bit quantization instead. - Specify
bnb_8bit_use_double_quant=TrueinBitsAndBytesConfigfor better accuracy. - Use
device_map="auto"to automatically place model layers on available GPUs. - For CPU inference, 8-bit loading is not supported; use full precision or 4-bit if supported.
Troubleshooting
- If you get
ImportErrorforbitsandbytes, ensure it is installed and compatible with your CUDA version. - If
device_map="auto"fails, trydevice_map="cpu"or manually assign devices. - Out of memory errors may require reducing batch size or using 4-bit quantization.
Key Takeaways
- Use
BitsAndBytesConfig(load_in_8bit=True)to enable 8-bit quantization in Hugging Face models. - Loading models in 8-bit reduces GPU memory usage and speeds up inference with minimal accuracy loss.
- Ensure
bitsandbytesand compatible CUDA drivers are installed for 8-bit support. - Use
device_map="auto"to automatically distribute model layers across GPUs. - If memory issues occur, consider 4-bit quantization or smaller batch sizes.