How to load model in 4-bit quantization Hugging Face
Quick answer
Use the
transformers library with bitsandbytes to load Hugging Face models in 4-bit quantization by setting load_in_4bit=True in from_pretrained. This reduces memory usage and speeds up inference while maintaining accuracy.PREREQUISITES
Python 3.8+pip install transformers>=4.30.0pip install bitsandbytespip install accelerate
Setup
Install the required libraries transformers, bitsandbytes, and accelerate to enable 4-bit quantization support.
pip install transformers bitsandbytes accelerate Step by step
Load a Hugging Face model with 4-bit quantization using transformers and bitsandbytes. This example loads the facebook/opt-1.3b model in 4-bit mode for efficient inference.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "facebook/opt-1.3b"
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Load model with 4-bit quantization
model = AutoModelForCausalLM.from_pretrained(
model_name,
load_in_4bit=True,
device_map="auto"
)
# Encode input and generate output
inputs = tokenizer("Hello, how are you?", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) output
Hello, how are you? I am doing well, thank you for asking.
Common variations
- Use
load_in_8bit=Truefor 8-bit quantization instead of 4-bit. - Specify
device_mapmanually for multi-GPU setups. - Use
transformerspipeline with 4-bit models by passingmodel_kwargs={'load_in_4bit': True}.
from transformers import pipeline
pipe = pipeline(
"text-generation",
model="facebook/opt-1.3b",
model_kwargs={"load_in_4bit": True, "device_map": "auto"}
)
print(pipe("Hello, world!", max_new_tokens=20)[0]['generated_text']) output
Hello, world! I hope you are having a great day.
Troubleshooting
- If you see
bitsandbytesimport errors, ensure it is installed and your CUDA version is compatible. - For
device_maperrors, try settingdevice_map='auto'or manually assign devices. - If memory errors occur, verify your GPU has enough VRAM or try 8-bit quantization.
Key Takeaways
- Use
load_in_4bit=Trueinfrom_pretrainedto enable 4-bit quantization loading. - Install
bitsandbytesandaccelerateto support quantized model loading. - Set
device_map='auto'to automatically place model layers on available GPUs. - 4-bit quantization reduces memory usage and speeds up inference with minimal accuracy loss.
- Troubleshoot by verifying CUDA compatibility and GPU memory availability.