Fix quantized model slower than expected
Quick answer
If your quantized model is slower than expected, ensure you use hardware acceleration like GPUs with 4-bit or 8-bit support, optimize batch sizes, and verify your BitsAndBytesConfig is correctly set for efficient quantization. Also, avoid CPU-only inference and check that your model loading uses device mapping and mixed precision to maximize speed.
PREREQUISITES
Python 3.8+pip install transformers bitsandbytes torchAccess to GPU with CUDA supportBasic knowledge of Hugging Face Transformers
Setup quantized model
Install required libraries and prepare your environment for quantized model inference using bitsandbytes and transformers. Ensure you have a CUDA-enabled GPU for best performance.
pip install transformers bitsandbytes torch Step by step example
Load a quantized model with proper BitsAndBytesConfig and device mapping to ensure fast inference on GPU.
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True
)
model_name = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=quantization_config,
device_map="auto"
)
inputs = tokenizer("Hello, how are you?", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) output
Hello, how are you?
Common variations
You can adjust quantization to 8-bit by changing load_in_4bit to load_in_8bit and updating the config accordingly. For CPU-only environments, quantization may not speed up inference and can even slow it down.
quantization_config_8bit = BitsAndBytesConfig(load_in_8bit=True)
model_8bit = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=quantization_config_8bit,
device_map="auto"
) Troubleshooting performance issues
- If inference is slow, verify your GPU is being used with
nvidia-smi. - Ensure batch sizes are optimized; very small batches can underutilize GPU.
- Check that
bnb_4bit_compute_dtypeis set totorch.float16ortorch.bfloat16for faster mixed precision computation. - Avoid running quantized models on CPU-only hardware as it can degrade performance.
- Update
bitsandbytesandtransformersto latest versions for bug fixes and optimizations.
Key Takeaways
- Use GPU with CUDA and device mapping for quantized model inference to maximize speed.
- Set bnb_4bit_compute_dtype to torch.float16 for efficient mixed precision.
- Avoid CPU-only inference for quantized models to prevent slowdowns.
- Optimize batch size to better utilize hardware during inference.
- Keep bitsandbytes and transformers libraries updated for best performance.