How to intermediate · 4 min read

Fix quantized model slower than expected

Quick answer
If your quantized model is slower than expected, ensure you use hardware acceleration like GPUs with 4-bit or 8-bit support, optimize batch sizes, and verify your BitsAndBytesConfig is correctly set for efficient quantization. Also, avoid CPU-only inference and check that your model loading uses device mapping and mixed precision to maximize speed.

PREREQUISITES

  • Python 3.8+
  • pip install transformers bitsandbytes torch
  • Access to GPU with CUDA support
  • Basic knowledge of Hugging Face Transformers

Setup quantized model

Install required libraries and prepare your environment for quantized model inference using bitsandbytes and transformers. Ensure you have a CUDA-enabled GPU for best performance.

bash
pip install transformers bitsandbytes torch

Step by step example

Load a quantized model with proper BitsAndBytesConfig and device mapping to ensure fast inference on GPU.

python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True
)

model_name = "meta-llama/Llama-3.1-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map="auto"
)

inputs = tokenizer("Hello, how are you?", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
output
Hello, how are you?

Common variations

You can adjust quantization to 8-bit by changing load_in_4bit to load_in_8bit and updating the config accordingly. For CPU-only environments, quantization may not speed up inference and can even slow it down.

python
quantization_config_8bit = BitsAndBytesConfig(load_in_8bit=True)

model_8bit = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config_8bit,
    device_map="auto"
)

Troubleshooting performance issues

  • If inference is slow, verify your GPU is being used with nvidia-smi.
  • Ensure batch sizes are optimized; very small batches can underutilize GPU.
  • Check that bnb_4bit_compute_dtype is set to torch.float16 or torch.bfloat16 for faster mixed precision computation.
  • Avoid running quantized models on CPU-only hardware as it can degrade performance.
  • Update bitsandbytes and transformers to latest versions for bug fixes and optimizations.

Key Takeaways

  • Use GPU with CUDA and device mapping for quantized model inference to maximize speed.
  • Set bnb_4bit_compute_dtype to torch.float16 for efficient mixed precision.
  • Avoid CPU-only inference for quantized models to prevent slowdowns.
  • Optimize batch size to better utilize hardware during inference.
  • Keep bitsandbytes and transformers libraries updated for best performance.
Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct
Verify ↗