How to intermediate · 4 min read

Fix quantized model slower than expected

Quick answer

If your quantized model is slower than expected, ensure you use hardware acceleration like GPUs with 4-bit or 8-bit support, optimize batch sizes, and verify your BitsAndBytesConfig is correctly set for efficient quantization. Also, avoid CPU-only inference and check that your model loading uses device mapping and mixed precision to maximize speed.

PREREQUISITES

Python 3.8+
pip install transformers bitsandbytes torch
Access to GPU with CUDA support
Basic knowledge of Hugging Face Transformers

Setup quantized model

Install required libraries and prepare your environment for quantized model inference using bitsandbytes and transformers. Ensure you have a CUDA-enabled GPU for best performance.

bash

pip install transformers bitsandbytes torch

Step by step example

Load a quantized model with proper BitsAndBytesConfig and device mapping to ensure fast inference on GPU.

python

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True
)

model_name = "meta-llama/Llama-3.1-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map="auto"
)

inputs = tokenizer("Hello, how are you?", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

output

Hello, how are you?

Common variations

You can adjust quantization to 8-bit by changing load_in_4bit to load_in_8bit and updating the config accordingly. For CPU-only environments, quantization may not speed up inference and can even slow it down.

python

quantization_config_8bit = BitsAndBytesConfig(load_in_8bit=True)

model_8bit = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config_8bit,
    device_map="auto"
)

Troubleshooting performance issues

If inference is slow, verify your GPU is being used with nvidia-smi.
Ensure batch sizes are optimized; very small batches can underutilize GPU.
Check that bnb_4bit_compute_dtype is set to torch.float16 or torch.bfloat16 for faster mixed precision computation.
Avoid running quantized models on CPU-only hardware as it can degrade performance.
Update bitsandbytes and transformers to latest versions for bug fixes and optimizations.

✅

Key Takeaways

Use GPU with CUDA and device mapping for quantized model inference to maximize speed.
Set bnb_4bit_compute_dtype to torch.float16 for efficient mixed precision.
Avoid CPU-only inference for quantized models to prevent slowdowns.
Optimize batch size to better utilize hardware during inference.
Keep bitsandbytes and transformers libraries updated for best performance.

Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct

Verify ↗