Quantization for CPU inference
Quick answer
Quantization reduces the precision of model weights (e.g., from 16-bit float to 8-bit integer) to optimize CPU inference speed and memory usage. Use libraries like bitsandbytes with transformers to load models in 4-bit or 8-bit precision for efficient CPU deployment.
PREREQUISITES
Python 3.8+pip install transformers bitsandbytes torchBasic knowledge of PyTorch and Hugging Face Transformers
Setup
Install the required Python packages for quantization and CPU inference. bitsandbytes enables 4-bit and 8-bit quantization, while transformers provides model loading and tokenization.
pip install transformers bitsandbytes torch Step by step
Load a Hugging Face model with 4-bit quantization for CPU inference using BitsAndBytesConfig. This reduces memory footprint and speeds up inference on CPUs.
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
# Configure 4-bit quantization
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True
)
# Load tokenizer and model with quantization config
model_name = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=quant_config,
device_map="auto"
)
# Prepare input
input_text = "Explain quantization for CPU inference."
inputs = tokenizer(input_text, return_tensors="pt")
# Generate output
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) output
Explain quantization for CPU inference. Quantization reduces the precision of model weights from floating point to lower bit integers, which reduces memory usage and speeds up computation on CPUs.
Common variations
You can use 8-bit quantization by setting load_in_8bit=True in BitsAndBytesConfig for a balance between speed and accuracy. Async inference or streaming outputs require additional frameworks like vllm or custom wrappers. Different models support quantization differently; always check model compatibility.
from transformers import BitsAndBytesConfig
# 8-bit quantization config example
quant_config_8bit = BitsAndBytesConfig(load_in_8bit=True)
# Load model with 8-bit quantization
model_8bit = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
quantization_config=quant_config_8bit,
device_map="auto"
) Troubleshooting
- If you see RuntimeError: CUDA not available on CPU-only machines, ensure device_map="auto" or set device_map={{'': 'cpu'}} explicitly.
- Quantization may reduce model accuracy; test outputs carefully.
- Ensure bitsandbytes is installed correctly; it requires compatible hardware and OS.
Key Takeaways
- Use bitsandbytes with transformers to load models in 4-bit or 8-bit precision for CPU inference.
- Quantization reduces memory and speeds up inference but may slightly impact accuracy.
- Always specify device_map correctly to avoid runtime errors on CPU-only systems.