How to intermediate · 4 min read

Quantization for CPU inference

Quick answer

Quantization reduces the precision of model weights (e.g., from 16-bit float to 8-bit integer) to optimize CPU inference speed and memory usage. Use libraries like bitsandbytes with transformers to load models in 4-bit or 8-bit precision for efficient CPU deployment.

PREREQUISITES

Python 3.8+
pip install transformers bitsandbytes torch
Basic knowledge of PyTorch and Hugging Face Transformers

Setup

Install the required Python packages for quantization and CPU inference. bitsandbytes enables 4-bit and 8-bit quantization, while transformers provides model loading and tokenization.

bash

pip install transformers bitsandbytes torch

Step by step

Load a Hugging Face model with 4-bit quantization for CPU inference using BitsAndBytesConfig. This reduces memory footprint and speeds up inference on CPUs.

python

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

# Configure 4-bit quantization
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True
)

# Load tokenizer and model with quantization config
model_name = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quant_config,
    device_map="auto"
)

# Prepare input
input_text = "Explain quantization for CPU inference."
inputs = tokenizer(input_text, return_tensors="pt")

# Generate output
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

output

Explain quantization for CPU inference. Quantization reduces the precision of model weights from floating point to lower bit integers, which reduces memory usage and speeds up computation on CPUs.

Common variations

You can use 8-bit quantization by setting load_in_8bit=True in BitsAndBytesConfig for a balance between speed and accuracy. Async inference or streaming outputs require additional frameworks like vllm or custom wrappers. Different models support quantization differently; always check model compatibility.

python

from transformers import BitsAndBytesConfig

# 8-bit quantization config example
quant_config_8bit = BitsAndBytesConfig(load_in_8bit=True)

# Load model with 8-bit quantization
model_8bit = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    quantization_config=quant_config_8bit,
    device_map="auto"
)

Troubleshooting

If you see RuntimeError: CUDA not available on CPU-only machines, ensure device_map="auto" or set device_map={{'': 'cpu'}} explicitly.
Quantization may reduce model accuracy; test outputs carefully.
Ensure bitsandbytes is installed correctly; it requires compatible hardware and OS.

✅

Key Takeaways

Use bitsandbytes with transformers to load models in 4-bit or 8-bit precision for CPU inference.
Quantization reduces memory and speeds up inference but may slightly impact accuracy.
Always specify device_map correctly to avoid runtime errors on CPU-only systems.

Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct

Verify ↗