How to Intermediate · 4 min read

How to benchmark quantized models

Quick answer
To benchmark quantized models, measure inference latency, memory usage, and accuracy against the original full-precision model using representative datasets and profiling tools. Use frameworks like PyTorch with BitsAndBytesConfig for loading quantized models and standard benchmarking scripts to collect metrics.

PREREQUISITES

  • Python 3.8+
  • pip install torch transformers bitsandbytes
  • Basic knowledge of PyTorch and model quantization

Setup environment

Install necessary Python packages including torch, transformers, and bitsandbytes for quantized model loading and benchmarking.

bash
pip install torch transformers bitsandbytes

Step by step benchmarking

Load a quantized model with BitsAndBytesConfig, run inference on sample inputs, and measure latency and accuracy compared to the original model.

python
import time
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import BitsAndBytesConfig

# Load quantized model config
quant_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)

# Load tokenizer
model_name = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load quantized model
model_quant = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quant_config,
    device_map="auto"
)
model_quant.eval()

# Load full precision model for comparison
model_fp = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
model_fp.eval()

def benchmark_model(model, inputs, runs=10):
    # Warmup
    for _ in range(3):
        _ = model(**inputs)
    
    start = time.time()
    for _ in range(runs):
        _ = model(**inputs)
    end = time.time()
    avg_latency = (end - start) / runs
    return avg_latency

# Prepare input
text = "The quick brown fox jumps over the lazy dog."
inputs = tokenizer(text, return_tensors="pt")
inputs = {k: v.to(model_quant.device) for k, v in inputs.items()}

# Benchmark quantized model
latency_quant = benchmark_model(model_quant, inputs)

# Benchmark full precision model
inputs_fp = {k: v.to(model_fp.device) for k, v in inputs.items()}
latency_fp = benchmark_model(model_fp, inputs_fp)

print(f"Quantized model latency: {latency_quant:.4f} seconds")
print(f"Full precision model latency: {latency_fp:.4f} seconds")
output
Quantized model latency: 0.1200 seconds
Full precision model latency: 0.3500 seconds

Common variations

  • Use load_in_8bit=True in BitsAndBytesConfig for 8-bit quantization.
  • Benchmark on GPU vs CPU by moving models and inputs accordingly.
  • Measure memory usage with tools like torch.cuda.memory_allocated() or psutil.
  • Use async inference or batch inputs for throughput benchmarking.

Troubleshooting tips

  • If you get CUDA out-of-memory errors, reduce batch size or use smaller quantized models.
  • Ensure bitsandbytes is installed correctly with GPU support.
  • Check device placement of inputs and model to avoid device mismatch errors.
  • Validate accuracy by comparing outputs with the full precision model on a test set.

Key Takeaways

  • Benchmark quantized models by measuring latency, memory, and accuracy against full precision baselines.
  • Use BitsAndBytesConfig in PyTorch to load 4-bit or 8-bit quantized models efficiently.
  • Profile on the target hardware (GPU or CPU) to get realistic performance metrics.
  • Validate output quality to ensure quantization does not degrade model accuracy significantly.
Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct
Verify ↗