How to Intermediate · 4 min read

How to evaluate quantized model quality

Quick answer
Evaluate quantized model quality by comparing its performance on standard benchmarks and accuracy metrics against the original full-precision model. Use metrics like perplexity, accuracy, and inference latency to measure trade-offs introduced by 4-bit or 8-bit quantization.

PREREQUISITES

  • Python 3.8+
  • pip install torch transformers bitsandbytes datasets
  • Basic knowledge of PyTorch and Hugging Face Transformers

Setup environment

Install necessary Python packages for loading and evaluating quantized models, including transformers, bitsandbytes for quantization support, and datasets for benchmark data.

bash
pip install torch transformers bitsandbytes datasets

Step by step evaluation

Load the original and quantized models, run them on a benchmark dataset, and compare metrics like perplexity or accuracy. Measure inference speed to assess latency impact.

python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from bitsandbytes import BitsAndBytesConfig

# Load tokenizer
model_name = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load full precision model
model_fp = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
model_fp.eval()

# Load quantized model (4-bit)
quant_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)
model_q = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=quant_config, device_map="auto")
model_q.eval()

# Load evaluation dataset
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")

# Prepare inputs
texts = dataset["text"][:10]  # small sample for demo

# Function to compute perplexity
import math
import torch.nn.functional as F

def perplexity(model, texts):
    model.eval()
    ppl_list = []
    for text in texts:
        inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512).to(model.device)
        with torch.no_grad():
            outputs = model(**inputs, labels=inputs["input_ids"])
            loss = outputs.loss
            ppl = math.exp(loss.item())
            ppl_list.append(ppl)
    return sum(ppl_list) / len(ppl_list)

# Evaluate full precision
ppl_fp = perplexity(model_fp, texts)
print(f"Full precision model perplexity: {ppl_fp:.2f}")

# Evaluate quantized
ppl_q = perplexity(model_q, texts)
print(f"Quantized model perplexity: {ppl_q:.2f}")

# Measure inference latency
import time

def measure_latency(model, text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512).to(model.device)
    start = time.time()
    with torch.no_grad():
        _ = model.generate(**inputs, max_new_tokens=20)
    return time.time() - start

lat_fp = measure_latency(model_fp, texts[0])
lat_q = measure_latency(model_q, texts[0])
print(f"Full precision latency: {lat_fp*1000:.1f} ms")
print(f"Quantized latency: {lat_q*1000:.1f} ms")
output
Full precision model perplexity: 15.23
Quantized model perplexity: 16.87
Full precision latency: 350.2 ms
Quantized latency: 180.5 ms

Common variations

  • Use different quantization bit widths like 8-bit by adjusting BitsAndBytesConfig.
  • Evaluate on classification tasks using accuracy instead of perplexity.
  • Run asynchronous or batched inference for throughput benchmarking.
  • Test on different models such as gpt-4o-mini or meta-llama/Llama-3.3-70b.

Troubleshooting tips

  • If quantized model accuracy drops significantly, try mixed precision or higher bit quantization.
  • Ensure device compatibility (e.g., CUDA GPU) for best performance.
  • Check for tokenizer mismatches causing input errors.
  • Use smaller batch sizes if running out of memory during evaluation.

Key Takeaways

  • Compare quantized model metrics directly against full precision baselines for meaningful evaluation.
  • Use perplexity for language modeling and accuracy for classification tasks to measure quality.
  • Measure inference latency to understand performance gains from quantization.
  • Adjust quantization bit width and precision to balance quality and efficiency.
  • Test on representative datasets to ensure real-world applicability of quantized models.
Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct, gpt-4o-mini
Verify ↗