Concept Intermediate · 3 min read

What is post-training quantization

Quick answer

Post-training quantization is a technique that converts a trained model's weights from high-precision (e.g., 32-bit floats) to lower-precision formats (e.g., 8-bit integers) after training is complete. This reduces model size and speeds up inference with minimal impact on accuracy, without requiring retraining.

Post-training quantization is a model compression technique that reduces the precision of weights after training to optimize size and speed without retraining.

How it works

Post-training quantization works by converting the high-precision floating-point weights of a trained neural network into lower-precision formats such as 8-bit integers. Imagine you have a detailed color photo (32-bit floats) and you convert it to a simpler pixel art version (8-bit integers) to save space and load faster. The model's parameters are mapped to a smaller numeric range, which reduces memory usage and speeds up computation during inference. This process happens after the model is fully trained, so no additional training or fine-tuning is needed.

Because the model is not retrained, some accuracy loss can occur, but modern quantization methods minimize this impact. It’s a practical way to deploy large models on resource-constrained devices or speed up cloud inference.

Concrete example

Here is a simple example using Hugging Face Transformers and bitsandbytes to apply 8-bit post-training quantization to a language model:

python

from transformers import AutoModelForCausalLM, AutoTokenizer
from bitsandbytes import load_in_8bit
import torch

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Load model with 8-bit quantization (post-training quantization)
model = AutoModelForCausalLM.from_pretrained(
    "gpt2",
    load_in_8bit=True,  # post-training quantization to 8-bit
    device_map="auto"
)

# Encode input and generate output
inputs = tokenizer("Hello, world!", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_length=20)
print(tokenizer.decode(outputs[0]))

output

Hello, world! I am a language model that has been quantized to 8-bit precision.

When to use it

Use post-training quantization when you want to reduce model size and speed up inference without the cost and complexity of retraining. It is ideal for deploying large models on edge devices, mobile phones, or in latency-sensitive applications where hardware resources are limited.

Do not use post-training quantization if you require the absolute highest accuracy and can afford retraining, as quantization-aware training (QAT) can yield better accuracy by incorporating quantization effects during training.

Key terms

Term	Definition
Post-training quantization	Converting a trained model's weights to lower precision after training to reduce size and speed up inference.
Quantization-aware training (QAT)	Training a model with quantization effects simulated to improve accuracy after quantization.
8-bit integer (int8)	A numeric format using 8 bits to represent values, reducing memory compared to 32-bit floats.
Inference	The process of running a trained model to generate predictions or outputs.
Model size	The amount of memory required to store a model's parameters.

✅

Key Takeaways

Post-training quantization reduces model size and speeds up inference by lowering weight precision after training.
It requires no retraining, making it fast and practical for deployment on limited hardware.
Accuracy loss is minimal but can be further reduced with quantization-aware training if retraining is possible.

Verified 2026-04 · gpt2

Verify ↗