What is post-training quantization
How it works
Post-training quantization works by converting the high-precision floating-point weights of a trained neural network into lower-precision formats such as 8-bit integers. Imagine you have a detailed color photo (32-bit floats) and you convert it to a simpler pixel art version (8-bit integers) to save space and load faster. The model's parameters are mapped to a smaller numeric range, which reduces memory usage and speeds up computation during inference. This process happens after the model is fully trained, so no additional training or fine-tuning is needed.
Because the model is not retrained, some accuracy loss can occur, but modern quantization methods minimize this impact. It’s a practical way to deploy large models on resource-constrained devices or speed up cloud inference.
Concrete example
Here is a simple example using Hugging Face Transformers and bitsandbytes to apply 8-bit post-training quantization to a language model:
from transformers import AutoModelForCausalLM, AutoTokenizer
from bitsandbytes import load_in_8bit
import torch
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
# Load model with 8-bit quantization (post-training quantization)
model = AutoModelForCausalLM.from_pretrained(
"gpt2",
load_in_8bit=True, # post-training quantization to 8-bit
device_map="auto"
)
# Encode input and generate output
inputs = tokenizer("Hello, world!", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_length=20)
print(tokenizer.decode(outputs[0])) Hello, world! I am a language model that has been quantized to 8-bit precision.
When to use it
Use post-training quantization when you want to reduce model size and speed up inference without the cost and complexity of retraining. It is ideal for deploying large models on edge devices, mobile phones, or in latency-sensitive applications where hardware resources are limited.
Do not use post-training quantization if you require the absolute highest accuracy and can afford retraining, as quantization-aware training (QAT) can yield better accuracy by incorporating quantization effects during training.
Key terms
| Term | Definition |
|---|---|
| Post-training quantization | Converting a trained model's weights to lower precision after training to reduce size and speed up inference. |
| Quantization-aware training (QAT) | Training a model with quantization effects simulated to improve accuracy after quantization. |
| 8-bit integer (int8) | A numeric format using 8 bits to represent values, reducing memory compared to 32-bit floats. |
| Inference | The process of running a trained model to generate predictions or outputs. |
| Model size | The amount of memory required to store a model's parameters. |
Key Takeaways
- Post-training quantization reduces model size and speeds up inference by lowering weight precision after training.
- It requires no retraining, making it fast and practical for deployment on limited hardware.
- Accuracy loss is minimal but can be further reduced with quantization-aware training if retraining is possible.