Quantization of fine-tuned models
Why this matters
Fine-tuned models are expensive to serve: quantization cuts memory by 75% and speeds inference 2-4x, making production deployment viable on resource-constrained hardware. Post-training quantization preserves task-specific knowledge without the cost of retraining.
Explanation
Quantization converts model weights from float32 (32-bit) to lower precision (int8, int4, or fp8) after fine-tuning completes. This reduces model size and memory bandwidth during inference. How it works: weights are scaled to fit the target precision range, then rounded; during inference, these compressed weights are dequantized back to float for compute. The loss of precision is minimal when quantization is done carefully: modern methods like GPTQ use calibration data to find optimal scale factors. When to use: after fine-tuning is complete and validated, especially for deployment where latency or memory is a bottleneck. Post-training quantization (PTQ) requires no retraining; quantization-aware training (QAT) during fine-tuning gives better accuracy but adds training overhead.
Analogy
Quantization is like compressing a high-resolution photo to JPG for web delivery: you lose some detail, but the image remains usable, and the file is 10x smaller. Your fine-tuned model is the high-res original; quantization is the JPG.
Code
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
model_name = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
quantize_config = BaseQuantizeConfig(
bits=4,
group_size=128,
desc_act=False,
)
model = AutoGPTQForCausalLM.from_pretrained(
model_name,
quantize_config=quantize_config,
device="cuda:0"
)
print(f"Model dtype before quantization: {model.model.dtype}")
train_data = [
"The fine-tuned model handles domain-specific text efficiently.",
"Quantization preserves knowledge while reducing memory footprint.",
"Post-training quantization requires calibration on representative samples.",
]
train_texts = "\n".join(train_data)
examples = [
tokenizer(
text,
padding=True,
truncation=True,
max_length=512,
return_tensors="pt"
)
for text in train_data
]
model.quantize(
examples,
use_triton=False,
)
model.save_pretrained("./llama-2-7b-gptq")
tokenizer.save_pretrained("./llama-2-7b-gptq")
print(f"Model saved to ./llama-2-7b-gptq")
inference_model = AutoGPTQForCausalLM.from_pretrained(
"./llama-2-7b-gptq",
device="cuda:0"
)
test_prompt = "The advantage of quantization is"
inputs = tokenizer(test_prompt, return_tensors="pt").to("cuda:0")
outputs = inference_model.generate(
**inputs,
max_length=40,
temperature=0.7,
top_p=0.95,
)
generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Generated output: {generated}") Model dtype before quantization: torch.float16 Model saved to ./llama-2-7b-gptq Generated output: The advantage of quantization is that it reduces the model size and inference latency while maintaining task-specific knowledge learned during fine-tuning.
What just happened?
The code loaded a pretrained Llama-2-7B model, configured GPTQ quantization to 4-bit precision with group size 128, calibrated the quantization using three example sentences, saved the quantized model to disk, then loaded it back and generated text. The quantized model consumes ~1.7GB of VRAM instead of ~14GB for full float32.
Common gotcha
Developers often quantize their fine-tuned model immediately after training without benchmarking accuracy first: then discover 3-4% task performance drop in production. Always evaluate your fine-tuned model on your validation set before and after quantization, using the exact prompts and metrics you'll use in production. A 2% drop in general knowledge might be a 5% drop on your specific task. Second gotcha: forgetting that quantization calibration is data-dependent: calibrate on examples representative of your actual deployment distribution, not random data.
Error recovery
RuntimeError: CUDA out of memoryValueError: bits must be in [2, 3, 4, 8]AssertionError: calibration data is emptyExperienced dev note
In production, quantization is where theory meets reality: a model that passes your internal benchmarks might fail on edge cases in quantized form. Use auto-gptq or bitsandbytes for production; they handle scale factor selection better than naive int8 conversion. Also: always version your quantized models separately from unquantized checkpoints (e.g., 'llama-2-7b-gptq-v1') because requantizing with different calibration data can produce different behavior. Finally, if you're quantizing right after fine-tuning and the accuracy drop is >2%, your fine-tuning itself may have been unstable: check your training loss curve before assuming quantization is the problem.
Check your understanding
Your fine-tuned model passes validation with 92% accuracy unquantized. After 4-bit GPTQ quantization with the same calibration data, it scores 89% on your test set. Should you deploy the quantized version to production, and why or why not?
Show answer hint
A correct answer requires understanding that a 3% drop is significant and depends on your SLA. It also requires knowing that this gap likely points to either insufficient/unrepresentative calibration data, or that your fine-tuning task is particularly sensitive to precision loss. The answer should mention benchmarking on the actual deployment use case, not just the validation set, and consider whether the latency/cost savings justify the accuracy trade-off for your specific product.