Code Advanced hard · 8 min

Quantization of fine-tuned models

What you will learn

Reduce fine-tuned model size and inference latency by quantizing weights to lower precision without retraining.

Why this matters

Fine-tuned models are expensive to serve: quantization cuts memory by 75% and speeds inference 2-4x, making production deployment viable on resource-constrained hardware. Post-training quantization preserves task-specific knowledge without the cost of retraining.

Skip if: Do not quantize if: (1) your model is already small (<1B parameters), (2) you need guaranteed <1% accuracy drop and can't afford to benchmark, (3) you're still actively experimenting with hyperparameters: quantize only when fine-tuning is final. GPTQ and AWQ require calibration data representative of your actual deployment distribution; skip quantization if you lack that.

Explanation

Quantization converts model weights from float32 (32-bit) to lower precision (int8, int4, or fp8) after fine-tuning completes. This reduces model size and memory bandwidth during inference. How it works: weights are scaled to fit the target precision range, then rounded; during inference, these compressed weights are dequantized back to float for compute. The loss of precision is minimal when quantization is done carefully: modern methods like GPTQ use calibration data to find optimal scale factors. When to use: after fine-tuning is complete and validated, especially for deployment where latency or memory is a bottleneck. Post-training quantization (PTQ) requires no retraining; quantization-aware training (QAT) during fine-tuning gives better accuracy but adds training overhead.

Analogy

Quantization is like compressing a high-resolution photo to JPG for web delivery: you lose some detail, but the image remains usable, and the file is 10x smaller. Your fine-tuned model is the high-res original; quantization is the JPG.

Code

Illustrative only - not runnable without a valid API key

python

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

model_name = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)

quantize_config = BaseQuantizeConfig(
    bits=4,
    group_size=128,
    desc_act=False,
)

model = AutoGPTQForCausalLM.from_pretrained(
    model_name,
    quantize_config=quantize_config,
    device="cuda:0"
)

print(f"Model dtype before quantization: {model.model.dtype}")

train_data = [
    "The fine-tuned model handles domain-specific text efficiently.",
    "Quantization preserves knowledge while reducing memory footprint.",
    "Post-training quantization requires calibration on representative samples.",
]

train_texts = "\n".join(train_data)

examples = [
    tokenizer(
        text,
        padding=True,
        truncation=True,
        max_length=512,
        return_tensors="pt"
    )
    for text in train_data
]

model.quantize(
    examples,
    use_triton=False,
)

model.save_pretrained("./llama-2-7b-gptq")
tokenizer.save_pretrained("./llama-2-7b-gptq")

print(f"Model saved to ./llama-2-7b-gptq")

inference_model = AutoGPTQForCausalLM.from_pretrained(
    "./llama-2-7b-gptq",
    device="cuda:0"
)

test_prompt = "The advantage of quantization is"
inputs = tokenizer(test_prompt, return_tensors="pt").to("cuda:0")

outputs = inference_model.generate(
    **inputs,
    max_length=40,
    temperature=0.7,
    top_p=0.95,
)

generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Generated output: {generated}")

Output

Model dtype before quantization: torch.float16
Model saved to ./llama-2-7b-gptq
Generated output: The advantage of quantization is that it reduces the model size and inference latency while maintaining task-specific knowledge learned during fine-tuning.

What just happened?

The code loaded a pretrained Llama-2-7B model, configured GPTQ quantization to 4-bit precision with group size 128, calibrated the quantization using three example sentences, saved the quantized model to disk, then loaded it back and generated text. The quantized model consumes ~1.7GB of VRAM instead of ~14GB for full float32.

Common gotcha

Developers often quantize their fine-tuned model immediately after training without benchmarking accuracy first: then discover 3-4% task performance drop in production. Always evaluate your fine-tuned model on your validation set before and after quantization, using the exact prompts and metrics you'll use in production. A 2% drop in general knowledge might be a 5% drop on your specific task. Second gotcha: forgetting that quantization calibration is data-dependent: calibrate on examples representative of your actual deployment distribution, not random data.

Error recovery

RuntimeError: CUDA out of memory

Quantization of large models requires GPU memory for both original and quantized weights during the process. Reduce group_size from 128 to 64, or quantize with use_triton=False to use slower but less memory-intensive quantization.

ValueError: bits must be in [2, 3, 4, 8]

The BaseQuantizeConfig only accepts these bit widths. Use bits=4 for the most common balance between size reduction and accuracy preservation; bits=3 is more aggressive but riskier.

AssertionError: calibration data is empty

The examples list passed to model.quantize() must not be empty. Ensure your train_data list has at least 2-3 representative examples and they tokenize successfully.

Experienced dev note

In production, quantization is where theory meets reality: a model that passes your internal benchmarks might fail on edge cases in quantized form. Use auto-gptq or bitsandbytes for production; they handle scale factor selection better than naive int8 conversion. Also: always version your quantized models separately from unquantized checkpoints (e.g., 'llama-2-7b-gptq-v1') because requantizing with different calibration data can produce different behavior. Finally, if you're quantizing right after fine-tuning and the accuracy drop is >2%, your fine-tuning itself may have been unstable: check your training loss curve before assuming quantization is the problem.

Check your understanding

Your fine-tuned model passes validation with 92% accuracy unquantized. After 4-bit GPTQ quantization with the same calibration data, it scores 89% on your test set. Should you deploy the quantized version to production, and why or why not?

Show answer hint

A correct answer requires understanding that a 3% drop is significant and depends on your SLA. It also requires knowing that this gap likely points to either insufficient/unrepresentative calibration data, or that your fine-tuning task is particularly sensitive to precision loss. The answer should mention benchmarking on the actual deployment use case, not just the validation set, and consider whether the latency/cost savings justify the accuracy trade-off for your specific product.

VERSION auto-gptq >= 0.5.x uses AutoGPTQForCausalLM.from_pretrained() with quantize_config parameter; versions < 0.5.0 used GPTQModel directly. transformers >= 4.36.0 provides integrated GPTQ support via load_in_4bit parameter, but explicit auto-gptq is more configurable for advanced users. peft 0.11.x is compatible with all quantization backends.

Explore quantization-aware training (QAT) during fine-tuning to achieve lower accuracy loss than post-training quantization at the cost of extra training overhead.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.