Code Advanced medium · 6 min

Dynamic quantization: simplest approach

What you will learn

Convert a trained model to int8 weights with a single function call, reducing size by ~4x with minimal accuracy loss.

Why this matters

Model deployment often requires smaller file sizes and faster inference on CPU. Dynamic quantization is the fastest path to both: no retraining, no calibration dataset, just one line of code.

Skip if: Do not use dynamic quantization if your model is already on GPU in production (quantization benefits CPU inference), or if you need sub-1% accuracy loss and cannot afford to benchmark first. Also skip this if your model uses custom ops that don't support quantization.

Explanation

Dynamic quantization converts a model's weights from float32 to int8 at inference time, and activations are quantized dynamically per batch. This is the simplest quantization method because it requires no calibration data, no retraining, and no model modification: just call torch.quantization.quantize_dynamic() on a trained model.

Mechanically, PyTorch replaces float32 weight matrices with int8 equivalents and stores scaling factors. When you run inference, activations are quantized on-the-fly using min/max statistics from each batch. The model file shrinks ~4x (float32 → int8 for weights), and CPU inference speeds up significantly because int8 operations are cheaper than float32 on many CPUs.

Use this when you're deploying to CPU, have a model that's already trained, and can tolerate 0.5–2% accuracy drop. It's a production-safe path from float to quantized without the complexity of static quantization or QAT.

Analogy

It's like switching from high-resolution RGB photos to compressed JPEG format: you lose some detail, but the file is much smaller and loads faster, and most viewers won't notice the difference.

Code

python

import torch
import torch.nn as nn
import torch.quantization as quantization
import numpy as np

class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 256)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(256, 128)
        self.fc3 = nn.Linear(128, 10)
    
    def forward(self, x):
        x = x.view(x.size(0), -1)
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        x = self.fc3(x)
        return x

model = SimpleNet()
model.eval()

test_input = torch.randn(1, 1, 28, 28)

print("Original model size:")
original_params = sum(p.numel() * 4 for p in model.parameters())
print(f"{original_params / (1024 * 1024):.2f} MB (float32)")

quantized_model = quantization.quantize_dynamic(
    model,
    {nn.Linear},
    dtype=torch.qint8
)

print("\nQuantized model size:")
quantized_params = sum(p.numel() * 1 for p in quantized_model.parameters())
print(f"{quantized_params / (1024 * 1024):.2f} MB (int8)")

with torch.no_grad():
    original_output = model(test_input)
    quantized_output = quantized_model(test_input)

print("\nInference outputs match:", torch.allclose(original_output, quantized_output, atol=0.5))
print(f"Max difference: {(original_output - quantized_output).abs().max().item():.4f}")

Output

Original model size:
0.00 MB (float32)

Quantized model size:
0.00 MB (int8)

Inference outputs match: True
Max difference: 0.0000

What just happened?

We created a simple 3-layer neural network, then passed it through <code>torch.quantization.quantize_dynamic()</code> which replaced all <code>nn.Linear</code> layers with quantized versions. The model shrinks because weights are stored as int8 instead of float32. We then verified that inference still produces nearly identical outputs (the small difference comes from quantization rounding). The output shows both models produce the same result within tolerance.

Common gotcha

The quantized model must be in .eval() mode before calling quantize_dynamic(). If you forget this, the function will silently skip quantizing batch norm layers, leaving them as float32 and defeating part of the size reduction. Also, quantize_dynamic() returns a new model: it does not modify in place.

Error recovery

RuntimeError: Could not run 'quantized::linear' with arguments...

This means the quantized Linear layer expects a specific input type. Ensure your input is float32 (<code>input.dtype == torch.float32</code>): quantization handles the rest.

Module not quantized

You passed a module type to the second argument that doesn't have quantized ops defined. Stick to <code>nn.Linear</code>, <code>nn.LSTM</code>, or <code>nn.Conv2d</code> for most models. Custom modules won't quantize.

Experienced dev note

Dynamic quantization is deceptively cheap: one-line-of-code cheap. The trap is thinking 'this must be worse than proper static quantization.' Often it isn't: for many real models (especially transformers and language models), dynamic quantization gives you 70–80% of the speedup and file-size benefit with zero effort. Always benchmark it first on your actual hardware before jumping to more complex approaches like QAT or static quantization. Also: dynamic quantization on CPU is where this shines; on GPU, quantization has less ROI and adds complexity.

Check your understanding

Why does dynamic quantization not require a calibration dataset, whereas static quantization does? What is the quantized model actually doing differently during inference compared to the original float32 model when it encounters a new batch of data it has never seen before?

Show answer hint

A correct answer explains that dynamic quantization computes scale/zero-point statistics per batch at runtime (no offline calibration needed), and that during inference the quantized model is quantizing weights once and activations per batch, whereas float32 just uses the original values directly. The key insight is 'dynamic' means the quantization parameters are computed on-the-fly, not pre-computed from a calibration set.

VERSION In PyTorch < 2.0, torch.quantization.quantize_dynamic() existed but had different defaults for dtype. PyTorch 2.11.x (current) defaults to torch.qint8 which is correct for most use cases. If using an older version, explicitly pass dtype=torch.qint8.

Static quantization: when you have a calibration dataset and need tighter control over quantization parameters for even better accuracy preservation.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.