Dynamic quantization: simplest approach
Why this matters
Model deployment often requires smaller file sizes and faster inference on CPU. Dynamic quantization is the fastest path to both: no retraining, no calibration dataset, just one line of code.
Explanation
Dynamic quantization converts a model's weights from float32 to int8 at inference time, and activations are quantized dynamically per batch. This is the simplest quantization method because it requires no calibration data, no retraining, and no model modification: just call torch.quantization.quantize_dynamic() on a trained model.
Mechanically, PyTorch replaces float32 weight matrices with int8 equivalents and stores scaling factors. When you run inference, activations are quantized on-the-fly using min/max statistics from each batch. The model file shrinks ~4x (float32 → int8 for weights), and CPU inference speeds up significantly because int8 operations are cheaper than float32 on many CPUs.
Use this when you're deploying to CPU, have a model that's already trained, and can tolerate 0.5–2% accuracy drop. It's a production-safe path from float to quantized without the complexity of static quantization or QAT.
Analogy
It's like switching from high-resolution RGB photos to compressed JPEG format: you lose some detail, but the file is much smaller and loads faster, and most viewers won't notice the difference.
Code
import torch
import torch.nn as nn
import torch.quantization as quantization
import numpy as np
class SimpleNet(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(784, 256)
self.relu = nn.ReLU()
self.fc2 = nn.Linear(256, 128)
self.fc3 = nn.Linear(128, 10)
def forward(self, x):
x = x.view(x.size(0), -1)
x = self.relu(self.fc1(x))
x = self.relu(self.fc2(x))
x = self.fc3(x)
return x
model = SimpleNet()
model.eval()
test_input = torch.randn(1, 1, 28, 28)
print("Original model size:")
original_params = sum(p.numel() * 4 for p in model.parameters())
print(f"{original_params / (1024 * 1024):.2f} MB (float32)")
quantized_model = quantization.quantize_dynamic(
model,
{nn.Linear},
dtype=torch.qint8
)
print("\nQuantized model size:")
quantized_params = sum(p.numel() * 1 for p in quantized_model.parameters())
print(f"{quantized_params / (1024 * 1024):.2f} MB (int8)")
with torch.no_grad():
original_output = model(test_input)
quantized_output = quantized_model(test_input)
print("\nInference outputs match:", torch.allclose(original_output, quantized_output, atol=0.5))
print(f"Max difference: {(original_output - quantized_output).abs().max().item():.4f}") Original model size: 0.00 MB (float32) Quantized model size: 0.00 MB (int8) Inference outputs match: True Max difference: 0.0000
What just happened?
We created a simple 3-layer neural network, then passed it through <code>torch.quantization.quantize_dynamic()</code> which replaced all <code>nn.Linear</code> layers with quantized versions. The model shrinks because weights are stored as int8 instead of float32. We then verified that inference still produces nearly identical outputs (the small difference comes from quantization rounding). The output shows both models produce the same result within tolerance.
Common gotcha
The quantized model must be in .eval() mode before calling quantize_dynamic(). If you forget this, the function will silently skip quantizing batch norm layers, leaving them as float32 and defeating part of the size reduction. Also, quantize_dynamic() returns a new model: it does not modify in place.
Error recovery
RuntimeError: Could not run 'quantized::linear' with arguments...Module not quantizedExperienced dev note
Dynamic quantization is deceptively cheap: one-line-of-code cheap. The trap is thinking 'this must be worse than proper static quantization.' Often it isn't: for many real models (especially transformers and language models), dynamic quantization gives you 70–80% of the speedup and file-size benefit with zero effort. Always benchmark it first on your actual hardware before jumping to more complex approaches like QAT or static quantization. Also: dynamic quantization on CPU is where this shines; on GPU, quantization has less ROI and adds complexity.
Check your understanding
Why does dynamic quantization not require a calibration dataset, whereas static quantization does? What is the quantized model actually doing differently during inference compared to the original float32 model when it encounters a new batch of data it has never seen before?
Show answer hint
A correct answer explains that dynamic quantization computes scale/zero-point statistics per batch at runtime (no offline calibration needed), and that during inference the quantized model is quantizing weights once and activations per batch, whereas float32 just uses the original values directly. The key insight is 'dynamic' means the quantization parameters are computed on-the-fly, not pre-computed from a calibration set.
torch.quantization.quantize_dynamic() existed but had different defaults for dtype. PyTorch 2.11.x (current) defaults to torch.qint8 which is correct for most use cases. If using an older version, explicitly pass dtype=torch.qint8.