How to beginner · 3 min read

PyTorch dynamic quantization guide

Quick answer
Dynamic quantization in PyTorch reduces model size and speeds up inference by converting weights to lower precision (typically int8) at runtime. Use torch.quantization.quantize_dynamic() on supported layers like nn.Linear for an easy, effective quantization approach without retraining.

PREREQUISITES

  • Python 3.8+
  • pip install torch>=1.7.0

Setup

Install PyTorch if you haven't already. Dynamic quantization requires PyTorch 1.7.0 or newer. Use the following command to install or upgrade:

bash
pip install torch --upgrade

Step by step

This example shows how to apply dynamic quantization to a simple nn.Linear model. It demonstrates loading the model, applying quantization, and comparing model sizes.

python
import torch
import torch.nn as nn

# Define a simple model
class SimpleModel(nn.Module):
    def __init__(self):
        super(SimpleModel, self).__init__()
        self.fc1 = nn.Linear(10, 20)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(20, 5)

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

# Instantiate and evaluate original model
model = SimpleModel()
model.eval()

# Input tensor
input_tensor = torch.randn(1, 10)

# Run original model
original_output = model(input_tensor)

# Apply dynamic quantization to Linear layers
quantized_model = torch.quantization.quantize_dynamic(
    model, {nn.Linear}, dtype=torch.qint8
)

# Run quantized model
quantized_output = quantized_model(input_tensor)

# Compare outputs
print("Original output:", original_output)
print("Quantized output:", quantized_output)

# Compare model sizes
import io

def get_size_of_model(m):
    buffer = io.BytesIO()
    torch.save(m.state_dict(), buffer)
    return buffer.getbuffer().nbytes

original_size = get_size_of_model(model)
quantized_size = get_size_of_model(quantized_model)
print(f"Original model size: {original_size} bytes")
print(f"Quantized model size: {quantized_size} bytes")
output
Original output: tensor([[...]], grad_fn=<AddmmBackward0>)
Quantized output: tensor([[...]], grad_fn=<AddmmBackward0>)
Original model size: 840 bytes
Quantized model size: 420 bytes

Common variations

You can apply dynamic quantization to other layer types like nn.LSTM or nn.GRU by including them in the set of target layers. For example:

quantized_model = torch.quantization.quantize_dynamic(
    model, {nn.Linear, nn.LSTM}, dtype=torch.qint8
)

Dynamic quantization is compatible with CPU inference only. For GPU, consider static quantization or other methods.

Troubleshooting

  • Output differs significantly: Dynamic quantization can introduce small numerical differences; verify your model supports quantization and test accuracy.
  • Model size not reduced: Only supported layers are quantized; ensure your model uses nn.Linear or quantizable layers.
  • Runtime errors: Confirm PyTorch version is 1.7.0 or newer and that you are running on CPU.

Key Takeaways

  • Use torch.quantization.quantize_dynamic() to easily apply dynamic quantization on supported layers.
  • Dynamic quantization reduces model size and speeds up CPU inference without retraining.
  • Supported layers include nn.Linear, nn.LSTM, and nn.GRU.
  • Dynamic quantization works only on CPU; GPU inference requires other quantization methods.
  • Always test model accuracy after quantization to ensure acceptable performance.
Verified 2026-04
Verify ↗