RuntimeError
torch._C._RuntimeError
Stack trace
RuntimeError: Expected tensor dtype torch.qint8 but got torch.float32 during quantized model inference.
Why it happens
Quantized models use specific low-precision data types like int8 or qint8 for weights and activations. If the input tensors or model weights are not properly converted or cast to the expected quantized dtype, PyTorch raises a dtype mismatch error during inference.
Detection
Monitor model input and weight tensor dtypes before inference; assert that all tensors match the expected quantized dtype to catch mismatches early.
Causes & fixes
Input tensors are in float32 but the quantized model expects int8 or qint8 dtype.
Convert input tensors to the expected quantized dtype using torch.quantize_per_tensor or appropriate casting before passing to the model.
Model weights were not properly quantized or loaded with the correct dtype.
Ensure the model is loaded with quantized weights using the correct quantization-aware loading functions or scripts.
Mixing quantized and non-quantized layers or tensors in the model pipeline.
Verify that all model components and intermediate tensors are consistently quantized or dequantized as needed to maintain dtype compatibility.
Code: broken vs fixed
import torch
model = torch.quantization.quantize_dynamic(torch.nn.Linear(10, 5), {torch.nn.Linear}, dtype=torch.qint8)
input_tensor = torch.randn(1, 10) # float32 tensor
output = model(input_tensor) # RuntimeError: dtype mismatch here import os
import torch
os.environ['PYTORCH_ENABLE_MPS_FALLBACK'] = '1' # example env usage
model = torch.quantization.quantize_dynamic(torch.nn.Linear(10, 5), {torch.nn.Linear}, dtype=torch.qint8)
input_tensor = torch.randn(1, 10)
input_quantized = torch.quantize_per_tensor(input_tensor, scale=1.0, zero_point=0, dtype=torch.qint8) # convert input to qint8
output = model(input_quantized) # fixed dtype mismatch
print(output) Workaround
Catch the RuntimeError and manually convert input tensors to the expected quantized dtype using torch.quantize_per_tensor before retrying inference.
Prevention
Use consistent quantization-aware training and inference pipelines that enforce dtype compatibility, and validate tensor dtypes at each stage to avoid mismatches.