Concept Intermediate · 4 min read

What is model quantization in LLMs

Q: What is model quantization in LLMs

Model quantization in LLMs is the process of converting the model's weights and activations from high-precision (like 32-bit floats) to lower-precision formats (like 8-bit integers) to reduce memory usage and speed up inference. This technique enables deploying large language models efficiently on resource-constrained hardware while maintaining comparable performance.

Quick answer

Model quantization in LLMs is the process of converting the model's weights and activations from high-precision (like 32-bit floats) to lower-precision formats (like 8-bit integers) to reduce memory usage and speed up inference. This technique enables deploying large language models efficiently on resource-constrained hardware while maintaining comparable performance.

Model quantization is a technique that reduces the precision of numerical values in large language models (LLMs) to decrease model size and improve inference speed without significantly sacrificing accuracy.

How it works

Model quantization works by mapping the original high-precision floating-point numbers (e.g., 32-bit floats) used in LLM weights and activations to lower-precision formats such as 16-bit floats, 8-bit integers, or even 4-bit integers. Imagine converting a detailed color photo into a simpler pixel art version: you lose some detail but keep the overall image recognizable. Similarly, quantization reduces the numerical precision to save memory and computation, trading off a small amount of accuracy for efficiency.

This process often involves scaling and zero-point adjustments to preserve the dynamic range of values. Quantization can be done post-training (post-training quantization) or during training (quantization-aware training) to better maintain model accuracy.

Concrete example

Here is a simple Python example using numpy to quantize a floating-point tensor to 8-bit integers and then dequantize it back:

python

import numpy as np

# Original float32 tensor (weights)
weights = np.array([0.1, -0.5, 0.3, 0.9], dtype=np.float32)

# Define quantization parameters
scale = (weights.max() - weights.min()) / 255
zero_point = np.round(-weights.min() / scale).astype(np.uint8)

# Quantize: float32 -> uint8
quantized = np.round(weights / scale + zero_point).astype(np.uint8)

# Dequantize: uint8 -> float32
dequantized = (quantized.astype(np.float32) - zero_point) * scale

print("Original weights:", weights)
print("Quantized weights:", quantized)
print("Dequantized weights:", dequantized)

output

Original weights: [ 0.1 -0.5  0.3  0.9]
Quantized weights: [153   0 204 255]
Dequantized weights: [ 0.09803922 -0.5019608   0.3019608   0.8980392 ]

When to use it

Use model quantization when you need to deploy LLMs on hardware with limited memory or compute power, such as edge devices, mobile phones, or cost-sensitive cloud environments. It is ideal for speeding up inference and reducing storage without retraining the model from scratch.

Avoid quantization if your application demands the highest possible accuracy or if the model is already small and fast enough, as quantization can introduce slight accuracy degradation.

Key terms

Term	Definition
Quantization	Reducing numerical precision of model weights/activations to save memory and compute.
Post-training quantization	Applying quantization after the model is fully trained.
Quantization-aware training	Training the model with quantization effects simulated to preserve accuracy.
Scale	A factor used to map floating-point values to integer range during quantization.
Zero-point	An integer offset used in quantization to align zero in floating-point and integer domains.

✅

Key Takeaways

Quantization reduces LLM model size and speeds up inference by lowering numerical precision.
Post-training quantization is quick but may reduce accuracy; quantization-aware training preserves accuracy better.
Use quantization to deploy large models on limited hardware or reduce cloud compute costs.
Quantization involves scale and zero-point to map floats to integers effectively.
Not suitable when maximum model accuracy is critical or model size is already minimal.

Verified 2026-04 · gpt-4o, claude-3-5-sonnet-20241022, llama-3.1-405b

Verify ↗