Concept Intermediate · 4 min read

What is model quantization for local LLMs

Quick answer
Model quantization is the process of converting a large language model's weights from high-precision floating-point numbers to lower-precision formats like int8 or int4. This reduces the model's size and computational requirements, enabling efficient local deployment of LLMs such as those used with Ollama.
Model quantization is a technique that reduces the precision of a language model's weights to lower memory and compute requirements for local deployment.

How it works

Model quantization works by converting the model's parameters from 32-bit floating-point numbers to lower-bit representations such as 8-bit integers (int8) or 4-bit integers (int4). This compression reduces the memory footprint and speeds up inference by enabling faster arithmetic operations on CPUs or GPUs. Analogous to compressing a high-resolution image into a smaller file without losing essential details, quantization preserves the model's accuracy while making it lightweight enough for local use.

Concrete example

For example, a 7-billion parameter model stored in 32-bit floats requires about 28 GB of memory (7B × 4 bytes). Quantizing it to 8-bit integers reduces this to approximately 7 GB, and 4-bit quantization halves it further to about 3.5 GB. This enables running the model on consumer-grade hardware.

python
import os
import ollama

# Load a quantized local LLM model
model_name = "llama-3-7b-int8"

response = ollama.chat(model=model_name, messages=[{"role": "user", "content": "Explain model quantization."}])

print(response['choices'][0]['message']['content'])
output
Model quantization reduces the precision of model weights, lowering memory and compute needs while maintaining accuracy.

When to use it

Use model quantization when deploying large language models locally on hardware with limited memory or compute power, such as laptops or edge devices. It is ideal for reducing latency and cost without cloud dependency. Avoid quantization if maximum model accuracy is critical or if you have abundant GPU resources, as quantization can slightly degrade performance.

Key terms

TermDefinition
Model quantizationReducing model weight precision to lower memory and compute requirements.
int88-bit integer format used for quantized weights.
int44-bit integer format for more aggressive quantization.
InferenceThe process of running a model to generate predictions or outputs.
Local LLMA large language model deployed and run on local hardware rather than in the cloud.

Key Takeaways

  • Quantization compresses model weights to int8 or int4, drastically reducing memory use.
  • It enables running large LLMs locally on consumer hardware with faster inference.
  • Quantization trades minimal accuracy loss for significant efficiency gains.
  • Use quantization when cloud access is limited or low latency is required.
  • Avoid quantization if absolute top accuracy is essential or hardware is abundant.
Verified 2026-04 · llama-3-7b-int8
Verify ↗