How to intermediate · 3 min read

vLLM quantization options

Quick answer
vLLM supports quantization options such as 4-bit and 8-bit integer quantization to reduce memory usage and speed up inference. You enable quantization by specifying the quantization type when loading the model, e.g., quantization_config="llm.int4" for 4-bit quantization.

PREREQUISITES

  • Python 3.8+
  • pip install vllm
  • Access to a vLLM-compatible model checkpoint

Setup

Install the vllm package via pip and ensure you have a compatible model checkpoint downloaded locally or accessible remotely.

bash
pip install vllm

Step by step

Load a model with quantization enabled by passing the quantization_config parameter to the LLM constructor. Supported quantization options include llm.int4 for 4-bit and llm.int8 for 8-bit quantization.

This example loads a 4-bit quantized model and generates text from a prompt.

python
from vllm import LLM, SamplingParams

# Load the model with 4-bit quantization
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct", quantization_config="llm.int4")

# Generate text
outputs = llm.generate(["Explain quantization in vLLM."], SamplingParams(temperature=0.7))

print(outputs[0].outputs[0].text)
output
Quantization in vLLM reduces model size and memory usage by representing weights with fewer bits, enabling faster inference with minimal accuracy loss.

Common variations

You can switch quantization types by changing the quantization_config parameter to llm.int8 for 8-bit or omit it for full precision. Async generation is supported but unrelated to quantization.

python
from vllm import LLM, SamplingParams

# Load 8-bit quantized model
llm_int8 = LLM(model="meta-llama/Llama-3.1-8B-Instruct", quantization_config="llm.int8")

outputs = llm_int8.generate(["Benefits of 8-bit quantization."], SamplingParams())
print(outputs[0].outputs[0].text)
output
8-bit quantization balances memory savings and model accuracy, making it suitable for deployment on resource-constrained hardware.

Troubleshooting

  • If you encounter errors loading quantized models, verify your vllm version supports the quantization option.
  • Ensure the model checkpoint is compatible with quantization (some checkpoints may not support it).
  • Check GPU memory availability; quantization reduces but does not eliminate memory needs.

Key Takeaways

  • Use quantization_config="llm.int4" or "llm.int8" to enable quantization in vLLM.
  • Quantization reduces memory usage and speeds up inference with minimal accuracy loss.
  • Verify model compatibility and vllm version when using quantization options.
Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct
Verify ↗