How to Intermediate · 3 min read

How to use vLLM with quantized models

Q: How to use vLLM with quantized models

Use vLLM with quantized models by loading your model with BitsAndBytesConfig for 4-bit or 8-bit quantization, then serve it via vllm CLI or Python API. This reduces memory usage and speeds up inference while maintaining accuracy.

Quick answer

Use vLLM with quantized models by loading your model with BitsAndBytesConfig for 4-bit or 8-bit quantization, then serve it via vllm CLI or Python API. This reduces memory usage and speeds up inference while maintaining accuracy.

PREREQUISITES

Python 3.8+
pip install vllm bitsandbytes torch
A quantized model checkpoint compatible with vLLM

Setup

Install vllm, bitsandbytes, and torch to enable quantized model loading and inference.

bash

pip install vllm bitsandbytes torch

Step by step

Load a quantized model using BitsAndBytesConfig and serve it with vllm. Below is a Python example to load a 4-bit quantized model and generate text.

python

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
from vllm import LLM, SamplingParams

# Configure 4-bit quantization
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True
)

# Load quantized model and tokenizer
model_name = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map="auto"
)

# Initialize vLLM LLM with the loaded model
llm = LLM(model=model)

# Prepare prompt
prompt = "Explain the benefits of quantization in LLMs."

# Generate with sampling parameters
outputs = llm.generate([prompt], SamplingParams(temperature=0.7, max_tokens=100))

# Print output text
print(outputs[0].outputs[0].text)

output

Explain the benefits of quantization in LLMs.
Quantization reduces the precision of model weights, which lowers memory usage and speeds up inference without significantly impacting accuracy. This enables running large models on limited hardware.

Common variations

You can serve quantized models using the vllm CLI for HTTP or local inference servers. Async generation and streaming outputs are supported by the Python API. You can also use 8-bit quantization by adjusting BitsAndBytesConfig.

python

# CLI example to serve a 4-bit quantized model
vllm serve meta-llama/Llama-3.1-8B-Instruct --quantization 4bit --port 8000

# Python async example
from vllm import LLM, SamplingParams
import asyncio

async def async_generate():
    llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct", quantization_config=BitsAndBytesConfig(load_in_4bit=True))
    outputs = await llm.generate(["Explain quantization benefits."], SamplingParams(max_tokens=50))
    print(outputs[0].outputs[0].text)

asyncio.run(async_generate())

Troubleshooting

If you see CUDA out of memory, reduce batch size or use 8-bit quantization instead of 4-bit.
Ensure bitsandbytes is installed correctly and compatible with your CUDA version.
Model loading errors often mean the checkpoint is not compatible with quantization or device mapping.

Key Takeaways

Use BitsAndBytesConfig to load models in 4-bit or 8-bit quantized formats for efficient memory use.
vLLM supports serving quantized models via CLI and Python API with sampling and streaming.
Adjust quantization and device mapping to fit your hardware and avoid out-of-memory errors.

Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.