How to use vLLM with quantized models
Quick answer
Use
vLLM with quantized models by loading your model with BitsAndBytesConfig for 4-bit or 8-bit quantization, then serve it via vllm CLI or Python API. This reduces memory usage and speeds up inference while maintaining accuracy.PREREQUISITES
Python 3.8+pip install vllm bitsandbytes torchA quantized model checkpoint compatible with vLLM
Setup
Install vllm, bitsandbytes, and torch to enable quantized model loading and inference.
pip install vllm bitsandbytes torch Step by step
Load a quantized model using BitsAndBytesConfig and serve it with vllm. Below is a Python example to load a 4-bit quantized model and generate text.
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
from vllm import LLM, SamplingParams
# Configure 4-bit quantization
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True
)
# Load quantized model and tokenizer
model_name = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=quantization_config,
device_map="auto"
)
# Initialize vLLM LLM with the loaded model
llm = LLM(model=model)
# Prepare prompt
prompt = "Explain the benefits of quantization in LLMs."
# Generate with sampling parameters
outputs = llm.generate([prompt], SamplingParams(temperature=0.7, max_tokens=100))
# Print output text
print(outputs[0].outputs[0].text) output
Explain the benefits of quantization in LLMs. Quantization reduces the precision of model weights, which lowers memory usage and speeds up inference without significantly impacting accuracy. This enables running large models on limited hardware.
Common variations
You can serve quantized models using the vllm CLI for HTTP or local inference servers. Async generation and streaming outputs are supported by the Python API. You can also use 8-bit quantization by adjusting BitsAndBytesConfig.
# CLI example to serve a 4-bit quantized model
vllm serve meta-llama/Llama-3.1-8B-Instruct --quantization 4bit --port 8000
# Python async example
from vllm import LLM, SamplingParams
import asyncio
async def async_generate():
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct", quantization_config=BitsAndBytesConfig(load_in_4bit=True))
outputs = await llm.generate(["Explain quantization benefits."], SamplingParams(max_tokens=50))
print(outputs[0].outputs[0].text)
asyncio.run(async_generate()) Troubleshooting
- If you see
CUDA out of memory, reduce batch size or use 8-bit quantization instead of 4-bit. - Ensure
bitsandbytesis installed correctly and compatible with your CUDA version. - Model loading errors often mean the checkpoint is not compatible with quantization or device mapping.
Key Takeaways
- Use
BitsAndBytesConfigto load models in 4-bit or 8-bit quantized formats for efficient memory use. -
vLLMsupports serving quantized models via CLI and Python API with sampling and streaming. - Adjust quantization and device mapping to fit your hardware and avoid out-of-memory errors.