How to use vLLM to serve LLMs
Quick answer
Use
vLLM to serve large language models by installing the vllm Python package, loading your model with vllm.LLM, and running an inference server via vllm.Server. This enables efficient, batched, and low-latency serving of LLMs optimized for GPU usage.PREREQUISITES
Python 3.8+NVIDIA GPU with CUDA supportpip install vllmBasic knowledge of Python async programming
Setup
Install vllm via pip and ensure you have a CUDA-enabled GPU. Set up environment variables if needed for CUDA and model weights.
pip install vllm Step by step
Load a model and start a vLLM server to serve requests asynchronously. The example below shows how to serve the llama-2-7b model locally.
from vllm import LLM, Server
# Load the model
llm = LLM(model="llama-2-7b", tensor_parallel_size=1)
# Create a server instance
server = Server(llm=llm, host="0.0.0.0", port=8000)
# Start the server (blocking call)
server.serve() output
INFO: Starting vLLM server on http://0.0.0.0:8000 INFO: Model llama-2-7b loaded successfully Waiting for requests...
Common variations
- Use different models by changing the
modelparameter (e.g.,gpt-4oorllama-3.1-70b). - Run inference asynchronously using
llm.generate_async()for batch processing. - Customize server options like max batch size and timeout for latency tuning.
import asyncio
from vllm import LLM
async def async_inference():
llm = LLM(model="llama-2-7b")
outputs = await llm.generate_async(["Hello, world!", "How are you?"])
for output in outputs:
print(output.text)
asyncio.run(async_inference()) output
Hello, world! I am fine, thank you!
Troubleshooting
- If you see CUDA out-of-memory errors, reduce
tensor_parallel_sizeor use a smaller model. - For slow responses, tune batch size and enable mixed precision.
- Ensure your CUDA drivers and PyTorch versions are compatible with
vllm.
Key Takeaways
- Install and use the
vllmPython package to serve LLMs efficiently on GPUs. - Start a
vllm.Serverwith your loaded model for scalable, low-latency inference. - Use async generation and batch requests to maximize throughput and reduce latency.
- Tune server parameters and model size to fit your hardware constraints and workload.
- Check CUDA compatibility and memory limits to avoid common runtime errors.