How to Intermediate · 3 min read

How to use vLLM to serve LLMs

Quick answer
Use vLLM to serve large language models by installing the vllm Python package, loading your model with vllm.LLM, and running an inference server via vllm.Server. This enables efficient, batched, and low-latency serving of LLMs optimized for GPU usage.

PREREQUISITES

  • Python 3.8+
  • NVIDIA GPU with CUDA support
  • pip install vllm
  • Basic knowledge of Python async programming

Setup

Install vllm via pip and ensure you have a CUDA-enabled GPU. Set up environment variables if needed for CUDA and model weights.

bash
pip install vllm

Step by step

Load a model and start a vLLM server to serve requests asynchronously. The example below shows how to serve the llama-2-7b model locally.

python
from vllm import LLM, Server

# Load the model
llm = LLM(model="llama-2-7b", tensor_parallel_size=1)

# Create a server instance
server = Server(llm=llm, host="0.0.0.0", port=8000)

# Start the server (blocking call)
server.serve()
output
INFO: Starting vLLM server on http://0.0.0.0:8000
INFO: Model llama-2-7b loaded successfully
Waiting for requests...

Common variations

  • Use different models by changing the model parameter (e.g., gpt-4o or llama-3.1-70b).
  • Run inference asynchronously using llm.generate_async() for batch processing.
  • Customize server options like max batch size and timeout for latency tuning.
python
import asyncio
from vllm import LLM

async def async_inference():
    llm = LLM(model="llama-2-7b")
    outputs = await llm.generate_async(["Hello, world!", "How are you?"])
    for output in outputs:
        print(output.text)

asyncio.run(async_inference())
output
Hello, world!
I am fine, thank you!

Troubleshooting

  • If you see CUDA out-of-memory errors, reduce tensor_parallel_size or use a smaller model.
  • For slow responses, tune batch size and enable mixed precision.
  • Ensure your CUDA drivers and PyTorch versions are compatible with vllm.

Key Takeaways

  • Install and use the vllm Python package to serve LLMs efficiently on GPUs.
  • Start a vllm.Server with your loaded model for scalable, low-latency inference.
  • Use async generation and batch requests to maximize throughput and reduce latency.
  • Tune server parameters and model size to fit your hardware constraints and workload.
  • Check CUDA compatibility and memory limits to avoid common runtime errors.
Verified 2026-04 · llama-2-7b, gpt-4o, llama-3.1-70b
Verify ↗