How to Intermediate · 3 min read

How to use vLLM to serve LLMs

Q: How to use vLLM to serve LLMs

Use vLLM to serve large language models by installing the vllm Python package, loading your model with vllm.LLM, and running an inference server via vllm.Server. This enables efficient, batched, and low-latency serving of LLMs optimized for GPU usage.

Quick answer

Use vLLM to serve large language models by installing the vllm Python package, loading your model with vllm.LLM, and running an inference server via vllm.Server. This enables efficient, batched, and low-latency serving of LLMs optimized for GPU usage.

PREREQUISITES

Python 3.8+
NVIDIA GPU with CUDA support
pip install vllm
Basic knowledge of Python async programming

Setup

Install vllm via pip and ensure you have a CUDA-enabled GPU. Set up environment variables if needed for CUDA and model weights.

bash

pip install vllm

Step by step

Load a model and start a vLLM server to serve requests asynchronously. The example below shows how to serve the llama-2-7b model locally.

python

from vllm import LLM, Server

# Load the model
llm = LLM(model="llama-2-7b", tensor_parallel_size=1)

# Create a server instance
server = Server(llm=llm, host="0.0.0.0", port=8000)

# Start the server (blocking call)
server.serve()

output

INFO: Starting vLLM server on http://0.0.0.0:8000
INFO: Model llama-2-7b loaded successfully
Waiting for requests...

Common variations

Use different models by changing the model parameter (e.g., gpt-4o or llama-3.1-70b).
Run inference asynchronously using llm.generate_async() for batch processing.
Customize server options like max batch size and timeout for latency tuning.

python

import asyncio
from vllm import LLM

async def async_inference():
    llm = LLM(model="llama-2-7b")
    outputs = await llm.generate_async(["Hello, world!", "How are you?"])
    for output in outputs:
        print(output.text)

asyncio.run(async_inference())

output

Hello, world!
I am fine, thank you!

Troubleshooting

If you see CUDA out-of-memory errors, reduce tensor_parallel_size or use a smaller model.
For slow responses, tune batch size and enable mixed precision.
Ensure your CUDA drivers and PyTorch versions are compatible with vllm.

Key Takeaways

Install and use the vllm Python package to serve LLMs efficiently on GPUs.
Start a vllm.Server with your loaded model for scalable, low-latency inference.
Use async generation and batch requests to maximize throughput and reduce latency.
Tune server parameters and model size to fit your hardware constraints and workload.
Check CUDA compatibility and memory limits to avoid common runtime errors.

Verified 2026-04 · llama-2-7b, gpt-4o, llama-3.1-70b

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.