How to beginner · 3 min read

How to run vLLM server

Q: How to run vLLM server

Run a vLLM server locally using the CLI command vllm serve <model-name> --port 8000. Then query it via Python using the openai SDK with base_url="http://localhost:8000/v1" to send requests for fast, low-latency inference.

Quick answer

Run a vLLM server locally using the CLI command vllm serve <model-name> --port 8000. Then query it via Python using the openai SDK with base_url="http://localhost:8000/v1" to send requests for fast, low-latency inference.

PREREQUISITES

Python 3.8+
pip install vllm openai>=1.0
Download a compatible vLLM model (e.g., meta-llama/Llama-3.1-8B-Instruct)

Setup

Install the vllm package and openai SDK via pip. Download a supported model checkpoint for vLLM, such as meta-llama/Llama-3.1-8B-Instruct. Ensure Python 3.8 or higher is installed.

bash

pip install vllm openai>=1.0

Step by step

Start the vLLM server locally on port 8000 with your chosen model. Then use Python and the openai SDK to send chat completion requests to the server's REST API endpoint.

python

# Start the vLLM server in a terminal
vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000

# Python client code to query the running server
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"], base_url="http://localhost:8000/v1")

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Hello, vLLM!"}]
)

print(response.choices[0].message.content)

output

Hello, vLLM! How can I assist you today?

Common variations

Use a different model by changing the model name in the CLI and Python code.
Run the server on a different port by modifying the --port argument.
Use batch inference by sending multiple messages in one request.
Implement async calls by integrating with async HTTP clients if needed.

Troubleshooting

If the server fails to start, verify the model path and that dependencies are installed.
If Python requests time out, confirm the server is running and accessible at the specified port.
Check environment variables for the API key; vLLM server does not require an API key but the client does.

✅

Key Takeaways

Use the CLI command vllm serve <model> --port 8000 to start the server locally.
Query the running server with the openai Python SDK by setting base_url to the server endpoint.
You can switch models or ports easily by changing CLI arguments and client parameters.
Ensure your environment has Python 3.8+, and install vllm and openai packages.
Troubleshoot by verifying server status, model availability, and environment variables.

Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct

Verify ↗