How to run vLLM server
Quick answer
Run a vLLM server locally using the CLI command
vllm serve <model-name> --port 8000. Then query it via Python using the openai SDK with base_url="http://localhost:8000/v1" to send requests for fast, low-latency inference.PREREQUISITES
Python 3.8+pip install vllm openai>=1.0Download a compatible vLLM model (e.g., meta-llama/Llama-3.1-8B-Instruct)
Setup
Install the vllm package and openai SDK via pip. Download a supported model checkpoint for vLLM, such as meta-llama/Llama-3.1-8B-Instruct. Ensure Python 3.8 or higher is installed.
pip install vllm openai>=1.0 Step by step
Start the vLLM server locally on port 8000 with your chosen model. Then use Python and the openai SDK to send chat completion requests to the server's REST API endpoint.
# Start the vLLM server in a terminal
vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000
# Python client code to query the running server
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"], base_url="http://localhost:8000/v1")
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Hello, vLLM!"}]
)
print(response.choices[0].message.content) output
Hello, vLLM! How can I assist you today?
Common variations
- Use a different model by changing the model name in the CLI and Python code.
- Run the server on a different port by modifying the
--portargument. - Use batch inference by sending multiple messages in one request.
- Implement async calls by integrating with async HTTP clients if needed.
Troubleshooting
- If the server fails to start, verify the model path and that dependencies are installed.
- If Python requests time out, confirm the server is running and accessible at the specified port.
- Check environment variables for the API key; vLLM server does not require an API key but the client does.
Key Takeaways
- Use the CLI command
vllm serve <model> --port 8000to start the server locally. - Query the running server with the
openaiPython SDK by settingbase_urlto the server endpoint. - You can switch models or ports easily by changing CLI arguments and client parameters.
- Ensure your environment has Python 3.8+, and install
vllmandopenaipackages. - Troubleshoot by verifying server status, model availability, and environment variables.