How to use vLLM on RunPod
Quick answer
Use vLLM on RunPod by first starting the vLLM server with the CLI command
vllm serve on your RunPod instance, then query it via Python using the OpenAI SDK with base_url set to your RunPod server endpoint. This lets you send chat completions or text generation requests to your hosted vLLM model seamlessly.PREREQUISITES
Python 3.8+RunPod account and instance with GPUvLLM installed on RunPod instanceOpenAI Python SDK (pip install openai>=1.0)OpenAI API key or RunPod API key for authentication
Setup vLLM server on RunPod
First, launch a RunPod instance with GPU support and install vllm. Then start the vLLM server using the CLI command to serve your desired model. This exposes an HTTP API compatible with OpenAI SDK calls.
pip install vllm
# On your RunPod instance terminal:
vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000 output
Serving model meta-llama/Llama-3.1-8B-Instruct on port 8000 Ready to accept requests...
Step by step Python client usage
Use the OpenAI Python SDK with base_url pointing to your RunPod vLLM server endpoint. This example sends a chat completion request and prints the response.
import os
from openai import OpenAI
# Set your RunPod vLLM server URL, e.g. http://your-runpod-ip:8000/v1
RUNPOD_VLLM_URL = os.environ.get("RUNPOD_VLLM_URL")
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"], base_url=RUNPOD_VLLM_URL)
messages = [
{"role": "user", "content": "Hello, how are you?"}
]
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=messages
)
print("Response:", response.choices[0].message.content) output
Response: Hello! I'm your vLLM model running on RunPod. How can I assist you today?
Common variations
- Async usage: Use
asyncwithawaitfor non-blocking calls. - Streaming: Enable
stream=Trueinchat.completions.createto receive tokens incrementally. - Different models: Change the
modelparameter to any supported vLLM GGUF model you have deployed.
import asyncio
from openai import OpenAI
async def async_chat():
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"], base_url=os.environ["RUNPOD_VLLM_URL"])
stream = await client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Stream tokens please."}],
stream=True
)
async for chunk in stream:
print(chunk.choices[0].delta.content or "", end="", flush=True)
asyncio.run(async_chat()) output
Stream tokens please. (tokens appear incrementally in console output)
Troubleshooting tips
- If you get connection errors, verify your RunPod instance IP and port are accessible and
vllm serveis running. - Ensure your
RUNPOD_VLLM_URLenvironment variable includes the full path, e.g.http://ip-address:8000/v1. - Check your API key is set in
OPENAI_API_KEYenvironment variable even if vLLM does not require it locally. - For model loading issues, confirm the model path is correct and the model files are downloaded on the RunPod instance.
Key Takeaways
- Run vLLM server on RunPod with the CLI command
vllm serveexposing an OpenAI-compatible API. - Use the OpenAI Python SDK with
base_urlset to your RunPod server URL to query vLLM models. - Enable streaming or async calls in the SDK for efficient token handling.
- Verify network access and environment variables to avoid connection errors.
- Deploy and manage GGUF models on RunPod for flexible vLLM usage.