How to Intermediate · 4 min read

How to use vLLM on RunPod

Quick answer
Use vLLM on RunPod by first starting the vLLM server with the CLI command vllm serve on your RunPod instance, then query it via Python using the OpenAI SDK with base_url set to your RunPod server endpoint. This lets you send chat completions or text generation requests to your hosted vLLM model seamlessly.

PREREQUISITES

  • Python 3.8+
  • RunPod account and instance with GPU
  • vLLM installed on RunPod instance
  • OpenAI Python SDK (pip install openai>=1.0)
  • OpenAI API key or RunPod API key for authentication

Setup vLLM server on RunPod

First, launch a RunPod instance with GPU support and install vllm. Then start the vLLM server using the CLI command to serve your desired model. This exposes an HTTP API compatible with OpenAI SDK calls.

bash
pip install vllm

# On your RunPod instance terminal:
vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000
output
Serving model meta-llama/Llama-3.1-8B-Instruct on port 8000
Ready to accept requests...

Step by step Python client usage

Use the OpenAI Python SDK with base_url pointing to your RunPod vLLM server endpoint. This example sends a chat completion request and prints the response.

python
import os
from openai import OpenAI

# Set your RunPod vLLM server URL, e.g. http://your-runpod-ip:8000/v1
RUNPOD_VLLM_URL = os.environ.get("RUNPOD_VLLM_URL")

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"], base_url=RUNPOD_VLLM_URL)

messages = [
    {"role": "user", "content": "Hello, how are you?"}
]

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=messages
)

print("Response:", response.choices[0].message.content)
output
Response: Hello! I'm your vLLM model running on RunPod. How can I assist you today?

Common variations

  • Async usage: Use async with await for non-blocking calls.
  • Streaming: Enable stream=True in chat.completions.create to receive tokens incrementally.
  • Different models: Change the model parameter to any supported vLLM GGUF model you have deployed.
python
import asyncio
from openai import OpenAI

async def async_chat():
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"], base_url=os.environ["RUNPOD_VLLM_URL"])
    stream = await client.chat.completions.create(
        model="meta-llama/Llama-3.1-8B-Instruct",
        messages=[{"role": "user", "content": "Stream tokens please."}],
        stream=True
    )
    async for chunk in stream:
        print(chunk.choices[0].delta.content or "", end="", flush=True)

asyncio.run(async_chat())
output
Stream tokens please. (tokens appear incrementally in console output)

Troubleshooting tips

  • If you get connection errors, verify your RunPod instance IP and port are accessible and vllm serve is running.
  • Ensure your RUNPOD_VLLM_URL environment variable includes the full path, e.g. http://ip-address:8000/v1.
  • Check your API key is set in OPENAI_API_KEY environment variable even if vLLM does not require it locally.
  • For model loading issues, confirm the model path is correct and the model files are downloaded on the RunPod instance.

Key Takeaways

  • Run vLLM server on RunPod with the CLI command vllm serve exposing an OpenAI-compatible API.
  • Use the OpenAI Python SDK with base_url set to your RunPod server URL to query vLLM models.
  • Enable streaming or async calls in the SDK for efficient token handling.
  • Verify network access and environment variables to avoid connection errors.
  • Deploy and manage GGUF models on RunPod for flexible vLLM usage.
Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct
Verify ↗