How to run vLLM on Modal
Quick answer
Run vLLM on
Modal by starting the vLLM server with the CLI command and then querying it via the OpenAI SDK with base_url="http://localhost:8000/v1". Use modal to deploy the server function and invoke it remotely with standard OpenAI API calls.PREREQUISITES
Python 3.8+Modal account and CLI installedpip install openai>=1.0vLLM installed (pip install vllm)
Setup
Install vllm and openai Python packages. Ensure you have a Modal account and CLI installed and configured. The vLLM server runs locally or inside a Modal function, exposing an OpenAI-compatible API endpoint.
pip install vllm openai modal output
Collecting vllm... Collecting openai... Collecting modal... Successfully installed vllm openai modal
Step by step
Start the vLLM server inside a Modal function and query it using the OpenAI-compatible OpenAI client with base_url set to the local server URL. This example deploys a Modal app that runs vLLM and exposes a function to query it.
import modal
from openai import OpenAI
import os
app = modal.App("vllm-modal-app")
@modal.function(gpu="A10G", image=modal.Image.debian_slim().pip_install("vllm"))
def vllm_server():
import subprocess
# Start vLLM server on port 8000
subprocess.Popen(["vllm", "serve", "meta-llama/Llama-3.1-8B-Instruct", "--port", "8000"])
import time
time.sleep(5) # Wait for server to start
return "vLLM server started"
@modal.function()
def query_vllm(prompt: str) -> str:
client = OpenAI(base_url="http://localhost:8000/v1")
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
if __name__ == "__main__":
stub = app.deploy()
print(stub.vllm_server.call())
output = stub.query_vllm.call("Hello from vLLM on Modal!")
print("Response:", output) output
vLLM server started Response: Hello from vLLM on Modal! How can I assist you today?
Common variations
- Use async Modal functions for concurrency.
- Stream responses by enabling
stream=Trueinclient.chat.completions.createand iterating over chunks. - Change the model by specifying other GGUF or Hugging Face compatible models in the
vllm servecommand.
async def query_vllm_stream(prompt: str):
client = OpenAI(base_url="http://localhost:8000/v1")
stream = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": prompt}],
stream=True
)
async for chunk in stream:
print(chunk.choices[0].delta.content or "", end="", flush=True)
# Usage in async context
# import asyncio
# asyncio.run(query_vllm_stream("Stream this response")) output
Streamed tokens printed progressively in console
Troubleshooting
- If the vLLM server fails to start, check GPU availability and model path correctness.
- Ensure the
base_urlmatches the server address and port. - Modal GPU quota errors require upgrading your Modal plan or using CPU mode (slower).
- Timeouts may require increasing Modal function time limits or adding retries.
Key Takeaways
- Use the vLLM CLI to serve models with an OpenAI-compatible API on port 8000.
- Deploy vLLM inside a Modal function with GPU support for scalable inference.
- Query the vLLM server via the OpenAI SDK by setting base_url to the local server.
- Enable streaming by passing stream=True and iterating over response chunks.
- Troubleshoot by verifying GPU access, model paths, and Modal resource limits.