How to Intermediate · 3 min read

How to run vLLM on Modal

Q: How to run vLLM on Modal

Run vLLM on Modal by starting the vLLM server with the CLI command and then querying it via the OpenAI SDK with base_url="http://localhost:8000/v1". Use modal to deploy the server function and invoke it remotely with standard OpenAI API calls.

Quick answer

Run vLLM on Modal by starting the vLLM server with the CLI command and then querying it via the OpenAI SDK with base_url="http://localhost:8000/v1". Use modal to deploy the server function and invoke it remotely with standard OpenAI API calls.

PREREQUISITES

Python 3.8+
Modal account and CLI installed
pip install openai>=1.0
vLLM installed (pip install vllm)

Setup

Install vllm and openai Python packages. Ensure you have a Modal account and CLI installed and configured. The vLLM server runs locally or inside a Modal function, exposing an OpenAI-compatible API endpoint.

bash

pip install vllm openai modal

output

Collecting vllm...
Collecting openai...
Collecting modal...
Successfully installed vllm openai modal

Step by step

Start the vLLM server inside a Modal function and query it using the OpenAI-compatible OpenAI client with base_url set to the local server URL. This example deploys a Modal app that runs vLLM and exposes a function to query it.

python

import modal
from openai import OpenAI
import os

app = modal.App("vllm-modal-app")

@modal.function(gpu="A10G", image=modal.Image.debian_slim().pip_install("vllm"))
def vllm_server():
    import subprocess
    # Start vLLM server on port 8000
    subprocess.Popen(["vllm", "serve", "meta-llama/Llama-3.1-8B-Instruct", "--port", "8000"])
    import time
    time.sleep(5)  # Wait for server to start
    return "vLLM server started"

@modal.function()
def query_vllm(prompt: str) -> str:
    client = OpenAI(base_url="http://localhost:8000/v1")
    response = client.chat.completions.create(
        model="meta-llama/Llama-3.1-8B-Instruct",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

if __name__ == "__main__":
    stub = app.deploy()
    print(stub.vllm_server.call())
    output = stub.query_vllm.call("Hello from vLLM on Modal!")
    print("Response:", output)

output

vLLM server started
Response: Hello from vLLM on Modal! How can I assist you today?

Common variations

Use async Modal functions for concurrency.
Stream responses by enabling stream=True in client.chat.completions.create and iterating over chunks.
Change the model by specifying other GGUF or Hugging Face compatible models in the vllm serve command.

python

async def query_vllm_stream(prompt: str):
    client = OpenAI(base_url="http://localhost:8000/v1")
    stream = client.chat.completions.create(
        model="meta-llama/Llama-3.1-8B-Instruct",
        messages=[{"role": "user", "content": prompt}],
        stream=True
    )
    async for chunk in stream:
        print(chunk.choices[0].delta.content or "", end="", flush=True)

# Usage in async context
# import asyncio
# asyncio.run(query_vllm_stream("Stream this response"))

output

Streamed tokens printed progressively in console

Troubleshooting

If the vLLM server fails to start, check GPU availability and model path correctness.
Ensure the base_url matches the server address and port.
Modal GPU quota errors require upgrading your Modal plan or using CPU mode (slower).
Timeouts may require increasing Modal function time limits or adding retries.

✅

Key Takeaways

Use the vLLM CLI to serve models with an OpenAI-compatible API on port 8000.
Deploy vLLM inside a Modal function with GPU support for scalable inference.
Query the vLLM server via the OpenAI SDK by setting base_url to the local server.
Enable streaming by passing stream=True and iterating over response chunks.
Troubleshoot by verifying GPU access, model paths, and Modal resource limits.

Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct

Verify ↗