How to Intermediate · 3 min read

How to expose vLLM as FastAPI endpoint

Q: How to expose vLLM as FastAPI endpoint

Use the vllm Python package to serve a Llama-3.1-8B-Instruct model locally via CLI, then create a FastAPI app that sends requests to the running vLLM server using the OpenAI-compatible API endpoint. This approach enables low-latency, local LLM inference exposed as a REST API.

Quick answer

Use the vllm Python package to serve a Llama-3.1-8B-Instruct model locally via CLI, then create a FastAPI app that sends requests to the running vLLM server using the OpenAI-compatible API endpoint. This approach enables low-latency, local LLM inference exposed as a REST API.

PREREQUISITES

Python 3.8+
pip install vllm fastapi uvicorn openai
vLLM model files downloaded or accessible
Basic knowledge of FastAPI and Python async

Setup

Install the required Python packages and prepare the vLLM model for serving. You need vllm for the model server, fastapi for the API, and uvicorn as the ASGI server.

bash

pip install vllm fastapi uvicorn openai

Step by step

Start the vLLM server locally using the CLI, then create a FastAPI app that forwards requests to this server using the OpenAI-compatible API interface.

python

import os
import asyncio
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from openai import OpenAI

app = FastAPI()

# Configure OpenAI client to point to local vLLM server
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", ""), base_url="http://localhost:8000/v1")

class ChatRequest(BaseModel):
    model: str
    messages: list

@app.post("/chat/completions")
async def chat_completions(request: ChatRequest):
    try:
        response = await asyncio.to_thread(
            lambda: client.chat.completions.create(
                model=request.model,
                messages=request.messages
            )
        )
        return response
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=9000)

# To start the vLLM server, run this CLI command in your terminal:
# vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000

# Then run this FastAPI app script.

# Example request JSON to POST /chat/completions:
# {
#   "model": "meta-llama/Llama-3.1-8B-Instruct",
#   "messages": [{"role": "user", "content": "Hello from FastAPI!"}]
# }

Common variations

Use async HTTP clients like httpx instead of asyncio.to_thread for better concurrency.
Change the model by specifying a different meta-llama or other vLLM-compatible model in the CLI and API calls.
Enable streaming responses by adapting the FastAPI endpoint to stream tokens from the vLLM server.

Troubleshooting

If you get connection errors, ensure the vLLM server is running on localhost:8000 and accessible.
Check that the model path or name in the CLI matches the one used in the API requests.
For permission errors, verify your Python environment and package versions.

✅

Key Takeaways

Use the vLLM CLI to serve models locally with OpenAI-compatible API endpoints.
FastAPI can proxy requests to the vLLM server for easy REST API integration.
Adjust model names and ports to fit your deployment environment.
Async HTTP clients improve concurrency in production-grade APIs.
Verify server availability and model correctness to avoid runtime errors.

Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct

Verify ↗