How to expose vLLM as FastAPI endpoint
Quick answer
Use the
vllm Python package to serve a Llama-3.1-8B-Instruct model locally via CLI, then create a FastAPI app that sends requests to the running vLLM server using the OpenAI-compatible API endpoint. This approach enables low-latency, local LLM inference exposed as a REST API.PREREQUISITES
Python 3.8+pip install vllm fastapi uvicorn openaivLLM model files downloaded or accessibleBasic knowledge of FastAPI and Python async
Setup
Install the required Python packages and prepare the vLLM model for serving. You need vllm for the model server, fastapi for the API, and uvicorn as the ASGI server.
pip install vllm fastapi uvicorn openai Step by step
Start the vLLM server locally using the CLI, then create a FastAPI app that forwards requests to this server using the OpenAI-compatible API interface.
import os
import asyncio
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from openai import OpenAI
app = FastAPI()
# Configure OpenAI client to point to local vLLM server
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", ""), base_url="http://localhost:8000/v1")
class ChatRequest(BaseModel):
model: str
messages: list
@app.post("/chat/completions")
async def chat_completions(request: ChatRequest):
try:
response = await asyncio.to_thread(
lambda: client.chat.completions.create(
model=request.model,
messages=request.messages
)
)
return response
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=9000)
# To start the vLLM server, run this CLI command in your terminal:
# vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000
# Then run this FastAPI app script.
# Example request JSON to POST /chat/completions:
# {
# "model": "meta-llama/Llama-3.1-8B-Instruct",
# "messages": [{"role": "user", "content": "Hello from FastAPI!"}]
# } Common variations
- Use async HTTP clients like
httpxinstead ofasyncio.to_threadfor better concurrency. - Change the model by specifying a different
meta-llamaor other vLLM-compatible model in the CLI and API calls. - Enable streaming responses by adapting the FastAPI endpoint to stream tokens from the vLLM server.
Troubleshooting
- If you get connection errors, ensure the vLLM server is running on
localhost:8000and accessible. - Check that the model path or name in the CLI matches the one used in the API requests.
- For permission errors, verify your Python environment and package versions.
Key Takeaways
- Use the vLLM CLI to serve models locally with OpenAI-compatible API endpoints.
- FastAPI can proxy requests to the vLLM server for easy REST API integration.
- Adjust model names and ports to fit your deployment environment.
- Async HTTP clients improve concurrency in production-grade APIs.
- Verify server availability and model correctness to avoid runtime errors.