How to beginner · 3 min read

FastAPI StreamingResponse for LLM

Quick answer
Use FastAPI's StreamingResponse to stream tokens from an LLM by calling client.chat.completions.create with stream=True. Iterate over the async generator to yield chunks as Server-Sent Events for real-time streaming in Python.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install fastapi uvicorn openai>=1.0

Setup

Install the required packages and set your OpenAI API key as an environment variable.

  • Install FastAPI, Uvicorn, and OpenAI SDK:
bash
pip install fastapi uvicorn openai>=1.0
output
Collecting fastapi
Collecting uvicorn
Collecting openai
Successfully installed fastapi uvicorn openai

Step by step

This example shows a complete FastAPI app that streams LLM chat completions using the OpenAI SDK's stream=True parameter and returns a StreamingResponse with Server-Sent Events (SSE).

python
import os
import json
from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse
from openai import OpenAI

app = FastAPI()
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

async def event_stream(messages):
    # Create streaming chat completion
    stream = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        stream=True
    )
    async for chunk in stream:
        # Extract token delta
        delta = chunk.choices[0].delta.content
        if delta:
            # Format as SSE data
            yield f"data: {json.dumps(delta)}\n\n"

@app.post("/chat/stream")
async def chat_stream(request: Request):
    data = await request.json()
    user_message = data.get("message", "")
    messages = [{"role": "user", "content": user_message}]
    return StreamingResponse(event_stream(messages), media_type="text/event-stream")

# To run:
# uvicorn filename:app --reload
output
INFO:     Started server process [12345]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)

Common variations

  • Async vs sync: Use async iteration for streaming with FastAPI; sync iteration blocks the event loop.
  • Different models: Change model="gpt-4o-mini" to any supported OpenAI chat model.
  • Non-SSE streaming: You can adapt the generator to other streaming protocols if needed.

Troubleshooting

  • If streaming hangs, verify your API key and network connectivity.
  • Ensure stream=True is set; otherwise, the response is not streamed.
  • Check that the client is using the OpenAI SDK v1+ pattern with OpenAI(api_key=...).

Key Takeaways

  • Use stream=True with client.chat.completions.create to get token streams.
  • Wrap the async token stream in a FastAPI StreamingResponse with text/event-stream media type for SSE.
  • Always use async iteration to avoid blocking FastAPI's event loop during streaming.
Verified 2026-04 · gpt-4o-mini
Verify ↗