How to beginner · 3 min read

How to stream LLM responses with FastAPI

Quick answer
Use the OpenAI SDK's chat.completions.create method with stream=True to receive streamed tokens. Integrate this with FastAPI by yielding chunks in an async generator and returning a StreamingResponse with text/event-stream media type for real-time updates.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai fastapi uvicorn

Setup

Install the required packages and set your OpenAI API key as an environment variable.

  • Install packages: pip install openai fastapi uvicorn
  • Set environment variable: export OPENAI_API_KEY='your_api_key' (Linux/macOS) or set OPENAI_API_KEY=your_api_key (Windows)
bash
pip install openai fastapi uvicorn
output
Collecting openai
Collecting fastapi
Collecting uvicorn
Successfully installed openai fastapi uvicorn

Step by step

This example shows a complete FastAPI app that streams tokens from the gpt-4o model using the OpenAI SDK. It uses an async generator to yield streamed chunks and StreamingResponse with Server-Sent Events (SSE) for real-time client updates.

python
import os
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import OpenAI
import asyncio

app = FastAPI()
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

async def stream_llm_response():
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "Tell me a joke."}],
        stream=True
    )
    async for chunk in response:
        token = chunk.choices[0].delta.content or ""
        if token:
            yield f"data: {token}\n\n"
        await asyncio.sleep(0)  # allow event loop to run

@app.get("/stream")
async def stream():
    return StreamingResponse(stream_llm_response(), media_type="text/event-stream")

# To run:
# uvicorn filename:app --reload --port 8000
output
INFO:     Started server process [12345]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)

# When you open http://127.0.0.1:8000/stream in a browser or SSE client,
# you receive streamed tokens as Server-Sent Events in real-time.

Common variations

  • Async client usage: The OpenAI SDK supports async iteration over streamed responses as shown.
  • Different models: Replace model="gpt-4o" with any supported streaming model like gpt-4o-mini or claude-3-5-sonnet-20241022 (Anthropic requires a different SDK).
  • Non-SSE streaming: You can adapt the generator to yield plain text or JSON chunks for WebSocket or other protocols.

Troubleshooting

  • If you see no streamed output, ensure your client supports SSE and you are using stream=True.
  • Check your OPENAI_API_KEY environment variable is set correctly.
  • For Windows users, use set OPENAI_API_KEY=your_api_key in CMD or PowerShell.
  • If the server hangs, verify your async iteration and event loop usage.

Key Takeaways

  • Use stream=True in client.chat.completions.create to enable streaming.
  • Integrate streaming with FastAPI using async generators and StreamingResponse with text/event-stream.
  • Always yield data prefixed with 'data: ' and double newline for SSE compliance.
  • Test streaming endpoints with SSE-compatible clients or browsers.
  • Set your API key securely via environment variables to avoid leaks.
Verified 2026-04 · gpt-4o, gpt-4o-mini, claude-3-5-sonnet-20241022
Verify ↗