How to beginner · 3 min read

How to stream LLM responses with FastAPI

Q: How to stream LLM responses with FastAPI

Use the OpenAI SDK's chat.completions.create method with stream=True to receive streamed tokens. Integrate this with FastAPI by yielding chunks in an async generator and returning a StreamingResponse with text/event-stream media type for real-time updates.

Quick answer

Use the OpenAI SDK's chat.completions.create method with stream=True to receive streamed tokens. Integrate this with FastAPI by yielding chunks in an async generator and returning a StreamingResponse with text/event-stream media type for real-time updates.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai fastapi uvicorn

Setup

Install the required packages and set your OpenAI API key as an environment variable.

Install packages: pip install openai fastapi uvicorn
Set environment variable: export OPENAI_API_KEY='your_api_key' (Linux/macOS) or set OPENAI_API_KEY=your_api_key (Windows)

bash

pip install openai fastapi uvicorn

output

Collecting openai
Collecting fastapi
Collecting uvicorn
Successfully installed openai fastapi uvicorn

Step by step

This example shows a complete FastAPI app that streams tokens from the gpt-4o model using the OpenAI SDK. It uses an async generator to yield streamed chunks and StreamingResponse with Server-Sent Events (SSE) for real-time client updates.

python

import os
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import OpenAI
import asyncio

app = FastAPI()
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

async def stream_llm_response():
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "Tell me a joke."}],
        stream=True
    )
    async for chunk in response:
        token = chunk.choices[0].delta.content or ""
        if token:
            yield f"data: {token}\n\n"
        await asyncio.sleep(0)  # allow event loop to run

@app.get("/stream")
async def stream():
    return StreamingResponse(stream_llm_response(), media_type="text/event-stream")

# To run:
# uvicorn filename:app --reload --port 8000

output

INFO:     Started server process [12345]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)

# When you open http://127.0.0.1:8000/stream in a browser or SSE client,
# you receive streamed tokens as Server-Sent Events in real-time.

Common variations

Async client usage: The OpenAI SDK supports async iteration over streamed responses as shown.
Different models: Replace model="gpt-4o" with any supported streaming model like gpt-4o-mini or claude-3-5-sonnet-20241022 (Anthropic requires a different SDK).
Non-SSE streaming: You can adapt the generator to yield plain text or JSON chunks for WebSocket or other protocols.

Troubleshooting

If you see no streamed output, ensure your client supports SSE and you are using stream=True.
Check your OPENAI_API_KEY environment variable is set correctly.
For Windows users, use set OPENAI_API_KEY=your_api_key in CMD or PowerShell.
If the server hangs, verify your async iteration and event loop usage.

✅

Key Takeaways

Use stream=True in client.chat.completions.create to enable streaming.
Integrate streaming with FastAPI using async generators and StreamingResponse with text/event-stream.
Always yield data prefixed with 'data: ' and double newline for SSE compliance.
Test streaming endpoints with SSE-compatible clients or browsers.
Set your API key securely via environment variables to avoid leaks.

Verified 2026-04 · gpt-4o, gpt-4o-mini, claude-3-5-sonnet-20241022

Verify ↗