How to intermediate · 3 min read

How to stream LLM responses with FastAPI

Quick answer
Use FastAPI's streaming response feature combined with the OpenAI SDK's streaming parameter to stream LLM responses in real time. Implement an async generator that yields chunks from client.chat.completions.create with stream=True and return it as a StreamingResponse in FastAPI.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install fastapi uvicorn openai>=1.0

Setup

Install required packages and set your OpenAI API key as an environment variable.

  • Install FastAPI, Uvicorn, and OpenAI SDK:
bash
pip install fastapi uvicorn openai>=1.0

Step by step

This example demonstrates streaming LLM responses using FastAPI and the OpenAI SDK. It defines an async generator that yields partial response chunks as they arrive from the API, then streams them to the client.

python
import os
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import OpenAI

app = FastAPI()
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

async def stream_chat_completion(prompt: str):
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        stream=True
    )
    for chunk in response:
        if chunk.choices[0].delta.content:
            yield chunk.choices[0].delta.content

@app.get("/stream")
async def stream_endpoint(prompt: str = "Hello, world!"):
    return StreamingResponse(stream_chat_completion(prompt), media_type="text/plain")

# To run:
# uvicorn filename:app --reload

Common variations

You can adapt streaming for different models or use async HTTP clients for more control. For example, use gpt-4o-mini for faster, cheaper responses. Also, you can stream JSON chunks or SSE format for richer client-side handling.

python
from openai import OpenAI
import os

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

async def stream_gemini(prompt: str):
    response = client.chat.completions.create(
        model="gemini-2.0-flash",
        messages=[{"role": "user", "content": prompt}],
        stream=True
    )
    for chunk in response:
        if chunk.choices[0].delta.content:
            yield chunk.choices[0].delta.content

Troubleshooting

  • If streaming hangs or returns no data, verify your API key and network connectivity.
  • Ensure your client supports streaming responses (e.g., browsers or HTTP clients).
  • Check for rate limits or quota exhaustion in your OpenAI dashboard.

Key Takeaways

  • Use FastAPI's StreamingResponse with an async generator to stream LLM output efficiently.
  • Set stream=True in client.chat.completions.create to receive partial tokens.
  • Always read your API key from environment variables for security and flexibility.
Verified 2026-04 · gpt-4o-mini, gemini-2.0-flash
Verify ↗