How to stream LLM responses with FastAPI
Quick answer
Use FastAPI's streaming response feature combined with the OpenAI SDK's streaming parameter to stream LLM responses in real time. Implement an async generator that yields chunks from client.chat.completions.create with stream=True and return it as a StreamingResponse in FastAPI.
PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install fastapi uvicorn openai>=1.0
Setup
Install required packages and set your OpenAI API key as an environment variable.
- Install FastAPI, Uvicorn, and OpenAI SDK:
pip install fastapi uvicorn openai>=1.0 Step by step
This example demonstrates streaming LLM responses using FastAPI and the OpenAI SDK. It defines an async generator that yields partial response chunks as they arrive from the API, then streams them to the client.
import os
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import OpenAI
app = FastAPI()
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
async def stream_chat_completion(prompt: str):
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
stream=True
)
for chunk in response:
if chunk.choices[0].delta.content:
yield chunk.choices[0].delta.content
@app.get("/stream")
async def stream_endpoint(prompt: str = "Hello, world!"):
return StreamingResponse(stream_chat_completion(prompt), media_type="text/plain")
# To run:
# uvicorn filename:app --reload Common variations
You can adapt streaming for different models or use async HTTP clients for more control. For example, use gpt-4o-mini for faster, cheaper responses. Also, you can stream JSON chunks or SSE format for richer client-side handling.
from openai import OpenAI
import os
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
async def stream_gemini(prompt: str):
response = client.chat.completions.create(
model="gemini-2.0-flash",
messages=[{"role": "user", "content": prompt}],
stream=True
)
for chunk in response:
if chunk.choices[0].delta.content:
yield chunk.choices[0].delta.content Troubleshooting
- If streaming hangs or returns no data, verify your API key and network connectivity.
- Ensure your client supports streaming responses (e.g., browsers or HTTP clients).
- Check for rate limits or quota exhaustion in your OpenAI dashboard.
Key Takeaways
- Use FastAPI's StreamingResponse with an async generator to stream LLM output efficiently.
- Set stream=True in client.chat.completions.create to receive partial tokens.
- Always read your API key from environment variables for security and flexibility.