How to stream LLM response to frontend
Quick answer
Use the
OpenAI SDK's chat.completions.create method with stream=True to receive partial LLM responses as they generate. On the backend, stream these chunks via Server-Sent Events (SSE) using frameworks like FastAPI to push real-time updates to the frontend.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai fastapi uvicorn
Setup
Install the required Python packages and set your OpenAI API key as an environment variable.
- Install packages:
pip install openai fastapi uvicorn - Set environment variable:
export OPENAI_API_KEY='your_api_key'(Linux/macOS) orset OPENAI_API_KEY=your_api_key(Windows)
pip install openai fastapi uvicorn output
Collecting openai Collecting fastapi Collecting uvicorn Successfully installed openai fastapi uvicorn
Step by step
This example shows a minimal FastAPI server that streams LLM responses to the frontend using Server-Sent Events (SSE). The backend calls client.chat.completions.create with stream=True and yields chunks as they arrive.
import os
from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse
from openai import OpenAI
app = FastAPI()
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
async def stream_llm_response(messages):
stream = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
stream=True
)
for chunk in stream:
delta = chunk.choices[0].delta.content or ""
yield f"data: {delta}\n\n"
@app.get("/stream")
async def stream(request: Request):
messages = [{"role": "user", "content": "Tell me a joke."}]
return StreamingResponse(stream_llm_response(messages), media_type="text/event-stream")
# To run:
# uvicorn filename:app --reload output
INFO: Started server process [12345] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit) # When accessing http://127.0.0.1:8000/stream in a browser or SSE client, you receive streamed chunks of the LLM response.
Common variations
- Async streaming: Use
async forif your client supports async iteration. - Different models: Replace
model="gpt-4o-mini"with any supported streaming model likegpt-4o-mini. - Other frameworks: Use similar SSE patterns in Flask or Django with appropriate SSE libraries.
import os
import asyncio
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
async def async_stream():
stream = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Hello async streaming"}],
stream=True
)
async for chunk in stream:
print(chunk.choices[0].delta.content or "", end="", flush=True)
asyncio.run(async_stream()) output
Hello async streaming
Troubleshooting
- If streaming hangs or returns no data, verify your API key and network connectivity.
- Ensure the client library version is >=1.0 to support streaming.
- Check that the frontend supports SSE and correctly handles
text/event-streamresponses.
Key Takeaways
- Use
stream=Trueinclient.chat.completions.createto receive partial LLM outputs. - Stream data to frontend via Server-Sent Events (SSE) for real-time user experience.
- FastAPI with
StreamingResponseis a simple and effective backend for streaming. - Always handle empty or missing
delta.contentsafely when streaming. - Test streaming with different models and async patterns for best integration.