How to beginner · 3 min read

How to stream LLM response to frontend

Q: How to stream LLM response to frontend

Use the OpenAI SDK's chat.completions.create method with stream=True to receive partial LLM responses as they generate. On the backend, stream these chunks via Server-Sent Events (SSE) using frameworks like FastAPI to push real-time updates to the frontend.

Quick answer

Use the OpenAI SDK's chat.completions.create method with stream=True to receive partial LLM responses as they generate. On the backend, stream these chunks via Server-Sent Events (SSE) using frameworks like FastAPI to push real-time updates to the frontend.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai fastapi uvicorn

Setup

Install the required Python packages and set your OpenAI API key as an environment variable.

Install packages: pip install openai fastapi uvicorn
Set environment variable: export OPENAI_API_KEY='your_api_key' (Linux/macOS) or set OPENAI_API_KEY=your_api_key (Windows)

bash

pip install openai fastapi uvicorn

output

Collecting openai
Collecting fastapi
Collecting uvicorn
Successfully installed openai fastapi uvicorn

Step by step

This example shows a minimal FastAPI server that streams LLM responses to the frontend using Server-Sent Events (SSE). The backend calls client.chat.completions.create with stream=True and yields chunks as they arrive.

python

import os
from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse
from openai import OpenAI

app = FastAPI()
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

async def stream_llm_response(messages):
    stream = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        stream=True
    )
    for chunk in stream:
        delta = chunk.choices[0].delta.content or ""
        yield f"data: {delta}\n\n"

@app.get("/stream")
async def stream(request: Request):
    messages = [{"role": "user", "content": "Tell me a joke."}]
    return StreamingResponse(stream_llm_response(messages), media_type="text/event-stream")

# To run:
# uvicorn filename:app --reload

output

INFO:     Started server process [12345]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)

# When accessing http://127.0.0.1:8000/stream in a browser or SSE client, you receive streamed chunks of the LLM response.

Common variations

Async streaming: Use async for if your client supports async iteration.
Different models: Replace model="gpt-4o-mini" with any supported streaming model like gpt-4o-mini.
Other frameworks: Use similar SSE patterns in Flask or Django with appropriate SSE libraries.

python

import os
import asyncio
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

async def async_stream():
    stream = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": "Hello async streaming"}],
        stream=True
    )
    async for chunk in stream:
        print(chunk.choices[0].delta.content or "", end="", flush=True)

asyncio.run(async_stream())

output

Hello async streaming

Troubleshooting

If streaming hangs or returns no data, verify your API key and network connectivity.
Ensure the client library version is >=1.0 to support streaming.
Check that the frontend supports SSE and correctly handles text/event-stream responses.

✅

Key Takeaways

Use stream=True in client.chat.completions.create to receive partial LLM outputs.
Stream data to frontend via Server-Sent Events (SSE) for real-time user experience.
FastAPI with StreamingResponse is a simple and effective backend for streaming.
Always handle empty or missing delta.content safely when streaming.
Test streaming with different models and async patterns for best integration.

Verified 2026-04 · gpt-4o-mini

Verify ↗