How to stream LlamaIndex response in web app
Quick answer
Use
LlamaIndex with an OpenAI client configured for streaming, then integrate it into a Python web app (e.g., FastAPI) using server-sent events or WebSockets to stream the response tokens in real time. This approach enables responsive, incremental display of AI-generated content in your web UI.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install llama-index openai fastapi uvicorn
Setup
Install required packages and set your OpenAI API key as an environment variable.
- Install packages:
pip install llama-index openai fastapi uvicorn - Set environment variable:
export OPENAI_API_KEY='your_api_key'(Linux/macOS) orset OPENAI_API_KEY=your_api_key(Windows)
pip install llama-index openai fastapi uvicorn Step by step
This example demonstrates streaming LlamaIndex responses in a FastAPI web app using OpenAI's streaming API. The server sends tokens incrementally to the client via Server-Sent Events (SSE).
import os
import asyncio
from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse
from llama_index import GPTSimpleVectorIndex, SimpleDirectoryReader, ServiceContext
from openai import OpenAI
app = FastAPI()
# Initialize OpenAI client with streaming enabled
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Load documents and build index (example with local docs)
docs = SimpleDirectoryReader("./docs").load_data()
service_context = ServiceContext.from_defaults(
chunk_size_limit=512,
llm=client
)
index = GPTSimpleVectorIndex.from_documents(docs, service_context=service_context)
async def event_generator(query: str):
# Create a streaming chat completion
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": query}],
stream=True
)
# Stream tokens as SSE
async for chunk in response:
if "choices" in chunk and len(chunk["choices"]) > 0:
delta = chunk["choices"][0]["delta"]
if "content" in delta:
yield f"data: {delta['content']}\n\n"
yield "data: [DONE]\n\n"
@app.get("/stream")
async def stream_response(request: Request, q: str):
generator = event_generator(q)
return StreamingResponse(generator, media_type="text/event-stream")
# To run: uvicorn this_file_name:app --reload
# Client side can connect to /stream?q=your+query and receive streamed tokens. Common variations
- Async LlamaIndex calls: Use async methods if supported by your LlamaIndex version for better concurrency.
- WebSocket streaming: Replace SSE with WebSocket for bidirectional streaming in complex apps.
- Different models: Swap
gpt-4owith other OpenAI or Anthropic models supporting streaming. - Frontend integration: Use JavaScript EventSource API to consume SSE and update UI live.
Troubleshooting
- If streaming stalls or hangs, verify your OpenAI API key and network connectivity.
- Ensure your LlamaIndex and OpenAI packages are up to date to support streaming.
- Check that your client properly handles SSE events and closes connections gracefully.
- For large documents, adjust
chunk_size_limitinServiceContextto optimize token streaming.
Key Takeaways
- Use OpenAI's streaming API with LlamaIndex to deliver incremental AI responses in web apps.
- FastAPI with Server-Sent Events is a simple, effective way to stream tokens to frontend clients.
- Adjust LlamaIndex chunk sizes and model parameters to optimize streaming performance.