How to Intermediate · 4 min read

How to stream LlamaIndex response in web app

Q: How to stream LlamaIndex response in web app

Use LlamaIndex with an OpenAI client configured for streaming, then integrate it into a Python web app (e.g., FastAPI) using server-sent events or WebSockets to stream the response tokens in real time. This approach enables responsive, incremental display of AI-generated content in your web UI.

Quick answer

Use LlamaIndex with an OpenAI client configured for streaming, then integrate it into a Python web app (e.g., FastAPI) using server-sent events or WebSockets to stream the response tokens in real time. This approach enables responsive, incremental display of AI-generated content in your web UI.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install llama-index openai fastapi uvicorn

Setup

Install required packages and set your OpenAI API key as an environment variable.

Install packages: pip install llama-index openai fastapi uvicorn
Set environment variable: export OPENAI_API_KEY='your_api_key' (Linux/macOS) or set OPENAI_API_KEY=your_api_key (Windows)

bash

pip install llama-index openai fastapi uvicorn

Step by step

This example demonstrates streaming LlamaIndex responses in a FastAPI web app using OpenAI's streaming API. The server sends tokens incrementally to the client via Server-Sent Events (SSE).

python

import os
import asyncio
from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse
from llama_index import GPTSimpleVectorIndex, SimpleDirectoryReader, ServiceContext
from openai import OpenAI

app = FastAPI()

# Initialize OpenAI client with streaming enabled
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Load documents and build index (example with local docs)
docs = SimpleDirectoryReader("./docs").load_data()
service_context = ServiceContext.from_defaults(
    chunk_size_limit=512,
    llm=client
)
index = GPTSimpleVectorIndex.from_documents(docs, service_context=service_context)

async def event_generator(query: str):
    # Create a streaming chat completion
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": query}],
        stream=True
    )

    # Stream tokens as SSE
    async for chunk in response:
        if "choices" in chunk and len(chunk["choices"]) > 0:
            delta = chunk["choices"][0]["delta"]
            if "content" in delta:
                yield f"data: {delta['content']}\n\n"
    yield "data: [DONE]\n\n"

@app.get("/stream")
async def stream_response(request: Request, q: str):
    generator = event_generator(q)
    return StreamingResponse(generator, media_type="text/event-stream")

# To run: uvicorn this_file_name:app --reload

# Client side can connect to /stream?q=your+query and receive streamed tokens.

Common variations

Async LlamaIndex calls: Use async methods if supported by your LlamaIndex version for better concurrency.
WebSocket streaming: Replace SSE with WebSocket for bidirectional streaming in complex apps.
Different models: Swap gpt-4o with other OpenAI or Anthropic models supporting streaming.
Frontend integration: Use JavaScript EventSource API to consume SSE and update UI live.

Troubleshooting

If streaming stalls or hangs, verify your OpenAI API key and network connectivity.
Ensure your LlamaIndex and OpenAI packages are up to date to support streaming.
Check that your client properly handles SSE events and closes connections gracefully.
For large documents, adjust chunk_size_limit in ServiceContext to optimize token streaming.

✅

Key Takeaways

Use OpenAI's streaming API with LlamaIndex to deliver incremental AI responses in web apps.
FastAPI with Server-Sent Events is a simple, effective way to stream tokens to frontend clients.
Adjust LlamaIndex chunk sizes and model parameters to optimize streaming performance.

Verified 2026-04 · gpt-4o

Verify ↗