How to Intermediate · 4 min read

How to stream LlamaIndex response in web app

Quick answer
Use LlamaIndex with an OpenAI client configured for streaming, then integrate it into a Python web app (e.g., FastAPI) using server-sent events or WebSockets to stream the response tokens in real time. This approach enables responsive, incremental display of AI-generated content in your web UI.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install llama-index openai fastapi uvicorn

Setup

Install required packages and set your OpenAI API key as an environment variable.

  • Install packages: pip install llama-index openai fastapi uvicorn
  • Set environment variable: export OPENAI_API_KEY='your_api_key' (Linux/macOS) or set OPENAI_API_KEY=your_api_key (Windows)
bash
pip install llama-index openai fastapi uvicorn

Step by step

This example demonstrates streaming LlamaIndex responses in a FastAPI web app using OpenAI's streaming API. The server sends tokens incrementally to the client via Server-Sent Events (SSE).

python
import os
import asyncio
from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse
from llama_index import GPTSimpleVectorIndex, SimpleDirectoryReader, ServiceContext
from openai import OpenAI

app = FastAPI()

# Initialize OpenAI client with streaming enabled
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Load documents and build index (example with local docs)
docs = SimpleDirectoryReader("./docs").load_data()
service_context = ServiceContext.from_defaults(
    chunk_size_limit=512,
    llm=client
)
index = GPTSimpleVectorIndex.from_documents(docs, service_context=service_context)

async def event_generator(query: str):
    # Create a streaming chat completion
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": query}],
        stream=True
    )

    # Stream tokens as SSE
    async for chunk in response:
        if "choices" in chunk and len(chunk["choices"]) > 0:
            delta = chunk["choices"][0]["delta"]
            if "content" in delta:
                yield f"data: {delta['content']}\n\n"
    yield "data: [DONE]\n\n"

@app.get("/stream")
async def stream_response(request: Request, q: str):
    generator = event_generator(q)
    return StreamingResponse(generator, media_type="text/event-stream")

# To run: uvicorn this_file_name:app --reload

# Client side can connect to /stream?q=your+query and receive streamed tokens.

Common variations

  • Async LlamaIndex calls: Use async methods if supported by your LlamaIndex version for better concurrency.
  • WebSocket streaming: Replace SSE with WebSocket for bidirectional streaming in complex apps.
  • Different models: Swap gpt-4o with other OpenAI or Anthropic models supporting streaming.
  • Frontend integration: Use JavaScript EventSource API to consume SSE and update UI live.

Troubleshooting

  • If streaming stalls or hangs, verify your OpenAI API key and network connectivity.
  • Ensure your LlamaIndex and OpenAI packages are up to date to support streaming.
  • Check that your client properly handles SSE events and closes connections gracefully.
  • For large documents, adjust chunk_size_limit in ServiceContext to optimize token streaming.

Key Takeaways

  • Use OpenAI's streaming API with LlamaIndex to deliver incremental AI responses in web apps.
  • FastAPI with Server-Sent Events is a simple, effective way to stream tokens to frontend clients.
  • Adjust LlamaIndex chunk sizes and model parameters to optimize streaming performance.
Verified 2026-04 · gpt-4o
Verify ↗