How to intermediate · 3 min read

How to stream query response in LlamaIndex

Q: How to stream query response in LlamaIndex

To stream query responses in LlamaIndex, use the streaming parameter in the query method with a compatible LLM that supports streaming. This enables real-time token-by-token output, improving responsiveness in applications.

Quick answer

To stream query responses in LlamaIndex, use the streaming parameter in the query method with a compatible LLM that supports streaming. This enables real-time token-by-token output, improving responsiveness in applications.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install llama-index openai

Setup

Install the llama-index and openai Python packages and set your OpenAI API key as an environment variable.

Install packages: pip install llama-index openai
Set environment variable in your shell: export OPENAI_API_KEY='your_api_key'

bash

pip install llama-index openai

Step by step

This example demonstrates streaming a query response from LlamaIndex using OpenAI's GPT-4o model with streaming enabled. The streaming parameter is set to True in the query call, and tokens are printed as they arrive.

python

import os
from llama_index import GPTVectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.llms import OpenAI

# Load documents (replace with your data source)
docs = SimpleDirectoryReader('data').load_data()

# Initialize OpenAI LLM with streaming enabled
llm = OpenAI(model='gpt-4o', streaming=True, temperature=0, api_key=os.environ['OPENAI_API_KEY'])

# Create service context with streaming LLM
service_context = ServiceContext.from_defaults(llm=llm)

# Build the index
index = GPTVectorStoreIndex.from_documents(docs, service_context=service_context)

# Define a query
query_str = "Explain the benefits of streaming responses in LlamaIndex."

# Stream the query response
print("Streaming response:")
for token in index.query(query_str, streaming=True):
    print(token, end='', flush=True)
print()

output

Streaming response:
Streaming responses allow real-time token-by-token output, improving user experience and reducing latency in applications.

Common variations

Async streaming: Use async methods if supported by your LLM wrapper for non-blocking streaming.
Different LLMs: Replace OpenAI with other LLMs that support streaming in LlamaIndex.
Custom callbacks: Implement callbacks to handle tokens for UI updates or logging.

python

import asyncio

async def async_stream_query(index, query_str):
    async for token in index.aquery(query_str, streaming=True):
        print(token, end='', flush=True)

# Usage example
# asyncio.run(async_stream_query(index, query_str))

Troubleshooting

If streaming does not output tokens, verify your LLM supports streaming and streaming=True is set.
Ensure your environment variable OPENAI_API_KEY is correctly set and accessible.
Check network connectivity and API usage limits if responses fail.

✅

Key Takeaways

Enable streaming in LlamaIndex by setting streaming=True in both the LLM and query call.
Use compatible LLMs like OpenAI's gpt-4o that support streaming for real-time token output.
Streaming improves responsiveness by delivering tokens incrementally instead of waiting for full completion.

Verified 2026-04 · gpt-4o

Verify ↗