How to stream query response in LlamaIndex
Quick answer
To stream query responses in
LlamaIndex, use the streaming parameter in the query method with a compatible LLM that supports streaming. This enables real-time token-by-token output, improving responsiveness in applications.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install llama-index openai
Setup
Install the llama-index and openai Python packages and set your OpenAI API key as an environment variable.
- Install packages:
pip install llama-index openai - Set environment variable in your shell:
export OPENAI_API_KEY='your_api_key'
pip install llama-index openai Step by step
This example demonstrates streaming a query response from LlamaIndex using OpenAI's GPT-4o model with streaming enabled. The streaming parameter is set to True in the query call, and tokens are printed as they arrive.
import os
from llama_index import GPTVectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.llms import OpenAI
# Load documents (replace with your data source)
docs = SimpleDirectoryReader('data').load_data()
# Initialize OpenAI LLM with streaming enabled
llm = OpenAI(model='gpt-4o', streaming=True, temperature=0, api_key=os.environ['OPENAI_API_KEY'])
# Create service context with streaming LLM
service_context = ServiceContext.from_defaults(llm=llm)
# Build the index
index = GPTVectorStoreIndex.from_documents(docs, service_context=service_context)
# Define a query
query_str = "Explain the benefits of streaming responses in LlamaIndex."
# Stream the query response
print("Streaming response:")
for token in index.query(query_str, streaming=True):
print(token, end='', flush=True)
print() output
Streaming response: Streaming responses allow real-time token-by-token output, improving user experience and reducing latency in applications.
Common variations
- Async streaming: Use async methods if supported by your LLM wrapper for non-blocking streaming.
- Different LLMs: Replace
OpenAIwith other LLMs that support streaming in LlamaIndex. - Custom callbacks: Implement callbacks to handle tokens for UI updates or logging.
import asyncio
async def async_stream_query(index, query_str):
async for token in index.aquery(query_str, streaming=True):
print(token, end='', flush=True)
# Usage example
# asyncio.run(async_stream_query(index, query_str)) Troubleshooting
- If streaming does not output tokens, verify your LLM supports streaming and
streaming=Trueis set. - Ensure your environment variable
OPENAI_API_KEYis correctly set and accessible. - Check network connectivity and API usage limits if responses fail.
Key Takeaways
- Enable streaming in LlamaIndex by setting
streaming=Truein both the LLM and query call. - Use compatible LLMs like OpenAI's
gpt-4othat support streaming for real-time token output. - Streaming improves responsiveness by delivering tokens incrementally instead of waiting for full completion.