How to stream LangChain agent output
Quick answer
To stream LangChain agent output, enable the
streaming parameter on the underlying LLM client and handle the streamed tokens via callbacks or async iterators. Use LangChain's CallbackManager or AsyncCallbackManager to capture and process output chunks as they arrive.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install langchain openai>=1.0
Setup
Install the required packages and set your environment variable for the OpenAI API key.
- Install LangChain and OpenAI SDK:
pip install langchain openai - Set your API key in the environment:
export OPENAI_API_KEY='your_api_key'(Linux/macOS) orsetx OPENAI_API_KEY "your_api_key"(Windows)
pip install langchain openai Step by step
This example shows how to create a LangChain agent with streaming enabled using the OpenAI gpt-4o model. The CallbackManager captures streamed tokens and prints them in real-time.
import os
from langchain_openai import ChatOpenAI
from langchain.agents import initialize_agent, AgentType
from langchain.callbacks.base import BaseCallbackHandler
class StreamHandler(BaseCallbackHandler):
def on_llm_new_token(self, token: str, **kwargs) -> None:
print(token, end='', flush=True)
# Initialize the streaming LLM client
llm = ChatOpenAI(
model_name="gpt-4o",
streaming=True
)
# Attach the streaming callback handler
stream_handler = StreamHandler()
llm.callback_manager.add_handler(stream_handler)
# Initialize a simple agent with the streaming LLM
agent = initialize_agent(
agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
llm=llm,
verbose=True
)
# Run the agent with streaming output
print("Agent output:")
agent.run("Write a short poem about AI streaming output.") output
Agent output: AI streams in flowing code, Tokens dance, ideas explode. Real-time thoughts on screen displayed, LangChain guides the streaming parade.
Common variations
You can also use async streaming with LangChain by using AsyncCallbackManager and async LLM clients. Different models like gpt-4o-mini or Anthropic's claude-3-5-sonnet-20241022 support streaming similarly. Adjust the callback handlers to process tokens differently, e.g., buffering or sending to a UI.
import os
import asyncio
from langchain_openai import ChatOpenAI
from langchain.callbacks.base import AsyncCallbackHandler
class AsyncStreamHandler(AsyncCallbackHandler):
async def on_llm_new_token(self, token: str, **kwargs) -> None:
print(token, end='', flush=True)
async def main():
llm = ChatOpenAI(
model_name="gpt-4o-mini",
streaming=True
)
stream_handler = AsyncStreamHandler()
llm.callback_manager.add_handler(stream_handler)
response = await llm.acall("Explain streaming in LangChain.")
print("\nFull response:", response)
asyncio.run(main()) output
Explain streaming in LangChain means tokens arrive incrementally, allowing real-time output display. Full response: Explain streaming in LangChain means tokens arrive incrementally, allowing real-time output display.
Troubleshooting
- If streaming output does not appear, ensure
streaming=Trueis set on the LLM client. - Check that your callback handlers are properly attached to the
callback_manager. - For async streaming, confirm you are running inside an async event loop.
- If output is buffered and delayed, verify your print or UI update calls flush output immediately.
Key Takeaways
- Enable streaming on the LangChain LLM client with
streaming=Trueto get incremental output. - Use
CallbackManagerorAsyncCallbackManagerwith custom handlers to process streamed tokens. - Streaming works with multiple models and can be integrated into agents for real-time interaction.
- Flush output immediately in handlers to avoid buffering delays in the console or UI.
- Async streaming requires running inside an async event loop and using async callback handlers.