Token-level streaming from LLM nodes
Why this matters
Real-time token streaming creates responsive user experiences where output appears word-by-word instead of all at once. This is critical for chat interfaces, live dashboards, and any system where latency to first token matters more than total response time.
Explanation
Token streaming means capturing and yielding LLM output as individual tokens arrive from the model, rather than collecting the full response and returning it all at once. How it works: LangChain's LLM classes support a stream() method that yields partial tokens. In a LangGraph node, you iterate over this stream, and use graph.stream() with mode="updates" to emit partial state updates back to the caller. Each token appears as a separate event your client can render immediately. When to use it: When you have end-users waiting for responses (chat, search results, completions) and latency perception matters. Avoid it when you need deterministic, complete responses before proceeding (batch processing, validation-first workflows).
Analogy
Think of it like watching water fill a glass through a faucet vs. someone handing you a full glass. With streaming, you see the water level rise continuously (responsive feedback). Without streaming, you wait for someone to fill the entire glass in a back room, then hand it to you all at once (higher latency perception).
Code
import anthropic
from langgraph.graph import StateGraph, START, END
from typing import TypedDict
class State(TypedDict):
query: str
response: str
def stream_llm_node(state: State):
client = anthropic.Anthropic()
response_text = ""
with client.messages.stream(
model="claude-3-5-sonnet-20241022",
max_tokens=256,
messages=[
{
"role": "user",
"content": state["query"]
}
]
) as stream:
for text in stream.text_stream:
response_text += text
print(f"Token: {repr(text)}", flush=True)
return {"response": response_text}
graph = StateGraph(State)
graph.add_node("llm", stream_llm_node)
graph.add_edge(START, "llm")
graph.add_edge("llm", END)
compiled_graph = graph.compile()
result = compiled_graph.invoke(
{"query": "Explain quantum computing in one sentence."}
)
print("\n--- Final Response ---")
print(result["response"]) Token: ' ' Token: 'Quantum' Token: ' ' Token: 'computing' Token: ' ' Token: 'harnesses' Token: ' ' Token: 'the' Token: ' ' Token: 'principles' Token: ' ' Token: 'of' Token: ' ' Token: 'quantum' Token: ' ' Token: 'mechanics' Token: ',' Token: ' ' Token: 'such' Token: ' ' Token: 'as' Token: ' ' Token: 'superposition' Token: ' ' Token: 'and' Token: ' ' Token: 'entanglement' Token: ',' Token: ' ' Token: 'to' Token: ' ' Token: 'perform' Token: ' ' Token: 'computations' Token: ' ' Token: 'exponentially' Token: ' ' Token: 'faster' Token: ' ' Token: 'than' Token: ' ' Token: 'classical' Token: ' ' Token: 'computers' Token: '.' --- Final Response --- Quantum computing harnesses the principles of quantum mechanics, such as superposition and entanglement, to perform computations exponentially faster than classical computers.
What just happened?
The code created a LangGraph state graph with one LLM node. The node used Anthropic's streaming context manager (<code>client.messages.stream()</code>) to receive tokens one at a time from the Claude model. Each token was printed immediately as it arrived, and accumulated into a final response string. The graph compiled and invoked normally, but the streaming happened inside the node function: the graph itself doesn't need special streaming configuration. The caller sees each token appear in real time via the print statements.
Common gotcha
Developers often assume they need to configure graph.stream(mode="updates") to get token-level granularity. In reality, graph.stream() with mode="updates" shows state updates at the node level, not token level. True token-level streaming happens inside the node function itself: you handle the streaming within your LLM call, not at the graph layer. If you want those tokens to reach a web client in real time, you need to use async and astream() on the graph, or implement a custom callback inside the node.
Error recovery
AttributeError: 'APIResponse' object has no attribute 'text_stream'TypeError: 'NoneType' object is not iterableKeyError when accessing state["response"]Experienced dev note
A common misconception: beginners think 'streaming' at the graph level means token streaming. It doesn't. graph.stream() returns node-level updates. Token streaming requires an LLM that supports it (most modern ones do) and you handle it in the node function itself. If you need tokens to reach an HTTP client in real time, combine this with async nodes and astream_events() on the graph. For production chat systems, implement streaming + error recovery: if a token call fails mid-stream, you've already sent partial output: you need a retry strategy that resumes or restarts gracefully. Most production failures aren't token-related; they're incomplete responses sent before validation.
Check your understanding
If a user is watching a chat interface and sees tokens appear one by one, where is the actual streaming happening: at the graph layer, the node layer, or the client layer? What would change in your code if you needed the graph itself to yield intermediate state updates between nodes (not just tokens within a node)?
Show answer hint
Streaming happens inside the node function (at the LLM library level), not at the graph layer. Streaming tokens from an LLM and streaming graph state updates are different concerns. To yield graph-level state updates, you'd need to implement a custom node that yields partial state or use `astream_events()` to capture in-progress node execution.