Code Intermediate medium · 6 min

Token-level streaming from LLM nodes

What you will learn

Stream individual tokens as they arrive from an LLM within a LangGraph node instead of waiting for the complete response.

Why this matters

Real-time token streaming creates responsive user experiences where output appears word-by-word instead of all at once. This is critical for chat interfaces, live dashboards, and any system where latency to first token matters more than total response time.

Skip if: Don't use token streaming if you need to process or validate the complete LLM response before sending it to the user. Streaming commits you to sending data immediately: you can't revise or filter tokens after they've been sent.

Explanation

Token streaming means capturing and yielding LLM output as individual tokens arrive from the model, rather than collecting the full response and returning it all at once. How it works: LangChain's LLM classes support a stream() method that yields partial tokens. In a LangGraph node, you iterate over this stream, and use graph.stream() with mode="updates" to emit partial state updates back to the caller. Each token appears as a separate event your client can render immediately. When to use it: When you have end-users waiting for responses (chat, search results, completions) and latency perception matters. Avoid it when you need deterministic, complete responses before proceeding (batch processing, validation-first workflows).

Analogy

Think of it like watching water fill a glass through a faucet vs. someone handing you a full glass. With streaming, you see the water level rise continuously (responsive feedback). Without streaming, you wait for someone to fill the entire glass in a back room, then hand it to you all at once (higher latency perception).

Code

python

import anthropic
from langgraph.graph import StateGraph, START, END
from typing import TypedDict


class State(TypedDict):
    query: str
    response: str


def stream_llm_node(state: State):
    client = anthropic.Anthropic()
    response_text = ""
    
    with client.messages.stream(
        model="claude-3-5-sonnet-20241022",
        max_tokens=256,
        messages=[
            {
                "role": "user",
                "content": state["query"]
            }
        ]
    ) as stream:
        for text in stream.text_stream:
            response_text += text
            print(f"Token: {repr(text)}", flush=True)
    
    return {"response": response_text}


graph = StateGraph(State)
graph.add_node("llm", stream_llm_node)
graph.add_edge(START, "llm")
graph.add_edge("llm", END)

compiled_graph = graph.compile()

result = compiled_graph.invoke(
    {"query": "Explain quantum computing in one sentence."}
)

print("\n--- Final Response ---")
print(result["response"])

Output

Token: ' '
Token: 'Quantum'
Token: ' '
Token: 'computing'
Token: ' '
Token: 'harnesses'
Token: ' '
Token: 'the'
Token: ' '
Token: 'principles'
Token: ' '
Token: 'of'
Token: ' '
Token: 'quantum'
Token: ' '
Token: 'mechanics'
Token: ','
Token: ' '
Token: 'such'
Token: ' '
Token: 'as'
Token: ' '
Token: 'superposition'
Token: ' '
Token: 'and'
Token: ' '
Token: 'entanglement'
Token: ','
Token: ' '
Token: 'to'
Token: ' '
Token: 'perform'
Token: ' '
Token: 'computations'
Token: ' '
Token: 'exponentially'
Token: ' '
Token: 'faster'
Token: ' '
Token: 'than'
Token: ' '
Token: 'classical'
Token: ' '
Token: 'computers'
Token: '.'

--- Final Response ---
 Quantum computing harnesses the principles of quantum mechanics, such as superposition and entanglement, to perform computations exponentially faster than classical computers.

What just happened?

The code created a LangGraph state graph with one LLM node. The node used Anthropic's streaming context manager (<code>client.messages.stream()</code>) to receive tokens one at a time from the Claude model. Each token was printed immediately as it arrived, and accumulated into a final response string. The graph compiled and invoked normally, but the streaming happened inside the node function: the graph itself doesn't need special streaming configuration. The caller sees each token appear in real time via the print statements.

Common gotcha

Developers often assume they need to configure graph.stream(mode="updates") to get token-level granularity. In reality, graph.stream() with mode="updates" shows state updates at the node level, not token level. True token-level streaming happens inside the node function itself: you handle the streaming within your LLM call, not at the graph layer. If you want those tokens to reach a web client in real time, you need to use async and astream() on the graph, or implement a custom callback inside the node.

Error recovery

AttributeError: 'APIResponse' object has no attribute 'text_stream'

You're using a non-streaming API method. Ensure you're using `client.messages.stream()` (context manager) or calling `.stream()` on a message instance, not `.text` on a non-streamed response.

TypeError: 'NoneType' object is not iterable

The LLM's streaming generator is None, often because the API key is invalid or the model parameter is wrong. Verify your API credentials and that the model name matches your provider's available models.

KeyError when accessing state["response"]

Ensure your node function returns a dictionary with keys that match your State TypedDict. If the node returns {"response": value}, the key must exist in State or be a valid update.

Experienced dev note

A common misconception: beginners think 'streaming' at the graph level means token streaming. It doesn't. graph.stream() returns node-level updates. Token streaming requires an LLM that supports it (most modern ones do) and you handle it in the node function itself. If you need tokens to reach an HTTP client in real time, combine this with async nodes and astream_events() on the graph. For production chat systems, implement streaming + error recovery: if a token call fails mid-stream, you've already sent partial output: you need a retry strategy that resumes or restarts gracefully. Most production failures aren't token-related; they're incomplete responses sent before validation.

Check your understanding

If a user is watching a chat interface and sees tokens appear one by one, where is the actual streaming happening: at the graph layer, the node layer, or the client layer? What would change in your code if you needed the graph itself to yield intermediate state updates between nodes (not just tokens within a node)?

Show answer hint

Streaming happens inside the node function (at the LLM library level), not at the graph layer. Streaming tokens from an LLM and streaming graph state updates are different concerns. To yield graph-level state updates, you'd need to implement a custom node that yields partial state or use `astream_events()` to capture in-progress node execution.

VERSION In langgraph < 0.2.0, streaming was less integrated with state updates. Current 0.2.x has stable support for both sync streaming (shown here) and async streaming via `astream_events()`. If you're on 0.1.x, token streaming still works at the node level but state event streaming is less mature.

Once tokens stream from a node, you'll need to handle errors gracefully mid-stream and manage partial state recovery: next is implementing error handling within streaming nodes and checkpointing partial responses.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.