Replay: re-running from a checkpoint
Why this matters
In production, graphs may run for hours or fail mid-execution. Replaying from a checkpoint lets you resume without losing computation cost and time: critical for expensive LLM calls, long-running tasks, or debugging state at a specific step.
Explanation
Replay means resuming graph execution from a stored checkpoint: a frozen snapshot of the graph's state at a specific step: rather than starting fresh from the input. MemorySaver() captures state after each node runs. When you call graph.invoke(input, config={'configurable': {'thread_id': 'id'}}) with an existing thread, langgraph resumes from the last saved state, skipping already-completed nodes. You retrieve a checkpoint using graph.get_state(config), which returns the exact state dict that can be fed back into invoke() with the same thread ID. This is essential for long pipelines: if node 3 crashes, you replay from node 3's input without re-running nodes 1–2.
Analogy
Like pausing a video game, saving your exact position and inventory, then loading from that save 3 days later. You don't replay the entire game: you resume from the checkpoint. If you died at a boss, you load the save and retry only the boss fight.
Code
from langgraph.graph import StateGraph, START, END
from langgraph.checkpoint.memory import MemorySaver
from typing import TypedDict
import json
class State(TypedDict):
messages: list[str]
step: int
def node_a(state: State) -> State:
print(f"Running node_a, step={state['step']}")
return {"messages": state["messages"] + ["A completed"], "step": state["step"] + 1}
def node_b(state: State) -> State:
print(f"Running node_b, step={state['step']}")
if state["step"] == 2:
raise ValueError("Intentional error at node_b")
return {"messages": state["messages"] + ["B completed"], "step": state["step"] + 1}
def node_c(state: State) -> State:
print(f"Running node_c, step={state['step']}")
return {"messages": state["messages"] + ["C completed"], "step": state["step"] + 1}
graph = StateGraph(State)
graph.add_node("a", node_a)
graph.add_node("b", node_b)
graph.add_node("c", node_c)
graph.add_edge(START, "a")
graph.add_edge("a", "b")
graph.add_edge("b", "c")
graph.add_edge("c", END)
checkpoint_saver = MemorySaver()
compiled_graph = graph.compile(checkpointer=checkpoint_saver)
initial_state = {"messages": [], "step": 1}
print("=== First run: will fail at node_b ===")
try:
result = compiled_graph.invoke(
initial_state,
config={"configurable": {"thread_id": "thread_1"}}
)
except ValueError as e:
print(f"Error caught: {e}")
print("\n=== Inspect checkpoint after failure ===")
state_snapshot = compiled_graph.get_state({"configurable": {"thread_id": "thread_1"}})
print(f"Checkpoint state: {state_snapshot.values}")
print(f"Next nodes to execute: {state_snapshot.next}")
print("\n=== Replay from checkpoint (simulating fix) ===")
print("(In production, you'd fix the bug, then replay)")
print(f"Resuming from thread_id='thread_1'...")
modified_state = {**state_snapshot.values, "step": 3}
result = compiled_graph.invoke(
modified_state,
config={"configurable": {"thread_id": "thread_1"}}
)
print(f"Final result: {result}") === First run: will fail at node_b ===
Running node_a, step=1
Running node_b, step=2
Error caught: Intentional error at node_b
=== Inspect checkpoint after failure ===
Checkpoint state: {'messages': ['A completed'], 'step': 2}
Next nodes to execute: ('b',)
=== Replay from checkpoint (simulating fix) ==="
(In production, you'd fix the bug, then replay)
Resuming from thread_id='thread_1'...
Running node_b, step=3
Running node_c, step=4
Final result: {'messages': ['A completed', 'B completed', 'C completed'], 'step': 4} What just happened?
First run completed node_a but failed in node_b, with state saved at that point. <code>get_state()</code> retrieved the frozen checkpoint showing 'A completed' and step=2, plus which node was about to run (node_b). We then called <code>invoke()</code> again with the same thread_id but modified state (step=3), which resumed from the saved point and executed only node_b and node_c, skipping the re-execution of node_a.
Common gotcha
Developers assume invoke() with the same thread_id automatically replays from failure. It doesn't: it replays from the last successfully saved checkpoint, which is after each node completes. If a node crashes mid-execution before state is updated, the checkpoint is from before that node ran. You must manually inspect get_state().next to see which node will run next, and you must manually fix the state or bug before calling invoke() again. Blindly re-invoking the same input will hit the same error.
Error recovery
KeyError on thread_idAttributeError: 'NoneType' object has no attribute 'values'ValueError during node re-executionExperienced dev note
Replay is not 'resume' in the async sense: it's deterministic re-execution from a saved state snapshot. If your nodes have side effects (API calls, database writes), they will re-run. Use replay only for idempotent operations or wrap side effects in a separate, checkpointed 'write' node that runs after your main logic. In production, store thread_ids tied to user sessions or request IDs so you can replay specific user workflows. Also: never assume a checkpoint exists: always guard get_state() with a null check or catch KeyError when the thread doesn't exist yet.
Check your understanding
If a graph fails at node_b after successfully running node_a, and you call invoke() again with the same thread_id but unmodified input state, which nodes will execute and why?
Show answer hint
The answer must address (1) that only node_b (and its dependents) will re-run, not node_a, because the checkpoint saved after node_a completed, and (2) whether the same error recurs depends on whether the input state or underlying logic changed: it likely will error again at the same point unless you modified the state or fixed the bug.