SqliteSaver: persistent checkpointing across restarts
Why this matters
In production, graphs often run long enough to timeout, crash, or need pausing. Without persistent checkpointing, you restart from zero. SqliteSaver lets you resume from the exact state you left off, critical for multi-step agentic workflows, batch processing, and cost-sensitive LLM applications.
Explanation
What it is: SqliteSaver is a LangGraph checkpointer that writes graph state snapshots to a local SQLite database. Each node execution creates a checkpoint: a frozen copy of the entire graph state at that moment.
How it works mechanically: When you compile a graph with checkpointer=SqliteSaver(db_path="..."), every .invoke() or .stream() call records state before and after each node runs. If the process crashes, you call graph.invoke(input, config={"configurable": {"thread_id": "xyz"}}) with the same thread_id and the graph loads the last checkpoint from disk and resumes from there, not from the beginning.
When to use it: Use SqliteSaver for any production graph that handles long-running tasks, handles retryable failures, or costs money per step (like token-based LLM calls). The overhead is negligible compared to the safety gain.
Analogy
Think of SqliteSaver like a save-game in a video game. Before each boss fight (node), the game writes your exact HP, inventory, and position to disk. If your console crashes mid-fight, you restart from that checkpoint instead of from level 1.
Code
import sqlite3
from langgraph.graph import StateGraph, START, END
from langgraph.checkpoint.sqlite import SqliteSaver
from typing_extensions import TypedDict
class State(TypedDict):
value: int
steps: list[str]
def node_a(state: State) -> State:
state["value"] += 10
state["steps"].append("node_a")
return state
def node_b(state: State) -> State:
state["value"] *= 2
state["steps"].append("node_b")
return state
def node_c(state: State) -> State:
state["value"] -= 5
state["steps"].append("node_c")
return state
graph = StateGraph(State)
graph.add_node("a", node_a)
graph.add_node("b", node_b)
graph.add_node("c", node_c)
graph.add_edge(START, "a")
graph.add_edge("a", "b")
graph.add_edge("b", "c")
graph.add_edge("c", END)
checkpointer = SqliteSaver.from_conn_string(":memory:")
compiled_graph = graph.compile(checkpointer=checkpointer)
thread_id = "user_123_session"
initial_state = {"value": 5, "steps": []}
print("=== First run (full execution) ===")
result_1 = compiled_graph.invoke(
initial_state,
config={"configurable": {"thread_id": thread_id}}
)
print(f"After run 1: value={result_1['value']}, steps={result_1['steps']}")
print("\n=== Simulated crash and resume ===")
print("[Imagine the process crashed here]")
print("[Now we resume with the same thread_id]")
result_2 = compiled_graph.invoke(
None,
config={"configurable": {"thread_id": thread_id}}
)
print(f"After resume: value={result_2['value']}, steps={result_2['steps']}")
print("\n=== Check: are they identical? ===")
print(f"Results match: {result_1 == result_2}") === First run (full execution) === After run 1: value=25, steps=['node_a', 'node_b', 'node_c'] === Simulated crash and resume === [Imagine the process crashed here] [Now we resume with the same thread_id] After resume: value=25, steps=['node_a', 'node_b', 'node_c'] === Check: are they identical? === Results match: True
What just happened?
The code created a three-node graph, compiled it with an in-memory SQLite checkpointer, and ran it with a thread_id. The first invoke() executed all three nodes (a→b→c) and saved checkpoints after each step. When we called invoke(None, ...) with the same thread_id, the graph didn't re-run: it resumed from the final checkpoint and returned the identical result because it was already complete. (In a real scenario with an actual crash mid-execution, the second invoke would resume from the last checkpoint, not the beginning.)
Common gotcha
Passing None as input on resume doesn't mean 'no input': it means 'use the state from the checkpoint.' If you accidentally pass new input data on resume, it will overwrite the recovered state and you lose your recovery guarantee. Always use None or no input argument when resuming. Also, forgetting to set the same thread_id creates a new execution thread: you'll have parallel histories and won't actually resume.
Error recovery
ValueError: Invalid thread_idFileNotFoundError: No such file or directorysqlite3.OperationalError: database is lockedExperienced dev note
A senior engineer would tell you: SqliteSaver is not about being fancy: it's about respecting the reality that production systems fail. The cost of disk I/O is tiny compared to re-running expensive operations (LLM calls, database queries, long computations). Also, use thread_id intelligently: for user-facing systems, thread_id = user_id; for batch jobs, thread_id = job_id; for agents, thread_id = conversation_id. This makes debugging and recovery straightforward. One more thing: test your resume path in dev. It's easy to write code that crashes but never actually tests recovery: then discovery happens at 3am in production.
Check your understanding
If a graph with SqliteSaver crashes after node_b completes but before node_c starts, and you resume with the same thread_id, which nodes re-execute and why?
Show answer hint
Only node_c re-executes because the checkpoint saved after node_b completed. The graph resumes from that checkpoint, not from START. The key insight is that checkpoints are written *after* each node succeeds, not before.