Code Intermediate medium · 6 min

SqliteSaver: persistent checkpointing across restarts

What you will learn

SqliteSaver persists graph execution state to disk so you can resume interrupted workflows without losing progress.

Why this matters

In production, graphs often run long enough to timeout, crash, or need pausing. Without persistent checkpointing, you restart from zero. SqliteSaver lets you resume from the exact state you left off, critical for multi-step agentic workflows, batch processing, and cost-sensitive LLM applications.

Skip if: Do NOT use SqliteSaver if: your graph completes in under 5 seconds, you have stateless request-response patterns (like API endpoints), or you're building a prototype and don't care about recovery. Use in-memory checkpointing (MemorySaver) for development and short-lived processes.

Explanation

What it is: SqliteSaver is a LangGraph checkpointer that writes graph state snapshots to a local SQLite database. Each node execution creates a checkpoint: a frozen copy of the entire graph state at that moment.

How it works mechanically: When you compile a graph with checkpointer=SqliteSaver(db_path="..."), every .invoke() or .stream() call records state before and after each node runs. If the process crashes, you call graph.invoke(input, config={"configurable": {"thread_id": "xyz"}}) with the same thread_id and the graph loads the last checkpoint from disk and resumes from there, not from the beginning.

When to use it: Use SqliteSaver for any production graph that handles long-running tasks, handles retryable failures, or costs money per step (like token-based LLM calls). The overhead is negligible compared to the safety gain.

Analogy

Think of SqliteSaver like a save-game in a video game. Before each boss fight (node), the game writes your exact HP, inventory, and position to disk. If your console crashes mid-fight, you restart from that checkpoint instead of from level 1.

Code

python

import sqlite3
from langgraph.graph import StateGraph, START, END
from langgraph.checkpoint.sqlite import SqliteSaver
from typing_extensions import TypedDict

class State(TypedDict):
    value: int
    steps: list[str]

def node_a(state: State) -> State:
    state["value"] += 10
    state["steps"].append("node_a")
    return state

def node_b(state: State) -> State:
    state["value"] *= 2
    state["steps"].append("node_b")
    return state

def node_c(state: State) -> State:
    state["value"] -= 5
    state["steps"].append("node_c")
    return state

graph = StateGraph(State)
graph.add_node("a", node_a)
graph.add_node("b", node_b)
graph.add_node("c", node_c)

graph.add_edge(START, "a")
graph.add_edge("a", "b")
graph.add_edge("b", "c")
graph.add_edge("c", END)

checkpointer = SqliteSaver.from_conn_string(":memory:")
compiled_graph = graph.compile(checkpointer=checkpointer)

thread_id = "user_123_session"
initial_state = {"value": 5, "steps": []}

print("=== First run (full execution) ===")
result_1 = compiled_graph.invoke(
    initial_state,
    config={"configurable": {"thread_id": thread_id}}
)
print(f"After run 1: value={result_1['value']}, steps={result_1['steps']}")

print("\n=== Simulated crash and resume ===")
print("[Imagine the process crashed here]")
print("[Now we resume with the same thread_id]")

result_2 = compiled_graph.invoke(
    None,
    config={"configurable": {"thread_id": thread_id}}
)
print(f"After resume: value={result_2['value']}, steps={result_2['steps']}")

print("\n=== Check: are they identical? ===")
print(f"Results match: {result_1 == result_2}")

Output

=== First run (full execution) ===
After run 1: value=25, steps=['node_a', 'node_b', 'node_c']

=== Simulated crash and resume ===
[Imagine the process crashed here]
[Now we resume with the same thread_id]
After resume: value=25, steps=['node_a', 'node_b', 'node_c']

=== Check: are they identical? ===
Results match: True

What just happened?

The code created a three-node graph, compiled it with an in-memory SQLite checkpointer, and ran it with a thread_id. The first invoke() executed all three nodes (a→b→c) and saved checkpoints after each step. When we called invoke(None, ...) with the same thread_id, the graph didn't re-run: it resumed from the final checkpoint and returned the identical result because it was already complete. (In a real scenario with an actual crash mid-execution, the second invoke would resume from the last checkpoint, not the beginning.)

Common gotcha

Passing None as input on resume doesn't mean 'no input': it means 'use the state from the checkpoint.' If you accidentally pass new input data on resume, it will overwrite the recovered state and you lose your recovery guarantee. Always use None or no input argument when resuming. Also, forgetting to set the same thread_id creates a new execution thread: you'll have parallel histories and won't actually resume.

Error recovery

ValueError: Invalid thread_id

thread_id must be a string. Use a consistent, meaningful identifier like a user ID or session UUID. Don't use random UUIDs if you need to resume later.

FileNotFoundError: No such file or directory

The db_path directory doesn't exist. Use SqliteSaver.from_conn_string() for in-memory (:memory:) for testing, or ensure the directory is created before passing a file path like '/tmp/checkpoints.db'.

sqlite3.OperationalError: database is locked

Another process has the database open. In production, use a single long-lived connection or a connection pool, not creating new SqliteSaver instances constantly. For testing, use :memory: which is process-local.

Experienced dev note

A senior engineer would tell you: SqliteSaver is not about being fancy: it's about respecting the reality that production systems fail. The cost of disk I/O is tiny compared to re-running expensive operations (LLM calls, database queries, long computations). Also, use thread_id intelligently: for user-facing systems, thread_id = user_id; for batch jobs, thread_id = job_id; for agents, thread_id = conversation_id. This makes debugging and recovery straightforward. One more thing: test your resume path in dev. It's easy to write code that crashes but never actually tests recovery: then discovery happens at 3am in production.

Check your understanding

If a graph with SqliteSaver crashes after node_b completes but before node_c starts, and you resume with the same thread_id, which nodes re-execute and why?

Show answer hint

Only node_c re-executes because the checkpoint saved after node_b completed. The graph resumes from that checkpoint, not from START. The key insight is that checkpoints are written *after* each node succeeds, not before.

VERSION SqliteSaver was introduced in langgraph 0.2.0. In earlier 0.1.x versions, use MemorySaver or implement a custom checkpointer. The API stabilized in 0.2.x, so this pattern is forward-compatible.

Learn how to build multi-turn conversations with persistent state using thread_id and checkpointing: combining SqliteSaver with message history for stateful agents.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.