Code Advanced hard · 8 min

Horizontal scaling: stateless graph servers

What you will learn
Design langgraph applications to run as stateless servers that scale horizontally by externalizing state to persistent storage.

Why this matters

As graph complexity grows and traffic increases, you need to run multiple instances without sharing in-memory state. Stateless servers with external persistence let you scale to thousands of concurrent conversations without bottlenecks.

Skip if: Do not use stateless external persistence if your application is a single-machine prototype, a local CLI tool, or has sub-millisecond latency requirements where network roundtrips to a database are unacceptable. For these, in-memory state is faster and simpler.

Explanation

What it is: A stateless graph server design where each langgraph instance holds no conversation state in memory. Instead, all state (message history, intermediate values, checkpoints) lives in an external database. This allows you to spawn identical graph servers and route requests to any instance: if one dies, another picks up the conversation seamlessly.

How it works mechanically: When you invoke a graph, you pass a config with a thread_id and checkpoint_backend pointing to persistent storage (PostgreSQL, Redis, etc.). LangGraph retrieves the previous state using that thread ID, processes the new input, saves the updated state back, and returns the output. The server itself keeps no memory of prior interactions. Multiple servers can process requests for the same conversation because they all read/write from the same backend.

When to use it: Use stateless graph servers in production when you need uptime guarantees, auto-scaling based on load, or rolling deployments without dropping conversations. This is the only pattern that scales beyond a single machine.

Analogy

Think of a restaurant with multiple cashiers. Each cashier (graph server) doesn't remember your order history: they check a notebook (database) to see what you've ordered before, process your new request, and write the update back to the notebook. Any cashier can serve you because they all reference the same notebook. If one cashier gets sick, another can take over immediately.

Code

Illustrative only - not runnable without a valid API key
python
import json
from typing import Annotated
from langgraph.graph import StateGraph, START, END
from langgraph.checkpoint.postgres import PostgresSaver
from langgraph.types import Command
from pydantic import BaseModel

class ConversationState(BaseModel):
    messages: list[dict]
    user_id: str
    turn_count: int = 0

def node_receive_input(state: ConversationState) -> ConversationState:
    """Receives user input and increments turn counter."""
    state.turn_count += 1
    return state

def node_process(state: ConversationState) -> ConversationState:
    """Simulates processing (in production: call LLM)."""
    last_msg = state.messages[-1]["content"] if state.messages else ""
    response = f"Turn {state.turn_count}: Processed '{last_msg}'"
    state.messages.append({"role": "assistant", "content": response})
    return state

def node_save_context(state: ConversationState) -> ConversationState:
    """Prepares state for checkpoint (automatic in langgraph 0.2+)."""
    return state

builder = StateGraph(ConversationState)
builder.add_node("receive", node_receive_input)
builder.add_node("process", node_process)
builder.add_node("save", node_save_context)

builder.add_edge(START, "receive")
builder.add_edge("receive", "process")
builder.add_edge("process", "save")
builder.add_edge("save", END)

checkpointer = PostgresSaver.from_conn_string(
    "postgresql://user:password@localhost:5432/langgraph_db"
)

graph = builder.compile(checkpointer=checkpointer)

initial_state = ConversationState(
    messages=[{"role": "user", "content": "Hello"}],
    user_id="user_123"
)

config = {"configurable": {"thread_id": "conversation_456"}}
result_1 = graph.invoke(initial_state, config=config)

print(f"Turn 1 - Messages: {len(result_1.messages)}")
print(f"Turn 1 - Turn count: {result_1.turn_count}")
print(f"Turn 1 - Last message: {result_1.messages[-1]['content']}")
print()

second_input = ConversationState(
    messages=result_1.messages + [{"role": "user", "content": "Continue"}],
    user_id="user_123"
)

result_2 = graph.invoke(second_input, config=config)

print(f"Turn 2 - Messages: {len(result_2.messages)}")
print(f"Turn 2 - Turn count: {result_2.turn_count}")
print(f"Turn 2 - Last message: {result_2.messages[-1]['content']}")
Output
Turn 1 - Messages: 2
Turn 1 - Turn count: 1
Turn 1 - Last message: Turn 1: Processed 'Hello'

Turn 2 - Messages: 4
Turn 2 - Turn count: 2
Turn 2 - Last message: Turn 2: Processed 'Continue'

What just happened?

The code created a stateless graph with three nodes (receive, process, save) that increments a turn counter and builds a message history. It compiled the graph with PostgresSaver as the checkpointer, meaning all state is persisted to a PostgreSQL database keyed by thread_id. When invoked twice with the same config (same thread_id), the second invocation retrieved the saved state from the database, continued from turn_count=1 to turn_count=2, and appended new messages to the existing message history. Each invocation was independent: the graph server held no memory between calls.

Common gotcha

Developers often forget that checkpointer is set at compile time, not invocation time. If you compile without a checkpointer, then pass a thread_id in config, langgraph silently falls back to in-memory state: giving you zero persistence despite thinking you're safe. Always verify: graph = builder.compile(checkpointer=your_checkpointer) is in your deployment code, not commented out or conditionally skipped.

Error recovery

psycopg2.OperationalError
PostgreSQL connection string is wrong or database is unreachable. Fix: verify connection string format (postgresql://user:pass@host:port/db) and that postgres is running. Test with `psql` CLI first.
ValueError: thread_id must be a string
You passed a non-string thread_id in config. Fix: ensure config={"configurable": {"thread_id": str(some_value)}} is always a string.
RuntimeError: checkpointer is None
Graph was compiled without a checkpointer but code tries to use thread-based state. Fix: add checkpointer=PostgresSaver(...) or other backend to compile() call.
psycopg2.errors.UndefinedTable
PostgreSQL tables for langgraph don't exist. Fix: run `checkpointer.setup()` once during initialization to create schema.

Experienced dev note

Stateless servers feel inefficient at first: you're writing state to disk after every step. But this is a feature, not a bug. It buys you automatic fault tolerance, safe concurrent requests (multiple servers process the same thread_id without locks), and instant horizontal scaling. The real cost isn't the database roundtrip: it's managing a separate database infrastructure. For production, accept this cost upfront; it scales to millions of conversations where in-memory approaches fail entirely. Also: in langgraph 0.2+, checkpointing is not optional if you want any durability: there is no 'session' mode anymore.

Check your understanding

If you have two identical graph servers both serving the same conversation (same thread_id) and both receive a user message simultaneously, how does the system prevent message loss or duplication?

Show answer hint

A correct answer explains that the checkpointer backend (database) uses transactions or atomic operations to ensure only one write succeeds and the loser re-reads the latest state. It's not about the graph server logic: it's about the persistence layer's ACID guarantees. Mention that this is why you can't use a file-based checkpointer in multi-server deployments.

VERSION In langgraph < 0.2.0, MessageGraph and string 'START' were required. langgraph 0.2.x switched to StateGraph with imported START and END constants, and changed how checkpointers are instantiated (e.g., PostgresSaver.from_conn_string instead of passing conn objects). This example uses 0.2.x patterns; older code will break.
NEXT

Cross-region replication and conflict resolution: how to sync checkpoint state across multiple databases for true geographic redundancy within langgraph.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.