Code Advanced hard · 8 min

Deployment as a LangGraph service

What you will learn
Package and serve a compiled LangGraph as a production API using LangServe, with state persistence and concurrent request handling.

Why this matters

A graph is useless if it only runs on your laptop: you need to deploy it as a service that handles multiple users, maintains state across sessions, and survives restarts. This is the gap between prototype and production.

Skip if: Don't use LangServe if you need sub-millisecond latency for high-frequency trading, or if your deployment infrastructure forbids extra Python processes (use native API wrappers instead). Also skip it if your graph is truly stateless and simple enough for AWS Lambda: overhead may not justify it.

Explanation

What it is: LangServe is a framework that wraps a compiled LangGraph (or any LangChain Runnable) as a REST API with built-in support for streaming, async operations, and state management. It handles the HTTP/WebSocket plumbing so you can focus on graph logic.

How it works mechanically: You create a FastAPI application, instantiate your graph, wrap it with add_routes(), and LangServe automatically exposes endpoints like /invoke (sync), /stream (streaming events), and /batch (parallel requests). State is managed via config dictionaries passed in requests, which are mapped to checkpointer threads: so two users with different thread_id values maintain isolated conversation histories.

When to use it: When you need a multi-user graph service with persistent memory, concurrent request handling, and streaming support. It's the production-grade bridge between LangGraph's execution model and HTTP clients.

Analogy

Think of LangServe like converting your laptop into a restaurant. Your graph is the kitchen (the logic). LangServe is the host, waiters, and reservation system: it takes orders (HTTP requests), routes them to the kitchen with metadata (thread_id for which table), and streams back plates (events) as they're ready. Multiple customers (threads) can order simultaneously without interfering.

Code

python
import json
from typing import Annotated
from typing_extensions import TypedDict
from langgraph.graph import StateGraph, START, END
from langgraph.checkpoint.memory import MemorySaver
from langchain_core.messages import BaseMessage, HumanMessage, AIMessage
from langchain_openai import ChatOpenAI
from fastapi import FastAPI
from langserve import add_routes
import uvicorn

class ChatState(TypedDict):
    messages: Annotated[list[BaseMessage], lambda x: x]
    user_name: str

def chat_node(state: ChatState) -> ChatState:
    llm = ChatOpenAI(model="gpt-4o-mini")
    last_message = state["messages"][-1]
    response = llm.invoke([last_message])
    return {
        "messages": state["messages"] + [response],
        "user_name": state["user_name"]
    }

graph = StateGraph(ChatState)
graph.add_node("chat", chat_node)
graph.add_edge(START, "chat")
graph.add_edge("chat", END)

compiled_graph = graph.compile(checkpointer=MemorySaver())

app = FastAPI(title="LangGraph Chat Service")

add_routes(
    app,
    compiled_graph.with_types(input_type=ChatState, output_type=ChatState),
    path="/chat"
)

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)
Output
INFO:     Started server process [12345]
INFO:     Waiting for application startup.
INFO:     Application startup complete
INFO:     Uvicorn running on http://0.0.0.0:8000

What just happened?

The code defined a stateful chat graph with a MemorySaver checkpointer, compiled it, wrapped it in FastAPI, and exposed it via LangServe at the <code>/chat</code> endpoint. Uvicorn started listening on port 8000. Any client sending a POST to <code>http://localhost:8000/chat/invoke</code> with a config containing <code>thread_id</code> will have its conversation persisted across requests.

Common gotcha

If you forget to pass config={"configurable": {"thread_id": "user_123"}} in the request body, every invoke creates a fresh state with no memory of prior messages. Developers assume the checkpointer works automatically: it doesn't, the client must specify the thread.

Error recovery

ImportError: No module named 'langserve'
LangServe is not bundled with langgraph. Install separately: <code>pip install langserve</code>
pydantic.ValidationError: input_type and output_type
The graph type signature must match ChatState exactly. Use <code>.with_types(input_type=ChatState, output_type=ChatState)</code> on the compiled graph before adding routes.
RuntimeError: Configurable not found for thread_id
The <code>config</code> key in the request payload must be a dict with <code>configurable</code> key containing <code>thread_id</code>. Pass JSON like <code>{"messages": [...], "user_name": "alice", "config": {"configurable": {"thread_id": "alice_session_1"}}}</code>
asyncio.CancelledError during stream
Client disconnected mid-stream. LangServe cancels the graph gracefully, but ensure your nodes handle cancellation (use <code>try/finally</code> for cleanup). This is expected in production.

Experienced dev note

The invisible gotcha: MemorySaver is in-process RAM only. If your service restarts, all conversation history evaporates. For production, swap MemorySaver() with PostgresSaver or SqliteSaver immediately. Also, thread_id collision across users is a security non-issue (different users can't see each other's threads), but a usability disaster if you use predictable IDs: hash the user ID or use UUIDs.

Check your understanding

If two different clients send requests to the same LangServe endpoint with identical thread_ids, what happens to their conversation histories, and why would this be a problem in a real multi-tenant application?

Show answer hint

They share the exact same state because the checkpointer isolates by thread_id only. In production, you'd need to namespace thread_ids by user (e.g., <code>user_123_thread_1</code>) or use separate services per tenant.

VERSION LangServe 0.1.x used as_runnable() instead of with_types(). In langgraph 0.2.x + langserve 0.3.x+, use .with_types() for proper type validation.
NEXT

Implementing human-in-the-loop interrupts in a deployed graph: pausing execution when a node requires human decision-making and resuming from exactly that point.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.