Chain as a FastAPI endpoint: the standard pattern
Why this matters
Every deployed LangChain application runs on a web server. FastAPI is the production standard for Python services because it handles async natively, validates input automatically, and serves both sync and async chains with zero boilerplate. Getting this pattern wrong costs you in latency, failed requests, and debugging time at scale.
Explanation
What it is: A FastAPI route that accepts a user input, executes a LangChain LCEL chain, and returns the result: with async support, proper error handling, and optional streaming.
How it works mechanically: FastAPI receives a POST request with JSON input, validates it using Pydantic models, passes the parsed data to a pre-constructed chain via .invoke(), and returns the output. The async def route allows FastAPI to handle multiple requests concurrently without blocking. The chain itself is instantiated once (at server startup) and reused across requests to avoid cold-starting the LLM connection repeatedly. Streaming works by yielding results to FastAPI's StreamingResponse.
When to use it: This is the standard pattern for production LangChain deployments. Use it whenever you need an HTTP API for a chain, whether serving a chatbot, a document QA system, or a multi-step agent. It scales because async I/O doesn't block the event loop, and Pydantic validation catches malformed requests before they reach your chain.
Analogy
Think of it like a restaurant kitchen: the chain is your recipe and execution process, FastAPI is the counter that takes orders and formats them, and the async event loop is your ability to take the next order while the first one is still cooking. Without async, you'd be standing at the counter waiting for every dish to finish before accepting the next order.
Code
from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
import json
import os
app = FastAPI()
class ChainInput(BaseModel):
question: str
class ChainOutput(BaseModel):
answer: str
llm = ChatOpenAI(
model="gpt-4o",
api_key=os.environ.get("OPENAI_API_KEY"),
temperature=0.7
)
prompt = ChatPromptTemplate.from_template(
"Answer this question concisely: {question}"
)
chain = prompt | llm | StrOutputParser()
@app.post("/invoke", response_model=ChainOutput)
async def invoke_chain(input_data: ChainInput):
try:
result = chain.invoke({"question": input_data.question})
return ChainOutput(answer=result)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@app.post("/stream")
async def stream_chain(input_data: ChainInput):
async def generate():
try:
async for chunk in chain.astream({"question": input_data.question}):
yield json.dumps({"chunk": chunk}) + "\n"
except Exception as e:
yield json.dumps({"error": str(e)}) + "\n"
return StreamingResponse(generate(), media_type="application/x-ndjson")
@app.get("/health")
async def health():
return {"status": "ok"}
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000) No output: runs without error. The FastAPI server starts and listens on 0.0.0.0:8000. You then POST to /invoke with {"question": "What is Python?"} and receive {"answer": "..."}. Or POST to /stream and receive newline-delimited JSON chunks as they arrive. What just happened?
We created a FastAPI application with three endpoints: /invoke (synchronous execution, returns full response), /stream (async streaming, yields chunks as they arrive), and /health (liveness probe). The chain is instantiated once at server startup and reused across all requests. FastAPI validates the input using Pydantic, passes it to the chain, catches exceptions, and formats responses. The server starts immediately and waits for HTTP requests.
Common gotcha
The most common mistake is calling chain.invoke() inside a synchronous route without async def, which blocks the entire FastAPI event loop and prevents other requests from being handled concurrently. Also, instantiating the chain inside the route (instead of once at startup) causes the LLM connection to be recreated per request, adding 500ms–2s latency per call. A third gotcha: forgetting to handle streaming properly: .astream() returns an async generator, so you must async for, not regular for.
Error recovery
RuntimeError: no running event loopValueError: Missing required input variable 'question'ModuleNotFoundError: No module named 'langchain_openai'Timeout on /stream endpoint413 Request Entity Too LargeExperienced dev note
The chain-per-request-startup pattern saves you hours of debugging. Instantiate the chain once at module level or in a FastAPI lifespan context manager (with @app.lifespan("startup")). This avoids reinitializing the OpenAI connection, tokenizer, and prompt template on every request. Also: use Pydantic response_model for automatic OpenAPI schema generation and client type safety. And always add a /health endpoint: your load balancer and Kubernetes probes will thank you. One more insight: if you need the chain to maintain conversation history (e.g., for a chatbot), use a message history retriever keyed by session ID, not a global variable: otherwise one user's conversation pollutes another's.
Check your understanding
Why would you use async def for a FastAPI route that calls a LangChain chain, even if the chain itself is synchronous? What would happen if you removed async and just did def invoke_chain? (Assume the chain is CPU-bound or makes external HTTP calls.)
Show answer hint
A correct answer recognizes that <code>async def</code> allows FastAPI's event loop to handle other requests while one request is waiting (on I/O, LLM latency, etc.). If you use <code>def</code>, the entire event loop blocks on that request, and no other concurrent requests can be processed. You'd lose concurrency even though FastAPI supports it. The chain itself doesn't need to be async; FastAPI runs sync functions in a thread pool. But declaring <code>async def</code> signals intent and lets FastAPI optimize the thread pool allocation.