Code Advanced hard · 7 min

Chain as a FastAPI endpoint: the standard pattern

What you will learn

Deploy a LangChain LCEL chain as a production FastAPI endpoint with streaming, error handling, and async support.

Why this matters

Every deployed LangChain application runs on a web server. FastAPI is the production standard for Python services because it handles async natively, validates input automatically, and serves both sync and async chains with zero boilerplate. Getting this pattern wrong costs you in latency, failed requests, and debugging time at scale.

Skip if: Do NOT use this pattern if you are building a simple CLI tool, a Jupyter notebook experiment, or a synchronous batch job that doesn't need HTTP access. Also avoid if your chain must integrate with frameworks that don't support async (legacy Django without async views, some older database drivers). For simple prototypes, use LangServe (higher-level abstraction); this pattern is for when you need control over request/response handling or custom middleware.

Explanation

What it is: A FastAPI route that accepts a user input, executes a LangChain LCEL chain, and returns the result: with async support, proper error handling, and optional streaming.

How it works mechanically: FastAPI receives a POST request with JSON input, validates it using Pydantic models, passes the parsed data to a pre-constructed chain via .invoke(), and returns the output. The async def route allows FastAPI to handle multiple requests concurrently without blocking. The chain itself is instantiated once (at server startup) and reused across requests to avoid cold-starting the LLM connection repeatedly. Streaming works by yielding results to FastAPI's StreamingResponse.

When to use it: This is the standard pattern for production LangChain deployments. Use it whenever you need an HTTP API for a chain, whether serving a chatbot, a document QA system, or a multi-step agent. It scales because async I/O doesn't block the event loop, and Pydantic validation catches malformed requests before they reach your chain.

Analogy

Think of it like a restaurant kitchen: the chain is your recipe and execution process, FastAPI is the counter that takes orders and formats them, and the async event loop is your ability to take the next order while the first one is still cooking. Without async, you'd be standing at the counter waiting for every dish to finish before accepting the next order.

Code

python

from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
import json
import os

app = FastAPI()

class ChainInput(BaseModel):
    question: str

class ChainOutput(BaseModel):
    answer: str

llm = ChatOpenAI(
    model="gpt-4o",
    api_key=os.environ.get("OPENAI_API_KEY"),
    temperature=0.7
)

prompt = ChatPromptTemplate.from_template(
    "Answer this question concisely: {question}"
)

chain = prompt | llm | StrOutputParser()

@app.post("/invoke", response_model=ChainOutput)
async def invoke_chain(input_data: ChainInput):
    try:
        result = chain.invoke({"question": input_data.question})
        return ChainOutput(answer=result)
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.post("/stream")
async def stream_chain(input_data: ChainInput):
    async def generate():
        try:
            async for chunk in chain.astream({"question": input_data.question}):
                yield json.dumps({"chunk": chunk}) + "\n"
        except Exception as e:
            yield json.dumps({"error": str(e)}) + "\n"
    return StreamingResponse(generate(), media_type="application/x-ndjson")

@app.get("/health")
async def health():
    return {"status": "ok"}

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

Output

No output: runs without error. The FastAPI server starts and listens on 0.0.0.0:8000. You then POST to /invoke with {"question": "What is Python?"} and receive {"answer": "..."}. Or POST to /stream and receive newline-delimited JSON chunks as they arrive.

What just happened?

We created a FastAPI application with three endpoints: /invoke (synchronous execution, returns full response), /stream (async streaming, yields chunks as they arrive), and /health (liveness probe). The chain is instantiated once at server startup and reused across all requests. FastAPI validates the input using Pydantic, passes it to the chain, catches exceptions, and formats responses. The server starts immediately and waits for HTTP requests.

Common gotcha

The most common mistake is calling chain.invoke() inside a synchronous route without async def, which blocks the entire FastAPI event loop and prevents other requests from being handled concurrently. Also, instantiating the chain inside the route (instead of once at startup) causes the LLM connection to be recreated per request, adding 500ms–2s latency per call. A third gotcha: forgetting to handle streaming properly: .astream() returns an async generator, so you must async for, not regular for.

Error recovery

RuntimeError: no running event loop

You're calling async chain methods (like .astream()) in a sync context. Use async def for your route and await chain.astream(), or use sync chain.invoke() in a sync route.

ValueError: Missing required input variable 'question'

Your prompt template expects 'question' but you're passing a different key. Check that your ChainInput model matches the prompt template variables exactly.

ModuleNotFoundError: No module named 'langchain_openai'

Install the OpenAI provider: pip install langchain-openai. Ensure you also have langchain 1.2.x and langchain-core 0.3.x.

Timeout on /stream endpoint

Your chain.astream() method does not exist or is not implemented for your LLM. Ensure you're using LangChain 0.3.x where astream() is standard. Alternatively, use synchronous chain.stream() and run it in an executor.

413 Request Entity Too Large

Your JSON payload is too large. Either stream the input/output or increase FastAPI's max_upload_size: from fastapi.middleware.cors import CORSMiddleware (not the solution: actually increase client timeout or paginate).

Experienced dev note

The chain-per-request-startup pattern saves you hours of debugging. Instantiate the chain once at module level or in a FastAPI lifespan context manager (with @app.lifespan("startup")). This avoids reinitializing the OpenAI connection, tokenizer, and prompt template on every request. Also: use Pydantic response_model for automatic OpenAPI schema generation and client type safety. And always add a /health endpoint: your load balancer and Kubernetes probes will thank you. One more insight: if you need the chain to maintain conversation history (e.g., for a chatbot), use a message history retriever keyed by session ID, not a global variable: otherwise one user's conversation pollutes another's.

Check your understanding

Why would you use async def for a FastAPI route that calls a LangChain chain, even if the chain itself is synchronous? What would happen if you removed async and just did def invoke_chain? (Assume the chain is CPU-bound or makes external HTTP calls.)

Show answer hint

A correct answer recognizes that <code>async def</code> allows FastAPI's event loop to handle other requests while one request is waiting (on I/O, LLM latency, etc.). If you use <code>def</code>, the entire event loop blocks on that request, and no other concurrent requests can be processed. You'd lose concurrency even though FastAPI supports it. The chain itself doesn't need to be async; FastAPI runs sync functions in a thread pool. But declaring <code>async def</code> signals intent and lets FastAPI optimize the thread pool allocation.

VERSION LangChain 1.2.x uses LCEL (Langchain Expression Language) as the standard. The deprecated LLMChain (from langchain.chains import LLMChain) was removed in 1.0.0. Always use the pipe operator: prompt | llm | parser. Also, langchain_openai is separate (pip install langchain-openai); the old from langchain.chat_models import ChatOpenAI no longer works as of 1.0.0.

Next, learn how to add request/response logging and tracing to your FastAPI chain endpoint using LangSmith instrumentation for production observability.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.