Code Advanced hard · 8 min

LlamaIndex behind FastAPI: the standard pattern

What you will learn

Serve a LlamaIndex RAG pipeline as production HTTP endpoints using FastAPI with proper async handling and index reuse.

Why this matters

In production, your RAG system runs as a service, not a script. FastAPI + LlamaIndex is the standard pattern for teams deploying retrieval pipelines at scale: you need to know how to structure index initialization, handle concurrent requests, and avoid rebuilding the index on every call.

Skip if: Do not use this pattern if you're building a single-user notebook, a CLI tool, or running inference entirely serverless (Lambda, Cloud Functions) where you don't control the runtime lifecycle. Also avoid if your index is smaller than 100KB and you can afford to reload it per request.

Explanation

LlamaIndex is a retrieval orchestration layer: it wraps your documents, embeddings, and LLM calls into a query engine. FastAPI is a web framework that exposes that query engine over HTTP. The standard production pattern connects them by initializing the index once at startup (not per-request) and reusing it across all incoming queries.

Mechanically: FastAPI's lifespan context manager holds the index in memory. When a request arrives, the endpoint calls query_engine.query(user_input) and returns the result. Async/await ensures the event loop doesn't block while the LLM responds. The index itself is built once from documents (file, vector database, or API), stored in a Pydantic model, and shared via dependency injection.

Use this when you're deploying a multi-user system that needs to field dozens of concurrent queries without reloading documents or re-embedding. It's also the foundation for production observability, rate limiting, and authentication layering.

Analogy

Think of the index as a loaded database connection pool. You don't create a new connection per HTTP request: you maintain one shared pool and give each request a handle to it. FastAPI's lifespan is your connection pool manager.

Code

python

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from llama_index.core import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    Settings,
)
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
import os
from contextlib import asynccontextmanager

class QueryRequest(BaseModel):
    query: str

class QueryResponse(BaseModel):
    answer: str
    sources: list[str]

class IndexManager:
    def __init__(self):
        self.index = None
        self.query_engine = None

index_manager = IndexManager()

@asynccontextmanager
async def lifespan(app: FastAPI):
    # Startup: initialize index once
    print("Loading documents and building index...")
    
    Settings.llm = OpenAI(model="gpt-4", api_key=os.getenv("OPENAI_API_KEY"))
    Settings.embed_model = OpenAIEmbedding(
        model="text-embedding-3-small",
        api_key=os.getenv("OPENAI_API_KEY")
    )
    
    documents = SimpleDirectoryReader("./documents").load_data()
    index_manager.index = VectorStoreIndex.from_documents(documents)
    index_manager.query_engine = index_manager.index.as_query_engine(similarity_top_k=3)
    
    print(f"Index ready. Indexed {len(documents)} documents.")
    yield
    
    # Shutdown: cleanup (if needed)
    print("Shutting down...")

app = FastAPI(title="LlamaIndex RAG API", lifespan=lifespan)

@app.post("/query", response_model=QueryResponse)
async def query_endpoint(request: QueryRequest) -> QueryResponse:
    if index_manager.query_engine is None:
        raise HTTPException(status_code=503, detail="Index not initialized")
    
    try:
        response = index_manager.query_engine.query(request.query)
        source_nodes = response.source_nodes if hasattr(response, 'source_nodes') else []
        sources = [str(node.node.get_content()[:100]) for node in source_nodes]
        
        return QueryResponse(
            answer=str(response),
            sources=sources
        )
    except Exception as e:
        raise HTTPException(status_code=500, detail=f"Query failed: {str(e)}")

@app.get("/health")
async def health() -> dict:
    return {"status": "ok", "index_ready": index_manager.index is not None}

Output

No output: this is a runnable FastAPI application. When run with `uvicorn main:app --reload`, it starts the server, loads the index during startup, and waits for POST requests to /query. Health check returns: {"status": "ok", "index_ready": true}

What just happened?

The code defines a FastAPI application with an async lifespan context. On startup, it initializes OpenAI settings, loads documents from a directory, builds a VectorStoreIndex, and creates a query engine. All three (index, query engine, settings) are stored in a shared `IndexManager` object. When a request hits `/query`, it uses the already-initialized query engine to answer without rebuilding. The `/health` endpoint confirms the index is ready. On shutdown, the context manager cleans up.

Common gotcha

The biggest mistake: rebuilding the index inside the endpoint function. Developers often write `index = VectorStoreIndex.from_documents(docs)` inside `@app.post("/query")` thinking each request needs a fresh index. This will timeout on production traffic. The index is expensive to build (embedding all documents): build it once at startup, reuse it forever.

Error recovery

RuntimeError: Event loop is closed

This happens when you run the lifespan outside of uvicorn. Only run FastAPI apps with `uvicorn main:app`, not `python main.py` or direct `app.run()`.

AttributeError: 'IndexManager' object has no attribute 'query_engine'

The endpoint is called before the lifespan startup finished, or in a test without the lifespan. Ensure you're hitting the endpoint after server startup (check the console for 'Index ready'), and in tests, use FastAPI's TestClient which respects lifespan.

AuthenticationError from OpenAI API

Your OPENAI_API_KEY environment variable is missing or invalid. Set it before starting the server: `export OPENAI_API_KEY=sk-...` (Linux/Mac) or `set OPENAI_API_KEY=sk-...` (Windows).

FileNotFoundError: ./documents directory not found

Create the ./documents directory and add .txt, .pdf, or .md files. SimpleDirectoryReader won't work on an empty directory. If you have no documents yet, create a test.txt file with sample content.

Experienced dev note

In production, don't store the index in a Python object: persist it. Use LlamaIndex's built-in persistence: `index.storage_context.persist('./index_storage')` at shutdown, then load it at startup with `load_index_from_storage(StorageContext.from_defaults(persist_dir='./index_storage'))`. This cuts startup time from minutes to seconds for large indices. Also, set `similarity_top_k=3` in your query engine to limit LLM context and keep costs down: more sources doesn't always mean better answers.

Check your understanding

Why does the code store the index in a class-level object (`IndexManager`) instead of creating it fresh inside the endpoint function? What problem does this solve in a real production system with 100 concurrent requests per second?

Show answer hint

A correct answer must touch on: (1) the cost of rebuilding the index (re-embedding all documents), (2) that the lifespan runs once at startup, and (3) that concurrent requests share the same index in memory. Bonus: mention that vector embeddings are deterministic, so rebuilding gives the same results but wastes compute.

VERSION llama-index-core >= 0.10.0 uses the new async lifespan context. Earlier versions (< 0.10.0) used app.add_event_handler('startup', ...) instead. If you're on 0.9.x, use event handlers. This course assumes 0.12.x.

Learn how to add query filtering and metadata-driven retrieval to your FastAPI RAG endpoint: this lets users ask 'show me only articles from 2025' and have the index respect it.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.