LlamaIndex behind FastAPI: the standard pattern
Why this matters
In production, your RAG system runs as a service, not a script. FastAPI + LlamaIndex is the standard pattern for teams deploying retrieval pipelines at scale: you need to know how to structure index initialization, handle concurrent requests, and avoid rebuilding the index on every call.
Explanation
LlamaIndex is a retrieval orchestration layer: it wraps your documents, embeddings, and LLM calls into a query engine. FastAPI is a web framework that exposes that query engine over HTTP. The standard production pattern connects them by initializing the index once at startup (not per-request) and reusing it across all incoming queries.
Mechanically: FastAPI's lifespan context manager holds the index in memory. When a request arrives, the endpoint calls query_engine.query(user_input) and returns the result. Async/await ensures the event loop doesn't block while the LLM responds. The index itself is built once from documents (file, vector database, or API), stored in a Pydantic model, and shared via dependency injection.
Use this when you're deploying a multi-user system that needs to field dozens of concurrent queries without reloading documents or re-embedding. It's also the foundation for production observability, rate limiting, and authentication layering.
Analogy
Think of the index as a loaded database connection pool. You don't create a new connection per HTTP request: you maintain one shared pool and give each request a handle to it. FastAPI's lifespan is your connection pool manager.
Code
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from llama_index.core import (
VectorStoreIndex,
SimpleDirectoryReader,
Settings,
)
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
import os
from contextlib import asynccontextmanager
class QueryRequest(BaseModel):
query: str
class QueryResponse(BaseModel):
answer: str
sources: list[str]
class IndexManager:
def __init__(self):
self.index = None
self.query_engine = None
index_manager = IndexManager()
@asynccontextmanager
async def lifespan(app: FastAPI):
# Startup: initialize index once
print("Loading documents and building index...")
Settings.llm = OpenAI(model="gpt-4", api_key=os.getenv("OPENAI_API_KEY"))
Settings.embed_model = OpenAIEmbedding(
model="text-embedding-3-small",
api_key=os.getenv("OPENAI_API_KEY")
)
documents = SimpleDirectoryReader("./documents").load_data()
index_manager.index = VectorStoreIndex.from_documents(documents)
index_manager.query_engine = index_manager.index.as_query_engine(similarity_top_k=3)
print(f"Index ready. Indexed {len(documents)} documents.")
yield
# Shutdown: cleanup (if needed)
print("Shutting down...")
app = FastAPI(title="LlamaIndex RAG API", lifespan=lifespan)
@app.post("/query", response_model=QueryResponse)
async def query_endpoint(request: QueryRequest) -> QueryResponse:
if index_manager.query_engine is None:
raise HTTPException(status_code=503, detail="Index not initialized")
try:
response = index_manager.query_engine.query(request.query)
source_nodes = response.source_nodes if hasattr(response, 'source_nodes') else []
sources = [str(node.node.get_content()[:100]) for node in source_nodes]
return QueryResponse(
answer=str(response),
sources=sources
)
except Exception as e:
raise HTTPException(status_code=500, detail=f"Query failed: {str(e)}")
@app.get("/health")
async def health() -> dict:
return {"status": "ok", "index_ready": index_manager.index is not None} No output: this is a runnable FastAPI application. When run with `uvicorn main:app --reload`, it starts the server, loads the index during startup, and waits for POST requests to /query. Health check returns: {"status": "ok", "index_ready": true} What just happened?
The code defines a FastAPI application with an async lifespan context. On startup, it initializes OpenAI settings, loads documents from a directory, builds a VectorStoreIndex, and creates a query engine. All three (index, query engine, settings) are stored in a shared `IndexManager` object. When a request hits `/query`, it uses the already-initialized query engine to answer without rebuilding. The `/health` endpoint confirms the index is ready. On shutdown, the context manager cleans up.
Common gotcha
The biggest mistake: rebuilding the index inside the endpoint function. Developers often write `index = VectorStoreIndex.from_documents(docs)` inside `@app.post("/query")` thinking each request needs a fresh index. This will timeout on production traffic. The index is expensive to build (embedding all documents): build it once at startup, reuse it forever.
Error recovery
RuntimeError: Event loop is closedAttributeError: 'IndexManager' object has no attribute 'query_engine'AuthenticationError from OpenAI APIFileNotFoundError: ./documents directory not foundExperienced dev note
In production, don't store the index in a Python object: persist it. Use LlamaIndex's built-in persistence: `index.storage_context.persist('./index_storage')` at shutdown, then load it at startup with `load_index_from_storage(StorageContext.from_defaults(persist_dir='./index_storage'))`. This cuts startup time from minutes to seconds for large indices. Also, set `similarity_top_k=3` in your query engine to limit LLM context and keep costs down: more sources doesn't always mean better answers.
Check your understanding
Why does the code store the index in a class-level object (`IndexManager`) instead of creating it fresh inside the endpoint function? What problem does this solve in a real production system with 100 concurrent requests per second?
Show answer hint
A correct answer must touch on: (1) the cost of rebuilding the index (re-embedding all documents), (2) that the lifespan runs once at startup, and (3) that concurrent requests share the same index in memory. Bonus: mention that vector embeddings are deterministic, so rebuilding gives the same results but wastes compute.