How to Intermediate · 3 min read

How to build a multi-document RAG system

Quick answer
A multi-document RAG system combines vector search over multiple documents with an LLM like gpt-4o to retrieve relevant context and generate accurate answers. It indexes documents using embeddings, performs similarity search on queries, then feeds retrieved text as context to the LLM for generation.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0 faiss-cpu numpy

Setup environment

Install required Python packages and set your OpenAI API key as an environment variable.

bash
pip install openai faiss-cpu numpy

Step by step implementation

This example shows how to embed multiple documents, build a FAISS vector index, query it, and generate answers with gpt-4o.

python
import os
import numpy as np
import faiss
from openai import OpenAI

# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Sample documents
documents = [
    "Python is a versatile programming language.",
    "OpenAI provides powerful LLM APIs.",
    "FAISS is a library for efficient similarity search.",
    "RAG combines retrieval with generation for better answers."
]

# Step 1: Embed documents
embeddings = []
for doc in documents:
    response = client.embeddings.create(
        model="text-embedding-3-large",
        input=doc
    )
    embeddings.append(response.data[0].embedding)

embeddings = np.array(embeddings).astype('float32')

# Step 2: Build FAISS index
dimension = len(embeddings[0])
index = faiss.IndexFlatL2(dimension)
index.add(embeddings)

# Step 3: Query embedding
query = "What library helps with similarity search?"
query_embedding = client.embeddings.create(
    model="text-embedding-3-large",
    input=query
).data[0].embedding
query_embedding = np.array([query_embedding]).astype('float32')

# Step 4: Search top 2 relevant docs
k = 2
D, I = index.search(query_embedding, k)
retrieved_docs = [documents[i] for i in I[0]]

# Step 5: Generate answer with context
context = "\n".join(retrieved_docs)
prompt = f"Use the following context to answer the question:\n{context}\nQuestion: {query}\nAnswer:"

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}]
)

print("Answer:", response.choices[0].message.content)
output
Answer: FAISS is a library for efficient similarity search.

Common variations

  • Use mistral-large-latest or claude-3-5-sonnet-20241022 for different LLMs.
  • Switch to async calls with asyncio for higher throughput.
  • Use persistent vector stores like Chroma or FAISS with disk storage for large corpora.
  • Expand retrieval to more documents or add metadata filtering.

Troubleshooting tips

  • If embeddings are empty or errors occur, verify your API key and network connection.
  • Ensure FAISS index dimension matches embedding size exactly.
  • If answers are irrelevant, increase number of retrieved documents or improve prompt clarity.
  • Monitor token usage to avoid exceeding API limits.

Key Takeaways

  • Use vector embeddings and FAISS to index and search multiple documents efficiently.
  • Feed retrieved relevant documents as context to an LLM like gpt-4o for accurate generation.
  • Adjust retrieval count and prompt design to improve answer relevance and completeness.
Verified 2026-04 · gpt-4o, text-embedding-3-large, mistral-large-latest, claude-3-5-sonnet-20241022
Verify ↗