How to Intermediate · 3 min read

How to build a multi-document RAG system

Q: How to build a multi-document RAG system

A multi-document RAG system combines vector search over multiple documents with an LLM like gpt-4o to retrieve relevant context and generate accurate answers. It indexes documents using embeddings, performs similarity search on queries, then feeds retrieved text as context to the LLM for generation.

Quick answer

A multi-document RAG system combines vector search over multiple documents with an LLM like gpt-4o to retrieve relevant context and generate accurate answers. It indexes documents using embeddings, performs similarity search on queries, then feeds retrieved text as context to the LLM for generation.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0 faiss-cpu numpy

Setup environment

Install required Python packages and set your OpenAI API key as an environment variable.

bash

pip install openai faiss-cpu numpy

Step by step implementation

This example shows how to embed multiple documents, build a FAISS vector index, query it, and generate answers with gpt-4o.

python

import os
import numpy as np
import faiss
from openai import OpenAI

# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Sample documents
documents = [
    "Python is a versatile programming language.",
    "OpenAI provides powerful LLM APIs.",
    "FAISS is a library for efficient similarity search.",
    "RAG combines retrieval with generation for better answers."
]

# Step 1: Embed documents
embeddings = []
for doc in documents:
    response = client.embeddings.create(
        model="text-embedding-3-large",
        input=doc
    )
    embeddings.append(response.data[0].embedding)

embeddings = np.array(embeddings).astype('float32')

# Step 2: Build FAISS index
dimension = len(embeddings[0])
index = faiss.IndexFlatL2(dimension)
index.add(embeddings)

# Step 3: Query embedding
query = "What library helps with similarity search?"
query_embedding = client.embeddings.create(
    model="text-embedding-3-large",
    input=query
).data[0].embedding
query_embedding = np.array([query_embedding]).astype('float32')

# Step 4: Search top 2 relevant docs
k = 2
D, I = index.search(query_embedding, k)
retrieved_docs = [documents[i] for i in I[0]]

# Step 5: Generate answer with context
context = "\n".join(retrieved_docs)
prompt = f"Use the following context to answer the question:\n{context}\nQuestion: {query}\nAnswer:"

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}]
)

print("Answer:", response.choices[0].message.content)

output

Answer: FAISS is a library for efficient similarity search.

Common variations

Use mistral-large-latest or claude-3-5-sonnet-20241022 for different LLMs.
Switch to async calls with asyncio for higher throughput.
Use persistent vector stores like Chroma or FAISS with disk storage for large corpora.
Expand retrieval to more documents or add metadata filtering.

Troubleshooting tips

If embeddings are empty or errors occur, verify your API key and network connection.
Ensure FAISS index dimension matches embedding size exactly.
If answers are irrelevant, increase number of retrieved documents or improve prompt clarity.
Monitor token usage to avoid exceeding API limits.

Key Takeaways

Use vector embeddings and FAISS to index and search multiple documents efficiently.
Feed retrieved relevant documents as context to an LLM like gpt-4o for accurate generation.
Adjust retrieval count and prompt design to improve answer relevance and completeness.

Verified 2026-04 · gpt-4o, text-embedding-3-large, mistral-large-latest, claude-3-5-sonnet-20241022

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.