How to build a multi-document RAG system
Quick answer
A multi-document
RAG system combines vector search over multiple documents with an LLM like gpt-4o to retrieve relevant context and generate accurate answers. It indexes documents using embeddings, performs similarity search on queries, then feeds retrieved text as context to the LLM for generation.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0 faiss-cpu numpy
Setup environment
Install required Python packages and set your OpenAI API key as an environment variable.
pip install openai faiss-cpu numpy Step by step implementation
This example shows how to embed multiple documents, build a FAISS vector index, query it, and generate answers with gpt-4o.
import os
import numpy as np
import faiss
from openai import OpenAI
# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Sample documents
documents = [
"Python is a versatile programming language.",
"OpenAI provides powerful LLM APIs.",
"FAISS is a library for efficient similarity search.",
"RAG combines retrieval with generation for better answers."
]
# Step 1: Embed documents
embeddings = []
for doc in documents:
response = client.embeddings.create(
model="text-embedding-3-large",
input=doc
)
embeddings.append(response.data[0].embedding)
embeddings = np.array(embeddings).astype('float32')
# Step 2: Build FAISS index
dimension = len(embeddings[0])
index = faiss.IndexFlatL2(dimension)
index.add(embeddings)
# Step 3: Query embedding
query = "What library helps with similarity search?"
query_embedding = client.embeddings.create(
model="text-embedding-3-large",
input=query
).data[0].embedding
query_embedding = np.array([query_embedding]).astype('float32')
# Step 4: Search top 2 relevant docs
k = 2
D, I = index.search(query_embedding, k)
retrieved_docs = [documents[i] for i in I[0]]
# Step 5: Generate answer with context
context = "\n".join(retrieved_docs)
prompt = f"Use the following context to answer the question:\n{context}\nQuestion: {query}\nAnswer:"
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
print("Answer:", response.choices[0].message.content) output
Answer: FAISS is a library for efficient similarity search.
Common variations
- Use
mistral-large-latestorclaude-3-5-sonnet-20241022for different LLMs. - Switch to async calls with
asynciofor higher throughput. - Use persistent vector stores like
ChromaorFAISSwith disk storage for large corpora. - Expand retrieval to more documents or add metadata filtering.
Troubleshooting tips
- If embeddings are empty or errors occur, verify your API key and network connection.
- Ensure FAISS index dimension matches embedding size exactly.
- If answers are irrelevant, increase number of retrieved documents or improve prompt clarity.
- Monitor token usage to avoid exceeding API limits.
Key Takeaways
- Use vector embeddings and FAISS to index and search multiple documents efficiently.
- Feed retrieved relevant documents as context to an LLM like
gpt-4ofor accurate generation. - Adjust retrieval count and prompt design to improve answer relevance and completeness.