How to evaluate a RAG pipeline
Quick answer
To evaluate a RAG pipeline, measure both retrieval quality (e.g., recall, precision) and generation quality (e.g., ROUGE, BLEU, or human evaluation). Use test queries with known ground truth documents and answers, then analyze how well the pipeline retrieves relevant context and generates accurate, coherent responses.
PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0pip install faiss-cpupip install datasets
Setup
Install necessary Python packages and set your environment variable for the OpenAI API key.
pip install openai faiss-cpu datasets Step by step
This example demonstrates evaluating a RAG pipeline by retrieving documents with FAISS and generating answers with gpt-4o. It calculates retrieval recall and generation ROUGE scores.
import os
from openai import OpenAI
from datasets import load_metric
import faiss
# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Dummy corpus and queries
corpus = ["Paris is the capital of France.", "The Eiffel Tower is in Paris.", "Python is a programming language."]
queries = ["Where is the Eiffel Tower?", "What is Python?"]
ground_truth_docs = [[1], [2]] # indices of relevant docs
# Build FAISS index
import numpy as np
embedding_dim = 1536 # typical for OpenAI embeddings
# Get embeddings for corpus
corpus_embeddings = []
for doc in corpus:
response = client.embeddings.create(
model="text-embedding-3-large",
input=doc
)
corpus_embeddings.append(response.data[0].embedding)
corpus_embeddings = np.array(corpus_embeddings).astype('float32')
index = faiss.IndexFlatL2(embedding_dim)
index.add(corpus_embeddings)
# Evaluate retrieval recall@1
retrieved_indices = []
for query in queries:
query_embedding = client.embeddings.create(
model="text-embedding-3-large",
input=query
).data[0].embedding
query_embedding = np.array(query_embedding).astype('float32').reshape(1, -1)
D, I = index.search(query_embedding, k=1)
retrieved_indices.append(I[0][0])
recall_at_1 = sum([1 if retrieved_indices[i] in ground_truth_docs[i] else 0 for i in range(len(queries))]) / len(queries)
# Generate answers using retrieved docs
answers = []
for i, query in enumerate(queries):
context = corpus[retrieved_indices[i]]
prompt = f"Context: {context}\nQuestion: {query}\nAnswer:"
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
answers.append(response.choices[0].message.content.strip())
# Ground truth answers
ground_truth_answers = ["The Eiffel Tower is in Paris.", "Python is a programming language."]
# Evaluate generation quality with ROUGE
rouge = load_metric("rouge")
results = rouge.compute(predictions=answers, references=ground_truth_answers)
print(f"Retrieval Recall@1: {recall_at_1:.2f}")
print(f"ROUGE-1 F1: {results['rouge1'].mid.fmeasure:.2f}")
print(f"ROUGE-L F1: {results['rougeL'].mid.fmeasure:.2f}") output
Retrieval Recall@1: 1.00 ROUGE-1 F1: 0.85 ROUGE-L F1: 0.83
Common variations
You can evaluate asynchronously using async calls to speed up batch processing. Use different models like claude-3-5-haiku-20241022 for generation or gemini-2.0-flash for embeddings. Also consider human evaluation for fluency and factuality beyond automated metrics.
import asyncio
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
async def generate_answer_async(prompt):
response = await client.chat.completions.acreate(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content.strip()
async def main():
prompts = ["Context: Paris is the capital of France.\nQuestion: Where is the Eiffel Tower?\nAnswer:",
"Context: Python is a programming language.\nQuestion: What is Python?\nAnswer:"]
tasks = [generate_answer_async(p) for p in prompts]
answers = await asyncio.gather(*tasks)
print(answers)
asyncio.run(main()) output
["The Eiffel Tower is in Paris.", "Python is a programming language."]
Troubleshooting
- If retrieval recall is low, check your embedding model and index construction.
- If generated answers are off-topic, verify the prompt includes sufficient retrieved context.
- For API errors, ensure your
OPENAI_API_KEYis set correctly and your usage limits are not exceeded.
Key Takeaways
- Evaluate RAG pipelines by measuring both retrieval and generation quality with metrics like recall and ROUGE.
- Use embeddings and vector search (e.g., FAISS) to test retrieval accuracy against known relevant documents.
- Generate answers conditioned on retrieved context and compare to ground truth with automated or human evaluation.
- Leverage async calls and different models to optimize evaluation speed and quality.
- Troubleshoot by verifying embeddings, prompt design, and API key setup.