How to Intermediate · 4 min read

How to evaluate a RAG pipeline

Quick answer

To evaluate a RAG pipeline, measure both retrieval quality (e.g., recall, precision) and generation quality (e.g., ROUGE, BLEU, or human evaluation). Use test queries with known ground truth documents and answers, then analyze how well the pipeline retrieves relevant context and generates accurate, coherent responses.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0
pip install faiss-cpu
pip install datasets

Setup

Install necessary Python packages and set your environment variable for the OpenAI API key.

bash

pip install openai faiss-cpu datasets

Step by step

This example demonstrates evaluating a RAG pipeline by retrieving documents with FAISS and generating answers with gpt-4o. It calculates retrieval recall and generation ROUGE scores.

python

import os
from openai import OpenAI
from datasets import load_metric
import faiss

# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Dummy corpus and queries
corpus = ["Paris is the capital of France.", "The Eiffel Tower is in Paris.", "Python is a programming language."]
queries = ["Where is the Eiffel Tower?", "What is Python?"]
ground_truth_docs = [[1], [2]]  # indices of relevant docs

# Build FAISS index
import numpy as np
embedding_dim = 1536  # typical for OpenAI embeddings

# Get embeddings for corpus
corpus_embeddings = []
for doc in corpus:
    response = client.embeddings.create(
        model="text-embedding-3-large",
        input=doc
    )
    corpus_embeddings.append(response.data[0].embedding)
corpus_embeddings = np.array(corpus_embeddings).astype('float32')

index = faiss.IndexFlatL2(embedding_dim)
index.add(corpus_embeddings)

# Evaluate retrieval recall@1
retrieved_indices = []
for query in queries:
    query_embedding = client.embeddings.create(
        model="text-embedding-3-large",
        input=query
    ).data[0].embedding
    query_embedding = np.array(query_embedding).astype('float32').reshape(1, -1)
    D, I = index.search(query_embedding, k=1)
    retrieved_indices.append(I[0][0])

recall_at_1 = sum([1 if retrieved_indices[i] in ground_truth_docs[i] else 0 for i in range(len(queries))]) / len(queries)

# Generate answers using retrieved docs
answers = []
for i, query in enumerate(queries):
    context = corpus[retrieved_indices[i]]
    prompt = f"Context: {context}\nQuestion: {query}\nAnswer:"  
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    answers.append(response.choices[0].message.content.strip())

# Ground truth answers
ground_truth_answers = ["The Eiffel Tower is in Paris.", "Python is a programming language."]

# Evaluate generation quality with ROUGE
rouge = load_metric("rouge")
results = rouge.compute(predictions=answers, references=ground_truth_answers)

print(f"Retrieval Recall@1: {recall_at_1:.2f}")
print(f"ROUGE-1 F1: {results['rouge1'].mid.fmeasure:.2f}")
print(f"ROUGE-L F1: {results['rougeL'].mid.fmeasure:.2f}")

output

Retrieval Recall@1: 1.00
ROUGE-1 F1: 0.85
ROUGE-L F1: 0.83

Common variations

You can evaluate asynchronously using async calls to speed up batch processing. Use different models like claude-3-5-haiku-20241022 for generation or gemini-2.0-flash for embeddings. Also consider human evaluation for fluency and factuality beyond automated metrics.

python

import asyncio
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

async def generate_answer_async(prompt):
    response = await client.chat.completions.acreate(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content.strip()

async def main():
    prompts = ["Context: Paris is the capital of France.\nQuestion: Where is the Eiffel Tower?\nAnswer:",
               "Context: Python is a programming language.\nQuestion: What is Python?\nAnswer:"]
    tasks = [generate_answer_async(p) for p in prompts]
    answers = await asyncio.gather(*tasks)
    print(answers)

asyncio.run(main())

output

["The Eiffel Tower is in Paris.", "Python is a programming language."]

Troubleshooting

If retrieval recall is low, check your embedding model and index construction.
If generated answers are off-topic, verify the prompt includes sufficient retrieved context.
For API errors, ensure your OPENAI_API_KEY is set correctly and your usage limits are not exceeded.

✅

Key Takeaways

Evaluate RAG pipelines by measuring both retrieval and generation quality with metrics like recall and ROUGE.
Use embeddings and vector search (e.g., FAISS) to test retrieval accuracy against known relevant documents.
Generate answers conditioned on retrieved context and compare to ground truth with automated or human evaluation.
Leverage async calls and different models to optimize evaluation speed and quality.
Troubleshoot by verifying embeddings, prompt design, and API key setup.

Verified 2026-04 · gpt-4o, claude-3-5-haiku-20241022, gemini-2.0-flash, text-embedding-3-large

Verify ↗