How to Intermediate · 3 min read

How to build an eval dataset for RAG

Quick answer
To build an eval dataset for RAG, collect query-document pairs with ground truth answers that require retrieval plus generation. Annotate queries with relevant documents and expected answers to measure retrieval accuracy and generation quality.

PREREQUISITES

  • Python 3.8+
  • Basic knowledge of NLP and retrieval systems
  • pip install pandas datasets

Setup

Install necessary Python packages for dataset handling and annotation.
bash
pip install pandas datasets

Step by step

Create a simple eval dataset with queries, relevant documents, and ground truth answers. This dataset will test both retrieval and generation components of RAG.
python
import pandas as pd

# Example data: queries, documents, and answers
queries = [
    "Who wrote the novel '1984'?",
    "What is the capital of France?",
    "Explain photosynthesis in plants."
]
documents = [
    "George Orwell wrote the novel '1984' in 1949.",
    "Paris is the capital city of France.",
    "Photosynthesis is the process by which green plants use sunlight to synthesize foods from carbon dioxide and water."
]
answers = [
    "George Orwell",
    "Paris",
    "Photosynthesis is how plants convert sunlight into energy."
]

# Build a DataFrame representing the eval dataset
eval_df = pd.DataFrame({
    "query": queries,
    "document": documents,
    "answer": answers
})

print(eval_df)
output
                               query                                           document  \
0          Who wrote the novel '1984'?          George Orwell wrote the novel '1984' in 1949.   
1          What is the capital of France?                      Paris is the capital city of France.   
2  Explain photosynthesis in plants.  Photosynthesis is the process by which green plants use sunlight to synthesize foods from carbon dioxide and water.   

                                             answer  
0                                    George Orwell  
1                                            Paris  
2  Photosynthesis is how plants convert sunlight into energy.

Common variations

You can extend the dataset by adding multiple relevant documents per query, or include metadata like document source and retrieval scores. For async or streaming eval, adapt data loading accordingly.
python
from datasets import Dataset

# Convert pandas DataFrame to Hugging Face Dataset for advanced usage
hf_dataset = Dataset.from_pandas(eval_df)

print(hf_dataset)
output
Dataset({
    features: ['query', 'document', 'answer'],
    num_rows: 3
})

Troubleshooting

If your eval results show poor retrieval, verify document relevance and coverage. For generation errors, check answer consistency and annotation quality. Ensure queries are clear and unambiguous.

Key Takeaways

  • Include queries paired with relevant documents and ground truth answers to evaluate both retrieval and generation.
  • Use structured formats like pandas DataFrame or Hugging Face Dataset for easy manipulation and scaling.
  • Annotate documents carefully to ensure retrieval relevance and answer correctness for reliable evaluation.
Verified 2026-04 · gpt-4o, claude-3-5-sonnet-20241022
Verify ↗