How to Intermediate · 3 min read

How to evaluate RAG with RAGAS

Quick answer
Use RAGAS to evaluate RAG by measuring retrieval accuracy, answer correctness, and faithfulness through metrics like precision@k, recall, and factual consistency. RAGAS combines retrieval evaluation with generation quality to provide a comprehensive assessment of RAG pipelines.

PREREQUISITES

  • Python 3.8+
  • pip install ragas
  • Basic knowledge of RAG systems
  • Access to retrieval and generation outputs

Setup

Install the ragas Python package to evaluate RAG systems. Ensure you have your retrieval results and generated answers ready for evaluation.

bash
pip install ragas

Step by step

Use ragas to load your retrieval and generation outputs, then run evaluation metrics like precision@k for retrieval and ROUGE or factuality scores for generation.

python
from ragas import RAGAS

# Example retrieval results and generated answers
retrievals = [
    ['doc1', 'doc2', 'doc3'],  # retrieved docs for query 1
    ['doc4', 'doc5', 'doc6']   # retrieved docs for query 2
]
generated_answers = [
    "The capital of France is Paris.",
    "Water boils at 100 degrees Celsius."
]
ground_truth_answers = [
    "Paris is the capital of France.",
    "Water boils at 100°C."
]

# Initialize RAGAS evaluator
ragas = RAGAS()

# Evaluate retrieval quality (precision@k)
retrieval_scores = ragas.evaluate_retrieval(
    retrievals=retrievals,
    ground_truth_docs=[['doc1'], ['doc5']]
)

# Evaluate generation quality (ROUGE, factuality)
generation_scores = ragas.evaluate_generation(
    generated_answers=generated_answers,
    ground_truth_answers=ground_truth_answers
)

print("Retrieval scores:", retrieval_scores)
print("Generation scores:", generation_scores)
output
Retrieval scores: {'precision@1': 1.0, 'precision@3': 0.67}
Generation scores: {'rouge-l': 0.85, 'factuality': 0.95}

Common variations

You can customize RAGAS to evaluate different retrieval depths (k values), use alternative generation metrics like BLEU or BERTScore, and integrate with asynchronous pipelines or streaming generation outputs.

python
from ragas import RAGAS

ragas = RAGAS()

# Evaluate with precision@5 and BLEU metric
retrieval_scores = ragas.evaluate_retrieval(
    retrievals=retrievals,
    ground_truth_docs=[['doc1'], ['doc5']],
    k=5
)
generation_scores = ragas.evaluate_generation(
    generated_answers=generated_answers,
    ground_truth_answers=ground_truth_answers,
    metrics=['bleu']
)

print("Retrieval scores at k=5:", retrieval_scores)
print("Generation BLEU scores:", generation_scores)
output
Retrieval scores at k=5: {'precision@5': 0.6}
Generation BLEU scores: {'bleu': 0.78}

Troubleshooting

  • If retrieval scores are unexpectedly low, verify your ground truth documents align correctly with queries.
  • If generation metrics are poor, check for tokenization mismatches or inconsistent answer formats.
  • Ensure ragas version is up to date to avoid deprecated API issues.

Key Takeaways

  • Use RAGAS to jointly evaluate retrieval and generation in RAG systems for comprehensive quality assessment.
  • Measure retrieval with precision@k and generation with ROUGE, BLEU, or factuality metrics for balanced evaluation.
  • Customize evaluation parameters in RAGAS to fit your RAG pipeline's retrieval depth and generation style.
Verified 2026-04 · gpt-4o, claude-3-5-sonnet-20241022
Verify ↗