How to Intermediate · 3 min read

How to evaluate generation quality in RAG

Q: How to evaluate generation quality in RAG

To evaluate generation quality in RAG, use automatic metrics like ROUGE, BLEU, and METEOR to compare generated text against references, combined with human evaluation for factual accuracy and relevance. Additionally, assess retrieval quality since it directly impacts generation correctness and coherence.

Quick answer

To evaluate generation quality in RAG, use automatic metrics like ROUGE, BLEU, and METEOR to compare generated text against references, combined with human evaluation for factual accuracy and relevance. Additionally, assess retrieval quality since it directly impacts generation correctness and coherence.

PREREQUISITES

Python 3.8+
pip install rouge-score nltk
Basic knowledge of retrieval-augmented generation (RAG)

Setup evaluation environment

Install necessary Python packages for text similarity metrics and evaluation tools.

bash

pip install rouge-score nltk

Step by step evaluation code

This example shows how to compute ROUGE scores to evaluate RAG output quality by comparing generated text to reference answers.

python

from rouge_score import rouge_scorer

# Sample reference and generated texts
reference = "The capital of France is Paris."
generated = "Paris is the capital city of France."

# Initialize ROUGE scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)

# Score the generated text
scores = scorer.score(reference, generated)

print("ROUGE-1 F1 score:", scores['rouge1'].fmeasure)
print("ROUGE-L F1 score:", scores['rougeL'].fmeasure)

output

ROUGE-1 F1 score: 0.8888888888888888
ROUGE-L F1 score: 0.8888888888888888

Common variations

You can extend evaluation by:

Using BLEU or METEOR metrics from nltk for n-gram precision and recall.
Incorporating human evaluation for factual correctness and relevance, especially important in RAG where retrieval affects output.
Evaluating retrieval quality separately using metrics like Recall@k or MRR to understand its impact on generation.

python

import nltk
from nltk.translate.bleu_score import sentence_bleu

reference = [['The', 'capital', 'of', 'France', 'is', 'Paris']]
generated = ['Paris', 'is', 'the', 'capital', 'city', 'of', 'France']

bleu_score = sentence_bleu(reference, generated)
print(f"BLEU score: {bleu_score:.2f}")

output

BLEU score: 0.55

Troubleshooting evaluation issues

If automatic metrics show low scores but outputs seem correct, consider:

Human evaluation for factuality and relevance, since metrics like ROUGE focus on lexical overlap.
Checking retrieval results quality, as poor retrieval leads to poor generation.
Using multiple references to improve metric reliability.

✅

Key Takeaways

Use automatic metrics like ROUGE and BLEU to quantify generation quality in RAG.
Human evaluation is essential to assess factual accuracy and relevance beyond lexical similarity.
Evaluate retrieval quality separately since it directly impacts generation correctness.
Combine multiple metrics and human judgment for robust RAG evaluation.
Low automatic scores may not always indicate poor quality; verify with human review.

Verified 2026-04 · gpt-4o, claude-3-5-sonnet-20241022

Verify ↗