What is answer relevance in RAG evaluation
Retrieval-Augmented Generation (RAG) evaluation, answer relevance measures how well the generated answer aligns with the retrieved documents and the original query, ensuring the response is factually grounded and contextually appropriate. It is a key metric to assess the quality and trustworthiness of RAG outputs.Answer relevance is a metric in Retrieval-Augmented Generation (RAG) evaluation that quantifies how accurately a generated answer reflects the information retrieved from external knowledge sources.How it works
Answer relevance in RAG evaluation assesses whether the generated answer correctly uses the retrieved documents to respond to the user's query. Imagine a librarian (retriever) fetching books (documents) for a question, and a writer (generator) composing an answer based on those books. Answer relevance checks if the writer's answer truly reflects the content of the fetched books and addresses the question accurately, rather than hallucinating or deviating.
This involves comparing the generated answer against the retrieved passages for factual consistency and topical alignment, often using metrics like exact match, ROUGE, or embedding similarity.
Concrete example
Suppose a RAG system retrieves two documents for the query "Who invented the telephone?" and generates the answer "Alexander Graham Bell invented the telephone in 1876." Answer relevance evaluates if this answer is supported by the retrieved documents.
from openai import OpenAI
import os
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
query = "Who invented the telephone?"
retrieved_docs = [
"Alexander Graham Bell was a Scottish-born inventor credited with inventing the first practical telephone.",
"The telephone was invented in 1876 by Alexander Graham Bell."
]
# Simple relevance check by embedding similarity (conceptual example)
messages = [
{"role": "system", "content": "You are an evaluator that checks if the answer is relevant to the retrieved documents and query."},
{"role": "user", "content": f"Query: {query}\nDocuments: {retrieved_docs}\nAnswer: Alexander Graham Bell invented the telephone in 1876. Is this answer relevant and supported by the documents? Reply yes or no with explanation."}
]
response = client.chat.completions.create(
model="gpt-4o",
messages=messages
)
print(response.choices[0].message.content) Yes, the answer is relevant and supported by the documents because both retrieved passages explicitly state that Alexander Graham Bell invented the telephone in 1876.
When to use it
Use answer relevance evaluation when deploying or benchmarking RAG systems to ensure generated answers are factually grounded and trustworthy. It is critical in domains like healthcare, legal, or customer support where accuracy is paramount.
Avoid relying solely on answer relevance if you need to evaluate creativity or open-ended generation, as it focuses on factual alignment rather than style or novelty.
Key terms
| Term | Definition |
|---|---|
| Answer relevance | Metric measuring how well a generated answer aligns with retrieved documents and query. |
| Retrieval-Augmented Generation (RAG) | AI architecture combining retrieval of documents with language model generation. |
| Retriever | Component that fetches relevant documents from a knowledge base. |
| Generator | Language model that produces answers based on retrieved documents. |
| Factual grounding | Ensuring generated content is supported by real-world data or documents. |
Key Takeaways
- Answer relevance ensures RAG outputs are factually supported by retrieved documents.
- It is essential for trustworthiness in knowledge-intensive AI applications.
- Evaluation often involves comparing generated answers to retrieved passages for consistency.
- Use answer relevance metrics when factual accuracy is critical, not for creative tasks.