CorrectnessEvaluator: answer accuracy
Why this matters
RAG systems retrieve and generate, but they can hallucinate or miss the truth. You need automated metrics to catch when your pipeline returns plausible-sounding but wrong answers before users see them. CorrectnessEvaluator gives you a programmatic gate.
Explanation
What it is: CorrectnessEvaluator is a llama-index evaluation module that takes a generated response and a reference (ground-truth) answer, then uses an LLM to determine if the generated answer is factually correct relative to the reference.
How it works mechanically: You initialize it with an LLM (usually gpt-4 for reliability), then call its evaluate() method with three things: the query, the generated response, and the reference answer. The evaluator passes these to the LLM with a prompt that asks "is the generated answer correct given the reference?" It returns a score (0 to 1) and reasoning. Scores closer to 1 mean higher correctness.
When to use it: Use this in evaluation loops after you build your retrieval chain. Run it over a test set of 50–200 queries where you have reference answers. Use the scores to flag regressions before pushing to production, or to compare different retriever configurations.
Analogy
Think of it like peer review in academic publishing: you give a reviewer (the LLM) the paper (generated answer), the gold standard (reference), and a rubric (the evaluator's prompt), and they compare them and give you a thumbs up or down with detailed reasoning.
Code
from llama_index.core.evaluation import CorrectnessEvaluator
from llama_index.llms.openai import OpenAI
llm = OpenAI(model='gpt-4.1', temperature=0)
evaluator = CorrectnessEvaluator(llm=llm)
query = "What year was Python first released?"
generated_answer = "Python was created in 1989."
reference_answer = "Python was first released in 1991."
result = evaluator.evaluate(
query=query,
response=generated_answer,
reference=reference_answer
)
print(f"Score: {result.score}")
print(f"Passing: {result.passing}")
print(f"Feedback: {result.feedback}") Score: 0.0 Passing: False Feedback: The generated answer states that Python was created in 1989, but the reference answer clearly indicates that Python was first released in 1991. The generated answer is incorrect.
What just happened?
The evaluator received a generated answer that was factually wrong (1989 vs. 1991), sent it to GPT-4 along with the reference answer and the query, and GPT-4 compared them and returned a failing score (0.0) with the reason why the generated answer was wrong.
Common gotcha
Developers often forget that CorrectnessEvaluator depends entirely on the quality of your reference answers. If your reference is wrong or ambiguous, the evaluator will give you confidently incorrect feedback. Also, the LLM doing the evaluation can be influenced by how the reference answer is phrased: always use clear, unambiguous reference answers, and consider using a higher-temperature model (0.1–0.3) to avoid the evaluator being too rigid.
Error recovery
ValueError: 'response' must be a stringOpenAIError: model 'gpt-3.5-turbo' does not support function callingAttributeError: 'EvaluationResult' object has no attribute 'score'Experienced dev note
CorrectnessEvaluator calls the LLM on every evaluation: that's expensive at scale. In production, batch evaluate on a representative subset (100 queries) on each deploy, not every query. Also, correctness alone isn't enough. Pair it with faithfulness (does the response use only retrieved context?) and relevance (did you retrieve the right documents?). A response can be correct but irrelevant to what the user asked. Use a suite of evaluators, not just this one.
Check your understanding
Your RAG system returns the answer 'Paris' to 'What is the capital of France?' and your reference answer is 'The capital of France is Paris, located on the Seine river.' If you run CorrectnessEvaluator on these, it will likely pass. Now your system returns 'The capital is in Europe' to the same query. What would the evaluator likely return, and why? What would be wrong with relying only on this evaluator to measure your system's quality?
Show answer hint
A correct answer explains that the first response would pass because the core fact (Paris) is correct even though it's less detailed. The second would fail because it's too vague and not factually complete. The insight is that you need multiple evaluators because correctness alone doesn't measure relevance, specificity, or retrieval quality: all of which matter for user satisfaction.