Code Intermediate medium · 6 min

CorrectnessEvaluator: answer accuracy

What you will learn

CorrectnessEvaluator measures whether your RAG retrieval answers are factually accurate by comparing generated responses against reference answers.

Why this matters

RAG systems retrieve and generate, but they can hallucinate or miss the truth. You need automated metrics to catch when your pipeline returns plausible-sounding but wrong answers before users see them. CorrectnessEvaluator gives you a programmatic gate.

Skip if: Don't use CorrectnessEvaluator if you have no ground-truth reference answers to compare against, or if your domain has multiple valid correct answers that the evaluator can't reason about (e.g., creative writing, code review styles). For those, use semantic similarity or custom judges instead.

Explanation

What it is: CorrectnessEvaluator is a llama-index evaluation module that takes a generated response and a reference (ground-truth) answer, then uses an LLM to determine if the generated answer is factually correct relative to the reference.

How it works mechanically: You initialize it with an LLM (usually gpt-4 for reliability), then call its evaluate() method with three things: the query, the generated response, and the reference answer. The evaluator passes these to the LLM with a prompt that asks "is the generated answer correct given the reference?" It returns a score (0 to 1) and reasoning. Scores closer to 1 mean higher correctness.

When to use it: Use this in evaluation loops after you build your retrieval chain. Run it over a test set of 50–200 queries where you have reference answers. Use the scores to flag regressions before pushing to production, or to compare different retriever configurations.

Analogy

Think of it like peer review in academic publishing: you give a reviewer (the LLM) the paper (generated answer), the gold standard (reference), and a rubric (the evaluator's prompt), and they compare them and give you a thumbs up or down with detailed reasoning.

Code

python

from llama_index.core.evaluation import CorrectnessEvaluator
from llama_index.llms.openai import OpenAI

llm = OpenAI(model='gpt-4.1', temperature=0)
evaluator = CorrectnessEvaluator(llm=llm)

query = "What year was Python first released?"
generated_answer = "Python was created in 1989."
reference_answer = "Python was first released in 1991."

result = evaluator.evaluate(
    query=query,
    response=generated_answer,
    reference=reference_answer
)

print(f"Score: {result.score}")
print(f"Passing: {result.passing}")
print(f"Feedback: {result.feedback}")

Output

Score: 0.0
Passing: False
Feedback: The generated answer states that Python was created in 1989, but the reference answer clearly indicates that Python was first released in 1991. The generated answer is incorrect.

What just happened?

The evaluator received a generated answer that was factually wrong (1989 vs. 1991), sent it to GPT-4 along with the reference answer and the query, and GPT-4 compared them and returned a failing score (0.0) with the reason why the generated answer was wrong.

Common gotcha

Developers often forget that CorrectnessEvaluator depends entirely on the quality of your reference answers. If your reference is wrong or ambiguous, the evaluator will give you confidently incorrect feedback. Also, the LLM doing the evaluation can be influenced by how the reference answer is phrased: always use clear, unambiguous reference answers, and consider using a higher-temperature model (0.1–0.3) to avoid the evaluator being too rigid.

Error recovery

ValueError: 'response' must be a string

You passed a non-string object (e.g., a list or dict) to the response parameter. Convert it to a string first: response=str(your_response)

OpenAIError: model 'gpt-3.5-turbo' does not support function calling

CorrectnessEvaluator needs a capable model for reasoning. Use gpt-4.1 or gpt-4-turbo instead. Older models may fail silently or return uninformative scores.

AttributeError: 'EvaluationResult' object has no attribute 'score'

You're using the old evaluation API. Upgrade to llama-index-core >=0.12.0: pip install --upgrade llama-index-core

Experienced dev note

CorrectnessEvaluator calls the LLM on every evaluation: that's expensive at scale. In production, batch evaluate on a representative subset (100 queries) on each deploy, not every query. Also, correctness alone isn't enough. Pair it with faithfulness (does the response use only retrieved context?) and relevance (did you retrieve the right documents?). A response can be correct but irrelevant to what the user asked. Use a suite of evaluators, not just this one.

Check your understanding

Your RAG system returns the answer 'Paris' to 'What is the capital of France?' and your reference answer is 'The capital of France is Paris, located on the Seine river.' If you run CorrectnessEvaluator on these, it will likely pass. Now your system returns 'The capital is in Europe' to the same query. What would the evaluator likely return, and why? What would be wrong with relying only on this evaluator to measure your system's quality?

Show answer hint

A correct answer explains that the first response would pass because the core fact (Paris) is correct even though it's less detailed. The second would fail because it's too vague and not factually complete. The insight is that you need multiple evaluators because correctness alone doesn't measure relevance, specificity, or retrieval quality: all of which matter for user satisfaction.

VERSION llama-index-core >= 0.12.0 moved evaluation modules under llama_index.core.evaluation. In versions < 0.12.0, the import path was from llama_index.evaluation import CorrectnessEvaluator. If you're on 0.11.x or earlier, upgrade or adjust your imports.

Now that you can measure correctness, learn FaithfulnessEvaluator to ensure your generated answers actually use the retrieved documents and don't hallucinate facts outside your knowledge base.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.