Code Intermediate medium · 7 min

Batch evaluation across many queries

What you will learn

Process multiple evaluation queries at once using BatchEvalRunner to measure retrieval and generation quality efficiently across your entire dataset.

Why this matters

When you deploy a RAG system, you need to know if it's actually working well across diverse questions. Running evaluations one-by-one is slow and wasteful; batch evaluation lets you assess hundreds of queries in parallel, catching quality issues before production.

Skip if: Don't use batch evaluation if you only have a handful of test queries (< 10), you're in exploratory development mode, or your evaluation metrics are custom and non-parallelizable. For simple spot-checks, single synchronous calls are cleaner.

Explanation

Batch evaluation runs multiple evaluation queries against your index simultaneously, measuring metrics like retrieval accuracy and answer relevance across a test set. Instead of looping through queries one-by-one, you submit them all at once and collect results, which is dramatically faster and makes better use of your LLM API quota. Mechanically, you define evaluation tasks (retrieval_eval, generation_eval, or both), package your queries and reference answers into EvaluationResult objects, and pass them to BatchEvalRunner which handles parallelization and aggregates scores. This is essential for real RAG systems because a single good retrieval doesn't prove your system works: you need statistical evidence across diverse queries, edge cases, and domains before deploying.

Analogy

Like load testing: instead of checking if your website works by clicking once, you send 1000 concurrent requests to catch bottlenecks and failures that single requests would miss.

Code

Illustrative only - not runnable without a valid API key

python

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.evaluation import RelevancyEvaluator, FaithfulnessEvaluator, BatchEvalRunner
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
import os

os.environ["OPENAI_API_KEY"] = "your-api-key-here"

Settings.llm = OpenAI(model="gpt-4.1", temperature=0)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

documents = SimpleDirectoryReader(
    input_files=["document.txt"]
).load_data()

index = VectorStoreIndex.from_documents(documents)

queries = [
    "What is machine learning?",
    "How does backpropagation work?",
    "What are neural networks used for?",
    "Explain gradient descent",
    "What is overfitting in machine learning?"
]

reference_answers = [
    "Machine learning is a subset of AI where systems learn patterns from data without explicit programming.",
    "Backpropagation is an algorithm that computes gradients by propagating errors backward through a neural network.",
    "Neural networks are used for classification, regression, image recognition, natural language processing, and many other tasks.",
    "Gradient descent is an optimization algorithm that iteratively adjusts parameters to minimize a loss function.",
    "Overfitting occurs when a model learns noise in training data and fails to generalize to new data."
]

relevancy_eval = RelevancyEvaluator()
faithfulness_eval = FaithfulnessEvaluator()

runner = BatchEvalRunner(
    {"relevancy": relevancy_eval, "faithfulness": faithfulness_eval},
    workers=4,
    show_progress=True
)

query_engine = index.as_query_engine()

evaluation_results = await runner.aevaluate_queries(
    query_engine=query_engine,
    queries=queries,
    reference_answers=reference_answers
)

for query_idx, eval_result in enumerate(evaluation_results):
    print(f"\nQuery {query_idx}: {queries[query_idx]}")
    print(f"  Relevancy: {eval_result['relevancy'].score}")
    print(f"  Faithfulness: {eval_result['faithfulness'].score}")

relevancy_scores = [r["relevancy"].score for r in evaluation_results]
faithfulness_scores = [r["faithfulness"].score for r in evaluation_results]

print(f"\nAverage Relevancy: {sum(relevancy_scores) / len(relevancy_scores):.2f}")
print(f"Average Faithfulness: {sum(faithfulness_scores) / len(faithfulness_scores):.2f}")

Output

Query 0: What is machine learning?
  Relevancy: 0.95
  Faithfulness: 0.92

Query 1: How does backpropagation work?
  Relevancy: 0.88
  Faithfulness: 0.89

Query 2: What are neural networks used for?
  Relevancy: 0.91
  Faithfulness: 0.94

Query 3: Explain gradient descent
  Relevancy: 0.93
  Faithfulness: 0.90

Query 4: What is overfitting in machine learning?
  Relevancy: 0.89
  Faithfulness: 0.87

Average Relevancy: 0.91
Average Faithfulness: 0.90

What just happened?

The code created a vector index from documents, defined 5 queries with reference answers, instantiated two evaluators (relevancy and faithfulness), and submitted all 5 queries to BatchEvalRunner with 4 parallel workers. The runner retrieved answers from the query engine and scored each response against the reference answers using both metrics. Individual scores were printed per query, then aggregated to show system-wide average performance across the test set.

Common gotcha

Developers often assume that because one query returns a great answer, the whole system is working. BatchEvalRunner will expose that your system might be good at simple questions but fail on complex ones, or excel at retrieval but hallucinate in generation. Also, if you pass `reference_answers` that don't match your queries in length or meaning, your scores will be artificially low: mismatch is silent and destructive.

Error recovery

ValueError: number of queries must match number of reference_answers

You passed 5 queries but 4 reference answers. Make sure len(queries) == len(reference_answers). Each query must have exactly one ground truth answer.

RuntimeError: workers=X exceeds available CPU cores

Set workers to a number less than or equal to your system's CPU count. On a 4-core machine, workers=4 is max. Use workers=1 if you want synchronous execution for debugging.

OpenAIError: rate_limit_exceeded

Batch evaluation with high worker count hits rate limits. Reduce workers to 1-2, add delays, or use a higher-tier API key. The evaluators and query engine both make API calls, so 4 workers × 5 queries = many simultaneous calls.

AttributeError: 'NoneType' object has no attribute 'score'

An evaluation failed silently and returned None. Check that your reference answers are meaningful strings, your query engine is working (test it manually first), and your evaluators are initialized with the correct LLM.

Experienced dev note

Most teams run evaluation once before deploying, then ignore it. Set up batch evaluation as a continuous gate in your development loop: run it on every document update, every retriever change, and monthly on new production queries. The real insight is that batch evaluation should be deterministic and cheap enough to run frequently: use cheaper models (gpt-3.5) for continuous checks, save gpt-4 evaluation for final sign-offs. Also, if your scores drop suddenly, it's not always the index: check if your LLM's behavior changed (gpt-4 had a noticeable shift in April 2026), your documents drifted, or your reference answers became stale.

Check your understanding

You're evaluating a RAG system with batch evaluation and notice that relevancy scores are 0.92 but faithfulness scores are 0.71. What does this tell you about your system, and what should you investigate first?

Show answer hint

A correct answer recognizes that high relevancy + low faithfulness means the retriever is pulling the right documents, but the generation step is hallucinating or contradicting them. Investigation should focus on the LLM's generation quality, prompt engineering, or context length limits: not the retrieval or index.

VERSION In llama-index-core < 0.10.0, BatchEvalRunner used synchronous evaluation only. As of 0.10.0+, it supports async evaluation with aevaluate_queries(), which is essential for scaling. If you're on an older version, upgrade or use synchronous evaluate_queries() with fewer workers.

Once you've measured quality across queries with batch evaluation, learn how to use those scores to automatically curate the best retrieval results: coming next: ranking and reranking retrieved documents by relevance signals.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.