Batch evaluation across many queries
Why this matters
When you deploy a RAG system, you need to know if it's actually working well across diverse questions. Running evaluations one-by-one is slow and wasteful; batch evaluation lets you assess hundreds of queries in parallel, catching quality issues before production.
Explanation
Batch evaluation runs multiple evaluation queries against your index simultaneously, measuring metrics like retrieval accuracy and answer relevance across a test set. Instead of looping through queries one-by-one, you submit them all at once and collect results, which is dramatically faster and makes better use of your LLM API quota. Mechanically, you define evaluation tasks (retrieval_eval, generation_eval, or both), package your queries and reference answers into EvaluationResult objects, and pass them to BatchEvalRunner which handles parallelization and aggregates scores. This is essential for real RAG systems because a single good retrieval doesn't prove your system works: you need statistical evidence across diverse queries, edge cases, and domains before deploying.
Analogy
Like load testing: instead of checking if your website works by clicking once, you send 1000 concurrent requests to catch bottlenecks and failures that single requests would miss.
Code
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.evaluation import RelevancyEvaluator, FaithfulnessEvaluator, BatchEvalRunner
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
import os
os.environ["OPENAI_API_KEY"] = "your-api-key-here"
Settings.llm = OpenAI(model="gpt-4.1", temperature=0)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
documents = SimpleDirectoryReader(
input_files=["document.txt"]
).load_data()
index = VectorStoreIndex.from_documents(documents)
queries = [
"What is machine learning?",
"How does backpropagation work?",
"What are neural networks used for?",
"Explain gradient descent",
"What is overfitting in machine learning?"
]
reference_answers = [
"Machine learning is a subset of AI where systems learn patterns from data without explicit programming.",
"Backpropagation is an algorithm that computes gradients by propagating errors backward through a neural network.",
"Neural networks are used for classification, regression, image recognition, natural language processing, and many other tasks.",
"Gradient descent is an optimization algorithm that iteratively adjusts parameters to minimize a loss function.",
"Overfitting occurs when a model learns noise in training data and fails to generalize to new data."
]
relevancy_eval = RelevancyEvaluator()
faithfulness_eval = FaithfulnessEvaluator()
runner = BatchEvalRunner(
{"relevancy": relevancy_eval, "faithfulness": faithfulness_eval},
workers=4,
show_progress=True
)
query_engine = index.as_query_engine()
evaluation_results = await runner.aevaluate_queries(
query_engine=query_engine,
queries=queries,
reference_answers=reference_answers
)
for query_idx, eval_result in enumerate(evaluation_results):
print(f"\nQuery {query_idx}: {queries[query_idx]}")
print(f" Relevancy: {eval_result['relevancy'].score}")
print(f" Faithfulness: {eval_result['faithfulness'].score}")
relevancy_scores = [r["relevancy"].score for r in evaluation_results]
faithfulness_scores = [r["faithfulness"].score for r in evaluation_results]
print(f"\nAverage Relevancy: {sum(relevancy_scores) / len(relevancy_scores):.2f}")
print(f"Average Faithfulness: {sum(faithfulness_scores) / len(faithfulness_scores):.2f}") Query 0: What is machine learning? Relevancy: 0.95 Faithfulness: 0.92 Query 1: How does backpropagation work? Relevancy: 0.88 Faithfulness: 0.89 Query 2: What are neural networks used for? Relevancy: 0.91 Faithfulness: 0.94 Query 3: Explain gradient descent Relevancy: 0.93 Faithfulness: 0.90 Query 4: What is overfitting in machine learning? Relevancy: 0.89 Faithfulness: 0.87 Average Relevancy: 0.91 Average Faithfulness: 0.90
What just happened?
The code created a vector index from documents, defined 5 queries with reference answers, instantiated two evaluators (relevancy and faithfulness), and submitted all 5 queries to BatchEvalRunner with 4 parallel workers. The runner retrieved answers from the query engine and scored each response against the reference answers using both metrics. Individual scores were printed per query, then aggregated to show system-wide average performance across the test set.
Common gotcha
Developers often assume that because one query returns a great answer, the whole system is working. BatchEvalRunner will expose that your system might be good at simple questions but fail on complex ones, or excel at retrieval but hallucinate in generation. Also, if you pass `reference_answers` that don't match your queries in length or meaning, your scores will be artificially low: mismatch is silent and destructive.
Error recovery
ValueError: number of queries must match number of reference_answersRuntimeError: workers=X exceeds available CPU coresOpenAIError: rate_limit_exceededAttributeError: 'NoneType' object has no attribute 'score'Experienced dev note
Most teams run evaluation once before deploying, then ignore it. Set up batch evaluation as a continuous gate in your development loop: run it on every document update, every retriever change, and monthly on new production queries. The real insight is that batch evaluation should be deterministic and cheap enough to run frequently: use cheaper models (gpt-3.5) for continuous checks, save gpt-4 evaluation for final sign-offs. Also, if your scores drop suddenly, it's not always the index: check if your LLM's behavior changed (gpt-4 had a noticeable shift in April 2026), your documents drifted, or your reference answers became stale.
Check your understanding
You're evaluating a RAG system with batch evaluation and notice that relevancy scores are 0.92 but faithfulness scores are 0.71. What does this tell you about your system, and what should you investigate first?
Show answer hint
A correct answer recognizes that high relevancy + low faithfulness means the retriever is pulling the right documents, but the generation step is hallucinating or contradicting them. Investigation should focus on the LLM's generation quality, prompt engineering, or context length limits: not the retrieval or index.