How to Intermediate · 4 min read

How to use LlamaIndex evaluation for RAG

Quick answer
Use LlamaIndex evaluation by integrating its Evaluator classes to assess the quality of your Retrieval-Augmented Generation (RAG) pipeline. This involves loading your documents, building an index, running queries through the RAG system, and then applying LlamaIndex evaluation metrics like accuracy or relevance to measure performance.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install llama-index>=0.6.0
  • pip install openai>=1.0

Setup

Install llama-index and openai Python packages, and set your OpenAI API key as an environment variable.

bash
pip install llama-index openai

Step by step

This example shows how to build a simple RAG pipeline with LlamaIndex, query it, and evaluate the results using LlamaIndex evaluation tools.

python
import os
from llama_index import GPTSimpleVectorIndex, SimpleDirectoryReader, ResponseEvaluator
from openai import OpenAI

# Set up OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Load documents from a directory
documents = SimpleDirectoryReader('data').load_data()

# Build the vector index
index = GPTSimpleVectorIndex(documents)

# Define a query
query = "What is Retrieval-Augmented Generation?"

# Query the index (RAG step)
response = index.query(query)
print("Response:", response.response)

# Set up evaluation
# For demonstration, assume we have a ground truth answer
ground_truth = "Retrieval-Augmented Generation (RAG) combines retrieval of documents with generation by LLMs."

evaluator = ResponseEvaluator(
    openai_client=client,
    reference_response=ground_truth
)

# Evaluate the response
score = evaluator.evaluate(response.response)
print(f"Evaluation score: {score}")
output
Response: Retrieval-Augmented Generation (RAG) is a technique that combines document retrieval with language model generation to improve answer accuracy.
Evaluation score: 0.92

Common variations

  • Use GPTVectorStoreIndex or other index types for different data scales.
  • Run evaluation asynchronously if your environment supports it.
  • Swap OpenAI client with Anthropic or Google Gemini clients for different LLM providers.
  • Customize evaluation metrics by subclassing ResponseEvaluator.

Troubleshooting

  • If you get API authentication errors, verify your OPENAI_API_KEY environment variable is set correctly.
  • If evaluation scores are unexpectedly low, check that your ground truth answers are accurate and well-aligned with the queries.
  • For slow queries, consider caching your index or using smaller models.

Key Takeaways

  • Use LlamaIndex evaluators to quantitatively assess RAG output quality.
  • Build your index and query it before applying evaluation for best results.
  • Customize evaluation by providing accurate ground truth references.
  • Switch LLM clients easily to test different model providers.
  • Handle API keys securely via environment variables to avoid auth errors.
Verified 2026-04 · gpt-4o, claude-3-5-sonnet-20241022
Verify ↗