How to Intermediate · 4 min read

How to use RAGAS for prompt evaluation

Quick answer
Use RAGAS (Retrieve, Answer, Grade, Annotate, Score) by automating prompt evaluation through retrieval of context, generating answers, grading outputs, annotating errors, and scoring prompt quality. Implement it with Python by integrating retrieval and LLM calls to systematically assess prompt performance.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0
  • Basic knowledge of prompt engineering and retrieval-augmented generation

Setup

Install the openai Python package and set your API key as an environment variable for secure access.

bash
pip install openai>=1.0

Step by step

This example demonstrates the RAGAS workflow: retrieve relevant context, generate an answer with a prompt, grade the answer, annotate errors, and compute a final score.

python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Step 1: Retrieve relevant context (mocked here)
def retrieve_context(query):
    # In practice, integrate a vector DB or search engine
    return "The capital of France is Paris."

# Step 2: Generate answer using prompt + context
def generate_answer(question, context):
    prompt = f"Context: {context}\nQuestion: {question}\nAnswer:"  
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content.strip()

# Step 3: Grade the answer (simple exact match for demo)
def grade_answer(answer, expected):
    return 1.0 if answer.lower() == expected.lower() else 0.0

# Step 4: Annotate errors (basic example)
def annotate(answer, expected):
    if answer.lower() != expected.lower():
        return f"Error: Expected '{expected}', but got '{answer}'."
    return "No errors."

# Step 5: Score prompt quality (aggregate grades)
def score_prompt(grades):
    return sum(grades) / len(grades) if grades else 0

# Example usage
question = "What is the capital of France?"
expected_answer = "Paris"
context = retrieve_context(question)
answer = generate_answer(question, context)
grade = grade_answer(answer, expected_answer)
annotation = annotate(answer, expected_answer)
final_score = score_prompt([grade])

print(f"Question: {question}")
print(f"Answer: {answer}")
print(f"Grade: {grade}")
print(f"Annotation: {annotation}")
print(f"Final prompt score: {final_score}")
output
Question: What is the capital of France?
Answer: Paris
Grade: 1.0
Annotation: No errors.
Final prompt score: 1.0

Common variations

You can extend RAGAS by using asynchronous calls for faster throughput, switching to other models like claude-3-5-haiku-20241022, or integrating streaming responses for real-time grading.

python
import asyncio
import os
import anthropic

client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

async def generate_answer_async(question, context):
    prompt = f"Context: {context}\nQuestion: {question}\nAnswer:"
    response = await client.messages.acreate(
        model="claude-3-5-haiku-20241022",
        max_tokens=100,
        system="You are a helpful assistant.",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text.strip()

async def main():
    question = "What is the capital of France?"
    context = "The capital of France is Paris."
    answer = await generate_answer_async(question, context)
    print(f"Async answer: {answer}")

asyncio.run(main())
output
Async answer: Paris

Troubleshooting

  • If the answer is irrelevant, verify your retrieval step returns accurate context.
  • If grading is always zero, improve your grading logic beyond exact matches.
  • For API errors, ensure your environment variable OPENAI_API_KEY or ANTHROPIC_API_KEY is set correctly.

Key Takeaways

  • Implement RAGAS by combining retrieval, generation, grading, annotation, and scoring steps programmatically.
  • Use the OpenAI or Anthropic SDKs with environment-secured API keys for reliable prompt evaluation.
  • Customize grading and annotation logic to fit your domain and evaluation criteria.
  • Leverage async and streaming APIs for scalable and real-time prompt assessment.
  • Validate retrieval quality first to ensure meaningful prompt evaluation results.
Verified 2026-04 · gpt-4o, claude-3-5-haiku-20241022
Verify ↗