How to Intermediate · 3 min read

How to automate LLM evaluation

Quick answer
Automate LLM evaluation by scripting calls to AI APIs like OpenAI or Anthropic to send prompts and parse responses programmatically. Use Python to run batch tests, compare outputs, and measure metrics such as accuracy or relevance.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • Anthropic API key (optional)
  • pip install openai>=1.0 anthropic>=0.20

Setup

Install the required Python packages and set your API keys as environment variables for secure access.

  • Install OpenAI and Anthropic SDKs:
bash
pip install openai>=1.0 anthropic>=0.20
output
Collecting openai
Collecting anthropic
Installing collected packages: openai, anthropic
Successfully installed anthropic-0.20.0 openai-1.0.0

Step by step

Use Python to automate evaluation by sending prompts to an LLM, collecting responses, and comparing them against expected outputs or metrics.

python
import os
from openai import OpenAI
import json

# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Define evaluation dataset: list of prompts and expected answers
evaluation_data = [
    {"prompt": "What is the capital of France?", "expected": "Paris"},
    {"prompt": "Solve 2 + 2.", "expected": "4"},
    {"prompt": "Who wrote '1984'?", "expected": "George Orwell"}
]

# Function to evaluate a single prompt

def evaluate_prompt(prompt, expected):
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    answer = response.choices[0].message.content.strip()
    # Simple exact match metric
    passed = answer.lower() == expected.lower()
    return answer, passed

# Run evaluation
results = []
for item in evaluation_data:
    answer, passed = evaluate_prompt(item["prompt"], item["expected"])
    results.append({"prompt": item["prompt"], "answer": answer, "passed": passed})

# Print results
for r in results:
    print(f"Prompt: {r['prompt']}")
    print(f"Answer: {r['answer']}")
    print(f"Passed: {r['passed']}\n")
output
Prompt: What is the capital of France?
Answer: Paris
Passed: True

Prompt: Solve 2 + 2.
Answer: 4
Passed: True

Prompt: Who wrote '1984'?
Answer: George Orwell
Passed: True

Common variations

You can automate evaluation asynchronously, use streaming responses for real-time scoring, or switch models like claude-3-5-sonnet-20241022 for Anthropic. Adjust metrics to include semantic similarity or token usage.

python
import asyncio
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

evaluation_data = [
    {"prompt": "What is the capital of France?", "expected": "Paris"},
    {"prompt": "Solve 2 + 2.", "expected": "4"}
]

async def evaluate_async(prompt, expected):
    stream = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        stream=True
    )
    answer = ""
    async for chunk in stream:
        delta = chunk.choices[0].delta.content or ""
        answer += delta
    passed = answer.strip().lower() == expected.lower()
    return answer.strip(), passed

async def main():
    for item in evaluation_data:
        answer, passed = await evaluate_async(item["prompt"], item["expected"])
        print(f"Prompt: {item['prompt']}")
        print(f"Answer: {answer}")
        print(f"Passed: {passed}\n")

asyncio.run(main())
output
Prompt: What is the capital of France?
Answer: Paris
Passed: True

Prompt: Solve 2 + 2.
Answer: 4
Passed: True

Troubleshooting

  • If you get authentication errors, verify your API key is set correctly in os.environ.
  • For rate limits, add retry logic or reduce request frequency.
  • If responses are inconsistent, increase max_tokens or use a stronger model.

Key Takeaways

  • Use Python scripts with OpenAI or Anthropic SDKs to automate LLM evaluation efficiently.
  • Batch prompts and parse responses programmatically to measure accuracy or custom metrics.
  • Leverage async and streaming APIs for scalable and real-time evaluation workflows.
Verified 2026-04 · gpt-4o-mini, claude-3-5-sonnet-20241022
Verify ↗