How to automate LLM evaluation
Quick answer
Automate LLM evaluation by scripting calls to AI APIs like
OpenAI or Anthropic to send prompts and parse responses programmatically. Use Python to run batch tests, compare outputs, and measure metrics such as accuracy or relevance.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)Anthropic API key (optional)pip install openai>=1.0 anthropic>=0.20
Setup
Install the required Python packages and set your API keys as environment variables for secure access.
- Install OpenAI and Anthropic SDKs:
pip install openai>=1.0 anthropic>=0.20 output
Collecting openai Collecting anthropic Installing collected packages: openai, anthropic Successfully installed anthropic-0.20.0 openai-1.0.0
Step by step
Use Python to automate evaluation by sending prompts to an LLM, collecting responses, and comparing them against expected outputs or metrics.
import os
from openai import OpenAI
import json
# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Define evaluation dataset: list of prompts and expected answers
evaluation_data = [
{"prompt": "What is the capital of France?", "expected": "Paris"},
{"prompt": "Solve 2 + 2.", "expected": "4"},
{"prompt": "Who wrote '1984'?", "expected": "George Orwell"}
]
# Function to evaluate a single prompt
def evaluate_prompt(prompt, expected):
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}]
)
answer = response.choices[0].message.content.strip()
# Simple exact match metric
passed = answer.lower() == expected.lower()
return answer, passed
# Run evaluation
results = []
for item in evaluation_data:
answer, passed = evaluate_prompt(item["prompt"], item["expected"])
results.append({"prompt": item["prompt"], "answer": answer, "passed": passed})
# Print results
for r in results:
print(f"Prompt: {r['prompt']}")
print(f"Answer: {r['answer']}")
print(f"Passed: {r['passed']}\n") output
Prompt: What is the capital of France? Answer: Paris Passed: True Prompt: Solve 2 + 2. Answer: 4 Passed: True Prompt: Who wrote '1984'? Answer: George Orwell Passed: True
Common variations
You can automate evaluation asynchronously, use streaming responses for real-time scoring, or switch models like claude-3-5-sonnet-20241022 for Anthropic. Adjust metrics to include semantic similarity or token usage.
import asyncio
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
evaluation_data = [
{"prompt": "What is the capital of France?", "expected": "Paris"},
{"prompt": "Solve 2 + 2.", "expected": "4"}
]
async def evaluate_async(prompt, expected):
stream = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
stream=True
)
answer = ""
async for chunk in stream:
delta = chunk.choices[0].delta.content or ""
answer += delta
passed = answer.strip().lower() == expected.lower()
return answer.strip(), passed
async def main():
for item in evaluation_data:
answer, passed = await evaluate_async(item["prompt"], item["expected"])
print(f"Prompt: {item['prompt']}")
print(f"Answer: {answer}")
print(f"Passed: {passed}\n")
asyncio.run(main()) output
Prompt: What is the capital of France? Answer: Paris Passed: True Prompt: Solve 2 + 2. Answer: 4 Passed: True
Troubleshooting
- If you get authentication errors, verify your API key is set correctly in
os.environ. - For rate limits, add retry logic or reduce request frequency.
- If responses are inconsistent, increase
max_tokensor use a stronger model.
Key Takeaways
- Use Python scripts with
OpenAIorAnthropicSDKs to automate LLM evaluation efficiently. - Batch prompts and parse responses programmatically to measure accuracy or custom metrics.
- Leverage async and streaming APIs for scalable and real-time evaluation workflows.