How to run LLM evals with Python
Direct answer
Use Python with the OpenAI or Anthropic SDKs to run LLM evals by sending benchmark prompts to models like gpt-4o or claude-sonnet-4-5 and parsing their responses for scoring.
Setup
Install
pip install openai anthropic Env vars
OPENAI_API_KEYANTHROPIC_API_KEY Imports
from openai import OpenAI
import anthropic
import os
import json Examples
inRun MMLU benchmark on gpt-4o with prompt 'What is the capital of France?'
outModel gpt-4o answered: Paris
inEvaluate coding task on claude-sonnet-4-5: 'Write a Python function to reverse a string.'
outModel claude-sonnet-4-5 generated correct Python code.
inTest math reasoning on gpt-4o-mini with 'Solve 12 * 15.'
outModel gpt-4o-mini answered: 180
Integration steps
- Set your API keys in environment variables for OpenAI and/or Anthropic.
- Import the OpenAI and Anthropic SDK clients in Python.
- Initialize the client with the API key from os.environ.
- Prepare benchmark prompts or test questions as messages.
- Call the chat completions endpoint with the chosen model and messages.
- Extract and analyze the response text to score or validate the output.
Full code
import os
from openai import OpenAI
import anthropic
import json
# Initialize clients
openai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
anthropic_client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
# Example benchmark prompt
benchmark_prompt = "What is the capital of France?"
# OpenAI GPT-4o eval
response_openai = openai_client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": benchmark_prompt}]
)
answer_openai = response_openai.choices[0].message.content
print(f"OpenAI gpt-4o answered: {answer_openai.strip()}")
# Anthropic Claude-sonnet eval
response_anthropic = anthropic_client.messages.create(
model="claude-sonnet-4-5",
max_tokens=100,
system="You are a helpful assistant.",
messages=[{"role": "user", "content": benchmark_prompt}]
)
answer_anthropic = response_anthropic.content[0].text
print(f"Anthropic claude-sonnet-4-5 answered: {answer_anthropic.strip()}") output
OpenAI gpt-4o answered: Paris Anthropic claude-sonnet-4-5 answered: Paris
API trace
Request
{"model": "gpt-4o", "messages": [{"role": "user", "content": "What is the capital of France?"}]} Response
{"choices": [{"message": {"content": "Paris"}}], "usage": {"total_tokens": 15}} Extract
response.choices[0].message.contentVariants
Streaming LLM Eval ›
Use streaming when you want to display partial results in real-time for long benchmark prompts.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
stream = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Explain quantum computing."}],
stream=True
)
print("Streaming response:")
for chunk in stream:
delta = chunk.choices[0].delta.content or ""
print(delta, end="", flush=True)
print() Async LLM Eval with Anthropic ›
Use async calls to run multiple benchmark queries concurrently for faster evaluation.
import os
import asyncio
import anthropic
async def run_async_eval():
client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
response = await client.messages.create(
model="claude-sonnet-4-5",
max_tokens=100,
system="You are a helpful assistant.",
messages=[{"role": "user", "content": "Write a Python function to add two numbers."}]
)
print("Async Anthropic answer:", response.content[0].text.strip())
asyncio.run(run_async_eval()) Alternative Model for Cost Efficiency ›
Use smaller models like gpt-4o-mini for cheaper, faster benchmarks with slightly lower accuracy.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Solve 12 * 15."}]
)
print("gpt-4o-mini answer:", response.choices[0].message.content.strip()) Performance
Latency~800ms for gpt-4o non-streaming; ~400ms for gpt-4o-mini
Cost~$0.002 per 500 tokens on gpt-4o; ~50% less on gpt-4o-mini
Rate limitsTier 1: 500 RPM / 30K TPM on OpenAI; Anthropic similar
- Keep prompts concise to reduce token usage.
- Use smaller models for preliminary benchmarks.
- Cache repeated benchmark prompts and responses.
| Approach | Latency | Cost/call | Best for |
|---|---|---|---|
| OpenAI gpt-4o full eval | ~800ms | ~$0.002 | High accuracy benchmarks |
| Streaming eval | Starts ~200ms, ongoing | Same as full | Real-time feedback |
| Async Anthropic eval | ~700ms per call | ~$0.002 | Concurrent benchmark runs |
| OpenAI gpt-4o-mini eval | ~400ms | ~$0.001 | Cost-effective quick tests |
Quick tip
Use the latest SDK v1+ clients and always extract responses from choices[0].message.content for consistent results.
Common mistake
Beginners often use deprecated SDK methods or hardcode API keys instead of using environment variables.