Code intermediate · 3 min read

How to run LLM evals with Python

Direct answer
Use Python with the OpenAI or Anthropic SDKs to run LLM evals by sending benchmark prompts to models like gpt-4o or claude-sonnet-4-5 and parsing their responses for scoring.

Setup

Install
bash
pip install openai anthropic
Env vars
OPENAI_API_KEYANTHROPIC_API_KEY
Imports
python
from openai import OpenAI
import anthropic
import os
import json

Examples

inRun MMLU benchmark on gpt-4o with prompt 'What is the capital of France?'
outModel gpt-4o answered: Paris
inEvaluate coding task on claude-sonnet-4-5: 'Write a Python function to reverse a string.'
outModel claude-sonnet-4-5 generated correct Python code.
inTest math reasoning on gpt-4o-mini with 'Solve 12 * 15.'
outModel gpt-4o-mini answered: 180

Integration steps

  1. Set your API keys in environment variables for OpenAI and/or Anthropic.
  2. Import the OpenAI and Anthropic SDK clients in Python.
  3. Initialize the client with the API key from os.environ.
  4. Prepare benchmark prompts or test questions as messages.
  5. Call the chat completions endpoint with the chosen model and messages.
  6. Extract and analyze the response text to score or validate the output.

Full code

python
import os
from openai import OpenAI
import anthropic
import json

# Initialize clients
openai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
anthropic_client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

# Example benchmark prompt
benchmark_prompt = "What is the capital of France?"

# OpenAI GPT-4o eval
response_openai = openai_client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": benchmark_prompt}]
)
answer_openai = response_openai.choices[0].message.content
print(f"OpenAI gpt-4o answered: {answer_openai.strip()}")

# Anthropic Claude-sonnet eval
response_anthropic = anthropic_client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=100,
    system="You are a helpful assistant.",
    messages=[{"role": "user", "content": benchmark_prompt}]
)
answer_anthropic = response_anthropic.content[0].text
print(f"Anthropic claude-sonnet-4-5 answered: {answer_anthropic.strip()}")
output
OpenAI gpt-4o answered: Paris
Anthropic claude-sonnet-4-5 answered: Paris

API trace

Request
json
{"model": "gpt-4o", "messages": [{"role": "user", "content": "What is the capital of France?"}]}
Response
json
{"choices": [{"message": {"content": "Paris"}}], "usage": {"total_tokens": 15}}
Extractresponse.choices[0].message.content

Variants

Streaming LLM Eval

Use streaming when you want to display partial results in real-time for long benchmark prompts.

python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain quantum computing."}],
    stream=True
)

print("Streaming response:")
for chunk in stream:
    delta = chunk.choices[0].delta.content or ""
    print(delta, end="", flush=True)
print()
Async LLM Eval with Anthropic

Use async calls to run multiple benchmark queries concurrently for faster evaluation.

python
import os
import asyncio
import anthropic

async def run_async_eval():
    client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
    response = await client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=100,
        system="You are a helpful assistant.",
        messages=[{"role": "user", "content": "Write a Python function to add two numbers."}]
    )
    print("Async Anthropic answer:", response.content[0].text.strip())

asyncio.run(run_async_eval())
Alternative Model for Cost Efficiency

Use smaller models like gpt-4o-mini for cheaper, faster benchmarks with slightly lower accuracy.

python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Solve 12 * 15."}]
)
print("gpt-4o-mini answer:", response.choices[0].message.content.strip())

Performance

Latency~800ms for gpt-4o non-streaming; ~400ms for gpt-4o-mini
Cost~$0.002 per 500 tokens on gpt-4o; ~50% less on gpt-4o-mini
Rate limitsTier 1: 500 RPM / 30K TPM on OpenAI; Anthropic similar
  • Keep prompts concise to reduce token usage.
  • Use smaller models for preliminary benchmarks.
  • Cache repeated benchmark prompts and responses.
ApproachLatencyCost/callBest for
OpenAI gpt-4o full eval~800ms~$0.002High accuracy benchmarks
Streaming evalStarts ~200ms, ongoingSame as fullReal-time feedback
Async Anthropic eval~700ms per call~$0.002Concurrent benchmark runs
OpenAI gpt-4o-mini eval~400ms~$0.001Cost-effective quick tests

Quick tip

Use the latest SDK v1+ clients and always extract responses from choices[0].message.content for consistent results.

Common mistake

Beginners often use deprecated SDK methods or hardcode API keys instead of using environment variables.

Verified 2026-04 · gpt-4o, gpt-4o-mini, claude-sonnet-4-5
Verify ↗