Code intermediate · 3 min read

How to run LLM evals with Python

Direct answer

Use Python with the OpenAI or Anthropic SDKs to run LLM evals by sending benchmark prompts to models like gpt-4o or claude-sonnet-4-5 and parsing their responses for scoring.

Setup

Install

bash

pip install openai anthropic

Env vars

OPENAI_API_KEYANTHROPIC_API_KEY

Imports

python

from openai import OpenAI
import anthropic
import os
import json

Examples

inRun MMLU benchmark on gpt-4o with prompt 'What is the capital of France?'

outModel gpt-4o answered: Paris

inEvaluate coding task on claude-sonnet-4-5: 'Write a Python function to reverse a string.'

outModel claude-sonnet-4-5 generated correct Python code.

inTest math reasoning on gpt-4o-mini with 'Solve 12 * 15.'

outModel gpt-4o-mini answered: 180

Integration steps

Set your API keys in environment variables for OpenAI and/or Anthropic.
Import the OpenAI and Anthropic SDK clients in Python.
Initialize the client with the API key from os.environ.
Prepare benchmark prompts or test questions as messages.
Call the chat completions endpoint with the chosen model and messages.
Extract and analyze the response text to score or validate the output.

Full code

python

import os
from openai import OpenAI
import anthropic
import json

# Initialize clients
openai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
anthropic_client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

# Example benchmark prompt
benchmark_prompt = "What is the capital of France?"

# OpenAI GPT-4o eval
response_openai = openai_client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": benchmark_prompt}]
)
answer_openai = response_openai.choices[0].message.content
print(f"OpenAI gpt-4o answered: {answer_openai.strip()}")

# Anthropic Claude-sonnet eval
response_anthropic = anthropic_client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=100,
    system="You are a helpful assistant.",
    messages=[{"role": "user", "content": benchmark_prompt}]
)
answer_anthropic = response_anthropic.content[0].text
print(f"Anthropic claude-sonnet-4-5 answered: {answer_anthropic.strip()}")

output

OpenAI gpt-4o answered: Paris
Anthropic claude-sonnet-4-5 answered: Paris

API trace

Request

json

{"model": "gpt-4o", "messages": [{"role": "user", "content": "What is the capital of France?"}]}

Response

json

{"choices": [{"message": {"content": "Paris"}}], "usage": {"total_tokens": 15}}

Extractresponse.choices[0].message.content

Variants

Streaming LLM Eval ›

Use streaming when you want to display partial results in real-time for long benchmark prompts.

python

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain quantum computing."}],
    stream=True
)

print("Streaming response:")
for chunk in stream:
    delta = chunk.choices[0].delta.content or ""
    print(delta, end="", flush=True)
print()

Async LLM Eval with Anthropic ›

Use async calls to run multiple benchmark queries concurrently for faster evaluation.

python

import os
import asyncio
import anthropic

async def run_async_eval():
    client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
    response = await client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=100,
        system="You are a helpful assistant.",
        messages=[{"role": "user", "content": "Write a Python function to add two numbers."}]
    )
    print("Async Anthropic answer:", response.content[0].text.strip())

asyncio.run(run_async_eval())

Alternative Model for Cost Efficiency ›

Use smaller models like gpt-4o-mini for cheaper, faster benchmarks with slightly lower accuracy.

python

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Solve 12 * 15."}]
)
print("gpt-4o-mini answer:", response.choices[0].message.content.strip())

Performance

Latency~800ms for gpt-4o non-streaming; ~400ms for gpt-4o-mini

Cost~$0.002 per 500 tokens on gpt-4o; ~50% less on gpt-4o-mini

Rate limitsTier 1: 500 RPM / 30K TPM on OpenAI; Anthropic similar

Keep prompts concise to reduce token usage.
Use smaller models for preliminary benchmarks.
Cache repeated benchmark prompts and responses.

Approach	Latency	Cost/call	Best for
OpenAI gpt-4o full eval	~800ms	~$0.002	High accuracy benchmarks
Streaming eval	Starts ~200ms, ongoing	Same as full	Real-time feedback
Async Anthropic eval	~700ms per call	~$0.002	Concurrent benchmark runs
OpenAI gpt-4o-mini eval	~400ms	~$0.001	Cost-effective quick tests

✓

Quick tip

Use the latest SDK v1+ clients and always extract responses from choices[0].message.content for consistent results.

⚠

Common mistake

Beginners often use deprecated SDK methods or hardcode API keys instead of using environment variables.

Verified 2026-04 · gpt-4o, gpt-4o-mini, claude-sonnet-4-5

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.