How to beginner · 3 min read

How to compare LLM performance

Q: How to compare LLM performance

Compare LLM performance using standardized benchmarks such as MMLU for knowledge, HumanEval for coding, and MATH for reasoning. Use official SDKs to run benchmark prompts and evaluate accuracy, speed, and cost metrics programmatically.

Quick answer

Compare LLM performance using standardized benchmarks such as MMLU for knowledge, HumanEval for coding, and MATH for reasoning. Use official SDKs to run benchmark prompts and evaluate accuracy, speed, and cost metrics programmatically.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0

Setup

Install the openai Python package and set your API key as an environment variable for secure access.

bash

pip install openai>=1.0

output

Collecting openai
  Downloading openai-1.x.x-py3-none-any.whl
Installing collected packages: openai
Successfully installed openai-1.x.x

Step by step

Use the OpenAI SDK to run benchmark prompts on models like gpt-4o and evaluate results against benchmark datasets.

python

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Example: Run a HumanEval coding benchmark prompt
prompt = """def add(a, b):\n    # Add two numbers\n"""

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}],
    max_tokens=64
)

print("Model output:", response.choices[0].message.content)

output

Model output: def add(a, b):
    return a + b

Common variations

You can test different models like claude-sonnet-4-5 or gemini-2.5-pro, use async calls, or stream outputs for real-time evaluation.

python

import asyncio
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

async def run_async_benchmark():
    response = await client.chat.completions.create(
        model="claude-sonnet-4-5",
        messages=[{"role": "user", "content": "Explain recursion in Python."}],
        max_tokens=100
    )
    print("Async model output:", response.choices[0].message.content)

asyncio.run(run_async_benchmark())

output

Async model output: Recursion in Python is a function calling itself to solve smaller instances of a problem until a base case is reached.

Troubleshooting

If you get authentication errors, verify your API key is set correctly in os.environ["OPENAI_API_KEY"].
For rate limits, implement exponential backoff retries.
Ensure you use current model names like gpt-4o or claude-sonnet-4-5 to avoid deprecated errors.

✅

Key Takeaways

Use standardized benchmarks like MMLU, HumanEval, and MATH to compare LLMs objectively.
Run benchmark prompts via official SDKs with current model names for accurate results.
Test multiple models and configurations including async and streaming for comprehensive evaluation.

Verified 2026-04 · gpt-4o, claude-sonnet-4-5, gemini-2.5-pro

Verify ↗