How to beginner · 3 min read

How to compare LLM performance

Quick answer
Compare LLM performance using standardized benchmarks such as MMLU for knowledge, HumanEval for coding, and MATH for reasoning. Use official SDKs to run benchmark prompts and evaluate accuracy, speed, and cost metrics programmatically.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0

Setup

Install the openai Python package and set your API key as an environment variable for secure access.

bash
pip install openai>=1.0
output
Collecting openai
  Downloading openai-1.x.x-py3-none-any.whl
Installing collected packages: openai
Successfully installed openai-1.x.x

Step by step

Use the OpenAI SDK to run benchmark prompts on models like gpt-4o and evaluate results against benchmark datasets.

python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Example: Run a HumanEval coding benchmark prompt
prompt = """def add(a, b):\n    # Add two numbers\n"""

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}],
    max_tokens=64
)

print("Model output:", response.choices[0].message.content)
output
Model output: def add(a, b):
    return a + b

Common variations

You can test different models like claude-sonnet-4-5 or gemini-2.5-pro, use async calls, or stream outputs for real-time evaluation.

python
import asyncio
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

async def run_async_benchmark():
    response = await client.chat.completions.create(
        model="claude-sonnet-4-5",
        messages=[{"role": "user", "content": "Explain recursion in Python."}],
        max_tokens=100
    )
    print("Async model output:", response.choices[0].message.content)

asyncio.run(run_async_benchmark())
output
Async model output: Recursion in Python is a function calling itself to solve smaller instances of a problem until a base case is reached.

Troubleshooting

  • If you get authentication errors, verify your API key is set correctly in os.environ["OPENAI_API_KEY"].
  • For rate limits, implement exponential backoff retries.
  • Ensure you use current model names like gpt-4o or claude-sonnet-4-5 to avoid deprecated errors.

Key Takeaways

  • Use standardized benchmarks like MMLU, HumanEval, and MATH to compare LLMs objectively.
  • Run benchmark prompts via official SDKs with current model names for accurate results.
  • Test multiple models and configurations including async and streaming for comprehensive evaluation.
Verified 2026-04 · gpt-4o, claude-sonnet-4-5, gemini-2.5-pro
Verify ↗