How to compare LLM performance
Quick answer
Compare
LLM performance using standardized benchmarks such as MMLU for knowledge, HumanEval for coding, and MATH for reasoning. Use official SDKs to run benchmark prompts and evaluate accuracy, speed, and cost metrics programmatically.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0
Setup
Install the openai Python package and set your API key as an environment variable for secure access.
pip install openai>=1.0 output
Collecting openai Downloading openai-1.x.x-py3-none-any.whl Installing collected packages: openai Successfully installed openai-1.x.x
Step by step
Use the OpenAI SDK to run benchmark prompts on models like gpt-4o and evaluate results against benchmark datasets.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Example: Run a HumanEval coding benchmark prompt
prompt = """def add(a, b):\n # Add two numbers\n"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
max_tokens=64
)
print("Model output:", response.choices[0].message.content) output
Model output: def add(a, b):
return a + b Common variations
You can test different models like claude-sonnet-4-5 or gemini-2.5-pro, use async calls, or stream outputs for real-time evaluation.
import asyncio
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
async def run_async_benchmark():
response = await client.chat.completions.create(
model="claude-sonnet-4-5",
messages=[{"role": "user", "content": "Explain recursion in Python."}],
max_tokens=100
)
print("Async model output:", response.choices[0].message.content)
asyncio.run(run_async_benchmark()) output
Async model output: Recursion in Python is a function calling itself to solve smaller instances of a problem until a base case is reached.
Troubleshooting
- If you get authentication errors, verify your API key is set correctly in
os.environ["OPENAI_API_KEY"]. - For rate limits, implement exponential backoff retries.
- Ensure you use current model names like
gpt-4oorclaude-sonnet-4-5to avoid deprecated errors.
Key Takeaways
- Use standardized benchmarks like MMLU, HumanEval, and MATH to compare LLMs objectively.
- Run benchmark prompts via official SDKs with current model names for accurate results.
- Test multiple models and configurations including async and streaming for comprehensive evaluation.