How to Intermediate · 3 min read

How to build custom LLM benchmark

Quick answer
Build a custom LLM benchmark by defining a representative dataset of prompts and expected outputs, then use the OpenAI or other LLM SDKs to generate model responses. Automate evaluation by comparing outputs against ground truth with metrics like accuracy or BLEU using Python scripts.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0

Setup

Install the openai Python package and set your API key as an environment variable for secure access.

bash
pip install openai
output
Collecting openai
  Downloading openai-1.x.x-py3-none-any.whl (xx kB)
Installing collected packages: openai
Successfully installed openai-1.x.x

Step by step

Create a Python script that loads a dataset of prompts and expected answers, queries the LLM using OpenAI SDK, and evaluates the responses with a simple accuracy metric.

python
import os
from openai import OpenAI

# Initialize client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Sample benchmark dataset: list of dicts with prompt and expected answer
benchmark_data = [
    {"prompt": "What is the capital of France?", "expected": "Paris"},
    {"prompt": "Translate 'Hello' to Spanish.", "expected": "Hola"},
    {"prompt": "Solve: 2 + 2", "expected": "4"}
]

correct = 0

for item in benchmark_data:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": item["prompt"]}]
    )
    answer = response.choices[0].message.content.strip()
    print(f"Prompt: {item['prompt']}")
    print(f"Model answer: {answer}")
    print(f"Expected answer: {item['expected']}")
    if answer.lower() == item["expected"].lower():
        correct += 1
        print("Result: Correct\n")
    else:
        print("Result: Incorrect\n")

accuracy = correct / len(benchmark_data) * 100
print(f"Benchmark accuracy: {accuracy:.2f}%")
output
Prompt: What is the capital of France?
Model answer: Paris
Expected answer: Paris
Result: Correct

Prompt: Translate 'Hello' to Spanish.
Model answer: Hola
Expected answer: Hola
Result: Correct

Prompt: Solve: 2 + 2
Model answer: 4
Expected answer: 4
Result: Correct

Benchmark accuracy: 100.00%

Common variations

  • Use asynchronous calls with asyncio for faster batch querying.
  • Evaluate with advanced metrics like BLEU, ROUGE, or custom scoring functions.
  • Test multiple models by looping over different model names.
  • Stream responses using stream=True for real-time feedback.
python
import asyncio
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

benchmark_data = [
    {"prompt": "What is the capital of France?", "expected": "Paris"},
    {"prompt": "Translate 'Hello' to Spanish.", "expected": "Hola"},
    {"prompt": "Solve: 2 + 2", "expected": "4"}
]

async def query_model(prompt):
    response = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content.strip()

async def run_benchmark():
    correct = 0
    for item in benchmark_data:
        answer = await query_model(item["prompt"])
        print(f"Prompt: {item['prompt']}")
        print(f"Model answer: {answer}")
        print(f"Expected answer: {item['expected']}")
        if answer.lower() == item["expected"].lower():
            correct += 1
            print("Result: Correct\n")
        else:
            print("Result: Incorrect\n")
    accuracy = correct / len(benchmark_data) * 100
    print(f"Benchmark accuracy: {accuracy:.2f}%")

asyncio.run(run_benchmark())
output
Prompt: What is the capital of France?
Model answer: Paris
Expected answer: Paris
Result: Correct

Prompt: Translate 'Hello' to Spanish.
Model answer: Hola
Expected answer: Hola
Result: Correct

Prompt: Solve: 2 + 2
Model answer: 4
Expected answer: 4
Result: Correct

Benchmark accuracy: 100.00%

Troubleshooting

  • If you get authentication errors, verify your OPENAI_API_KEY environment variable is set correctly.
  • For rate limit errors, add delays or use smaller batch sizes.
  • If model responses are inconsistent, increase max_tokens or adjust temperature to 0 for deterministic output.
  • Check network connectivity if API calls fail.

Key Takeaways

  • Use a representative dataset of prompts and expected outputs for meaningful benchmarking.
  • Automate querying and evaluation with the OpenAI SDK using environment-secured API keys.
  • Leverage async calls and advanced metrics for scalable and precise benchmark results.
Verified 2026-04 · gpt-4o-mini
Verify ↗