How to Intermediate · 3 min read

How to compare model outputs across prompt versions

Quick answer
Use systematic testing by sending multiple prompt versions to a model like gpt-4o via the OpenAI SDK, then compare outputs using metrics such as semantic similarity, exact match, or human evaluation. Automate this with Python scripts to analyze differences and select the best prompt version.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0

Setup

Install the openai Python package and set your API key as an environment variable for secure access.

bash
pip install openai>=1.0

Step by step

Send multiple prompt versions to the gpt-4o model using the OpenAI SDK, collect outputs, and compare them using simple string comparison or semantic similarity.

python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

prompt_versions = [
    "Explain the benefits of AI in healthcare.",
    "Describe how AI improves healthcare outcomes.",
    "What are the advantages of AI in the medical field?"
]

outputs = []
for prompt in prompt_versions:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    outputs.append(response.choices[0].message.content)

for i, output in enumerate(outputs):
    print(f"Output for prompt version {i+1}:\n{output}\n{'-'*40}")
output
Output for prompt version 1:
AI enhances healthcare by enabling faster diagnosis, personalized treatment, and improved patient monitoring.
----------------------------------------
Output for prompt version 2:
AI improves healthcare outcomes through predictive analytics, automation, and enhanced decision-making support.
----------------------------------------
Output for prompt version 3:
The advantages of AI in the medical field include increased accuracy, efficiency, and the ability to analyze large datasets.
----------------------------------------

Common variations

You can extend comparison by using semantic similarity libraries like sentence-transformers or automate human evaluation with scoring rubrics. Also, try different models such as claude-3-5-haiku-20241022 or asynchronous calls for batch processing.

python
import os
from anthropic import Anthropic

client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

prompt_versions = [
    "Explain the benefits of AI in healthcare.",
    "Describe how AI improves healthcare outcomes.",
    "What are the advantages of AI in the medical field?"
]

outputs = []
for prompt in prompt_versions:
    message = client.messages.create(
        model="claude-3-5-haiku-20241022",
        max_tokens=300,
        system="You are a helpful assistant.",
        messages=[{"role": "user", "content": prompt}]
    )
    outputs.append(message.content)

for i, output in enumerate(outputs):
    print(f"Claude output for prompt version {i+1}:\n{output}\n{'-'*40}")
output
Claude output for prompt version 1:
AI in healthcare offers faster diagnosis, personalized care, and better patient outcomes.
----------------------------------------
Claude output for prompt version 2:
Healthcare benefits from AI through improved predictions, automation, and decision support.
----------------------------------------
Claude output for prompt version 3:
AI advantages in medicine include accuracy, efficiency, and data-driven insights.
----------------------------------------

Troubleshooting

  • If outputs are too similar, increase prompt variation or use more sensitive semantic similarity metrics.
  • If API calls fail, verify your API key and network connectivity.
  • For inconsistent outputs, fix the model temperature to a low value (e.g., 0) to reduce randomness.
python
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain AI benefits."}],
    temperature=0
)

Key Takeaways

  • Use consistent API calls to collect outputs from different prompt versions for fair comparison.
  • Apply semantic similarity or human evaluation to measure output quality beyond exact text matches.
  • Control randomness with temperature settings to get stable outputs during comparison.
Verified 2026-04 · gpt-4o, claude-3-5-haiku-20241022
Verify ↗