How to beginner · 3 min read

How to compare prompts A/B testing

Quick answer
Use A/B testing by sending multiple prompt variants to the same model via the chat.completions.create API, then compare outputs based on metrics like relevance or accuracy. Automate this with Python by running each prompt variant in parallel or sequentially and analyzing the results.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0

Setup

Install the OpenAI Python SDK and set your API key as an environment variable to authenticate requests.

bash
pip install openai>=1.0

Step by step

This example sends two prompt variants to gpt-4o and prints their outputs for manual comparison.

python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

prompts = [
    "Explain the benefits of AI in healthcare.",
    "Describe how AI improves healthcare outcomes."
]

responses = []
for i, prompt in enumerate(prompts, 1):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    text = response.choices[0].message.content
    print(f"Prompt Variant {i}: {prompt}\nResponse:\n{text}\n{'-'*40}")
    responses.append(text)
output
Prompt Variant 1: Explain the benefits of AI in healthcare.
Response:
AI enhances healthcare by enabling faster diagnosis, personalized treatment, and improved patient monitoring.
----------------------------------------
Prompt Variant 2: Describe how AI improves healthcare outcomes.
Response:
AI improves healthcare outcomes by analyzing data to predict diseases early, optimizing treatments, and reducing errors.
----------------------------------------

Common variations

You can run A/B tests asynchronously or use streaming responses for faster feedback. Also, try different models like claude-3-5-sonnet-20241022 or gemini-1.5-pro for comparison.

python
import asyncio
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

async def get_response(prompt):
    response = await client.chat.completions.acreate(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

async def main():
    prompts = [
        "Explain the benefits of AI in healthcare.",
        "Describe how AI improves healthcare outcomes."
    ]
    tasks = [get_response(p) for p in prompts]
    results = await asyncio.gather(*tasks)
    for i, text in enumerate(results, 1):
        print(f"Prompt Variant {i} Response:\n{text}\n{'-'*40}")

if __name__ == "__main__":
    asyncio.run(main())
output
Prompt Variant 1 Response:
AI enhances healthcare by enabling faster diagnosis, personalized treatment, and improved patient monitoring.
----------------------------------------
Prompt Variant 2 Response:
AI improves healthcare outcomes by analyzing data to predict diseases early, optimizing treatments, and reducing errors.
----------------------------------------

Troubleshooting

If you receive rate limit errors, add delays between requests or reduce max_tokens. For inconsistent outputs, fix the random seed or use temperature=0 for deterministic responses.

python
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Your prompt here"}],
    temperature=0
)

Key Takeaways

  • Use the same model and environment to fairly compare prompt variants.
  • Automate prompt testing with scripts to collect and analyze outputs efficiently.
  • Control randomness with temperature=0 for consistent A/B test results.
Verified 2026-04 · gpt-4o, claude-3-5-sonnet-20241022, gemini-1.5-pro
Verify ↗