How to beginner · 3 min read

How to compare prompts A/B testing

Quick answer

Use A/B testing by sending multiple prompt variants to the same model via the chat.completions.create API, then compare outputs based on metrics like relevance or accuracy. Automate this with Python by running each prompt variant in parallel or sequentially and analyzing the results.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0

Setup

Install the OpenAI Python SDK and set your API key as an environment variable to authenticate requests.

bash

pip install openai>=1.0

Step by step

This example sends two prompt variants to gpt-4o and prints their outputs for manual comparison.

python

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

prompts = [
    "Explain the benefits of AI in healthcare.",
    "Describe how AI improves healthcare outcomes."
]

responses = []
for i, prompt in enumerate(prompts, 1):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    text = response.choices[0].message.content
    print(f"Prompt Variant {i}: {prompt}\nResponse:\n{text}\n{'-'*40}")
    responses.append(text)

output

Prompt Variant 1: Explain the benefits of AI in healthcare.
Response:
AI enhances healthcare by enabling faster diagnosis, personalized treatment, and improved patient monitoring.
----------------------------------------
Prompt Variant 2: Describe how AI improves healthcare outcomes.
Response:
AI improves healthcare outcomes by analyzing data to predict diseases early, optimizing treatments, and reducing errors.
----------------------------------------

Common variations

You can run A/B tests asynchronously or use streaming responses for faster feedback. Also, try different models like claude-3-5-sonnet-20241022 or gemini-1.5-pro for comparison.

python

import asyncio
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

async def get_response(prompt):
    response = await client.chat.completions.acreate(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

async def main():
    prompts = [
        "Explain the benefits of AI in healthcare.",
        "Describe how AI improves healthcare outcomes."
    ]
    tasks = [get_response(p) for p in prompts]
    results = await asyncio.gather(*tasks)
    for i, text in enumerate(results, 1):
        print(f"Prompt Variant {i} Response:\n{text}\n{'-'*40}")

if __name__ == "__main__":
    asyncio.run(main())

output

Prompt Variant 1 Response:
AI enhances healthcare by enabling faster diagnosis, personalized treatment, and improved patient monitoring.
----------------------------------------
Prompt Variant 2 Response:
AI improves healthcare outcomes by analyzing data to predict diseases early, optimizing treatments, and reducing errors.
----------------------------------------

Troubleshooting

If you receive rate limit errors, add delays between requests or reduce max_tokens. For inconsistent outputs, fix the random seed or use temperature=0 for deterministic responses.

python

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Your prompt here"}],
    temperature=0
)

✅

Key Takeaways

Use the same model and environment to fairly compare prompt variants.
Automate prompt testing with scripts to collect and analyze outputs efficiently.
Control randomness with temperature=0 for consistent A/B test results.

Verified 2026-04 · gpt-4o, claude-3-5-sonnet-20241022, gemini-1.5-pro

Verify ↗