How to Beginner to Intermediate · 3 min read

How to A/B test prompts

Quick answer

To A/B test prompts, send multiple prompt variants to an LLM like gpt-4o using the API, collect and compare their outputs based on metrics like relevance or accuracy. Automate this by running parallel requests and analyzing results statistically to identify the best-performing prompt.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0

Setup

Install the openai Python package and set your API key as an environment variable for secure access.

bash

pip install openai>=1.0

Step by step

This example sends two prompt variants to gpt-4o and compares their outputs side-by-side.

python

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

prompts = [
    "Explain the benefits of AI in healthcare.",
    "Describe how AI improves healthcare outcomes."
]

responses = []
for prompt in prompts:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    text = response.choices[0].message.content
    responses.append(text)

for i, output in enumerate(responses, 1):
    print(f"Prompt {i} output:\n{output}\n{'-'*40}")

output

Prompt 1 output:
AI enhances healthcare by enabling faster diagnosis, personalized treatment, and improved patient monitoring.
----------------------------------------
Prompt 2 output:
AI improves healthcare outcomes through predictive analytics, automating routine tasks, and supporting clinical decisions.
----------------------------------------

Common variations

Use asynchronous calls to speed up testing multiple prompts concurrently.
Test with different models like claude-3-5-haiku-20241022 for comparison.
Incorporate automated scoring metrics such as BLEU or ROUGE for quantitative evaluation.

python

import asyncio
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

async def get_response(prompt):
    response = await client.chat.completions.acreate(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

async def main():
    prompts = [
        "Explain the benefits of AI in healthcare.",
        "Describe how AI improves healthcare outcomes."
    ]
    tasks = [get_response(p) for p in prompts]
    results = await asyncio.gather(*tasks)
    for i, res in enumerate(results, 1):
        print(f"Prompt {i} output:\n{res}\n{'-'*40}")

asyncio.run(main())

output

Prompt 1 output:
AI enhances healthcare by enabling faster diagnosis, personalized treatment, and improved patient monitoring.
----------------------------------------
Prompt 2 output:
AI improves healthcare outcomes through predictive analytics, automating routine tasks, and supporting clinical decisions.
----------------------------------------

Troubleshooting

If you receive rate limit errors, reduce request frequency or upgrade your API plan.
Empty or irrelevant outputs may indicate prompts are too vague; refine prompt clarity.
Check environment variable setup if authentication fails.

✅

Key Takeaways

Use parallel API calls to efficiently compare multiple prompt variants.
Automate output evaluation with quantitative metrics for objective A/B testing.
Test across different models to find the best prompt-model combination.

Verified 2026-04 · gpt-4o, claude-3-5-haiku-20241022

Verify ↗