How to beginner · 3 min read

How to do A/B testing for prompts

Quick answer

Perform A/B testing for prompts by sending multiple prompt variants to an AI model using the OpenAI Python SDK and comparing their outputs. Automate this by running parallel requests with different prompts and analyzing response quality or metrics.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0

Setup

Install the openai Python package and set your API key as an environment variable for secure authentication.

bash

pip install openai>=1.0

output

Collecting openai
  Downloading openai-1.x.x-py3-none-any.whl (xx kB)
Installing collected packages: openai
Successfully installed openai-1.x.x

Step by step

This example demonstrates how to run A/B testing by sending two different prompt variants to the gpt-4o model and printing their responses for comparison.

python

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

prompts = {
    "A": "Explain the benefits of renewable energy.",
    "B": "List the advantages of using renewable energy sources."
}

responses = {}
for variant, prompt in prompts.items():
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    responses[variant] = response.choices[0].message.content

print("A/B Testing Results:")
for variant, text in responses.items():
    print(f"Variant {variant}:\n{text}\n")

output

A/B Testing Results:
Variant A:
Renewable energy offers benefits such as reducing greenhouse gas emissions, lowering energy costs over time, and promoting energy independence.

Variant B:
Using renewable energy sources provides advantages including environmental protection, sustainable power supply, and decreased reliance on fossil fuels.

Common variations

Use asynchronous calls with asyncio for parallel prompt testing.
Test more than two prompt variants by expanding the prompts dictionary.
Switch models, e.g., gpt-4o-mini or claude-3-5-sonnet-20241022, to compare prompt effectiveness across models.
Incorporate automated metrics like token usage or sentiment analysis to quantify prompt performance.

python

import asyncio
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

prompts = {
    "A": "Explain the benefits of renewable energy.",
    "B": "List the advantages of using renewable energy sources.",
    "C": "Why is renewable energy important for the environment?"
}

async def get_response(prompt):
    response = await client.chat.completions.acreate(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

async def main():
    tasks = [get_response(p) for p in prompts.values()]
    results = await asyncio.gather(*tasks)
    for variant, text in zip(prompts.keys(), results):
        print(f"Variant {variant}:\n{text}\n")

if __name__ == "__main__":
    asyncio.run(main())

output

Variant A:
Renewable energy reduces carbon emissions and supports sustainable development.

Variant B:
Advantages of renewable energy include cost savings and environmental benefits.

Variant C:
Renewable energy is crucial for protecting ecosystems and reducing pollution.

Troubleshooting

If you receive authentication errors, verify your OPENAI_API_KEY environment variable is set correctly.
For rate limit errors, add delays between requests or reduce concurrency.
If responses are inconsistent, increase max_tokens or adjust prompt phrasing for clarity.
Check for SDK version compatibility if methods like acreate are missing.

✅

Key Takeaways

Use the OpenAI Python SDK to send multiple prompt variants and compare outputs for A/B testing.
Automate prompt testing with asynchronous calls to speed up evaluation of many variants.
Analyze responses qualitatively or with metrics like token usage to select the best prompt.
Adjust concurrency and API parameters to avoid rate limits and improve consistency.
Test across different models to find the optimal prompt-model combination.

Verified 2026-04 · gpt-4o, gpt-4o-mini, claude-3-5-sonnet-20241022

Verify ↗