How to beginner · 3 min read

How to do A/B testing for prompts

Quick answer
Perform A/B testing for prompts by sending multiple prompt variants to an AI model using the OpenAI Python SDK and comparing their outputs. Automate this by running parallel requests with different prompts and analyzing response quality or metrics.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0

Setup

Install the openai Python package and set your API key as an environment variable for secure authentication.

bash
pip install openai>=1.0
output
Collecting openai
  Downloading openai-1.x.x-py3-none-any.whl (xx kB)
Installing collected packages: openai
Successfully installed openai-1.x.x

Step by step

This example demonstrates how to run A/B testing by sending two different prompt variants to the gpt-4o model and printing their responses for comparison.

python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

prompts = {
    "A": "Explain the benefits of renewable energy.",
    "B": "List the advantages of using renewable energy sources."
}

responses = {}
for variant, prompt in prompts.items():
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    responses[variant] = response.choices[0].message.content

print("A/B Testing Results:")
for variant, text in responses.items():
    print(f"Variant {variant}:\n{text}\n")
output
A/B Testing Results:
Variant A:
Renewable energy offers benefits such as reducing greenhouse gas emissions, lowering energy costs over time, and promoting energy independence.

Variant B:
Using renewable energy sources provides advantages including environmental protection, sustainable power supply, and decreased reliance on fossil fuels.

Common variations

  • Use asynchronous calls with asyncio for parallel prompt testing.
  • Test more than two prompt variants by expanding the prompts dictionary.
  • Switch models, e.g., gpt-4o-mini or claude-3-5-sonnet-20241022, to compare prompt effectiveness across models.
  • Incorporate automated metrics like token usage or sentiment analysis to quantify prompt performance.
python
import asyncio
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

prompts = {
    "A": "Explain the benefits of renewable energy.",
    "B": "List the advantages of using renewable energy sources.",
    "C": "Why is renewable energy important for the environment?"
}

async def get_response(prompt):
    response = await client.chat.completions.acreate(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

async def main():
    tasks = [get_response(p) for p in prompts.values()]
    results = await asyncio.gather(*tasks)
    for variant, text in zip(prompts.keys(), results):
        print(f"Variant {variant}:\n{text}\n")

if __name__ == "__main__":
    asyncio.run(main())
output
Variant A:
Renewable energy reduces carbon emissions and supports sustainable development.

Variant B:
Advantages of renewable energy include cost savings and environmental benefits.

Variant C:
Renewable energy is crucial for protecting ecosystems and reducing pollution.

Troubleshooting

  • If you receive authentication errors, verify your OPENAI_API_KEY environment variable is set correctly.
  • For rate limit errors, add delays between requests or reduce concurrency.
  • If responses are inconsistent, increase max_tokens or adjust prompt phrasing for clarity.
  • Check for SDK version compatibility if methods like acreate are missing.

Key Takeaways

  • Use the OpenAI Python SDK to send multiple prompt variants and compare outputs for A/B testing.
  • Automate prompt testing with asynchronous calls to speed up evaluation of many variants.
  • Analyze responses qualitatively or with metrics like token usage to select the best prompt.
  • Adjust concurrency and API parameters to avoid rate limits and improve consistency.
  • Test across different models to find the optimal prompt-model combination.
Verified 2026-04 · gpt-4o, gpt-4o-mini, claude-3-5-sonnet-20241022
Verify ↗