How to compare prompts A/B testing
Quick answer
Use A/B testing by sending multiple prompt variants to the same model via the
chat.completions.create API, then compare outputs based on metrics like relevance or accuracy. Automate this with Python by running each prompt variant in parallel or sequentially and analyzing the results.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0
Setup
Install the OpenAI Python SDK and set your API key as an environment variable to authenticate requests.
pip install openai>=1.0 Step by step
This example sends two prompt variants to gpt-4o and prints their outputs for manual comparison.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
prompts = [
"Explain the benefits of AI in healthcare.",
"Describe how AI improves healthcare outcomes."
]
responses = []
for i, prompt in enumerate(prompts, 1):
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
text = response.choices[0].message.content
print(f"Prompt Variant {i}: {prompt}\nResponse:\n{text}\n{'-'*40}")
responses.append(text) output
Prompt Variant 1: Explain the benefits of AI in healthcare. Response: AI enhances healthcare by enabling faster diagnosis, personalized treatment, and improved patient monitoring. ---------------------------------------- Prompt Variant 2: Describe how AI improves healthcare outcomes. Response: AI improves healthcare outcomes by analyzing data to predict diseases early, optimizing treatments, and reducing errors. ----------------------------------------
Common variations
You can run A/B tests asynchronously or use streaming responses for faster feedback. Also, try different models like claude-3-5-sonnet-20241022 or gemini-1.5-pro for comparison.
import asyncio
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
async def get_response(prompt):
response = await client.chat.completions.acreate(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
async def main():
prompts = [
"Explain the benefits of AI in healthcare.",
"Describe how AI improves healthcare outcomes."
]
tasks = [get_response(p) for p in prompts]
results = await asyncio.gather(*tasks)
for i, text in enumerate(results, 1):
print(f"Prompt Variant {i} Response:\n{text}\n{'-'*40}")
if __name__ == "__main__":
asyncio.run(main()) output
Prompt Variant 1 Response: AI enhances healthcare by enabling faster diagnosis, personalized treatment, and improved patient monitoring. ---------------------------------------- Prompt Variant 2 Response: AI improves healthcare outcomes by analyzing data to predict diseases early, optimizing treatments, and reducing errors. ----------------------------------------
Troubleshooting
If you receive rate limit errors, add delays between requests or reduce max_tokens. For inconsistent outputs, fix the random seed or use temperature=0 for deterministic responses.
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Your prompt here"}],
temperature=0
) Key Takeaways
- Use the same model and environment to fairly compare prompt variants.
- Automate prompt testing with scripts to collect and analyze outputs efficiently.
- Control randomness with temperature=0 for consistent A/B test results.