How to A/B test prompts
Quick answer
To A/B test prompts, send multiple prompt variants to an LLM like gpt-4o using the API, collect and compare their outputs based on metrics like relevance or accuracy. Automate this by running parallel requests and analyzing results statistically to identify the best-performing prompt.
PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0
Setup
Install the openai Python package and set your API key as an environment variable for secure access.
pip install openai>=1.0 Step by step
This example sends two prompt variants to gpt-4o and compares their outputs side-by-side.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
prompts = [
"Explain the benefits of AI in healthcare.",
"Describe how AI improves healthcare outcomes."
]
responses = []
for prompt in prompts:
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
text = response.choices[0].message.content
responses.append(text)
for i, output in enumerate(responses, 1):
print(f"Prompt {i} output:\n{output}\n{'-'*40}") output
Prompt 1 output: AI enhances healthcare by enabling faster diagnosis, personalized treatment, and improved patient monitoring. ---------------------------------------- Prompt 2 output: AI improves healthcare outcomes through predictive analytics, automating routine tasks, and supporting clinical decisions. ----------------------------------------
Common variations
- Use asynchronous calls to speed up testing multiple prompts concurrently.
- Test with different models like
claude-3-5-haiku-20241022for comparison. - Incorporate automated scoring metrics such as BLEU or ROUGE for quantitative evaluation.
import asyncio
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
async def get_response(prompt):
response = await client.chat.completions.acreate(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
async def main():
prompts = [
"Explain the benefits of AI in healthcare.",
"Describe how AI improves healthcare outcomes."
]
tasks = [get_response(p) for p in prompts]
results = await asyncio.gather(*tasks)
for i, res in enumerate(results, 1):
print(f"Prompt {i} output:\n{res}\n{'-'*40}")
asyncio.run(main()) output
Prompt 1 output: AI enhances healthcare by enabling faster diagnosis, personalized treatment, and improved patient monitoring. ---------------------------------------- Prompt 2 output: AI improves healthcare outcomes through predictive analytics, automating routine tasks, and supporting clinical decisions. ----------------------------------------
Troubleshooting
- If you receive rate limit errors, reduce request frequency or upgrade your API plan.
- Empty or irrelevant outputs may indicate prompts are too vague; refine prompt clarity.
- Check environment variable setup if authentication fails.
Key Takeaways
- Use parallel API calls to efficiently compare multiple prompt variants.
- Automate output evaluation with quantitative metrics for objective A/B testing.
- Test across different models to find the best prompt-model combination.