How to A/B test chatbot responses
Quick answer
Use
A/B testing by generating multiple chatbot response variants with different prompts or models via the chat.completions.create API. Randomly assign users to variants, collect their feedback or engagement metrics, and analyze which response performs better to optimize your chatbot.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0
Setup
Install the openai Python SDK and set your API key as an environment variable.
- Install SDK:
pip install openai - Set environment variable:
export OPENAI_API_KEY='your_api_key'(Linux/macOS) orsetx OPENAI_API_KEY "your_api_key"(Windows)
pip install openai Step by step
This example demonstrates generating two chatbot response variants (A and B) using gpt-4o. It randomly assigns a user to a variant, sends the prompt, and prints the response. You can extend this by logging user feedback for analysis.
import os
import random
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Define two prompt variants
prompt_variants = {
"A": "You are a friendly assistant. Answer briefly.",
"B": "You are a detailed assistant. Provide thorough explanations."
}
# Simulate assigning user to variant randomly
def get_variant():
return random.choice(["A", "B"])
variant = get_variant()
prompt = prompt_variants[variant]
messages = [
{"role": "system", "content": prompt},
{"role": "user", "content": "How do I reset my password?"}
]
response = client.chat.completions.create(
model="gpt-4o",
messages=messages
)
print(f"Variant: {variant}")
print("Response:", response.choices[0].message.content) output
Variant: A Response: To reset your password, go to the login page and click on "Forgot Password." Follow the instructions sent to your email.
Common variations
You can implement A/B testing with different models (e.g., gpt-4o-mini vs gpt-4o), asynchronous calls, or streaming responses for real-time feedback.
- Use async with
asyncioandawait client.chat.completions.create(...) - Stream responses by setting
stream=Trueand iterating over chunks - Test different prompt styles or system instructions
import os
import random
import asyncio
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
async def ab_test_stream():
variant = random.choice(["A", "B"])
prompt_variants = {
"A": "You are a friendly assistant. Answer briefly.",
"B": "You are a detailed assistant. Provide thorough explanations."
}
messages = [
{"role": "system", "content": prompt_variants[variant]},
{"role": "user", "content": "Explain how to reset my password."}
]
stream = await client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
stream=True
)
print(f"Variant: {variant}")
print("Response:", end=" ")
async for chunk in stream:
delta = chunk.choices[0].delta.content or ""
print(delta, end="", flush=True)
print()
asyncio.run(ab_test_stream()) output
Variant: B Response: To reset your password, first navigate to the login page. Click on the "Forgot Password" link, then enter your registered email address. You will receive an email with instructions to create a new password.
Troubleshooting
- If you get authentication errors, verify your
OPENAI_API_KEYenvironment variable is set correctly. - If responses are slow, consider using smaller models like
gpt-4o-minifor faster turnaround. - Ensure your random assignment logic is unbiased to get statistically valid A/B test results.
- Log user feedback and engagement metrics externally for proper analysis.
Key Takeaways
- Use the
chat.completions.createAPI to generate multiple response variants for A/B testing. - Randomly assign users to different variants and collect feedback or engagement data for analysis.
- Leverage async and streaming features for real-time or large-scale testing scenarios.