How to beginner · 3 min read

How to A/B test chatbot responses

Quick answer
Use A/B testing by generating multiple chatbot response variants with different prompts or models via the chat.completions.create API. Randomly assign users to variants, collect their feedback or engagement metrics, and analyze which response performs better to optimize your chatbot.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0

Setup

Install the openai Python SDK and set your API key as an environment variable.

  • Install SDK: pip install openai
  • Set environment variable: export OPENAI_API_KEY='your_api_key' (Linux/macOS) or setx OPENAI_API_KEY "your_api_key" (Windows)
bash
pip install openai

Step by step

This example demonstrates generating two chatbot response variants (A and B) using gpt-4o. It randomly assigns a user to a variant, sends the prompt, and prints the response. You can extend this by logging user feedback for analysis.

python
import os
import random
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Define two prompt variants
prompt_variants = {
    "A": "You are a friendly assistant. Answer briefly.",
    "B": "You are a detailed assistant. Provide thorough explanations."
}

# Simulate assigning user to variant randomly
def get_variant():
    return random.choice(["A", "B"])

variant = get_variant()
prompt = prompt_variants[variant]

messages = [
    {"role": "system", "content": prompt},
    {"role": "user", "content": "How do I reset my password?"}
]

response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages
)

print(f"Variant: {variant}")
print("Response:", response.choices[0].message.content)
output
Variant: A
Response: To reset your password, go to the login page and click on "Forgot Password." Follow the instructions sent to your email.

Common variations

You can implement A/B testing with different models (e.g., gpt-4o-mini vs gpt-4o), asynchronous calls, or streaming responses for real-time feedback.

  • Use async with asyncio and await client.chat.completions.create(...)
  • Stream responses by setting stream=True and iterating over chunks
  • Test different prompt styles or system instructions
python
import os
import random
import asyncio
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

async def ab_test_stream():
    variant = random.choice(["A", "B"])
    prompt_variants = {
        "A": "You are a friendly assistant. Answer briefly.",
        "B": "You are a detailed assistant. Provide thorough explanations."
    }
    messages = [
        {"role": "system", "content": prompt_variants[variant]},
        {"role": "user", "content": "Explain how to reset my password."}
    ]

    stream = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        stream=True
    )

    print(f"Variant: {variant}")
    print("Response:", end=" ")
    async for chunk in stream:
        delta = chunk.choices[0].delta.content or ""
        print(delta, end="", flush=True)
    print()

asyncio.run(ab_test_stream())
output
Variant: B
Response: To reset your password, first navigate to the login page. Click on the "Forgot Password" link, then enter your registered email address. You will receive an email with instructions to create a new password.

Troubleshooting

  • If you get authentication errors, verify your OPENAI_API_KEY environment variable is set correctly.
  • If responses are slow, consider using smaller models like gpt-4o-mini for faster turnaround.
  • Ensure your random assignment logic is unbiased to get statistically valid A/B test results.
  • Log user feedback and engagement metrics externally for proper analysis.

Key Takeaways

  • Use the chat.completions.create API to generate multiple response variants for A/B testing.
  • Randomly assign users to different variants and collect feedback or engagement data for analysis.
  • Leverage async and streaming features for real-time or large-scale testing scenarios.
Verified 2026-04 · gpt-4o, gpt-4o-mini
Verify ↗