How to Intermediate · 3 min read

How to A/B test LLM providers

Quick answer

Use Python to send identical prompts to different LLM providers via their APIs (e.g., OpenAI, Anthropic, Google Vertex AI). Collect and compare responses on metrics like relevance, latency, and cost to determine the best provider for your use case.

PREREQUISITES

Python 3.8+
API keys for LLM providers (e.g., OpenAI, Anthropic, Google Vertex AI)
pip install openai anthropic vertexai

Setup

Install required Python SDKs and set environment variables for API keys. This example uses OpenAI, Anthropic, and Google Vertex AI clients.

bash

pip install openai anthropic vertexai

output

Collecting openai
Collecting anthropic
Collecting vertexai
Successfully installed openai anthropic vertexai

Step by step

Send the same prompt to multiple LLM providers, measure response time, and print outputs for side-by-side comparison.

python

import os
import time
from openai import OpenAI
import anthropic
import vertexai
from vertexai import language_models

# Initialize clients
openai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
anthropic_client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
vertexai.init(project=os.environ["GOOGLE_CLOUD_PROJECT"], location="us-central1")
gemini_model = language_models.GenerativeModel("gemini-2.0-flash")

prompt = "Explain the benefits of A/B testing LLM providers."

# OpenAI request
start = time.time()
response_openai = openai_client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": prompt}]
)
openai_time = time.time() - start
openai_text = response_openai.choices[0].message.content

# Anthropic request
start = time.time()
response_anthropic = anthropic_client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=512,
    system="You are a helpful assistant.",
    messages=[{"role": "user", "content": prompt}]
)
anthropic_time = time.time() - start
anthropic_text = response_anthropic.content[0].text

# Google Vertex AI request
start = time.time()
response_vertex = gemini_model.generate_content(prompt)
vertex_time = time.time() - start
vertex_text = response_vertex.text

# Print results
print(f"OpenAI (gpt-4o-mini) response in {openai_time:.2f}s:\n{openai_text}\n")
print(f"Anthropic (claude-3-5-sonnet) response in {anthropic_time:.2f}s:\n{anthropic_text}\n")
print(f"Google Vertex AI (gemini-2.0-flash) response in {vertex_time:.2f}s:\n{vertex_text}\n")

output

OpenAI (gpt-4o-mini) response in 1.23s:
A/B testing LLM providers helps identify the best model for your needs by comparing output quality, latency, and cost.

Anthropic (claude-3-5-sonnet) response in 1.45s:
A/B testing allows you to evaluate different LLMs side-by-side to optimize performance and reduce expenses.

Google Vertex AI (gemini-2.0-flash) response in 1.10s:
By A/B testing LLM providers, you can select the most effective model based on accuracy, speed, and pricing.

Common variations

Use async calls for parallel requests to reduce total latency.
Test different models within the same provider (e.g., gpt-4o-mini vs gpt-4o).
Include cost tracking by logging token usage and pricing per provider.

python

import asyncio
from openai import OpenAI
import anthropic
import vertexai
from vertexai import language_models

async def fetch_openai(client, prompt):
    return client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )

async def main():
    openai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    anthropic_client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
    vertexai.init(project=os.environ["GOOGLE_CLOUD_PROJECT"], location="us-central1")
    gemini_model = language_models.GenerativeModel("gemini-2.0-flash")

    prompt = "Explain the benefits of A/B testing LLM providers."

    # Run OpenAI call asynchronously
    openai_task = asyncio.to_thread(fetch_openai, openai_client, prompt)

    # Anthropic synchronous call
    response_anthropic = anthropic_client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=512,
        system="You are a helpful assistant.",
        messages=[{"role": "user", "content": prompt}]
    )

    # Google Vertex AI synchronous call
    response_vertex = gemini_model.generate_content(prompt)

    response_openai = await openai_task

    print("OpenAI response:", response_openai.choices[0].message.content)
    print("Anthropic response:", response_anthropic.content[0].text)
    print("Google Vertex AI response:", response_vertex.text)

import asyncio
asyncio.run(main())

output

OpenAI response: A/B testing LLM providers helps identify the best model for your needs by comparing output quality, latency, and cost.
Anthropic response: A/B testing allows you to evaluate different LLMs side-by-side to optimize performance and reduce expenses.
Google Vertex AI response: By A/B testing LLM providers, you can select the most effective model based on accuracy, speed, and pricing.

Troubleshooting

If API calls fail, verify your API keys and environment variables are correctly set.
Check network connectivity and provider status pages for outages.
For rate limits, implement exponential backoff retries.
Ensure SDK versions are up to date to avoid deprecated method errors.

✅

Key Takeaways

Use consistent prompts and metrics to fairly compare LLM providers.
Measure latency, output quality, and cost to determine the best provider for your use case.
Leverage async calls to speed up parallel testing across providers.
Track token usage and pricing to evaluate cost-effectiveness.
Keep SDKs updated and handle API errors gracefully for reliable testing.

Verified 2026-04 · gpt-4o-mini, claude-3-5-sonnet-20241022, gemini-2.0-flash

Verify ↗