Code Beginner easy · 6 min

A/B testing methodology

What you will learn

A/B testing compares two fine-tuned model variants on the same evaluation set to determine which performs better for your specific use case.

Why this matters

After fine-tuning an LLM, you need an objective way to pick between competing configurations (learning rates, number of epochs, LoRA ranks) rather than guessing. A/B testing prevents shipping an inferior model to production.

Skip if: Don't use A/B testing when you have a single model and no alternatives to compare, or when your evaluation metric is undefined (you must know what 'better' means before testing). Also skip it for cost-sensitive scenarios where generating two model variants is prohibitively expensive: use cross-validation instead.

Explanation

What it is: A/B testing means training two different model variants (A and B) with different hyperparameters or configurations, then evaluating both on an identical test set to see which one produces better results. The winner is determined by a pre-defined metric (accuracy, BLEU score, human preference rating).

How it works: You define your metric upfront, train model A with config A (e.g., learning_rate=5e-5), train model B with config B (e.g., learning_rate=2e-5), run both on your validation set, collect scores, and compare statistically. The model with the higher score wins: that's the one you deploy. The key insight is that both models see identical inputs, so differences in output quality are due to the configuration, not randomness.

When to use: Use A/B testing whenever you're uncertain which hyperparameter choice is best, or when you're experimenting with new techniques. It's especially valuable in fine-tuning because small configuration changes can have large downstream effects on model behavior.

Analogy

A/B testing is like having two chefs cook the same recipe with different heat settings, then serving both dishes to the same diners. The diners' consistent feedback tells you which heat setting produces better food: not because the diners are different, but because the cooking method was different.

Code

Illustrative only - not runnable without a valid API key

python

import json
from dataclasses import dataclass
from typing import Callable

@dataclass
class ABTestResult:
    model_a_score: float
    model_b_score: float
    winner: str
    improvement_percent: float

def evaluate_model_on_dataset(model_name: str, test_samples: list[dict]) -> float:
    """
    Simulate evaluation of a fine-tuned model on test samples.
    In reality, this would run your model on the test set and compute metrics.
    """
    scores = {
        'model_a': 0.78,
        'model_b': 0.82
    }
    return scores.get(model_name, 0.0)

def run_ab_test(
    model_a_name: str,
    model_b_name: str,
    test_dataset: list[dict],
    metric_fn: Callable[[str, list[dict]], float]
) -> ABTestResult:
    """
    Run an A/B test comparing two model variants.
    Returns the winner and improvement percentage.
    """
    score_a = metric_fn(model_a_name, test_dataset)
    score_b = metric_fn(model_b_name, test_dataset)
    
    if score_a > score_b:
        winner = 'A'
        improvement = ((score_a - score_b) / score_b) * 100
    elif score_b > score_a:
        winner = 'B'
        improvement = ((score_b - score_a) / score_a) * 100
    else:
        winner = 'TIE'
        improvement = 0.0
    
    return ABTestResult(
        model_a_score=score_a,
        model_b_score=score_b,
        winner=winner,
        improvement_percent=improvement
    )

test_samples = [
    {'input': 'What is 2+2?', 'expected': '4'},
    {'input': 'Translate hello to Spanish', 'expected': 'hola'},
    {'input': 'Summarize: The sky is blue', 'expected': 'Sky is blue'},
]

result = run_ab_test(
    model_a_name='model_a',
    model_b_name='model_b',
    test_dataset=test_samples,
    metric_fn=evaluate_model_on_dataset
)

print(f'Model A score: {result.model_a_score}')
print(f'Model B score: {result.model_b_score}')
print(f'Winner: Model {result.winner}')
print(f'Improvement: {result.improvement_percent:.2f}%')

Output

Model A score: 0.78
Model B score: 0.82
Winner: Model B
Improvement: 5.13%

What just happened?

The code defined a function to simulate model evaluation (in production you'd call your actual fine-tuned model and compute a real metric like accuracy). It then ran both Model A and Model B on the same test dataset, extracted their scores (0.78 and 0.82), determined that Model B won, and calculated the percentage improvement (5.13%). The test confirmed that the configuration change in Model B made it measurably better.

Common gotcha

The most common mistake is reusing your training set as your test set for A/B comparison. This tells you nothing: both models have memorized that data. You must use a held-out validation or test set that neither model saw during fine-tuning. A second mistake: comparing models on different hardware or under different load conditions. If Model A runs on CPU and Model B on GPU, latency differences will confound quality differences.

Error recovery

KeyError on model name

The metric_fn is looking up a model name that doesn't exist in the scores dictionary. Ensure the model_a_name and model_b_name parameters exactly match keys in your evaluation function.

ZeroDivisionError in improvement calculation

This happens if score_a or score_b is 0.0 (model produced completely invalid output). Check that your evaluation function is returning numeric scores, not None or exceptions.

Experienced dev note

A/B tests are only meaningful if your metric matches what you actually care about in production. If you optimize for BLEU score but users care about fluency, you'll pick the wrong model. Spend 30 minutes defining your metric upfront (often this means collecting human feedback on a small set of outputs): it saves weeks of shipping the wrong model. Also: always log which model is A and which is B before running the test, because after the fact it's easy to forget which hyperparameter config produced which score.

Check your understanding

Why is it essential to evaluate both models on the same test set rather than different test sets? What could go wrong if you used a different test set for each model?

Show answer hint

A correct answer explains that different test sets may have different difficulty distributions or characteristics, which would make the scores incomparable: one model might score higher just because its test set was easier, not because the model is actually better. The test set must be identical so differences in scores reflect differences in model quality, not test set properties.

VERSION This methodology is stable across transformers 5.5.x and trl 1.x. There are no breaking changes to evaluation patterns in these versions.

Once you understand how to compare two models, learn about <strong>setting up your evaluation metric</strong>: defining exactly what you'll measure (accuracy, loss, custom scoring functions) before you start fine-tuning.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.