A/B testing methodology
Why this matters
After fine-tuning an LLM, you need an objective way to pick between competing configurations (learning rates, number of epochs, LoRA ranks) rather than guessing. A/B testing prevents shipping an inferior model to production.
Explanation
What it is: A/B testing means training two different model variants (A and B) with different hyperparameters or configurations, then evaluating both on an identical test set to see which one produces better results. The winner is determined by a pre-defined metric (accuracy, BLEU score, human preference rating).
How it works: You define your metric upfront, train model A with config A (e.g., learning_rate=5e-5), train model B with config B (e.g., learning_rate=2e-5), run both on your validation set, collect scores, and compare statistically. The model with the higher score wins: that's the one you deploy. The key insight is that both models see identical inputs, so differences in output quality are due to the configuration, not randomness.
When to use: Use A/B testing whenever you're uncertain which hyperparameter choice is best, or when you're experimenting with new techniques. It's especially valuable in fine-tuning because small configuration changes can have large downstream effects on model behavior.
Analogy
A/B testing is like having two chefs cook the same recipe with different heat settings, then serving both dishes to the same diners. The diners' consistent feedback tells you which heat setting produces better food: not because the diners are different, but because the cooking method was different.
Code
import json
from dataclasses import dataclass
from typing import Callable
@dataclass
class ABTestResult:
model_a_score: float
model_b_score: float
winner: str
improvement_percent: float
def evaluate_model_on_dataset(model_name: str, test_samples: list[dict]) -> float:
"""
Simulate evaluation of a fine-tuned model on test samples.
In reality, this would run your model on the test set and compute metrics.
"""
scores = {
'model_a': 0.78,
'model_b': 0.82
}
return scores.get(model_name, 0.0)
def run_ab_test(
model_a_name: str,
model_b_name: str,
test_dataset: list[dict],
metric_fn: Callable[[str, list[dict]], float]
) -> ABTestResult:
"""
Run an A/B test comparing two model variants.
Returns the winner and improvement percentage.
"""
score_a = metric_fn(model_a_name, test_dataset)
score_b = metric_fn(model_b_name, test_dataset)
if score_a > score_b:
winner = 'A'
improvement = ((score_a - score_b) / score_b) * 100
elif score_b > score_a:
winner = 'B'
improvement = ((score_b - score_a) / score_a) * 100
else:
winner = 'TIE'
improvement = 0.0
return ABTestResult(
model_a_score=score_a,
model_b_score=score_b,
winner=winner,
improvement_percent=improvement
)
test_samples = [
{'input': 'What is 2+2?', 'expected': '4'},
{'input': 'Translate hello to Spanish', 'expected': 'hola'},
{'input': 'Summarize: The sky is blue', 'expected': 'Sky is blue'},
]
result = run_ab_test(
model_a_name='model_a',
model_b_name='model_b',
test_dataset=test_samples,
metric_fn=evaluate_model_on_dataset
)
print(f'Model A score: {result.model_a_score}')
print(f'Model B score: {result.model_b_score}')
print(f'Winner: Model {result.winner}')
print(f'Improvement: {result.improvement_percent:.2f}%') Model A score: 0.78 Model B score: 0.82 Winner: Model B Improvement: 5.13%
What just happened?
The code defined a function to simulate model evaluation (in production you'd call your actual fine-tuned model and compute a real metric like accuracy). It then ran both Model A and Model B on the same test dataset, extracted their scores (0.78 and 0.82), determined that Model B won, and calculated the percentage improvement (5.13%). The test confirmed that the configuration change in Model B made it measurably better.
Common gotcha
The most common mistake is reusing your training set as your test set for A/B comparison. This tells you nothing: both models have memorized that data. You must use a held-out validation or test set that neither model saw during fine-tuning. A second mistake: comparing models on different hardware or under different load conditions. If Model A runs on CPU and Model B on GPU, latency differences will confound quality differences.
Error recovery
KeyError on model nameZeroDivisionError in improvement calculationExperienced dev note
A/B tests are only meaningful if your metric matches what you actually care about in production. If you optimize for BLEU score but users care about fluency, you'll pick the wrong model. Spend 30 minutes defining your metric upfront (often this means collecting human feedback on a small set of outputs): it saves weeks of shipping the wrong model. Also: always log which model is A and which is B before running the test, because after the fact it's easy to forget which hyperparameter config produced which score.
Check your understanding
Why is it essential to evaluate both models on the same test set rather than different test sets? What could go wrong if you used a different test set for each model?
Show answer hint
A correct answer explains that different test sets may have different difficulty distributions or characteristics, which would make the scores incomparable: one model might score higher just because its test set was easier, not because the model is actually better. The test set must be identical so differences in scores reflect differences in model quality, not test set properties.