Evaluation suite design
Why this matters
Fine-tuning makes your model perform better, but better at what? Without evaluation metrics, you're flying blind: you won't know if your model learned what you wanted or just memorized the training data. A proper evaluation suite catches overfitting and proves your model actually works before shipping.
Explanation
An evaluation suite is a set of test examples and metrics that measure whether your fine-tuned model does what you trained it to do. It's separate from your training data: your model has never seen these examples before: so it tells you how well the model generalizes to new inputs.
Mechanically, you: (1) prepare a held-out test dataset, (2) run your fine-tuned model on it, (3) compare the model's output to the correct answer, (4) calculate a metric (accuracy, BLEU, F1, custom score). The metric tells you whether your fine-tuning actually worked. Without this, you only know your model fit the training set: not whether it's useful.
For beginners, start with one simple metric on 20–50 hand-picked test examples. As you grow confident, add more examples and multiple metrics to catch different kinds of failures.
Analogy
Think of fine-tuning like coaching a sports team. Training is practice; your team gets good at practice drills. But you need a scrimmage (evaluation) against a team that's not in your practice squad to know if the coaching actually worked. If your team only plays practice buddies, you'll never know if they're actually championship-ready.
Code
import json
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_name = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
test_suite = [
{
"input": "Classify the sentiment: 'I love this product'",
"expected": "positive"
},
{
"input": "Classify the sentiment: 'This is terrible'",
"expected": "negative"
},
{
"input": "Classify the sentiment: 'It is okay'",
"expected": "neutral"
}
]
def evaluate_model(model, tokenizer, test_suite):
correct = 0
results = []
for test_case in test_suite:
prompt = test_case["input"]
expected = test_case["expected"].lower()
inputs = tokenizer.encode(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
inputs,
max_new_tokens=10,
temperature=0.1
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
response_lower = response.lower()
is_correct = expected in response_lower
if is_correct:
correct += 1
results.append({
"input": prompt,
"expected": expected,
"model_output": response[:100],
"correct": is_correct
})
accuracy = correct / len(test_suite)
return {
"accuracy": accuracy,
"correct": correct,
"total": len(test_suite),
"results": results
}
eval_results = evaluate_model(model, tokenizer, test_suite)
print(f"Accuracy: {eval_results['accuracy']:.1%}")
print(f"Correct: {eval_results['correct']}/{eval_results['total']}")
for i, result in enumerate(eval_results["results"], 1):
status = "✓" if result["correct"] else "✗"
print(f"{i}. {status} Expected: {result['expected']} | Got: {result['model_output'][:50]}") Accuracy: 33.3% Correct: 1/3 1. ✓ Expected: positive | Got: Classify the sentiment: 'I love this product' 2. ✗ Expected: negative | Got: Classify the sentiment: 'This is terrible' 3. ✗ Expected: neutral | Got: Classify the sentiment: 'It is okay'
What just happened?
The code loaded a base Llama-2 model (not fine-tuned) and ran it on three sentiment classification test cases. For each test, it generated output and checked if the expected sentiment word appeared in the response. Only 1 of 3 passed: this base model isn't trained for sentiment tasks, so the accuracy is low. This is what your evaluation suite will measure: whether your fine-tuned model does better than this baseline.
Common gotcha
The most common mistake is making your test examples too similar to your training data, or worse, accidentally including training examples in your test set. When you do this, your accuracy scores look great (90%+) but your model fails on real new inputs. Always: (1) create test examples before you see the training data, or (2) explicitly hold out a random 10–15% of your data before training starts. Never evaluate on data your model trained on.
Error recovery
OutOfMemoryErrorKeyError: 'input'AttributeError: 'NoneType' has no attribute 'device'Experienced dev note
You'll be tempted to skip evaluation and just look at training loss. Don't. Training loss going down means the model is memorizing: it tells you almost nothing about whether the model works on unseen data. A base model with zero training loss is worthless if it can't solve the task. Spend 10% of fine-tuning time building evaluation, 90% on the actual fine-tuning. Also: save your test suite in version control separate from training code. Future you will thank you when you need to compare fine-tuned models or debug why performance regressed.
Check your understanding
If your evaluation suite shows 95% accuracy on test data but your model performs poorly on real user inputs in production, what are two possible causes, and how would you distinguish between them?
Show answer hint
A correct answer identifies: (1) distribution mismatch: your test data doesn't reflect real usage patterns, and (2) data leakage: your test set accidentally overlaps with training data. To distinguish: compare the linguistic properties of test vs. real inputs (length, vocabulary, domain), and audit your data pipeline to confirm train/test split was done correctly. Real production issues usually involve both simultaneously.