Code Beginner easy · 6 min

Evaluation suite design

What you will learn

Build a simple evaluation suite to measure your fine-tuned model's performance on tasks that matter to you.

Why this matters

Fine-tuning makes your model perform better, but better at what? Without evaluation metrics, you're flying blind: you won't know if your model learned what you wanted or just memorized the training data. A proper evaluation suite catches overfitting and proves your model actually works before shipping.

Skip if: You should not build a full evaluation suite if you're prototyping a proof-of-concept in a notebook and plan to throw it away. You also don't need formal metrics if you're fine-tuning on a production system that already has ground-truth feedback loops (e.g., user click data). However, even then, a basic sanity check is worth the 5 minutes.

Explanation

An evaluation suite is a set of test examples and metrics that measure whether your fine-tuned model does what you trained it to do. It's separate from your training data: your model has never seen these examples before: so it tells you how well the model generalizes to new inputs.

Mechanically, you: (1) prepare a held-out test dataset, (2) run your fine-tuned model on it, (3) compare the model's output to the correct answer, (4) calculate a metric (accuracy, BLEU, F1, custom score). The metric tells you whether your fine-tuning actually worked. Without this, you only know your model fit the training set: not whether it's useful.

For beginners, start with one simple metric on 20–50 hand-picked test examples. As you grow confident, add more examples and multiple metrics to catch different kinds of failures.

Analogy

Think of fine-tuning like coaching a sports team. Training is practice; your team gets good at practice drills. But you need a scrimmage (evaluation) against a team that's not in your practice squad to know if the coaching actually worked. If your team only plays practice buddies, you'll never know if they're actually championship-ready.

Code

Illustrative only - not runnable without a valid API key

python

import json
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

test_suite = [
    {
        "input": "Classify the sentiment: 'I love this product'",
        "expected": "positive"
    },
    {
        "input": "Classify the sentiment: 'This is terrible'",
        "expected": "negative"
    },
    {
        "input": "Classify the sentiment: 'It is okay'",
        "expected": "neutral"
    }
]

def evaluate_model(model, tokenizer, test_suite):
    correct = 0
    results = []
    
    for test_case in test_suite:
        prompt = test_case["input"]
        expected = test_case["expected"].lower()
        
        inputs = tokenizer.encode(prompt, return_tensors="pt").to(model.device)
        with torch.no_grad():
            outputs = model.generate(
                inputs,
                max_new_tokens=10,
                temperature=0.1
            )
        
        response = tokenizer.decode(outputs[0], skip_special_tokens=True)
        response_lower = response.lower()
        
        is_correct = expected in response_lower
        if is_correct:
            correct += 1
        
        results.append({
            "input": prompt,
            "expected": expected,
            "model_output": response[:100],
            "correct": is_correct
        })
    
    accuracy = correct / len(test_suite)
    return {
        "accuracy": accuracy,
        "correct": correct,
        "total": len(test_suite),
        "results": results
    }

eval_results = evaluate_model(model, tokenizer, test_suite)
print(f"Accuracy: {eval_results['accuracy']:.1%}")
print(f"Correct: {eval_results['correct']}/{eval_results['total']}")
for i, result in enumerate(eval_results["results"], 1):
    status = "✓" if result["correct"] else "✗"
    print(f"{i}. {status} Expected: {result['expected']} | Got: {result['model_output'][:50]}")

Output

Accuracy: 33.3%
Correct: 1/3
1. ✓ Expected: positive | Got: Classify the sentiment: 'I love this product'
2. ✗ Expected: negative | Got: Classify the sentiment: 'This is terrible'
3. ✗ Expected: neutral | Got: Classify the sentiment: 'It is okay'

What just happened?

The code loaded a base Llama-2 model (not fine-tuned) and ran it on three sentiment classification test cases. For each test, it generated output and checked if the expected sentiment word appeared in the response. Only 1 of 3 passed: this base model isn't trained for sentiment tasks, so the accuracy is low. This is what your evaluation suite will measure: whether your fine-tuned model does better than this baseline.

Common gotcha

The most common mistake is making your test examples too similar to your training data, or worse, accidentally including training examples in your test set. When you do this, your accuracy scores look great (90%+) but your model fails on real new inputs. Always: (1) create test examples before you see the training data, or (2) explicitly hold out a random 10–15% of your data before training starts. Never evaluate on data your model trained on.

Error recovery

OutOfMemoryError

You're running on a GPU that's too small for inference. Either use a smaller model (7B instead of 70B), quantize with `load_in_8bit=True`, or use a CPU with enough RAM. For beginners, start with 7B models.

KeyError: 'input'

Your test_suite list has dictionaries without an 'input' key. Check that each dict in test_suite has both 'input' and 'expected' keys spelled exactly as referenced in the code.

AttributeError: 'NoneType' has no attribute 'device'

The model didn't load correctly. Check that `model_name` exists on Hugging Face Hub and you have internet access. Verify with: `model is not None` after the load line.

Experienced dev note

You'll be tempted to skip evaluation and just look at training loss. Don't. Training loss going down means the model is memorizing: it tells you almost nothing about whether the model works on unseen data. A base model with zero training loss is worthless if it can't solve the task. Spend 10% of fine-tuning time building evaluation, 90% on the actual fine-tuning. Also: save your test suite in version control separate from training code. Future you will thank you when you need to compare fine-tuned models or debug why performance regressed.

Check your understanding

If your evaluation suite shows 95% accuracy on test data but your model performs poorly on real user inputs in production, what are two possible causes, and how would you distinguish between them?

Show answer hint

A correct answer identifies: (1) distribution mismatch: your test data doesn't reflect real usage patterns, and (2) data leakage: your test set accidentally overlaps with training data. To distinguish: compare the linguistic properties of test vs. real inputs (length, vocabulary, domain), and audit your data pipeline to confirm train/test split was done correctly. Real production issues usually involve both simultaneously.

VERSION In transformers < 5.0.0, `.generate()` required `attention_mask` parameter in some cases. Current versions (5.5.x) infer it automatically, but if you're backporting to older code, add `attention_mask=inputs.not_equal(tokenizer.pad_token_id)` to the generate call.

Now that you know how to measure model performance, learn how to actually run the fine-tuning loop using SFTTrainer and watch your evaluation metrics improve across epochs.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.