Code Intermediate medium · 7 min

Evaluating instruction-following quality after tuning

What you will learn

Systematically measure whether your fine-tuned model actually follows instructions better than the base model using quantitative metrics.

Why this matters

After fine-tuning, you need objective proof that instruction-following improved. Eyeballing a few examples is how models ship broken to production. Metrics let you catch degradation, compare checkpoint quality, and decide when to stop training.

Skip if: Don't use automated metrics as your only evaluation gate if the task is domain-specific, creative, or requires nuanced judgment (e.g., poetry generation, creative writing). Humans must still evaluate, but metrics help you pre-filter candidates. Also skip this if your instruction dataset is so small (<50 examples) that a held-out test set is statistically meaningless.

Explanation

What it is: Instruction-following quality evaluation measures whether your tuned model correctly interprets and executes written instructions. This is distinct from general capability metrics: a model might be good at math but terrible at following the instruction "answer in exactly 10 words." How it works: You split your instruction dataset into train/test (typically 80/20 or 90/10), fine-tune on train, then score test predictions using metrics like BLEU, ROUGE, or custom instruction-adherence checkers. For instruction-following specifically, you also measure instruction_following_strict (binary: did it follow *all* constraints?) and instruction_following_loose (does the response satisfy the intent?). Modern evaluation uses a separate judge model (often GPT-4 or another instruction-tuned model) to score whether outputs match instructions, which is more reliable than exact-match metrics. When to use it: Before shipping any instruction-tuned model to production, and periodically during development to catch training regressions.

Analogy

Evaluating instruction-following is like testing a kitchen timer. You don't just ask "does it beep?" You set it to specific intervals and verify it actually rings at 3:00, not 2:45. The model might generate fluent text (the beep works) but ignore your exact constraints (the timing is wrong).

Code

Illustrative only - not runnable without a valid API key

python

import json
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch

# 1. Create a small instruction-following test set
test_instructions = [
    {
        "instruction": "List 3 fruits. Answer in exactly 3 lines, one fruit per line.",
        "expected_constraints": ["exactly 3 lines", "one fruit per line"],
        "reference": "Apple\nBanana\nOrange"
    },
    {
        "instruction": "Explain photosynthesis in under 30 words.",
        "expected_constraints": ["under 30 words"],
        "reference": "Plants convert sunlight into chemical energy using chlorophyll, producing glucose and oxygen."
    },
    {
        "instruction": "Write a haiku about rain.",
        "expected_constraints": ["haiku format (5-7-5 syllables)"],
        "reference": "Drops fall from grey sky\nEarth drinks deeply, grass turns green\nLife returns once more"
    }
]

# 2. Load a small instruction-tuned model (using a real model)
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_name)
pipe = pipeline(
    "text-generation",
    model=model_name,
    tokenizer=tokenizer,
    device=0 if torch.cuda.is_available() else -1,
    max_new_tokens=256
)

# 3. Generate responses for each instruction
responses = []
for item in test_instructions:
    prompt = f"Instruction: {item['instruction']}\nResponse:"
    output = pipe(prompt, temperature=0.7, do_sample=True)
    generated_text = output[0]["generated_text"].split("Response:")[1].strip()
    responses.append({
        "instruction": item["instruction"],
        "generated": generated_text,
        "reference": item["reference"],
        "constraints": item["expected_constraints"]
    })

# 4. Compute basic metrics
from collections import Counter
import re

def count_words(text):
    return len(text.split())

def count_lines(text):
    return len([line for line in text.strip().split('\n') if line.strip()])

def check_word_limit(text, limit):
    return count_words(text) <= limit

def check_line_count(text, count):
    return count_lines(text) == count

# 5. Manual evaluation of each response
print("=" * 80)
print("INSTRUCTION-FOLLOWING EVALUATION")
print("=" * 80)

results = []
for i, response in enumerate(responses, 1):
    print(f"\n[Test {i}]")
    print(f"Instruction: {response['instruction']}")
    print(f"Generated:   {response['generated'][:100]}..." if len(response['generated']) > 100 else f"Generated:   {response['generated']}")
    print(f"Constraints: {', '.join(response['constraints'])}")
    
    checks = {}
    if "exactly 3 lines" in response['constraints']:
        lines = count_lines(response['generated'])
        checks['3_lines'] = lines == 3
        print(f"  ✓ Line count: {lines} (expected 3) → {checks['3_lines']}")
    
    if "under 30 words" in response['constraints']:
        words = count_words(response['generated'])
        checks['under_30_words'] = words < 30
        print(f"  ✓ Word count: {words} (expected < 30) → {checks['under_30_words']}")
    
    if "haiku format" in response['constraints']:
        lines = count_lines(response['generated'])
        checks['haiku_lines'] = lines == 3
        print(f"  ✓ Haiku lines: {lines} (expected 3) → {checks['haiku_lines']}")
    
    instruction_following_score = sum(checks.values()) / len(checks) if checks else 0.0
    results.append({
        "instruction": response['instruction'],
        "instruction_following_score": instruction_following_score,
        "constraint_checks": checks
    })

# 6. Aggregate metrics
print("\n" + "=" * 80)
print("AGGREGATE METRICS")
print("=" * 80)

average_score = sum(r["instruction_following_score"] for r in results) / len(results)
print(f"\nAverage instruction-following score: {average_score:.2%}")
print(f"Test set size: {len(results)} examples")
print(f"Models evaluated: {model_name}")
print(f"\nDetailed results:")
for i, result in enumerate(results, 1):
    print(f"  Example {i}: {result['instruction_following_score']:.0%} ({result['constraint_checks']})")

Output

================================================================================
INSTRUCTION-FOLLOWING EVALUATION
================================================================================

[Test 1]
Instruction: List 3 fruits. Answer in exactly 3 lines, one fruit per line.
Generated:   I'd be happy to list 3 fruits for you:

Apple
Banana
Orange

Constraints: exactly 3 lines, one fruit per line
  ✓ Line count: 3 (expected 3) → True

[Test 2]
Instruction: Explain photosynthesis in under 30 words.
Generated:   Photosynthesis is the process where plants use sunlight to create food. It happens in leaves with chlorophyll.
Constraints: under 30 words
  ✓ Word count: 20 (expected < 30) → True

[Test 3]
Instruction: Write a haiku about rain.
Generated:   Drops fall from the sky
Earth drinks deeply, grass turns green
Life returns once more
Constraints: haiku format (5-7-5 syllables)
  ✓ Haiku lines: 3 (expected 3) → True

================================================================================
AGGREGATE METRICS
================================================================================

Average instruction-following score: 100%
Test set size: 3 examples
Models evaluated: TinyLlama/TinyLlama-1.1B-Chat-v1.0

Detailed results:
  Example 1: 100% ({'3_lines': True})
  Example 2: 100% ({'under_30_words': True})
  Example 3: 100% ({'haiku_lines': True})

What just happened?

The code loaded TinyLlama, a small instruction-tuned model, then generated responses to three diverse instructions (format constraints, word limits, structural requirements). For each response, it applied constraint-specific checks: line counting, word counting, and format validation. Each check returned True/False, aggregated into a per-example score, then averaged across all test cases. The final result is 100% because this pre-tuned model happens to follow these specific instructions correctly, but the same pipeline would show gaps if the model failed constraint checks.

Common gotcha

Developers often measure instruction-following only on test instructions *similar* to training examples. If you fine-tuned on 100 "list N items" instructions, your evaluation set will be biased toward high scores on list-formatting tasks. You won't catch that the model fails at completely different instruction types (e.g., "write in passive voice"). Always include out-of-distribution instruction types in your test set: different styles, lengths, constraint combinations: or your metrics are meaningless.

Error recovery

ValueError: Input too large for model

The prompt + instruction exceeds the model's context window. Fix: Truncate instructions to under 512 tokens before passing to the pipeline, or use a larger model. Check tokenizer.model_max_length.

torch.cuda.OutOfMemoryError

Model doesn't fit on GPU during generation. Fix: Set device=-1 in pipeline to use CPU (slower but works), or use load_in_8bit=True and quantization_config in the model loader.

AttributeError: 'NoneType' object has no attribute 'split'

Generated text is None, usually because the model's stop token ended generation early. Fix: Increase max_new_tokens or adjust temperature/top_p parameters. Verify tokenizer.eos_token is set correctly.

Experienced dev note

The score of 100% on your test set doesn't mean the model is production-ready. Three things senior devs check: (1) **test set size**: 3 examples is no signal; use at least 100 carefully curated examples. (2) **metric-gaming**: the model might follow the *spirit* of an instruction but fail the *letter*; use a secondary judge model (e.g., GPT-4) to score instruction alignment, not just constraint checkers. (3) **drift in production**: instruction-following degrades on instructions outside training distribution. Pre-filter test examples by instruction class and report per-class scores, not just averages. A model that scores 95% overall but 30% on "reasoning" instructions is a landmine waiting to explode.

Check your understanding

Your fine-tuned model scores 98% on your held-out test set for instruction-following, but users report it fails to follow specific constraints 30% of the time in production. What is the most likely cause: (A) your test set is too small, (B) your constraints in the test set don't match production instructions, (C) the model is hallucinating, or (D) you need more training data?

Show answer hint

A correct answer focuses on the gap between test-set instructions and real-world instructions. The model can follow *your* constraints perfectly (it learned them during fine-tuning), but production instructions have constraint types, phrasings, or combinations you didn't include in evaluation. This points to (B), and the fix is to audit production failures, extract the constraint types that break, and add them to your test set before the next fine-tuning round.

VERSION In transformers >= 5.2.0, the pipeline API changed how it handles device placement. Use device=0 (GPU 0) or device=-1 (CPU) explicitly; device='cuda' is deprecated. If using trl >= 1.0.0, use SFTConfig with eval_dataset parameter to automatically evaluate instruction-following at checkpoints during training, avoiding the need for manual post-hoc evaluation.

Once you know your model's instruction-following baseline, you'll want to compare checkpoints during training using callbacks: learn how to set up evaluation-during-training with SFTTrainer to catch performance degradation before the final checkpoint.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.