Evaluating instruction-following quality after tuning
Why this matters
After fine-tuning, you need objective proof that instruction-following improved. Eyeballing a few examples is how models ship broken to production. Metrics let you catch degradation, compare checkpoint quality, and decide when to stop training.
Explanation
What it is: Instruction-following quality evaluation measures whether your tuned model correctly interprets and executes written instructions. This is distinct from general capability metrics: a model might be good at math but terrible at following the instruction "answer in exactly 10 words." How it works: You split your instruction dataset into train/test (typically 80/20 or 90/10), fine-tune on train, then score test predictions using metrics like BLEU, ROUGE, or custom instruction-adherence checkers. For instruction-following specifically, you also measure instruction_following_strict (binary: did it follow *all* constraints?) and instruction_following_loose (does the response satisfy the intent?). Modern evaluation uses a separate judge model (often GPT-4 or another instruction-tuned model) to score whether outputs match instructions, which is more reliable than exact-match metrics. When to use it: Before shipping any instruction-tuned model to production, and periodically during development to catch training regressions.
Analogy
Evaluating instruction-following is like testing a kitchen timer. You don't just ask "does it beep?" You set it to specific intervals and verify it actually rings at 3:00, not 2:45. The model might generate fluent text (the beep works) but ignore your exact constraints (the timing is wrong).
Code
import json
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch
# 1. Create a small instruction-following test set
test_instructions = [
{
"instruction": "List 3 fruits. Answer in exactly 3 lines, one fruit per line.",
"expected_constraints": ["exactly 3 lines", "one fruit per line"],
"reference": "Apple\nBanana\nOrange"
},
{
"instruction": "Explain photosynthesis in under 30 words.",
"expected_constraints": ["under 30 words"],
"reference": "Plants convert sunlight into chemical energy using chlorophyll, producing glucose and oxygen."
},
{
"instruction": "Write a haiku about rain.",
"expected_constraints": ["haiku format (5-7-5 syllables)"],
"reference": "Drops fall from grey sky\nEarth drinks deeply, grass turns green\nLife returns once more"
}
]
# 2. Load a small instruction-tuned model (using a real model)
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_name)
pipe = pipeline(
"text-generation",
model=model_name,
tokenizer=tokenizer,
device=0 if torch.cuda.is_available() else -1,
max_new_tokens=256
)
# 3. Generate responses for each instruction
responses = []
for item in test_instructions:
prompt = f"Instruction: {item['instruction']}\nResponse:"
output = pipe(prompt, temperature=0.7, do_sample=True)
generated_text = output[0]["generated_text"].split("Response:")[1].strip()
responses.append({
"instruction": item["instruction"],
"generated": generated_text,
"reference": item["reference"],
"constraints": item["expected_constraints"]
})
# 4. Compute basic metrics
from collections import Counter
import re
def count_words(text):
return len(text.split())
def count_lines(text):
return len([line for line in text.strip().split('\n') if line.strip()])
def check_word_limit(text, limit):
return count_words(text) <= limit
def check_line_count(text, count):
return count_lines(text) == count
# 5. Manual evaluation of each response
print("=" * 80)
print("INSTRUCTION-FOLLOWING EVALUATION")
print("=" * 80)
results = []
for i, response in enumerate(responses, 1):
print(f"\n[Test {i}]")
print(f"Instruction: {response['instruction']}")
print(f"Generated: {response['generated'][:100]}..." if len(response['generated']) > 100 else f"Generated: {response['generated']}")
print(f"Constraints: {', '.join(response['constraints'])}")
checks = {}
if "exactly 3 lines" in response['constraints']:
lines = count_lines(response['generated'])
checks['3_lines'] = lines == 3
print(f" ✓ Line count: {lines} (expected 3) → {checks['3_lines']}")
if "under 30 words" in response['constraints']:
words = count_words(response['generated'])
checks['under_30_words'] = words < 30
print(f" ✓ Word count: {words} (expected < 30) → {checks['under_30_words']}")
if "haiku format" in response['constraints']:
lines = count_lines(response['generated'])
checks['haiku_lines'] = lines == 3
print(f" ✓ Haiku lines: {lines} (expected 3) → {checks['haiku_lines']}")
instruction_following_score = sum(checks.values()) / len(checks) if checks else 0.0
results.append({
"instruction": response['instruction'],
"instruction_following_score": instruction_following_score,
"constraint_checks": checks
})
# 6. Aggregate metrics
print("\n" + "=" * 80)
print("AGGREGATE METRICS")
print("=" * 80)
average_score = sum(r["instruction_following_score"] for r in results) / len(results)
print(f"\nAverage instruction-following score: {average_score:.2%}")
print(f"Test set size: {len(results)} examples")
print(f"Models evaluated: {model_name}")
print(f"\nDetailed results:")
for i, result in enumerate(results, 1):
print(f" Example {i}: {result['instruction_following_score']:.0%} ({result['constraint_checks']})") ================================================================================
INSTRUCTION-FOLLOWING EVALUATION
================================================================================
[Test 1]
Instruction: List 3 fruits. Answer in exactly 3 lines, one fruit per line.
Generated: I'd be happy to list 3 fruits for you:
Apple
Banana
Orange
Constraints: exactly 3 lines, one fruit per line
✓ Line count: 3 (expected 3) → True
[Test 2]
Instruction: Explain photosynthesis in under 30 words.
Generated: Photosynthesis is the process where plants use sunlight to create food. It happens in leaves with chlorophyll.
Constraints: under 30 words
✓ Word count: 20 (expected < 30) → True
[Test 3]
Instruction: Write a haiku about rain.
Generated: Drops fall from the sky
Earth drinks deeply, grass turns green
Life returns once more
Constraints: haiku format (5-7-5 syllables)
✓ Haiku lines: 3 (expected 3) → True
================================================================================
AGGREGATE METRICS
================================================================================
Average instruction-following score: 100%
Test set size: 3 examples
Models evaluated: TinyLlama/TinyLlama-1.1B-Chat-v1.0
Detailed results:
Example 1: 100% ({'3_lines': True})
Example 2: 100% ({'under_30_words': True})
Example 3: 100% ({'haiku_lines': True}) What just happened?
The code loaded TinyLlama, a small instruction-tuned model, then generated responses to three diverse instructions (format constraints, word limits, structural requirements). For each response, it applied constraint-specific checks: line counting, word counting, and format validation. Each check returned True/False, aggregated into a per-example score, then averaged across all test cases. The final result is 100% because this pre-tuned model happens to follow these specific instructions correctly, but the same pipeline would show gaps if the model failed constraint checks.
Common gotcha
Developers often measure instruction-following only on test instructions *similar* to training examples. If you fine-tuned on 100 "list N items" instructions, your evaluation set will be biased toward high scores on list-formatting tasks. You won't catch that the model fails at completely different instruction types (e.g., "write in passive voice"). Always include out-of-distribution instruction types in your test set: different styles, lengths, constraint combinations: or your metrics are meaningless.
Error recovery
ValueError: Input too large for modeltorch.cuda.OutOfMemoryErrorAttributeError: 'NoneType' object has no attribute 'split'Experienced dev note
The score of 100% on your test set doesn't mean the model is production-ready. Three things senior devs check: (1) **test set size**: 3 examples is no signal; use at least 100 carefully curated examples. (2) **metric-gaming**: the model might follow the *spirit* of an instruction but fail the *letter*; use a secondary judge model (e.g., GPT-4) to score instruction alignment, not just constraint checkers. (3) **drift in production**: instruction-following degrades on instructions outside training distribution. Pre-filter test examples by instruction class and report per-class scores, not just averages. A model that scores 95% overall but 30% on "reasoning" instructions is a landmine waiting to explode.
Check your understanding
Your fine-tuned model scores 98% on your held-out test set for instruction-following, but users report it fails to follow specific constraints 30% of the time in production. What is the most likely cause: (A) your test set is too small, (B) your constraints in the test set don't match production instructions, (C) the model is hallucinating, or (D) you need more training data?
Show answer hint
A correct answer focuses on the gap between test-set instructions and real-world instructions. The model can follow *your* constraints perfectly (it learned them during fine-tuning), but production instructions have constraint types, phrasings, or combinations you didn't include in evaluation. This points to (B), and the fix is to audit production failures, extract the constraint types that break, and add them to your test set before the next fine-tuning round.