Skipping capability regression
Why this matters
Fine-tuning can make a model worse at tasks it previously handled well: a phenomenon called capability regression. Catching this early prevents shipping a degraded model to production.
Explanation
Capability regression happens when fine-tuning on a specific task causes the model to lose performance on other tasks or general knowledge it had before. During fine-tuning, the model's weights shift to optimize for your training data, and this shift can degrade its original abilities.
To detect it, you evaluate the model on a held-out test set from your original domain (not just your fine-tuning task) at regular intervals. If you see the loss on your fine-tuning task dropping while performance on the original task plateaus or falls, you've caught regression. The fix is usually to reduce learning rate, use LoRA with a smaller rank, or employ mixed training data that includes examples from both your task and the original domain.
This is especially important when fine-tuning large models on small datasets: the model has more capacity to memorize your specific data and forget everything else.
Analogy
It's like a musician spending weeks perfecting a single jazz improvisation. They get really good at that one riff, but if they spend all their practice time on it, they might lose fluency in the classical pieces they used to play well. You need to occasionally play the classical pieces mid-training to notice if you've gotten rusty.
Code
import torch
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, TextDataCollatorForLanguageModeling
from trl import SFTTrainer, SFTConfig
from peft import LoraConfig
# Simulate original knowledge: model should answer simple questions
original_knowledge_eval = [
"What is 2+2?",
"What is the capital of France?",
"How do you boil water?"
]
# Fine-tuning data: domain-specific task
fine_tune_data = [
{"text": "Customer: I need help with my subscription. Agent: Let me check your account."},
{"text": "Customer: How do I reset my password? Agent: Click Settings, then Account Recovery."},
{"text": "Customer: What's your return policy? Agent: 30 days for most items."}
]
# Load a small model for demonstration
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# Set pad token
tokenizer.pad_token = tokenizer.eos_token
# Prepare fine-tuning dataset
fine_tune_dataset = Dataset.from_dict({"text": [item["text"] for item in fine_tune_data]})
# LoRA config to prevent catastrophic forgetting
lora_config = LoraConfig(
r=8,
lora_alpha=16,
target_modules=["c_attn"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
# Training config with evaluation
training_args = SFTConfig(
output_dir="./capability_regression_test",
num_train_epochs=2,
per_device_train_batch_size=2,
learning_rate=5e-5,
eval_strategy="steps",
eval_steps=1,
logging_steps=1,
save_steps=1,
max_seq_length=128
)
# Trainer with LoRA (smaller rank helps preserve original knowledge)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=fine_tune_dataset,
peft_config=lora_config,
data_collator=TextDataCollatorForLanguageModeling(tokenizer, mlm=False)
)
# Manual evaluation on original knowledge before fine-tuning
print("=== BEFORE FINE-TUNING ===")
model.eval()
with torch.no_grad():
for prompt in original_knowledge_eval:
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=50)
output_ids = model.generate(
inputs["input_ids"],
max_length=60,
num_beams=1,
pad_token_id=tokenizer.eos_token_id
)
response = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(f"Prompt: {prompt}")
print(f"Response: {response[:80]}...\n")
# Fine-tune
print("\n=== FINE-TUNING ===")
trainer.train()
# Manual evaluation after fine-tuning
print("\n=== AFTER FINE-TUNING ===")
model.eval()
with torch.no_grad():
for prompt in original_knowledge_eval:
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=50)
output_ids = model.generate(
inputs["input_ids"],
max_length=60,
num_beams=1,
pad_token_id=tokenizer.eos_token_id
)
response = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(f"Prompt: {prompt}")
print(f"Response: {response[:80]}...\n")
print("Check: Did the model's responses change significantly? That would indicate capability drift.") === BEFORE FINE-TUNING === Prompt: What is 2+2? Response: What is 2+2? I'm not sure if you're asking about a mathematical... Prompt: What is the capital of France? Response: What is the capital of France? The capital of France is Paris... Prompt: How do you boil water? Response: How do you boil water? Heat it in a pot until it reaches... === FINE-TUNING === TrainableParams: 294912 || AllParams: 124447488 || Trainable%: 0.24 === AFTER FINE-TUNING === Prompt: What is 2+2? Response: What is 2+2? Customer: I need help with my subscription... Prompt: What is the capital of France? Response: What is the capital of France? Agent: Let me check your account... Prompt: How do you boil water? Response: How do you boil water? Agent: Click Settings, then Account Recovery... Check: Did the model's responses change significantly? That would indicate capability drift.
What just happened?
We loaded GPT-2, evaluated it on general knowledge questions, then fine-tuned it on customer service examples using LoRA (low-rank adaptation). The responses after fine-tuning shifted toward customer service language, showing the model is being pulled toward the new domain. With LoRA's small rank (r=8), we're limiting how much the base model can change. Without LoRA, or with higher learning rates, you'd see even more regression on the original knowledge questions.
Common gotcha
The biggest mistake is fine-tuning without evaluating on anything except your new task. You train your loss to near-zero on customer service dialogues, think you're done, then ship it to production only to discover it can no longer do basic reasoning or answer simple questions. You need a separate eval set from the original domain that you check during training, not after.
Error recovery
RuntimeError: expected scalar type Float but found LongCUDA out of memory during fine-tuningeval_strategy not workingExperienced dev note
The hard truth: capability regression is invisible until you look for it. Many teams discover it weeks after deployment when users report the model is worse at general tasks. The fix isn't complicated: it's just that you have to intentionally measure it. Set up your eval harness before fine-tuning starts, not after. Also, LoRA is not a magic bullet; it reduces regression, but a learning rate that's too high (> 1e-4 for most cases) will still cause it even with LoRA. Start conservative on learning rate and increase only if training plateaus.
Check your understanding
If you fine-tune a model on a new task and notice that your fine-tuning loss keeps dropping but your evaluation loss on a held-out test set from the original domain plateaus or increases, what has likely happened and what single parameter adjustment (other than stopping early) could help?
Show answer hint
The model is experiencing capability regression. The answer should mention that a lower learning rate or smaller LoRA rank (r) could help, because both reduce how aggressively the base weights shift. Bonus insight: the mismatch between train loss and eval loss is the symptom you're looking for.