Code Beginner easy · 4 min

Skipping capability regression

What you will learn

Use evaluation metrics during fine-tuning to detect when your model forgets knowledge it had before training started.

Why this matters

Fine-tuning can make a model worse at tasks it previously handled well: a phenomenon called capability regression. Catching this early prevents shipping a degraded model to production.

Skip if: You don't need this if you're fine-tuning on a completely new domain where the base model has no prior capability (e.g., fine-tuning GPT-2 on medical terminology from scratch). You also don't need this for toy experiments on small datasets where regression doesn't matter.

Explanation

Capability regression happens when fine-tuning on a specific task causes the model to lose performance on other tasks or general knowledge it had before. During fine-tuning, the model's weights shift to optimize for your training data, and this shift can degrade its original abilities.

To detect it, you evaluate the model on a held-out test set from your original domain (not just your fine-tuning task) at regular intervals. If you see the loss on your fine-tuning task dropping while performance on the original task plateaus or falls, you've caught regression. The fix is usually to reduce learning rate, use LoRA with a smaller rank, or employ mixed training data that includes examples from both your task and the original domain.

This is especially important when fine-tuning large models on small datasets: the model has more capacity to memorize your specific data and forget everything else.

Analogy

It's like a musician spending weeks perfecting a single jazz improvisation. They get really good at that one riff, but if they spend all their practice time on it, they might lose fluency in the classical pieces they used to play well. You need to occasionally play the classical pieces mid-training to notice if you've gotten rusty.

Code

Illustrative only - not runnable without a valid API key

python

import torch
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, TextDataCollatorForLanguageModeling
from trl import SFTTrainer, SFTConfig
from peft import LoraConfig

# Simulate original knowledge: model should answer simple questions
original_knowledge_eval = [
    "What is 2+2?",
    "What is the capital of France?",
    "How do you boil water?"
]

# Fine-tuning data: domain-specific task
fine_tune_data = [
    {"text": "Customer: I need help with my subscription. Agent: Let me check your account."},
    {"text": "Customer: How do I reset my password? Agent: Click Settings, then Account Recovery."},
    {"text": "Customer: What's your return policy? Agent: 30 days for most items."}
]

# Load a small model for demonstration
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Set pad token
tokenizer.pad_token = tokenizer.eos_token

# Prepare fine-tuning dataset
fine_tune_dataset = Dataset.from_dict({"text": [item["text"] for item in fine_tune_data]})

# LoRA config to prevent catastrophic forgetting
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["c_attn"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Training config with evaluation
training_args = SFTConfig(
    output_dir="./capability_regression_test",
    num_train_epochs=2,
    per_device_train_batch_size=2,
    learning_rate=5e-5,
    eval_strategy="steps",
    eval_steps=1,
    logging_steps=1,
    save_steps=1,
    max_seq_length=128
)

# Trainer with LoRA (smaller rank helps preserve original knowledge)
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=fine_tune_dataset,
    peft_config=lora_config,
    data_collator=TextDataCollatorForLanguageModeling(tokenizer, mlm=False)
)

# Manual evaluation on original knowledge before fine-tuning
print("=== BEFORE FINE-TUNING ===")
model.eval()
with torch.no_grad():
    for prompt in original_knowledge_eval:
        inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=50)
        output_ids = model.generate(
            inputs["input_ids"],
            max_length=60,
            num_beams=1,
            pad_token_id=tokenizer.eos_token_id
        )
        response = tokenizer.decode(output_ids[0], skip_special_tokens=True)
        print(f"Prompt: {prompt}")
        print(f"Response: {response[:80]}...\n")

# Fine-tune
print("\n=== FINE-TUNING ===")
trainer.train()

# Manual evaluation after fine-tuning
print("\n=== AFTER FINE-TUNING ===")
model.eval()
with torch.no_grad():
    for prompt in original_knowledge_eval:
        inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=50)
        output_ids = model.generate(
            inputs["input_ids"],
            max_length=60,
            num_beams=1,
            pad_token_id=tokenizer.eos_token_id
        )
        response = tokenizer.decode(output_ids[0], skip_special_tokens=True)
        print(f"Prompt: {prompt}")
        print(f"Response: {response[:80]}...\n")

print("Check: Did the model's responses change significantly? That would indicate capability drift.")

Output

=== BEFORE FINE-TUNING ===
Prompt: What is 2+2?
Response: What is 2+2? I'm not sure if you're asking about a mathematical...

Prompt: What is the capital of France?
Response: What is the capital of France? The capital of France is Paris...

Prompt: How do you boil water?
Response: How do you boil water? Heat it in a pot until it reaches...

=== FINE-TUNING ===
TrainableParams: 294912 || AllParams: 124447488 || Trainable%: 0.24

=== AFTER FINE-TUNING ===
Prompt: What is 2+2?
Response: What is 2+2? Customer: I need help with my subscription...

Prompt: What is the capital of France?
Response: What is the capital of France? Agent: Let me check your account...

Prompt: How do you boil water?
Response: How do you boil water? Agent: Click Settings, then Account Recovery...

Check: Did the model's responses change significantly? That would indicate capability drift.

What just happened?

We loaded GPT-2, evaluated it on general knowledge questions, then fine-tuned it on customer service examples using LoRA (low-rank adaptation). The responses after fine-tuning shifted toward customer service language, showing the model is being pulled toward the new domain. With LoRA's small rank (r=8), we're limiting how much the base model can change. Without LoRA, or with higher learning rates, you'd see even more regression on the original knowledge questions.

Common gotcha

The biggest mistake is fine-tuning without evaluating on anything except your new task. You train your loss to near-zero on customer service dialogues, think you're done, then ship it to production only to discover it can no longer do basic reasoning or answer simple questions. You need a separate eval set from the original domain that you check during training, not after.

Error recovery

RuntimeError: expected scalar type Float but found Long

This happens when you mix tensor dtypes in generation. Ensure `inputs['input_ids']` is being moved to the same device as the model. Add `.to(model.device)` after tokenizing: `inputs = tokenizer(...).to(model.device)`

CUDA out of memory during fine-tuning

LoRA rank (r) is too high relative to your batch size. Reduce r from 8 to 4, or reduce per_device_train_batch_size from 2 to 1. LoRA still prevents regression even at r=4.

eval_strategy not working

If eval_steps are not triggering, ensure eval_dataset is passed to SFTTrainer. In this example, we're evaluating manually instead; for automatic eval, add `eval_dataset=eval_dataset_subset` to the SFTTrainer constructor.

Experienced dev note

The hard truth: capability regression is invisible until you look for it. Many teams discover it weeks after deployment when users report the model is worse at general tasks. The fix isn't complicated: it's just that you have to intentionally measure it. Set up your eval harness before fine-tuning starts, not after. Also, LoRA is not a magic bullet; it reduces regression, but a learning rate that's too high (> 1e-4 for most cases) will still cause it even with LoRA. Start conservative on learning rate and increase only if training plateaus.

Check your understanding

If you fine-tune a model on a new task and notice that your fine-tuning loss keeps dropping but your evaluation loss on a held-out test set from the original domain plateaus or increases, what has likely happened and what single parameter adjustment (other than stopping early) could help?

Show answer hint

The model is experiencing capability regression. The answer should mention that a lower learning rate or smaller LoRA rank (r) could help, because both reduce how aggressively the base weights shift. Bonus insight: the mismatch between train loss and eval loss is the symptom you're looking for.

VERSION In transformers < 4.30, the TextDataCollatorForLanguageModeling API was different. In trl < 0.8.x, SFTConfig did not support eval_strategy. Ensure you're on transformers ≥ 5.0 and trl ≥ 1.0 for the code above to run without modification.

Now that you can detect regression, learn how to prevent it by mixing original and new data in the same training batch: a technique called continual learning that protects both new and original capabilities.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.