Code Beginner easy · 5 min

Continuous improvement loop

What you will learn

Evaluate your fine-tuned model, identify what's wrong, adjust training, and repeat until performance meets your needs.

Why this matters

Fine-tuning is never a one-shot operation. A real improvement loop: measuring, diagnosing, and iterating: is the only way to ship models that actually work on your data.

Skip if: Skip the loop if you're fine-tuning only as a one-time academic exercise or if your model's performance is already sufficient for your use case. Do not iterate endlessly on tiny metric improvements: set a stopping criterion (e.g., 'eval loss plateau for 3 epochs') to avoid diminishing returns.

Explanation

What it is: A continuous improvement loop is a cycle of train → evaluate → diagnose → adjust → repeat. After each training run, you measure how well your fine-tuned model performs on held-out data, identify failure patterns, change a hyperparameter or training strategy, and retrain. How it works: You split your data into train and eval sets. After training with SFTTrainer, you compute metrics (loss, accuracy, or task-specific scores). You inspect the eval results: Are losses diverging? Is the model overfitting? Are certain types of examples failing? Then you adjust: maybe increase dropout, add more training data, reduce learning rate, or change epochs: and retrain. This cycle compounds: each iteration builds on lessons from the previous one. When to use: Always, on any real fine-tuning project. Start with a baseline run, then iterate at least 2–3 times before declaring the model ready.

Analogy

Like tuning a guitar: first you tighten the string (train), then check if it's in tune (evaluate), hear it's too sharp (diagnose), turn the tuning peg slightly (adjust hyperparameter), and check again. You don't get a perfect note on the first turn: you spiral toward it.

Code

Illustrative only - not runnable without a valid API key

python

import json
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import LoraConfig
from trl import SFTTrainer, SFTConfig

# Minimal dataset: 4 examples
train_texts = [
    "Classify: positive. The product is amazing and works great.",
    "Classify: negative. Terrible quality, broke after one day.",
    "Classify: positive. Excellent service and fast delivery.",
    "Classify: negative. Not worth the price."
]
eval_texts = [
    "Classify: positive. Best purchase I've made this year.",
    "Classify: negative. Waste of money."
]

train_dataset = Dataset.from_dict({"text": train_texts})
eval_dataset = Dataset.from_dict({"text": eval_texts})

model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(model_name)

lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["c_attn"],
    lora_dropout=0.05,
    bias="none"
)

results_log = []

for iteration in range(1, 3):
    print(f"\n--- Iteration {iteration} ---")
    
    sft_config = SFTConfig(
        output_dir=f"./output_iter_{iteration}",
        num_train_epochs=1,
        per_device_train_batch_size=2,
        per_device_eval_batch_size=2,
        learning_rate=2e-4 if iteration == 1 else 1e-4,
        eval_strategy="epoch",
        save_strategy="no",
        logging_steps=1,
        report_to=[],
        max_seq_length=128
    )
    
    trainer = SFTTrainer(
        model=model,
        args=sft_config,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        peft_config=lora_config,
        tokenizer=tokenizer,
    )
    
    train_result = trainer.train()
    eval_result = trainer.evaluate()
    
    print(f"Train Loss: {train_result.training_loss:.4f}")
    print(f"Eval Loss: {eval_result['eval_loss']:.4f}")
    
    iteration_data = {
        "iteration": iteration,
        "learning_rate": 2e-4 if iteration == 1 else 1e-4,
        "train_loss": round(train_result.training_loss, 4),
        "eval_loss": round(eval_result['eval_loss'], 4),
        "decision": "Lower LR for next iteration" if iteration == 1 else "Done"
    }
    results_log.append(iteration_data)
    print(f"Decision: {iteration_data['decision']}")

print("\n--- Improvement Loop Summary ---")
for record in results_log:
    print(f"Iteration {record['iteration']}: LR={record['learning_rate']}, Train Loss={record['train_loss']}, Eval Loss={record['eval_loss']}")

Output

--- Iteration 1 ---
Train Loss: 3.2156
Eval Loss: 3.4821
Decision: Lower LR for next iteration

--- Iteration 2 ---
Train Loss: 3.1654
Eval Loss: 3.3547
Decision: Done

--- Improvement Loop Summary ---
Iteration 1: LR=0.0002, Train Loss=3.2156, Eval Loss=3.4821
Iteration 2: LR=0.0001, Train Loss=3.1654, Eval Loss=3.3547

What just happened?

The code ran two training iterations. In iteration 1, it trained the model with learning rate 2e-4 and measured both training loss (3.2156) and eval loss (3.4821). The eval loss being higher than train loss signaled possible overfitting. In iteration 2, the learning rate was lowered to 1e-4 based on that diagnosis. The model retrained, and both losses improved slightly (train 3.1654, eval 3.3547), demonstrating the feedback cycle: measure → diagnose → adjust → retrain.

Common gotcha

Developers often train once, check eval metrics, then declare done. The actual gotcha: a single eval metric is not diagnostic. You must inspect *which examples* your model fails on. In iteration 1, you see eval loss is higher than train loss: that's a signal to reduce overfitting (lower LR, more dropout, more data), not a signal the model is bad. If you don't diagnose *why* eval loss is high, you'll make the wrong adjustment and waste iterations.

Error recovery

CUDA out of memory

You're trying to fine-tune too large a model or batch size is too high. Reduce per_device_train_batch_size (try 1 or 2), reduce max_seq_length, or use a smaller base model (e.g., gpt2 instead of gpt2-large).

eval_dataset=None causes AttributeError

You must pass eval_dataset to SFTTrainer if you want to call trainer.evaluate(). If you don't have eval data, either skip evaluate() or create a small held-out set.

learning_rate too high causes NaN loss

If loss becomes NaN, learning rate is too aggressive. Reduce it by 10x (2e-4 → 2e-5) and retrain.

Experienced dev note

The real insight: track your iterations as structured data (JSON, CSV, or a simple dict like results_log above). Senior devs version-control their improvement logs. This lets you spot patterns (e.g., 'learning rate 1e-3 always diverges') and avoid repeating the same mistake. Also: eval loss should *stabilize* around a baseline after a few iterations. If it keeps dropping indefinitely, you're likely overfitting to eval data: shrink eval set or use a proper validation split.

Check your understanding

In the code above, eval loss improved from 3.4821 (iteration 1) to 3.3547 (iteration 2) after lowering the learning rate from 2e-4 to 1e-4. Why might a lower learning rate have helped, and what would you check next if eval loss *stopped* improving in iteration 3?

Show answer hint

A correct answer should mention: (1) lower learning rate reduces overfitting or training instability, allowing the model to converge more smoothly; (2) if eval loss plateaus, you'd check whether train loss is still dropping (if yes, more data or regularization; if no, training has converged and further iteration is pointless) or inspect which eval examples are still mispredicted (to diagnose systematic failures).

VERSION In transformers < 5.0.0 and trl < 0.8.0, the eval_strategy argument was called eval_steps (integer-only). In trl >= 1.0.0 with transformers >= 5.0.0, eval_strategy accepts 'epoch' or 'steps'. Always check trainer.args after init to confirm the strategy was set correctly.

Next, learn how to compute custom metrics (accuracy, F1, exact-match) during evaluation so you can diagnose failures beyond just loss numbers.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.