Continuous improvement loop
Why this matters
Fine-tuning is never a one-shot operation. A real improvement loop: measuring, diagnosing, and iterating: is the only way to ship models that actually work on your data.
Explanation
What it is: A continuous improvement loop is a cycle of train → evaluate → diagnose → adjust → repeat. After each training run, you measure how well your fine-tuned model performs on held-out data, identify failure patterns, change a hyperparameter or training strategy, and retrain. How it works: You split your data into train and eval sets. After training with SFTTrainer, you compute metrics (loss, accuracy, or task-specific scores). You inspect the eval results: Are losses diverging? Is the model overfitting? Are certain types of examples failing? Then you adjust: maybe increase dropout, add more training data, reduce learning rate, or change epochs: and retrain. This cycle compounds: each iteration builds on lessons from the previous one. When to use: Always, on any real fine-tuning project. Start with a baseline run, then iterate at least 2–3 times before declaring the model ready.
Analogy
Like tuning a guitar: first you tighten the string (train), then check if it's in tune (evaluate), hear it's too sharp (diagnose), turn the tuning peg slightly (adjust hyperparameter), and check again. You don't get a perfect note on the first turn: you spiral toward it.
Code
import json
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import LoraConfig
from trl import SFTTrainer, SFTConfig
# Minimal dataset: 4 examples
train_texts = [
"Classify: positive. The product is amazing and works great.",
"Classify: negative. Terrible quality, broke after one day.",
"Classify: positive. Excellent service and fast delivery.",
"Classify: negative. Not worth the price."
]
eval_texts = [
"Classify: positive. Best purchase I've made this year.",
"Classify: negative. Waste of money."
]
train_dataset = Dataset.from_dict({"text": train_texts})
eval_dataset = Dataset.from_dict({"text": eval_texts})
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(model_name)
lora_config = LoraConfig(
r=8,
lora_alpha=16,
target_modules=["c_attn"],
lora_dropout=0.05,
bias="none"
)
results_log = []
for iteration in range(1, 3):
print(f"\n--- Iteration {iteration} ---")
sft_config = SFTConfig(
output_dir=f"./output_iter_{iteration}",
num_train_epochs=1,
per_device_train_batch_size=2,
per_device_eval_batch_size=2,
learning_rate=2e-4 if iteration == 1 else 1e-4,
eval_strategy="epoch",
save_strategy="no",
logging_steps=1,
report_to=[],
max_seq_length=128
)
trainer = SFTTrainer(
model=model,
args=sft_config,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
peft_config=lora_config,
tokenizer=tokenizer,
)
train_result = trainer.train()
eval_result = trainer.evaluate()
print(f"Train Loss: {train_result.training_loss:.4f}")
print(f"Eval Loss: {eval_result['eval_loss']:.4f}")
iteration_data = {
"iteration": iteration,
"learning_rate": 2e-4 if iteration == 1 else 1e-4,
"train_loss": round(train_result.training_loss, 4),
"eval_loss": round(eval_result['eval_loss'], 4),
"decision": "Lower LR for next iteration" if iteration == 1 else "Done"
}
results_log.append(iteration_data)
print(f"Decision: {iteration_data['decision']}")
print("\n--- Improvement Loop Summary ---")
for record in results_log:
print(f"Iteration {record['iteration']}: LR={record['learning_rate']}, Train Loss={record['train_loss']}, Eval Loss={record['eval_loss']}") --- Iteration 1 --- Train Loss: 3.2156 Eval Loss: 3.4821 Decision: Lower LR for next iteration --- Iteration 2 --- Train Loss: 3.1654 Eval Loss: 3.3547 Decision: Done --- Improvement Loop Summary --- Iteration 1: LR=0.0002, Train Loss=3.2156, Eval Loss=3.4821 Iteration 2: LR=0.0001, Train Loss=3.1654, Eval Loss=3.3547
What just happened?
The code ran two training iterations. In iteration 1, it trained the model with learning rate 2e-4 and measured both training loss (3.2156) and eval loss (3.4821). The eval loss being higher than train loss signaled possible overfitting. In iteration 2, the learning rate was lowered to 1e-4 based on that diagnosis. The model retrained, and both losses improved slightly (train 3.1654, eval 3.3547), demonstrating the feedback cycle: measure → diagnose → adjust → retrain.
Common gotcha
Developers often train once, check eval metrics, then declare done. The actual gotcha: a single eval metric is not diagnostic. You must inspect *which examples* your model fails on. In iteration 1, you see eval loss is higher than train loss: that's a signal to reduce overfitting (lower LR, more dropout, more data), not a signal the model is bad. If you don't diagnose *why* eval loss is high, you'll make the wrong adjustment and waste iterations.
Error recovery
CUDA out of memoryeval_dataset=None causes AttributeErrorlearning_rate too high causes NaN lossExperienced dev note
The real insight: track your iterations as structured data (JSON, CSV, or a simple dict like results_log above). Senior devs version-control their improvement logs. This lets you spot patterns (e.g., 'learning rate 1e-3 always diverges') and avoid repeating the same mistake. Also: eval loss should *stabilize* around a baseline after a few iterations. If it keeps dropping indefinitely, you're likely overfitting to eval data: shrink eval set or use a proper validation split.
Check your understanding
In the code above, eval loss improved from 3.4821 (iteration 1) to 3.3547 (iteration 2) after lowering the learning rate from 2e-4 to 1e-4. Why might a lower learning rate have helped, and what would you check next if eval loss *stopped* improving in iteration 3?
Show answer hint
A correct answer should mention: (1) lower learning rate reduces overfitting or training instability, allowing the model to converge more smoothly; (2) if eval loss plateaus, you'd check whether train loss is still dropping (if yes, more data or regularization; if no, training has converged and further iteration is pointless) or inspect which eval examples are still mispredicted (to diagnose systematic failures).