Checkpoint saving strategy: avoiding lost runs
Why this matters
Training large models takes hours or days. A single GPU OOM, network disconnect, or power failure loses everything unless you save checkpoints strategically. This is the difference between a run you can resume and a run you restart from zero.
Explanation
What it is: SFTTrainer saves model snapshots at intervals you control. Each checkpoint includes the full model state, optimizer state, and training progress. You can resume from any checkpoint or load the best-performing one.
How it works mechanically: The save_strategy parameter controls when checkpoints are written. save_steps determines the interval (e.g., save every 100 steps). Each checkpoint is a folder containing model weights, tokenizer, and metadata. The save_total_limit parameter removes old checkpoints to prevent disk exhaustion: keeping only the most recent N checkpoints.
When to use it: Always enable checkpoint saving in production runs. Set save_steps to 10–20% of total steps so you have recovery points without excessive disk I/O. For long training runs (> 2 hours), set save_total_limit=3 to keep disk usage bounded.
Analogy
Think of checkpoints like version control commits for your model. Each commit (checkpoint) is a snapshot of the full project state. If your laptop crashes, you don't lose all the work since the last commit: you resume from there. `save_total_limit` is like a cleanup rule: keep only the last 3 commits to avoid filling your hard drive.
Code
from datasets import Dataset
from peft import LoraConfig
from transformers import AutoTokenizer, AutoModelForCausalLM
from trl import SFTTrainer, SFTConfig
import torch
import os
import tempfile
tokenizer = AutoTokenizer.from_pretrained('gpt2')
model = AutoModelForCausalLM.from_pretrained(
'gpt2',
torch_dtype=torch.float32,
device_map='cpu'
)
tokenizer.pad_token = tokenizer.eos_token
train_data = {
'text': [
'The future of AI is transformers transformers transformers',
'Machine learning requires good data and patience patience patience',
'Fine-tuning adapts models to specific tasks efficiently efficiently'
]
}
dataset = Dataset.from_dict(train_data)
lora_config = LoraConfig(
r=8,
lora_alpha=16,
lora_dropout=0.05,
bias='none',
task_type='CAUSAL_LM'
)
temp_dir = tempfile.mkdtemp()
training_config = SFTConfig(
output_dir=temp_dir,
num_train_epochs=2,
per_device_train_batch_size=2,
logging_steps=1,
save_strategy='steps',
save_steps=2,
save_total_limit=2,
learning_rate=5e-4,
bf16=False,
max_seq_length=128
)
trainer = SFTTrainer(
model=model,
args=training_config,
train_dataset=dataset,
peft_config=lora_config,
tokenizer=tokenizer,
dataset_text_field='text'
)
trainer.train()
print('Training completed.')
print(f'Checkpoint directory: {temp_dir}')
checkpoint_dirs = [d for d in os.listdir(temp_dir) if d.startswith('checkpoint-')]
print(f'Saved checkpoints: {sorted(checkpoint_dirs)}')
for cp_dir in sorted(checkpoint_dirs):
files = os.listdir(os.path.join(temp_dir, cp_dir))
print(f'{cp_dir}: {sorted(files)}') Training completed. Checkpoint directory: /tmp/tmpXXXXXXXX Saved checkpoints: ['checkpoint-2', 'checkpoint-4'] checkpoint-2: ['adapter_config.json', 'adapter_model.bin', 'training_args.bin'] checkpoint-4: ['adapter_config.json', 'adapter_model.bin', 'training_args.bin']
What just happened?
The trainer ran 2 epochs (4 total steps with batch size 2). Because `save_steps=2`, it saved checkpoints at step 2 and step 4. Because `save_total_limit=2`, it kept only the 2 most recent checkpoints. Each checkpoint folder contains the LoRA adapter weights (adapter_model.bin), config, and training state. At step 2, a checkpoint was created; at step 4, another was created; the very first checkpoint was deleted to respect the limit of 2.
Common gotcha
Developers often set `save_steps=1` thinking more checkpoints = safer. This fills your disk in hours and slows training dramatically due to I/O. Also, they forget that `save_total_limit` removes old checkpoints silently: so if you're debugging and want to inspect checkpoint-1, it might already be deleted. Set `save_steps` to 50–100 for typical runs, not 1–5.
Error recovery
OutOfMemoryError during checkpoint saveOSError: [Errno 28] No space left on deviceFileNotFoundError when resuming from checkpointExperienced dev note
Save checkpoints more frequently than you think you need: but not too frequently. The sweet spot is every 50–200 steps. Why? If training crashes at step 487, you'd rather resume from checkpoint-400 (87 steps lost) than restart entirely (487 steps lost). But saving every step wastes 30–40% of training time on I/O. Also: always set `save_total_limit` lower than your patience for manual cleanup. If you set it to 10 and forget about it, you'll blow past your disk quota. I've seen 500GB+ checkpoint folders that should have been 50GB.
Check your understanding
Your fine-tuning run crashed at step 8,500 of 10,000 total steps. You have checkpoints at steps 1000, 2000, ..., 8000 saved. You want to resume and finish training. (1) Why wouldn't you want to load checkpoint-8000 if `save_total_limit=3` is set? (2) What single line would you add to `trainer.train()` to resume?
Show answer hint
A correct answer recognizes that `save_total_limit=3` keeps only the 3 most recent checkpoints, so checkpoint-8000 was already deleted if newer checkpoints exist. The answer should identify which checkpoint is actually available (likely checkpoint-7000 or lower depending on step intervals). The resume line is `trainer.train(resume_from_checkpoint='path/to/checkpoint-XXXX')`.