Code Intermediate medium · 6 min

Checkpoint saving strategy: avoiding lost runs

What you will learn
Configure checkpoint saving in SFTTrainer to persist your model state at regular intervals so training crashes don't erase hours of work.

Why this matters

Training large models takes hours or days. A single GPU OOM, network disconnect, or power failure loses everything unless you save checkpoints strategically. This is the difference between a run you can resume and a run you restart from zero.

Skip if: Skip aggressive checkpoint saving only when: (1) fine-tuning a small model on a laptop for < 30 minutes, (2) disk space is severely constrained (< 5GB free) and you're using quantization, or (3) you're actively debugging hyperparameters and don't need recovery. Even then, save at least the final model.

Explanation

What it is: SFTTrainer saves model snapshots at intervals you control. Each checkpoint includes the full model state, optimizer state, and training progress. You can resume from any checkpoint or load the best-performing one.

How it works mechanically: The save_strategy parameter controls when checkpoints are written. save_steps determines the interval (e.g., save every 100 steps). Each checkpoint is a folder containing model weights, tokenizer, and metadata. The save_total_limit parameter removes old checkpoints to prevent disk exhaustion: keeping only the most recent N checkpoints.

When to use it: Always enable checkpoint saving in production runs. Set save_steps to 10–20% of total steps so you have recovery points without excessive disk I/O. For long training runs (> 2 hours), set save_total_limit=3 to keep disk usage bounded.

Analogy

Think of checkpoints like version control commits for your model. Each commit (checkpoint) is a snapshot of the full project state. If your laptop crashes, you don't lose all the work since the last commit: you resume from there. `save_total_limit` is like a cleanup rule: keep only the last 3 commits to avoid filling your hard drive.

Code

python
from datasets import Dataset
from peft import LoraConfig
from transformers import AutoTokenizer, AutoModelForCausalLM
from trl import SFTTrainer, SFTConfig
import torch
import os
import tempfile

tokenizer = AutoTokenizer.from_pretrained('gpt2')
model = AutoModelForCausalLM.from_pretrained(
    'gpt2',
    torch_dtype=torch.float32,
    device_map='cpu'
)
tokenizer.pad_token = tokenizer.eos_token

train_data = {
    'text': [
        'The future of AI is transformers transformers transformers',
        'Machine learning requires good data and patience patience patience',
        'Fine-tuning adapts models to specific tasks efficiently efficiently'
    ]
}
dataset = Dataset.from_dict(train_data)

lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    lora_dropout=0.05,
    bias='none',
    task_type='CAUSAL_LM'
)

temp_dir = tempfile.mkdtemp()

training_config = SFTConfig(
    output_dir=temp_dir,
    num_train_epochs=2,
    per_device_train_batch_size=2,
    logging_steps=1,
    save_strategy='steps',
    save_steps=2,
    save_total_limit=2,
    learning_rate=5e-4,
    bf16=False,
    max_seq_length=128
)

trainer = SFTTrainer(
    model=model,
    args=training_config,
    train_dataset=dataset,
    peft_config=lora_config,
    tokenizer=tokenizer,
    dataset_text_field='text'
)

trainer.train()

print('Training completed.')
print(f'Checkpoint directory: {temp_dir}')
checkpoint_dirs = [d for d in os.listdir(temp_dir) if d.startswith('checkpoint-')]
print(f'Saved checkpoints: {sorted(checkpoint_dirs)}')
for cp_dir in sorted(checkpoint_dirs):
    files = os.listdir(os.path.join(temp_dir, cp_dir))
    print(f'{cp_dir}: {sorted(files)}')
Output
Training completed.
Checkpoint directory: /tmp/tmpXXXXXXXX
Saved checkpoints: ['checkpoint-2', 'checkpoint-4']
checkpoint-2: ['adapter_config.json', 'adapter_model.bin', 'training_args.bin']
checkpoint-4: ['adapter_config.json', 'adapter_model.bin', 'training_args.bin']

What just happened?

The trainer ran 2 epochs (4 total steps with batch size 2). Because `save_steps=2`, it saved checkpoints at step 2 and step 4. Because `save_total_limit=2`, it kept only the 2 most recent checkpoints. Each checkpoint folder contains the LoRA adapter weights (adapter_model.bin), config, and training state. At step 2, a checkpoint was created; at step 4, another was created; the very first checkpoint was deleted to respect the limit of 2.

Common gotcha

Developers often set `save_steps=1` thinking more checkpoints = safer. This fills your disk in hours and slows training dramatically due to I/O. Also, they forget that `save_total_limit` removes old checkpoints silently: so if you're debugging and want to inspect checkpoint-1, it might already be deleted. Set `save_steps` to 50–100 for typical runs, not 1–5.

Error recovery

OutOfMemoryError during checkpoint save
Checkpoint saving writes to disk synchronously. If `save_steps` is too frequent relative to your batch size, the model + optimizer state exceeds memory when checkpointing. Fix: increase `save_steps` to every 100+ steps, or reduce `per_device_train_batch_size` by 1.
OSError: [Errno 28] No space left on device
Each checkpoint is ~200MB–2GB depending on model size. `save_total_limit` doesn't apply retroactively to existing checkpoints. Fix: manually delete old checkpoints with `rm -rf {output_dir}/checkpoint-*`, then resume training. Or set `save_total_limit=1` to keep only the latest.
FileNotFoundError when resuming from checkpoint
If you move or delete the checkpoint directory after training starts, the trainer can't find it. Fix: Always pass the full absolute path to `output_dir` and don't delete it until training is completely done. Use `trainer.train(resume_from_checkpoint='path/to/checkpoint-100')` explicitly.

Experienced dev note

Save checkpoints more frequently than you think you need: but not too frequently. The sweet spot is every 50–200 steps. Why? If training crashes at step 487, you'd rather resume from checkpoint-400 (87 steps lost) than restart entirely (487 steps lost). But saving every step wastes 30–40% of training time on I/O. Also: always set `save_total_limit` lower than your patience for manual cleanup. If you set it to 10 and forget about it, you'll blow past your disk quota. I've seen 500GB+ checkpoint folders that should have been 50GB.

Check your understanding

Your fine-tuning run crashed at step 8,500 of 10,000 total steps. You have checkpoints at steps 1000, 2000, ..., 8000 saved. You want to resume and finish training. (1) Why wouldn't you want to load checkpoint-8000 if `save_total_limit=3` is set? (2) What single line would you add to `trainer.train()` to resume?

Show answer hint

A correct answer recognizes that `save_total_limit=3` keeps only the 3 most recent checkpoints, so checkpoint-8000 was already deleted if newer checkpoints exist. The answer should identify which checkpoint is actually available (likely checkpoint-7000 or lower depending on step intervals). The resume line is `trainer.train(resume_from_checkpoint='path/to/checkpoint-XXXX')`.

VERSION In trl < 0.9.0, checkpoint saving was controlled via `Trainer.args.save_steps` without explicit `save_strategy`. In trl 1.x (current), `SFTConfig` accepts both `save_strategy` ('steps' or 'epoch') and `save_steps` for fine control. Always use `save_strategy='steps'` for LLM fine-tuning because epoch-based saving doesn't align well with multi-epoch small datasets.
NEXT

Next, learn how to evaluate your checkpoints in-flight using validation datasets and early stopping, so you automatically select the best-performing model instead of guessing which checkpoint to use.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.