Code Intermediate medium · 7 min

Training run: trainer.train()

What you will learn

trainer.train() executes the training loop and returns a TrainOutput object with metrics, handling checkpoints and validation automatically.

Why this matters

In production, you need to understand what train() actually does, how to catch failures mid-training, and how to inspect the training state: blindly calling train() without understanding the TrainOutput leads to silent failures and wasted compute.

Skip if: Do not use trainer.train() when you need fine-grained control over individual training steps (use trainer.training_step() in a manual loop instead), when training on a custom distributed setup outside HuggingFace's abstraction, or when you need to pause and resume training with custom state management beyond what checkpointing provides.

Explanation

The trainer.train() method is the main entry point for executing a full training run with the HuggingFace Trainer class. It runs the complete training loop: forward pass, backward pass, optimization steps, evaluation, and checkpointing: according to the TrainingArguments you configured.

Mechanically, trainer.train() iterates through your dataset in batches, computes loss, backpropagates, updates weights, and periodically runs evaluation (if configured). It returns a TrainOutput object containing global_step, training_loss, and metrics. The method also handles resuming from checkpoints if called after interruption, learning rate scheduling, mixed precision, distributed training, and early stopping. The key detail: it's blocking: execution pauses until training completes (or the trainer exits early due to EarlyStoppingCallback).

Use this when you want HuggingFace to manage the entire training lifecycle. If you need to inject custom logic mid-training (beyond callbacks), or if you're training on a system where the Trainer's abstractions don't fit, you'll need to manually loop and call trainer.training_step() instead.

Analogy

trainer.train() is like handing your car to a professional driver with a preset route, fuel stops, and a checklist. You give it the map (TrainingArguments), sit back, and it handles the driving. At the end, you get a receipt (TrainOutput) saying how far it went and how the trip went. But if you need to navigate mid-route or take a detour, you're driving manually.

Code

Illustrative only - not runnable without a valid API key

python

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
from datasets import load_dataset

model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map='auto',
    torch_dtype=torch.bfloat16
)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

dataset = load_dataset('wikitext', 'wikitext-2-raw-v1', split='train[:1%]')

def tokenize_function(examples):
    return tokenizer(
        examples['text'],
        truncation=True,
        max_length=128,
        padding='max_length',
        return_tensors='pt'
    )

tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=['text'])
train_dataset = tokenized_dataset.select(range(min(100, len(tokenized_dataset))))

training_args = TrainingArguments(
    output_dir='./gpt2-finetuned',
    num_train_epochs=1,
    per_device_train_batch_size=4,
    save_steps=25,
    logging_steps=10,
    learning_rate=2e-5,
    bf16=True,
    report_to='none'
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    tokenizer=tokenizer
)

train_output = trainer.train()
print(f"Training completed at global_step: {train_output.global_step}")
print(f"Training loss: {train_output.training_loss:.4f}")
print(f"Type of output: {type(train_output).__name__}")

Output

Training completed at global_step: 25
Training loss: 3.2156
Type of output: TrainOutput

What just happened?

The code instantiated a Trainer with a tiny 1% slice of WikiText-2 (about 100 samples), configured it for 1 epoch with batch size 4, then called trainer.train(). The Trainer iterated through the 100 samples (25 batches of 4), computing loss and updating weights via backprop with bfloat16 precision. At each 10-step interval, it logged metrics. At step 25 (end of epoch), training stopped and returned a TrainOutput object containing the final global_step (25), average training_loss (3.2156), and an empty metrics dict (since no evaluation was configured). The entire training loop: checkpointing, scheduling, gradient accumulation: was managed automatically.

Common gotcha

Developers forget that trainer.train() blocks until completion and assume they can check training_loss inside the TrainOutput immediately after calling train(). The gotcha: if training crashes at step 50 of 100, train() will raise an exception and return nothing: you get no partial TrainOutput. The second gotcha: if you call trainer.train() twice without clearing the checkpoint directory or setting resume_from_checkpoint=False, it will resume from the last checkpoint, not restart training. This silently retrains the model from where it left off, wasting compute.

Error recovery

OutOfMemoryError

Reduce per_device_train_batch_size in TrainingArguments, enable gradient_accumulation_steps to simulate larger batches without using more VRAM, or add device_map='auto' with load_in_8bit=True for 8-bit quantization.

RuntimeError: expected scalar type Half but found Float

Your model or tensors are in float32 but you set fp16=True (or bf16=True). Either set the training dtype to match: change bf16=True to False, or load the model with torch_dtype=torch.float16 / torch.bfloat16.

AssertionError: Trainer: model wasn't saved

Your save_steps is larger than the total number of training steps. With 100 samples at batch size 4, you have 25 steps. Set save_steps=10 (or lower) to save before training ends.

FileNotFoundError: ./gpt2-finetuned/training_args.bin not found

You're trying to resume training but the output_dir doesn't exist or was deleted. Set resume_from_checkpoint=False in trainer.train() or ensure output_dir exists from a previous run.

Experienced dev note

In 5.5.x, TrainOutput is just a namedtuple with three fields: global_step, training_loss, and metrics. Don't overthink it: most teams only care about training_loss and whether training_completed without error. The real insight: always wrap trainer.train() in a try-except block in production, because mid-training OOMs or NaN losses will crash silently if you're logging to a file. Second insight: if your dataset is small (< 1000 samples), trainer.train() will use all of it every epoch: there's no automatic train/val split, you must pass eval_dataset explicitly if you want evaluation or early stopping. Third: save_steps and eval_steps are absolute step counts, not percentages: beginners always get this wrong and end up saving 50 times per epoch.

Check your understanding

If you call trainer.train() twice on the same output_dir without modifying the checkpoint, what happens to your model weights: does it retrain from scratch, resume from the last checkpoint, or raise an error? Why would that matter in production?

Show answer hint

A correct answer must mention that train() will resume from the last checkpoint by default (resumption is automatic), not retrain from scratch. It matters in production because accidental resumption means you're not training the full epochs you intended: you're wasting compute and misreporting your training time.

VERSION In transformers < 5.0.0, trainer.train() returned a TrainResult object with different field names (best_loss, best_model_checkpoint). In 5.0.0+, this was simplified to TrainOutput with only global_step, training_loss, and metrics. If you're migrating from 4.x to 5.5.x, update code that accesses train_result.best_loss to use trainer.state.best_model_checkpoint instead.

Next, evaluate your trained model using trainer.evaluate() to compute validation metrics (BLEU, perplexity, custom metrics) on a held-out dataset: this shows whether your training actually improved the model.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.