How to intermediate · 4 min read

How to optimize training speed Hugging Face

Q: How to optimize training speed Hugging Face

To optimize training speed with Hugging Face Transformers, use mixed precision training with accelerate or Trainer, enable gradient accumulation to simulate larger batch sizes, and leverage efficient data loading with datasets and caching. Additionally, use distributed training and adjust learning rate schedules for faster convergence.

Quick answer

To optimize training speed with Hugging Face Transformers, use mixed precision training with accelerate or Trainer, enable gradient accumulation to simulate larger batch sizes, and leverage efficient data loading with datasets and caching. Additionally, use distributed training and adjust learning rate schedules for faster convergence.

PREREQUISITES

Python 3.8+
pip install transformers datasets accelerate
Access to GPU(s) for hardware acceleration

Setup

Install the necessary libraries to train Hugging Face models efficiently. Use transformers for model and training utilities, datasets for efficient data handling, and accelerate for optimized hardware usage.

bash

pip install transformers datasets accelerate

Step by step

This example demonstrates optimizing training speed by enabling mixed precision, gradient accumulation, and efficient data loading using the Hugging Face Trainer API.

python

import os
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments

# Load dataset
raw_datasets = load_dataset('glue', 'mrpc')

# Load tokenizer and tokenize dataset
model_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)

def preprocess_function(examples):
    return tokenizer(examples['sentence1'], examples['sentence2'], truncation=True, padding='max_length')

encoded_datasets = raw_datasets.map(preprocess_function, batched=True)

# Load model
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# Define training arguments with optimization
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy='epoch',
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    gradient_accumulation_steps=2,  # Accumulate gradients to simulate batch size 32
    fp16=True,  # Enable mixed precision training
    save_total_limit=1,
    num_train_epochs=3,
    logging_dir='./logs',
    logging_steps=10,
    load_best_model_at_end=True
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=encoded_datasets['train'],
    eval_dataset=encoded_datasets['validation'],
    tokenizer=tokenizer
)

# Train model
trainer.train()

output

***** Running training *****
  Num examples = 3668
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Gradient Accumulation steps = 2
  Total optimization steps = 345

[INFO|trainer.py:train] ***** Training completed *****
[INFO|trainer.py:train] Best model saved to ./results

Common variations

Use accelerate CLI to launch distributed training across multiple GPUs or machines.
Switch to Trainer with deepspeed integration for very large models.
Adjust learning_rate schedules like cosine decay or warmup for faster convergence.
Use DataLoader with num_workers > 0 for parallel data loading.

python

from accelerate import Accelerator

accelerator = Accelerator(fp16=True)

model, optimizer, train_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader
)

for epoch in range(num_epochs):
    model.train()
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)
        optimizer.step()
        optimizer.zero_grad()

Troubleshooting

If you encounter out-of-memory errors, reduce per_device_train_batch_size or increase gradient_accumulation_steps.
If mixed precision causes instability, disable fp16 or try bf16 if supported by hardware.
Slow data loading? Increase num_workers in your DataLoader or cache datasets locally.

✅

Key Takeaways

Enable mixed precision training with fp16=True to speed up GPU utilization.
Use gradient accumulation to simulate larger batch sizes without extra memory.
Leverage efficient data loading and caching with the datasets library.
Distributed training with accelerate scales training speed across GPUs.
Adjust learning rate schedules and batch sizes to balance speed and convergence.

Verified 2026-04 · bert-base-uncased

Verify ↗