How to optimize training speed Hugging Face
Quick answer
To optimize training speed with
Hugging Face Transformers, use mixed precision training with accelerate or Trainer, enable gradient accumulation to simulate larger batch sizes, and leverage efficient data loading with datasets and caching. Additionally, use distributed training and adjust learning rate schedules for faster convergence.PREREQUISITES
Python 3.8+pip install transformers datasets accelerateAccess to GPU(s) for hardware acceleration
Setup
Install the necessary libraries to train Hugging Face models efficiently. Use transformers for model and training utilities, datasets for efficient data handling, and accelerate for optimized hardware usage.
pip install transformers datasets accelerate Step by step
This example demonstrates optimizing training speed by enabling mixed precision, gradient accumulation, and efficient data loading using the Hugging Face Trainer API.
import os
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
# Load dataset
raw_datasets = load_dataset('glue', 'mrpc')
# Load tokenizer and tokenize dataset
model_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
def preprocess_function(examples):
return tokenizer(examples['sentence1'], examples['sentence2'], truncation=True, padding='max_length')
encoded_datasets = raw_datasets.map(preprocess_function, batched=True)
# Load model
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
# Define training arguments with optimization
training_args = TrainingArguments(
output_dir='./results',
evaluation_strategy='epoch',
per_device_train_batch_size=16,
per_device_eval_batch_size=32,
gradient_accumulation_steps=2, # Accumulate gradients to simulate batch size 32
fp16=True, # Enable mixed precision training
save_total_limit=1,
num_train_epochs=3,
logging_dir='./logs',
logging_steps=10,
load_best_model_at_end=True
)
# Initialize Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=encoded_datasets['train'],
eval_dataset=encoded_datasets['validation'],
tokenizer=tokenizer
)
# Train model
trainer.train() output
***** Running training ***** Num examples = 3668 Num Epochs = 3 Instantaneous batch size per device = 16 Gradient Accumulation steps = 2 Total optimization steps = 345 [INFO|trainer.py:train] ***** Training completed ***** [INFO|trainer.py:train] Best model saved to ./results
Common variations
- Use
accelerateCLI to launch distributed training across multiple GPUs or machines. - Switch to
Trainerwithdeepspeedintegration for very large models. - Adjust
learning_rateschedules like cosine decay or warmup for faster convergence. - Use
DataLoaderwithnum_workers> 0 for parallel data loading.
from accelerate import Accelerator
accelerator = Accelerator(fp16=True)
model, optimizer, train_dataloader = accelerator.prepare(
model, optimizer, train_dataloader
)
for epoch in range(num_epochs):
model.train()
for batch in train_dataloader:
outputs = model(**batch)
loss = outputs.loss
accelerator.backward(loss)
optimizer.step()
optimizer.zero_grad() Troubleshooting
- If you encounter out-of-memory errors, reduce
per_device_train_batch_sizeor increasegradient_accumulation_steps. - If mixed precision causes instability, disable
fp16or trybf16if supported by hardware. - Slow data loading? Increase
num_workersin yourDataLoaderor cache datasets locally.
Key Takeaways
- Enable mixed precision training with
fp16=Trueto speed up GPU utilization. - Use gradient accumulation to simulate larger batch sizes without extra memory.
- Leverage efficient data loading and caching with the
datasetslibrary. - Distributed training with
acceleratescales training speed across GPUs. - Adjust learning rate schedules and batch sizes to balance speed and convergence.