How to fine-tune on custom dataset Hugging Face
Quick answer
Use the
transformers and datasets libraries from Hugging Face to load your custom dataset, preprocess it, and fine-tune a pretrained model with Trainer. Prepare your dataset as a Dataset object, tokenize it, then call Trainer.train() to fine-tune on your data.PREREQUISITES
Python 3.8+pip install transformers datasetsBasic knowledge of Python and PyTorch or TensorFlow
Setup
Install the required libraries and import necessary modules for fine-tuning.
pip install transformers datasets Step by step fine-tuning
This example shows how to fine-tune a Hugging Face bert-base-uncased model on a custom text classification dataset using the Trainer API.
import os
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
# Example custom dataset
raw_data = {
'text': ["I love AI", "I hate bugs", "Python is great", "Debugging is hard"],
'label': [1, 0, 1, 0]
}
# Convert to Hugging Face Dataset
dataset = Dataset.from_dict(raw_data)
# Load tokenizer and model
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
# Tokenize function
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
# Tokenize dataset
encoded_dataset = dataset.map(tokenize_function, batched=True)
# Set format for PyTorch
encoded_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])
# Training arguments
training_args = TrainingArguments(
output_dir="./results",
evaluation_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=2,
num_train_epochs=3,
weight_decay=0.01,
save_total_limit=1
)
# Initialize Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=encoded_dataset,
eval_dataset=encoded_dataset
)
# Train model
trainer.train() output
***** Running training ***** Num examples = 4 Num Epochs = 3 Instantaneous batch size per device = 2 Total train batch size = 2 Gradient Accumulation steps = 1 Total optimization steps = 6 [...training logs...] Training completed. Model saved to ./results
Common variations
- Use
Trainerwith a validation split for better evaluation. - Fine-tune other models like
roberta-baseordistilbert-base-uncased. - Use mixed precision training with
fp16=TrueinTrainingArgumentsfor faster training on GPUs. - For large datasets, load from CSV or JSON files using
datasets.load_dataset.
Troubleshooting
- If you get tokenization errors, verify your dataset text fields are strings and not empty.
- Out of memory errors? Reduce batch size or enable gradient checkpointing.
- Model not improving? Check learning rate and dataset quality.
- Ensure
transformersanddatasetslibraries are up to date.
Key Takeaways
- Use Hugging Face
datasetsto load and preprocess your custom data efficiently. - Tokenize your dataset with the model's tokenizer before training.
- Use
TrainerandTrainingArgumentsfor streamlined fine-tuning. - Adjust batch size, learning rate, and epochs based on your dataset size and hardware.
- Troubleshoot common issues by checking data types, memory limits, and library versions.