How to beginner · 3 min read

How to prepare dataset for fine-tuning Hugging Face

Quick answer
To prepare a dataset for fine-tuning Hugging Face models, format your data as a list of dictionaries with input and target fields, then convert it into a datasets.Dataset or datasets.DatasetDict. Use the datasets library to load, preprocess, tokenize, and save the dataset in a compatible format for training.

PREREQUISITES

  • Python 3.8+
  • pip install datasets transformers
  • Basic knowledge of Python and JSON

Setup

Install the necessary libraries to handle datasets and tokenization for Hugging Face fine-tuning.

bash
pip install datasets transformers

Step by step

Prepare your dataset as a list of dictionaries with keys like "text" and "label", then load it into a datasets.Dataset. Tokenize the inputs using a Hugging Face tokenizer and format the dataset for training.

python
from datasets import Dataset
from transformers import AutoTokenizer

# Example raw data
raw_data = [
    {"text": "Hello, how are you?", "label": 0},
    {"text": "Fine-tuning is easy!", "label": 1}
]

# Load raw data into a Dataset
dataset = Dataset.from_list(raw_data)

# Initialize tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Tokenize function
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

# Apply tokenization
tokenized_dataset = dataset.map(tokenize_function, batched=True)

# Set format for PyTorch or TensorFlow
tokenized_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])

print(tokenized_dataset[0])
output
{'text': 'Hello, how are you?', 'label': 0, 'input_ids': tensor([...]), 'attention_mask': tensor([...])}

Common variations

You can prepare datasets for different tasks by changing keys (e.g., "input_text", "target_text" for seq2seq). Use datasets.DatasetDict for train/validation splits. Async tokenization or streaming large datasets is also supported.

python
from datasets import DatasetDict

# Example train and validation splits
train_data = [{"text": "Train example", "label": 0}]
val_data = [{"text": "Validation example", "label": 1}]

# Create DatasetDict
full_dataset = DatasetDict({
    "train": Dataset.from_list(train_data),
    "validation": Dataset.from_list(val_data)
})

print(full_dataset)
output
{'train': Dataset, 'validation': Dataset}

Troubleshooting

  • If tokenization fails, ensure your input keys match the tokenizer's expected input (e.g., 'text' or 'input_text').
  • For large datasets, use streaming mode in datasets.load_dataset to avoid memory issues.
  • Check that labels are integers or strings matching your model's expected format.

Key Takeaways

  • Format your dataset as a list of dictionaries with consistent keys before loading into Hugging Face datasets.
  • Use the Hugging Face tokenizer to preprocess text inputs for model compatibility.
  • Leverage DatasetDict for managing train and validation splits efficiently.
  • Streaming datasets help handle large data without memory overload.
  • Always verify label formats and tokenizer input keys to avoid errors.
Verified 2026-04 · bert-base-uncased
Verify ↗