How to beginner · 3 min read

How to prepare dataset for fine-tuning Hugging Face

Quick answer

To prepare a dataset for fine-tuning Hugging Face models, format your data as a list of dictionaries with input and target fields, then convert it into a datasets.Dataset or datasets.DatasetDict. Use the datasets library to load, preprocess, tokenize, and save the dataset in a compatible format for training.

PREREQUISITES

Python 3.8+
pip install datasets transformers
Basic knowledge of Python and JSON

Setup

Install the necessary libraries to handle datasets and tokenization for Hugging Face fine-tuning.

bash

pip install datasets transformers

Step by step

Prepare your dataset as a list of dictionaries with keys like "text" and "label", then load it into a datasets.Dataset. Tokenize the inputs using a Hugging Face tokenizer and format the dataset for training.

python

from datasets import Dataset
from transformers import AutoTokenizer

# Example raw data
raw_data = [
    {"text": "Hello, how are you?", "label": 0},
    {"text": "Fine-tuning is easy!", "label": 1}
]

# Load raw data into a Dataset
dataset = Dataset.from_list(raw_data)

# Initialize tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Tokenize function
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

# Apply tokenization
tokenized_dataset = dataset.map(tokenize_function, batched=True)

# Set format for PyTorch or TensorFlow
tokenized_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])

print(tokenized_dataset[0])

output

{'text': 'Hello, how are you?', 'label': 0, 'input_ids': tensor([...]), 'attention_mask': tensor([...])}

Common variations

You can prepare datasets for different tasks by changing keys (e.g., "input_text", "target_text" for seq2seq). Use datasets.DatasetDict for train/validation splits. Async tokenization or streaming large datasets is also supported.

python

from datasets import DatasetDict

# Example train and validation splits
train_data = [{"text": "Train example", "label": 0}]
val_data = [{"text": "Validation example", "label": 1}]

# Create DatasetDict
full_dataset = DatasetDict({
    "train": Dataset.from_list(train_data),
    "validation": Dataset.from_list(val_data)
})

print(full_dataset)

output

{'train': Dataset, 'validation': Dataset}

Troubleshooting

If tokenization fails, ensure your input keys match the tokenizer's expected input (e.g., 'text' or 'input_text').
For large datasets, use streaming mode in datasets.load_dataset to avoid memory issues.
Check that labels are integers or strings matching your model's expected format.

✅

Key Takeaways

Format your dataset as a list of dictionaries with consistent keys before loading into Hugging Face datasets.
Use the Hugging Face tokenizer to preprocess text inputs for model compatibility.
Leverage DatasetDict for managing train and validation splits efficiently.
Streaming datasets help handle large data without memory overload.
Always verify label formats and tokenizer input keys to avoid errors.

Verified 2026-04 · bert-base-uncased

Verify ↗