How to beginner · 3 min read

How to create custom dataset for Hugging Face

Q: How to create custom dataset for Hugging Face

Use the datasets library from Hugging Face to create a custom dataset by defining your data as a list of dictionaries or loading from files, then use datasets.Dataset.from_dict() or load_dataset() with a custom script. This enables seamless integration with Hugging Face tools for training and evaluation.

Quick answer

Use the datasets library from Hugging Face to create a custom dataset by defining your data as a list of dictionaries or loading from files, then use datasets.Dataset.from_dict() or load_dataset() with a custom script. This enables seamless integration with Hugging Face tools for training and evaluation.

PREREQUISITES

Python 3.8+
pip install datasets
Basic knowledge of Python data structures

Setup

Install the Hugging Face datasets library using pip and import it in your Python script.

bash

pip install datasets

Step by step

Create a custom dataset by defining your data as a list of dictionaries, then convert it to a Hugging Face Dataset object using datasets.Dataset.from_dict(). This example shows a simple text classification dataset.

python

from datasets import Dataset

# Define your data as a dictionary of lists
data = {
    "text": ["Hello world", "Hugging Face is great", "Custom datasets are easy"],
    "label": [0, 1, 1]
}

# Create a Dataset object
custom_dataset = Dataset.from_dict(data)

# Inspect the dataset
print(custom_dataset)
print(custom_dataset[0])

output

Dataset({
    features: ['text', 'label'],
    num_rows: 3
})
{'text': 'Hello world', 'label': 0}

Common variations

You can also create datasets from CSV, JSON, or Parquet files using load_dataset() with the data_files parameter. For large or complex datasets, write a custom dataset loading script following Hugging Face's dataset loading script template.

python

from datasets import load_dataset

# Load dataset from a CSV file
csv_dataset = load_dataset('csv', data_files='path/to/your_file.csv')

print(csv_dataset['train'][0])

output

{'text': 'Example sentence', 'label': 1}

Troubleshooting

If you encounter errors loading your dataset, verify your data format matches the expected structure (e.g., consistent column names). For large datasets, ensure sufficient memory or use streaming mode with load_dataset(..., streaming=True). Check file paths and permissions when loading from files.

✅

Key Takeaways

Use datasets.Dataset.from_dict() to create custom datasets from Python data structures.
Load datasets from files with load_dataset() and data_files parameter for CSV, JSON, or Parquet.
Write custom loading scripts for complex datasets to integrate with Hugging Face ecosystem.
Check data consistency and file paths to avoid common loading errors.
Streaming mode helps handle large datasets without loading all data into memory.

Verified 2026-04

Verify ↗