How to beginner · 3 min read

How to create custom dataset for Hugging Face

Quick answer
Use the datasets library from Hugging Face to create a custom dataset by defining your data as a list of dictionaries or loading from files, then use datasets.Dataset.from_dict() or load_dataset() with a custom script. This enables seamless integration with Hugging Face tools for training and evaluation.

PREREQUISITES

  • Python 3.8+
  • pip install datasets
  • Basic knowledge of Python data structures

Setup

Install the Hugging Face datasets library using pip and import it in your Python script.

bash
pip install datasets

Step by step

Create a custom dataset by defining your data as a list of dictionaries, then convert it to a Hugging Face Dataset object using datasets.Dataset.from_dict(). This example shows a simple text classification dataset.

python
from datasets import Dataset

# Define your data as a dictionary of lists
data = {
    "text": ["Hello world", "Hugging Face is great", "Custom datasets are easy"],
    "label": [0, 1, 1]
}

# Create a Dataset object
custom_dataset = Dataset.from_dict(data)

# Inspect the dataset
print(custom_dataset)
print(custom_dataset[0])
output
Dataset({
    features: ['text', 'label'],
    num_rows: 3
})
{'text': 'Hello world', 'label': 0}

Common variations

You can also create datasets from CSV, JSON, or Parquet files using load_dataset() with the data_files parameter. For large or complex datasets, write a custom dataset loading script following Hugging Face's dataset loading script template.

python
from datasets import load_dataset

# Load dataset from a CSV file
csv_dataset = load_dataset('csv', data_files='path/to/your_file.csv')

print(csv_dataset['train'][0])
output
{'text': 'Example sentence', 'label': 1}

Troubleshooting

If you encounter errors loading your dataset, verify your data format matches the expected structure (e.g., consistent column names). For large datasets, ensure sufficient memory or use streaming mode with load_dataset(..., streaming=True). Check file paths and permissions when loading from files.

Key Takeaways

  • Use datasets.Dataset.from_dict() to create custom datasets from Python data structures.
  • Load datasets from files with load_dataset() and data_files parameter for CSV, JSON, or Parquet.
  • Write custom loading scripts for complex datasets to integrate with Hugging Face ecosystem.
  • Check data consistency and file paths to avoid common loading errors.
  • Streaming mode helps handle large datasets without loading all data into memory.
Verified 2026-04
Verify ↗