How to beginner · 3 min read

How to use Hugging Face datasets library

Quick answer
Use the datasets library from Hugging Face to easily load and preprocess datasets with load_dataset. Install it via pip install datasets, then load datasets by name or local files for seamless integration into ML workflows.

PREREQUISITES

  • Python 3.8+
  • pip install datasets>=2.0

Setup

Install the Hugging Face datasets library using pip. No API key is required for basic usage.

bash
pip install datasets

Step by step

Load a dataset by name, explore its features, and access data samples with the following Python code.

python
from datasets import load_dataset

# Load the 'imdb' dataset
dataset = load_dataset('imdb')

# Print dataset splits
print(dataset)

# Access the first training example
print(dataset['train'][0])
output
{'train': Dataset({\n    features: ['text', 'label'],\n    num_rows: 25000\n}), 'test': Dataset({\n    features: ['text', 'label'],\n    num_rows: 25000\n}), 'unsupervised': Dataset({\n    features: ['text'],\n    num_rows: 50000\n})}
{'text': 'I rented I AM CURIOUS-YELLOW from my video store...', 'label': 0}

Common variations

You can load datasets from local files, use streaming for large datasets, or load specific dataset configurations.

python
from datasets import load_dataset

# Load dataset from local CSV file
local_dataset = load_dataset('csv', data_files='data/my_data.csv')

# Load dataset with streaming (memory efficient)
streamed_dataset = load_dataset('wikipedia', '20220301.en', streaming=True)

# Load a specific configuration of a dataset
config_dataset = load_dataset('glue', 'mrpc')

Troubleshooting

  • If you see ModuleNotFoundError, ensure datasets is installed with pip install datasets.
  • For slow downloads, check your internet connection or use dataset caching.
  • If loading local files, verify the file path and format are correct.

Key Takeaways

  • Use load_dataset to quickly access thousands of datasets by name or local files.
  • The library supports streaming and dataset configurations for flexible data handling.
  • Install with pip install datasets and no API key is needed for public datasets.
Verified 2026-04
Verify ↗