How to beginner · 3 min read

How to split dataset into train and test Hugging Face

Quick answer
Use the Hugging Face datasets library's train_test_split() method to split a dataset into training and testing subsets. This method returns a dictionary with train and test keys containing the respective splits.

PREREQUISITES

  • Python 3.8+
  • pip install datasets>=2.0

Setup

Install the Hugging Face datasets library if you haven't already. This library provides easy access to many datasets and utilities for dataset manipulation.

bash
pip install datasets

Step by step

Load a dataset and split it into train and test sets using train_test_split(). The example below uses the built-in imdb dataset.

python
from datasets import load_dataset

# Load the dataset
raw_dataset = load_dataset('imdb')

# Split the training set into train (80%) and test (20%)
split_dataset = raw_dataset['train'].train_test_split(test_size=0.2)

# Access train and test splits
train_dataset = split_dataset['train']
test_dataset = split_dataset['test']

# Print number of samples
print(f"Train samples: {len(train_dataset)}")
print(f"Test samples: {len(test_dataset)}")
output
Train samples: 20000
Test samples: 5000

Common variations

  • Use train_test_split(train_size=0.7) to specify train size instead of test size.
  • Split a custom dataset loaded from files by converting it to a Dataset object first.
  • Use shuffle=True (default) to shuffle before splitting for randomized splits.
python
from datasets import Dataset

# Example custom dataset
custom_data = {'text': ['sample1', 'sample2', 'sample3', 'sample4', 'sample5'], 'label': [0, 1, 0, 1, 0]}
custom_dataset = Dataset.from_dict(custom_data)

# Split with 60% train, 40% test
split_custom = custom_dataset.train_test_split(train_size=0.6)

print(f"Train: {len(split_custom['train'])}, Test: {len(split_custom['test'])}")
output
Train: 3, Test: 2

Troubleshooting

  • If you get an error about missing dataset, ensure you have internet access or the dataset is cached locally.
  • For very large datasets, consider streaming or loading subsets to avoid memory issues.
  • If splits are not random, verify shuffle=True is set (default behavior).

Key Takeaways

  • Use train_test_split() from Hugging Face datasets to easily split datasets.
  • Specify test_size or train_size to control split proportions.
  • Always shuffle data before splitting to ensure randomized train/test sets.
  • You can split both built-in and custom datasets loaded as Dataset objects.
  • Check for dataset availability and memory constraints when working with large datasets.
Verified 2026-04
Verify ↗