How to beginner · 3 min read

How to filter Hugging Face dataset

Q: How to filter Hugging Face dataset

Use the datasets library's Dataset.filter() method to filter Hugging Face datasets by applying a boolean function to each example. This method returns a new dataset containing only the examples that satisfy the filter condition.

Quick answer

Use the datasets library's Dataset.filter() method to filter Hugging Face datasets by applying a boolean function to each example. This method returns a new dataset containing only the examples that satisfy the filter condition.

PREREQUISITES

Python 3.8+
pip install datasets
Basic Python knowledge

Setup

Install the datasets library from Hugging Face and import it in your Python script.

bash

pip install datasets

Step by step

Load a dataset from Hugging Face, then use the filter() method with a filtering function to select examples that meet your criteria.

python

from datasets import load_dataset

# Load the 'ag_news' dataset
dataset = load_dataset('ag_news', split='train')

# Define a filter function to keep only examples with label 1
# (e.g., 'World' news category)
def filter_label(example):
    return example['label'] == 1

# Apply the filter
filtered_dataset = dataset.filter(filter_label)

# Print number of examples before and after filtering
print(f"Original dataset size: {len(dataset)}")
print(f"Filtered dataset size: {len(filtered_dataset)}")

# Show first filtered example
print(filtered_dataset[0])

output

Original dataset size: 120000
Filtered dataset size: 30000
{'text': 'The U.N. Security Council met today to discuss global peace.', 'label': 1, 'idx': 123}

Common variations

You can filter datasets asynchronously or use lambda functions for inline filtering. Also, you can filter by multiple conditions or on different dataset splits.

python

from datasets import load_dataset

# Load dataset
dataset = load_dataset('ag_news', split='test')

# Inline lambda filter: keep examples with label 2 or 3
filtered = dataset.filter(lambda x: x['label'] in [2, 3])

print(f"Filtered test split size: {len(filtered)}")

output

Filtered test split size: 15000

Troubleshooting

If filtering is slow, try using batched=True in filter() to process multiple examples at once.
If you get errors about missing columns, verify the dataset schema with dataset.column_names.
For large datasets, consider streaming mode or filtering during loading.

python

filtered = dataset.filter(filter_label, batched=True)

✅

Key Takeaways

Use Dataset.filter() with a boolean function to select dataset subsets.
Filtering supports both single-example and batched processing for efficiency.
You can filter on any dataset column by accessing example fields in the filter function.

Verified 2026-04

Verify ↗