How to beginner · 3 min read

How to filter Hugging Face dataset

Quick answer
Use the datasets library's Dataset.filter() method to filter Hugging Face datasets by applying a boolean function to each example. This method returns a new dataset containing only the examples that satisfy the filter condition.

PREREQUISITES

  • Python 3.8+
  • pip install datasets
  • Basic Python knowledge

Setup

Install the datasets library from Hugging Face and import it in your Python script.

bash
pip install datasets

Step by step

Load a dataset from Hugging Face, then use the filter() method with a filtering function to select examples that meet your criteria.

python
from datasets import load_dataset

# Load the 'ag_news' dataset
dataset = load_dataset('ag_news', split='train')

# Define a filter function to keep only examples with label 1
# (e.g., 'World' news category)
def filter_label(example):
    return example['label'] == 1

# Apply the filter
filtered_dataset = dataset.filter(filter_label)

# Print number of examples before and after filtering
print(f"Original dataset size: {len(dataset)}")
print(f"Filtered dataset size: {len(filtered_dataset)}")

# Show first filtered example
print(filtered_dataset[0])
output
Original dataset size: 120000
Filtered dataset size: 30000
{'text': 'The U.N. Security Council met today to discuss global peace.', 'label': 1, 'idx': 123}

Common variations

You can filter datasets asynchronously or use lambda functions for inline filtering. Also, you can filter by multiple conditions or on different dataset splits.

python
from datasets import load_dataset

# Load dataset
dataset = load_dataset('ag_news', split='test')

# Inline lambda filter: keep examples with label 2 or 3
filtered = dataset.filter(lambda x: x['label'] in [2, 3])

print(f"Filtered test split size: {len(filtered)}")
output
Filtered test split size: 15000

Troubleshooting

  • If filtering is slow, try using batched=True in filter() to process multiple examples at once.
  • If you get errors about missing columns, verify the dataset schema with dataset.column_names.
  • For large datasets, consider streaming mode or filtering during loading.
python
filtered = dataset.filter(filter_label, batched=True)

Key Takeaways

  • Use Dataset.filter() with a boolean function to select dataset subsets.
  • Filtering supports both single-example and batched processing for efficiency.
  • You can filter on any dataset column by accessing example fields in the filter function.
Verified 2026-04
Verify ↗