How to beginner · 3 min read

How to map function over Hugging Face dataset

Q: How to map function over Hugging Face dataset

Use the map method on a Hugging Face Dataset to apply a function to each example. Define a processing function and pass it to dataset.map() for efficient batch or single example transformations.

Quick answer

Use the map method on a Hugging Face Dataset to apply a function to each example. Define a processing function and pass it to dataset.map() for efficient batch or single example transformations.

PREREQUISITES

Python 3.8+
pip install datasets
Basic Python knowledge

Setup

Install the Hugging Face datasets library and import it. No API keys are required for local dataset processing.

bash

pip install datasets

Step by step

Load a dataset, define a function to transform each example, and apply it using dataset.map(). The example below adds a new field with the length of the text.

python

from datasets import load_dataset

# Load a sample dataset
dataset = load_dataset('ag_news', split='train[:5]')

# Define a function to add a new field

def add_length(example):
    example['text_length'] = len(example['text'])
    return example

# Map the function over the dataset
new_dataset = dataset.map(add_length)

# Print the updated dataset
for item in new_dataset:
    print(item)

output

{'text': 'Wall St. Bears Claw Back Into the Black (Reuters)', 'label': 2, 'text_length': 44}
{'text': 'Carlyle Looks Toward Commercial Aerospace (Reuters)', 'label': 2, 'text_length': 46}
{'text': 'Oil and Economy Cloud Stocks' Outlook (Reuters)', 'label': 2, 'text_length': 44}
{'text': 'Iraq Halts Oil Exports from Main Southern Pipeline', 'label': 2, 'text_length': 48}
{'text': 'Oil prices rise on Middle East tension, U.S. stock falls', 'label': 2, 'text_length': 55}

Common variations

Use batched=True in map() to process batches of examples for better performance.
Use remove_columns to drop unwanted columns after mapping.
Use with_indices=True to access example indices in your function.

python

def add_length_batch(batch):
    batch['text_length'] = [len(text) for text in batch['text']]
    return batch

batched_dataset = dataset.map(add_length_batch, batched=True)
print(batched_dataset[0])

output

{'text': 'Wall St. Bears Claw Back Into the Black (Reuters)', 'label': 2, 'text_length': 44}

Troubleshooting

If you get a TypeError, ensure your function returns a dictionary or example, not None.
For large datasets, use batched=True to avoid slow processing.
Check column names carefully to avoid key errors in your function.

✅

Key Takeaways

Use dataset.map() to apply transformations efficiently over Hugging Face datasets.
Batch processing with batched=True improves performance on large datasets.
Always return the modified example or batch from your mapping function to avoid errors.

Verified 2026-04

Verify ↗