How to beginner · 3 min read

How to map function over Hugging Face dataset

Quick answer
Use the map method on a Hugging Face Dataset to apply a function to each example. Define a processing function and pass it to dataset.map() for efficient batch or single example transformations.

PREREQUISITES

  • Python 3.8+
  • pip install datasets
  • Basic Python knowledge

Setup

Install the Hugging Face datasets library and import it. No API keys are required for local dataset processing.

bash
pip install datasets

Step by step

Load a dataset, define a function to transform each example, and apply it using dataset.map(). The example below adds a new field with the length of the text.

python
from datasets import load_dataset

# Load a sample dataset
dataset = load_dataset('ag_news', split='train[:5]')

# Define a function to add a new field

def add_length(example):
    example['text_length'] = len(example['text'])
    return example

# Map the function over the dataset
new_dataset = dataset.map(add_length)

# Print the updated dataset
for item in new_dataset:
    print(item)
output
{'text': 'Wall St. Bears Claw Back Into the Black (Reuters)', 'label': 2, 'text_length': 44}
{'text': 'Carlyle Looks Toward Commercial Aerospace (Reuters)', 'label': 2, 'text_length': 46}
{'text': 'Oil and Economy Cloud Stocks' Outlook (Reuters)', 'label': 2, 'text_length': 44}
{'text': 'Iraq Halts Oil Exports from Main Southern Pipeline', 'label': 2, 'text_length': 48}
{'text': 'Oil prices rise on Middle East tension, U.S. stock falls', 'label': 2, 'text_length': 55}

Common variations

  • Use batched=True in map() to process batches of examples for better performance.
  • Use remove_columns to drop unwanted columns after mapping.
  • Use with_indices=True to access example indices in your function.
python
def add_length_batch(batch):
    batch['text_length'] = [len(text) for text in batch['text']]
    return batch

batched_dataset = dataset.map(add_length_batch, batched=True)
print(batched_dataset[0])
output
{'text': 'Wall St. Bears Claw Back Into the Black (Reuters)', 'label': 2, 'text_length': 44}

Troubleshooting

  • If you get a TypeError, ensure your function returns a dictionary or example, not None.
  • For large datasets, use batched=True to avoid slow processing.
  • Check column names carefully to avoid key errors in your function.

Key Takeaways

  • Use dataset.map() to apply transformations efficiently over Hugging Face datasets.
  • Batch processing with batched=True improves performance on large datasets.
  • Always return the modified example or batch from your mapping function to avoid errors.
Verified 2026-04
Verify ↗