How to map function over Hugging Face dataset
Quick answer
Use the
map method on a Hugging Face Dataset to apply a function to each example. Define a processing function and pass it to dataset.map() for efficient batch or single example transformations.PREREQUISITES
Python 3.8+pip install datasetsBasic Python knowledge
Setup
Install the Hugging Face datasets library and import it. No API keys are required for local dataset processing.
pip install datasets Step by step
Load a dataset, define a function to transform each example, and apply it using dataset.map(). The example below adds a new field with the length of the text.
from datasets import load_dataset
# Load a sample dataset
dataset = load_dataset('ag_news', split='train[:5]')
# Define a function to add a new field
def add_length(example):
example['text_length'] = len(example['text'])
return example
# Map the function over the dataset
new_dataset = dataset.map(add_length)
# Print the updated dataset
for item in new_dataset:
print(item) output
{'text': 'Wall St. Bears Claw Back Into the Black (Reuters)', 'label': 2, 'text_length': 44}
{'text': 'Carlyle Looks Toward Commercial Aerospace (Reuters)', 'label': 2, 'text_length': 46}
{'text': 'Oil and Economy Cloud Stocks' Outlook (Reuters)', 'label': 2, 'text_length': 44}
{'text': 'Iraq Halts Oil Exports from Main Southern Pipeline', 'label': 2, 'text_length': 48}
{'text': 'Oil prices rise on Middle East tension, U.S. stock falls', 'label': 2, 'text_length': 55} Common variations
- Use
batched=Trueinmap()to process batches of examples for better performance. - Use
remove_columnsto drop unwanted columns after mapping. - Use
with_indices=Trueto access example indices in your function.
def add_length_batch(batch):
batch['text_length'] = [len(text) for text in batch['text']]
return batch
batched_dataset = dataset.map(add_length_batch, batched=True)
print(batched_dataset[0]) output
{'text': 'Wall St. Bears Claw Back Into the Black (Reuters)', 'label': 2, 'text_length': 44} Troubleshooting
- If you get a
TypeError, ensure your function returns a dictionary or example, not None. - For large datasets, use
batched=Trueto avoid slow processing. - Check column names carefully to avoid key errors in your function.
Key Takeaways
- Use
dataset.map()to apply transformations efficiently over Hugging Face datasets. - Batch processing with
batched=Trueimproves performance on large datasets. - Always return the modified example or batch from your mapping function to avoid errors.