How to stream large dataset from Hugging Face
Quick answer
Use the Hugging Face
datasets library with the streaming=True parameter in load_dataset to stream large datasets efficiently without downloading them fully. This allows you to iterate over dataset examples on-the-fly, saving memory and disk space.PREREQUISITES
Python 3.8+pip install datasets>=2.0.0Internet connection
Setup
Install the Hugging Face datasets library if you haven't already. This library supports streaming large datasets directly from the Hugging Face Hub.
pip install datasets>=2.0.0 Step by step
Use the load_dataset function with streaming=True to stream the dataset. This example streams the imdb dataset and prints the first 5 examples without downloading the entire dataset.
from datasets import load_dataset
# Stream the dataset without downloading
streamed_dataset = load_dataset('imdb', split='train', streaming=True)
# Iterate over the first 5 examples
for i, example in enumerate(streamed_dataset):
print(f"Example {i+1}:", example)
if i >= 4:
break output
Example 1: {'text': 'I rented I AM CURIOUS YELLOW from my video store ...', 'label': 0}
Example 2: {'text': 'This movie was just brilliant. ...', 'label': 1}
Example 3: {'text': 'The film had a great cast, ...', 'label': 1}
Example 4: {'text': 'I thought this was a wonderful movie. ...', 'label': 1}
Example 5: {'text': 'The story was a little slow, ...', 'label': 0} Common variations
- Streaming other splits or datasets: Change the
splitor dataset name inload_dataset. - Using streaming with dataset processing: You can map functions over streamed datasets but avoid operations requiring full dataset length.
- Async streaming: The
datasetslibrary does not natively support async iteration, but you can integrate streaming with async code using threads or async wrappers.
from datasets import load_dataset
# Stream validation split of a different dataset
streamed_dataset = load_dataset('ag_news', split='test', streaming=True)
for i, example in enumerate(streamed_dataset):
print(example)
if i >= 2:
break output
{'text': 'Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sell...','label': 3}
{'text': 'Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private equity...','label': 3}
{'text': 'Oil and Economy Cloud Stocks' ,'label': 2} Troubleshooting
- If you see
ValueError: Streaming is only supported for datasets hosted on the Hugging Face Hub, ensure the dataset is available on the Hub and you are usingstreaming=True. - If streaming is slow, check your internet connection or try a smaller dataset split.
- For memory errors, avoid converting streamed datasets to lists or pandas DataFrames without limiting size.
Key Takeaways
- Use
streaming=Trueinload_datasetto handle large datasets efficiently. - Streaming avoids full dataset downloads, saving disk space and memory.
- You can iterate over streamed datasets like normal Python iterators but avoid operations needing full dataset length.
- Streaming works only with datasets hosted on the Hugging Face Hub.
- Check your internet connection and dataset availability if streaming fails.