How to beginner · 3 min read

How to stream large dataset from Hugging Face

Q: How to stream large dataset from Hugging Face

Use the Hugging Face datasets library with the streaming=True parameter in load_dataset to stream large datasets efficiently without downloading them fully. This allows you to iterate over dataset examples on-the-fly, saving memory and disk space.

Quick answer

Use the Hugging Face datasets library with the streaming=True parameter in load_dataset to stream large datasets efficiently without downloading them fully. This allows you to iterate over dataset examples on-the-fly, saving memory and disk space.

PREREQUISITES

Python 3.8+
pip install datasets>=2.0.0
Internet connection

Setup

Install the Hugging Face datasets library if you haven't already. This library supports streaming large datasets directly from the Hugging Face Hub.

bash

pip install datasets>=2.0.0

Step by step

Use the load_dataset function with streaming=True to stream the dataset. This example streams the imdb dataset and prints the first 5 examples without downloading the entire dataset.

python

from datasets import load_dataset

# Stream the dataset without downloading
streamed_dataset = load_dataset('imdb', split='train', streaming=True)

# Iterate over the first 5 examples
for i, example in enumerate(streamed_dataset):
    print(f"Example {i+1}:", example)
    if i >= 4:
        break

output

Example 1: {'text': 'I rented I AM CURIOUS YELLOW from my video store ...', 'label': 0}
Example 2: {'text': 'This movie was just brilliant. ...', 'label': 1}
Example 3: {'text': 'The film had a great cast, ...', 'label': 1}
Example 4: {'text': 'I thought this was a wonderful movie. ...', 'label': 1}
Example 5: {'text': 'The story was a little slow, ...', 'label': 0}

Common variations

Streaming other splits or datasets: Change the split or dataset name in load_dataset.
Using streaming with dataset processing: You can map functions over streamed datasets but avoid operations requiring full dataset length.
Async streaming: The datasets library does not natively support async iteration, but you can integrate streaming with async code using threads or async wrappers.

python

from datasets import load_dataset

# Stream validation split of a different dataset
streamed_dataset = load_dataset('ag_news', split='test', streaming=True)

for i, example in enumerate(streamed_dataset):
    print(example)
    if i >= 2:
        break

output

{'text': 'Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sell...','label': 3}
{'text': 'Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private equity...','label': 3}
{'text': 'Oil and Economy Cloud Stocks' ,'label': 2}

Troubleshooting

If you see ValueError: Streaming is only supported for datasets hosted on the Hugging Face Hub, ensure the dataset is available on the Hub and you are using streaming=True.
If streaming is slow, check your internet connection or try a smaller dataset split.
For memory errors, avoid converting streamed datasets to lists or pandas DataFrames without limiting size.

✅

Key Takeaways

Use streaming=True in load_dataset to handle large datasets efficiently.
Streaming avoids full dataset downloads, saving disk space and memory.
You can iterate over streamed datasets like normal Python iterators but avoid operations needing full dataset length.
Streaming works only with datasets hosted on the Hugging Face Hub.
Check your internet connection and dataset availability if streaming fails.

Verified 2026-04

Verify ↗