How to version datasets with wandb
Quick answer
Use wandb.Artifact to version datasets by logging them as artifacts with unique names and versions. This enables tracking dataset changes and retrieving specific versions in your ML projects.
PREREQUISITES
Python 3.8+wandb account and API keypip install wandb
Setup
Install the wandb Python package and configure your API key to enable dataset versioning.
pip install wandb Step by step
This example shows how to create, log, and retrieve a versioned dataset artifact with wandb.
import os
import wandb
# Login to wandb (set WANDB_API_KEY env var or run wandb login)
wandb.login()
# Initialize a wandb run
run = wandb.init(project="dataset-versioning", job_type="dataset_upload")
# Path to your dataset file
dataset_path = "data/sample_dataset.csv"
# Create an artifact for the dataset
artifact = wandb.Artifact(name="sample-dataset", type="dataset")
# Add the dataset file to the artifact
artifact.add_file(dataset_path)
# Log the artifact to wandb
run.log_artifact(artifact)
# Finish the run
run.finish()
# Later, to use a specific version of the dataset:
run = wandb.init(project="dataset-versioning", job_type="dataset_download")
# Get the artifact by name and version (e.g., "sample-dataset:v0")
dataset_artifact = run.use_artifact("sample-dataset:v0")
# Download the artifact files locally
dataset_dir = dataset_artifact.download()
print(f"Dataset downloaded to: {dataset_dir}")
run.finish() output
Dataset downloaded to: ./wandb/artifacts/sample-dataset:v0
Common variations
- Use
artifact.add_dir()to version entire dataset directories. - Specify aliases like
artifact.aliases = ["latest"]to track latest dataset versions. - Use asynchronous logging in distributed training setups.
import wandb
run = wandb.init(project="dataset-versioning")
artifact = wandb.Artifact(name="full-dataset", type="dataset")
# Add entire directory
artifact.add_dir("data/full_dataset")
# Add alias for easy reference
artifact.aliases.append("latest")
run.log_artifact(artifact)
run.finish() Troubleshooting
- If you see
wandb.errors.CommError, check your internet connection and API key validity. - Ensure dataset paths exist before adding to artifacts to avoid
FileNotFoundError. - Use
wandb artifact listCLI command to verify logged artifacts and versions.
Key Takeaways
- Use wandb.Artifact to version datasets and track changes systematically.
- Log datasets as artifacts with unique names and optional aliases for easy retrieval.
- Download specific dataset versions in your workflows to ensure reproducibility.