How to beginner · 3 min read

How to version datasets with wandb

Quick answer

Use wandb.Artifact to version datasets by logging them as artifacts with unique names and versions. This enables tracking dataset changes and retrieving specific versions in your ML projects.

PREREQUISITES

Python 3.8+
wandb account and API key
pip install wandb

Setup

Install the wandb Python package and configure your API key to enable dataset versioning.

bash

pip install wandb

Step by step

This example shows how to create, log, and retrieve a versioned dataset artifact with wandb.

python

import os
import wandb

# Login to wandb (set WANDB_API_KEY env var or run wandb login)
wandb.login()

# Initialize a wandb run
run = wandb.init(project="dataset-versioning", job_type="dataset_upload")

# Path to your dataset file
dataset_path = "data/sample_dataset.csv"

# Create an artifact for the dataset
artifact = wandb.Artifact(name="sample-dataset", type="dataset")

# Add the dataset file to the artifact
artifact.add_file(dataset_path)

# Log the artifact to wandb
run.log_artifact(artifact)

# Finish the run
run.finish()

# Later, to use a specific version of the dataset:
run = wandb.init(project="dataset-versioning", job_type="dataset_download")

# Get the artifact by name and version (e.g., "sample-dataset:v0")
dataset_artifact = run.use_artifact("sample-dataset:v0")

# Download the artifact files locally
dataset_dir = dataset_artifact.download()

print(f"Dataset downloaded to: {dataset_dir}")

run.finish()

output

Dataset downloaded to: ./wandb/artifacts/sample-dataset:v0

Common variations

Use artifact.add_dir() to version entire dataset directories.
Specify aliases like artifact.aliases = ["latest"] to track latest dataset versions.
Use asynchronous logging in distributed training setups.

python

import wandb

run = wandb.init(project="dataset-versioning")

artifact = wandb.Artifact(name="full-dataset", type="dataset")

# Add entire directory
artifact.add_dir("data/full_dataset")

# Add alias for easy reference
artifact.aliases.append("latest")

run.log_artifact(artifact)
run.finish()

Troubleshooting

If you see wandb.errors.CommError, check your internet connection and API key validity.
Ensure dataset paths exist before adding to artifacts to avoid FileNotFoundError.
Use wandb artifact list CLI command to verify logged artifacts and versions.

Key Takeaways

Use wandb.Artifact to version datasets and track changes systematically.
Log datasets as artifacts with unique names and optional aliases for easy retrieval.
Download specific dataset versions in your workflows to ensure reproducibility.

Verified 2026-04

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.