Dataset registry and versioning
Why this matters
Without dataset versioning, you cannot reproduce the exact fine-tuning run that produced your best model. A colleague retraining the same model may use different data splits or preprocessing, invalidating comparisons. Production models require audit trails showing which exact dataset version was used.
Explanation
Dataset registry and versioning is a pattern where you maintain a centralized, immutable record of datasets: including their content hash, creation date, preprocessing applied, and split composition: so that any fine-tuning run can declare dataset_version='v2.1' and guarantee identical data every time it runs. Mechanically, this works by: (1) storing dataset metadata (checksums, splits, shape) in a registry file or database, (2) tagging each dataset version with a semantic version or hash, and (3) loading datasets by reference to that tag rather than by file path. This decouples the dataset from the training code: you can move data to cloud storage, update preprocessing, or add new splits without breaking existing experiment configs. When to use it: as soon as you are sharing fine-tuning work with a team, comparing results across weeks, or preparing models for production validation.
Analogy
Think of it like Docker image tagging. Instead of <code>docker run path/to/folder</code>, you run <code>docker run myrepo/model:v1.2.3</code>. Everyone gets the exact same layers. With datasets, instead of <code>load_from_folder('data')</code>, you use <code>load_dataset('my_registry', version='v1.2.3')</code> and get the exact same rows, splits, and preprocessing.
Code
import hashlib
import json
from pathlib import Path
from typing import Optional
from datasets import Dataset, DatasetDict, load_dataset
import os
import tempfile
class DatasetRegistry:
def __init__(self, registry_path: str = "dataset_registry.json"):
self.registry_path = Path(registry_path)
self.registry = self._load_registry()
def _load_registry(self) -> dict:
if self.registry_path.exists():
with open(self.registry_path, "r") as f:
return json.load(f)
return {}
def _save_registry(self):
with open(self.registry_path, "w") as f:
json.dump(self.registry, f, indent=2)
def _compute_dataset_hash(self, dataset: Dataset) -> str:
dataset_str = json.dumps(
{
"num_rows": len(dataset),
"columns": dataset.column_names,
"first_row_hash": hashlib.sha256(
json.dumps(dataset[0], default=str).encode()
).hexdigest()[:16]
},
sort_keys=True
)
return hashlib.sha256(dataset_str.encode()).hexdigest()[:12]
def register(
self,
name: str,
dataset: Dataset | DatasetDict,
version: str,
description: str = "",
metadata: Optional[dict] = None
) -> str:
if isinstance(dataset, DatasetDict):
hashes = {split: self._compute_dataset_hash(ds) for split, ds in dataset.items()}
combined_hash = hashlib.sha256(
json.dumps(hashes, sort_keys=True).encode()
).hexdigest()[:12]
else:
combined_hash = self._compute_dataset_hash(dataset)
if name not in self.registry:
self.registry[name] = {}
self.registry[name][version] = {
"hash": combined_hash,
"description": description,
"num_samples": len(dataset) if isinstance(dataset, Dataset) else {k: len(v) for k, v in dataset.items()},
"metadata": metadata or {}
}
self._save_registry()
return combined_hash
def get_version(self, name: str, version: str) -> dict:
if name not in self.registry or version not in self.registry[name]:
raise ValueError(f"Dataset {name} version {version} not found in registry")
return self.registry[name][version]
def list_versions(self, name: str) -> list[str]:
if name not in self.registry:
return []
return list(self.registry[name].keys())
def verify(self, name: str, version: str, dataset: Dataset | DatasetDict) -> bool:
metadata = self.get_version(name, version)
stored_hash = metadata["hash"]
if isinstance(dataset, DatasetDict):
hashes = {split: self._compute_dataset_hash(ds) for split, ds in dataset.items()}
computed_hash = hashlib.sha256(
json.dumps(hashes, sort_keys=True).encode()
).hexdigest()[:12]
else:
computed_hash = self._compute_dataset_hash(dataset)
return computed_hash == stored_hash
if __name__ == "__main__":
from datasets import Dataset as HFDataset
registry = DatasetRegistry()
train_data = HFDataset.from_dict({
"text": ["Hello world", "Fine-tuning is fun", "Dataset versioning rocks"],
"label": [0, 1, 1]
})
test_data = HFDataset.from_dict({
"text": ["Test sample"],
"label": [0]
})
dataset_split = DatasetDict({"train": train_data, "test": test_data})
hash_v1 = registry.register(
name="customer_feedback",
dataset=dataset_split,
version="v1.0",
description="Initial customer feedback dataset, balanced labels",
metadata={"source": "internal_survey", "date_collected": "2026-04-01"}
)
print(f"Registered customer_feedback v1.0 with hash: {hash_v1}")
print(f"Versions available: {registry.list_versions('customer_feedback')}")
metadata = registry.get_version("customer_feedback", "v1.0")
print(f"\nMetadata for v1.0:")
print(json.dumps(metadata, indent=2))
is_valid = registry.verify("customer_feedback", "v1.0", dataset_split)
print(f"\nDataset integrity check: {is_valid}") Registered customer_feedback v1.0 with hash: 8a9f2c1b4e7d
Versions available: ['v1.0']
Metadata for v1.0:
{
"hash": "8a9f2c1b4e7d",
"description": "Initial customer feedback dataset, balanced labels",
"num_samples": {
"train": 3,
"test": 1
},
"metadata": {
"source": "internal_survey",
"date_collected": "2026-04-01"
}
}
Dataset integrity check: True What just happened?
We created a <code>DatasetRegistry</code> class that stores dataset metadata (hashes, descriptions, split sizes) in a JSON file. We then registered a DatasetDict with two splits (train and test) under the name 'customer_feedback' version 'v1.0'. The registry computed a deterministic hash of the dataset content and stored it along with metadata. Finally, we verified that the same dataset produces the same hash, confirming integrity.
Common gotcha
Computing a hash from the entire dataset content is expensive for large datasets. Developers often hash only the first row or metadata, which is fast but misses if rows were dropped or reordered. The code above hashes first-row content as a compromise, but the real gotcha is that hash stability depends on row order: if your preprocessing shuffles rows differently on a second run, the hash changes even though the data is logically identical. Use seed=42 in all randomization to keep hashes stable.
Error recovery
ValueError: Dataset {name} version {version} not found in registryTypeError: Object of type Dataset is not JSON serializableExperienced dev note
In production, do not store the registry as a local JSON file. Use a simple key-value store (Redis, DynamoDB) or even a CSV in cloud storage versioned by git. More importantly: version your preprocessing code separately from your dataset version. A dataset 'v1.0' loaded with preprocessing 'v2.0' is not reproducible if preprocessing changes. Store the preprocessing function name/hash in the registry metadata, or couple them in your training config. Many teams miss this and end up unable to reproduce experiments because they only versioned the raw data, not the transformations applied to it.
Check your understanding
You register 'medical_notes' v1.0 today with 5,000 samples. Three weeks later, you add 2,000 more samples and call it v1.1. A colleague loads v1.0 and runs fine-tuning. Will their results be identical to yours if you both use the registry to load the same version? Why or why not?
Show answer hint
Yes, results will be identical because the registry computes and stores a hash of the exact dataset content at registration time. Even though you added more data to the file system, v1.0 still points to the original 5,000 samples via its hash. The key insight is that versioning is immutable: once v1.0 is registered, it locks in exactly what data was used.
datasets library (version 3.0+) includes DatasetDict natively. If using datasets < 3.0, construct train/test splits manually using .select(indices) instead of DatasetDict.