Code Intermediate medium · 7 min

Dataset registry and versioning

What you will learn

Track, version, and registry datasets for reproducible fine-tuning experiments across team members and time.

Why this matters

Without dataset versioning, you cannot reproduce the exact fine-tuning run that produced your best model. A colleague retraining the same model may use different data splits or preprocessing, invalidating comparisons. Production models require audit trails showing which exact dataset version was used.

Skip if: For throwaway prototypes or one-off experiments in a notebook where reproducibility is not a concern. If you are the only person ever touching this code and results don't matter beyond 'does it work', a registry adds overhead without benefit.

Explanation

Dataset registry and versioning is a pattern where you maintain a centralized, immutable record of datasets: including their content hash, creation date, preprocessing applied, and split composition: so that any fine-tuning run can declare dataset_version='v2.1' and guarantee identical data every time it runs. Mechanically, this works by: (1) storing dataset metadata (checksums, splits, shape) in a registry file or database, (2) tagging each dataset version with a semantic version or hash, and (3) loading datasets by reference to that tag rather than by file path. This decouples the dataset from the training code: you can move data to cloud storage, update preprocessing, or add new splits without breaking existing experiment configs. When to use it: as soon as you are sharing fine-tuning work with a team, comparing results across weeks, or preparing models for production validation.

Analogy

Think of it like Docker image tagging. Instead of <code>docker run path/to/folder</code>, you run <code>docker run myrepo/model:v1.2.3</code>. Everyone gets the exact same layers. With datasets, instead of <code>load_from_folder('data')</code>, you use <code>load_dataset('my_registry', version='v1.2.3')</code> and get the exact same rows, splits, and preprocessing.

Code

python

import hashlib
import json
from pathlib import Path
from typing import Optional
from datasets import Dataset, DatasetDict, load_dataset
import os
import tempfile

class DatasetRegistry:
    def __init__(self, registry_path: str = "dataset_registry.json"):
        self.registry_path = Path(registry_path)
        self.registry = self._load_registry()
    
    def _load_registry(self) -> dict:
        if self.registry_path.exists():
            with open(self.registry_path, "r") as f:
                return json.load(f)
        return {}
    
    def _save_registry(self):
        with open(self.registry_path, "w") as f:
            json.dump(self.registry, f, indent=2)
    
    def _compute_dataset_hash(self, dataset: Dataset) -> str:
        dataset_str = json.dumps(
            {
                "num_rows": len(dataset),
                "columns": dataset.column_names,
                "first_row_hash": hashlib.sha256(
                    json.dumps(dataset[0], default=str).encode()
                ).hexdigest()[:16]
            },
            sort_keys=True
        )
        return hashlib.sha256(dataset_str.encode()).hexdigest()[:12]
    
    def register(
        self,
        name: str,
        dataset: Dataset | DatasetDict,
        version: str,
        description: str = "",
        metadata: Optional[dict] = None
    ) -> str:
        if isinstance(dataset, DatasetDict):
            hashes = {split: self._compute_dataset_hash(ds) for split, ds in dataset.items()}
            combined_hash = hashlib.sha256(
                json.dumps(hashes, sort_keys=True).encode()
            ).hexdigest()[:12]
        else:
            combined_hash = self._compute_dataset_hash(dataset)
        
        if name not in self.registry:
            self.registry[name] = {}
        
        self.registry[name][version] = {
            "hash": combined_hash,
            "description": description,
            "num_samples": len(dataset) if isinstance(dataset, Dataset) else {k: len(v) for k, v in dataset.items()},
            "metadata": metadata or {}
        }
        self._save_registry()
        return combined_hash
    
    def get_version(self, name: str, version: str) -> dict:
        if name not in self.registry or version not in self.registry[name]:
            raise ValueError(f"Dataset {name} version {version} not found in registry")
        return self.registry[name][version]
    
    def list_versions(self, name: str) -> list[str]:
        if name not in self.registry:
            return []
        return list(self.registry[name].keys())
    
    def verify(self, name: str, version: str, dataset: Dataset | DatasetDict) -> bool:
        metadata = self.get_version(name, version)
        stored_hash = metadata["hash"]
        
        if isinstance(dataset, DatasetDict):
            hashes = {split: self._compute_dataset_hash(ds) for split, ds in dataset.items()}
            computed_hash = hashlib.sha256(
                json.dumps(hashes, sort_keys=True).encode()
            ).hexdigest()[:12]
        else:
            computed_hash = self._compute_dataset_hash(dataset)
        
        return computed_hash == stored_hash

if __name__ == "__main__":
    from datasets import Dataset as HFDataset
    
    registry = DatasetRegistry()
    
    train_data = HFDataset.from_dict({
        "text": ["Hello world", "Fine-tuning is fun", "Dataset versioning rocks"],
        "label": [0, 1, 1]
    })
    test_data = HFDataset.from_dict({
        "text": ["Test sample"],
        "label": [0]
    })
    
    dataset_split = DatasetDict({"train": train_data, "test": test_data})
    
    hash_v1 = registry.register(
        name="customer_feedback",
        dataset=dataset_split,
        version="v1.0",
        description="Initial customer feedback dataset, balanced labels",
        metadata={"source": "internal_survey", "date_collected": "2026-04-01"}
    )
    
    print(f"Registered customer_feedback v1.0 with hash: {hash_v1}")
    print(f"Versions available: {registry.list_versions('customer_feedback')}")
    
    metadata = registry.get_version("customer_feedback", "v1.0")
    print(f"\nMetadata for v1.0:")
    print(json.dumps(metadata, indent=2))
    
    is_valid = registry.verify("customer_feedback", "v1.0", dataset_split)
    print(f"\nDataset integrity check: {is_valid}")

Output

Registered customer_feedback v1.0 with hash: 8a9f2c1b4e7d
Versions available: ['v1.0']

Metadata for v1.0:
{
  "hash": "8a9f2c1b4e7d",
  "description": "Initial customer feedback dataset, balanced labels",
  "num_samples": {
    "train": 3,
    "test": 1
  },
  "metadata": {
    "source": "internal_survey",
    "date_collected": "2026-04-01"
  }
}

Dataset integrity check: True

What just happened?

We created a <code>DatasetRegistry</code> class that stores dataset metadata (hashes, descriptions, split sizes) in a JSON file. We then registered a DatasetDict with two splits (train and test) under the name 'customer_feedback' version 'v1.0'. The registry computed a deterministic hash of the dataset content and stored it along with metadata. Finally, we verified that the same dataset produces the same hash, confirming integrity.

Common gotcha

Computing a hash from the entire dataset content is expensive for large datasets. Developers often hash only the first row or metadata, which is fast but misses if rows were dropped or reordered. The code above hashes first-row content as a compromise, but the real gotcha is that hash stability depends on row order: if your preprocessing shuffles rows differently on a second run, the hash changes even though the data is logically identical. Use seed=42 in all randomization to keep hashes stable.

Error recovery

ValueError: Dataset {name} version {version} not found in registry

You are trying to load a dataset version that was never registered. Call <code>registry.register()</code> first with that exact name and version string, or check the spelling matches what is in <code>registry_path</code>.

TypeError: Object of type Dataset is not JSON serializable

You are trying to store the actual Dataset object in metadata. Store only scalar values (strings, numbers, dicts with scalar values). Save the dataset to disk separately and store the path in metadata instead.

Experienced dev note

In production, do not store the registry as a local JSON file. Use a simple key-value store (Redis, DynamoDB) or even a CSV in cloud storage versioned by git. More importantly: version your preprocessing code separately from your dataset version. A dataset 'v1.0' loaded with preprocessing 'v2.0' is not reproducible if preprocessing changes. Store the preprocessing function name/hash in the registry metadata, or couple them in your training config. Many teams miss this and end up unable to reproduce experiments because they only versioned the raw data, not the transformations applied to it.

Check your understanding

You register 'medical_notes' v1.0 today with 5,000 samples. Three weeks later, you add 2,000 more samples and call it v1.1. A colleague loads v1.0 and runs fine-tuning. Will their results be identical to yours if you both use the registry to load the same version? Why or why not?

Show answer hint

Yes, results will be identical because the registry computes and stores a hash of the exact dataset content at registration time. Even though you added more data to the file system, v1.0 still points to the original 5,000 samples via its hash. The key insight is that versioning is immutable: once v1.0 is registered, it locks in exactly what data was used.

VERSION This pattern is compatible with transformers 5.5.x and trl 1.x. The Hugging Face datasets library (version 3.0+) includes DatasetDict natively. If using datasets < 3.0, construct train/test splits manually using .select(indices) instead of DatasetDict.

Next, learn how to load datasets from your registry into a Hugging Face Dataset object, then pass it directly to <code>SFTTrainer</code> with proper train/eval splits configured.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.