Tool Intermediate medium · 8 min config

Phase 2: reproducibility

What you will learn
Lock data, models, and pipelines with DVC to make your experiments reproducible across environments.

Why this matters

Without version control on data and models, your training code may run differently tomorrow because datasets shifted or dependencies changed. DVC creates a single source of truth for every artifact in your pipeline. Your teammate pulls your commit and runs the same pipeline: guaranteed to get the same results.

Skip if: If your dataset never changes, model weights are always static, and you never collaborate: you don't need DVC yet. But the moment you iterate on data cleaning, experiment with feature engineering, or share work across a team, DVC prevents silent divergence that wastes weeks debugging.

Explanation

DVC (Data Version Control) is a Git-like system for data and models, not code. It stores pointers to large files in object storage (S3, Azure, GCS) and tracks them with `.dvc` files committed to Git. The dvc.yaml file defines your pipeline stages (data fetch → preprocess → train → evaluate), and dvc.lock pins exact versions of inputs and outputs. When you run dvc repro, DVC checks if any input changed; if not, it skips that stage (caching). This transforms machine learning from a manual, error-prone process into a reproducible pipeline. Every experiment is traceable: which data version, which code commit, which hyperparameters produced this model.

Configuration

yaml
stages:
  prepare:
    cmd: python src/prepare.py
    deps:
      - data/raw
      - src/prepare.py
    outs:
      - data/prepared:
          cache: true
    metrics:
      - data/prepared.json:
          cache: false

  featurize:
    cmd: python src/featurize.py
    deps:
      - data/prepared
      - src/featurize.py
    outs:
      - data/features:
          cache: true

  train:
    cmd: python src/train.py --epochs 10
    deps:
      - data/features
      - src/train.py
    outs:
      - models/model.pkl:
          cache: true
    metrics:
      - metrics.json:
          cache: false

  evaluate:
    cmd: python src/evaluate.py
    deps:
      - models/model.pkl
      - data/features
      - src/evaluate.py
    metrics:
      - eval_metrics.json:
          cache: false

Why this order?

Stages execute in dependency order. prepare must run before featurize because featurize declares data/prepared as a dependency. DVC topologically sorts your pipeline. If you swap train and featurize, DVC will fail because train depends on data/features which doesn't exist yet.

Wrong vs Right

Wrong way
yaml
stages:
  train:
    cmd: python train.py
    outs:
      - model.pkl
Right way
yaml
stages:
  prepare:
    cmd: python prepare.py
    deps:
      - data/raw
    outs:
      - data/prepared
  featurize:
    cmd: python featurize.py
    deps:
      - data/prepared
    outs:
      - data/features
  train:
    cmd: python train.py
    deps:
      - data/features
    outs:
      - model.pkl

Tool vitals

Primary command
bash
dvc repro
Config file dvc.yaml
Verify
bash
dvc dag

Integration notes

DVC integrates with MLflow: store DVC-tracked model paths in MLflow artifacts, and log MLflow run IDs into dvc.yaml comments for audit trails. Use DVC for pipeline orchestration and data versioning; use MLflow for experiment tracking and model registry. Together, they create end-to-end reproducibility: data + code + metrics all pinned.

Migration path

If you outgrow DVC (datasets >100GB, complex distributed training), migrate to Kubeflow or Airflow. Export DVC-tracked artifacts to a distributed storage backend (S3 with lifecycle policies) and replace DVC stages with Airflow DAGs. Your dvc.yaml structure translates 1:1 to Airflow tasks.

Common gotcha

Setting cache: false on metrics files. If you cache metrics, DVC treats them as data to skip: your eval_metrics.json won't update on re-runs. Mark outputs as cache: false if they're final artifacts you always want regenerated. Mark intermediate data (like data/features) as cache: true so DVC skips expensive transformations when nothing upstream changed.

Team adoption

On day one, commit .gitignore with /data, /models, .dvc/config (with storage backend set). All team members run dvc pull to fetch data from remote storage. Enforce: every data mutation goes through a DVC stage in dvc.yaml, never manual file edits. Use dvc dag in your PR template: it visualizes what changed. Block merges that add outputs without corresponding stages.

Experienced dev note

Use dvc repro --no-commit to test pipeline changes without writing outputs to storage. This catches missing dependencies or broken commands before they propagate. Pair it with dvc dag to visualize your pipeline: if a stage has no upstream dependencies, you've found a bug. Also: always set cache: false on metrics and evaluation outputs. Caching them silently breaks your experiment tracking.

Check your understanding

Why does DVC skip the featurize stage on a second dvc repro run, even though your model performs worse? What did you change?

Show answer hint

DVC skips stages when all dependencies (files, code) are unchanged. If model performance dropped without re-running <code>featurize</code>, you probably changed hyperparameters in <code>src/train.py</code> without updating the <code>deps</code> list, or your training code randomness isn't controlled (random seed). The stage cache was valid but the upstream assumptions were not.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.