Phase 2: reproducibility
Why this matters
Without version control on data and models, your training code may run differently tomorrow because datasets shifted or dependencies changed. DVC creates a single source of truth for every artifact in your pipeline. Your teammate pulls your commit and runs the same pipeline: guaranteed to get the same results.
Explanation
DVC (Data Version Control) is a Git-like system for data and models, not code. It stores pointers to large files in object storage (S3, Azure, GCS) and tracks them with `.dvc` files committed to Git. The dvc.yaml file defines your pipeline stages (data fetch → preprocess → train → evaluate), and dvc.lock pins exact versions of inputs and outputs. When you run dvc repro, DVC checks if any input changed; if not, it skips that stage (caching). This transforms machine learning from a manual, error-prone process into a reproducible pipeline. Every experiment is traceable: which data version, which code commit, which hyperparameters produced this model.
Configuration
stages:
prepare:
cmd: python src/prepare.py
deps:
- data/raw
- src/prepare.py
outs:
- data/prepared:
cache: true
metrics:
- data/prepared.json:
cache: false
featurize:
cmd: python src/featurize.py
deps:
- data/prepared
- src/featurize.py
outs:
- data/features:
cache: true
train:
cmd: python src/train.py --epochs 10
deps:
- data/features
- src/train.py
outs:
- models/model.pkl:
cache: true
metrics:
- metrics.json:
cache: false
evaluate:
cmd: python src/evaluate.py
deps:
- models/model.pkl
- data/features
- src/evaluate.py
metrics:
- eval_metrics.json:
cache: false Why this order?
Stages execute in dependency order. prepare must run before featurize because featurize declares data/prepared as a dependency. DVC topologically sorts your pipeline. If you swap train and featurize, DVC will fail because train depends on data/features which doesn't exist yet.
Wrong vs Right
stages:
train:
cmd: python train.py
outs:
- model.pkl stages:
prepare:
cmd: python prepare.py
deps:
- data/raw
outs:
- data/prepared
featurize:
cmd: python featurize.py
deps:
- data/prepared
outs:
- data/features
train:
cmd: python train.py
deps:
- data/features
outs:
- model.pkl Tool vitals
dvc repro dvc.yaml dvc dag Integration notes
DVC integrates with MLflow: store DVC-tracked model paths in MLflow artifacts, and log MLflow run IDs into dvc.yaml comments for audit trails. Use DVC for pipeline orchestration and data versioning; use MLflow for experiment tracking and model registry. Together, they create end-to-end reproducibility: data + code + metrics all pinned.
Migration path
If you outgrow DVC (datasets >100GB, complex distributed training), migrate to Kubeflow or Airflow. Export DVC-tracked artifacts to a distributed storage backend (S3 with lifecycle policies) and replace DVC stages with Airflow DAGs. Your dvc.yaml structure translates 1:1 to Airflow tasks.
Common gotcha
Setting cache: false on metrics files. If you cache metrics, DVC treats them as data to skip: your eval_metrics.json won't update on re-runs. Mark outputs as cache: false if they're final artifacts you always want regenerated. Mark intermediate data (like data/features) as cache: true so DVC skips expensive transformations when nothing upstream changed.
Team adoption
On day one, commit .gitignore with /data, /models, .dvc/config (with storage backend set). All team members run dvc pull to fetch data from remote storage. Enforce: every data mutation goes through a DVC stage in dvc.yaml, never manual file edits. Use dvc dag in your PR template: it visualizes what changed. Block merges that add outputs without corresponding stages.
Experienced dev note
Use dvc repro --no-commit to test pipeline changes without writing outputs to storage. This catches missing dependencies or broken commands before they propagate. Pair it with dvc dag to visualize your pipeline: if a stage has no upstream dependencies, you've found a bug. Also: always set cache: false on metrics and evaluation outputs. Caching them silently breaks your experiment tracking.
Check your understanding
Why does DVC skip the featurize stage on a second dvc repro run, even though your model performs worse? What did you change?
Show answer hint
DVC skips stages when all dependencies (files, code) are unchanged. If model performance dropped without re-running <code>featurize</code>, you probably changed hyperparameters in <code>src/train.py</code> without updating the <code>deps</code> list, or your training code randomness isn't controlled (random seed). The stage cache was valid but the upstream assumptions were not.