Level 1: automated training
Why this matters
Without pipeline versioning, retraining becomes manual and fragile: you can't reproduce which data version trained which model, regressions go undetected, and team members accidentally retrain on stale datasets. DVC pipelines make training deterministic and auditable.
Explanation
DVC pipelines declaratively specify how data flows through training stages (preprocessing → train → evaluate) and capture every input hash and output artifact. Unlike CI/CD pipelines that trigger on code changes, DVC pipelines trigger on data or code changes and store the lineage in Git. When you run dvc repro, DVC checks if inputs have changed since the last run; if not, it skips expensive stages (caching). Each stage output is tracked by hash and can be versioned independently of Git. This decouples model artifacts from Git's 100MB+ blob limits and creates an audit trail: you can ask "what data trained model v2.3?" and get a definitive answer from dvc.yaml + the data registry.
Configuration
# dvc.yaml - Complete training pipeline with three stages
stages:
prepare:
cmd: python src/prepare.py data/raw data/prepared
deps:
- data/raw/train.csv
- src/prepare.py
outs:
- data/prepared:
hash: md5
md5: a1b2c3d4e5f6.dir
size: 52428800
nfiles: 2
train:
cmd: python src/train.py data/prepared models/model.pkl --epochs 10
deps:
- data/prepared
- src/train.py
- src/config.yaml
params:
- train.learning_rate
- train.batch_size
outs:
- models/model.pkl:
hash: md5
md5: f7g8h9i0j1k2
size: 125000000
metrics:
- metrics.json:
cache: false
evaluate:
cmd: python src/evaluate.py models/model.pkl data/prepared metrics.json
deps:
- models/model.pkl
- data/prepared
- src/evaluate.py
metrics:
- metrics.json:
cache: false
# params.yaml - Separate hyperparameters that trigger retraining
train:
learning_rate: 0.001
batch_size: 32
epochs: 10
preprocess:
test_split: 0.2 Why this order?
Stages must be ordered by dependency. prepare runs first because train depends on its outputs. train runs before evaluate because metrics depend on the trained model. DVC detects this order from the deps and outs declarations: you don't specify order explicitly. The params.yaml file is separate because DVC watches it for changes; if you bump learning_rate, DVC knows to rerun training automatically.
Wrong vs Right
stages:
train:
cmd: python train.py
prepare:
cmd: python prepare.py
# Missing deps/outs: DVC can't detect the prepare→train dependency
# No params section: hyperparameter changes don't trigger retraining
# No metrics cache: false: metrics get cached and stale versions are used
outs:
- models/model.pkl
# outs without hash tracking: DVC can't verify reproducibility stages:
prepare:
cmd: python src/prepare.py data/raw data/prepared
deps:
- data/raw/train.csv
- src/prepare.py
outs:
- data/prepared
train:
cmd: python src/train.py data/prepared models/model.pkl
deps:
- data/prepared
- src/train.py
- params.yaml:train
params:
- train.learning_rate
- train.batch_size
outs:
- models/model.pkl
metrics:
- metrics.json:
cache: false
# Each stage declares what it needs (deps) and what it creates (outs)
# Params section triggers rerun if hyperparams change
# Metrics use cache: false so they reflect current model, not a stale hash Tool vitals
dvc repro dvc.yaml dvc dag Integration notes
DVC pipelines feed into MLflow: after dvc repro completes, your training script logs metrics to MLflow via mlflow.log_metrics(). MLflow experiment tracking answers "which hyperparams produced the best metrics?" while DVC pipeline lineage answers "what data produced this model?". Use DVC for data versioning + pipeline orchestration, MLflow for experiment tracking + model registry. In production, combine with dvc dag --ascii to document the pipeline, then wire dvc repro triggers into your CI/CD (GitHub Actions / GitLab CI) so retraining happens automatically when code or data changes.
Migration path
If you outgrow DVC (scaling to 100+ pipeline stages, complex resource constraints), migrate to Airflow or Kubeflow for orchestration. DVC becomes the data versioning layer (dvc pull/push) while Airflow orchestrates the stages. Alternatively, use MLflow Projects + Kubernetes for distributed training without DVC pipelines, but you lose local reproducibility. For 95% of teams, DVC is the right stop.
Common gotcha
If you change a hyperparameter in params.yaml but forget to add it to the params: section in your stage, DVC will NOT rerun training. The stage will appear up-to-date and you'll train on the old hyperparameters without realizing it. Always verify with dvc dag and check the stage declaration includes params: - train.learning_rate etc. Second gotcha: metrics.json with cache: true (default) will cache the hash and show stale metrics after retraining: always use cache: false for outputs that change per model.
Team adoption
On day one, have each engineer commit dvc.yaml to Git and push raw data to DVC remote (S3/GCS). Run dvc repro locally to verify the pipeline works. Create a template dvc.yaml in a shared repo so new team members copy it. Enforce that any hyperparameter change must also update params.yaml, and any new stage must include deps/outs. Use pre-commit hooks to validate dvc.yaml syntax: add a .pre-commit-config.yaml with dvc-yaml-lint to catch missing deps before commit. This prevents the silent-failure gotcha where someone forgets to declare a parameter and retrains on stale hyperparams.
Experienced dev note
Use dvc plots to visualize metrics across experiment runs without context-switching to Jupyter. After dvc repro, run dvc plots show metrics.json to compare accuracy/loss curves across commits. Also, always set hash: md5 explicitly on directory outputs (not file outputs): DVC needs it to detect if the directory contents changed, not just the directory object. Miss this and you'll debug why a stage thinks its output is up-to-date when the files inside actually changed.
Check your understanding
You change the learning rate in params.yaml and run dvc repro. The prepare stage runs, but train stage is skipped (marked as up-to-date). What went wrong, and how would you verify the issue?
Show answer hint
The train stage likely doesn't declare the learning_rate parameter in its params: section, so DVC doesn't know to rerun it when the hyperparameter changes. Verify with dvc dag or dvc status to see why DVC thinks the stage is cached. The fix is to add params: - train.learning_rate to the train stage.