Tool Advanced medium · 8 min config

Level 1: automated training

What you will learn

Define reproducible training pipelines with DVC that track data, code, and outputs as a DAG for automated retraining.

Why this matters

Without pipeline versioning, retraining becomes manual and fragile: you can't reproduce which data version trained which model, regressions go undetected, and team members accidentally retrain on stale datasets. DVC pipelines make training deterministic and auditable.

Skip if: Use ad-hoc scripts if you're in pure research mode with zero reproducibility requirements. But the moment you need to retrain in production, ship model versions to another team, or audit what data trained a model, DVC is non-negotiable.

Explanation

DVC pipelines declaratively specify how data flows through training stages (preprocessing → train → evaluate) and capture every input hash and output artifact. Unlike CI/CD pipelines that trigger on code changes, DVC pipelines trigger on data or code changes and store the lineage in Git. When you run dvc repro, DVC checks if inputs have changed since the last run; if not, it skips expensive stages (caching). Each stage output is tracked by hash and can be versioned independently of Git. This decouples model artifacts from Git's 100MB+ blob limits and creates an audit trail: you can ask "what data trained model v2.3?" and get a definitive answer from dvc.yaml + the data registry.

Configuration

yaml

# dvc.yaml - Complete training pipeline with three stages
stages:
  prepare:
    cmd: python src/prepare.py data/raw data/prepared
    deps:
      - data/raw/train.csv
      - src/prepare.py
    outs:
      - data/prepared:
          hash: md5
          md5: a1b2c3d4e5f6.dir
          size: 52428800
          nfiles: 2

  train:
    cmd: python src/train.py data/prepared models/model.pkl --epochs 10
    deps:
      - data/prepared
      - src/train.py
      - src/config.yaml
    params:
      - train.learning_rate
      - train.batch_size
    outs:
      - models/model.pkl:
          hash: md5
          md5: f7g8h9i0j1k2
          size: 125000000
    metrics:
      - metrics.json:
          cache: false

  evaluate:
    cmd: python src/evaluate.py models/model.pkl data/prepared metrics.json
    deps:
      - models/model.pkl
      - data/prepared
      - src/evaluate.py
    metrics:
      - metrics.json:
          cache: false

# params.yaml - Separate hyperparameters that trigger retraining
train:
  learning_rate: 0.001
  batch_size: 32
  epochs: 10
preprocess:
  test_split: 0.2

Why this order?

Stages must be ordered by dependency. prepare runs first because train depends on its outputs. train runs before evaluate because metrics depend on the trained model. DVC detects this order from the deps and outs declarations: you don't specify order explicitly. The params.yaml file is separate because DVC watches it for changes; if you bump learning_rate, DVC knows to rerun training automatically.

Wrong vs Right

Wrong way

yaml

stages:
  train:
    cmd: python train.py
  prepare:
    cmd: python prepare.py
# Missing deps/outs: DVC can't detect the prepare→train dependency
# No params section: hyperparameter changes don't trigger retraining
# No metrics cache: false: metrics get cached and stale versions are used
outs:
  - models/model.pkl
# outs without hash tracking: DVC can't verify reproducibility

Right way

yaml

stages:
  prepare:
    cmd: python src/prepare.py data/raw data/prepared
    deps:
      - data/raw/train.csv
      - src/prepare.py
    outs:
      - data/prepared

  train:
    cmd: python src/train.py data/prepared models/model.pkl
    deps:
      - data/prepared
      - src/train.py
      - params.yaml:train
    params:
      - train.learning_rate
      - train.batch_size
    outs:
      - models/model.pkl
    metrics:
      - metrics.json:
          cache: false
# Each stage declares what it needs (deps) and what it creates (outs)
# Params section triggers rerun if hyperparams change
# Metrics use cache: false so they reflect current model, not a stale hash

Tool vitals

Primary command

bash

dvc repro

Config file dvc.yaml

Verify

bash

dvc dag

Integration notes

DVC pipelines feed into MLflow: after dvc repro completes, your training script logs metrics to MLflow via mlflow.log_metrics(). MLflow experiment tracking answers "which hyperparams produced the best metrics?" while DVC pipeline lineage answers "what data produced this model?". Use DVC for data versioning + pipeline orchestration, MLflow for experiment tracking + model registry. In production, combine with dvc dag --ascii to document the pipeline, then wire dvc repro triggers into your CI/CD (GitHub Actions / GitLab CI) so retraining happens automatically when code or data changes.

Migration path

If you outgrow DVC (scaling to 100+ pipeline stages, complex resource constraints), migrate to Airflow or Kubeflow for orchestration. DVC becomes the data versioning layer (dvc pull/push) while Airflow orchestrates the stages. Alternatively, use MLflow Projects + Kubernetes for distributed training without DVC pipelines, but you lose local reproducibility. For 95% of teams, DVC is the right stop.

Common gotcha

If you change a hyperparameter in params.yaml but forget to add it to the params: section in your stage, DVC will NOT rerun training. The stage will appear up-to-date and you'll train on the old hyperparameters without realizing it. Always verify with dvc dag and check the stage declaration includes params: - train.learning_rate etc. Second gotcha: metrics.json with cache: true (default) will cache the hash and show stale metrics after retraining: always use cache: false for outputs that change per model.

Team adoption

On day one, have each engineer commit dvc.yaml to Git and push raw data to DVC remote (S3/GCS). Run dvc repro locally to verify the pipeline works. Create a template dvc.yaml in a shared repo so new team members copy it. Enforce that any hyperparameter change must also update params.yaml, and any new stage must include deps/outs. Use pre-commit hooks to validate dvc.yaml syntax: add a .pre-commit-config.yaml with dvc-yaml-lint to catch missing deps before commit. This prevents the silent-failure gotcha where someone forgets to declare a parameter and retrains on stale hyperparams.

Experienced dev note

Use dvc plots to visualize metrics across experiment runs without context-switching to Jupyter. After dvc repro, run dvc plots show metrics.json to compare accuracy/loss curves across commits. Also, always set hash: md5 explicitly on directory outputs (not file outputs): DVC needs it to detect if the directory contents changed, not just the directory object. Miss this and you'll debug why a stage thinks its output is up-to-date when the files inside actually changed.

Check your understanding

You change the learning rate in params.yaml and run dvc repro. The prepare stage runs, but train stage is skipped (marked as up-to-date). What went wrong, and how would you verify the issue?

Show answer hint

The train stage likely doesn't declare the learning_rate parameter in its params: section, so DVC doesn't know to rerun it when the hyperparameter changes. Verify with dvc dag or dvc status to see why DVC thinks the stage is cached. The fix is to add params: - train.learning_rate to the train stage.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.