Tool Intermediate medium · 8 min integration

Phase 5: Automation: DVC Pipelines & GitOps

What you will learn

Automate end-to-end ML workflows with DVC pipelines triggered by Git commits, eliminating manual retraining and tracking data lineage automatically.

Why this matters

Without pipeline automation, teams manually run training scripts, forget to track which data version produced which model, lose reproducibility between commits, and waste compute resources on redundant runs. DVC pipelines couple Git versioning to data and model artifacts, ensuring every commit has a traceable, reproducible workflow.

Skip if: Use simpler approaches (cron jobs, manual bash scripts) only for one-off experiments or single-person projects. Avoid DVC pipelines if your workflow has no data versioning needs or if you're using a managed ML platform (Vertex AI, SageMaker) that handles orchestration. For simple feature engineering without model retraining, a Makefile may suffice.

Explanation

DVC pipelines define stages (fetch data, preprocess, train, evaluate) in a dvc.yaml file, each with inputs, outputs, and commands. When you commit changes to code or data, Git triggers the pipeline; DVC detects which stages have outdated inputs and re-runs only those stages. This eliminates manual python train.py calls and guarantees reproducibility: the same commit + data version always produces the same model. MLflow integrates seamlessly: each pipeline run logs metrics and models to MLflow, while DVC tracks data lineage. The workflow becomes: develop → commit → Git webhook → DVC reproduce → MLflow logs → model registry update. This is the backbone of GitOps for ML: your Git repository becomes the source of truth for everything.

Configuration

yaml

stages:
  fetch_data:
    cmd: python src/fetch_data.py
    deps:
      - src/fetch_data.py
    outs:
      - data/raw/dataset.csv
    plots:
      - data/raw/dataset.csv:
          x: id
          y: feature_count

  preprocess:
    cmd: python src/preprocess.py --input data/raw/dataset.csv --output data/processed/train.parquet
    deps:
      - data/raw/dataset.csv
      - src/preprocess.py
    outs:
      - data/processed/train.parquet
      - data/processed/test.parquet
    params:
      - preprocess.train_split
      - preprocess.random_seed

  train:
    cmd: python src/train.py --train data/processed/train.parquet --model models/model.pkl
    deps:
      - data/processed/train.parquet
      - src/train.py
    outs:
      - models/model.pkl
    metrics:
      - metrics.json:
          cache: false
    params:
      - train.learning_rate
      - train.max_depth

  evaluate:
    cmd: python src/evaluate.py --model models/model.pkl --test data/processed/test.parquet --output eval.json
    deps:
      - models/model.pkl
      - data/processed/test.parquet
      - src/evaluate.py
    metrics:
      - eval.json:
          cache: false

params:
  - preprocess.yaml
  - train.yaml

Why this order?

Stages must be ordered by dependency: fetch_data produces raw data → preprocess consumes it and produces train/test splits → train consumes train.parquet → evaluate consumes both model and test.parquet. DVC resolves this automatically, but declaring stages in logical order makes dvc dag output easier to read. Parameters must be referenced after all stages that use them.

Wrong vs Right

Wrong way

yaml

stages:
  train:
    cmd: python src/train.py
    outs:
      - models/model.pkl
  
  preprocess:
    cmd: python src/preprocess.py
    deps:
      - data/raw/dataset.csv
    outs:
      - data/processed/train.parquet

Right way

yaml

stages:
  preprocess:
    cmd: python src/preprocess.py
    deps:
      - data/raw/dataset.csv
      - src/preprocess.py
    outs:
      - data/processed/train.parquet

  train:
    cmd: python src/train.py --train data/processed/train.parquet
    deps:
      - data/processed/train.parquet
      - src/train.py
    outs:
      - models/model.pkl

Tool vitals

Primary command

bash

dvc repro

Config file dvc.yaml

Verify

bash

dvc dag

Integration notes

DVC pipelines feed directly into MLflow: after dvc repro completes, the train stage can call mlflow.log_metrics() and mlflow.log_model() to register the model in MLflow's registry. A CI/CD webhook (GitHub Actions, GitLab CI) can trigger dvc repro on every commit, with results published to MLflow. Docker is used to containerize the pipeline environment; the same dvc.yaml runs identically in CI/CD containers as locally.

Migration path

If moving to Airflow or Kubeflow: export dvc dag as a task graph, map each DVC stage to an Airflow task with the same dependencies. Use Airflow's S3/GCS operators to pull artifacts DVC tracked instead of dvc pull. DVC remains useful for local reproduction; remove from production orchestration.

Cost model

DVC itself is free (open source). Cloud storage costs apply: S3, GCS, or Azure Blob Storage charges for stored artifacts. Estimate: 1GB model × 100 versions = 100GB storage (~$2.30/month on S3 Standard). DVC Remote cache can reduce re-downloads. No per-run costs.

Common gotcha

If a stage's dependency file (e.g., src/train.py) is modified but the output is cached, dvc repro will skip that stage unless you explicitly add the script to deps:. Many teams forget to list source code files as dependencies, then discover stale models are being used. Always include - src/script_name.py in every stage's deps section.

Team adoption

1) Add a .dvc/config step to onboarding: configure remote storage so all team members push/pull from the same S3 or GCS bucket. 2) Pin DVC version in requirements.txt (dvc==3.50.0) to prevent version mismatch issues. 3) Run dvc dag in CI/CD to visualize the pipeline on every PR; include it in code review checklists. 4) Make dvc repro mandatory before committing: add a pre-commit hook that runs dvc status and warns if outputs are stale. 5) Document which stage produces which MLflow experiment run; teams often forget the mapping.

Experienced dev note

Use params.yaml or separate YAML files and reference them in dvc.yaml with params: keys instead of hardcoding hyperparameters. This decouples config from pipeline definition: changing train.learning_rate in params.yaml automatically invalidates the train stage on next dvc repro, forcing a retrain. Without this pattern, teams edit hyperparameters in Python scripts, forget the pipeline is cached, and train with old settings. Also: use outs_persist: for intermediate files you want kept across runs (checkpoints), not outs:, to avoid cache bloat.

Check your understanding

You modify src/preprocess.py and commit it. When you run dvc repro, which stages will re-execute and why?

Show answer hint

The preprocess stage will re-execute because <code>src/preprocess.py</code> is listed in its <code>deps:</code>. The train and evaluate stages will also re-execute (if they exist) because their inputs depend on preprocess's outputs, which have changed. DVC recomputes the dependency chain; it doesn't stop at preprocess.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.