Phase 5: Automation: DVC Pipelines & GitOps
Why this matters
Without pipeline automation, teams manually run training scripts, forget to track which data version produced which model, lose reproducibility between commits, and waste compute resources on redundant runs. DVC pipelines couple Git versioning to data and model artifacts, ensuring every commit has a traceable, reproducible workflow.
Explanation
DVC pipelines define stages (fetch data, preprocess, train, evaluate) in a dvc.yaml file, each with inputs, outputs, and commands. When you commit changes to code or data, Git triggers the pipeline; DVC detects which stages have outdated inputs and re-runs only those stages. This eliminates manual python train.py calls and guarantees reproducibility: the same commit + data version always produces the same model. MLflow integrates seamlessly: each pipeline run logs metrics and models to MLflow, while DVC tracks data lineage. The workflow becomes: develop → commit → Git webhook → DVC reproduce → MLflow logs → model registry update. This is the backbone of GitOps for ML: your Git repository becomes the source of truth for everything.
Configuration
stages:
fetch_data:
cmd: python src/fetch_data.py
deps:
- src/fetch_data.py
outs:
- data/raw/dataset.csv
plots:
- data/raw/dataset.csv:
x: id
y: feature_count
preprocess:
cmd: python src/preprocess.py --input data/raw/dataset.csv --output data/processed/train.parquet
deps:
- data/raw/dataset.csv
- src/preprocess.py
outs:
- data/processed/train.parquet
- data/processed/test.parquet
params:
- preprocess.train_split
- preprocess.random_seed
train:
cmd: python src/train.py --train data/processed/train.parquet --model models/model.pkl
deps:
- data/processed/train.parquet
- src/train.py
outs:
- models/model.pkl
metrics:
- metrics.json:
cache: false
params:
- train.learning_rate
- train.max_depth
evaluate:
cmd: python src/evaluate.py --model models/model.pkl --test data/processed/test.parquet --output eval.json
deps:
- models/model.pkl
- data/processed/test.parquet
- src/evaluate.py
metrics:
- eval.json:
cache: false
params:
- preprocess.yaml
- train.yaml Why this order?
Stages must be ordered by dependency: fetch_data produces raw data → preprocess consumes it and produces train/test splits → train consumes train.parquet → evaluate consumes both model and test.parquet. DVC resolves this automatically, but declaring stages in logical order makes dvc dag output easier to read. Parameters must be referenced after all stages that use them.
Wrong vs Right
stages:
train:
cmd: python src/train.py
outs:
- models/model.pkl
preprocess:
cmd: python src/preprocess.py
deps:
- data/raw/dataset.csv
outs:
- data/processed/train.parquet stages:
preprocess:
cmd: python src/preprocess.py
deps:
- data/raw/dataset.csv
- src/preprocess.py
outs:
- data/processed/train.parquet
train:
cmd: python src/train.py --train data/processed/train.parquet
deps:
- data/processed/train.parquet
- src/train.py
outs:
- models/model.pkl Tool vitals
dvc repro dvc.yaml dvc dag Integration notes
DVC pipelines feed directly into MLflow: after dvc repro completes, the train stage can call mlflow.log_metrics() and mlflow.log_model() to register the model in MLflow's registry. A CI/CD webhook (GitHub Actions, GitLab CI) can trigger dvc repro on every commit, with results published to MLflow. Docker is used to containerize the pipeline environment; the same dvc.yaml runs identically in CI/CD containers as locally.
Migration path
If moving to Airflow or Kubeflow: export dvc dag as a task graph, map each DVC stage to an Airflow task with the same dependencies. Use Airflow's S3/GCS operators to pull artifacts DVC tracked instead of dvc pull. DVC remains useful for local reproduction; remove from production orchestration.
Cost model
DVC itself is free (open source). Cloud storage costs apply: S3, GCS, or Azure Blob Storage charges for stored artifacts. Estimate: 1GB model × 100 versions = 100GB storage (~$2.30/month on S3 Standard). DVC Remote cache can reduce re-downloads. No per-run costs.
Common gotcha
If a stage's dependency file (e.g., src/train.py) is modified but the output is cached, dvc repro will skip that stage unless you explicitly add the script to deps:. Many teams forget to list source code files as dependencies, then discover stale models are being used. Always include - src/script_name.py in every stage's deps section.
Team adoption
1) Add a .dvc/config step to onboarding: configure remote storage so all team members push/pull from the same S3 or GCS bucket. 2) Pin DVC version in requirements.txt (dvc==3.50.0) to prevent version mismatch issues. 3) Run dvc dag in CI/CD to visualize the pipeline on every PR; include it in code review checklists. 4) Make dvc repro mandatory before committing: add a pre-commit hook that runs dvc status and warns if outputs are stale. 5) Document which stage produces which MLflow experiment run; teams often forget the mapping.
Experienced dev note
Use params.yaml or separate YAML files and reference them in dvc.yaml with params: keys instead of hardcoding hyperparameters. This decouples config from pipeline definition: changing train.learning_rate in params.yaml automatically invalidates the train stage on next dvc repro, forcing a retrain. Without this pattern, teams edit hyperparameters in Python scripts, forget the pipeline is cached, and train with old settings. Also: use outs_persist: for intermediate files you want kept across runs (checkpoints), not outs:, to avoid cache bloat.
Check your understanding
You modify src/preprocess.py and commit it. When you run dvc repro, which stages will re-execute and why?
Show answer hint
The preprocess stage will re-execute because <code>src/preprocess.py</code> is listed in its <code>deps:</code>. The train and evaluate stages will also re-execute (if they exist) because their inputs depend on preprocess's outputs, which have changed. DVC recomputes the dependency chain; it doesn't stop at preprocess.