Data quality monitoring with DVC
Why this matters
Without data quality checks, poisoned or corrupted data flows silently into training, resulting in models that perform great in CI but fail catastrophically in production. DVC catches these issues at pipeline runtime before they reach the registry.
Explanation
Data quality monitoring in DVC works by defining metric collection points in your pipeline. You write small Python scripts that compute statistics (row counts, null percentages, value distributions) and output JSON or YAML. DVC then tracks these metrics across commits and plot them over time. When a metric exceeds a threshold or deviates from baseline, you catch it before the model sees bad data. This is different from experiment tracking (MLflow): DVC monitors the *input* pipeline health, while MLflow monitors model *output* performance.
Configuration
# dvc.yaml - data quality monitoring pipeline
stages:
fetch_data:
cmd: python src/fetch_data.py
outs:
- data/raw.csv
validate_data:
cmd: python src/validate_data.py data/raw.csv
deps:
- data/raw.csv
- src/validate_data.py
metrics:
- metrics/data_quality.json:
cache: false
plots:
- metrics/column_dist.csv:
x: column_name
y: missing_percent
prepare_data:
cmd: python src/prepare_data.py data/raw.csv
deps:
- data/raw.csv
- src/prepare_data.py
outs:
- data/prepared.csv Why this order?
The fetch_data stage pulls raw data. validate_data depends on that output and produces metrics: these are checked before proceeding. prepare_data only runs if validation passes (you add the check logic in your script). This order ensures quality gates sit *between* raw data and model input.
Wrong vs Right
stages:
train:
cmd: python train.py data/raw.csv
deps:
- data/raw.csv
outs:
- model.pkl
metrics:
- metrics/accuracy.json stages:
validate_data:
cmd: python validate.py data/raw.csv
deps:
- data/raw.csv
metrics:
- metrics/data_quality.json:
cache: false
train:
cmd: python train.py data/raw.csv
deps:
- data/raw.csv
outs:
- model.pkl
metrics:
- metrics/accuracy.json Tool vitals
dvc stage add dvc.yaml dvc dag Integration notes
DVC stages produce metrics that feed into MLflow. After dvc repro succeeds (data quality passed), your training script logs model metrics to MLflow. If DVC metrics fail a threshold you define in dvc.yaml params, the entire pipeline halts before MLflow ever sees it.
Migration path
If you outgrow DVC for data quality (e.g., need real-time streaming validation), migrate the metric collection scripts to Great Expectations (which DVC can call) or move validation to your feature store (Feast). The core idea stays the same: compute + compare + alert.
Common gotcha
If you set cache: true (the default) on data quality metrics, DVC will not re-run validation on subsequent dvc repro calls, even if the raw data changed. Always use cache: false for quality metrics so they re-compute every run and catch drift you're trying to monitor.
Team adoption
Create a src/validate_data.py template that all team members copy and customize. Define a team threshold file params.yaml with max-null-percent and min-row-count. Enforce dvc repro as part of CI: if validation fails, the PR build fails. This prevents bad data PRs from merging.
Experienced dev note
Use dvc plots with a baseline branch. Run dvc plots diff main --targets metrics/column_dist.csv to visually compare data distributions between your feature branch and production. This is how you spot subtle data drift that a single threshold would miss.
Check your understanding
Why must the validate_data stage have cache: false on its metrics, and what happens if you forget?
Show answer hint
With caching enabled, DVC skips re-running validation if the input file hash hasn't changed: but if the *content* of that file changes (rows added, values modified), the old metrics stay in cache and you never see the drift. Always recompute quality metrics on every pipeline run.