Tool Beginner easy · 6 min config

Data quality monitoring with DVC

What you will learn

Use DVC metrics and plots to track data schema, missing values, and statistical drift across pipeline stages.

Why this matters

Without data quality checks, poisoned or corrupted data flows silently into training, resulting in models that perform great in CI but fail catastrophically in production. DVC catches these issues at pipeline runtime before they reach the registry.

Skip if: If your data is sourced from a single, highly controlled system (internal database with strong constraints), basic schema validation in your training script may suffice. DVC becomes critical when data comes from multiple sources, user uploads, or APIs where drift is expected.

Explanation

Data quality monitoring in DVC works by defining metric collection points in your pipeline. You write small Python scripts that compute statistics (row counts, null percentages, value distributions) and output JSON or YAML. DVC then tracks these metrics across commits and plot them over time. When a metric exceeds a threshold or deviates from baseline, you catch it before the model sees bad data. This is different from experiment tracking (MLflow): DVC monitors the *input* pipeline health, while MLflow monitors model *output* performance.

Configuration

yaml

# dvc.yaml - data quality monitoring pipeline
stages:
  fetch_data:
    cmd: python src/fetch_data.py
    outs:
      - data/raw.csv

  validate_data:
    cmd: python src/validate_data.py data/raw.csv
    deps:
      - data/raw.csv
      - src/validate_data.py
    metrics:
      - metrics/data_quality.json:
          cache: false
    plots:
      - metrics/column_dist.csv:
          x: column_name
          y: missing_percent

  prepare_data:
    cmd: python src/prepare_data.py data/raw.csv
    deps:
      - data/raw.csv
      - src/prepare_data.py
    outs:
      - data/prepared.csv

Why this order?

The fetch_data stage pulls raw data. validate_data depends on that output and produces metrics: these are checked before proceeding. prepare_data only runs if validation passes (you add the check logic in your script). This order ensures quality gates sit *between* raw data and model input.

Wrong vs Right

Wrong way

yaml

stages:
  train:
    cmd: python train.py data/raw.csv
    deps:
      - data/raw.csv
    outs:
      - model.pkl
    metrics:
      - metrics/accuracy.json

Right way

yaml

stages:
  validate_data:
    cmd: python validate.py data/raw.csv
    deps:
      - data/raw.csv
    metrics:
      - metrics/data_quality.json:
          cache: false
  train:
    cmd: python train.py data/raw.csv
    deps:
      - data/raw.csv
    outs:
      - model.pkl
    metrics:
      - metrics/accuracy.json

Tool vitals

Primary command

bash

dvc stage add

Config file dvc.yaml

Verify

bash

dvc dag

Integration notes

DVC stages produce metrics that feed into MLflow. After dvc repro succeeds (data quality passed), your training script logs model metrics to MLflow. If DVC metrics fail a threshold you define in dvc.yaml params, the entire pipeline halts before MLflow ever sees it.

Migration path

If you outgrow DVC for data quality (e.g., need real-time streaming validation), migrate the metric collection scripts to Great Expectations (which DVC can call) or move validation to your feature store (Feast). The core idea stays the same: compute + compare + alert.

Common gotcha

If you set cache: true (the default) on data quality metrics, DVC will not re-run validation on subsequent dvc repro calls, even if the raw data changed. Always use cache: false for quality metrics so they re-compute every run and catch drift you're trying to monitor.

Team adoption

Create a src/validate_data.py template that all team members copy and customize. Define a team threshold file params.yaml with max-null-percent and min-row-count. Enforce dvc repro as part of CI: if validation fails, the PR build fails. This prevents bad data PRs from merging.

Experienced dev note

Use dvc plots with a baseline branch. Run dvc plots diff main --targets metrics/column_dist.csv to visually compare data distributions between your feature branch and production. This is how you spot subtle data drift that a single threshold would miss.

Check your understanding

Why must the validate_data stage have cache: false on its metrics, and what happens if you forget?

Show answer hint

With caching enabled, DVC skips re-running validation if the input file hash hasn't changed: but if the *content* of that file changes (rows added, values modified), the old metrics stay in cache and you never see the drift. Always recompute quality metrics on every pipeline run.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.