Tool Advanced hard · 8 min integration

Data lineage tracking with DVC

What you will learn

Track complete data and model lineage across your ML pipeline using DVC's dag and metrics system to answer 'where did this model come from?' in production.

Why this matters

In production, you'll face questions like: 'Which dataset version trained this model?' 'Did we retrain after fixing that data bug?' 'Why did accuracy drop 2%?' Without lineage tracking, you're debugging blind. DVC automatically records dependencies between data, code, and outputs so you can reproduce any result and audit decisions months later.

Skip if: Skip DVC lineage if: your pipeline is a single Jupyter notebook, you never retrain models, your data never changes, or your regulatory environment doesn't require auditability. For academic research or one-off analyses, manual documentation is often enough. DVC becomes essential once you have multiple stages, team collaboration, or production model updates.

Explanation

Data lineage in DVC works by tracking dependencies and outputs in dvc.yaml pipelines. Each stage declares its inputs (data, code, parameters) and outputs (models, metrics, artifacts). DVC builds a directed acyclic graph (DAG) of these relationships, then stores metadata in .dvc files and Git. When you run dvc dag, you see the exact flow: raw data → preprocessing → feature engineering → training → evaluation. When you later ask 'what data trained model v3?', DVC traces back through the DAG using the Git commit hash and stage outputs to answer precisely. DVC also integrates with MLflow: MLflow logs model parameters and metrics; DVC tracks which data versions produced those metrics. Together they create an auditable record: Git controls code versions, DVC controls data lineage, MLflow controls experiment metadata. This matters in production because regulators (GDPR, financial audits) require you to prove model decisions were made on specific data, and when bugs are discovered, you need to know which models are affected.

Configuration

yaml

stages:
  prepare:
    cmd: python src/prepare.py --input data/raw.csv --output data/prepared.csv
    deps:
      - data/raw.csv
      - src/prepare.py
    outs:
      - data/prepared.csv
    params:
      - prepare.test_size

  featurize:
    cmd: python src/featurize.py --input data/prepared.csv --output data/features.pkl
    deps:
      - data/prepared.csv
      - src/featurize.py
    outs:
      - data/features.pkl
    params:
      - featurize.n_features

  train:
    cmd: python src/train.py --features data/features.pkl --model models/model.pkl --metrics metrics.json
    deps:
      - data/features.pkl
      - src/train.py
    outs:
      - models/model.pkl
    metrics:
      - metrics.json:
          cache: false
    params:
      - train.epochs
      - train.learning_rate

  evaluate:
    cmd: python src/evaluate.py --model models/model.pkl --features data/features.pkl --output eval_metrics.json
    deps:
      - models/model.pkl
      - data/features.pkl
      - src/evaluate.py
    metrics:
      - eval_metrics.json:
          cache: false

params:
  - params.yaml

Why this order?

Stages must be ordered logically in dvc.yaml: prepare first (consumes raw data), featurize second (consumes prepared data), train third (consumes features), evaluatedvc repro. The params section at the bottom references params.yaml, which must exist. Metrics must have cache: false for reproducibility: otherwise DVC caches metric files and won't recompute them.

Wrong vs Right

Wrong way

yaml

stages:
  train:
    cmd: python train.py
    outs:
      - model.pkl

stages:
  evaluate:
    cmd: python evaluate.py
    metrics:
      - metrics.json

Right way

yaml

stages:
  prepare:
    cmd: python src/prepare.py --input data/raw.csv --output data/prepared.csv
    deps:
      - data/raw.csv
      - src/prepare.py
    outs:
      - data/prepared.csv

  train:
    cmd: python src/train.py --features data/prepared.csv --model models/model.pkl
    deps:
      - data/prepared.csv
      - src/train.py
    outs:
      - models/model.pkl
    metrics:
      - metrics.json:
          cache: false

  evaluate:
    cmd: python src/evaluate.py --model models/model.pkl --features data/prepared.csv --output eval_metrics.json
    deps:
      - models/model.pkl
      - data/prepared.csv
    metrics:
      - eval_metrics.json:
          cache: false

Tool vitals

Primary command

bash

dvc dag

Config file dvc.yaml

Verify

bash

dvc dag && dvc metrics show

Integration notes

DVC lineage connects to MLflow like this: (1) DVC tracks data versions in dvc.yaml and stores hashes in Git. (2) When training starts, your training script logs to MLflow with mlflow.log_param() and mlflow.log_metrics(). (3) At the end, log the DVC Git commit: mlflow.log_param('dvc_commit', subprocess.check_output(['git', 'rev-parse', 'HEAD']).decode().strip()). Now MLflow runs are linked to exact data versions. In dvc.yaml, add a final stage that queries MLflow and tags the run: post_train: dvc plots metrics show connects to MLflow UI. Together: Git = code lineage, DVC = data lineage, MLflow = experiment lineage.

Migration path

To move away from DVC: export all metadata with dvc export (outputs a JSON DAG) and dvc metrics show --show-json. Migrate to a data catalog tool like Collibra or Alation by parsing the JSON. For simple cases, you can manually archive dvc.yaml and dvc.lock in S3, then query them directly. DVC's biggest unique value is automatic DAG inference; switching to manual pipelines (Airflow, Prefect) means writing dependency graphs by hand.

Common gotcha

If you define metrics with cache: true (the default), DVC assumes the metric file won't change and skips recomputation. Your model trains correctly, but dvc repro never re-evaluates it because DVC thinks the metrics are already up-to-date. You'll see stale accuracy numbers in dvc metrics show from three runs ago. Always use cache: false for metric outputs.

Team adoption

Day 1: (1) Create a template dvc.yaml with all stages (prepare, featurize, train, evaluate). (2) Make it team standard: dvc init in every new repo, commit .dvc and dvc.lock to Git. (3) Document the rule: 'Every data input and every code script goes in deps. Every output (model, metrics) goes in outs. Every hyperparameter goes in params.yaml.' Day 2: Show dvc dag visualization in onboarding so engineers see the full pipeline. Day 3: Link DVC to MLflow: each train run logs to MLflow with the DVC commit hash. Use a pre-commit hook to validate dvc.yaml syntax before pushing. Within a week, dvc repro && dvc metrics show becomes the standard 'run the full pipeline' command, replacing ad-hoc script runs.

Experienced dev note

After weeks of fighting DVC, experienced engineers discover: use dvc plots to track metric evolution across runs, not just final values. Add plots: - metrics.json to your train stage, then dvc plots show metrics.json to see loss over time. This integrates lineage with visualization: you see not just 'which data trained this?', but 'how did loss evolve during training for that data?' Also, use dvc dag --ascii in CI/CD logs so reviewers can see the full pipeline dependency graph before merging.

Check your understanding

You run dvc repro and it completes instantly without retraining. Your teammate says 'good, no changes needed', but you suspect the raw data was updated yesterday. How do you verify if the model actually retrained on new data, and what's the most common reason DVC skipped training?

Show answer hint

Use <code>dvc diff</code> to compare data hashes before and after. DVC skips stages when all upstream <code>deps</code> have unchanged hashes (stored in <code>dvc.lock</code>). The most common reason: the raw data file wasn't added to <code>deps</code> in the prepare stage, so DVC never detects changes. Always explicitly list every input file in <code>deps</code>.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.