Tool Beginner easy · 6 min config

Dataset lineage tracking with DVC

What you will learn

Track how datasets flow through your pipeline stages so you know exactly which data version produced which model.

Why this matters

Without lineage tracking, you can't answer 'which training data created this model?': critical when debugging model failures or compliance audits require proof of data provenance. DVC records the exact data versions, transformations, and outputs at each pipeline stage.

Skip if: Skip this if you have only one static dataset and never retrain. Also unnecessary for exploratory notebooks that won't become production pipelines: but as soon as you have 2+ pipeline stages, lineage becomes mandatory.

Explanation

DVC dvc.yaml defines pipeline stages that consume and produce data. Each stage declares its dependencies (input data/code) and outputs (datasets/models). When you run dvc repro, DVC builds a directed acyclic graph (DAG) showing data flow: raw data → preprocessing → training data → model. This lineage is stored in dvc.lock, which captures the exact file hashes. Later, when you run dvc dag, you see the complete pipeline structure. When model accuracy drops, you can instantly trace back: which training dataset version was used? Was the preprocessing code the same? DVC answers this by comparing dvc.lock across commits.

DVC integrates with Git: dvc.yaml and dvc.lock are committed to version control, while actual large files stay in remote storage (S3, GCS, local NAS). This separation lets you track lineage without bloating your Git repo.

For a beginner MLOps engineer, lineage tracking is the bridge between 'I ran some code and got results' and 'I can reproduce and audit any result from my pipeline.'

Configuration

yaml

# dvc.yaml - define pipeline stages with clear input/output lineage
stages:
  download_data:
    cmd: python scripts/download.py
    outs:
      - data/raw/dataset.csv
  preprocess:
    cmd: python scripts/preprocess.py data/raw/dataset.csv data/processed/
    deps:
      - data/raw/dataset.csv
      - scripts/preprocess.py
    outs:
      - data/processed/train.csv
      - data/processed/test.csv
  train:
    cmd: python scripts/train.py data/processed/train.csv models/
    deps:
      - data/processed/train.csv
      - scripts/train.py
    outs:
      - models/model.pkl
    metrics:
      - metrics.json:
          cache: false

Why this order?

Stages must be ordered logically: raw data first, then transformations, then outputs. DVC respects the dependency graph: if stage B depends on stage A's output, DVC will always run A before B. The order you write them in dvc.yaml doesn't matter for execution (DVC sorts by dependencies), but grouping logically aids readability.

Wrong vs Right

Wrong way

yaml

# WRONG: No dependencies declared, DVC has no lineage visibility
stages:
  preprocess:
    cmd: python scripts/preprocess.py
    outs:
      - data/processed/train.csv
  train:
    cmd: python scripts/train.py
    outs:
      - models/model.pkl

# Without deps, DVC doesn't know preprocess output feeds train input.
# dvc dag shows disconnected boxes, not a pipeline.
# Running dvc repro will execute both but won't skip preprocess if train input hasn't changed.

Right way

yaml

# RIGHT: Explicit deps link stages, creating visible lineage
stages:
  preprocess:
    cmd: python scripts/preprocess.py data/raw/dataset.csv data/processed/
    deps:
      - data/raw/dataset.csv
      - scripts/preprocess.py
    outs:
      - data/processed/train.csv
  train:
    cmd: python scripts/train.py data/processed/train.csv models/
    deps:
      - data/processed/train.csv
      - scripts/train.py
    outs:
      - models/model.pkl

# Now dvc dag shows: download_data → preprocess → train
# DVC skips preprocess if its deps haven't changed, even if train is re-run.
# dvc.lock captures exact file hashes: audit trail is permanent.

Tool vitals

Primary command

bash

dvc dag

Config file dvc.yaml

Verify

bash

dvc dag --ascii

Integration notes

DVC lineage integrates with Git: commit dvc.yaml and dvc.lock together. MLflow picks up model lineage from dvc.lock (via dvc dag or custom logging). Docker deployments read dvc.yaml to rebuild identical pipelines in production. Kubernetes orchestrators (like Kubeflow) parse dvc.yaml to generate parallel job DAGs.

Migration path

To move away from DVC lineage: export dvc dag to a manual Makefile or Airflow DAG, but you lose automatic hash-based caching and reproducibility guarantees. For teams already using Airflow, define lineage there instead: DVC remains useful for data versioning only.

Common gotcha

Forgetting to list script files as deps breaks lineage integrity. If you update preprocess.py but don't list it as a dependency, DVC won't know to re-run preprocessing: it only checks data file hashes. Pipeline silently uses stale outputs. Always include any .py or .sh file that affects an output as a dependency.

Team adoption

Establish a team convention: always run dvc dag --ascii after writing a new stage to visualize the graph. Store dvc.yaml and dvc.lock in Git alongside .dvc/config (which points to shared remote storage). New team members clone the repo and run dvc pull to fetch data, then dvc repro to verify lineage works on their machine.

Experienced dev note

Use dvc.yaml params section to externalize hyperparameters and data paths: deps can reference them. Example: params: preprocess: test_split: 0.2 then deps: [data/raw/dataset.csv, params.yaml:preprocess.test_split]. This makes lineage sensitive to param changes without rewriting the script.

Check your understanding

Why does DVC require you to list scripts/preprocess.py as a dependency in dvc.yaml if the script hasn't changed between two pipeline runs: wouldn't DVC still skip re-running that stage?

Show answer hint

DVC uses file hash changes (MD5/hash of file contents) to decide whether to re-run a stage. If the script isn't listed as a dependency, DVC only monitors the data inputs: it won't detect that the script logic changed, so it uses stale outputs. Listing the script ensures any code modification triggers re-execution.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.