Dataset lineage tracking with DVC
Why this matters
Without lineage tracking, you can't answer 'which training data created this model?': critical when debugging model failures or compliance audits require proof of data provenance. DVC records the exact data versions, transformations, and outputs at each pipeline stage.
Explanation
DVC dvc.yaml defines pipeline stages that consume and produce data. Each stage declares its dependencies (input data/code) and outputs (datasets/models). When you run dvc repro, DVC builds a directed acyclic graph (DAG) showing data flow: raw data → preprocessing → training data → model. This lineage is stored in dvc.lock, which captures the exact file hashes. Later, when you run dvc dag, you see the complete pipeline structure. When model accuracy drops, you can instantly trace back: which training dataset version was used? Was the preprocessing code the same? DVC answers this by comparing dvc.lock across commits.
DVC integrates with Git: dvc.yaml and dvc.lock are committed to version control, while actual large files stay in remote storage (S3, GCS, local NAS). This separation lets you track lineage without bloating your Git repo.
For a beginner MLOps engineer, lineage tracking is the bridge between 'I ran some code and got results' and 'I can reproduce and audit any result from my pipeline.'
Configuration
# dvc.yaml - define pipeline stages with clear input/output lineage
stages:
download_data:
cmd: python scripts/download.py
outs:
- data/raw/dataset.csv
preprocess:
cmd: python scripts/preprocess.py data/raw/dataset.csv data/processed/
deps:
- data/raw/dataset.csv
- scripts/preprocess.py
outs:
- data/processed/train.csv
- data/processed/test.csv
train:
cmd: python scripts/train.py data/processed/train.csv models/
deps:
- data/processed/train.csv
- scripts/train.py
outs:
- models/model.pkl
metrics:
- metrics.json:
cache: false Why this order?
Stages must be ordered logically: raw data first, then transformations, then outputs. DVC respects the dependency graph: if stage B depends on stage A's output, DVC will always run A before B. The order you write them in dvc.yaml doesn't matter for execution (DVC sorts by dependencies), but grouping logically aids readability.
Wrong vs Right
# WRONG: No dependencies declared, DVC has no lineage visibility
stages:
preprocess:
cmd: python scripts/preprocess.py
outs:
- data/processed/train.csv
train:
cmd: python scripts/train.py
outs:
- models/model.pkl
# Without deps, DVC doesn't know preprocess output feeds train input.
# dvc dag shows disconnected boxes, not a pipeline.
# Running dvc repro will execute both but won't skip preprocess if train input hasn't changed. # RIGHT: Explicit deps link stages, creating visible lineage
stages:
preprocess:
cmd: python scripts/preprocess.py data/raw/dataset.csv data/processed/
deps:
- data/raw/dataset.csv
- scripts/preprocess.py
outs:
- data/processed/train.csv
train:
cmd: python scripts/train.py data/processed/train.csv models/
deps:
- data/processed/train.csv
- scripts/train.py
outs:
- models/model.pkl
# Now dvc dag shows: download_data → preprocess → train
# DVC skips preprocess if its deps haven't changed, even if train is re-run.
# dvc.lock captures exact file hashes: audit trail is permanent. Tool vitals
dvc dag dvc.yaml dvc dag --ascii Integration notes
DVC lineage integrates with Git: commit dvc.yaml and dvc.lock together. MLflow picks up model lineage from dvc.lock (via dvc dag or custom logging). Docker deployments read dvc.yaml to rebuild identical pipelines in production. Kubernetes orchestrators (like Kubeflow) parse dvc.yaml to generate parallel job DAGs.
Migration path
To move away from DVC lineage: export dvc dag to a manual Makefile or Airflow DAG, but you lose automatic hash-based caching and reproducibility guarantees. For teams already using Airflow, define lineage there instead: DVC remains useful for data versioning only.
Common gotcha
Forgetting to list script files as deps breaks lineage integrity. If you update preprocess.py but don't list it as a dependency, DVC won't know to re-run preprocessing: it only checks data file hashes. Pipeline silently uses stale outputs. Always include any .py or .sh file that affects an output as a dependency.
Team adoption
Establish a team convention: always run dvc dag --ascii after writing a new stage to visualize the graph. Store dvc.yaml and dvc.lock in Git alongside .dvc/config (which points to shared remote storage). New team members clone the repo and run dvc pull to fetch data, then dvc repro to verify lineage works on their machine.
Experienced dev note
Use dvc.yaml params section to externalize hyperparameters and data paths: deps can reference them. Example: params:
preprocess:
test_split: 0.2 then deps: [data/raw/dataset.csv, params.yaml:preprocess.test_split]. This makes lineage sensitive to param changes without rewriting the script.
Check your understanding
Why does DVC require you to list scripts/preprocess.py as a dependency in dvc.yaml if the script hasn't changed between two pipeline runs: wouldn't DVC still skip re-running that stage?
Show answer hint
DVC uses file hash changes (MD5/hash of file contents) to decide whether to re-run a stage. If the script isn't listed as a dependency, DVC only monitors the data inputs: it won't detect that the script logic changed, so it uses stale outputs. Listing the script ensures any code modification triggers re-execution.