Model metadata: training run, data version
Why this matters
Without linking models to data versions, you cannot reproduce a training run, debug why accuracy dropped, or trust which data a model saw. In production, this breaks your ability to audit model decisions or rollback to known-good versions.
Explanation
Model metadata connects three pieces: the training code, the specific dataset version, and the model artifact. MLflow tracks parameters and metrics; DVC tracks data and model file versions; together they form an immutable record of 'this model was trained on dataset-v2.3 with learning_rate=0.001 on 2026-04-15 at 14:32 UTC'. This is the foundation of reproducible MLOps. Without it, a year later you cannot answer 'where did this model come from?' When you log a training run with MLflow and commit the dvc.lock file, future developers can run dvc checkout and dvc pull to restore exactly the data and code that generated it.
Configuration
# dvc.yaml: Define training pipeline with data version tracking
stages:
prepare:
cmd: python src/prepare.py
deps:
- data/raw/train.csv
outs:
- data/prepared/train.pkl:
hash: md5
train:
cmd: python src/train.py --epochs 10 --lr 0.001
deps:
- data/prepared/train.pkl
- src/train.py
params:
- train.epochs
- train.lr
outs:
- models/model.pkl:
hash: md5
metrics:
- metrics.json:
cache: false
# params.yaml: Store hyperparameters for DVC to track
train:
epochs: 10
lr: 0.001
batch_size: 32
# .dvc/config: Point MLflow to local tracking server
[core]
remote = myremote
autostage = true
['remote "myremote"']
url = /mnt/shared/dvc-storage
# src/train.py: Log everything to MLflow
import mlflow
import json
import pickle
import yaml
with open('params.yaml') as f:
params = yaml.safe_load(f)
mlflow.set_experiment('model-training')
with mlflow.start_run():
mlflow.log_params(params['train'])
with open('data/prepared/train.pkl', 'rb') as f:
X_train = pickle.load(f)
model = train_model(X_train, epochs=params['train']['epochs'])
accuracy = evaluate(model, X_train)
mlflow.log_metrics({'accuracy': accuracy})
mlflow.sklearn.log_model(model, 'model')
with open('metrics.json', 'w') as f:
json.dump({'accuracy': accuracy}, f)
with open('models/model.pkl', 'wb') as f:
pickle.dump(model, f) Why this order?
DVC pipeline stages must be ordered by dependency: prepare runs first and outputs data/prepared/train.pkl, which train depends on. DVC respects this order automatically when you run dvc repro. MLflow logging happens inside the training script, so it records the exact parameters and metrics after the data preparation is complete. params.yaml must exist before dvc.yaml references it, and .dvc/config determines where DVC stores versioned data.
Wrong vs Right
# WRONG: Hardcoded paths, no tracking
python train.py # Which dataset version? Which code commit?
# WRONG: Training script with no metadata
import pickle
model = train()
with open('models/model.pkl', 'wb') as f:
pickle.dump(model, f)
# No link to data, no parameters logged, no reproduction path
# WRONG: DVC pipeline without MLflow logging
stages:
train:
cmd: python train.py
deps:
- data/train.csv
outs:
- model.pkl
# DVC tracks what changed but doesn't record why (parameters, accuracy, timestamp) # RIGHT: Use dvc.yaml for pipeline reproducibility
stages:
train:
cmd: python src/train.py
deps:
- data/train.csv
- src/train.py
params:
- train.epochs
- train.lr
outs:
- models/model.pkl
metrics:
- metrics.json:
cache: false
# RIGHT: Log everything to MLflow inside training script
import mlflow
import yaml
with open('params.yaml') as f:
params = yaml.safe_load(f)
mlflow.start_run()
mlflow.log_params(params['train'])
model = train(epochs=params['train']['epochs'])
mlflow.log_metrics({'accuracy': 0.95})
mlflow.sklearn.log_model(model, 'model')
mlflow.end_run()
# RIGHT: Commit dvc.lock to git
# dvc repro generates dvc.lock with exact file hashes
# Commit it to git so teammates can reproduce the exact run later Tool vitals
mlflow.log_param(), mlflow.log_metrics(), dvc stage add, dvc repro dvc.yaml, .dvc/config, MLflow tracking server URI mlflow ui, dvc dag, dvc status Integration notes
MLflow and DVC are complementary: DVC versioned your data (hash: md5) and stores it in remote storage; MLflow logs training metadata and serves models. When you dvc pull, you restore the exact dataset; when you mlflow.start_run(), you record what happened during training. Together, dvc.lock (data version) + MLflow UI (training metrics) + git history (code version) = complete reproducibility.
Migration path
If you outgrow MLflow's local tracking server, migrate to a managed MLflow instance (AWS SageMaker, Databricks, or self-hosted). If DVC's remote storage is too slow, switch to a faster object store (S3 instead of local NFS). Both migrations preserve your metadata format: MLflow experiments and DVC pipelines remain unchanged.
Common gotcha
DVC tracks data file hashes in dvc.lock, but dvc.lock MUST be committed to git. If you add dvc.lock to .gitignore, teammates cannot reproduce your run because they have no record of which data version you used. Also, if you run dvc repro and the output file already exists, DVC skips the stage: delete the output or use dvc repro --force if you need to re-run.
Team adoption
Day 1: Have one person set up dvc.yaml, run dvc repro, and commit dvc.lock to git. Day 2: Every teammate clones the repo, runs dvc pull (downloads data), then dvc repro and confirms they get identical metrics. This proves reproducibility before code review. Document the MLflow tracking server URI in README so everyone logs to the same place. Use dvc dag to visualize the pipeline in team meetings: it clarifies dependencies better than text.
Experienced dev note
Use dvc plots to visualize metrics over runs without writing custom code: dvc plots show will render metrics.json as a graph. Also, always set cache: false on metrics files in dvc.yaml, otherwise DVC won't update them on re-runs. And tag your MLflow runs with git commit hash: mlflow.set_tag('git_sha', os.popen('git rev-parse HEAD').read().strip()) so you can always trace a model back to exact source code.
Check your understanding
You run dvc repro and it says 'Stage train is up to date' even though your training script changed. Why didn't it re-run, and how do you force it?
Show answer hint
DVC only re-runs a stage if its <code>deps</code> or <code>params</code> changed. Since you didn't list <code>src/train.py</code> in <code>deps</code>, DVC doesn't know the script changed. Always add training code as a dependency: <code>deps: [data/train.csv, src/train.py]</code>. To force re-run: <code>dvc repro --force</code>.