Tool Beginner easy · 6 min cli_command

Model metadata: training run, data version

What you will learn

Tag and track which dataset version trained which model using MLflow and DVC to recreate any experiment.

Why this matters

Without linking models to data versions, you cannot reproduce a training run, debug why accuracy dropped, or trust which data a model saw. In production, this breaks your ability to audit model decisions or rollback to known-good versions.

Skip if: If you're prototyping locally with a single dataset and will discard the model immediately, metadata can wait. But the moment you save a model or share code with teammates, metadata becomes mandatory.

Explanation

Model metadata connects three pieces: the training code, the specific dataset version, and the model artifact. MLflow tracks parameters and metrics; DVC tracks data and model file versions; together they form an immutable record of 'this model was trained on dataset-v2.3 with learning_rate=0.001 on 2026-04-15 at 14:32 UTC'. This is the foundation of reproducible MLOps. Without it, a year later you cannot answer 'where did this model come from?' When you log a training run with MLflow and commit the dvc.lock file, future developers can run dvc checkout and dvc pull to restore exactly the data and code that generated it.

Configuration

yaml

# dvc.yaml: Define training pipeline with data version tracking
stages:
  prepare:
    cmd: python src/prepare.py
    deps:
      - data/raw/train.csv
    outs:
      - data/prepared/train.pkl:
          hash: md5
  train:
    cmd: python src/train.py --epochs 10 --lr 0.001
    deps:
      - data/prepared/train.pkl
      - src/train.py
    params:
      - train.epochs
      - train.lr
    outs:
      - models/model.pkl:
          hash: md5
    metrics:
      - metrics.json:
          cache: false

# params.yaml: Store hyperparameters for DVC to track
train:
  epochs: 10
  lr: 0.001
  batch_size: 32

# .dvc/config: Point MLflow to local tracking server
[core]
    remote = myremote
    autostage = true

['remote "myremote"']
    url = /mnt/shared/dvc-storage

# src/train.py: Log everything to MLflow
import mlflow
import json
import pickle
import yaml

with open('params.yaml') as f:
    params = yaml.safe_load(f)

mlflow.set_experiment('model-training')
with mlflow.start_run():
    mlflow.log_params(params['train'])
    
    with open('data/prepared/train.pkl', 'rb') as f:
        X_train = pickle.load(f)
    
    model = train_model(X_train, epochs=params['train']['epochs'])
    accuracy = evaluate(model, X_train)
    
    mlflow.log_metrics({'accuracy': accuracy})
    mlflow.sklearn.log_model(model, 'model')
    
    with open('metrics.json', 'w') as f:
        json.dump({'accuracy': accuracy}, f)
    
    with open('models/model.pkl', 'wb') as f:
        pickle.dump(model, f)

Why this order?

DVC pipeline stages must be ordered by dependency: prepare runs first and outputs data/prepared/train.pkl, which train depends on. DVC respects this order automatically when you run dvc repro. MLflow logging happens inside the training script, so it records the exact parameters and metrics after the data preparation is complete. params.yaml must exist before dvc.yaml references it, and .dvc/config determines where DVC stores versioned data.

Wrong vs Right

Wrong way

yaml

# WRONG: Hardcoded paths, no tracking
python train.py  # Which dataset version? Which code commit?

# WRONG: Training script with no metadata
import pickle
model = train()
with open('models/model.pkl', 'wb') as f:
    pickle.dump(model, f)
# No link to data, no parameters logged, no reproduction path

# WRONG: DVC pipeline without MLflow logging
stages:
  train:
    cmd: python train.py
    deps:
      - data/train.csv
    outs:
      - model.pkl
# DVC tracks what changed but doesn't record why (parameters, accuracy, timestamp)

Right way

yaml

# RIGHT: Use dvc.yaml for pipeline reproducibility
stages:
  train:
    cmd: python src/train.py
    deps:
      - data/train.csv
      - src/train.py
    params:
      - train.epochs
      - train.lr
    outs:
      - models/model.pkl
    metrics:
      - metrics.json:
          cache: false

# RIGHT: Log everything to MLflow inside training script
import mlflow
import yaml

with open('params.yaml') as f:
    params = yaml.safe_load(f)

mlflow.start_run()
mlflow.log_params(params['train'])
model = train(epochs=params['train']['epochs'])
mlflow.log_metrics({'accuracy': 0.95})
mlflow.sklearn.log_model(model, 'model')
mlflow.end_run()

# RIGHT: Commit dvc.lock to git
# dvc repro generates dvc.lock with exact file hashes
# Commit it to git so teammates can reproduce the exact run later

Tool vitals

Primary command

bash

mlflow.log_param(), mlflow.log_metrics(), dvc stage add, dvc repro

Config file dvc.yaml, .dvc/config, MLflow tracking server URI

Verify

bash

mlflow ui, dvc dag, dvc status

Integration notes

MLflow and DVC are complementary: DVC versioned your data (hash: md5) and stores it in remote storage; MLflow logs training metadata and serves models. When you dvc pull, you restore the exact dataset; when you mlflow.start_run(), you record what happened during training. Together, dvc.lock (data version) + MLflow UI (training metrics) + git history (code version) = complete reproducibility.

Migration path

If you outgrow MLflow's local tracking server, migrate to a managed MLflow instance (AWS SageMaker, Databricks, or self-hosted). If DVC's remote storage is too slow, switch to a faster object store (S3 instead of local NFS). Both migrations preserve your metadata format: MLflow experiments and DVC pipelines remain unchanged.

Common gotcha

DVC tracks data file hashes in dvc.lock, but dvc.lock MUST be committed to git. If you add dvc.lock to .gitignore, teammates cannot reproduce your run because they have no record of which data version you used. Also, if you run dvc repro and the output file already exists, DVC skips the stage: delete the output or use dvc repro --force if you need to re-run.

Team adoption

Day 1: Have one person set up dvc.yaml, run dvc repro, and commit dvc.lock to git. Day 2: Every teammate clones the repo, runs dvc pull (downloads data), then dvc repro and confirms they get identical metrics. This proves reproducibility before code review. Document the MLflow tracking server URI in README so everyone logs to the same place. Use dvc dag to visualize the pipeline in team meetings: it clarifies dependencies better than text.

Experienced dev note

Use dvc plots to visualize metrics over runs without writing custom code: dvc plots show will render metrics.json as a graph. Also, always set cache: false on metrics files in dvc.yaml, otherwise DVC won't update them on re-runs. And tag your MLflow runs with git commit hash: mlflow.set_tag('git_sha', os.popen('git rev-parse HEAD').read().strip()) so you can always trace a model back to exact source code.

Check your understanding

You run dvc repro and it says 'Stage train is up to date' even though your training script changed. Why didn't it re-run, and how do you force it?

Show answer hint

DVC only re-runs a stage if its <code>deps</code> or <code>params</code> changed. Since you didn't list <code>src/train.py</code> in <code>deps</code>, DVC doesn't know the script changed. Always add training code as a dependency: <code>deps: [data/train.csv, src/train.py]</code>. To force re-run: <code>dvc repro --force</code>.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.