Tool Advanced hard · 8 min best_practice

Model governance process

What you will learn

Enforce reproducible model lifecycle tracking via MLflow Model Registry + DVC pipeline versioning + signed approval gates.

Why this matters

Without governance, teams deploy untraceable models, lose reproducibility across environments, and cannot audit which model trained on which data with which hyperparameters. This causes production failures blamed on 'stale' models and compliance violations in regulated industries.

Skip if: For one-off experiments or notebooks: governance overhead is unnecessary. Use governance when: multiple people deploy models, regulatory audit trails are required, models run in production >1 month, or data pipelines have >3 dependencies.

Explanation

Model governance is a layered process: (1) track experiments via MLflow with locked parameters and metrics, (2) version data and code together via DVC to guarantee reproducibility, (3) gate production promotion via Model Registry stages (Staging → Production) with approvals, (4) version the entire pipeline (code + data + model) as one atomic unit. This creates an immutable audit trail: given a model ID and version, you can walk backward to exact training data, code commit hash, and human approver. The process prevents model drift, enables rollback, and satisfies compliance requirements by proving who deployed what when.

Configuration

yaml

# File: dvc.yaml
# DVC pipeline definition: ensures data → model → artifact versioning
stages:
  prepare:
    cmd: python src/prepare.py
    deps:
      - data/raw.csv
      - src/prepare.py
    outs:
      - data/processed.csv
  train:
    cmd: python src/train.py --config params.yaml
    deps:
      - data/processed.csv
      - src/train.py
      - params.yaml
    outs:
      - models/model.pkl
    metrics:
      - metrics.json:
          cache: false
  evaluate:
    cmd: python src/evaluate.py
    deps:
      - models/model.pkl
      - data/processed.csv
    metrics:
      - eval_metrics.json:
          cache: false

---

# File: params.yaml
# Hyperparameters locked in version control
train:
  learning_rate: 0.001
  batch_size: 32
  epochs: 50
  random_seed: 42
model:
  type: xgboost
  max_depth: 6

---

# File: src/train.py
# MLflow tracking + model registration
import mlflow
import mlflow.xgboost
import xgboost as xgb
import yaml
import pickle
import json
from pathlib import Path

with open('params.yaml') as f:
    params = yaml.safe_load(f)

mlflow.set_tracking_uri('http://mlflow-server:5000')
mlflow.set_experiment('taxi-demand-production')

with mlflow.start_run(run_name='train-v2.3.1'):
    X_train = pickle.load(open('data/processed.csv.pkl', 'rb'))
    y_train = pickle.load(open('data/labels.pkl', 'rb'))
    
    mlflow.log_params(params['train'])
    mlflow.log_params(params['model'])
    
    model = xgb.XGBRegressor(
        learning_rate=params['train']['learning_rate'],
        max_depth=params['model']['max_depth'],
        random_state=params['train']['random_seed']
    )
    model.fit(X_train, y_train)
    
    train_metrics = {
        'rmse': 0.42,
        'mae': 0.31,
        'r2': 0.876
    }
    mlflow.log_metrics(train_metrics)
    
    mlflow.xgboost.log_model(model, 'model')
    
    model_uri = mlflow.get_artifact_uri('model')
    print(f'Model logged to: {model_uri}')

---

# File: .mlflow/config (or env var MLFLOW_TRACKING_URI)
# Point to centralized MLflow server
tracking_uri: http://mlflow-server:5000
registry_uri: http://mlflow-server:5000

---

# File: .dvc/config (local DVC configuration)
[core]
    analytics = false
    autostage = true
[remote "storage"]
    url = s3://my-org-mlops/dvc-storage
    access_key_id = ${AWS_ACCESS_KEY_ID}
    secret_access_key = ${AWS_SECRET_ACCESS_KEY}
[\'cache\']
    s3 = true

---

# Bash: Register model in MLflow Registry with approval gate
mlflow models register-uri \
  --model-uri "runs:/abc123def456/model" \
  --registered-model-name "taxi-demand-v2"

mlflow models get-latest-versions \
  --name "taxi-demand-v2" \
  --stages Production Staging

mlflow models request-transition \
  --name "taxi-demand-v2" \
  --version 3 \
  --stage Production \
  --comment "JIRA-4521: approved for production after stress testing"

---

# File: governance_audit.py
# Generate audit trail: model → training run → data version
import mlflow
from datetime import datetime

mlflow.set_tracking_uri('http://mlflow-server:5000')
client = mlflow.tracking.MlflowClient()

model_name = 'taxi-demand-v2'
model = client.get_latest_versions(name=model_name, stages=['Production'])[0]

run = client.get_run(model.run_id)
params = run.data.params
metrics = run.data.metrics
tags = run.data.tags

audit_log = {
    'timestamp': datetime.utcnow().isoformat(),
    'model_name': model_name,
    'model_version': model.version,
    'stage': model.current_stage,
    'run_id': model.run_id,
    'created_by': tags.get('user', 'unknown'),
    'training_commit': tags.get('git_commit', 'untagged'),
    'hyperparameters': params,
    'metrics': metrics,
    'approval_comment': model.description or 'no approval recorded'
}

print(f'Audit log: {audit_log}')

Why this order?

DVC stages must be declared bottom-up (prepare → train → evaluate) because later stages depend on earlier outputs. MLflow parameters must be logged before model.fit() to capture all hyperparameters. Model registration happens AFTER training run completes so the run_id exists. Approval gates happen last because they enforce immutability before production deployment.

Wrong vs Right

Wrong way

yaml

# Wrong: No governance
python train.py
python predict.py
cp model.pkl /prod/models/
# No way to know: what data trained this? who approved? can we rollback? was it validated?

---

# Wrong: MLflow without DVC
mlflow.log_model(model, 'model')
# Problem: model is versioned but data is not. You have v3 of the model but which v of the data did it train on?

---

# Wrong: DVC without MLflow
dvc push
git commit -m "train new model"
# Problem: pipeline is versioned but experiments are invisible. No metrics stored, no hyperparameter lineage.

Right way

yaml

# Right: Layered governance

# 1. Track everything in DVC
dvc repro
dvc push
git add dvc.yaml dvc.lock params.yaml
git commit -m "train taxi-demand v2.3.1 (rmse=0.42)"

# 2. Track metrics + hyperparameters in MLflow during training
mlflow.log_params(params)
mlflow.log_metrics(metrics)
mlflow.xgboost.log_model(model, 'model')

# 3. Register in Model Registry (default: Staging)
mlflow models register-uri --model-uri runs:/xyz/model --registered-model-name taxi-demand-v2

# 4. Gate promotion to Production with approval
mlflow models request-transition --name taxi-demand-v2 --version 5 --stage Production --comment "APPROVED by Data team: validation RMSE=0.39"

# 5. Audit trail is complete
mlflow models describe taxi-demand-v2 version 5
# Shows: run_id → git commit → training params → metrics → approver → timestamp

Tool vitals

Primary command

bash

mlflow models describe <model_uri> && dvc dag && mlflow models transition-request-to-production

Config file dvc.yaml, MLflow tracking URI, .mlflow/config

Verify

bash

mlflow models list && dvc status

Integration notes

This governance process chains three tools: (1) DVC versions data + code + pipeline, commits to git; (2) MLflow tracks experiments during training, stores metrics/params; (3) Model Registry gates promotion with approval workflow. Together they create atomic versions: one git commit = one DVC lock file = one MLflow run = one model version. When deploying to Kubernetes, reference the git commit hash and DVC data version, not loose files.

Migration path

If moving away from MLflow: use Weights & Biases (W&B) for experiment tracking: maintains same param/metric logging pattern. For Model Registry: use cloud-native registries (Google Vertex AI Model Registry, AWS SageMaker Model Registry). DVC can be replaced by Pachyderm or Delta Lake, but DVC integration with git is hard to replicate: requires separate artifact versioning system.

Cost model

MLflow Server (self-hosted): free, requires EC2/Docker (~$20–100/month infrastructure). MLflow Cloud (Databricks-hosted): $0 free tier, $200+/month for teams. DVC: free, costs are S3 storage (~$0.023/GB/month) and egress. Model Registry: free (part of MLflow). Approval workflow: requires MLflow Webhooks (enterprise) or manual API calls (free).

Common gotcha

MLflow Model Registry stages (Staging, Production) are NOT automatically enforced at prediction time. A model in 'Staging' can still be loaded and used if someone has the model URI. Governance requires: (1) enforce via API policy (check model.current_stage before serving), (2) separate S3 buckets or IAM roles for Production artifacts, or (3) deny deployments that reference non-Production models in your CI/CD.

Team adoption

Day 1: Enforce that every training script calls mlflow.start_run() / mlflow.end_run() and logs params + metrics. Require git commits for every model promotion: no manual S3 uploads. Day 2: Add a Slack webhook triggered by model transitions (JIRA comment + Slack notification). Day 3: In CI/CD, add a gate that blocks predictions on models not in Production stage (check via MLflow API). Day 4: Weekly audit report: mlflow models list + dvc dag --md sent to compliance. Resistance point: developers hate approval latency: solve by allowing 'fast-track' approval (auto-approve if metrics meet SLA threshold).

Experienced dev note

Set mlflow.end_run() explicitly after logging: orphaned runs cause metadata leaks. More importantly: use mlflow.set_tag('git_commit', os.popen('git rev-parse HEAD').read().strip()) and mlflow.set_tag('dvc_version', os.popen('dvc dag --md').read()) to hardcode reproducibility. Without these tags, you have a model version number but no way to fetch the exact training code 6 months later. Also: Model Registry approval is not enforced server-side: it's a workflow hint. Production safety requires either (a) CI/CD gate that checks stage via MLflow API, or (b) separate S3 bucket for Production models with IAM policy blocking non-approved versions.

Check your understanding

You have MLflow model version 5 in Staging and version 4 in Production. Your team trained v5 2 days ago with better hyperparameters, but haven't requested transition yet. A junior developer runs mlflow.load_model('models:/taxi-demand-v2/Staging') and deploys it to the production API server. Does this violate governance, and why?

Show answer hint

MLflow stages are metadata only: the API call succeeds and v5 is deployed. This violates governance because: (1) no approval was recorded, (2) nobody documented why v5 supersedes v4, (3) there's no audit trail of who deployed when. Governance requires either a CI/CD policy check (reject deployments referencing non-Production models) or enforcement that prediction code only accepts Production-stage models via MLflow API call.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.