Model governance process
Why this matters
Without governance, teams deploy untraceable models, lose reproducibility across environments, fail audits, and can't rollback bad versions. Governance bridges the gap between experiment tracking and production deployment: it's what prevents 'which model is actually running?' questions at 2 AM.
Explanation
Model governance is the structured process that ensures every model in production is versioned, approved, traceable, and reproducible. It sits between MLflow Tracking (experiments) and MLflow Model Registry (deployment), adding workflow gates like approval, validation, and audit logging. DVC complements this by versioning the data and code that produced the model, creating an immutable lineage trail. A governance process answers: Who trained this model? What data was used? Was it approved for production? Can we reproduce it exactly? The process typically flows: experiment → register candidate → validate → approve → deploy → monitor → retire. Without it, you have orphaned models, unclear ownership, and no way to answer 'why did this model behave differently yesterday?'
Configuration
#!/bin/bash
# 1. Register a model in MLflow Registry (from training code)
mlflow models register-model \
--artifact-uri s3://my-bucket/models/linear-regression/v1 \
--registered-model-name linear-regression-prod
# 2. Define governance stages in MLflow
mlflow models transition-model-version-stage \
--name linear-regression-prod \
--version 1 \
--stage Staging
# 3. DVC pipeline with governance gates (dvc.yaml)
stages:
train:
cmd: python train.py
deps:
- train.py
- data/train.csv
params:
- model.learning_rate
- model.max_depth
metrics:
- metrics.json:
cache: false
artifacts:
- models/model.pkl:
type: model
validate:
cmd: python validate.py --model models/model.pkl --threshold 0.85
deps:
- validate.py
- data/test.csv
- models/model.pkl
metrics:
- validation.json:
cache: false
# 4. GitHub Actions workflow for approval gate
name: Model Approval Gate
on:
workflow_dispatch:
inputs:
model_version:
description: Model version to promote
required: true
type: string
jobs:
approval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install MLflow
run: pip install mlflow boto3
- name: Validate model metrics
run: |
python -c "
import mlflow
from mlflow.tracking import MlflowClient
client = MlflowClient()
runs = client.search_runs('0', max_results=1)
metrics = runs[0].data.metrics
if metrics.get('accuracy', 0) < 0.85:
raise ValueError('Model accuracy below threshold')
print('✓ Model validated')
"
- name: Promote to Production
run: |
mlflow models transition-model-version-stage \
--name linear-regression-prod \
--version ${{ github.event.inputs.model_version }} \
--stage Production
- name: Log audit entry
run: |
echo "Model ${{ github.event.inputs.model_version }} promoted by ${{ github.actor }} at $(date)" >> AUDIT.log
- name: Commit audit log
run: |
git config user.email "governance@example.com"
git config user.name "Governance Bot"
git add AUDIT.log
git commit -m "Audit: Promote model v${{ github.event.inputs.model_version }}"
git push
# 5. MLflow config for multi-stage governance (.mlflow/config.yaml)
backend_store:
uri: postgresql://user:password@localhost/mlflow
default_artifact_root: s3://my-bucket/mlflow-artifacts
model_registry:
stages:
- name: Development
description: Experimental models
requires_approval: false
- name: Staging
description: Validated candidates
requires_approval: false
- name: Production
description: Live models
requires_approval: true
approvers:
- data-science-lead
- mlops-engineer Why this order?
Register model first (creates artifact URI), transition through stages sequentially (Development → Staging → Production), validate metrics before promotion (gates prevent bad models), commit audit log last (immutable record of who approved what and when). The GitHub Actions workflow enforces this order programmatically, blocking promotion if validation fails.
Wrong vs Right
# ❌ No governance: train.py directly uploads to S3 and deploys
import boto3
import joblib
model = train_model(data)
joblib.dump(model, '/tmp/model.pkl')
s3 = boto3.client('s3')
s3.put_object(Bucket='prod-bucket', Key='model.pkl', Body=open('/tmp/model.pkl', 'rb'))
# Kubernetes deployment pod just loads from S3: no versioning, approval, or audit trail
# Result: Nobody knows which training run produced this model. Rolling back is manual and error-prone. # ✅ Governance: MLflow Registry + DVC versioning + approval workflow
import mlflow
from mlflow.tracking import MlflowClient
mlflow.set_experiment("production-models")
with mlflow.start_run():
model = train_model(data)
mlflow.sklearn.log_model(model, "model")
mlflow.log_metrics({"accuracy": 0.92, "f1": 0.89})
mlflow.log_param("learning_rate", 0.01)
run_id = mlflow.active_run().info.run_id
client = MlflowClient()
model_version = client.create_model_version(
name="linear-regression-prod",
source=f"runs:/{run_id}/model",
run_id=run_id
)
client.transition_model_version_stage(
name="linear-regression-prod",
version=model_version.version,
stage="Staging"
)
# Now human (or automated validation) approves before moving to Production
# DVC tracks data: dvc add data/train.csv → creates data.csv.dvc file with hash
# GitHub Actions enforces: if metrics < threshold, block promotion
# AUDIT.log records: who, what, when for compliance Tool vitals
mlflow models list && mlflow models get-version-stage .github/workflows/model-approval.yml or MLProject YAML mlflow models get-version-stage --name your-model --version 1 Integration notes
MLflow Model Registry is the approval gate; DVC tracks the exact data/code/parameters that produced each model version (create dvc.yaml with train and validate stages, commit dvc.lock to lock versions); GitHub Actions or equivalent CI/CD system enforces the workflow. S3 or similar artifact store holds the model pickle files. In Kubernetes, your model-serving pod reads from MLflow Registry API (not S3 directly) to fetch only Production-stage models, ensuring no stale models are deployed.
Migration path
If moving away from MLflow Registry: use a custom governance system (database + API) or switch to cloud-native registries (SageMaker Model Registry, Vertex AI Model Registry). DVC governance is harder to replace without losing lineage: migrate to Git-based model storage (model.pkl checked into Git LFS) only for small models (<1 GB). For most teams, MLflow stays; you might layer on additional approval tools (Slack bots, custom dashboards) but the core MLflow registry remains.
Cost model
MLflow open-source is free. Backend store: free if using local filesystem, ~$15/month for a managed PostgreSQL database (AWS RDS or Heroku Postgres). S3 artifact store: ~$0.023 per GB/month. If using MLflow tracking server in production: ~$50-200/month for a small EC2 instance. Hidden cost: audit log storage (Git commits) is free with GitHub (unlimited), but large datasets in dvc.yaml can slow Git operations.
Common gotcha
MLflow transitions are *asynchronous*: calling `transition-model-version-stage` returns immediately, but the stage change may take seconds to propagate in the backend store. If you query the stage immediately after transition (in the same script), you'll see the old stage. Add a 2-3 second sleep or poll with exponential backoff. Also, if using a PostgreSQL backend store, verify `postgresql_driver` is set correctly; the default psycopg2 won't work without `pip install psycopg2-binary`. Silent failure: model stays in old stage forever.
Team adoption
Day 1: Pick one model and register it manually using the CLI commands above, walk through one promotion (Development → Staging → Production) to build intuition. Day 2: Set up GitHub Actions workflow and require all new models to go through it: no exceptions, even for 'quick experiments.' Day 3: Add Slack notifications to the workflow (post approval requests to #ml-approvals channel). Enforce role-based approvers (`approvers` field in config) so data leads sign off before production. Incentivize: Show the team that governance saves 10 minutes per incident by enabling fast rollbacks. After week 1, it feels like overhead; after the first production bug traced to bad data, it feels essential.
Experienced dev note
Set `MLFLOW_TRACKING_URI` to a persistent backend store (PostgreSQL, not file system) from day one. File-system backend on a local laptop works in tutorials but fails silently when multiple team members try to register models: they create separate, non-shareable registries. Also, use `client.search_model_versions(filter_string="stage='Production'")` to fetch only prod models in your serving code; never hardcode version numbers. The filter_string syntax is underdocumented, but it's the key to safe deployments.
Check your understanding
Why does the GitHub Actions workflow commit the AUDIT.log to Git instead of just logging to MLflow's built-in history? What would break if you relied only on MLflow's audit trail without the Git commit?
Show answer hint
Git commits are immutable (can't be rewritten without rewriting history) and decentralized (everyone has a copy). MLflow audit logs are centralized and live in the backend store: if the backend store is compromised or data is accidentally deleted, the audit trail vanishes. Git provides a forensic-grade backup that satisfies compliance audits (SOC 2, HIPAA). MLflow's logs are operational; Git's log is legal/regulatory evidence.