Tool Intermediate medium · 8 min best_practice

Model governance process

What you will learn

Define version control, approval workflows, and audit trails for models using MLflow Model Registry and DVC stages to enforce reproducibility and compliance.

Why this matters

Without governance, teams deploy untraceable models, lose reproducibility across environments, fail audits, and can't rollback bad versions. Governance bridges the gap between experiment tracking and production deployment: it's what prevents 'which model is actually running?' questions at 2 AM.

Skip if: Use lightweight governance only for personal projects or internal POCs. For production systems, especially regulated industries (healthcare, finance), proper governance is mandatory: skipping it creates liability and operational chaos. Even startups benefit from basic governance after their first production incident.

Explanation

Model governance is the structured process that ensures every model in production is versioned, approved, traceable, and reproducible. It sits between MLflow Tracking (experiments) and MLflow Model Registry (deployment), adding workflow gates like approval, validation, and audit logging. DVC complements this by versioning the data and code that produced the model, creating an immutable lineage trail. A governance process answers: Who trained this model? What data was used? Was it approved for production? Can we reproduce it exactly? The process typically flows: experiment → register candidate → validate → approve → deploy → monitor → retire. Without it, you have orphaned models, unclear ownership, and no way to answer 'why did this model behave differently yesterday?'

Configuration

bash

#!/bin/bash

# 1. Register a model in MLflow Registry (from training code)
mlflow models register-model \
  --artifact-uri s3://my-bucket/models/linear-regression/v1 \
  --registered-model-name linear-regression-prod

# 2. Define governance stages in MLflow
mlflow models transition-model-version-stage \
  --name linear-regression-prod \
  --version 1 \
  --stage Staging

# 3. DVC pipeline with governance gates (dvc.yaml)
stages:
  train:
    cmd: python train.py
    deps:
      - train.py
      - data/train.csv
    params:
      - model.learning_rate
      - model.max_depth
    metrics:
      - metrics.json:
          cache: false
    artifacts:
      - models/model.pkl:
          type: model
  
  validate:
    cmd: python validate.py --model models/model.pkl --threshold 0.85
    deps:
      - validate.py
      - data/test.csv
      - models/model.pkl
    metrics:
      - validation.json:
          cache: false

# 4. GitHub Actions workflow for approval gate
name: Model Approval Gate
on:
  workflow_dispatch:
    inputs:
      model_version:
        description: Model version to promote
        required: true
        type: string
jobs:
  approval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      - name: Install MLflow
        run: pip install mlflow boto3
      - name: Validate model metrics
        run: |
          python -c "
          import mlflow
          from mlflow.tracking import MlflowClient
          client = MlflowClient()
          runs = client.search_runs('0', max_results=1)
          metrics = runs[0].data.metrics
          if metrics.get('accuracy', 0) < 0.85:
            raise ValueError('Model accuracy below threshold')
          print('✓ Model validated')
          "
      - name: Promote to Production
        run: |
          mlflow models transition-model-version-stage \
            --name linear-regression-prod \
            --version ${{ github.event.inputs.model_version }} \
            --stage Production
      - name: Log audit entry
        run: |
          echo "Model ${{ github.event.inputs.model_version }} promoted by ${{ github.actor }} at $(date)" >> AUDIT.log
      - name: Commit audit log
        run: |
          git config user.email "governance@example.com"
          git config user.name "Governance Bot"
          git add AUDIT.log
          git commit -m "Audit: Promote model v${{ github.event.inputs.model_version }}"
          git push

# 5. MLflow config for multi-stage governance (.mlflow/config.yaml)
backend_store:
  uri: postgresql://user:password@localhost/mlflow
default_artifact_root: s3://my-bucket/mlflow-artifacts
model_registry:
  stages:
    - name: Development
      description: Experimental models
      requires_approval: false
    - name: Staging
      description: Validated candidates
      requires_approval: false
    - name: Production
      description: Live models
      requires_approval: true
      approvers:
        - data-science-lead
        - mlops-engineer

Why this order?

Register model first (creates artifact URI), transition through stages sequentially (Development → Staging → Production), validate metrics before promotion (gates prevent bad models), commit audit log last (immutable record of who approved what and when). The GitHub Actions workflow enforces this order programmatically, blocking promotion if validation fails.

Wrong vs Right

Wrong way

bash

# ❌ No governance: train.py directly uploads to S3 and deploys
import boto3
import joblib

model = train_model(data)
joblib.dump(model, '/tmp/model.pkl')

s3 = boto3.client('s3')
s3.put_object(Bucket='prod-bucket', Key='model.pkl', Body=open('/tmp/model.pkl', 'rb'))

# Kubernetes deployment pod just loads from S3: no versioning, approval, or audit trail
# Result: Nobody knows which training run produced this model. Rolling back is manual and error-prone.

Right way

bash

# ✅ Governance: MLflow Registry + DVC versioning + approval workflow
import mlflow
from mlflow.tracking import MlflowClient

mlflow.set_experiment("production-models")

with mlflow.start_run():
    model = train_model(data)
    mlflow.sklearn.log_model(model, "model")
    mlflow.log_metrics({"accuracy": 0.92, "f1": 0.89})
    mlflow.log_param("learning_rate", 0.01)
    
    run_id = mlflow.active_run().info.run_id

client = MlflowClient()
model_version = client.create_model_version(
    name="linear-regression-prod",
    source=f"runs:/{run_id}/model",
    run_id=run_id
)

client.transition_model_version_stage(
    name="linear-regression-prod",
    version=model_version.version,
    stage="Staging"
)

# Now human (or automated validation) approves before moving to Production
# DVC tracks data: dvc add data/train.csv → creates data.csv.dvc file with hash
# GitHub Actions enforces: if metrics < threshold, block promotion
# AUDIT.log records: who, what, when for compliance

Tool vitals

Primary command

bash

mlflow models list && mlflow models get-version-stage

Config file .github/workflows/model-approval.yml or MLProject YAML

Verify

bash

mlflow models get-version-stage --name your-model --version 1

Integration notes

MLflow Model Registry is the approval gate; DVC tracks the exact data/code/parameters that produced each model version (create dvc.yaml with train and validate stages, commit dvc.lock to lock versions); GitHub Actions or equivalent CI/CD system enforces the workflow. S3 or similar artifact store holds the model pickle files. In Kubernetes, your model-serving pod reads from MLflow Registry API (not S3 directly) to fetch only Production-stage models, ensuring no stale models are deployed.

Migration path

If moving away from MLflow Registry: use a custom governance system (database + API) or switch to cloud-native registries (SageMaker Model Registry, Vertex AI Model Registry). DVC governance is harder to replace without losing lineage: migrate to Git-based model storage (model.pkl checked into Git LFS) only for small models (<1 GB). For most teams, MLflow stays; you might layer on additional approval tools (Slack bots, custom dashboards) but the core MLflow registry remains.

Cost model

MLflow open-source is free. Backend store: free if using local filesystem, ~$15/month for a managed PostgreSQL database (AWS RDS or Heroku Postgres). S3 artifact store: ~$0.023 per GB/month. If using MLflow tracking server in production: ~$50-200/month for a small EC2 instance. Hidden cost: audit log storage (Git commits) is free with GitHub (unlimited), but large datasets in dvc.yaml can slow Git operations.

Common gotcha

MLflow transitions are *asynchronous*: calling `transition-model-version-stage` returns immediately, but the stage change may take seconds to propagate in the backend store. If you query the stage immediately after transition (in the same script), you'll see the old stage. Add a 2-3 second sleep or poll with exponential backoff. Also, if using a PostgreSQL backend store, verify `postgresql_driver` is set correctly; the default psycopg2 won't work without `pip install psycopg2-binary`. Silent failure: model stays in old stage forever.

Team adoption

Day 1: Pick one model and register it manually using the CLI commands above, walk through one promotion (Development → Staging → Production) to build intuition. Day 2: Set up GitHub Actions workflow and require all new models to go through it: no exceptions, even for 'quick experiments.' Day 3: Add Slack notifications to the workflow (post approval requests to #ml-approvals channel). Enforce role-based approvers (`approvers` field in config) so data leads sign off before production. Incentivize: Show the team that governance saves 10 minutes per incident by enabling fast rollbacks. After week 1, it feels like overhead; after the first production bug traced to bad data, it feels essential.

Experienced dev note

Set `MLFLOW_TRACKING_URI` to a persistent backend store (PostgreSQL, not file system) from day one. File-system backend on a local laptop works in tutorials but fails silently when multiple team members try to register models: they create separate, non-shareable registries. Also, use `client.search_model_versions(filter_string="stage='Production'")` to fetch only prod models in your serving code; never hardcode version numbers. The filter_string syntax is underdocumented, but it's the key to safe deployments.

Check your understanding

Why does the GitHub Actions workflow commit the AUDIT.log to Git instead of just logging to MLflow's built-in history? What would break if you relied only on MLflow's audit trail without the Git commit?

Show answer hint

Git commits are immutable (can't be rewritten without rewriting history) and decentralized (everyone has a copy). MLflow audit logs are centralized and live in the backend store: if the backend store is compromised or data is accidentally deleted, the audit trail vanishes. Git provides a forensic-grade backup that satisfies compliance audits (SOC 2, HIPAA). MLflow's logs are operational; Git's log is legal/regulatory evidence.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.