Tool Beginner easy · 6 min config

Automated evaluation gate

What you will learn

Set up a gate in MLflow that automatically blocks model promotion if metrics fall below a defined threshold.

Why this matters

Without an evaluation gate, you can accidentally promote a degraded model to production because no one manually checked the metrics. An automated gate prevents this by making metric thresholds enforceable and auditable in your CI/CD pipeline.

Skip if: Skip this if your team manually reviews every metric before promotion, or if you're in experimentation-only mode with no production serving. Once you automate model deployment, gates become mandatory.

Explanation

An automated evaluation gate is a policy layer between model training and production deployment. After a model finishes training and evaluation, MLflow compares the logged metrics against thresholds you define in a YAML config. If any metric fails: accuracy drops below 0.92, recall falls below 0.88: the gate rejects the model and prevents promotion to the registry's 'Production' stage. This works by querying MLflow's REST API or Python client in your CI/CD pipeline (GitHub Actions, GitLab CI, Jenkins) and failing the job if thresholds aren't met. The gate runs during the model registration step, not during training, so failed gates don't waste compute: they stop the process before it reaches production.

Configuration

bash

#!/bin/bash
# eval_gate.sh: evaluation gate script run in CI/CD pipeline
# This script queries the latest run, checks metrics, and promotes or rejects

set -e

MLFLOW_TRACKING_URI="http://mlflow-server:5000"
EXPERIMENT_NAME="customer-churn"
MODEL_NAME="churn-classifier"
MIN_ACCURACY=0.92
MIN_PRECISION=0.88
MIN_RECALL=0.85

# Query the best run from the latest experiment
BEST_RUN=$(python3 << 'EOF'
import json
import sys
from mlflow.tracking import MlflowClient

client = MlflowClient(tracking_uri="http://mlflow-server:5000")
experiment = client.get_experiment_by_name("customer-churn")
runs = client.search_runs(
    experiment_ids=[experiment.experiment_id],
    order_by=["metrics.accuracy DESC"],
    max_results=1
)

if not runs:
    print(json.dumps({"status": "failed", "error": "No runs found"}))
    sys.exit(1)

run = runs[0]
metrics = run.data.metrics
run_id = run.info.run_id

print(json.dumps({
    "run_id": run_id,
    "accuracy": metrics.get("accuracy"),
    "precision": metrics.get("precision"),
    "recall": metrics.get("recall")
}))
EOF
)

echo "Latest run metrics: $BEST_RUN"

# Parse and validate metrics
ACCURACY=$(echo $BEST_RUN | python3 -c "import sys, json; d=json.load(sys.stdin); print(d['accuracy'] or 0)")
PRECISION=$(echo $BEST_RUN | python3 -c "import sys, json; d=json.load(sys.stdin); print(d['precision'] or 0)")
RECALL=$(echo $BEST_RUN | python3 -c "import sys, json; d=json.load(sys.stdin); print(d['recall'] or 0)")
RUN_ID=$(echo $BEST_RUN | python3 -c "import sys, json; d=json.load(sys.stdin); print(d['run_id'])")

echo "Gate thresholds: accuracy>=$MIN_ACCURACY, precision>=$MIN_PRECISION, recall>=$MIN_RECALL"
echo "Measured: accuracy=$ACCURACY, precision=$PRECISION, recall=$RECALL"

# Enforce thresholds
if (( $(echo "$ACCURACY < $MIN_ACCURACY" | bc -l) )); then
  echo "❌ GATE FAILED: accuracy $ACCURACY < $MIN_ACCURACY"
  exit 1
fi

if (( $(echo "$PRECISION < $MIN_PRECISION" | bc -l) )); then
  echo "❌ GATE FAILED: precision $PRECISION < $MIN_PRECISION"
  exit 1
fi

if (( $(echo "$RECALL < $MIN_RECALL" | bc -l) )); then
  echo "❌ GATE FAILED: recall $RECALL < $MIN_RECALL"
  exit 1
fi

echo "✅ GATE PASSED: All metrics above thresholds"

# Promote to Production stage
python3 << PROMOTE_EOF
from mlflow.tracking import MlflowClient
from mlflow.entities.model_registry.model_version_stage import ModelVersionStage

client = MlflowClient(tracking_uri="http://mlflow-server:5000")

try:
    # Create model version from run
    model_version = client.create_model_version(
        name="$MODEL_NAME",
        source=f"runs:/{"$RUN_ID"}/model",
        run_id="$RUN_ID"
    )
    
    # Transition to Production
    client.transition_model_version_stage(
        name="$MODEL_NAME",
        version=model_version.version,
        stage="Production"
    )
    
    print(f"✅ Model $MODEL_NAME version {model_version.version} promoted to Production")
except Exception as e:
    print(f"❌ Promotion failed: {e}")
    exit(1)
PROMOTE_EOF

Why this order?

The script queries MLflow first (establish the current state), then validates metrics in order from strictest to loosest threshold (fail fast on the most critical check), then promotes only if all gates pass. If you promote before validating, you lose the safety net.

Wrong vs Right

Wrong way

bash

#!/bin/bash
# ❌ WRONG: No thresholds, promotes blindly
RUN_ID="abc123"
MODEL_NAME="churn-classifier"

python3 << 'EOF'
from mlflow.tracking import MlflowClient

client = MlflowClient()
model_version = client.create_model_version(
    name="churn-classifier",
    source="runs:/abc123/model"
)
client.transition_model_version_stage(
    name="churn-classifier",
    version=model_version.version,
    stage="Production"
)
EOF
# Metrics are never checked. Bad model goes to production silently.

Right way

bash

# ✅ CORRECT: Query metrics, validate, then promote
from mlflow.tracking import MlflowClient

client = MlflowClient()

# Query the run and its metrics
run = client.get_run(run_id="abc123")
accuracy = run.data.metrics.get("accuracy", 0)

# Enforce gate
if accuracy < 0.92:
    print(f"Gate failed: accuracy {accuracy} < 0.92")
    exit(1)

# Only promote if gate passes
model_version = client.create_model_version(
    name="churn-classifier",
    source="runs:/abc123/model"
)
client.transition_model_version_stage(
    name="churn-classifier",
    version=model_version.version,
    stage="Production"
)

Tool vitals

Primary command

bash

mlflow models list --filter-string 'tags.env = "production"' and MLflow Python client query_best_run()

Config file eval_gate.yaml or .github/workflows/eval_gate.yml

Verify

bash

python -c "from mlflow.tracking import MlflowClient; client = MlflowClient(); run = client.get_run('run_id'); print(run.data.metrics)"

Integration notes

This gate integrates with your CI/CD pipeline (GitHub Actions, GitLab CI, Jenkins) and MLflow Model Registry. The gate runs after training completes, queries MLflow's tracking server, and gates the transition to the Production stage in the registry. DVC can version the training data and model artifacts, but MLflow owns the metric thresholds and promotion logic.

Migration path

If you outgrow simple numeric thresholds, migrate to Kubeflow Pipelines or BentoML's governance system, which support complex model validation rules, slicing metrics by cohort, and multi-armed bandit promotions. For now, a gate at the Python/bash level is sufficient and cheaper.

Common gotcha

MLflow's REST API and Python client can lag behind live runs: metrics may still be writing to the backend when your gate queries them. Always add a 5-10 second sleep after your training job completes and before the gate runs. Also, run IDs must be exact; a typo or environment variable leak will query the wrong run silently and pass a bad model.

Team adoption

Create a shared eval_gate.sh template in your repo's .github/workflows or .gitlab-ci.yml, document the threshold values for each model in a README, and require one human approval before the first promotion. After 3-5 successful promotions, the team will trust the gate and make it fully automated.

Experienced dev note

Use MLflow's `search_runs()` with `order_by` and `max_results=1` instead of `get_best_run()`, which is less transparent. Also, always set `tracking_uri` explicitly in CI/CD: never rely on the MLFLOW_TRACKING_URI environment variable alone, because if it's unset, the client silently falls back to a local ./mlruns directory and you'll think the gate passed when it actually queried nothing.

Check your understanding

Why does querying the run immediately after training completes sometimes return incomplete metrics, and what's the safest fix?

Show answer hint

Metrics are written asynchronously to MLflow's backend. The safest fix is to add a small sleep (5-10 seconds) after training completes and before the gate queries, or poll the run until all expected metrics appear.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.