Automated evaluation gate
Why this matters
Without an evaluation gate, you can accidentally promote a degraded model to production because no one manually checked the metrics. An automated gate prevents this by making metric thresholds enforceable and auditable in your CI/CD pipeline.
Explanation
An automated evaluation gate is a policy layer between model training and production deployment. After a model finishes training and evaluation, MLflow compares the logged metrics against thresholds you define in a YAML config. If any metric fails: accuracy drops below 0.92, recall falls below 0.88: the gate rejects the model and prevents promotion to the registry's 'Production' stage. This works by querying MLflow's REST API or Python client in your CI/CD pipeline (GitHub Actions, GitLab CI, Jenkins) and failing the job if thresholds aren't met. The gate runs during the model registration step, not during training, so failed gates don't waste compute: they stop the process before it reaches production.
Configuration
#!/bin/bash
# eval_gate.sh: evaluation gate script run in CI/CD pipeline
# This script queries the latest run, checks metrics, and promotes or rejects
set -e
MLFLOW_TRACKING_URI="http://mlflow-server:5000"
EXPERIMENT_NAME="customer-churn"
MODEL_NAME="churn-classifier"
MIN_ACCURACY=0.92
MIN_PRECISION=0.88
MIN_RECALL=0.85
# Query the best run from the latest experiment
BEST_RUN=$(python3 << 'EOF'
import json
import sys
from mlflow.tracking import MlflowClient
client = MlflowClient(tracking_uri="http://mlflow-server:5000")
experiment = client.get_experiment_by_name("customer-churn")
runs = client.search_runs(
experiment_ids=[experiment.experiment_id],
order_by=["metrics.accuracy DESC"],
max_results=1
)
if not runs:
print(json.dumps({"status": "failed", "error": "No runs found"}))
sys.exit(1)
run = runs[0]
metrics = run.data.metrics
run_id = run.info.run_id
print(json.dumps({
"run_id": run_id,
"accuracy": metrics.get("accuracy"),
"precision": metrics.get("precision"),
"recall": metrics.get("recall")
}))
EOF
)
echo "Latest run metrics: $BEST_RUN"
# Parse and validate metrics
ACCURACY=$(echo $BEST_RUN | python3 -c "import sys, json; d=json.load(sys.stdin); print(d['accuracy'] or 0)")
PRECISION=$(echo $BEST_RUN | python3 -c "import sys, json; d=json.load(sys.stdin); print(d['precision'] or 0)")
RECALL=$(echo $BEST_RUN | python3 -c "import sys, json; d=json.load(sys.stdin); print(d['recall'] or 0)")
RUN_ID=$(echo $BEST_RUN | python3 -c "import sys, json; d=json.load(sys.stdin); print(d['run_id'])")
echo "Gate thresholds: accuracy>=$MIN_ACCURACY, precision>=$MIN_PRECISION, recall>=$MIN_RECALL"
echo "Measured: accuracy=$ACCURACY, precision=$PRECISION, recall=$RECALL"
# Enforce thresholds
if (( $(echo "$ACCURACY < $MIN_ACCURACY" | bc -l) )); then
echo "❌ GATE FAILED: accuracy $ACCURACY < $MIN_ACCURACY"
exit 1
fi
if (( $(echo "$PRECISION < $MIN_PRECISION" | bc -l) )); then
echo "❌ GATE FAILED: precision $PRECISION < $MIN_PRECISION"
exit 1
fi
if (( $(echo "$RECALL < $MIN_RECALL" | bc -l) )); then
echo "❌ GATE FAILED: recall $RECALL < $MIN_RECALL"
exit 1
fi
echo "✅ GATE PASSED: All metrics above thresholds"
# Promote to Production stage
python3 << PROMOTE_EOF
from mlflow.tracking import MlflowClient
from mlflow.entities.model_registry.model_version_stage import ModelVersionStage
client = MlflowClient(tracking_uri="http://mlflow-server:5000")
try:
# Create model version from run
model_version = client.create_model_version(
name="$MODEL_NAME",
source=f"runs:/{"$RUN_ID"}/model",
run_id="$RUN_ID"
)
# Transition to Production
client.transition_model_version_stage(
name="$MODEL_NAME",
version=model_version.version,
stage="Production"
)
print(f"✅ Model $MODEL_NAME version {model_version.version} promoted to Production")
except Exception as e:
print(f"❌ Promotion failed: {e}")
exit(1)
PROMOTE_EOF Why this order?
The script queries MLflow first (establish the current state), then validates metrics in order from strictest to loosest threshold (fail fast on the most critical check), then promotes only if all gates pass. If you promote before validating, you lose the safety net.
Wrong vs Right
#!/bin/bash
# ❌ WRONG: No thresholds, promotes blindly
RUN_ID="abc123"
MODEL_NAME="churn-classifier"
python3 << 'EOF'
from mlflow.tracking import MlflowClient
client = MlflowClient()
model_version = client.create_model_version(
name="churn-classifier",
source="runs:/abc123/model"
)
client.transition_model_version_stage(
name="churn-classifier",
version=model_version.version,
stage="Production"
)
EOF
# Metrics are never checked. Bad model goes to production silently. # ✅ CORRECT: Query metrics, validate, then promote
from mlflow.tracking import MlflowClient
client = MlflowClient()
# Query the run and its metrics
run = client.get_run(run_id="abc123")
accuracy = run.data.metrics.get("accuracy", 0)
# Enforce gate
if accuracy < 0.92:
print(f"Gate failed: accuracy {accuracy} < 0.92")
exit(1)
# Only promote if gate passes
model_version = client.create_model_version(
name="churn-classifier",
source="runs:/abc123/model"
)
client.transition_model_version_stage(
name="churn-classifier",
version=model_version.version,
stage="Production"
) Tool vitals
mlflow models list --filter-string 'tags.env = "production"' and MLflow Python client query_best_run() eval_gate.yaml or .github/workflows/eval_gate.yml python -c "from mlflow.tracking import MlflowClient; client = MlflowClient(); run = client.get_run('run_id'); print(run.data.metrics)" Integration notes
This gate integrates with your CI/CD pipeline (GitHub Actions, GitLab CI, Jenkins) and MLflow Model Registry. The gate runs after training completes, queries MLflow's tracking server, and gates the transition to the Production stage in the registry. DVC can version the training data and model artifacts, but MLflow owns the metric thresholds and promotion logic.
Migration path
If you outgrow simple numeric thresholds, migrate to Kubeflow Pipelines or BentoML's governance system, which support complex model validation rules, slicing metrics by cohort, and multi-armed bandit promotions. For now, a gate at the Python/bash level is sufficient and cheaper.
Common gotcha
MLflow's REST API and Python client can lag behind live runs: metrics may still be writing to the backend when your gate queries them. Always add a 5-10 second sleep after your training job completes and before the gate runs. Also, run IDs must be exact; a typo or environment variable leak will query the wrong run silently and pass a bad model.
Team adoption
Create a shared eval_gate.sh template in your repo's .github/workflows or .gitlab-ci.yml, document the threshold values for each model in a README, and require one human approval before the first promotion. After 3-5 successful promotions, the team will trust the gate and make it fully automated.
Experienced dev note
Use MLflow's `search_runs()` with `order_by` and `max_results=1` instead of `get_best_run()`, which is less transparent. Also, always set `tracking_uri` explicitly in CI/CD: never rely on the MLFLOW_TRACKING_URI environment variable alone, because if it's unset, the client silently falls back to a local ./mlruns directory and you'll think the gate passed when it actually queried nothing.
Check your understanding
Why does querying the run immediately after training completes sometimes return incomplete metrics, and what's the safest fix?
Show answer hint
Metrics are written asynchronously to MLflow's backend. The safest fix is to add a small sleep (5-10 seconds) after training completes and before the gate queries, or poll the run until all expected metrics appear.