Retraining triggers
Why this matters
Models degrade in production when data shifts. Without retraining triggers, you deploy once and hope it stays accurate. Triggers let you catch drift early, retrain automatically, and keep models relevant. Without them, you're manually monitoring dashboards and guessing when to retrain.
Explanation
A retraining trigger is a decision rule that says 'now is the time to retrain.' It monitors one or more signals: data drift, model performance drop, time elapsed, or new data volume: and fires when a threshold is crossed. The trigger then kicks off a pipeline: fetch new data, retrain, validate, and push to registry. In MLflow + DVC, triggers live in your orchestration layer (Airflow, GitHub Actions, cron). DVC tracks which data version went into which model. MLflow logs the retrain run. Together, they let you trace back: 'why did this model retrain on March 15th?' and 'what data did it use?'. The trigger itself is usually a simple conditional check in your orchestrator that examines metrics (like accuracy drop) or signals (like 'new data arrived'). You define thresholds in config files or environment variables so non-engineers can adjust sensitivity without touching code.
Configuration
name: Retrain on Data Drift
on:
schedule:
- cron: '0 2 * * 1' # Every Monday at 2 AM UTC
workflow_dispatch: # Manual trigger
jobs:
check-drift:
runs-on: ubuntu-22.04
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install dependencies
run: |
pip install --upgrade pip
pip install mlflow dvc pandas scikit-learn
- name: Check for data drift
run: |
python3 << 'EOF'
import json
import subprocess
from sklearn.utils.validation import check_array
import pandas as pd
metrics_file = 'metrics.json'
try:
with open(metrics_file, 'r') as f:
metrics = json.load(f)
except FileNotFoundError:
metrics = {'accuracy': 1.0}
threshold = 0.85
if metrics.get('accuracy', 1.0) < threshold:
print('DATA_DRIFT_DETECTED=true')
else:
print('DATA_DRIFT_DETECTED=false')
EOF
id: drift_check
- name: Trigger retraining pipeline
if: steps.drift_check.outputs.DATA_DRIFT_DETECTED == 'true'
run: |
dvc repro
- name: Push updated model to DVC remote
if: steps.drift_check.outputs.DATA_DRIFT_DETECTED == 'true'
run: |
dvc push
- name: Log retrain event to MLflow
if: steps.drift_check.outputs.DATA_DRIFT_DETECTED == 'true'
run: |
python3 << 'EOF'
import mlflow
mlflow.set_tracking_uri('http://localhost:5000')
with mlflow.start_run(run_name='retrain_trigger_drift'):
mlflow.log_param('trigger_type', 'data_drift')
mlflow.log_param('threshold', 0.85)
mlflow.log_metric('accuracy_before', metrics.get('accuracy', 1.0))
EOF Why this order?
Schedule comes first (defines when to run). The check-drift step must complete before the conditional trigger. Outputs from drift_check must be available to downstream conditional steps (if: conditions). The push to DVC and MLflow logging happen last so you only log successful retrains.
Wrong vs Right
name: Bad Retraining
on:
push:
branches:
- main
jobs:
retrain:
runs-on: ubuntu-22.04
steps:
- uses: actions/checkout@v4
- run: dvc repro
- run: dvc push This triggers on every commit, even if data hasn't changed. It wastes compute, floods MLflow with redundant runs, and creates merge conflicts in dvc.lock. The correct approach: check a metric or drift signal first, only trigger if threshold is crossed. Use schedule + conditional, or webhook listening to data arrival events. Tool vitals
dvc dag (visualize pipeline), airflow trigger_dag (manual), or GitHub Actions workflow_run event (automatic) dvc.yaml (pipeline definition), .github/workflows/*.yml (CI/CD triggers), or airflow DAG Python file dvc dag && dvc repro --dry Integration notes
DVC tracks which data commit was used in the retraining run via dvc.lock and dvc push. MLflow records the retrain as a new experiment run with tags like 'trigger_type: drift'. Together: you can query MLflow ('show me all retrain runs triggered by drift in the last 30 days') and trace them back to data versions in DVC. This is your audit trail.
Migration path
If you outgrow cron + GitHub Actions, move to Apache Airflow (DAGs with rich scheduling) or Kubernetes CronJobs (for containerized workflows). The trigger logic stays the same; only the orchestrator changes. DVC and MLflow don't care how the trigger fires: they only log the result.
Common gotcha
Output variables in GitHub Actions steps don't automatically propagate across steps. You must echo them to $GITHUB_OUTPUT and reference them as steps.
Team adoption
Ship a template workflow in your repo under .github/workflows/retrain.yml. Add a metrics.json file with baseline values to version control. Document the threshold in a README: 'Retrains when accuracy drops below 0.85.' Use branch protection to prevent direct pushes to main: all retrains go through the workflow. Add a #ml-ops Slack notification in the workflow so the team sees when retrains fire. This builds confidence that the system is working.
Experienced dev note
Most teams hardcode thresholds in Python (accuracy < 0.85). The single biggest mistake: they forget to update the threshold when the baseline model improves. Use a reference run in MLflow (tagged 'baseline') and calculate the threshold dynamically: 'retrain if current accuracy drops >5% from baseline.' This survives model improvements without code changes.
Check your understanding
Why is the 'Check for data drift' step mandatory even if you want to retrain every Monday? What breaks if you skip it and run dvc repro unconditionally on schedule?
Show answer hint
Without the check, you retrain even when data hasn't changed. This creates identical models (wasting compute), pollutes your run history, and makes it impossible to correlate retrains with actual data changes. The check lets you retrain only when needed and leaves a clean audit trail.