Tool Beginner easy · 6 min concept

Retraining triggers

What you will learn

Define when and how your ML model automatically retrains based on data drift, performance degradation, or schedule.

Why this matters

Models degrade in production when data shifts. Without retraining triggers, you deploy once and hope it stays accurate. Triggers let you catch drift early, retrain automatically, and keep models relevant. Without them, you're manually monitoring dashboards and guessing when to retrain.

Skip if: If your model runs on static, unchanging data (rare), or if retraining takes weeks and drift is slow, manual retraining on a quarterly schedule may be acceptable. But for any production system with new data flowing in daily, triggers are non-negotiable.

Explanation

A retraining trigger is a decision rule that says 'now is the time to retrain.' It monitors one or more signals: data drift, model performance drop, time elapsed, or new data volume: and fires when a threshold is crossed. The trigger then kicks off a pipeline: fetch new data, retrain, validate, and push to registry. In MLflow + DVC, triggers live in your orchestration layer (Airflow, GitHub Actions, cron). DVC tracks which data version went into which model. MLflow logs the retrain run. Together, they let you trace back: 'why did this model retrain on March 15th?' and 'what data did it use?'. The trigger itself is usually a simple conditional check in your orchestrator that examines metrics (like accuracy drop) or signals (like 'new data arrived'). You define thresholds in config files or environment variables so non-engineers can adjust sensitivity without touching code.

Configuration

yaml

name: Retrain on Data Drift
on:
  schedule:
    - cron: '0 2 * * 1'  # Every Monday at 2 AM UTC
  workflow_dispatch:  # Manual trigger

jobs:
  check-drift:
    runs-on: ubuntu-22.04
    steps:
      - uses: actions/checkout@v4
      
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      
      - name: Install dependencies
        run: |
          pip install --upgrade pip
          pip install mlflow dvc pandas scikit-learn
      
      - name: Check for data drift
        run: |
          python3 << 'EOF'
import json
import subprocess
from sklearn.utils.validation import check_array
import pandas as pd

metrics_file = 'metrics.json'
try:
    with open(metrics_file, 'r') as f:
        metrics = json.load(f)
except FileNotFoundError:
    metrics = {'accuracy': 1.0}

threshold = 0.85
if metrics.get('accuracy', 1.0) < threshold:
    print('DATA_DRIFT_DETECTED=true')
else:
    print('DATA_DRIFT_DETECTED=false')
EOF
        id: drift_check
      
      - name: Trigger retraining pipeline
        if: steps.drift_check.outputs.DATA_DRIFT_DETECTED == 'true'
        run: |
          dvc repro
      
      - name: Push updated model to DVC remote
        if: steps.drift_check.outputs.DATA_DRIFT_DETECTED == 'true'
        run: |
          dvc push
      
      - name: Log retrain event to MLflow
        if: steps.drift_check.outputs.DATA_DRIFT_DETECTED == 'true'
        run: |
          python3 << 'EOF'
import mlflow

mlflow.set_tracking_uri('http://localhost:5000')
with mlflow.start_run(run_name='retrain_trigger_drift'):
    mlflow.log_param('trigger_type', 'data_drift')
    mlflow.log_param('threshold', 0.85)
    mlflow.log_metric('accuracy_before', metrics.get('accuracy', 1.0))
EOF

Why this order?

Schedule comes first (defines when to run). The check-drift step must complete before the conditional trigger. Outputs from drift_check must be available to downstream conditional steps (if: conditions). The push to DVC and MLflow logging happen last so you only log successful retrains.

Wrong vs Right

Wrong way

yaml

name: Bad Retraining
on:
  push:
    branches:
      - main
jobs:
  retrain:
    runs-on: ubuntu-22.04
    steps:
      - uses: actions/checkout@v4
      - run: dvc repro
      - run: dvc push

Right way

yaml

This triggers on every commit, even if data hasn't changed. It wastes compute, floods MLflow with redundant runs, and creates merge conflicts in dvc.lock. The correct approach: check a metric or drift signal first, only trigger if threshold is crossed. Use schedule + conditional, or webhook listening to data arrival events.

Tool vitals

Primary command

bash

dvc dag (visualize pipeline), airflow trigger_dag (manual), or GitHub Actions workflow_run event (automatic)

Config file dvc.yaml (pipeline definition), .github/workflows/*.yml (CI/CD triggers), or airflow DAG Python file

Verify

bash

dvc dag && dvc repro --dry

Integration notes

DVC tracks which data commit was used in the retraining run via dvc.lock and dvc push. MLflow records the retrain as a new experiment run with tags like 'trigger_type: drift'. Together: you can query MLflow ('show me all retrain runs triggered by drift in the last 30 days') and trace them back to data versions in DVC. This is your audit trail.

Migration path

If you outgrow cron + GitHub Actions, move to Apache Airflow (DAGs with rich scheduling) or Kubernetes CronJobs (for containerized workflows). The trigger logic stays the same; only the orchestrator changes. DVC and MLflow don't care how the trigger fires: they only log the result.

Common gotcha

Output variables in GitHub Actions steps don't automatically propagate across steps. You must echo them to $GITHUB_OUTPUT and reference them as steps..outputs.. If you write 'if: ${{ env.DATA_DRIFT_DETECTED }}', it will always be empty because the env var is local to that step only. Always use step outputs for cross-step communication.

Team adoption

Ship a template workflow in your repo under .github/workflows/retrain.yml. Add a metrics.json file with baseline values to version control. Document the threshold in a README: 'Retrains when accuracy drops below 0.85.' Use branch protection to prevent direct pushes to main: all retrains go through the workflow. Add a #ml-ops Slack notification in the workflow so the team sees when retrains fire. This builds confidence that the system is working.

Experienced dev note

Most teams hardcode thresholds in Python (accuracy < 0.85). The single biggest mistake: they forget to update the threshold when the baseline model improves. Use a reference run in MLflow (tagged 'baseline') and calculate the threshold dynamically: 'retrain if current accuracy drops >5% from baseline.' This survives model improvements without code changes.

Check your understanding

Why is the 'Check for data drift' step mandatory even if you want to retrain every Monday? What breaks if you skip it and run dvc repro unconditionally on schedule?

Show answer hint

Without the check, you retrain even when data hasn't changed. This creates identical models (wasting compute), pollutes your run history, and makes it impossible to correlate retrains with actual data changes. The check lets you retrain only when needed and leaves a clean audit trail.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.