Tool Advanced hard · 8 min integration

Retraining triggers

What you will learn
Automatically retrain models when data drifts, metrics degrade, or schedules fire using DVC pipelines + MLflow callbacks + orchestration.

Why this matters

Without retraining triggers, stale models silently degrade in production. You ship a 92% accuracy model that becomes 71% accurate in 6 months because the data distribution shifted. Manual retraining is unsustainable at scale. Triggers catch drift before it hits users.

Skip if: Skip automated triggers if: (1) your model is a one-time batch prediction job with no new data, (2) regulatory requirements forbid automated retraining without human sign-off (use approval gates instead), (3) you have <1GB of training data and can afford daily full retrains without cost concern. For small, stable datasets, a weekly cron job may be simpler than drift detection.

Explanation

Retraining triggers are decision points that automatically launch DVC pipelines or Airflow DAGs when conditions are met. There are three main trigger types: data drift (detected via statistical tests), performance degradation (detected via holdout validation or production metrics), and time-based (cron schedules). MLflow tracks which experiments triggered retrains, creating an audit trail. DVC pipelines execute the retraining workflow (data prep → train → evaluate → register). The orchestrator (cron, Airflow, Kubernetes CronJob) watches for trigger conditions and kicks off the pipeline. The key architecture: DVC owns the retraining pipeline definition, MLflow owns the experiment/model registry, and the orchestrator owns scheduling and condition checking. Without this separation, you end up with trigger logic scattered across three different systems and no one owns the reliability.

Configuration

yaml
kind: CronJob
apiVersion: batch/v1
metadata:
  name: model-retraining-trigger
  namespace: mlops
spec:
  schedule: "0 2 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: mlops-sa
          containers:
          - name: trigger
            image: myregistry.azurecr.io/ml-trainer:latest
            imagePullPolicy: Always
            env:
            - name: MLFLOW_TRACKING_URI
              value: "http://mlflow-server:5000"
            - name: DVC_REMOTE
              value: "s3://my-bucket/dvc-store"
            - name: DRIFT_THRESHOLD
              value: "0.08"
            command:
            - /bin/bash
            - -c
            - |
              set -e
              echo "[$(date)] Starting drift detection..."
              python3 /app/check_drift.py \
                --baseline-model-uri "models:/prod-classifier/Production" \
                --recent-data s3://my-bucket/data/recent/ \
                --threshold ${DRIFT_THRESHOLD}
              DRIFT_DETECTED=$?
              if [ $DRIFT_DETECTED -eq 0 ]; then
                echo "[$(date)] Drift detected. Triggering retraining pipeline."
                cd /app && dvc repro dvc.yaml --force
                python3 /app/register_model.py
              else
                echo "[$(date)] No drift detected. Skipping retraining."
              fi
            volumeMounts:
            - name: dvc-config
              mountPath: /root/.dvc
              readOnly: true
            - name: aws-creds
              mountPath: /root/.aws
              readOnly: true
            resources:
              requests:
                memory: "4Gi"
                cpu: "2"
              limits:
                memory: "8Gi"
                cpu: "4"
          volumes:
          - name: dvc-config
            secret:
              secretName: dvc-remote-config
          - name: aws-creds
            secret:
              secretName: aws-credentials
          restartPolicy: OnFailure
          backoffLimit: 2
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: dvc-pipeline-config
  namespace: mlops
data:
  dvc.yaml: |
    stages:
      fetch_recent_data:
        cmd: python3 src/fetch_data.py --output data/recent.parquet --days 7
        deps:
          - src/fetch_data.py
        outs:
          - data/recent.parquet
      
      train:
        cmd: python3 src/train.py --data data/recent.parquet --output models/candidate.pkl
        deps:
          - src/train.py
          - data/recent.parquet
        outs:
          - models/candidate.pkl
        metrics:
          - metrics.json:
              cache: false
      
      evaluate:
        cmd: python3 src/evaluate.py --model models/candidate.pkl --holdout data/holdout.parquet
        deps:
          - src/evaluate.py
          - models/candidate.pkl
          - data/holdout.parquet
        metrics:
          - eval_metrics.json:
              cache: false
  
  params.yaml: |
    retraining:
      min_accuracy_threshold: 0.88
      max_latency_ms: 150
      drift_pvalue_threshold: 0.05
      data_volume_min_samples: 5000
    model:
      learning_rate: 0.001
      batch_size: 32
      epochs: 20

Why this order?

The CronJob definition comes first because it is the orchestration trigger itself: the scheduler that fires at 2 AM daily. Inside the job, the trigger script (check_drift.py) runs first to determine whether conditions are met; only if drift is detected does it call `dvc repro`. The dvc.yaml pipeline defines the retraining workflow stages in dependency order: fetch data → train → evaluate. params.yaml contains the threshold values that check_drift.py reads, making thresholds auditable and version-controlled. This ordering ensures: (1) scheduling is independent of pipeline logic, (2) drift detection is decoupled from training, (3) thresholds are centralized and parameterized, (4) the pipeline is deterministic and reproducible.

Wrong vs Right

Wrong way
yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: retrain
spec:
  schedule: "0 2 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: trainer
            image: myregistry.azurecr.io/ml-trainer:latest
            command:
            - /bin/bash
            - -c
            - |
              cd /app
              dvc repro dvc.yaml
              if grep -q 'accuracy.*0\.[0-8][0-9]' metrics.json; then
                echo "Bad accuracy, but we already trained. Oops."
              fi
          restartPolicy: OnFailure
Right way
yaml
Separate drift detection from pipeline execution. Run a Python script first that reads production metrics from MLflow and the holdout set from S3, computes Kolmogorov-Smirnov or Population Stability Index on input distributions, and exits with code 0 only if drift exceeds threshold. Only then invoke `dvc repro`. This way: (1) you don't waste compute retraining when data is stable, (2) you have a clear decision point with logging, (3) you can alert on drift without actually retraining, (4) thresholds are in params.yaml and version-controlled, not hardcoded in the job.

Tool vitals

Primary command
bash
dvc repro <pipeline.yaml> with external condition checks; MLflow client tracking; orchestrator (K8s CronJob, Airflow task, or GitHub Actions) executing the trigger
Config file dvc.yaml (pipeline), params.yaml (thresholds), Kubernetes CronJob YAML or Dockerfile for containerized trigger
Verify
bash
dvc dag --ascii to see pipeline; mlflow experiments search to confirm runs; kubectl logs -f <cronjob-pod> to watch trigger execution

Integration notes

The CronJob orchestrates the retraining trigger. check_drift.py queries MLflow's tracking server for recent holdout metrics and fetches recent production data from S3 (via DVC remote). DVC pipeline (dvc.yaml) runs the actual training, logging experiments to MLflow. After training, register_model.py promotes the candidate model to the Model Registry if it passes threshold checks. The entire flow is: K8s scheduler → drift detector → DVC pipeline → MLflow tracking + registry. If using Airflow instead of K8s, replace the CronJob with an Airflow DAG that calls a Python operator running the same check_drift.py and dvc repro commands.

Migration path

If moving away from cron-based triggers: (1) Airflow or Prefect offers richer condition logic and retry strategies; migrate by converting the bash script into Python operators with Airflow's branching. (2) If moving away from DVC pipelines: keep the trigger detector (drift check), but replace `dvc repro` with direct Python calls to train/evaluate functions; this loses reproducibility but gains flexibility. (3) If replacing MLflow: trigger logic stays the same, but change the metrics queries from MLflow API to your new registry's API.

Cost model

Free. K8s CronJob costs only the compute of the nodes it runs on. MLflow and DVC are open-source. Hidden cost: uncontrolled retraining with expensive hardware (GPUs) can balloon costs if drift triggers too frequently. Mitigate by setting a minimum time between retrains (use a timestamp file in the job to skip retraining within 24 hours of the last one) and by monitoring how often drift is detected. At scale (>100 models), orchestrating triggers for each model in separate CronJobs is noisy; consider a central trigger manager that batches retraining requests.

Common gotcha

Drift detection scripts timeout or crash silently, and the CronJob restarts infinitely in backoff without anyone noticing. Always wrap your trigger logic in a try-except that logs to stderr, set backoffLimit to 2 or 3 (not infinity), and add a liveness probe or external health check (e.g., a Prometheus metric tracking last successful trigger run). The second gotcha: if your drift detection reads from production metrics in MLflow, and MLflow is unavailable at trigger time, the job fails but the pipeline never runs: the model never retrains. Use a fallback: if you can't fetch recent metrics, assume drift is present and retrain anyway. Third gotcha: DVC pipelines are deterministic by design, but if your trigger script modifies the git branch or DVC remote between checks, `dvc repro` may use stale cache. Always ensure the trigger runs in an isolated working directory and resets it before executing dvc repro.

Team adoption

Day 1: Have one senior engineer own the trigger orchestration (K8s + CronJob setup). Day 2: Have a data scientist write and test check_drift.py locally with holdout data; code review it against params.yaml thresholds. Day 3: Integrate into CI/CD so that any change to params.yaml or check_drift.py triggers a staging CronJob run first. Day 4: Deploy to production with alerting on CronJob failures and on drift detection (even if retraining succeeds: seeing how often drift fires tells you if thresholds need tuning). Day 5: Weekly review of retraining frequency and model accuracy trends. Use MLflow UI to show the team which experiments were triggered by the automated pipeline vs. manual runs: this builds confidence in the automation.

Experienced dev note

After months of debugging production, the most valuable pattern is to separate the drift detection exit code from the pipeline execution. Make check_drift.py idempotent and test it offline with holdout data: never rely on production metrics alone. Log the drift score (e.g., KS statistic = 0.087) and threshold (0.08) to stdout at the moment of decision, not just a boolean. This makes root-causing false positives trivial. Second insight: use `dvc repro --no-commit` first to validate the pipeline runs, then `dvc repro` to commit outputs. Third: parameterize everything in params.yaml, including the drift threshold, holdout split ratio, and model architecture. This decouples the trigger logic from training logic and lets data scientists tune thresholds without touching the orchestration code.

Check your understanding

Why does the CronJob example run check_drift.py before `dvc repro`, rather than embedding the drift check inside the dvc.yaml pipeline itself?

Show answer hint

If drift detection were a DVC stage, DVC would cache its outputs and skip it on subsequent runs. You'd never re-detect drift without manually clearing the cache. By running drift detection outside the pipeline, it executes every time the CronJob fires, making the trigger stateless and reliable. Additionally, you avoid paying the compute cost of DVC dependency tracking and caching for a logic step that should be fast and always-run.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.