Tool Beginner easy · 6 min concept

Drift detection fundamentals

What you will learn
Detect when your model's input data or predictions diverge from training conditions using statistical monitoring.

Why this matters

Models trained on historical data degrade silently when production data shifts. Without drift detection, you deploy a model that appears healthy but makes wrong predictions on new patterns: and you don't know it. Drift detection triggers retraining before users notice failures.

Skip if: Skip explicit drift detection only if: (1) your data is truly stationary (rare: time-series almost always drift), (2) you have human review gates on all predictions (expensive), or (3) model retraining is automated daily regardless (wastes compute). For most production systems, drift detection is non-negotiable.

Explanation

Drift detection monitors whether the data flowing into your model in production matches the statistical properties of training data. Two types matter: data drift (input features changed) and prediction drift (model outputs shifted). You detect this by comparing distributions: if production Feature X has a different mean, variance, or categorical distribution than training data, you flag it. MLflow's Model Registry combined with monitoring scripts (often in DVC or scheduled batch jobs) catches drift before it cascades into bad predictions. The workflow is: log baseline statistics during training → collect production data in batches → compare distributions using statistical tests (Kolmogorov-Smirnov, chi-square, or custom thresholds) → alert and trigger retraining if threshold breached.

Configuration

yaml
# dvc.yaml: Drift detection pipeline
stages:
  train:
    cmd: python train.py
    deps:
      - train.py
      - data/train.csv
    outs:
      - models/model.pkl
      - stats/baseline_stats.json
    metrics:
      - metrics/train_metrics.json:
          cache: false

  monitor_drift:
    cmd: python monitor_drift.py
    deps:
      - monitor_drift.py
      - stats/baseline_stats.json
      - data/production_data.csv
    metrics:
      - metrics/drift_report.json:
          cache: false
    plots:
      - plots/drift_comparison.csv:
          x: feature
          y: ks_statistic

# Python: monitor_drift.py
import json
import pandas as pd
from scipy.stats import ks_2samp
import mlflow

with open('stats/baseline_stats.json') as f:
    baseline_stats = json.load(f)

train_data = pd.read_csv('data/train.csv')
prod_data = pd.read_csv('data/production_data.csv')

drift_results = {}
ks_tests = []

for col in baseline_stats['columns']:
    if col in prod_data.columns:
        stat, p_value = ks_2samp(train_data[col], prod_data[col])
        drift_results[col] = {
            'ks_statistic': float(stat),
            'p_value': float(p_value),
            'drifted': p_value < 0.05
        }
        ks_tests.append({'feature': col, 'ks_statistic': stat})

with open('metrics/drift_report.json', 'w') as f:
    json.dump(drift_results, f, indent=2)

df_plots = pd.DataFrame(ks_tests)
df_plots.to_csv('plots/drift_comparison.csv', index=False)

mlflow.log_metrics({
    'drifted_features': sum(1 for v in drift_results.values() if v['drifted']),
    'total_features': len(drift_results)
})

if sum(1 for v in drift_results.values() if v['drifted']) > 0:
    print('ALERT: Data drift detected')
    exit(1)

Why this order?

The train stage must run first to generate baseline statistics. The monitor_drift stage depends on those statistics and production data, so it runs second. In DVC pipelines, dependencies and outputs define execution order automatically.

Wrong vs Right

Wrong way
yaml
# DON'T: Check drift manually after each batch
cd /data
manually_inspect.sh production_data.csv
# No version control, no reproducibility, missed drift for 3 weeks until customer complains

# DON'T: Log metrics without statistical tests
mlflow.log_metric('feature_x_mean_prod', prod_mean)
# You log the value but have no threshold: alert fatigue or silent failures

# DON'T: Use production baseline for comparison
baseline = production_data.sample(1000)  # Week 1 production
current = production_data_week_2
# You're comparing production-to-production, not training-to-production: detects seasonal shifts, not model degradation
Right way
yaml
# DO: Baseline from training data, compare to production
baseline_stats = train_data.describe().to_dict()  # Capture during training
# In monitoring: compare production_data distribution to baseline_stats
stat, p_value = ks_2samp(train_data[col], prod_data[col])
if p_value < 0.05:  # Statistical significance threshold
    mlflow.log_metric(f'drift_{col}', 1)  # Alert

# DO: Version baseline statistics in DVC
# dvc add stats/baseline_stats.json
# Ensures reproducibility and audit trail

# DO: Separate data drift detection from model performance monitoring
# Data drift: "Did inputs change?" → Monitor KS, Wasserstein distance
# Performance drift: "Did predictions change?" → Monitor accuracy, precision on labeled holdout

Tool vitals

Primary command
bash
dvc metrics compare or custom Python monitoring scripts with MLflow logging
Config file dvc.yaml for pipeline drift checks, or standalone monitoring configuration
Verify
bash
dvc plots show or MLflow UI to inspect logged drift metrics

Integration notes

In MLflow 2.x Model Registry, log drift metrics alongside model versions so retraining decisions are auditable. DVC stages integrate with scheduling tools (Airflow, GitHub Actions) to run monitor_drift on a cadence (hourly, daily). When drift is detected, trigger MLflow experiment runs to retrain. This closes the loop: production monitoring → MLflow experiment → new model version → registry promotion.

Migration path

If moving away from manual DVC pipelines, use Evidently AI (purpose-built for drift, logs to MLflow) or integrate monitoring into Airflow DAGs. For real-time detection, stream data to a monitoring system (Datadog, CloudWatch) alongside batch checks.

Common gotcha

The Kolmogorov-Smirnov test (ks_2samp) assumes continuous data. If your features are categorical or discrete, use chi-square test instead. Using KS on categorical data gives false negatives and you miss drift entirely. Also: baseline statistics must be computed BEFORE any production data enters training, or you're training on what you claim to drift-detect against.

Team adoption

On day one, embed baseline stats capture in training scripts (one line: mlflow.log_dict or dvc add). Assign one person to write the monitoring script; reuse it across all models. Add drift report to weekly model health dashboard. Alert on drift but don't auto-retrain until the team has seen 2-3 cases and trusts the signal.

Experienced dev note

Use a relative threshold (e.g., 'alert if 20% of features drift') instead of absolute p-value thresholds. With many features, p < 0.05 triggers false alarms. Also: log the _baseline distribution itself (mean, std, percentiles) alongside the drift metric: when retraining, you need to know what the training distribution was, not just whether production drifted.

Check your understanding

Why is it wrong to use Week 1 production data as your baseline for Week 2 production drift detection?

Show answer hint

Baseline must reflect training conditions. Production data at Week 1 may already differ from training (gradual shift, seasonal patterns, user behavior changes). Comparing production-to-production only detects change relative to an arbitrary point, not deterioration relative to model assumptions.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.