Tool Advanced hard · 8 min best_practice

On-call for ML systems

What you will learn

Set up monitoring, alerting, and incident response for production ML pipelines using MLflow, DVC, and Kubernetes observability tools.

Why this matters

ML systems fail silently: a model can degrade accuracy without crashing, data drift corrupts predictions, and training jobs hang indefinitely. Without on-call infrastructure, you discover production failures hours or days later when business metrics tank. On-call for ML means catching failures in minutes, with automated rollbacks and clear incident playbooks.

Skip if: If your model runs offline (batch reports generated weekly, reviewed before publishing), on-call monitoring is overkill. If predictions don't affect critical business logic (novelty features, A/B test candidate arms), simpler dashboards suffice. For hobby projects or internal tools with no SLA, skip this entirely.

Explanation

On-call for ML systems requires three layers: (1) Pipeline monitoring: detect training/inference failures, data drift, model staleness in MLflow and DVC; (2) Model performance monitoring: track prediction latency, error rates, and model output distribution in real time; (3) Incident response: automated alerts to Slack/PagerDuty, runbooks for common failures, and quick rollback paths. Most teams skip this and burn out on-call engineers debugging production issues with no historical context. The core insight is that ML failures are domain-specific (drift, label leakage, numerical instability) and require ML-aware monitoring, not just application health checks. Kubernetes liveness probes catch hung processes, but they won't catch a model that returns NaN for 10% of requests or has silently degraded to 40% accuracy.

Configuration

yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-ml-rules
  namespace: monitoring
data:
  ml-alerts.yaml: |
    groups:
    - name: ml_pipeline_alerts
      interval: 30s
      rules:
      - alert: ModelInferenceLatenessHigh
        expr: histogram_quantile(0.95, model_inference_duration_seconds) > 2.0
        for: 5m
        labels:
          severity: warning
          component: inference
        annotations:
          summary: "Model inference p95 latency > 2s"
          runbook: "https://wiki.internal/runbooks/inference-latency.md"
      - alert: DataDriftDetected
        expr: dvc_metric_data_drift_kolmogorov_smirnov > 0.15
        for: 10m
        labels:
          severity: critical
          component: data
        annotations:
          summary: "Data distribution shifted significantly from training set"
          runbook: "https://wiki.internal/runbooks/data-drift.md"
      - alert: ModelStalenessHigh
        expr: (time() - mlflow_model_last_trained_timestamp) / 86400 > 7
        for: 1m
        labels:
          severity: warning
          component: model
        annotations:
          summary: "Model not retrained in 7+ days"
          runbook: "https://wiki.internal/runbooks/model-staleness.md"
      - alert: TrainingJobFailed
        expr: increase(dvc_training_job_failures_total[5m]) > 0
        for: 1m
        labels:
          severity: critical
          component: training
        annotations:
          summary: "DVC training job failed: check logs"
          runbook: "https://wiki.internal/runbooks/training-failure.md"
      - alert: ModelRegistryPushFailed
        expr: increase(mlflow_model_push_errors_total[5m]) > 0
        for: 1m
        labels:
          severity: warning
          component: registry
        annotations:
          summary: "MLflow model registry push error"
          runbook: "https://wiki.internal/runbooks/registry-push.md"
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: alertmanager-ml-config
  namespace: monitoring
data:
  config.yaml: |
    global:
      resolve_timeout: 5m
      slack_api_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
    route:
      receiver: ml-on-call
      group_by: [component, severity]
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 4h
      routes:
      - match:
          component: data
          severity: critical
        receiver: ml-data-critical
        group_wait: 10s
        repeat_interval: 30m
      - match:
          component: inference
          severity: critical
        receiver: ml-inference-critical
        repeat_interval: 1h
      - match:
          component: training
          severity: critical
        receiver: ml-training-critical
        repeat_interval: 2h
    receivers:
    - name: ml-on-call
      slack_configs:
      - channel: "#ml-alerts"
        title: "ML System Alert"
        text: "{{ .GroupLabels.component }}: {{ .CommonAnnotations.summary }}"
        actions:
        - type: button
          text: "View Runbook"
          url: "{{ .CommonAnnotations.runbook }}"
    - name: ml-data-critical
      slack_configs:
      - channel: "#ml-critical"
        title: "🚨 Data Drift Critical"
        text: "Data distribution shifted. {{ .CommonAnnotations.summary }}"
      pagerduty_configs:
      - service_key: "YOUR_PAGERDUTY_SERVICE_KEY"
        description: "{{ .CommonAnnotations.summary }}"
    - name: ml-inference-critical
      slack_configs:
      - channel: "#ml-critical"
        title: "🚨 Inference Degradation"
        text: "{{ .CommonAnnotations.summary }}"
      pagerduty_configs:
      - service_key: "YOUR_PAGERDUTY_INFERENCE_KEY"
    - name: ml-training-critical
      slack_configs:
      - channel: "#ml-critical"
        title: "🚨 Training Pipeline Failed"
        text: "{{ .CommonAnnotations.summary }}"
---
apiVersion: v1
kind: Service
metadata:
  name: mlflow-server
  namespace: ml-prod
spec:
  selector:
    app: mlflow
  ports:
  - port: 5000
    targetPort: 5000
    name: ui
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mlflow-server
  namespace: ml-prod
spec:
  replicas: 2
  selector:
    matchLabels:
      app: mlflow
  template:
    metadata:
      labels:
        app: mlflow
    spec:
      containers:
      - name: mlflow
        image: ghcr.io/mlflow/mlflow:v2.12.1
        command:
        - mlflow
        - server
        - --backend-store-uri
        - postgresql://mlflow_user:password@postgres.ml-prod:5432/mlflow
        - --default-artifact-root
        - s3://ml-artifacts/
        - --host
        - "0.0.0.0"
        - --port
        - "5000"
        ports:
        - containerPort: 5000
        env:
        - name: AWS_ACCESS_KEY_ID
          valueFrom:
            secretKeyRef:
              name: s3-credentials
              key: access-key
        - name: AWS_SECRET_ACCESS_KEY
          valueFrom:
            secretKeyRef:
              name: s3-credentials
              key: secret-key
        livenessProbe:
          httpGet:
            path: /health
            port: 5000
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health
            port: 5000
          initialDelaySeconds: 10
          periodSeconds: 5
        resources:
          requests:
            memory: "512Mi"
            cpu: "250m"
          limits:
            memory: "2Gi"
            cpu: "1000m"
---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: dvc-pipeline-monitor
  namespace: ml-prod
spec:
  schedule: "*/5 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: dvc-monitor
          containers:
          - name: monitor
            image: python:3.11-slim
            command:
            - /bin/bash
            - -c
            - |
              pip install dvc[s3] prometheus-client pydantic -q
              python /scripts/dvc_monitor.py
            volumeMounts:
            - name: monitor-script
              mountPath: /scripts
            - name: dvc-config
              mountPath: /root/.dvc
            env:
            - name: PROMETHEUS_PUSHGATEWAY
              value: "http://prometheus-pushgateway:9091"
          restartPolicy: OnFailure
          volumes:
          - name: monitor-script
            configMap:
              name: dvc-monitor-script
          - name: dvc-config
            secret:
              secretName: dvc-s3-config
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: dvc-monitor-script
  namespace: ml-prod
data:
  dvc_monitor.py: |
    import os
    import json
    from datetime import datetime
    from prometheus_client import CollectorRegistry, Gauge, push_to_gateway
    from pathlib import Path
    
    registry = CollectorRegistry()
    
    model_staleness = Gauge('dvc_model_staleness_days', 'Days since model was trained', registry=registry)
    pipeline_status = Gauge('dvc_pipeline_last_status', 'Last pipeline exit code (0=success)', registry=registry)
    data_size = Gauge('dvc_tracked_data_size_bytes', 'Total size of DVC-tracked data', registry=registry)
    
    dvc_dir = Path('/workspace/.dvc')
    repo_dir = Path('/workspace')
    
    try:
        os.chdir(repo_dir)
        import subprocess
        
        result = subprocess.run(['dvc', 'status', '--json'], capture_output=True, text=True, timeout=30)
        pipeline_status.set(result.returncode)
        
        result = subprocess.run(['dvc', 'metrics', 'show', '--json'], capture_output=True, text=True, timeout=30)
        if result.returncode == 0:
            metrics = json.loads(result.stdout)
            if 'training/metrics.json' in metrics:
                metric_data = metrics['training/metrics.json']
                if 'timestamp' in metric_data:
                    last_train = datetime.fromisoformat(metric_data['timestamp'])
                    staleness = (datetime.now(last_train.tzinfo) - last_train).days
                    model_staleness.set(staleness)
        
        result = subprocess.run(['du', '-sb', repo_dir], capture_output=True, text=True)
        size_bytes = int(result.stdout.split()[0])
        data_size.set(size_bytes)
        
    except Exception as e:
        print(f"Monitor error: {e}", flush=True)
    
    finally:
        gateway = os.getenv('PROMETHEUS_PUSHGATEWAY', 'localhost:9091')
        push_to_gateway(gateway, job='dvc-monitor', registry=registry)

Why this order?

Prometheus scrape rules must define alert conditions before AlertManager can route them. AlertManager configuration routes alerts by label (component, severity) to different channels: critical data drift goes to PagerDuty immediately, while lower-priority training alerts wait 2 hours before re-notifying. MLflow Deployment with health checks ensures the experiment tracking backend stays available. DVC monitor CronJob runs every 5 minutes to push staleness metrics to Prometheus: frequency matters because stale models are only detected in the next scrape cycle.

Wrong vs Right

Wrong way

yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-bad
data:
  alerts.yaml: |
    groups:
    - name: ml_alerts
      rules:
      - alert: HighLatency
        expr: inference_latency > 2
        labels:
          severity: warning
      - alert: DataDrift
        expr: drift_score > 0.15

Right way

yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-ml-rules
data:
  ml-alerts.yaml: |
    groups:
    - name: ml_pipeline_alerts
      interval: 30s
      rules:
      - alert: ModelInferenceLatenessHigh
        expr: histogram_quantile(0.95, model_inference_duration_seconds) > 2.0
        for: 5m
        labels:
          severity: warning
          component: inference
        annotations:
          summary: "Model inference p95 latency > 2s"
          runbook: "https://wiki.internal/runbooks/inference-latency.md"
      - alert: DataDriftDetected
        expr: dvc_metric_data_drift_kolmogorov_smirnov > 0.15
        for: 10m
        labels:
          severity: critical
          component: data
        annotations:
          summary: "Data distribution shifted from training"
          runbook: "https://wiki.internal/runbooks/data-drift.md"

Tool vitals

Primary command

bash

kubectl apply -f monitoring-stack.yaml && mlflow server --backend-store-uri postgresql://... --artifact-root s3://...

Config file monitoring-stack.yaml, prometheus-rules.yaml, alert-config.yaml

Verify

bash

kubectl get pods -n monitoring && curl http://prometheus:9090/api/v1/rules && mlflow ui

Integration notes

MLflow tracks experiments and models: expose metrics like model_last_trained_timestamp and model_validation_accuracy to Prometheus via a custom MLflow callback. DVC stores pipeline logs and metrics: use DVC's S3 backend so metrics are queryable from Prometheus. Kubernetes liveness probes keep pods alive but won't catch model degradation: bind Prometheus scrape targets to your inference service and your training orchestrator (Airflow, Kubeflow). PagerDuty escalation policies should route ML-specific alerts to data scientists, not SREs who can't judge whether a model's accuracy drop is expected or catastrophic.

Migration path

If moving to a simpler observability stack: Datadog or Grafana Cloud have pre-built ML dashboards (data drift, model latency, feature importance) and cost more but require zero custom Prometheus setup. If moving away from Kubernetes: use AWS CloudWatch for ECS-based inference, or Google Cloud Monitoring for Vertex AI. If replacing DVC: switch to MLflow's built-in dataset and model versioning (MLflow 2.2+): you lose data lineage but gain simpler single-tool management.

Cost model

Prometheus + Alertmanager: free, self-hosted. PagerDuty: $10-30/user/month escalation policies (on-call engineers are the real cost). S3 for MLflow artifacts: ~$0.02/GB/month storage + ~$0.0004 per request (costs spike with frequent model push/pull, ~$50-200/month for active teams). Datadog alternative: $15-50/host/month if you use it for ML observability (cheaper than hiring second on-call person). Hidden cost: time spent writing custom monitoring scripts (DVC monitor above): pre-built solutions like Grafana ML cost more but save weeks of engineering.

Common gotcha

Prometheus alert rules require both 'for:' duration and proper label grouping. If you omit 'for: 5m', alerts fire on the first scrape where condition is true: a single slow query triggers PagerDuty, waking up on-call engineers for blips. Also, AlertManager's 'group_wait: 30s' batches alerts from the same component: if you have group_wait: 0s, each alert fires immediately, spamming Slack 100+ times for a cascade failure. Finally, DVC metrics must be pushed to Prometheus via a CronJob or persistent sidecar; DVC has no native Prometheus exporter, so many teams skip this entirely and miss data drift until it corrupts the model.

Team adoption

Day 1: Set up Prometheus + AlertManager YAML and route critical alerts to a dedicated #ml-critical Slack channel (not #general). Day 2: Write a runbook for each alert (one page: symptom → diagnosis command → fix command → rollback). Day 3: Do a fire drill: page on-call, run the runbook, measure MTTR (mean time to recovery). Week 2: Add DVC metrics export and model staleness checks. Week 3: Train the team on reading Prometheus dashboards and dismissing vs. resolving false alarms. Without this structure, on-call engineers will override alerts ('mute for 1 hour') instead of investigating, and you'll ship broken models because no one trusted the monitoring.

Experienced dev note

Set 'for:' duration based on your retraining frequency: if you retrain weekly, set model staleness alert 'for: 1m' so stale models trigger immediately after 7 days; if you retrain daily, stale-model alerts are noise, skip them. Also, histogram_quantile(0.95, ...) catches tail latency (what users actually experience); checking average latency hides 50 requests/sec at 10s while average is 200ms. Finally, use AlertManager's 'continue' routing to send critical alerts to multiple receivers (Slack + PagerDuty): without 'continue: true', first matching route swallows the alert and you miss PagerDuty escalation.

Check your understanding

Why does the config set 'for: 10m' on the DataDriftDetected alert but 'for: 5m' on ModelInferenceLatenessHigh, and what happens if you remove both 'for:' clauses?

Show answer hint

Data drift is gradual (true distribution shift vs. temporary anomaly takes time to confirm); inference latency is immediate (a single slow query means your model or hardware is degraded now). Without 'for:', alerts fire on first scrape where condition is true: you'll page engineers for every temporary network hiccup. With 'for: 10m', you only alert if drift persists for 10 minutes, reducing false positives for real data shifts vs. measurement noise.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.