On-call for ML systems
Why this matters
ML systems degrade gradually: predictions can drift, data pipelines fail silently, and GPU memory leaks go unnoticed for weeks. Without structured on-call, you discover problems hours or days after users see them. Modern on-call for ML requires both traditional infrastructure monitoring (CPU, memory, latency) AND model-specific metrics (prediction drift, data drift, feature missing rates). Without this, you inherit production incidents with no runbook to follow.
Explanation
On-call for ML systems is the operational discipline of monitoring model health, detecting degradation before it impacts users, and having a documented response plan when things break. Unlike traditional software, ML failures are often silent: accuracy drops 2% over weeks, a data source changes format, or feature engineering code breaks on edge cases. You need both real-time metrics (latency, prediction distribution) and statistical metrics (data drift, prediction drift). The runbook defines who gets paged, what they check first, how to roll back, and when to escalate. This is typically owned by the team that built the model, using tools like MLflow for drift detection, Prometheus/Grafana for infrastructure metrics, and a status page or incident tracker (PagerDuty, Opsgenie) for escalation. The on-call engineer doesn't need to fix the model: they need to detect, communicate, and either roll back or create a war room.
Configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: ml-on-call-runbook
namespace: ml-platform
data:
runbook.md: |
# ML Model On-Call Runbook
## Incident: Prediction Latency > 500ms
1. Check pod CPU/memory: `kubectl top pod -l app=inference-service`
2. Check model serving logs: `kubectl logs -l app=inference-service --tail=100`
3. Verify feature service is healthy: `curl http://feature-service:8000/health`
4. If feature service down: page feature-platform team
5. If model OOM: scale replicas or reduce batch size
6. Rollback to last known good: `kubectl set image deployment/inference-service inference-service=registry.example.com/inference:v2.1.4`
## Incident: Prediction Drift Detected
1. Check MLflow: run anomaly detection notebook at https://mlflow.example.com/notebook/drift-check
2. Inspect raw predictions vs expected distribution from past 7 days
3. Check data schema: did any upstream data sources change?
4. If feature values out of range: escalate to data-engineering team
5. If model predictions unexpected: trigger retraining in staging
6. Do NOT immediately roll back: model may be correct, data may have legitimately shifted
## Incident: OOM Killed Pod
1. Check memory request/limit: `kubectl describe pod <pod-name>`
2. Check batch size in config: `kubectl get configmap inference-config -o yaml | grep batch_size`
3. Reduce batch_size by 50%, apply: `kubectl set env deployment/inference-service BATCH_SIZE=8`
4. Monitor memory for 10 minutes
5. If stable, update deployment YAML and redeploy
6. Create ticket to profile model memory usage
## Escalation
- Latency + memory pressure: ML Platform On-Call
- Data quality issues: Data Engineering On-Call
- Predictions don't match business logic: ML Team Lead
- Cannot contact anyone after 5 min: @ml-platform-channel in Slack
---
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: ml-model-alerts
namespace: ml-platform
spec:
groups:
- name: ml-inference
interval: 30s
rules:
- alert: InferencePodCrashLooping
expr: rate(kube_pod_container_status_restarts_total{pod=~"inference-.*"}[15m]) > 0.1
for: 5m
annotations:
summary: "Inference pod restarting {{ $labels.pod }}"
runbook_url: "https://wiki.example.com/ml-oncall#pod-crashloop"
- alert: PredictionLatencyHigh
expr: histogram_quantile(0.99, rate(inference_request_duration_seconds_bucket[5m])) > 0.5
for: 10m
annotations:
summary: "p99 latency > 500ms for model {{ $labels.model_name }}"
- alert: FeatureServiceDown
expr: up{job="feature-service"} == 0
for: 1m
annotations:
summary: "Feature service unreachable"
runbook_url: "https://wiki.example.com/ml-oncall#feature-service-down"
- alert: PredictionDriftDetected
expr: ml_prediction_drift_score{model="production"} > 0.15
for: 1h
annotations:
summary: "Prediction drift > 0.15 for {{ $labels.model }}"
dashboard: "https://mlflow.example.com/experiments/{{$labels.experiment_id}}"
- alert: InputDataMissing
expr: rate(feature_missing_value_total{job="inference-service"}[5m]) > 0.01
for: 5m
annotations:
summary: "{{ $value | humanizePercentage }} of features missing" Why this order?
The ConfigMap runbook comes first because on-call is about human decision-making during incidents: that's your north star. The PrometheusRule alerts come second because they define the CONDITIONS that trigger the runbook. Alerts without runbooks are noise; runbooks without alerts are wishful thinking.
Wrong vs Right
apiVersion: v1
kind: Service
metadata:
name: inference-service
spec:
ports:
- port: 8000
selector:
app: inference-service
---
# Just monitoring CPU and memory, no ML-specific metrics
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: bad-ml-alerts
spec:
groups:
- name: infrastructure
rules:
- alert: HighCPU
expr: node_cpu_usage > 0.8
annotations:
summary: "CPU high, restart pod manually if needed" # Same service deployment, but WITH comprehensive runbook and model-aware alerts
apiVersion: v1
kind: ConfigMap
metadata:
name: inference-runbook
data:
on_call.md: |
# Step 1: Verify which alert fired
# Step 2: Follow corresponding section below
# Step 3: Run verify commands before declaring "incident over"
---
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: model-health-alerts
spec:
groups:
- name: ml-model-health
rules:
- alert: InferencePodCrashLooping
expr: rate(kube_pod_container_status_restarts_total{pod=~"inference-.*"}[15m]) > 0.1
for: 5m
annotations:
summary: "{{ $labels.pod }} restarting: check memory/logs"
- alert: PredictionDriftDetected
expr: ml_prediction_drift_ks_statistic > 0.1
for: 1h
annotations:
summary: "KS stat {{ $value }} indicates data shift"
- alert: FeatureMissingRate
expr: rate(model_feature_missing_total[5m]) > 0.001
for: 5m
annotations:
summary: "{{ $value | humanizePercentage }} features missing: upstream data issue?" Tool vitals
kubectl get pods -l app=inference-service -o wide && curl http://inference-service:8000/health prometheus-rules.yaml, incident-runbook.md kubectl get alerts && promtool check rules prometheus-rules.yaml Integration notes
This connects to MLflow (model versioning and metric tracking: you need MLflow to export ml_prediction_drift_ks_statistic as a Prometheus metric), Kubernetes (where your inference service runs and where Prometheus scrapes metrics), and your incident tracking tool (PagerDuty, Opsgenie, or Slack). The runbook should link directly to your MLflow instance, Grafana dashboards, and status pages. If using DVC for data versioning, the runbook should reference the DVC revision that trained the current model so you can quickly compare to a known-good commit if rollback is needed.
Migration path
If you outgrow Prometheus-based alerting (too much manual tuning, too many false positives), migrate to ML-specific monitoring like Arize, Evidently AI, or Whylabs. These tools understand drift detection better than generic time-series databases. However, Prometheus is sufficient for teams <10 and models <5; use it as your learning ground before buying specialized tooling. To migrate: export Prometheus metrics to a data lake (e.g., S3 + Athena), rebuild alerts in the new platform, run both in parallel for 1 week to validate, then deprecate Prometheus exporters.
Cost model
Free with Prometheus + Grafana. If you self-host on K8s, cost is cluster compute (negligible, ~5-10% CPU for Prometheus/Alertmanager). If using managed services: Datadog charges per custom metric (model-specific metrics like drift are custom: budget $500–2000/month for 3–5 models), New Relic similarly ~$1000/month for model-level insights. PagerDuty free tier supports 1 user; $29/user/month for teams. Opsgenie $5/user/month. The hidden cost: someone must be on-call 24/7: that's a $200k–400k salary cost, not a tool cost. Optimize your alert tuning AGGRESSIVELY or on-call becomes a burnout factory.
Common gotcha
Alerts fire when models are CORRECT but data changes. A 5% accuracy drop sounds bad until you check that user behavior shifted completely: the model is right, the world changed. The dangerous pattern is automatic rollback on drift alerts: you'll rollback to an older model that's now WORSE on new data. ALWAYS require a human to verify the alert correlates with actual user impact before touching production. Also: drift alerts at 1-hour window catch problems too late; set alert duration to match your SLA (if users care about accuracy every 10 minutes, 1-hour window is useless). Finally, test your runbook: if the person on-call has never seen the dashboard or can't SSH into a pod, the runbook is theater.
Team adoption
Day 1: Write the runbook as a Markdown file (not in a tool). Day 2: Post it in your on-call Slack channel and have the team read it. Day 3: Have the person going on-call TOMORROW walk through the runbook with you for 30 minutes: they'll find 5 broken links. Day 4: Update those links and run the verify commands together. Day 1 of actual on-call: Whoever gets paged runs the runbook DURING the incident (not after), which forces them to update it again. Repeat this cycle every rotation. The runbook is a living document: it stays useful only if on-call engineers are forced to use it and fix it immediately.
Experienced dev note
The pattern that pays dividends: separate ALERT DEFINITION (what condition triggers) from RUNBOOK ACTION (what to do). New teams bake both into the alert message ('CPU high, restart deployment'), then can't reuse runbooks across services. Instead, define alerts by symptom (latency, availability, drift), keep runbooks generic ('Is upstream healthy? Is this expected? Can we rollback?'), and let the human operator map symptom→root cause→action. Also: ALWAYS test your runbook. Run the verify commands yourself before going on-call. A runbook that assumes SSH access you don't have, or links to a dashboard that no longer exists, is worse than no runbook because it demoralizes the on-call engineer.
Check your understanding
You're on-call and receive an alert: 'PredictionDriftDetected: KS statistic 0.18'. You check MLflow and see that accuracy on the holdout test set actually improved by 1.2% in the past 6 hours. Should you rollback the model? Why or why not?
Show answer hint
No: drift is CHANGE in prediction distribution, not necessarily degradation. If accuracy improved, the model is responding correctly to data that shifted. Rollback would move backward. The correct action: investigate WHY data shifted (new user segment? data pipeline change?), update monitoring thresholds if this is expected, and if data shift is BAD (e.g., pipeline bug), fix the pipeline, not the model. This is why runbooks must say 'verify impact before acting,' not 'rollback on drift.'