Phase 4: Monitoring: MLflow + Prometheus + Grafana Stack
Why this matters
Without monitoring, you deploy a model that works in staging and silently degrades in production. Prometheus + Grafana catch latency spikes, prediction failures, and data drift. MLflow's tracking integrates model versions with live metrics so you know which model caused a degradation. Teams that skip this deploy hot-fixes at 3 AM; teams with monitoring sleep.
Explanation
MLflow 2.x exposes model serving metrics (latency, predictions per second, inference errors) via a Prometheus endpoint. Prometheus scrapes these endpoints every 30 seconds by default. Grafana reads Prometheus and visualizes trends. Together, they form the production observability layer: MLflow tells you what your model is doing, Prometheus stores the time-series data, Grafana makes it visible. The key insight is that monitoring must be tied to model versions: when latency rises, you need to know if it's because you deployed Model v5 or because traffic spiked. MLflow's integration with Prometheus tags metrics with model names and versions automatically.
Configuration
version: '3.8'
services:
mlflow:
image: ghcr.io/mlflow/mlflow:latest
ports:
- "5000:5000"
environment:
MLFLOW_TRACKING_URI: "sqlite:///mlflow.db"
MLFLOW_BACKEND_STORE_URI: "sqlite:///mlflow.db"
command: mlflow server --host 0.0.0.0 --port 5000
networks:
- monitoring
model-server:
image: python:3.11-slim
ports:
- "8000:8000"
- "8001:8001"
environment:
MLFLOW_TRACKING_URI: "http://mlflow:5000"
working_dir: /app
volumes:
- ./serve.py:/app/serve.py
- ./requirements-serve.txt:/app/requirements-serve.txt
command: bash -c "pip install -r requirements-serve.txt && python serve.py"
networks:
- monitoring
depends_on:
- mlflow
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
networks:
- monitoring
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
GF_SECURITY_ADMIN_PASSWORD: "admin"
GF_PATHS_PROVISIONING: "/etc/grafana/provisioning"
volumes:
- ./grafana/datasources.yml:/etc/grafana/provisioning/datasources/datasources.yml
- ./grafana/dashboard.yml:/etc/grafana/provisioning/dashboards/dashboard.yml
- ./grafana/dashboards:/etc/grafana/dashboards
- grafana_storage:/var/lib/grafana
networks:
- monitoring
depends_on:
- prometheus
volumes:
prometheus_data:
grafana_storage:
networks:
monitoring:
driver: bridge Why this order?
MLflow must start first (it's the source of metrics). Prometheus depends on the model-server endpoint being ready. Grafana depends on Prometheus to have scraped at least one metric cycle (30 seconds). docker-compose respects depends_on ordering, but networks must be shared. The monitoring network allows all services to reach each other by hostname.
Wrong vs Right
version: '3.8'
services:
model-server:
image: python:3.11-slim
ports:
- "8000:8000"
command: mlflow models serve -m models:/my-model
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
command:
- "--config.file=/etc/prometheus/prometheus.yml" Add explicit environment variables linking model-server to MLflow tracking URI. Add networks: monitoring to all services so Prometheus can reach model-server by hostname, not just localhost. Set depends_on to ensure MLflow starts before model-server tries to log metrics. Use volumes for Prometheus data persistence; without it, metrics disappear on container restart. Tool vitals
mlflow models serve with --env-manager conda and Prometheus client enabled prometheus.yml for scrape config; docker-compose.yml for local stack curl http://localhost:9090/api/v1/query?query=mlflow_inference_latency_seconds Integration notes
MLflow 2.x automatically instruments model serving with Prometheus metrics (mlflow_inference_latency_seconds, mlflow_predictions_total, mlflow_prediction_errors_total). DVC doesn't participate in runtime monitoring: it's for versioning data and models at rest. Docker and Kubernetes layer 4-5 metrics (CPU, memory) via their own exporters (cAdvisor, node-exporter). This stack monitors application-level metrics (model performance). Wire both: Prometheus + Node Exporter (for infrastructure) gives you the full picture. In Kubernetes, use Prometheus Operator (helm chart) to auto-discover model-server pods; in Docker Compose, manually list targets.
Migration path
If you move to a managed service: Datadog has Prometheus Protocol support (ingest via /api/v1/series); New Relic has OTLP exporters. Export MLflow metrics to CloudWatch by wrapping serve.py with a CloudWatch exporter. For on-prem migrations, Thanos or Cortex layer long-term storage on top of Prometheus. You don't need to rewrite monitoring code: just redirect the Prometheus remote_write endpoint.
Cost model
Prometheus + Grafana + MLflow are free and open-source. Costs come from infrastructure: Docker Compose on a single machine is free, but production Kubernetes on EKS adds ~$40/month per node minimum. Storage: Prometheus defaults to 15 days of metrics; add Thanos or external storage for long-term retention. Grafana Cloud (managed) is free tier up to 3 dashboards, then $10–100/month depending on active series volume. Hidden cost: cardinality explosion. If you're monitoring high-dimensional data (one metric per user_id, model_version, time_bucket), Prometheus scrapes can consume 100GB+ disk in weeks. Use metric relabeling to drop high-cardinality labels.
Common gotcha
Prometheus scrapes http://model-server:8001/metrics inside the Docker network, but the service must expose port 8001 AND have the Prometheus client library running. MLflow models serve with --env-manager conda does NOT automatically expose /metrics: you must wrap it with a Prometheus client. The silent failure: Prometheus shows up=0 for your model-server job, but you don't notice because you're looking at Grafana dashboards that only show successful scrapes. Check prometheus:9090/targets manually to catch failed scrapes.
Team adoption
Day 1: Run docker-compose up; show the team Grafana at :3000 (login admin/admin). Make one dashboard showing inference latency and error rate: small wins build momentum. Day 2: Create an alert rule in Prometheus (alert if latency > 500ms or error_rate > 1%) and test it by restarting the model-server. Alert fatigue kills adoption; start with 2–3 critical alerts, not 20. Week 1: Add a runbook link in Grafana alerts (e.g., 'If latency is high, check model-server logs: kubectl logs -l app=model-server'). Week 2: Tie alerts to on-call rotation (PagerDuty, Opsgenie) so they matter. Teams ignore dashboards they don't act on; alerts they can't ignore drive adoption.
Experienced dev note
Set relabel_configs in prometheus.yml to drop high-cardinality labels before ingestion. In scrape_configs, add metric_relabel_configs to drop labels like user_id or request_id that explode cardinality. Example: match mlflow_inference_latency_seconds{user_id=~".+"} and drop the label. Also: enable compression (–storage.tsdb.compression-level=2) and set a hard retention policy (–storage.tsdb.retention.size=50GB) to prevent disk filling silently. Most teams discover this after their Prometheus disk is full and they lose 2 weeks of metrics history.
Check your understanding
Why does the model-server service need both port 8000 (model serving) and port 8001 (Prometheus metrics) exposed? What happens if you only expose 8000 to external clients?
Show answer hint
Port 8000 handles prediction requests; port 8001 serves Prometheus metrics. Prometheus scrapes internal metrics from :8001 inside the Docker network, so external clients never need to call it. If you only expose :8000, Prometheus can't reach :8001 and scrapes fail silently (you'll see up=0 in Prometheus UI). The ports serve different audiences: :8000 is for the application, :8001 is for observability infrastructure.