Tool Intermediate medium · 8 min integration

Phase 4: Monitoring: MLflow + Prometheus + Grafana Stack

What you will learn

Set up production monitoring for model serving with MLflow metrics, Prometheus scraping, and Grafana dashboards to catch model drift and inference failures before users do.

Why this matters

Without monitoring, you deploy a model that works in staging and silently degrades in production. Prometheus + Grafana catch latency spikes, prediction failures, and data drift. MLflow's tracking integrates model versions with live metrics so you know which model caused a degradation. Teams that skip this deploy hot-fixes at 3 AM; teams with monitoring sleep.

Skip if: If your model runs once per week in a batch job with no time-sensitive SLAs, and failures don't cost money, basic log aggregation (CloudWatch, ELK) is sufficient. Skip Prometheus if you're using a managed monitoring service (Datadog, New Relic) that has native Kubernetes support: they'll scrape metrics automatically.

Explanation

MLflow 2.x exposes model serving metrics (latency, predictions per second, inference errors) via a Prometheus endpoint. Prometheus scrapes these endpoints every 30 seconds by default. Grafana reads Prometheus and visualizes trends. Together, they form the production observability layer: MLflow tells you what your model is doing, Prometheus stores the time-series data, Grafana makes it visible. The key insight is that monitoring must be tied to model versions: when latency rises, you need to know if it's because you deployed Model v5 or because traffic spiked. MLflow's integration with Prometheus tags metrics with model names and versions automatically.

Configuration

yaml

version: '3.8'
services:
  mlflow:
    image: ghcr.io/mlflow/mlflow:latest
    ports:
      - "5000:5000"
    environment:
      MLFLOW_TRACKING_URI: "sqlite:///mlflow.db"
      MLFLOW_BACKEND_STORE_URI: "sqlite:///mlflow.db"
    command: mlflow server --host 0.0.0.0 --port 5000
    networks:
      - monitoring

  model-server:
    image: python:3.11-slim
    ports:
      - "8000:8000"
      - "8001:8001"
    environment:
      MLFLOW_TRACKING_URI: "http://mlflow:5000"
    working_dir: /app
    volumes:
      - ./serve.py:/app/serve.py
      - ./requirements-serve.txt:/app/requirements-serve.txt
    command: bash -c "pip install -r requirements-serve.txt && python serve.py"
    networks:
      - monitoring
    depends_on:
      - mlflow

  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.path=/prometheus"
    networks:
      - monitoring

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      GF_SECURITY_ADMIN_PASSWORD: "admin"
      GF_PATHS_PROVISIONING: "/etc/grafana/provisioning"
    volumes:
      - ./grafana/datasources.yml:/etc/grafana/provisioning/datasources/datasources.yml
      - ./grafana/dashboard.yml:/etc/grafana/provisioning/dashboards/dashboard.yml
      - ./grafana/dashboards:/etc/grafana/dashboards
      - grafana_storage:/var/lib/grafana
    networks:
      - monitoring
    depends_on:
      - prometheus

volumes:
  prometheus_data:
  grafana_storage:

networks:
  monitoring:
    driver: bridge

Why this order?

MLflow must start first (it's the source of metrics). Prometheus depends on the model-server endpoint being ready. Grafana depends on Prometheus to have scraped at least one metric cycle (30 seconds). docker-compose respects depends_on ordering, but networks must be shared. The monitoring network allows all services to reach each other by hostname.

Wrong vs Right

Wrong way

yaml

version: '3.8'
services:
  model-server:
    image: python:3.11-slim
    ports:
      - "8000:8000"
    command: mlflow models serve -m models:/my-model
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"

Right way

yaml

Add explicit environment variables linking model-server to MLflow tracking URI. Add networks: monitoring to all services so Prometheus can reach model-server by hostname, not just localhost. Set depends_on to ensure MLflow starts before model-server tries to log metrics. Use volumes for Prometheus data persistence; without it, metrics disappear on container restart.

Tool vitals

Primary command

bash

mlflow models serve with --env-manager conda and Prometheus client enabled

Config file prometheus.yml for scrape config; docker-compose.yml for local stack

Verify

bash

curl http://localhost:9090/api/v1/query?query=mlflow_inference_latency_seconds

Integration notes

MLflow 2.x automatically instruments model serving with Prometheus metrics (mlflow_inference_latency_seconds, mlflow_predictions_total, mlflow_prediction_errors_total). DVC doesn't participate in runtime monitoring: it's for versioning data and models at rest. Docker and Kubernetes layer 4-5 metrics (CPU, memory) via their own exporters (cAdvisor, node-exporter). This stack monitors application-level metrics (model performance). Wire both: Prometheus + Node Exporter (for infrastructure) gives you the full picture. In Kubernetes, use Prometheus Operator (helm chart) to auto-discover model-server pods; in Docker Compose, manually list targets.

Migration path

If you move to a managed service: Datadog has Prometheus Protocol support (ingest via /api/v1/series); New Relic has OTLP exporters. Export MLflow metrics to CloudWatch by wrapping serve.py with a CloudWatch exporter. For on-prem migrations, Thanos or Cortex layer long-term storage on top of Prometheus. You don't need to rewrite monitoring code: just redirect the Prometheus remote_write endpoint.

Cost model

Prometheus + Grafana + MLflow are free and open-source. Costs come from infrastructure: Docker Compose on a single machine is free, but production Kubernetes on EKS adds ~$40/month per node minimum. Storage: Prometheus defaults to 15 days of metrics; add Thanos or external storage for long-term retention. Grafana Cloud (managed) is free tier up to 3 dashboards, then $10–100/month depending on active series volume. Hidden cost: cardinality explosion. If you're monitoring high-dimensional data (one metric per user_id, model_version, time_bucket), Prometheus scrapes can consume 100GB+ disk in weeks. Use metric relabeling to drop high-cardinality labels.

Common gotcha

Prometheus scrapes http://model-server:8001/metrics inside the Docker network, but the service must expose port 8001 AND have the Prometheus client library running. MLflow models serve with --env-manager conda does NOT automatically expose /metrics: you must wrap it with a Prometheus client. The silent failure: Prometheus shows up=0 for your model-server job, but you don't notice because you're looking at Grafana dashboards that only show successful scrapes. Check prometheus:9090/targets manually to catch failed scrapes.

Team adoption

Day 1: Run docker-compose up; show the team Grafana at :3000 (login admin/admin). Make one dashboard showing inference latency and error rate: small wins build momentum. Day 2: Create an alert rule in Prometheus (alert if latency > 500ms or error_rate > 1%) and test it by restarting the model-server. Alert fatigue kills adoption; start with 2–3 critical alerts, not 20. Week 1: Add a runbook link in Grafana alerts (e.g., 'If latency is high, check model-server logs: kubectl logs -l app=model-server'). Week 2: Tie alerts to on-call rotation (PagerDuty, Opsgenie) so they matter. Teams ignore dashboards they don't act on; alerts they can't ignore drive adoption.

Experienced dev note

Set relabel_configs in prometheus.yml to drop high-cardinality labels before ingestion. In scrape_configs, add metric_relabel_configs to drop labels like user_id or request_id that explode cardinality. Example: match mlflow_inference_latency_seconds{user_id=~".+"} and drop the label. Also: enable compression (–storage.tsdb.compression-level=2) and set a hard retention policy (–storage.tsdb.retention.size=50GB) to prevent disk filling silently. Most teams discover this after their Prometheus disk is full and they lose 2 weeks of metrics history.

Check your understanding

Why does the model-server service need both port 8000 (model serving) and port 8001 (Prometheus metrics) exposed? What happens if you only expose 8000 to external clients?

Show answer hint

Port 8000 handles prediction requests; port 8001 serves Prometheus metrics. Prometheus scrapes internal metrics from :8001 inside the Docker network, so external clients never need to call it. If you only expose :8000, Prometheus can't reach :8001 and scrapes fail silently (you'll see up=0 in Prometheus UI). The ports serve different audiences: :8000 is for the application, :8001 is for observability infrastructure.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.