How to Intermediate · 4 min read

How to use Prometheus for ML monitoring

Q: How to use Prometheus for ML monitoring

Use Prometheus to monitor ML models by instrumenting your model serving code with Prometheus client libraries to expose metrics endpoints. Then configure Prometheus server to scrape these metrics, enabling real-time tracking of model performance, resource usage, and alerting.

Quick answer

Use Prometheus to monitor ML models by instrumenting your model serving code with Prometheus client libraries to expose metrics endpoints. Then configure Prometheus server to scrape these metrics, enabling real-time tracking of model performance, resource usage, and alerting.

PREREQUISITES

Python 3.8+
Prometheus installed (server and client libraries)
Basic knowledge of ML model serving
pip install prometheus-client
Access to Grafana (optional for visualization)

Setup Prometheus and client library

Install the prometheus-client Python package to expose metrics from your ML model server. Also, install and run the Prometheus server to scrape these metrics.

Set environment variables or config files for Prometheus server to define scrape targets.

bash

pip install prometheus-client

output

Collecting prometheus-client
  Downloading prometheus_client-0.17.0-py3-none-any.whl (58 kB)
Installing collected packages: prometheus-client
Successfully installed prometheus-client-0.17.0

Step by step: instrument ML model server

Use prometheus_client to create metrics like counters, gauges, and histograms in your ML serving code. Expose a /metrics HTTP endpoint for Prometheus to scrape.

python

from prometheus_client import start_http_server, Summary, Counter
import random
import time

# Create a metric to track prediction latency
PREDICTION_LATENCY = Summary('ml_prediction_latency_seconds', 'Time spent processing prediction')

# Counter for total predictions
PREDICTIONS_TOTAL = Counter('ml_predictions_total', 'Total number of predictions made')

@PREDICTION_LATENCY.time()
def process_prediction():
    # Simulate prediction processing time
    time.sleep(random.uniform(0.1, 0.5))
    PREDICTIONS_TOTAL.inc()

if __name__ == '__main__':
    # Start Prometheus metrics server on port 8000
    start_http_server(8000)
    print('Prometheus metrics available at http://localhost:8000/metrics')
    while True:
        process_prediction()

output

Prometheus metrics available at http://localhost:8000/metrics
# Metrics endpoint serves data like:
# ml_prediction_latency_seconds_count 10
# ml_predictions_total 10

Common variations

Use async frameworks like FastAPI with prometheus_client middleware for non-blocking metrics.
Integrate with Grafana for dashboards visualizing ML metrics.
Configure alerting rules in Prometheus to notify on anomalies like high latency or error rates.
Use exporters or sidecars if your model server language lacks native Prometheus clients.

Troubleshooting

If metrics endpoint is not reachable, check firewall and port configuration.
If Prometheus shows no data, verify scrape configs and endpoint URL.
High cardinality metrics can cause performance issues; avoid labels with high variability.
Use prometheus_client logging to debug metric registration errors.

✅

Key Takeaways

Instrument ML serving code with Prometheus client libraries to expose metrics endpoints.
Configure Prometheus server to scrape these endpoints regularly for real-time monitoring.
Use Grafana dashboards and Prometheus alerting rules to visualize and get notified on ML model health.
Avoid high cardinality labels in metrics to maintain Prometheus performance.
Test metrics endpoints independently to ensure Prometheus can scrape them successfully.

Verified 2026-04

Verify ↗