Tool Beginner easy · 6 min config

Model performance metrics in production

What you will learn

Use MLflow to log, track, and compare model performance metrics across training runs and production deployments.

Why this matters

Without production metrics tracking, you won't know when your model degrades in the real world until users complain. MLflow gives you a central place to log metrics during training and inference so you can spot data drift, performance drops, and regressions before they cause business damage.

Skip if: If you're running a one-off batch job or proof-of-concept that never needs to be compared to other runs, logging to stdout is fine. Once you have multiple models, versions, or need to compare performance across time, MLflow becomes essential. For simple projects with a single model file and no versioning needs, a basic metrics.json file might suffice: but you'll outgrow it quickly.

Explanation

MLflow Tracking is the component that captures metrics (accuracy, precision, loss, latency) and logs them to a centralized server. When you train a model, MLflow creates a run and stores all metrics, parameters, and artifacts under that run. In production, you log inference-time metrics (prediction latency, error rates) to the same run or a new run, making it easy to compare training performance against live performance. The MLflow UI lets you visualize metrics across runs, set baselines, and detect when a model's performance dips below acceptable thresholds. You configure this with a Python client that initializes MLflow, sets experiment names, logs metrics, and tags runs for organization.

Configuration

python

#!/usr/bin/env python3
import mlflow
import mlflow.sklearn
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

mlflow.set_tracking_uri('http://localhost:5000')
mlflow.set_experiment('iris-classification')

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

with mlflow.start_run(run_name='rf-baseline'):
    model = RandomForestClassifier(n_estimators=50, random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')
    f1 = f1_score(y_test, y_pred, average='weighted')
    
    mlflow.log_metric('accuracy', accuracy)
    mlflow.log_metric('precision', precision)
    mlflow.log_metric('recall', recall)
    mlflow.log_metric('f1_score', f1)
    
    mlflow.log_param('n_estimators', 50)
    mlflow.log_param('random_state', 42)
    
    mlflow.sklearn.log_model(model, 'model')
    
    print(f'Run ID: {mlflow.active_run().info.run_id}')
    print(f'Accuracy: {accuracy:.4f}')

mlflow.end_run()

Why this order?

Initialize MLflow and set experiment name first so all subsequent logs go to the correct location. Log parameters before metrics because parameters define the configuration, then log metrics as outcomes. Log the model last because it's the largest artifact and gives you a checkpoint once training is complete.

Wrong vs Right

Wrong way

python

import mlflow

model = train_model()
acc = evaluate(model)
print(f'Accuracy: {acc}')
with open('metrics.txt', 'w') as f:
    f.write(f'accuracy={acc}\n')

Right way

python

import mlflow

mlflow.set_tracking_uri('http://localhost:5000')
mlflow.set_experiment('model-training')

with mlflow.start_run(run_name='baseline'):
    model = train_model()
    acc = evaluate(model)
    mlflow.log_metric('accuracy', acc)
    mlflow.sklearn.log_model(model, 'model')
    print(f'Accuracy: {acc}')

Tool vitals

Primary command

bash

mlflow ui

Config file MLflow tracking URI (environment variable or code config)

Verify

bash

mlflow runs list --experiment-name <experiment>

Integration notes

MLflow integrates with DVC for model versioning: MLflow tracks metrics during training, then DVC versions the final model artifacts in your data lake. In production, you deploy via the MLflow Model Registry (which stores model URI and metadata), then BentoML or vLLM wraps that model for serving while logging inference metrics back to MLflow. This closes the loop: training metrics → model registry → production serving → production metrics all in one system.

Migration path

If you outgrow MLflow, you can export runs as JSON using the REST API (mlflow runs search --experiment-name X --output-format json) or use the Python SDK to programmatically migrate to a data warehouse. However, MLflow 2.x is designed to scale to thousands of runs without degradation, so migration is rarely needed.

Cost model

MLflow itself is open-source and free. Running <code>mlflow ui</code> locally or on a self-hosted server costs only compute. MLflow Cloud (Databricks-managed) charges per active model and run, starting around $0.05 per run for tracking beyond free tier. For teams, self-hosted MLflow on a t3.medium EC2 instance costs ~$30/month and handles millions of metrics.

Common gotcha

If you don't call mlflow.start_run(), MLflow logs metrics to a default run silently. You'll see metrics appear in the UI under a single unnamed run, making it impossible to distinguish between training sessions. Always wrap your training code in with mlflow.start_run(): and give each run a meaningful name. Also, metrics logged without a tracking URI default to local ./mlruns directory: if you're running on multiple machines, set MLFLOW_TRACKING_URI to point to a shared server, or your production metrics won't match your training metrics.

Team adoption

On day one, set a team standard: (1) always use mlflow.set_experiment() with a consistent naming scheme (e.g., 'team-project-v1'), (2) require run_name that includes timestamp and author (f'{getpass.getuser()}-{datetime.now():%Y%m%d-%H%M%S}'), (3) enforce tagging for env (training vs production), (4) share the MLflow UI URL and give all team members read access to the tracking server. Add MLflow server startup to your onboarding docs and CI/CD pipeline initialization so no one accidentally logs to ./mlruns locally.

Experienced dev note

Use mlflow.log_dict() instead of mlflow.log_param() for complex hyperparameters (dicts, lists). Parameters in MLflow are searchable and indexed, but they're stored as strings: logging a 500-line config dict as a parameter creates ugly searchable gibberish. Use mlflow.log_artifact() to save the full YAML config file instead, keeping params for simple scalars. Also, tag runs with mlflow.set_tag('env', 'production') or mlflow.set_tag('model_type', 'random_forest'): tags are your best friend for filtering the UI when you have thousands of runs.

Check your understanding

Why would logging metrics only to a local metrics.txt file fail in a production environment where multiple services log predictions to different machines?

Show answer hint

Each machine writes to its own local filesystem, so metrics are scattered and never aggregated in one place. MLflow's tracking server centralizes all metrics from all machines to a single database, making production comparisons possible.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.