Model performance metrics in production
Why this matters
Without production metrics tracking, you won't know when your model degrades in the real world until users complain. MLflow gives you a central place to log metrics during training and inference so you can spot data drift, performance drops, and regressions before they cause business damage.
Explanation
MLflow Tracking is the component that captures metrics (accuracy, precision, loss, latency) and logs them to a centralized server. When you train a model, MLflow creates a run and stores all metrics, parameters, and artifacts under that run. In production, you log inference-time metrics (prediction latency, error rates) to the same run or a new run, making it easy to compare training performance against live performance. The MLflow UI lets you visualize metrics across runs, set baselines, and detect when a model's performance dips below acceptable thresholds. You configure this with a Python client that initializes MLflow, sets experiment names, logs metrics, and tags runs for organization.
Configuration
#!/usr/bin/env python3
import mlflow
import mlflow.sklearn
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
mlflow.set_tracking_uri('http://localhost:5000')
mlflow.set_experiment('iris-classification')
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
with mlflow.start_run(run_name='rf-baseline'):
model = RandomForestClassifier(n_estimators=50, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')
mlflow.log_metric('accuracy', accuracy)
mlflow.log_metric('precision', precision)
mlflow.log_metric('recall', recall)
mlflow.log_metric('f1_score', f1)
mlflow.log_param('n_estimators', 50)
mlflow.log_param('random_state', 42)
mlflow.sklearn.log_model(model, 'model')
print(f'Run ID: {mlflow.active_run().info.run_id}')
print(f'Accuracy: {accuracy:.4f}')
mlflow.end_run() Why this order?
Initialize MLflow and set experiment name first so all subsequent logs go to the correct location. Log parameters before metrics because parameters define the configuration, then log metrics as outcomes. Log the model last because it's the largest artifact and gives you a checkpoint once training is complete.
Wrong vs Right
import mlflow
model = train_model()
acc = evaluate(model)
print(f'Accuracy: {acc}')
with open('metrics.txt', 'w') as f:
f.write(f'accuracy={acc}\n') import mlflow
mlflow.set_tracking_uri('http://localhost:5000')
mlflow.set_experiment('model-training')
with mlflow.start_run(run_name='baseline'):
model = train_model()
acc = evaluate(model)
mlflow.log_metric('accuracy', acc)
mlflow.sklearn.log_model(model, 'model')
print(f'Accuracy: {acc}') Tool vitals
mlflow ui MLflow tracking URI (environment variable or code config) mlflow runs list --experiment-name <experiment> Integration notes
MLflow integrates with DVC for model versioning: MLflow tracks metrics during training, then DVC versions the final model artifacts in your data lake. In production, you deploy via the MLflow Model Registry (which stores model URI and metadata), then BentoML or vLLM wraps that model for serving while logging inference metrics back to MLflow. This closes the loop: training metrics → model registry → production serving → production metrics all in one system.
Migration path
If you outgrow MLflow, you can export runs as JSON using the REST API (mlflow runs search --experiment-name X --output-format json) or use the Python SDK to programmatically migrate to a data warehouse. However, MLflow 2.x is designed to scale to thousands of runs without degradation, so migration is rarely needed.
Cost model
MLflow itself is open-source and free. Running <code>mlflow ui</code> locally or on a self-hosted server costs only compute. MLflow Cloud (Databricks-managed) charges per active model and run, starting around $0.05 per run for tracking beyond free tier. For teams, self-hosted MLflow on a t3.medium EC2 instance costs ~$30/month and handles millions of metrics.
Common gotcha
If you don't call mlflow.start_run(), MLflow logs metrics to a default run silently. You'll see metrics appear in the UI under a single unnamed run, making it impossible to distinguish between training sessions. Always wrap your training code in with mlflow.start_run(): and give each run a meaningful name. Also, metrics logged without a tracking URI default to local ./mlruns directory: if you're running on multiple machines, set MLFLOW_TRACKING_URI to point to a shared server, or your production metrics won't match your training metrics.
Team adoption
On day one, set a team standard: (1) always use mlflow.set_experiment() with a consistent naming scheme (e.g., 'team-project-v1'), (2) require run_name that includes timestamp and author (f'{getpass.getuser()}-{datetime.now():%Y%m%d-%H%M%S}'), (3) enforce tagging for env (training vs production), (4) share the MLflow UI URL and give all team members read access to the tracking server. Add MLflow server startup to your onboarding docs and CI/CD pipeline initialization so no one accidentally logs to ./mlruns locally.
Experienced dev note
Use mlflow.log_dict() instead of mlflow.log_param() for complex hyperparameters (dicts, lists). Parameters in MLflow are searchable and indexed, but they're stored as strings: logging a 500-line config dict as a parameter creates ugly searchable gibberish. Use mlflow.log_artifact() to save the full YAML config file instead, keeping params for simple scalars. Also, tag runs with mlflow.set_tag('env', 'production') or mlflow.set_tag('model_type', 'random_forest'): tags are your best friend for filtering the UI when you have thousands of runs.
Check your understanding
Why would logging metrics only to a local metrics.txt file fail in a production environment where multiple services log predictions to different machines?
Show answer hint
Each machine writes to its own local filesystem, so metrics are scattered and never aggregated in one place. MLflow's tracking server centralizes all metrics from all machines to a single database, making production comparisons possible.