Phase 1: experiment tracking
Why this matters
Without centralized experiment tracking, ML teams lose the ability to compare model performance across runs, reproduce results, or trace which hyperparameters produced the best model. Ad-hoc logging (print statements, CSV files, Slack messages) scales only to 2-3 experiments before paralysis sets in.
Explanation
MLflow tracking is a lightweight experiment logger that captures metrics, parameters, and artifacts (models, plots) from each run and stores them in a central backend. The workflow is: (1) initialize a local or remote tracking server, (2) configure your training script to log to MLflow, (3) query runs via CLI or UI to compare results. MLflow 2.x introduced a simpler configuration model: you define a backend store (where metadata lives) and artifact store (where large files live), then point your client at the tracking URI. The server runs as a standalone process listening on a port (default 5000); your training scripts communicate via HTTP calls that are non-blocking and fail gracefully if the server is down. This separates concerns: experiments run independently while the server collects and indexes results asynchronously.
Configuration
#!/bin/bash
# Step 1: Initialize MLflow backend and artifact store
mlflow server \
--backend-store-uri sqlite:///./mlruns.db \
--default-artifact-root ./artifacts \
--host 0.0.0.0 \
--port 5000 \
&
echo "MLflow server started on http://localhost:5000"
sleep 2
# Step 2: Verify server is responding
curl -s http://localhost:5000 > /dev/null && echo "✓ Tracking server healthy" || echo "✗ Failed to connect"
# Step 3: Set environment variable so Python client auto-connects
export MLFLOW_TRACKING_URI="http://localhost:5000"
# Step 4: Create experiment container (CLI equivalent)
mlflow experiments create --experiment-name "baseline-v1"
# Step 5: Verify experiments exist
mlflow experiments search --view list-all Why this order?
The server must start first because Python clients will attempt to connect immediately. Setting MLFLOW_TRACKING_URI before running training scripts ensures they know where to send logs. Creating the experiment explicitly allows you to organize runs by project/iteration without polluting a default namespace.
Wrong vs Right
#!/bin/bash
# WRONG: Starting server without specifying stores
mlflow ui
# WRONG: Tracking URI hardcoded in Python scripts instead of env var
# (forces you to edit code when moving servers)
# WRONG: Using file-based SQLite on NFS without synchronization
# (causes corruption on concurrent writes)
# WRONG: Forgetting to set MLFLOW_TRACKING_URI
# (Python client creates new local mlruns/ directory, fragmenting experiments) #!/bin/bash
# RIGHT: Explicit configuration with environment variable
mlflow server \
--backend-store-uri sqlite:///./mlruns.db \
--default-artifact-root s3://my-bucket/artifacts \
--host 0.0.0.0 \
--port 5000
export MLFLOW_TRACKING_URI="http://localhost:5000"
# In Python script, no hardcoded URI needed:
# import mlflow
# mlflow.log_metric("accuracy", 0.92)
# MLflow automatically discovers MLFLOW_TRACKING_URI from env Tool vitals
mlflow ui --backend-store-uri sqlite:///mlruns.db --default-artifact-root ./artifacts .mlflow/config curl http://localhost:5000 && echo 'Tracking server is running' Integration notes
MLflow tracking is the upstream input to MLflow Model Registry (Phase 2). The artifacts logged here (pickled models, ONNX files) become the source of truth for production deployments. DVC (Phase 3) handles large data versioning separately; MLflow handles experiment metadata and model artifacts. When you tag a run as production-ready in MLflow, a CI/CD pipeline (Phase 4) pulls the artifact from the artifact store (S3, MinIO, or local), rebuilds it into a Docker image, and pushes to a registry: all triggered by MLflow event webhooks.
Migration path
If switching to Weights & Biases or Neptune later: both provide MLflow export APIs. Run mlflow artifacts download to pull all models locally, then use their Python SDKs to re-log to their platform. However, MLflow is stable enough (2.x is backward-compatible through 2024) that migrations are rare for teams already using it.
Cost model
MLflow server itself is free and open-source. Storage costs depend on your artifact store: local disk (free but limited to single machine), S3 (typical: $0.023 per GB/month for Standard), or managed MLflow Cloud (Databricks pricing, ~$0.40/DBU/hour for small deployments). For 100 experiments × 10MB average artifacts = 1GB, costs are negligible (~$0.02/month). Cost does not scale until you hit thousands of runs or large model artifacts (>1GB each).
Common gotcha
MLflow creates a local mlruns/ directory by default if MLFLOW_TRACKING_URI is not set or is invalid. This causes experiments to scatter across the filesystem invisibly. A training script will run, log metrics to a hidden local store, and you'll see nothing in the UI. Always verify echo $MLFLOW_TRACKING_URI before training, and check that curl succeeds on the tracking server. If you're using SQLite backend (fine for single-machine), never run MLflow server and training on the same container: they lock the database file during concurrent writes, causing hangs.
Team adoption
Day 1: Senior engineer starts the tracking server as a service (systemd unit or Docker container). Day 1: Add export MLFLOW_TRACKING_URI=... to team's shared `.bashrc` or CI config. Week 1: Add experiment naming standard (e.g., mlflow.set_experiment(f"v{model_version}")). Week 2: Set up artifact storage on S3/MinIO so experiments persist across machine restarts. Track adoption by running mlflow experiments search --view list-all | wc -l weekly: healthy adoption shows 10+ experiments in first month.
Experienced dev note
Set MLFLOW_TRACKING_URI in your shell profile or CI environment file once, not per-script. Then, always disable local MLflow artifacts when using remote storage by setting mlflow server --artifacts-only flag on a separate process if you're serving UI and storing artifacts on different infrastructure. Most importantly: use mlflow.set_experiment() in Python to namespace runs by project, not by creating separate tracking servers: one server can hold unlimited experiments, and querying across them is free.
Check your understanding
Why would logging metrics to a local mlruns/ directory instead of a server cause problems on a team, and how does setting MLFLOW_TRACKING_URI prevent this?
Show answer hint
Each script instance that doesn't know about the server creates its own local directory, fragmenting the single source of truth. The env var centralizes the endpoint so all scripts write to the same backend automatically.