Tool Beginner easy · 5 min concept

The ML engineering gap: models don't deploy themselves

What you will learn
Models built in notebooks fail in production because deployment requires versioning, environment isolation, and monitoring: tools that don't exist in Jupyter.

Why this matters

A trained model sitting on your laptop is not a product. The gap between 'model works in my notebook' and 'model serves predictions in production' costs teams weeks of debugging, silent failures, and data drift. Without MLflow, DVC, and Docker, you'll deploy the wrong version, with wrong dependencies, and have no idea when it breaks.

Skip if: If you're building a one-off prototype for a Kaggle competition or internal proof-of-concept that runs once and never updates, you might skip MLflow experiment tracking. But the moment a model touches real data or runs on a schedule, these tools become non-negotiable. Even for 'small' projects, the debugging cost of missing versioning outweighs setup time.

Explanation

The ML engineering gap is the chasm between data science and production. A data scientist trains a model in a Jupyter notebook with scikit-learn 1.3.0, pandas 2.0, and Python 3.10. Six months later, a different engineer tries to load that model on a server running Python 3.11 with scikit-learn 1.5.0: it crashes. No one knows which version of the training data was used. The model scores worse on production data than in development. There's no history of which hyperparameters were tried or why this version was chosen.

Three tools solve this: MLflow tracks experiments and registers model versions, DVC versions data and models alongside code, and Docker freezes the entire environment (dependencies, Python version, OS libraries) so the model runs identically everywhere. Together, they answer the three questions production needs: Which code and data created this model? Does it still work? Can I reproduce it?

This item establishes the conceptual foundation: the gap exists because notebooks are interactive and isolated, but production is automated and shared. These three tools bridge that gap by making experiments reproducible, data trackable, and deployments portable.

Configuration

dockerfile
# Dockerfile: Freeze the environment so the model runs the same everywhere
FROM nvidia/cuda:12.4.0-runtime-ubuntu22.04

WORKDIR /app

RUN apt-get update && apt-get install -y --no-install-recommends \
    python3 \
    python3-pip \
    python3-venv \
    git && \
    rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY model.pkl .
COPY app.py .

EXPOSE 8000
CMD ["python3", "app.py"]

---
# requirements.txt: Pin exact versions so dependencies never drift
scikit-learn==1.3.0
pandas==2.0.3
mlflow==2.10.0
flask==3.0.0
numpy==1.24.3

---
# dvc.yaml: Track data and model versions with DVC
stages:
  preprocess:
    cmd: python3 preprocess.py
    deps:
      - data/raw.csv
    outs:
      - data/processed.csv
  
  train:
    cmd: python3 train.py
    deps:
      - data/processed.csv
      - train.py
    outs:
      - model.pkl
    plots:
      - metrics.json

---
# .dvc/config: Tell DVC where to store model and data versions
[core]
    remote = myremote

['remote "myremote"']
    url = s3://my-bucket/dvc-storage

---
# MLproject: Describe the experiment for MLflow reproducibility
name: iris_classifier

version: 1.0

entry_points:
  train:
    parameters:
      learning_rate: {type: float, default: 0.01}
      epochs: {type: int, default: 100}
    command: "python3 train.py --lr {learning_rate} --epochs {epochs}"

Why this order?

Dockerfile defines the runtime environment first because everything else: requirements.txt, dvc.yaml, MLproject: depends on it. requirements.txt is pinned second because the model's behavior is locked to specific library versions. dvc.yaml comes third because it orchestrates data and model lineage. .dvc/config specifies where artifacts live. MLproject last because it's the glue that makes experiments reproducible across the team.

Wrong vs Right

Wrong way
dockerfile
# The notebook-to-production way that fails:
# 1. Export model from notebook as pickle
with open('model.pkl', 'wb') as f:
    pickle.dump(trained_model, f)

# 2. Upload to server (no version tracking)
scp model.pkl user@prod-server:/models/

# 3. Load and serve (environment is whatever the server happens to have)
import pickle
with open('/models/model.pkl', 'rb') as f:
    model = pickle.load(f)

# Problems: No idea what code/data built this. Server might have incompatible scikit-learn. No experiment history. No monitoring.
Right way
docker
# The MLops way:
# 1. Track experiment with MLflow
mlflow.start_run()
mlflow.log_param('learning_rate', 0.01)
mlflow.log_metric('accuracy', 0.95)
mlflow.sklearn.log_model(model, 'model')
mlflow.end_run()

# 2. Version model in MLflow registry
# (registered model is immutable, tracked to code commit hash)

# 3. Version data with DVC
dvc add data/raw.csv
git add data/raw.csv.dvc
git commit -m "Add training data v1.2"

# 4. Freeze environment in Dockerfile
# (model always runs with scikit-learn 1.3.0)

# 5. Deploy via Docker image (reproducible everywhere)
docker build -t iris-classifier:1.2.3 .
docker push iris-classifier:1.2.3

Tool vitals

Primary command
bash
mlflow ui
Config file MLproject or dvc.yaml or Dockerfile
Verify
bash
docker run --rm <image_name> python -c "import mlflow; print(mlflow.__version__)"

Integration notes

These three tools work together: MLflow tracks which code + hyperparameters created the model, DVC versions the data that trained it, Docker locks in the exact environment. When you push a model to production via Docker, you've implicitly tied it to a specific DVC data version and MLflow experiment. Kubernetes (next item) orchestrates running these Docker images at scale.

Migration path

If you outgrow MLflow's tracking (too many experiments, complex dependencies), migrate to Weights & Biases or Neptune. If DVC's remote storage becomes a bottleneck, move to a dedicated data lake (Delta Lake, Iceberg). If Docker overhead is too high (FaaS/serverless), use BentoML or vLLM which containerize just the model. But all three address the gap: switching one doesn't eliminate the need for the others.

Cost model

MLflow: free and open-source, but hosting the tracking server costs compute (typically <$50/month on a small EC2 instance). DVC: free for local/Git storage, scales to ~$5-50/month for S3/cloud storage depending on data size. Docker: free, but image storage on registries (Docker Hub, ECR, GCR) costs based on bandwidth and storage (~$5-20/month for small teams). No hidden costs if you stay within free tier limits.

Common gotcha

The silent failure: you upload model.pkl to production, it loads fine with pickle.load(), runs fine on test data: then crashes on real production data because the server has scikit-learn 1.5.0 and the model was trained on 1.3.0. The model loads, the server doesn't error, but the predictions are wrong or behavior changes. Without MLflow tracking and Docker environment freezing, you won't know why until production metrics degrade and users complain.

Team adoption

Day one: require all model training to use MLflow (even if just a local sqlite db). No exceptions: make mlflow.start_run() a habit. Week one: set up a shared DVC remote (S3 bucket or shared NAS) so data versions are team-accessible, not siloed on one laptop. Week two: create a Dockerfile template that all models inherit from; make it hard to bypass. Enforce in code review: if a Dockerfile has >= in requirements.txt instead of ==, the PR doesn't merge. Month one: run a postmortem on the last model that failed in production: show how MLflow, DVC, and Docker would have caught it.

Experienced dev note

Experienced MLops engineers freeze Python version in the Dockerfile even when it seems overkill: FROM python:3.10-slim instead of relying on the system Python. Likewise, they always use --pin-git-hash with DVC to link data versions to exact code commits. And they log not just metrics but the git commit hash and data hash to MLflow: mlflow.log_param('git_commit', subprocess.check_output(['git', 'rev-parse', 'HEAD']).decode().strip()). This costs 30 seconds upfront but saves weeks debugging 'which version was this?'

Check your understanding

You train a model locally with scikit-learn 1.3.0, log it to MLflow, version the training data with DVC, and commit both to Git. Six months later, someone pulls your code, runs docker build, and the built image has scikit-learn 1.5.0 because requirements.txt says scikit-learn>=1.3.0. The model loads and makes predictions, but they differ from your original test metrics. Why is this a problem, and which of the three tools failed to prevent it?

Show answer hint

The problem is pinning: <code>scikit-learn>=1.3.0</code> allows any newer version. Only <code>scikit-learn==1.3.0</code> (exact version in requirements.txt) prevents this. The tool that failed is Docker: it didn't freeze the dependency. MLflow logged the metrics but didn't enforce the environment. DVC tracked the data but didn't control Python packages. All three needed exact pinning to work together.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.