Reproducibility requirements
Why this matters
Without reproducibility, you cannot debug why yesterday's model performs differently today, you cannot hand off experiments to a teammate, and you cannot satisfy regulatory requirements (GDPR, FDA) that demand audit trails. Untracked changes to data or dependencies silently break pipelines and cause data drift that goes unnoticed.
Explanation
Reproducibility in MLOps means that given the same code, data, and environment, your model training produces identical results every time. This requires three pillars: (1) environment pinning (lock all dependency versions), (2) data versioning (track exact data snapshots with checksums), and (3) seed control (fix random seeds). MLflow tracks experiment metadata and parameters, DVC versions data and models, and Docker locks the OS and system libraries. Without these, you hit the 'it works on my machine' wall immediately.
Configuration
# requirements.txt: Python dependency lock
mlflow==2.16.0
dvc==3.53.0
scikit-learn==1.5.1
numpy==1.26.4
pandas==2.2.0
requests==2.31.0
# .gitignore: prevent large files in git
*.csv
*.parquet
data/raw/
models/
.dvc/cache/
# dvc.yaml: DVC pipeline with data versioning
stages:
prepare:
cmd: python src/prepare.py
deps:
- src/prepare.py
- data/raw/train.csv
outs:
- data/processed/train.pkl:
hash: md5
train:
cmd: python src/train.py --seed 42
deps:
- src/train.py
- data/processed/train.pkl
params:
- lr: 0.001
- epochs: 10
outs:
- models/model.pkl
metrics:
- metrics.json:
cache: false
# Dockerfile for reproducible environment
FROM python:3.11-slim-jammy
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY src/ src/
COPY dvc.yaml dvc.lock .
RUN dvc pull
CMD ["mlflow", "run", "-e", "main", "--no-conda"]
# Train script with seed control (src/train.py)
import numpy as np
import random
import mlflow
from sklearn.ensemble import RandomForestClassifier
np.random.seed(42)
random.seed(42)
with mlflow.start_run():
mlflow.log_param("seed", 42)
mlflow.log_param("n_estimators", 100)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
mlflow.log_metric("accuracy", score) Why this order?
requirements.txt must be created before Dockerfile (it's a dependency). dvc.yaml defines the pipeline order (prepare before train). Random seed must be set before model initialization. MLflow logging captures parameters before training so they're tracked in the experiment record.
Wrong vs Right
# ❌ NO pinned versions: breaks on pip install at different times
pip install mlflow sklearn pandas
# ❌ NO seed: different results on every run
model = RandomForestClassifier(n_estimators=100)
# ❌ NO data versioning: data changes silently
df = pd.read_csv('data/raw/train.csv') # might be different next month
# ❌ NO Docker: environment varies between machines
python train.py # ✅ Pin exact versions
pip freeze > requirements.txt
# ✅ Set seed before model creation
np.random.seed(42)
random.seed(42)
model = RandomForestClassifier(n_estimators=100, random_state=42)
# ✅ Track data with DVC
dvc add data/raw/train.csv
# ✅ Run in Docker with locked environment
docker build -t ml-train:latest . && docker run ml-train:latest Tool vitals
pip freeze > requirements.txt && dvc add data/ && mlflow run requirements.txt, dvc.yaml, .mlflowignore docker run --rm <image> python train.py && docker run --rm <image> python train.py (compare metrics) Integration notes
MLflow, DVC, and Docker work together in a pyramid: Docker provides the reproducible OS/library layer, requirements.txt freezes Python packages, DVC versions the data and pipeline, and MLflow tracks experiments and model lineage. Without all three, you lose reproducibility at different levels: drop Docker and OS libraries vary; drop DVC and data drifts silently; drop MLflow and you can't track which data produced which model.
Migration path
If you outgrow DVC (for very large datasets >500GB), migrate to a cloud data versioning service (e.g., Weights & Biases, Neptune). If you outgrow MLflow, move to Kubeflow or Databricks MLflow hosted tier. Docker stays: it's foundational. Requirements.txt can move to poetry (pyproject.toml) for better dependency resolution, but the principle of pinning remains.
Cost model
MLflow (free, open-source), DVC (free for local/Git storage, paid for cloud backends like DVC Cloud), Docker (free). DVC costs emerge only if you use S3/GCS/Azure backend ($0.023/GB/month for S3 storage). For a 10GB dataset, expect ~$0.23/month in DVC storage.
Common gotcha
Setting random seeds in Python does NOT guarantee reproducibility if NumPy operations run on GPU (CUDA) or if you use threading/multiprocessing without controlling worker seeds. GPU operations are non-deterministic by default. Set `np.random.seed()` AND `random.seed()` AND `torch.manual_seed()` (if PyTorch) AND ensure `PYTHONHASHSEED=0` in your Docker environment. The DVC `dvc.lock` file records exact data hashes: if you update data without running `dvc repro`, the lock file becomes stale and teammates will use different data.
Team adoption
Day 1: Distribute a `requirements.txt` and `Dockerfile` in the repo template. Day 2: Add a CI/CD check that rebuilds the Docker image and runs `dvc repro` to catch environment drift. Day 3: Document that all new features must be added to requirements.txt before merging. Week 1: Run a 30-minute workshop showing 'why my model changed' (spoiler: environment) and demo the difference between a pinned and unpinned run.
Experienced dev note
Add `dvc.lock` to Git, not just `dvc.yaml`. The `.lock` file records exact data hashes and dependency outputs: without it, teammates can't reproduce your exact pipeline run, only the latest code. Also, always use `dvc repro --force` in CI/CD pipelines to catch data drift; a stale lock file silently runs against old data.
Check your understanding
If you pin Python packages with requirements.txt but don't use Docker, can you guarantee another developer running the code on Ubuntu 20.04 gets the same results as you on macOS Sonoma?
Show answer hint
No: system-level libraries (libopenblas, libpq, CUDA runtime) vary between OS versions and machines. Docker locks the entire OS layer, not just Python. Requirements.txt alone is necessary but insufficient.