Tool Beginner easy · 6 min concept

Reproducibility requirements

What you will learn

Pin versions and track data snapshots so any team member can rebuild your exact model and results months later.

Why this matters

Without reproducibility, you cannot debug why yesterday's model performs differently today, you cannot hand off experiments to a teammate, and you cannot satisfy regulatory requirements (GDPR, FDA) that demand audit trails. Untracked changes to data or dependencies silently break pipelines and cause data drift that goes unnoticed.

Skip if: Small one-off scripts or Jupyter notebooks for personal exploration don't require reproducibility setup: but the moment code reaches production, a team, or a second run, reproducibility becomes non-negotiable. If your pipeline takes 30 seconds to run, investing 10 minutes in reproducibility setup saves hours of debugging.

Explanation

Reproducibility in MLOps means that given the same code, data, and environment, your model training produces identical results every time. This requires three pillars: (1) environment pinning (lock all dependency versions), (2) data versioning (track exact data snapshots with checksums), and (3) seed control (fix random seeds). MLflow tracks experiment metadata and parameters, DVC versions data and models, and Docker locks the OS and system libraries. Without these, you hit the 'it works on my machine' wall immediately.

Configuration

bash

# requirements.txt: Python dependency lock
mlflow==2.16.0
dvc==3.53.0
scikit-learn==1.5.1
numpy==1.26.4
pandas==2.2.0
requests==2.31.0

# .gitignore: prevent large files in git
*.csv
*.parquet
data/raw/
models/
.dvc/cache/

# dvc.yaml: DVC pipeline with data versioning
stages:
  prepare:
    cmd: python src/prepare.py
    deps:
      - src/prepare.py
      - data/raw/train.csv
    outs:
      - data/processed/train.pkl:
          hash: md5
  train:
    cmd: python src/train.py --seed 42
    deps:
      - src/train.py
      - data/processed/train.pkl
    params:
      - lr: 0.001
      - epochs: 10
    outs:
      - models/model.pkl
    metrics:
      - metrics.json:
          cache: false

# Dockerfile for reproducible environment
FROM python:3.11-slim-jammy
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY src/ src/
COPY dvc.yaml dvc.lock .
RUN dvc pull
CMD ["mlflow", "run", "-e", "main", "--no-conda"]

# Train script with seed control (src/train.py)
import numpy as np
import random
import mlflow
from sklearn.ensemble import RandomForestClassifier

np.random.seed(42)
random.seed(42)

with mlflow.start_run():
    mlflow.log_param("seed", 42)
    mlflow.log_param("n_estimators", 100)
    model = RandomForestClassifier(n_estimators=100, random_state=42)
    model.fit(X_train, y_train)
    score = model.score(X_test, y_test)
    mlflow.log_metric("accuracy", score)

Why this order?

requirements.txt must be created before Dockerfile (it's a dependency). dvc.yaml defines the pipeline order (prepare before train). Random seed must be set before model initialization. MLflow logging captures parameters before training so they're tracked in the experiment record.

Wrong vs Right

Wrong way

bash

# ❌ NO pinned versions: breaks on pip install at different times
pip install mlflow sklearn pandas

# ❌ NO seed: different results on every run
model = RandomForestClassifier(n_estimators=100)

# ❌ NO data versioning: data changes silently
df = pd.read_csv('data/raw/train.csv')  # might be different next month

# ❌ NO Docker: environment varies between machines
python train.py

Right way

bash

# ✅ Pin exact versions
pip freeze > requirements.txt

# ✅ Set seed before model creation
np.random.seed(42)
random.seed(42)
model = RandomForestClassifier(n_estimators=100, random_state=42)

# ✅ Track data with DVC
dvc add data/raw/train.csv

# ✅ Run in Docker with locked environment
docker build -t ml-train:latest . && docker run ml-train:latest

Tool vitals

Primary command

bash

pip freeze > requirements.txt && dvc add data/ && mlflow run

Config file requirements.txt, dvc.yaml, .mlflowignore

Verify

bash

docker run --rm <image> python train.py && docker run --rm <image> python train.py (compare metrics)

Integration notes

MLflow, DVC, and Docker work together in a pyramid: Docker provides the reproducible OS/library layer, requirements.txt freezes Python packages, DVC versions the data and pipeline, and MLflow tracks experiments and model lineage. Without all three, you lose reproducibility at different levels: drop Docker and OS libraries vary; drop DVC and data drifts silently; drop MLflow and you can't track which data produced which model.

Migration path

If you outgrow DVC (for very large datasets >500GB), migrate to a cloud data versioning service (e.g., Weights & Biases, Neptune). If you outgrow MLflow, move to Kubeflow or Databricks MLflow hosted tier. Docker stays: it's foundational. Requirements.txt can move to poetry (pyproject.toml) for better dependency resolution, but the principle of pinning remains.

Cost model

MLflow (free, open-source), DVC (free for local/Git storage, paid for cloud backends like DVC Cloud), Docker (free). DVC costs emerge only if you use S3/GCS/Azure backend ($0.023/GB/month for S3 storage). For a 10GB dataset, expect ~$0.23/month in DVC storage.

Common gotcha

Setting random seeds in Python does NOT guarantee reproducibility if NumPy operations run on GPU (CUDA) or if you use threading/multiprocessing without controlling worker seeds. GPU operations are non-deterministic by default. Set `np.random.seed()` AND `random.seed()` AND `torch.manual_seed()` (if PyTorch) AND ensure `PYTHONHASHSEED=0` in your Docker environment. The DVC `dvc.lock` file records exact data hashes: if you update data without running `dvc repro`, the lock file becomes stale and teammates will use different data.

Team adoption

Day 1: Distribute a `requirements.txt` and `Dockerfile` in the repo template. Day 2: Add a CI/CD check that rebuilds the Docker image and runs `dvc repro` to catch environment drift. Day 3: Document that all new features must be added to requirements.txt before merging. Week 1: Run a 30-minute workshop showing 'why my model changed' (spoiler: environment) and demo the difference between a pinned and unpinned run.

Experienced dev note

Add `dvc.lock` to Git, not just `dvc.yaml`. The `.lock` file records exact data hashes and dependency outputs: without it, teammates can't reproduce your exact pipeline run, only the latest code. Also, always use `dvc repro --force` in CI/CD pipelines to catch data drift; a stale lock file silently runs against old data.

Check your understanding

If you pin Python packages with requirements.txt but don't use Docker, can you guarantee another developer running the code on Ubuntu 20.04 gets the same results as you on macOS Sonoma?

Show answer hint

No: system-level libraries (libopenblas, libpq, CUDA runtime) vary between OS versions and machines. Docker locks the entire OS layer, not just Python. Requirements.txt alone is necessary but insufficient.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.