Phase 3: CI/CD: GitHub Actions for ML Pipelines
Why this matters
Without CI/CD automation, models drift untracked: experiments run locally on different machines, training scripts diverge from main branch, and nobody knows what version is production-ready. GitHub Actions forces reproducibility: every commit triggers a controlled training run, metrics are logged to MLflow automatically, and only validated models reach the registry.
Explanation
GitHub Actions listens to repository events (push, PR, schedule) and runs workflow jobs: each job is a containerized environment where you check out code, pull data via DVC, train models, log metrics to MLflow, and conditionally push to the model registry. The workflow file (YAML) lives in .github/workflows/ and defines triggers, environment variables (MLflow tracking URI, DVC remote credentials), and sequential steps. Key difference from local development: the runner is stateless and ephemeral, so all artifacts (model weights, metrics, data) must be persisted externally (MLflow backend, DVC remote storage, or GitHub Artifacts). You typically trigger on: push to main (always train), pull requests (validate on test set), or schedule (nightly retraining). The workflow passes only if training succeeds and validation metrics meet thresholds: failures block merges and alert the team.
Configuration
name: Train and Register Model
on:
push:
branches:
- main
paths:
- 'src/**'
- 'data/**'
- '.github/workflows/train.yml'
pull_request:
branches:
- main
jobs:
train:
runs-on: ubuntu-latest
timeout-minutes: 30
environment: ml-prod
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
cache: 'pip'
- name: Install dependencies
run: |
pip install --upgrade pip
pip install -r requirements.txt
pip install dvc dvc-s3 mlflow
- name: Configure DVC remote
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: |
dvc remote modify myremote url s3://my-bucket/dvc-storage
dvc pull
- name: Train model
env:
MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}
MLFLOW_TRACKING_USERNAME: ${{ secrets.MLFLOW_USERNAME }}
MLFLOW_TRACKING_PASSWORD: ${{ secrets.MLFLOW_PASSWORD }}
run: |
python src/train.py \
--data-path data/train.csv \
--model-output models/model.pkl \
--experiment-name production
- name: Evaluate model
run: |
python src/evaluate.py \
--model-path models/model.pkl \
--test-path data/test.csv
- name: Check validation thresholds
run: |
ACCURACY=$(python src/get_metric.py --metric accuracy)
if (( $(echo "$ACCURACY < 0.85" | bc -l) )); then
echo "Accuracy $ACCURACY below threshold 0.85"
exit 1
fi
- name: Register model to MLflow
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
env:
MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}
MLFLOW_TRACKING_USERNAME: ${{ secrets.MLFLOW_USERNAME }}
MLFLOW_TRACKING_PASSWORD: ${{ secrets.MLFLOW_PASSWORD }}
run: |
python src/register_model.py \
--model-uri runs:/${{ env.RUN_ID }}/model \
--stage Production
- name: Commit DVC lock updates
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
run: |
git config user.name "ML Bot"
git config user.email "ml-bot@company.com"
git add dvc.lock
git diff --quiet && git diff --staged --quiet || git commit -m "[ci] Update DVC lock after training"
git push
- name: Upload test results
if: always()
uses: actions/upload-artifact@v4
with:
name: test-results
path: |
metrics.json
confusion_matrix.png
retention-days: 30 Why this order?
Steps execute sequentially: checkout first (get code), setup Python (runtime), install deps (DVC/MLflow CLIs), pull data (DVC needs remote creds), train (depends on data), evaluate (depends on model), validate thresholds (blocks registration if metrics fail), register (only on main branch, depends on passing validation), commit lock (updates reproducibility artifact), upload artifacts (diagnostic output). Conditional steps use if: to prevent registration on PR branches or failed runs.
Wrong vs Right
name: Bad CI Pipeline
on:
push:
jobs:
train:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: python src/train.py
- run: python src/register_model.py Use environment secrets for credentials, add DVC pull with remote config, include validation thresholds, conditionally register only on main branch, cache Python dependencies, and set explicit timeout to avoid runners hanging. Tool vitals
gh workflow run <workflow-name> or automatic on git push .github/workflows/train.yml gh run list --repo <owner>/<repo> --limit 1 && gh run view <run-id> --log Integration notes
This workflow bridges three tools: GitHub Actions (orchestration) → DVC (data versioning, pulls from S3/GCS) → MLflow (metrics, model registry). MLflow runs are tagged with github.sha and github.ref for audit trail. Model artifact URI points to runs://, which MLflow stores in its backend (local or remote). DVC lock updates after training ensure reproducibility: next dvc repro locally will use the exact same data versions.
Migration path
Start with GitHub Actions for ≤10 minute trains. If training exceeds runner time limits (6 hours), migrate to self-hosted runners (expensive, requires infrastructure) or move to Airflow (complex scheduling) or Kubeflow (requires Kubernetes). For multi-GPU training, use GitHub Actions Matrix builds to parallelize, or switch to cloud-native CI (AWS CodePipeline, GCP Cloud Build) with GPU runners.
Cost model
GitHub Actions: free for public repos and 2,000 minutes/month for private repos (then $0.008/minute). Costs scale with training frequency and run time. Hidden costs: MLflow backend storage (AWS RDS ~$30/mo for small instance), DVC remote storage (S3: $0.023/GB/month), GitHub Artifacts storage (15GB free, then $0.50/GB/month). For high-volume retraining (daily), consider reducing workflow triggers (e.g., skip on docs-only changes with <code>paths:</code> filter) or batch runs.
Common gotcha
GitHub Actions secrets are masked in logs but are NOT encrypted at rest in the runner VM: any step can leak them if you print env vars or write to artifacts. Always use environment: ml-prod to restrict secret access to main branch only, and never log ${{ secrets.* }} directly. Also: DVC pull will fail silently if remote credentials expire mid-run: test dvc remote list before train steps. Third gotcha: model registry push inside the action won't see uncommitted changes to dvc.lock from training: run git add dvc.lock && git commit in a separate step if you need lock file in the registry.
Team adoption
1. Create .github/workflows/train.yml in the repo root and require code review for workflow changes. 2. Add a CODEOWNERS file that requires ML team sign-off on .github/workflows/. 3. Document expected failure modes in your wiki: what to do when validation fails (debug locally with dvc repro, fix data, push again). 4. Set up GitHub branch protection to require workflow success before merge: this forces CI quality. 5. Create a Slack/Teams integration to notify team on workflow failures (uses actions/slack action). 6. Schedule nightly training runs to catch data drift early: add schedule: - cron: '0 2 * * *' trigger.
Experienced dev note
Most teams discover too late: always add timeout-minutes at the job level: GitHub's default 6-hour timeout means a hung training process wastes runner minutes silently. Also, use cache: 'pip' in setup-python action to avoid re-downloading 500MB of deps on every run. And set environment: at job level to enforce branch protection rules for secrets: prevents accidental exposure in PR forks.
Check your understanding
Why does this workflow use if: github.event_name == 'push' && github.ref == 'refs/heads/main' for the registration step instead of always registering? What would break if you removed this condition?
Show answer hint
PR branches should not register models because they're unreviewed and untested against production data. Removing the condition would register a model for every PR, polluting the registry and potentially blocking production pushes due to version conflicts.