Tool Intermediate medium · 8 min integration

Phase 3: CI/CD: GitHub Actions for ML Pipelines

What you will learn

Automate model training, validation, and registry pushes on every commit using GitHub Actions with MLflow and DVC.

Why this matters

Without CI/CD automation, models drift untracked: experiments run locally on different machines, training scripts diverge from main branch, and nobody knows what version is production-ready. GitHub Actions forces reproducibility: every commit triggers a controlled training run, metrics are logged to MLflow automatically, and only validated models reach the registry.

Skip if: Use simpler trigger mechanisms (e.g., cron jobs or manual pushes) if: you're prototyping alone with <5 experiments/week, your training takes >1 hour (GitHub Actions runners timeout at 6 hours), or your data is >100GB (GitHub Actions has 14GB runner disk, DVC will need remote storage anyway). For production at scale, migrate to Kubeflow or Airflow.

Explanation

GitHub Actions listens to repository events (push, PR, schedule) and runs workflow jobs: each job is a containerized environment where you check out code, pull data via DVC, train models, log metrics to MLflow, and conditionally push to the model registry. The workflow file (YAML) lives in .github/workflows/ and defines triggers, environment variables (MLflow tracking URI, DVC remote credentials), and sequential steps. Key difference from local development: the runner is stateless and ephemeral, so all artifacts (model weights, metrics, data) must be persisted externally (MLflow backend, DVC remote storage, or GitHub Artifacts). You typically trigger on: push to main (always train), pull requests (validate on test set), or schedule (nightly retraining). The workflow passes only if training succeeds and validation metrics meet thresholds: failures block merges and alert the team.

Configuration

yaml

name: Train and Register Model

on:
  push:
    branches:
      - main
    paths:
      - 'src/**'
      - 'data/**'
      - '.github/workflows/train.yml'
  pull_request:
    branches:
      - main

jobs:
  train:
    runs-on: ubuntu-latest
    timeout-minutes: 30
    environment: ml-prod
    
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          fetch-depth: 0
      
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
          cache: 'pip'
      
      - name: Install dependencies
        run: |
          pip install --upgrade pip
          pip install -r requirements.txt
          pip install dvc dvc-s3 mlflow
      
      - name: Configure DVC remote
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          dvc remote modify myremote url s3://my-bucket/dvc-storage
          dvc pull
      
      - name: Train model
        env:
          MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}
          MLFLOW_TRACKING_USERNAME: ${{ secrets.MLFLOW_USERNAME }}
          MLFLOW_TRACKING_PASSWORD: ${{ secrets.MLFLOW_PASSWORD }}
        run: |
          python src/train.py \
            --data-path data/train.csv \
            --model-output models/model.pkl \
            --experiment-name production
      
      - name: Evaluate model
        run: |
          python src/evaluate.py \
            --model-path models/model.pkl \
            --test-path data/test.csv
      
      - name: Check validation thresholds
        run: |
          ACCURACY=$(python src/get_metric.py --metric accuracy)
          if (( $(echo "$ACCURACY < 0.85" | bc -l) )); then
            echo "Accuracy $ACCURACY below threshold 0.85"
            exit 1
          fi
      
      - name: Register model to MLflow
        if: github.event_name == 'push' && github.ref == 'refs/heads/main'
        env:
          MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}
          MLFLOW_TRACKING_USERNAME: ${{ secrets.MLFLOW_USERNAME }}
          MLFLOW_TRACKING_PASSWORD: ${{ secrets.MLFLOW_PASSWORD }}
        run: |
          python src/register_model.py \
            --model-uri runs:/${{ env.RUN_ID }}/model \
            --stage Production
      
      - name: Commit DVC lock updates
        if: github.event_name == 'push' && github.ref == 'refs/heads/main'
        run: |
          git config user.name "ML Bot"
          git config user.email "ml-bot@company.com"
          git add dvc.lock
          git diff --quiet && git diff --staged --quiet || git commit -m "[ci] Update DVC lock after training"
          git push
      
      - name: Upload test results
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: test-results
          path: |
            metrics.json
            confusion_matrix.png
          retention-days: 30

Why this order?

Steps execute sequentially: checkout first (get code), setup Python (runtime), install deps (DVC/MLflow CLIs), pull data (DVC needs remote creds), train (depends on data), evaluate (depends on model), validate thresholds (blocks registration if metrics fail), register (only on main branch, depends on passing validation), commit lock (updates reproducibility artifact), upload artifacts (diagnostic output). Conditional steps use if: to prevent registration on PR branches or failed runs.

Wrong vs Right

Wrong way

yaml

name: Bad CI Pipeline

on:
  push:

jobs:
  train:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: python src/train.py
      - run: python src/register_model.py

Right way

yaml

Use environment secrets for credentials, add DVC pull with remote config, include validation thresholds, conditionally register only on main branch, cache Python dependencies, and set explicit timeout to avoid runners hanging.

Tool vitals

Primary command

bash

gh workflow run <workflow-name> or automatic on git push

Config file .github/workflows/train.yml

Verify

bash

gh run list --repo <owner>/<repo> --limit 1 && gh run view <run-id> --log

Integration notes

This workflow bridges three tools: GitHub Actions (orchestration) → DVC (data versioning, pulls from S3/GCS) → MLflow (metrics, model registry). MLflow runs are tagged with github.sha and github.ref for audit trail. Model artifact URI points to runs:///model, which MLflow stores in its backend (local or remote). DVC lock updates after training ensure reproducibility: next dvc repro locally will use the exact same data versions.

Migration path

Start with GitHub Actions for ≤10 minute trains. If training exceeds runner time limits (6 hours), migrate to self-hosted runners (expensive, requires infrastructure) or move to Airflow (complex scheduling) or Kubeflow (requires Kubernetes). For multi-GPU training, use GitHub Actions Matrix builds to parallelize, or switch to cloud-native CI (AWS CodePipeline, GCP Cloud Build) with GPU runners.

Cost model

GitHub Actions: free for public repos and 2,000 minutes/month for private repos (then $0.008/minute). Costs scale with training frequency and run time. Hidden costs: MLflow backend storage (AWS RDS ~$30/mo for small instance), DVC remote storage (S3: $0.023/GB/month), GitHub Artifacts storage (15GB free, then $0.50/GB/month). For high-volume retraining (daily), consider reducing workflow triggers (e.g., skip on docs-only changes with <code>paths:</code> filter) or batch runs.

Common gotcha

GitHub Actions secrets are masked in logs but are NOT encrypted at rest in the runner VM: any step can leak them if you print env vars or write to artifacts. Always use environment: ml-prod to restrict secret access to main branch only, and never log ${{ secrets.* }} directly. Also: DVC pull will fail silently if remote credentials expire mid-run: test dvc remote list before train steps. Third gotcha: model registry push inside the action won't see uncommitted changes to dvc.lock from training: run git add dvc.lock && git commit in a separate step if you need lock file in the registry.

Team adoption

1. Create .github/workflows/train.yml in the repo root and require code review for workflow changes. 2. Add a CODEOWNERS file that requires ML team sign-off on .github/workflows/. 3. Document expected failure modes in your wiki: what to do when validation fails (debug locally with dvc repro, fix data, push again). 4. Set up GitHub branch protection to require workflow success before merge: this forces CI quality. 5. Create a Slack/Teams integration to notify team on workflow failures (uses actions/slack action). 6. Schedule nightly training runs to catch data drift early: add schedule: - cron: '0 2 * * *' trigger.

Experienced dev note

Most teams discover too late: always add timeout-minutes at the job level: GitHub's default 6-hour timeout means a hung training process wastes runner minutes silently. Also, use cache: 'pip' in setup-python action to avoid re-downloading 500MB of deps on every run. And set environment: at job level to enforce branch protection rules for secrets: prevents accidental exposure in PR forks.

Check your understanding

Why does this workflow use if: github.event_name == 'push' && github.ref == 'refs/heads/main' for the registration step instead of always registering? What would break if you removed this condition?

Show answer hint

PR branches should not register models because they're unreviewed and untested against production data. Removing the condition would register a model for every PR, polluting the registry and potentially blocking production pushes due to version conflicts.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.