Tool Beginner easy · 8 min integration

GitHub Actions for ML

What you will learn

Automate ML training, testing, and model registration whenever code changes using GitHub Actions workflows.

Why this matters

Without CI/CD, ML experiments and model updates are manual and error-prone. GitHub Actions lets you automatically run experiments, log metrics to MLflow, version datasets with DVC, and trigger deployments: all without leaving your GitHub repo. Without it, you miss breaking changes, inconsistent training, and unmaintained model registries.

Skip if: If you're a single developer on a small personal project, a notebook + manual MLflow run is fine. If your team uses GitLab or Bitbucket natively, use their CI instead. If you need complex scheduling across a Kubernetes cluster, jump to K8s CronJobs: GitHub Actions runs on GitHub's infrastructure and has latency overhead.

Explanation

GitHub Actions is a native CI/CD system that runs workflows triggered by events (push, pull request, schedule). For ML, the workflow pattern is: (1) pull code, (2) install dependencies and DVC data, (3) run training script logged to MLflow, (4) if metrics pass, register model in MLflow Model Registry, (5) optionally deploy. The workflow lives in .github/workflows/ and is pure YAML: no GitHub UI required. Each step runs in isolation; you chain steps by passing artifacts or environment variables. GitHub provides free quota for public repos and 2,000 minutes/month for private repos, which is enough for most small teams.

Configuration

yaml

name: ML Training Pipeline
on:
  push:
    branches:
      - main
    paths:
      - 'src/train.py'
      - 'data/raw/**'
  pull_request:
    branches:
      - main
  schedule:
    - cron: '0 2 * * MON'
jobs:
  train-and-register:
    runs-on: ubuntu-latest
    env:
      MLFLOW_TRACKING_URI: https://mlflow.example.com
      MLFLOW_TRACKING_USERNAME: ${{ secrets.MLFLOW_USER }}
      MLFLOW_TRACKING_PASSWORD: ${{ secrets.MLFLOW_PASSWORD }}
    steps:
      - uses: actions/checkout@v4
      - uses: iterative/setup-dvc@v1
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
          cache: 'pip'
      - name: Pull data with DVC
        run: |
          dvc remote add -d myremote s3://my-bucket/dvc-storage
          dvc pull
      - name: Install dependencies
        run: |
          pip install --upgrade pip
          pip install -r requirements.txt
      - name: Run training
        run: |
          python src/train.py \
            --epochs 50 \
            --batch-size 32 \
            --log-mlflow
        env:
          PYTHONUNBUFFERED: 1
      - name: Register model if metrics pass
        if: success()
        run: |
          python -c "
          import mlflow
          client = mlflow.tracking.MlflowClient()
          best_run = client.search_runs('0', order_by=['metrics.accuracy DESC'])[0]
          if best_run.data.metrics['accuracy'] > 0.88:
            mlflow.register_model(
              model_uri=f'runs:/{best_run.info.run_id}/model',
              name='iris-classifier'
            )
            print('Model registered')
          else:
            print('Accuracy below threshold, skipping registration')
          "
      - name: Push updated model to DVC (optional)
        if: success()
        run: |
          dvc add models/current.pkl
          git add models/current.pkl.dvc
          git config user.email 'ci@example.com'
          git config user.name 'CI Bot'
          git commit -m 'Update model artifact' || echo 'No changes'
          git push
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

Why this order?

The workflow triggers first (on: block defines WHEN to run). Then jobs defines the execution graph. Within each job, steps run sequentially top-to-bottom. Secrets must be injected before they're used; Python setup happens before pip install; DVC pull must happen before your training script tries to read data. If you pull after training, you'll get cache misses or stale data.

Wrong vs Right

Wrong way

yaml

name: Train Model
on: [push]
jobs:
  train:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: python src/train.py
      - run: mlflow ui &
      - run: sleep 5 && curl http://localhost:5000

Right way

yaml

Add secrets to env block: <code>MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_URI }}</code>. Use official GitHub actions like <code>actions/setup-python@v5</code> (caching works, faster). Replace <code>mlflow ui</code> with direct MLflow REST API calls or a separate tracking server running 24/7. Remove localhost calls: they fail in CI because the service is ephemeral.

Tool vitals

Primary command

bash

Push to GitHub repo or manually trigger via <code>gh workflow run</code>

Config file .github/workflows/train-and-register.yml

Verify

bash

gh workflow list --repo <owner>/<repo>

Integration notes

This workflow assumes MLflow is running at a static URL (e.g., managed service or persistent EC2 instance). If MLflow is containerized, it won't be reachable from GitHub's runners without VPN or bastion host. For DVC, the remote storage (S3, GCS, Azure Blob) must be accessible via credentials stored in GitHub Secrets. For K8s deployments, add a step that calls kubectl apply -f deployment.yaml after model registration, assuming your GitHub runner has kubeconfig access.

Migration path

If you outgrow GitHub Actions (complex DAGs, long-running jobs, custom resource needs), migrate to Airflow, Prefect, or GitLab CI. Export your workflow as a Python DAG using Airflow's @task decorator: each GitHub Actions step becomes an Airflow operator. Your MLflow and DVC calls stay the same; only the trigger and execution context change.

Cost model

GitHub Actions is free for public repos and 2,000 minutes/month for private repos on the free plan. Each job consumes minutes based on OS: Linux is 1x, Windows is 2x, macOS is 10x. If you train a large model for 30 minutes 10 times a month, you use 300 minutes: well within free quota. After 2,000 minutes, you pay $0.25/minute (generous vs. CircleCI or Jenkins Cloud).

Common gotcha

Secrets like MLFLOW_TRACKING_PASSWORD are redacted in logs, so if authentication fails silently, you'll see 'Connection refused' with no detail. Always test locally with real credentials first. Also, GitHub Actions caches pip packages by hash of requirements.txt: if you change a pinned version, cache won't invalidate automatically; you'll see ghost failures. Use pip install --upgrade pip in workflows even if not needed locally.

Team adoption

Create a template repo with this workflow pre-configured. In your org, use GitHub's 'Starter Workflows' feature to make it discoverable. Run a 30-minute sync: show the team how to add --log-mlflow to their scripts, store credentials in Settings → Secrets, and watch runs auto-execute on push. Start with just training; add model registration and deployment once confidence grows.

Experienced dev note

Use actions/cache@v4 with a custom key based on your Python version and requirements hash: key: python-${{ matrix.python-version }}-${{ hashFiles('**/requirements.txt') }}. This reduces workflow time by 70% on repeat runs. Also, split your workflow into multiple jobs that run in parallel with needs: [train] to speed up the whole pipeline: e.g., one job trains, another validates, another deploys, all at once.

Check your understanding

Why does the workflow pull DVC data BEFORE running training, and not after? What happens if you reverse this order?

Show answer hint

Your training script reads data from disk (e.g., <code>pd.read_csv('data/raw/train.csv')</code>). If DVC hasn't pulled yet, that file doesn't exist locally: your script fails immediately. Reversing the order means your training runs against stale or missing data, or fails silently if error handling is weak.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.