GitHub Actions for ML
Why this matters
Without CI/CD, ML experiments and model updates are manual and error-prone. GitHub Actions lets you automatically run experiments, log metrics to MLflow, version datasets with DVC, and trigger deployments: all without leaving your GitHub repo. Without it, you miss breaking changes, inconsistent training, and unmaintained model registries.
Explanation
GitHub Actions is a native CI/CD system that runs workflows triggered by events (push, pull request, schedule). For ML, the workflow pattern is: (1) pull code, (2) install dependencies and DVC data, (3) run training script logged to MLflow, (4) if metrics pass, register model in MLflow Model Registry, (5) optionally deploy. The workflow lives in .github/workflows/ and is pure YAML: no GitHub UI required. Each step runs in isolation; you chain steps by passing artifacts or environment variables. GitHub provides free quota for public repos and 2,000 minutes/month for private repos, which is enough for most small teams.
Configuration
name: ML Training Pipeline
on:
push:
branches:
- main
paths:
- 'src/train.py'
- 'data/raw/**'
pull_request:
branches:
- main
schedule:
- cron: '0 2 * * MON'
jobs:
train-and-register:
runs-on: ubuntu-latest
env:
MLFLOW_TRACKING_URI: https://mlflow.example.com
MLFLOW_TRACKING_USERNAME: ${{ secrets.MLFLOW_USER }}
MLFLOW_TRACKING_PASSWORD: ${{ secrets.MLFLOW_PASSWORD }}
steps:
- uses: actions/checkout@v4
- uses: iterative/setup-dvc@v1
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
cache: 'pip'
- name: Pull data with DVC
run: |
dvc remote add -d myremote s3://my-bucket/dvc-storage
dvc pull
- name: Install dependencies
run: |
pip install --upgrade pip
pip install -r requirements.txt
- name: Run training
run: |
python src/train.py \
--epochs 50 \
--batch-size 32 \
--log-mlflow
env:
PYTHONUNBUFFERED: 1
- name: Register model if metrics pass
if: success()
run: |
python -c "
import mlflow
client = mlflow.tracking.MlflowClient()
best_run = client.search_runs('0', order_by=['metrics.accuracy DESC'])[0]
if best_run.data.metrics['accuracy'] > 0.88:
mlflow.register_model(
model_uri=f'runs:/{best_run.info.run_id}/model',
name='iris-classifier'
)
print('Model registered')
else:
print('Accuracy below threshold, skipping registration')
"
- name: Push updated model to DVC (optional)
if: success()
run: |
dvc add models/current.pkl
git add models/current.pkl.dvc
git config user.email 'ci@example.com'
git config user.name 'CI Bot'
git commit -m 'Update model artifact' || echo 'No changes'
git push
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} Why this order?
The workflow triggers first (on: block defines WHEN to run). Then jobs defines the execution graph. Within each job, steps run sequentially top-to-bottom. Secrets must be injected before they're used; Python setup happens before pip install; DVC pull must happen before your training script tries to read data. If you pull after training, you'll get cache misses or stale data.
Wrong vs Right
name: Train Model
on: [push]
jobs:
train:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: python src/train.py
- run: mlflow ui &
- run: sleep 5 && curl http://localhost:5000 Add secrets to env block: <code>MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_URI }}</code>. Use official GitHub actions like <code>actions/setup-python@v5</code> (caching works, faster). Replace <code>mlflow ui</code> with direct MLflow REST API calls or a separate tracking server running 24/7. Remove localhost calls: they fail in CI because the service is ephemeral. Tool vitals
Push to GitHub repo or manually trigger via <code>gh workflow run</code> .github/workflows/train-and-register.yml gh workflow list --repo <owner>/<repo> Integration notes
This workflow assumes MLflow is running at a static URL (e.g., managed service or persistent EC2 instance). If MLflow is containerized, it won't be reachable from GitHub's runners without VPN or bastion host. For DVC, the remote storage (S3, GCS, Azure Blob) must be accessible via credentials stored in GitHub Secrets. For K8s deployments, add a step that calls kubectl apply -f deployment.yaml after model registration, assuming your GitHub runner has kubeconfig access.
Migration path
If you outgrow GitHub Actions (complex DAGs, long-running jobs, custom resource needs), migrate to Airflow, Prefect, or GitLab CI. Export your workflow as a Python DAG using Airflow's @task decorator: each GitHub Actions step becomes an Airflow operator. Your MLflow and DVC calls stay the same; only the trigger and execution context change.
Cost model
GitHub Actions is free for public repos and 2,000 minutes/month for private repos on the free plan. Each job consumes minutes based on OS: Linux is 1x, Windows is 2x, macOS is 10x. If you train a large model for 30 minutes 10 times a month, you use 300 minutes: well within free quota. After 2,000 minutes, you pay $0.25/minute (generous vs. CircleCI or Jenkins Cloud).
Common gotcha
Secrets like MLFLOW_TRACKING_PASSWORD are redacted in logs, so if authentication fails silently, you'll see 'Connection refused' with no detail. Always test locally with real credentials first. Also, GitHub Actions caches pip packages by hash of requirements.txt: if you change a pinned version, cache won't invalidate automatically; you'll see ghost failures. Use pip install --upgrade pip in workflows even if not needed locally.
Team adoption
Create a template repo with this workflow pre-configured. In your org, use GitHub's 'Starter Workflows' feature to make it discoverable. Run a 30-minute sync: show the team how to add --log-mlflow to their scripts, store credentials in Settings → Secrets, and watch runs auto-execute on push. Start with just training; add model registration and deployment once confidence grows.
Experienced dev note
Use actions/cache@v4 with a custom key based on your Python version and requirements hash: key: python-${{ matrix.python-version }}-${{ hashFiles('**/requirements.txt') }}. This reduces workflow time by 70% on repeat runs. Also, split your workflow into multiple jobs that run in parallel with needs: [train] to speed up the whole pipeline: e.g., one job trains, another validates, another deploys, all at once.
Check your understanding
Why does the workflow pull DVC data BEFORE running training, and not after? What happens if you reverse this order?
Show answer hint
Your training script reads data from disk (e.g., <code>pd.read_csv('data/raw/train.csv')</code>). If DVC hasn't pulled yet, that file doesn't exist locally: your script fails immediately. Reversing the order means your training runs against stale or missing data, or fails silently if error handling is weak.