Data Validation in CI
Why this matters
Bad data silently breaks models. Without validation in CI, corrupted, missing, or schema-mismatched data reaches production undetected, causing model failures, retraining loops, and wasted compute. Catching data issues at commit time prevents hours of debugging.
Explanation
Data validation in CI is a pre-training check that runs automatically when code or data changes. It verifies schema (column names, types), detects missing values, checks statistical bounds (e.g., no prices > $1M), and flags data drift compared to historical baselines. Popular tools like Great Expectations or DVC plots integrate with GitHub Actions or GitLab CI to fail the pipeline if data quality thresholds are unmet. This sits upstream of DVC data versioning and MLflow experiment tracking: it ensures only valid data enters your training runs. In beginner MLOps, you'll use YAML-based CI config (GitHub Actions, GitLab CI) combined with a simple validation script or Great Expectations suite.
Configuration
name: Data Validation
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install dependencies
run: |
pip install great-expectations dvc pandas
- name: Pull data with DVC
run: |
dvc pull data/raw
- name: Run data validation
run: |
great_expectations checkpoint run production_data_checks
- name: Check for schema drift
run: |
python scripts/validate_schema.py data/raw/train.csv schema/expected_schema.json
- name: Profile data statistics
run: |
python scripts/profile_data.py data/raw/train.csv
- name: Upload validation report
if: always()
uses: actions/upload-artifact@v4
with:
name: validation-report
path: great_expectations/uncommitted/validations/ Why this order?
The workflow fires on push/PR → checks out code → sets up Python → installs validation tools → pulls versioned data with DVC → runs Great Expectations suite → checks schema consistency → profiles stats → uploads artifacts. This order ensures data is available (DVC pull) before validation runs, and reports are always captured (if: always()) even if checks fail.
Wrong vs Right
name: Train
on: [push]
jobs:
train:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install dependencies
run: pip install torch scikit-learn
- name: Train model
run: python train.py
- name: Upload model
run: mlflow models push model.pkl Add validation steps before training. After <code>dvc pull</code>, insert <code>great_expectations checkpoint run</code> and schema checks. Fail the job if validation fails by removing <code>continue-on-error</code>. This ensures training never runs on bad data. Tool vitals
great_expectations checkpoint run <checkpoint_name> .github/workflows/data-validation.yml (GitHub Actions) or .gitlab-ci.yml (GitLab CI) git commit && git push (triggers the CI workflow) Integration notes
Data validation sits between data ingestion and DVC versioning. After validation passes, dvc add and git commit .dvc to version the validated data. MLflow experiment tracking downstream depends on validated data reaching training scripts. Great Expectations integrates with DVC via the dvc plots artifact output: you can store validation metadata in dvc.yaml and reference it in MLflow runs.
Migration path
Start with Great Expectations for flexibility. If your validation rules are simple (schema-only, no statistical checks), migrate to lightweight validators like Pydantic or pandera to reduce dependencies. If you outgrow CLI workflows, move validation logic into a dedicated service (e.g., BentoML) or Airflow DAG.
Cost model
Great Expectations is open-source and free. The Cloud version (Expectation Hub) is paid (~$500+/mo for teams). GitHub Actions provides 2000 free minutes/month for private repos. Running validation on every push can consume minutes quickly if data is large (pull time dominates). Optimize with shallow DVC clones (<code>dvc fetch --shallow</code>) or conditional runs (only validate if <code>data/</code> changed).
Common gotcha
Great Expectations checkpoints run against the working directory, not automatically against DVC-tracked files. You must explicitly run dvc pull before great_expectations checkpoint run, or GE will validate stale/missing data. Additionally, GE checkpoints are stateless: they don't inherit context from prior runs. Use --config_version in your checkpoint YAML to version the validation rules themselves, so you can audit what changed.
Team adoption
1. Initialize Great Expectations once (great_expectations init) and commit the config. 2. Have one teammate build the first checkpoint using the auto-generation command. 3. Add the CI workflow to your repo's main branch. 4. Require validation to pass before merging PRs (GitHub: set branch protection rule 'require status checks'). 5. Document the checkpoint in your README so new team members know how to run validation locally (great_expectations checkpoint run ).
Experienced dev note
Use great_expectations checkpoint new --datasource to auto-generate your first checkpoint instead of writing YAML by hand. Most teams forget to version the great_expectations/ directory itself: it must be in Git so all developers share the same validation rules. Add great_expectations/uncommitted/ to .gitignore (it is by default), but commit great_expectations/expectations/ and great_expectations/checkpoints/.
Check your understanding
Why must dvc pull run before great_expectations checkpoint run in the CI workflow? What happens if you skip that step?
Show answer hint
Great Expectations validates files on disk in the current working directory. If <code>dvc pull</code> is skipped, DVC-tracked files remain as .gitignore pointers, not actual data. GE will fail with 'file not found' or validate empty/stub files, giving false negatives.