Tool Beginner easy · 6 min integration

Data Validation in CI

What you will learn

Automatically check data quality and schema integrity in your CI pipeline before training or deployment.

Why this matters

Bad data silently breaks models. Without validation in CI, corrupted, missing, or schema-mismatched data reaches production undetected, causing model failures, retraining loops, and wasted compute. Catching data issues at commit time prevents hours of debugging.

Skip if: For internal prototyping or one-off notebooks where you manually inspect data first. Validation in CI is essential once data flows into shared pipelines or production systems. If your data source is a trusted API with contractual schema guarantees, basic schema checks may suffice instead of full profiling.

Explanation

Data validation in CI is a pre-training check that runs automatically when code or data changes. It verifies schema (column names, types), detects missing values, checks statistical bounds (e.g., no prices > $1M), and flags data drift compared to historical baselines. Popular tools like Great Expectations or DVC plots integrate with GitHub Actions or GitLab CI to fail the pipeline if data quality thresholds are unmet. This sits upstream of DVC data versioning and MLflow experiment tracking: it ensures only valid data enters your training runs. In beginner MLOps, you'll use YAML-based CI config (GitHub Actions, GitLab CI) combined with a simple validation script or Great Expectations suite.

Configuration

yaml

name: Data Validation
on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]
jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      - name: Install dependencies
        run: |
          pip install great-expectations dvc pandas
      - name: Pull data with DVC
        run: |
          dvc pull data/raw
      - name: Run data validation
        run: |
          great_expectations checkpoint run production_data_checks
      - name: Check for schema drift
        run: |
          python scripts/validate_schema.py data/raw/train.csv schema/expected_schema.json
      - name: Profile data statistics
        run: |
          python scripts/profile_data.py data/raw/train.csv
      - name: Upload validation report
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: validation-report
          path: great_expectations/uncommitted/validations/

Why this order?

The workflow fires on push/PR → checks out code → sets up Python → installs validation tools → pulls versioned data with DVC → runs Great Expectations suite → checks schema consistency → profiles stats → uploads artifacts. This order ensures data is available (DVC pull) before validation runs, and reports are always captured (if: always()) even if checks fail.

Wrong vs Right

Wrong way

yaml

name: Train
on: [push]
jobs:
  train:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Install dependencies
        run: pip install torch scikit-learn
      - name: Train model
        run: python train.py
      - name: Upload model
        run: mlflow models push model.pkl

Right way

yaml

Add validation steps before training. After <code>dvc pull</code>, insert <code>great_expectations checkpoint run</code> and schema checks. Fail the job if validation fails by removing <code>continue-on-error</code>. This ensures training never runs on bad data.

Tool vitals

Primary command

bash

great_expectations checkpoint run <checkpoint_name>

Config file .github/workflows/data-validation.yml (GitHub Actions) or .gitlab-ci.yml (GitLab CI)

Verify

bash

git commit && git push (triggers the CI workflow)

Integration notes

Data validation sits between data ingestion and DVC versioning. After validation passes, dvc add and git commit .dvc to version the validated data. MLflow experiment tracking downstream depends on validated data reaching training scripts. Great Expectations integrates with DVC via the dvc plots artifact output: you can store validation metadata in dvc.yaml and reference it in MLflow runs.

Migration path

Start with Great Expectations for flexibility. If your validation rules are simple (schema-only, no statistical checks), migrate to lightweight validators like Pydantic or pandera to reduce dependencies. If you outgrow CLI workflows, move validation logic into a dedicated service (e.g., BentoML) or Airflow DAG.

Cost model

Great Expectations is open-source and free. The Cloud version (Expectation Hub) is paid (~$500+/mo for teams). GitHub Actions provides 2000 free minutes/month for private repos. Running validation on every push can consume minutes quickly if data is large (pull time dominates). Optimize with shallow DVC clones (<code>dvc fetch --shallow</code>) or conditional runs (only validate if <code>data/</code> changed).

Common gotcha

Great Expectations checkpoints run against the working directory, not automatically against DVC-tracked files. You must explicitly run dvc pull before great_expectations checkpoint run, or GE will validate stale/missing data. Additionally, GE checkpoints are stateless: they don't inherit context from prior runs. Use --config_version in your checkpoint YAML to version the validation rules themselves, so you can audit what changed.

Team adoption

1. Initialize Great Expectations once (great_expectations init) and commit the config. 2. Have one teammate build the first checkpoint using the auto-generation command. 3. Add the CI workflow to your repo's main branch. 4. Require validation to pass before merging PRs (GitHub: set branch protection rule 'require status checks'). 5. Document the checkpoint in your README so new team members know how to run validation locally (great_expectations checkpoint run ).

Experienced dev note

Use great_expectations checkpoint new --datasource --data_asset to auto-generate your first checkpoint instead of writing YAML by hand. Most teams forget to version the great_expectations/ directory itself: it must be in Git so all developers share the same validation rules. Add great_expectations/uncommitted/ to .gitignore (it is by default), but commit great_expectations/expectations/ and great_expectations/checkpoints/.

Check your understanding

Why must dvc pull run before great_expectations checkpoint run in the CI workflow? What happens if you skip that step?

Show answer hint

Great Expectations validates files on disk in the current working directory. If <code>dvc pull</code> is skipped, DVC-tracked files remain as .gitignore pointers, not actual data. GE will fail with 'file not found' or validate empty/stub files, giving false negatives.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.