Tool Intermediate medium · 7 min integration

Data validation with Great Expectations before training

What you will learn
Automatically validate dataset quality and schema before training runs using Great Expectations checkpoints integrated with MLflow.

Why this matters

Silent data quality issues cause models to train on corrupted, missing, or drifted data: leading to poor performance in production that's nearly impossible to debug retrospectively. Great Expectations catches these issues before they reach your training pipeline, preventing wasted compute and catching data regressions early.

Skip if: If your data is entirely synthetic, perfectly controlled, and changes never, you may skip validation. Also unnecessary for one-off experiments where you manually inspect the data first. For any production pipeline or team collaboration, validation is non-negotiable.

Explanation

Great Expectations is a framework that defines and enforces data quality rules (expectations) on datasets. Rather than writing custom validation scripts, you define a suite of expectations (e.g., 'column X has no nulls', 'column Y is between 0-100'), store them in version control, and run them as checkpoints before training. The results integrate directly with MLflow, logging validation reports alongside experiment metrics. This catches schema drift, missing values, outliers, and type errors automatically. When integrated into a DVC pipeline, validation becomes a gate: if expectations fail, the pipeline halts before training begins, saving time and preventing bad models from reaching the registry.

Configuration

json
# great_expectations/expectations/train_data.json
# Define expectations for your training dataset
{
  "meta": {
    "api_version": "1.0.0"
  },
  "expectations": [
    {
      "expectation_type": "expect_column_to_exist",
      "kwargs": {
        "column": "feature_age"
      }
    },
    {
      "expectation_type": "expect_column_values_to_not_be_null",
      "kwargs": {
        "column": "feature_age"
      }
    },
    {
      "expectation_type": "expect_column_values_to_be_between",
      "kwargs": {
        "column": "feature_age",
        "min_value": 0,
        "max_value": 120
      }
    },
    {
      "expectation_type": "expect_column_values_to_be_in_set",
      "kwargs": {
        "column": "label",
        "value_set": ["positive", "negative"]
      }
    },
    {
      "expectation_type": "expect_table_row_count_to_be_between",
      "kwargs": {
        "min_value": 1000,
        "max_value": 1000000
      }
    }
  ]
}

# great_expectations/checkpoints/train_data_checkpoint.yml
name: train_data_checkpoint
config_version: 1.0
template_name: null
module_name: great_expectations.checkpoint
class_name: Checkpoint
run_name_template: "%Y%m%d-%H%M%S-train-validation"
expectation_suite_name: train_data
action_list:
  - name: store_validation_result
    action:
      class_name: StoreValidationResultAction
  - name: store_evaluation_params
    action:
      class_name: StoreEvaluationParametersAction
  - name: update_data_docs
    action:
      class_name: UpdateDataDocsAction
batch_request:
  datasource_name: training_data
  data_connector_name: csv_connector
  data_asset_name: train_dataset
  batch_identifiers:
    - batch_id: latest

Why this order?

The expectations file defines what to validate (the contract). The checkpoint file defines how and when to validate (the execution plan). Checkpoints reference the expectations suite by name, so the suite must exist first.

Wrong vs Right

Wrong way
json
# DON'T: Ad-hoc validation scattered in training script
import pandas as pd
df = pd.read_csv('data.csv')
if df['age'].isnull().any():
    raise ValueError("nulls found")
if not set(df['label'].unique()).issubset({'pos', 'neg'}):
    raise ValueError("unexpected labels")
# Problem: No audit trail, no integration with MLflow, no version control, breaks on next schema change
Right way
json
# DO: Define expectations in version control, run via checkpoint
great_expectations checkpoint run train_data_checkpoint \
  --config-file-path great_expectations/checkpoints/train_data_checkpoint.yml
# Then in training script:
import json
with open('great_expectations/uncommitted/run_results.json', 'r') as f:
    validation_result = json.load(f)
    if not validation_result['success']:
        raise ValueError(f"Data validation failed: {validation_result}")
# Log to MLflow
mlflow.log_artifact('great_expectations/uncommitted/run_results.json')
# Advantage: Reproducible, versioned, auditable, integrated

Tool vitals

Primary command
bash
great_expectations checkpoint run
Config file great_expectations/checkpoints/train_data_checkpoint.yml
Verify
bash
great_expectations checkpoint run train_data_checkpoint --config-file-path great_expectations/checkpoints/train_data_checkpoint.yml

Integration notes

In a DVC pipeline, add the checkpoint as a dependency stage before training: dvc stage add -n validate -d data/train.csv -o great_expectations/uncommitted/run_results.json 'great_expectations checkpoint run train_data_checkpoint'. This halts the pipeline if validation fails. Log the validation report to MLflow using mlflow.log_artifact() to keep a complete audit trail alongside model metrics.

Migration path

If Great Expectations becomes too heavyweight, you can migrate to Pandera (Python-first, lighter weight) or custom Pydantic validators. Export your expectations as JSON, then convert column constraints to Pandera schemas. The checkpoint integration with MLflow remains unchanged: only the validation layer swaps out.

Cost model

Great Expectations is open source and free. No hidden costs. The <code>great_expectations cloud</code> service is optional and paid (for centralized validation dashboards), but the core CLI tool used here is entirely free.

Common gotcha

By default, Great Expectations stores validation results in uncommitted/ (not in git), which is correct for transient data. However, if your checkpoint fails silently, you may miss it because the success/failure status isn't printed to stdout: you must explicitly parse the JSON result file or configure logging. Always add --no-store-backend during development to force stdout output, then remove for production.

Team adoption

On day one: Create a single shared expectations suite in version control, not auto-generated by profiling. Have one engineer generate expectations once using great_expectations suite new --profile, then the team reviews and edits the JSON together. Store it in great_expectations/expectations/ alongside dvc.yaml. Require checkpoint run to pass before any training PR merges. Set up Slack alerts for validation failures using a custom action.

Experienced dev note

Set profiling_enabled: false in your checkpoint config unless you're actively debugging. Profiling (auto-computing statistics) adds 30-60 seconds per run. Also: store expectations in a separate git branch or tag aligned with your data schema version: when data contracts change, you explicitly update expectations rather than silently tolerating new nulls or outliers.

Check your understanding

Your team adds a new feature to the training dataset without updating expectations. The data now contains nulls in this column. Will Great Expectations halt the pipeline, and why?

Show answer hint

Great Expectations will NOT halt the pipeline because the new column has no expectation defined for it. Only explicitly declared expectations are validated. This is a feature (backward compatible) but a gotcha in practice: you must actively add expectations for new columns. Use <code>expect_column_to_exist</code> as a guard.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.