Data validation with Great Expectations before training
Why this matters
Silent data quality issues cause models to train on corrupted, missing, or drifted data: leading to poor performance in production that's nearly impossible to debug retrospectively. Great Expectations catches these issues before they reach your training pipeline, preventing wasted compute and catching data regressions early.
Explanation
Great Expectations is a framework that defines and enforces data quality rules (expectations) on datasets. Rather than writing custom validation scripts, you define a suite of expectations (e.g., 'column X has no nulls', 'column Y is between 0-100'), store them in version control, and run them as checkpoints before training. The results integrate directly with MLflow, logging validation reports alongside experiment metrics. This catches schema drift, missing values, outliers, and type errors automatically. When integrated into a DVC pipeline, validation becomes a gate: if expectations fail, the pipeline halts before training begins, saving time and preventing bad models from reaching the registry.
Configuration
# great_expectations/expectations/train_data.json
# Define expectations for your training dataset
{
"meta": {
"api_version": "1.0.0"
},
"expectations": [
{
"expectation_type": "expect_column_to_exist",
"kwargs": {
"column": "feature_age"
}
},
{
"expectation_type": "expect_column_values_to_not_be_null",
"kwargs": {
"column": "feature_age"
}
},
{
"expectation_type": "expect_column_values_to_be_between",
"kwargs": {
"column": "feature_age",
"min_value": 0,
"max_value": 120
}
},
{
"expectation_type": "expect_column_values_to_be_in_set",
"kwargs": {
"column": "label",
"value_set": ["positive", "negative"]
}
},
{
"expectation_type": "expect_table_row_count_to_be_between",
"kwargs": {
"min_value": 1000,
"max_value": 1000000
}
}
]
}
# great_expectations/checkpoints/train_data_checkpoint.yml
name: train_data_checkpoint
config_version: 1.0
template_name: null
module_name: great_expectations.checkpoint
class_name: Checkpoint
run_name_template: "%Y%m%d-%H%M%S-train-validation"
expectation_suite_name: train_data
action_list:
- name: store_validation_result
action:
class_name: StoreValidationResultAction
- name: store_evaluation_params
action:
class_name: StoreEvaluationParametersAction
- name: update_data_docs
action:
class_name: UpdateDataDocsAction
batch_request:
datasource_name: training_data
data_connector_name: csv_connector
data_asset_name: train_dataset
batch_identifiers:
- batch_id: latest Why this order?
The expectations file defines what to validate (the contract). The checkpoint file defines how and when to validate (the execution plan). Checkpoints reference the expectations suite by name, so the suite must exist first.
Wrong vs Right
# DON'T: Ad-hoc validation scattered in training script
import pandas as pd
df = pd.read_csv('data.csv')
if df['age'].isnull().any():
raise ValueError("nulls found")
if not set(df['label'].unique()).issubset({'pos', 'neg'}):
raise ValueError("unexpected labels")
# Problem: No audit trail, no integration with MLflow, no version control, breaks on next schema change # DO: Define expectations in version control, run via checkpoint
great_expectations checkpoint run train_data_checkpoint \
--config-file-path great_expectations/checkpoints/train_data_checkpoint.yml
# Then in training script:
import json
with open('great_expectations/uncommitted/run_results.json', 'r') as f:
validation_result = json.load(f)
if not validation_result['success']:
raise ValueError(f"Data validation failed: {validation_result}")
# Log to MLflow
mlflow.log_artifact('great_expectations/uncommitted/run_results.json')
# Advantage: Reproducible, versioned, auditable, integrated Tool vitals
great_expectations checkpoint run great_expectations/checkpoints/train_data_checkpoint.yml great_expectations checkpoint run train_data_checkpoint --config-file-path great_expectations/checkpoints/train_data_checkpoint.yml Integration notes
In a DVC pipeline, add the checkpoint as a dependency stage before training: dvc stage add -n validate -d data/train.csv -o great_expectations/uncommitted/run_results.json 'great_expectations checkpoint run train_data_checkpoint'. This halts the pipeline if validation fails. Log the validation report to MLflow using mlflow.log_artifact() to keep a complete audit trail alongside model metrics.
Migration path
If Great Expectations becomes too heavyweight, you can migrate to Pandera (Python-first, lighter weight) or custom Pydantic validators. Export your expectations as JSON, then convert column constraints to Pandera schemas. The checkpoint integration with MLflow remains unchanged: only the validation layer swaps out.
Cost model
Great Expectations is open source and free. No hidden costs. The <code>great_expectations cloud</code> service is optional and paid (for centralized validation dashboards), but the core CLI tool used here is entirely free.
Common gotcha
By default, Great Expectations stores validation results in uncommitted/ (not in git), which is correct for transient data. However, if your checkpoint fails silently, you may miss it because the success/failure status isn't printed to stdout: you must explicitly parse the JSON result file or configure logging. Always add --no-store-backend during development to force stdout output, then remove for production.
Team adoption
On day one: Create a single shared expectations suite in version control, not auto-generated by profiling. Have one engineer generate expectations once using great_expectations suite new --profile, then the team reviews and edits the JSON together. Store it in great_expectations/expectations/ alongside dvc.yaml. Require checkpoint run to pass before any training PR merges. Set up Slack alerts for validation failures using a custom action.
Experienced dev note
Set profiling_enabled: false in your checkpoint config unless you're actively debugging. Profiling (auto-computing statistics) adds 30-60 seconds per run. Also: store expectations in a separate git branch or tag aligned with your data schema version: when data contracts change, you explicitly update expectations rather than silently tolerating new nulls or outliers.
Check your understanding
Your team adds a new feature to the training dataset without updating expectations. The data now contains nulls in this column. Will Great Expectations halt the pipeline, and why?
Show answer hint
Great Expectations will NOT halt the pipeline because the new column has no expectation defined for it. Only explicitly declared expectations are validated. This is a feature (backward compatible) but a gotcha in practice: you must actively add expectations for new columns. Use <code>expect_column_to_exist</code> as a guard.