Regulatory compliance for data
Why this matters
Regulators audit your data pipeline. Without lineage tracking and access logs, you cannot prove data consent, deletion, or transformation accuracy during an audit. This results in failed compliance checks, fines, or data removal orders that break production models.
Explanation
Regulatory compliance for data in MLOps means establishing an immutable audit trail of three things: (1) data provenance: where data came from and how it was transformed, (2) access logs: who accessed what data and when, and (3) retention policies: ensuring data deletion happens on schedule. MLflow's model registry combined with DVC's data versioning and explicit RBAC (role-based access control) in your data store creates a compliance-ready pipeline. The key is that every data transformation and model training run must be tagged with compliance metadata: data version, consent flags, PII masking status, and retention deadline. Tools like DVC track the immutable hash of each dataset version, MLflow logs the exact data inputs to each model, and container logs with centralized aggregation (ELK stack or cloud logging) create the access audit trail. This prevents the common scenario where a regulator asks 'what data did model V3 train on?' and you cannot answer.
Configuration
# .dvc/config - Data versioning with immutable audit trail
['remote "prod-storage"']
url = s3://compliance-vault/data
access_key_id = ${AWS_ACCESS_KEY_ID}
secret_access_key = ${AWS_SECRET_ACCESS_KEY}
['core']
remote = prod-storage
autostage = true
check_update = false
# dvc.yaml - Pipeline with compliance metadata
stages:
ingest_raw:
cmd: python ingest.py --source sales_db --output data/raw --log-access
deps:
- ingest.py
outs:
- data/raw:
hash: md5
persist: false
params:
- ingest.source
validate_consent:
cmd: python validate_consent.py --input data/raw --output data/consented
deps:
- data/raw
- validate_consent.py
outs:
- data/consented
meta:
compliance_check: gdpr_consent
retention_days: 365
pii_masked: true
train_model:
cmd: python train.py --data data/consented --output models/v1
deps:
- data/consented
- train.py
outs:
- models/v1
plots:
- metrics.json
meta:
compliance_approved: true
data_version: ${data_commit}
audit_trail_enabled: true
# MLflowfile.yml - Experiment tracking with compliance tags
experiment_name: production_pipeline
run_name: training_run_2026_04_15
tags:
compliance_approved: "true"
gdpr_consent_verified: "true"
pii_masking_applied: "true"
retention_deadline: "2027-04-15"
audit_log_enabled: "true"
data_version: "e3b0c44298fc1c149afbf4c8996fb924"
params:
learning_rate: 0.001
batch_size: 32
metrics:
accuracy: 0.94
auc: 0.87 Why this order?
Ingest first (raw data capture), then validate consent (GDPR requirement), then train (only on consented data). DVC's DAG ensures immutable ordering. MLflow tags are applied at training time so the compliance metadata lives alongside the model artifact.
Wrong vs Right
# ❌ No compliance tracking
stages:
train:
cmd: python train.py --input /tmp/data --output model.pkl
outs:
- model.pkl
# Later: regulator asks where this model's data came from
# Answer: 'I don't know, it was in /tmp' # ✅ Full compliance chain
stages:
train:
cmd: python train.py --input data/consented --output models/v1 --log-compliance
deps:
- data/consented # DVC tracks this hash immutably
outs:
- models/v1
meta:
compliance_approved: true
data_version: "${dvc dag hash}" # Immutable link to source
audit_enabled: true
# MLflow logs compliance tags at runtime
mlflow.set_tags({
'compliance_approved': 'true',
'data_version': data_hash,
'consent_verified': 'true',
'retention_deadline': deadline_date
})
mlflow.log_artifact('audit_log.json') Tool vitals
dvc remote list && dvc dag && mlflow.log_param('compliance_data_version', ...) .dvc/config, dvc.yaml, MLflowfile.yml dvc dag --full && mlflow runs search --experiment-name prod --filter-string 'tags.compliance_approved = True' Integration notes
DVC provides immutable data versioning (prevents 'someone deleted the source data'), MLflow provides run-time audit logs (what model was trained on what data), and container logging (centralized access logs for data stores) complete the picture. Together: DVC = data lineage, MLflow = model-to-data traceability, logging = access audit trail.
Migration path
If moving away: Delta Lake (Databricks) provides similar ACID + lineage for data warehouses. For model registry, Neptune or Weights & Biases offer compliance-first audit trails. The pattern (versioned data + tagged experiments + access logs) is universal; only the tooling changes.
Cost model
DVC Community is free; DVC Cloud is $0.03/GB/month for remote storage. MLflow OSS is free; MLflow Databricks Enterprise adds RBAC and SOC 2 compliance features ($3-5/DBU). Cloud object storage (S3, GCS, Azure Blob) charges per GB stored. Budget 50-200GB/month for typical ML pipelines with compliance logging.
Common gotcha
DVC tracks data commits with content hashes, but if you push model artifacts to MLflow without linking the exact DVC data version, auditors cannot verify which dataset trained which model. Always log the DVC commit hash to MLflow tags at training time: `mlflow.set_tag('dvc_data_commit', dvc_commit_hash)`. Without this bridge, compliance becomes impossible.
Team adoption
Day 1: Add a .dvc/config with your compliance vault (S3 bucket with object lock enabled). Week 1: Require all pipelines to pass dvc dag --check-consistency in CI/CD. Week 2: Add MLflow compliance tag validation to your model promotion gate: no model can move to production without 'compliance_approved=true'. Week 4: Centralize access logs to ELK or CloudWatch and wire them to your compliance dashboard.
Experienced dev note
Add a compliance_checkpoint stage before training that logs 'this data passed GDPR validation at time T with hash H.' Then MLflow tags reference that checkpoint ID instead of the raw data version. This decouples compliance validation from training: you can retrain on already-approved data without re-validating consent.
Check your understanding
Why must the DVC data commit hash be logged to MLflow tags? What audit scenario does this solve that DVC alone cannot?
Show answer hint
DVC proves 'what data version exists' but not 'which specific data version trained model X.' MLflow proves 'what code trained model X' but not 'what data it used.' Only linking them (via tags) allows a regulator to ask 'show me the exact data that trained model V3' and get a verifiable answer.