Tool Advanced hard · 9 min best_practice

Regulatory compliance for data

What you will learn
Track data provenance, access logs, and transformation history to meet GDPR, HIPAA, and SOC 2 audit requirements.

Why this matters

Regulators audit your data pipeline. Without lineage tracking and access logs, you cannot prove data consent, deletion, or transformation accuracy during an audit. This results in failed compliance checks, fines, or data removal orders that break production models.

Skip if: If you're building a toy project or internal prototype with no external data, compliance tooling is overhead. For any production system touching customer data, financial data, or healthcare data, compliance tracking is mandatory: not optional.

Explanation

Regulatory compliance for data in MLOps means establishing an immutable audit trail of three things: (1) data provenance: where data came from and how it was transformed, (2) access logs: who accessed what data and when, and (3) retention policies: ensuring data deletion happens on schedule. MLflow's model registry combined with DVC's data versioning and explicit RBAC (role-based access control) in your data store creates a compliance-ready pipeline. The key is that every data transformation and model training run must be tagged with compliance metadata: data version, consent flags, PII masking status, and retention deadline. Tools like DVC track the immutable hash of each dataset version, MLflow logs the exact data inputs to each model, and container logs with centralized aggregation (ELK stack or cloud logging) create the access audit trail. This prevents the common scenario where a regulator asks 'what data did model V3 train on?' and you cannot answer.

Configuration

yaml
# .dvc/config - Data versioning with immutable audit trail
['remote "prod-storage"']
    url = s3://compliance-vault/data
    access_key_id = ${AWS_ACCESS_KEY_ID}
    secret_access_key = ${AWS_SECRET_ACCESS_KEY}

['core']
    remote = prod-storage
    autostage = true
    check_update = false

# dvc.yaml - Pipeline with compliance metadata
stages:
  ingest_raw:
    cmd: python ingest.py --source sales_db --output data/raw --log-access
    deps:
      - ingest.py
    outs:
      - data/raw:
          hash: md5
          persist: false
    params:
      - ingest.source
  
  validate_consent:
    cmd: python validate_consent.py --input data/raw --output data/consented
    deps:
      - data/raw
      - validate_consent.py
    outs:
      - data/consented
    meta:
      compliance_check: gdpr_consent
      retention_days: 365
      pii_masked: true
  
  train_model:
    cmd: python train.py --data data/consented --output models/v1
    deps:
      - data/consented
      - train.py
    outs:
      - models/v1
    plots:
      - metrics.json
    meta:
      compliance_approved: true
      data_version: ${data_commit}
      audit_trail_enabled: true

# MLflowfile.yml - Experiment tracking with compliance tags
experiment_name: production_pipeline
run_name: training_run_2026_04_15
tags:
  compliance_approved: "true"
  gdpr_consent_verified: "true"
  pii_masking_applied: "true"
  retention_deadline: "2027-04-15"
  audit_log_enabled: "true"
  data_version: "e3b0c44298fc1c149afbf4c8996fb924"
params:
  learning_rate: 0.001
  batch_size: 32
metrics:
  accuracy: 0.94
  auc: 0.87

Why this order?

Ingest first (raw data capture), then validate consent (GDPR requirement), then train (only on consented data). DVC's DAG ensures immutable ordering. MLflow tags are applied at training time so the compliance metadata lives alongside the model artifact.

Wrong vs Right

Wrong way
yaml
# ❌ No compliance tracking
stages:
  train:
    cmd: python train.py --input /tmp/data --output model.pkl
    outs:
      - model.pkl

# Later: regulator asks where this model's data came from
# Answer: 'I don't know, it was in /tmp'
Right way
yaml
# ✅ Full compliance chain
stages:
  train:
    cmd: python train.py --input data/consented --output models/v1 --log-compliance
    deps:
      - data/consented  # DVC tracks this hash immutably
    outs:
      - models/v1
    meta:
      compliance_approved: true
      data_version: "${dvc dag hash}"  # Immutable link to source
      audit_enabled: true

# MLflow logs compliance tags at runtime
mlflow.set_tags({
    'compliance_approved': 'true',
    'data_version': data_hash,
    'consent_verified': 'true',
    'retention_deadline': deadline_date
})
mlflow.log_artifact('audit_log.json')

Tool vitals

Primary command
bash
dvc remote list && dvc dag && mlflow.log_param('compliance_data_version', ...)
Config file .dvc/config, dvc.yaml, MLflowfile.yml
Verify
bash
dvc dag --full && mlflow runs search --experiment-name prod --filter-string 'tags.compliance_approved = True'

Integration notes

DVC provides immutable data versioning (prevents 'someone deleted the source data'), MLflow provides run-time audit logs (what model was trained on what data), and container logging (centralized access logs for data stores) complete the picture. Together: DVC = data lineage, MLflow = model-to-data traceability, logging = access audit trail.

Migration path

If moving away: Delta Lake (Databricks) provides similar ACID + lineage for data warehouses. For model registry, Neptune or Weights & Biases offer compliance-first audit trails. The pattern (versioned data + tagged experiments + access logs) is universal; only the tooling changes.

Cost model

DVC Community is free; DVC Cloud is $0.03/GB/month for remote storage. MLflow OSS is free; MLflow Databricks Enterprise adds RBAC and SOC 2 compliance features ($3-5/DBU). Cloud object storage (S3, GCS, Azure Blob) charges per GB stored. Budget 50-200GB/month for typical ML pipelines with compliance logging.

Common gotcha

DVC tracks data commits with content hashes, but if you push model artifacts to MLflow without linking the exact DVC data version, auditors cannot verify which dataset trained which model. Always log the DVC commit hash to MLflow tags at training time: `mlflow.set_tag('dvc_data_commit', dvc_commit_hash)`. Without this bridge, compliance becomes impossible.

Team adoption

Day 1: Add a .dvc/config with your compliance vault (S3 bucket with object lock enabled). Week 1: Require all pipelines to pass dvc dag --check-consistency in CI/CD. Week 2: Add MLflow compliance tag validation to your model promotion gate: no model can move to production without 'compliance_approved=true'. Week 4: Centralize access logs to ELK or CloudWatch and wire them to your compliance dashboard.

Experienced dev note

Add a compliance_checkpoint stage before training that logs 'this data passed GDPR validation at time T with hash H.' Then MLflow tags reference that checkpoint ID instead of the raw data version. This decouples compliance validation from training: you can retrain on already-approved data without re-validating consent.

Check your understanding

Why must the DVC data commit hash be logged to MLflow tags? What audit scenario does this solve that DVC alone cannot?

Show answer hint

DVC proves 'what data version exists' but not 'which specific data version trained model X.' MLflow proves 'what code trained model X' but not 'what data it used.' Only linking them (via tags) allows a regulator to ask 'show me the exact data that trained model V3' and get a verifiable answer.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.