Tool Intermediate medium · 8 min best_practice

Point-in-time correctness: preventing data leakage in features

What you will learn
Use DVC timestamps and MLflow run tags to ensure features are computed only from historical data available at prediction time, preventing data leakage.

Why this matters

Data leakage silently inflates model metrics during training but causes production failures when real-time predictions encounter unavailable future data. Without point-in-time correctness, your model learns patterns that don't exist in production, leading to dramatic performance drops and undetected bugs.

Skip if: For pure offline batch inference on static historical data where all features are pre-computed and immutable. However, even then, DVC versioning prevents accidental use of wrong feature versions in retraining.

Explanation

Point-in-time correctness means: at training time, features are computed only from data available before the observation date; at inference time, features come only from data available before the prediction request. DVC's versioning combined with MLflow's timestamp tracking enables this. You tag each MLflow run with the observation cutoff date and store feature datasets in DVC with their computation timestamps. When retraining, you explicitly load the feature version computed up to that cutoff. This prevents the common leakage pattern where tomorrow's aggregates sneak into today's training data.

The mechanism works through DVC's content-addressable storage (each data version gets a unique hash) paired with .dvc files that record when features were last computed. MLflow runs store the commit hash and timestamp, creating an immutable audit trail. During inference, your feature pipeline references the DVC version tag corresponding to the prediction timestamp, ensuring no future data contaminates predictions.

This is essential in time-series problems (fraud detection, demand forecasting) where leakage is subtle: a simple join on customer_id without time-window boundaries leaks tomorrow's transactions into today's training labels.

Configuration

yaml
# dvc.yaml - Define feature pipeline with explicit cutoff tracking
stages:
  prepare_training_features:
    cmd: python scripts/compute_features.py --cutoff-date ${CUTOFF_DATE} --output-dir features/train
    deps:
      - data/raw/transactions.parquet
      - scripts/compute_features.py
    outs:
      - features/train/customer_features.parquet:
          hash: md5
          md5: abc123def456
          size: 1024000
          nfiles: 1
    params:
      - cutoff_date: "2025-12-31"
      - lookback_days: 90
    metrics:
      - metrics/feature_stats.json:
          cache: false

  train_model:
    cmd: python scripts/train.py --features features/train/customer_features.parquet --cutoff ${CUTOFF_DATE}
    deps:
      - features/train/customer_features.parquet
      - scripts/train.py
    outs:
      - models/model.pkl
    params:
      - cutoff_date
    plots:
      - plots/confusion_matrix.csv:
          template: confusion

# .dvc/config - Track data lineage
[core]
  remote = s3storage
  autostage = true

['remote "s3storage"']
  url = s3://ml-bucket/features

# Python training script (scripts/train.py) snippet:
import mlflow
import yaml
from datetime import datetime
import pandas as pd

with open("params.yaml") as f:
    params = yaml.safe_load(f)

CUTOFF_DATE = params["prepare_training_features"]["cutoff_date"]
LOOKBACK_DAYS = params["prepare_training_features"]["lookback_days"]

mlflow.set_experiment("feature_pipeline_v2")
with mlflow.start_run() as run:
    mlflow.log_param("cutoff_date", CUTOFF_DATE)
    mlflow.log_param("lookback_days", LOOKBACK_DAYS)
    mlflow.log_param("data_version", "dvc_abc123def456")
    
    features_df = pd.read_parquet("features/train/customer_features.parquet")
    print(f"Loaded {len(features_df)} rows with cutoff {CUTOFF_DATE}")
    
    model = train_model(features_df)
    mlflow.sklearn.log_model(model, "model")
    mlflow.log_metric("train_auc", 0.87)
    
    print(f"Run ID: {run.info.run_id}")
    print(f"Cutoff enforced: {CUTOFF_DATE}")

Why this order?

The dvc.yaml stages execute sequentially: prepare_training_features must complete before train_model uses its output. The cutoff_date parameter flows through both stages to ensure consistency. MLflow logging happens at training runtime to capture the exact data version and cutoff applied.

Wrong vs Right

Wrong way
yaml
# LEAKAGE: No cutoff enforcement
features_df = pd.read_parquet("features/all_data.parquet")
model = train_model(features_df)  # Could contain future data
mlflow.log_param("features_version", "latest")  # No timestamp tracking

# Later at inference:
features_inference = pd.read_parquet("features/all_data.parquet")  # Different version, no audit trail
predictions = model.predict(features_inference)
Right way
yaml
# CORRECT: Explicit cutoff + DVC versioning + MLflow audit
# At training time (scripts/train.py):
CUTOFF_DATE = "2025-12-31"
features_path = "features/train/customer_features.parquet"
features_df = pd.read_parquet(features_path)
assert (features_df["computation_timestamp"] <= pd.Timestamp(CUTOFF_DATE)).all()

mlflow.log_param("cutoff_date", CUTOFF_DATE)
mlflow.log_artifact(".dvc/outputs.json")  # Log DVC metadata
model = train_model(features_df)
mlflow.sklearn.log_model(model, "model")

# At inference time (scripts/inference.py):
prediction_time = datetime.utcnow()
feature_version = get_dvc_version_before(prediction_time)  # Load only features computed before now
features_inference = load_from_dvc(feature_version)
assert (features_inference["computation_timestamp"] <= prediction_time)
predictions = model.predict(features_inference)

Tool vitals

Primary command
bash
dvc run -n feature_stage -d data/ -o features/ -M metrics.json python scripts/compute_features.py --cutoff-date
Config file .dvc/config and dvc.yaml
Verify
bash
dvc dag && dvc status && mlflow experiments search --experiment-names feature_pipeline

Integration notes

DVC + MLflow partnership here: DVC manages feature data versioning and reproducibility via dvc.yaml and .dvc files; MLflow logs the run-level metadata (cutoff_date, data_version hash) so you can trace which training run used which feature version. In production, your inference service queries MLflow to fetch the last-known-good run, retrieves its logged DVC version tag, and loads features from that exact DVC version. This creates an unbreakable chain: commit → DVC data version → MLflow run → production prediction.

Migration path

If you move away from DVC (e.g., to a feature store like Tecton or Feast), the same principle applies: the feature store must support point-in-time queries and version immutability. Replace DVC's versioning with your feature store's time-travel API and keep MLflow logging the feature store query timestamp and version ID.

Common gotcha

DVC caches the output path (features/train/customer_features.parquet) by hash, not by cutoff_date. If you change the cutoff_date parameter but don't change the output path, DVC will silently reuse the old cached file. Always include the cutoff date in the output path (e.g., features/train/features_cutoff_2025_12_31.parquet) or force recomputation with dvc repro --force. MLflow run tags can say cutoff_date=2025-12-31, but DVC needs the actual data path to change.

Team adoption

1. Document the cutoff_date parameter prominently in your feature pipeline README. 2. Add a unit test that confirms all feature rows have computation_timestamp ≤ cutoff_date. 3. Include cutoff_date in your feature output path to force cache misses on parameter changes. 4. In code review, require that every retraining run has MLflow logs showing the cutoff_date and DVC version. 5. Set up a weekly validation that re-runs training on historical cutoff dates and compares metrics: catches leakage bugs early.

Experienced dev note

The real win is storing the DVC commit hash (not just data version) in MLflow's run tags via mlflow.log_param("dvc_commit", os.popen("git rev-parse HEAD").read().strip()). This creates a one-click reproducibility path: click an MLflow run → read its dvc_commit param → git checkout that commit → dvc checkout to restore exact features → rerun training. Without this, you have feature versions floating in S3 with no way to recreate them.

Check your understanding

Why does simply setting cutoff_date: "2025-12-31" in dvc.yaml not prevent leakage if the feature computation script ignores the cutoff_date parameter?

Show answer hint

DVC orchestrates the pipeline and tracks outputs, but doesn't enforce data semantics. The script itself must filter the input data to only include rows with timestamps ≤ cutoff_date. DVC ensures reproducibility; your code ensures correctness.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.