Point-in-time correctness: preventing data leakage in features
Why this matters
Data leakage silently inflates model metrics during training but causes production failures when real-time predictions encounter unavailable future data. Without point-in-time correctness, your model learns patterns that don't exist in production, leading to dramatic performance drops and undetected bugs.
Explanation
Point-in-time correctness means: at training time, features are computed only from data available before the observation date; at inference time, features come only from data available before the prediction request. DVC's versioning combined with MLflow's timestamp tracking enables this. You tag each MLflow run with the observation cutoff date and store feature datasets in DVC with their computation timestamps. When retraining, you explicitly load the feature version computed up to that cutoff. This prevents the common leakage pattern where tomorrow's aggregates sneak into today's training data.
The mechanism works through DVC's content-addressable storage (each data version gets a unique hash) paired with .dvc files that record when features were last computed. MLflow runs store the commit hash and timestamp, creating an immutable audit trail. During inference, your feature pipeline references the DVC version tag corresponding to the prediction timestamp, ensuring no future data contaminates predictions.
This is essential in time-series problems (fraud detection, demand forecasting) where leakage is subtle: a simple join on customer_id without time-window boundaries leaks tomorrow's transactions into today's training labels.
Configuration
# dvc.yaml - Define feature pipeline with explicit cutoff tracking
stages:
prepare_training_features:
cmd: python scripts/compute_features.py --cutoff-date ${CUTOFF_DATE} --output-dir features/train
deps:
- data/raw/transactions.parquet
- scripts/compute_features.py
outs:
- features/train/customer_features.parquet:
hash: md5
md5: abc123def456
size: 1024000
nfiles: 1
params:
- cutoff_date: "2025-12-31"
- lookback_days: 90
metrics:
- metrics/feature_stats.json:
cache: false
train_model:
cmd: python scripts/train.py --features features/train/customer_features.parquet --cutoff ${CUTOFF_DATE}
deps:
- features/train/customer_features.parquet
- scripts/train.py
outs:
- models/model.pkl
params:
- cutoff_date
plots:
- plots/confusion_matrix.csv:
template: confusion
# .dvc/config - Track data lineage
[core]
remote = s3storage
autostage = true
['remote "s3storage"']
url = s3://ml-bucket/features
# Python training script (scripts/train.py) snippet:
import mlflow
import yaml
from datetime import datetime
import pandas as pd
with open("params.yaml") as f:
params = yaml.safe_load(f)
CUTOFF_DATE = params["prepare_training_features"]["cutoff_date"]
LOOKBACK_DAYS = params["prepare_training_features"]["lookback_days"]
mlflow.set_experiment("feature_pipeline_v2")
with mlflow.start_run() as run:
mlflow.log_param("cutoff_date", CUTOFF_DATE)
mlflow.log_param("lookback_days", LOOKBACK_DAYS)
mlflow.log_param("data_version", "dvc_abc123def456")
features_df = pd.read_parquet("features/train/customer_features.parquet")
print(f"Loaded {len(features_df)} rows with cutoff {CUTOFF_DATE}")
model = train_model(features_df)
mlflow.sklearn.log_model(model, "model")
mlflow.log_metric("train_auc", 0.87)
print(f"Run ID: {run.info.run_id}")
print(f"Cutoff enforced: {CUTOFF_DATE}") Why this order?
The dvc.yaml stages execute sequentially: prepare_training_features must complete before train_model uses its output. The cutoff_date parameter flows through both stages to ensure consistency. MLflow logging happens at training runtime to capture the exact data version and cutoff applied.
Wrong vs Right
# LEAKAGE: No cutoff enforcement
features_df = pd.read_parquet("features/all_data.parquet")
model = train_model(features_df) # Could contain future data
mlflow.log_param("features_version", "latest") # No timestamp tracking
# Later at inference:
features_inference = pd.read_parquet("features/all_data.parquet") # Different version, no audit trail
predictions = model.predict(features_inference) # CORRECT: Explicit cutoff + DVC versioning + MLflow audit
# At training time (scripts/train.py):
CUTOFF_DATE = "2025-12-31"
features_path = "features/train/customer_features.parquet"
features_df = pd.read_parquet(features_path)
assert (features_df["computation_timestamp"] <= pd.Timestamp(CUTOFF_DATE)).all()
mlflow.log_param("cutoff_date", CUTOFF_DATE)
mlflow.log_artifact(".dvc/outputs.json") # Log DVC metadata
model = train_model(features_df)
mlflow.sklearn.log_model(model, "model")
# At inference time (scripts/inference.py):
prediction_time = datetime.utcnow()
feature_version = get_dvc_version_before(prediction_time) # Load only features computed before now
features_inference = load_from_dvc(feature_version)
assert (features_inference["computation_timestamp"] <= prediction_time)
predictions = model.predict(features_inference) Tool vitals
dvc run -n feature_stage -d data/ -o features/ -M metrics.json python scripts/compute_features.py --cutoff-date .dvc/config and dvc.yaml dvc dag && dvc status && mlflow experiments search --experiment-names feature_pipeline Integration notes
DVC + MLflow partnership here: DVC manages feature data versioning and reproducibility via dvc.yaml and .dvc files; MLflow logs the run-level metadata (cutoff_date, data_version hash) so you can trace which training run used which feature version. In production, your inference service queries MLflow to fetch the last-known-good run, retrieves its logged DVC version tag, and loads features from that exact DVC version. This creates an unbreakable chain: commit → DVC data version → MLflow run → production prediction.
Migration path
If you move away from DVC (e.g., to a feature store like Tecton or Feast), the same principle applies: the feature store must support point-in-time queries and version immutability. Replace DVC's versioning with your feature store's time-travel API and keep MLflow logging the feature store query timestamp and version ID.
Common gotcha
DVC caches the output path (features/train/customer_features.parquet) by hash, not by cutoff_date. If you change the cutoff_date parameter but don't change the output path, DVC will silently reuse the old cached file. Always include the cutoff date in the output path (e.g., features/train/features_cutoff_2025_12_31.parquet) or force recomputation with dvc repro --force. MLflow run tags can say cutoff_date=2025-12-31, but DVC needs the actual data path to change.
Team adoption
1. Document the cutoff_date parameter prominently in your feature pipeline README. 2. Add a unit test that confirms all feature rows have computation_timestamp ≤ cutoff_date. 3. Include cutoff_date in your feature output path to force cache misses on parameter changes. 4. In code review, require that every retraining run has MLflow logs showing the cutoff_date and DVC version. 5. Set up a weekly validation that re-runs training on historical cutoff dates and compares metrics: catches leakage bugs early.
Experienced dev note
The real win is storing the DVC commit hash (not just data version) in MLflow's run tags via mlflow.log_param("dvc_commit", os.popen("git rev-parse HEAD").read().strip()). This creates a one-click reproducibility path: click an MLflow run → read its dvc_commit param → git checkout that commit → dvc checkout to restore exact features → rerun training. Without this, you have feature versions floating in S3 with no way to recreate them.
Check your understanding
Why does simply setting cutoff_date: "2025-12-31" in dvc.yaml not prevent leakage if the feature computation script ignores the cutoff_date parameter?
Show answer hint
DVC orchestrates the pipeline and tracks outputs, but doesn't enforce data semantics. The script itself must filter the input data to only include rows with timestamps ≤ cutoff_date. DVC ensures reproducibility; your code ensures correctness.