Versioning features alongside model versions
Why this matters
A model trained on v1.2 of your features behaves differently than the same model code trained on v1.3 features. Without versioning features alongside models, you lose reproducibility, struggle with model debugging (was it the model or the features?), and can't reliably A/B test feature changes in production. Teams frequently retrain 'the same model' and get shocked when performance drops due to untracked feature changes.
Explanation
DVC pipelines connect data inputs → feature transformation code → feature outputs → model training in a single, reproducible DAG. Each stage is versioned independently. When you commit your dvc.yaml and dvc.lock, you're capturing not just the model weights but the exact feature engineering steps, their inputs, and outputs. This means you can check out any Git commit and dvc repro will regenerate the features and retrain the model identically. The key insight: dvc.lock records the hash of your features.csv so if features change, the lock changes, making it visible in Git diffs. MLflow then logs which DVC commit was used, linking models to their exact feature lineage. Without this, you have models but no way to know what features created them.
Configuration
# dvc.yaml - Define the feature engineering pipeline
stages:
fetch_raw_data:
cmd: python src/fetch_data.py
outs:
- data/raw.csv:
hash: md5
md5: a1b2c3d4e5f6g7h8i9j0
engineer_features:
cmd: python src/transform.py --input data/raw.csv --output data/features.csv
deps:
- data/raw.csv
- src/transform.py
outs:
- data/features.csv:
hash: md5
md5: x9y8z7w6v5u4t3s2r1q0
params:
- src/params.yaml:
- feature_engineering
train_model:
cmd: python src/train.py --features data/features.csv --model models/model.pkl
deps:
- data/features.csv
- src/train.py
outs:
- models/model.pkl
- models/model_metrics.json
params:
- src/params.yaml:
- training
# src/params.yaml
feature_engineering:
normalize: true
remove_outliers: true
outlier_threshold: 3.0
polynomial_features: false
training:
model_type: xgboost
test_split: 0.2
random_state: 42
# src/transform.py (excerpt)
import pandas as pd
import yaml
import argparse
with open('src/params.yaml') as f:
params = yaml.safe_load(f)
parser = argparse.ArgumentParser()
parser.add_argument('--input', required=True)
parser.add_argument('--output', required=True)
args = parser.parse_args()
df = pd.read_csv(args.input)
if params['feature_engineering']['normalize']:
df = (df - df.mean()) / df.std()
if params['feature_engineering']['remove_outliers']:
threshold = params['feature_engineering']['outlier_threshold']
df = df[(df.abs() <= threshold).all(axis=1)]
df.to_csv(args.output, index=False)
print(f"Features saved to {args.output}") Why this order?
The pipeline executes in dependency order: fetch_raw_data has no dependencies so it runs first, engineer_features depends on raw.csv so it runs second, train_model depends on features.csv so it runs last. DVC topologically sorts the DAG automatically, but dvc.lock records the exact order and hashes so reruns are bit-for-bit identical.
Wrong vs Right
# WRONG: Hardcoded features, no versioning
python src/train.py # Modifies data/features.csv in place
# Git tracks only the model, not features. No way to know what features were used.
# WRONG: Features in multiple scripts with manual dependencies
bash scripts/step1.sh # Generates features_v1
bash scripts/step2.sh # Generates features_v2
python train.py # Which features did this use? Unknown.
# No lineage. Features change silently.
# WRONG: Committing features to Git
git add data/features.csv
# Large files bloat repo. No content-addressable hashing. No way to garbage-collect old features. # RIGHT: Define pipeline in dvc.yaml with explicit dependencies
dvc stage add -n engineer_features \
-d data/raw.csv \
-d src/transform.py \
-p feature_engineering \
-o data/features.csv \
python src/transform.py
# RIGHT: Run the full pipeline
dvc repro
# RIGHT: Commit the pipeline definition and lock file
git add dvc.yaml dvc.lock
git commit -m "Add feature normalization and outlier removal"
# RIGHT: Track dvc.lock changes to see exactly what changed
git diff HEAD~1 dvc.lock # Shows which stage outputs changed Tool vitals
dvc stage add -n feature_engineering -d raw_data.csv -o features/train.csv python src/transform.py dvc.yaml and dvc.lock dvc dag && dvc status Integration notes
MLflow logs the Git commit SHA when you track an experiment. If you also commit dvc.lock alongside your code, you can retrieve the exact model and features used for any MLflow run by checking out that commit and running dvc repro. Use mlflow.log_param('dvc_commit', os.popen('git rev-parse HEAD').read().strip()) to link them explicitly. When serving the model (via BentoML or vLLM), include the dvc.yaml in the model artifact so the serving system can regenerate features identically at inference time.
Migration path
If you outgrow DVC (e.g., need enterprise feature store capabilities), migrate to Feast or Tecton. Export your DVC pipeline stages as Feast feature definitions: one Feast FeatureView per DVC stage. Your dvc.lock becomes the source of truth for historical feature versions. Keep DVC for raw data and validation; use Feast for feature serving. This is a gradual migration: both can coexist for months.
Cost model
DVC is open-source and free. Remote storage (S3, GCS, Azure) charges apply only if you store feature artifacts there. DVC Cloud (optional managed service) offers caching and collaboration for $50+/month/user but the open-source CLI is sufficient for most teams. Hidden cost: large feature datasets can make <code>dvc repro</code> slow if you don't use remote caching.
Common gotcha
If you change src/transform.py but forget to update the -d dependency, DVC won't rerun engineer_features when the code changes. The pipeline will appear 'up to date' even though features are stale. Always list every file that affects a stage's output as a dependency. Similarly, if you manually edit data/features.csv instead of regenerating it via dvc repro, the hash in dvc.lock becomes inconsistent with the file on disk, causing silent bugs. Always regenerate via the pipeline.
Team adoption
On day one, set a Git pre-commit hook that runs dvc status and blocks commits if outputs are not in dvc.lock. Add to .git/hooks/pre-commit: #!/bin/bash
dvc status || exit 1. Require all team members to run dvc repro before committing, not just python train.py. Document the rule: 'If you change any file in data/ or src/, run dvc repro and commit the updated dvc.lock. If you don't, your teammates will have stale features.' In code review, look for commits that touch src/transform.py but not dvc.lock: that's a red flag.
Experienced dev note
Most teams miss that dvc.lock is your source of truth for reproducibility, not your Git history. Commit dvc.lock every time: it's only ~1KB per stage even for large datasets. Also, use dvc plots to version your feature distributions: dvc plots modify data/features.csv -x age -y income then git add .dvc/plots. This lets you see feature drift across Git commits in CI/CD without manual scripts.
Check your understanding
You change the outlier_threshold parameter in params.yaml from 3.0 to 2.5 and commit both params.yaml and dvc.lock to Git. A teammate checks out your commit and runs dvc status. What will they see, and why?
Show answer hint
They will see 'All pipelines are up to date' because <code>dvc.lock</code> was already committed with the new hash. The key insight: DVC uses <code>dvc.lock</code> to determine if a stage needs rerunning, not the timestamp of params.yaml. If you change params but don't regenerate and commit the updated <code>dvc.lock</code>, the pipeline will appear stale when others pull your code. This is why <code>dvc repro && git add dvc.lock</code> must always happen together.