Tool Intermediate medium · 7 min best_practice

Versioning features alongside model versions

What you will learn

Track feature engineering code and transformations alongside model versions using DVC pipelines so you can reproduce any model's exact feature set.

Why this matters

A model trained on v1.2 of your features behaves differently than the same model code trained on v1.3 features. Without versioning features alongside models, you lose reproducibility, struggle with model debugging (was it the model or the features?), and can't reliably A/B test feature changes in production. Teams frequently retrain 'the same model' and get shocked when performance drops due to untracked feature changes.

Skip if: If you have a tiny feature set (< 5 transformations) that never changes and fits in a single Python file you manually track in Git, explicit DVC feature versioning adds overhead. If you're using a feature store (Feast, Tecton) that handles versioning internally, you don't need DVC for that layer. If features are generated on-the-fly in your training script and never reused, basic Git tracking of the training code is sufficient.

Explanation

DVC pipelines connect data inputs → feature transformation code → feature outputs → model training in a single, reproducible DAG. Each stage is versioned independently. When you commit your dvc.yaml and dvc.lock, you're capturing not just the model weights but the exact feature engineering steps, their inputs, and outputs. This means you can check out any Git commit and dvc repro will regenerate the features and retrain the model identically. The key insight: dvc.lock records the hash of your features.csv so if features change, the lock changes, making it visible in Git diffs. MLflow then logs which DVC commit was used, linking models to their exact feature lineage. Without this, you have models but no way to know what features created them.

Configuration

yaml

# dvc.yaml - Define the feature engineering pipeline
stages:
  fetch_raw_data:
    cmd: python src/fetch_data.py
    outs:
      - data/raw.csv:
          hash: md5
          md5: a1b2c3d4e5f6g7h8i9j0
  
  engineer_features:
    cmd: python src/transform.py --input data/raw.csv --output data/features.csv
    deps:
      - data/raw.csv
      - src/transform.py
    outs:
      - data/features.csv:
          hash: md5
          md5: x9y8z7w6v5u4t3s2r1q0
    params:
      - src/params.yaml:
          - feature_engineering
  
  train_model:
    cmd: python src/train.py --features data/features.csv --model models/model.pkl
    deps:
      - data/features.csv
      - src/train.py
    outs:
      - models/model.pkl
      - models/model_metrics.json
    params:
      - src/params.yaml:
          - training

# src/params.yaml
feature_engineering:
  normalize: true
  remove_outliers: true
  outlier_threshold: 3.0
  polynomial_features: false

training:
  model_type: xgboost
  test_split: 0.2
  random_state: 42

# src/transform.py (excerpt)
import pandas as pd
import yaml
import argparse

with open('src/params.yaml') as f:
    params = yaml.safe_load(f)

parser = argparse.ArgumentParser()
parser.add_argument('--input', required=True)
parser.add_argument('--output', required=True)
args = parser.parse_args()

df = pd.read_csv(args.input)

if params['feature_engineering']['normalize']:
    df = (df - df.mean()) / df.std()

if params['feature_engineering']['remove_outliers']:
    threshold = params['feature_engineering']['outlier_threshold']
    df = df[(df.abs() <= threshold).all(axis=1)]

df.to_csv(args.output, index=False)
print(f"Features saved to {args.output}")

Why this order?

The pipeline executes in dependency order: fetch_raw_data has no dependencies so it runs first, engineer_features depends on raw.csv so it runs second, train_model depends on features.csv so it runs last. DVC topologically sorts the DAG automatically, but dvc.lock records the exact order and hashes so reruns are bit-for-bit identical.

Wrong vs Right

Wrong way

yaml

# WRONG: Hardcoded features, no versioning
python src/train.py  # Modifies data/features.csv in place
# Git tracks only the model, not features. No way to know what features were used.

# WRONG: Features in multiple scripts with manual dependencies
bash scripts/step1.sh  # Generates features_v1
bash scripts/step2.sh  # Generates features_v2
python train.py  # Which features did this use? Unknown.
# No lineage. Features change silently.

# WRONG: Committing features to Git
git add data/features.csv
# Large files bloat repo. No content-addressable hashing. No way to garbage-collect old features.

Right way

yaml

# RIGHT: Define pipeline in dvc.yaml with explicit dependencies
dvc stage add -n engineer_features \
  -d data/raw.csv \
  -d src/transform.py \
  -p feature_engineering \
  -o data/features.csv \
  python src/transform.py

# RIGHT: Run the full pipeline
dvc repro

# RIGHT: Commit the pipeline definition and lock file
git add dvc.yaml dvc.lock
git commit -m "Add feature normalization and outlier removal"

# RIGHT: Track dvc.lock changes to see exactly what changed
git diff HEAD~1 dvc.lock  # Shows which stage outputs changed

Tool vitals

Primary command

bash

dvc stage add -n feature_engineering -d raw_data.csv -o features/train.csv python src/transform.py

Config file dvc.yaml and dvc.lock

Verify

bash

dvc dag && dvc status

Integration notes

MLflow logs the Git commit SHA when you track an experiment. If you also commit dvc.lock alongside your code, you can retrieve the exact model and features used for any MLflow run by checking out that commit and running dvc repro. Use mlflow.log_param('dvc_commit', os.popen('git rev-parse HEAD').read().strip()) to link them explicitly. When serving the model (via BentoML or vLLM), include the dvc.yaml in the model artifact so the serving system can regenerate features identically at inference time.

Migration path

If you outgrow DVC (e.g., need enterprise feature store capabilities), migrate to Feast or Tecton. Export your DVC pipeline stages as Feast feature definitions: one Feast FeatureView per DVC stage. Your dvc.lock becomes the source of truth for historical feature versions. Keep DVC for raw data and validation; use Feast for feature serving. This is a gradual migration: both can coexist for months.

Cost model

DVC is open-source and free. Remote storage (S3, GCS, Azure) charges apply only if you store feature artifacts there. DVC Cloud (optional managed service) offers caching and collaboration for $50+/month/user but the open-source CLI is sufficient for most teams. Hidden cost: large feature datasets can make <code>dvc repro</code> slow if you don't use remote caching.

Common gotcha

If you change src/transform.py but forget to update the -d dependency, DVC won't rerun engineer_features when the code changes. The pipeline will appear 'up to date' even though features are stale. Always list every file that affects a stage's output as a dependency. Similarly, if you manually edit data/features.csv instead of regenerating it via dvc repro, the hash in dvc.lock becomes inconsistent with the file on disk, causing silent bugs. Always regenerate via the pipeline.

Team adoption

On day one, set a Git pre-commit hook that runs dvc status and blocks commits if outputs are not in dvc.lock. Add to .git/hooks/pre-commit: #!/bin/bash dvc status || exit 1. Require all team members to run dvc repro before committing, not just python train.py. Document the rule: 'If you change any file in data/ or src/, run dvc repro and commit the updated dvc.lock. If you don't, your teammates will have stale features.' In code review, look for commits that touch src/transform.py but not dvc.lock: that's a red flag.

Experienced dev note

Most teams miss that dvc.lock is your source of truth for reproducibility, not your Git history. Commit dvc.lock every time: it's only ~1KB per stage even for large datasets. Also, use dvc plots to version your feature distributions: dvc plots modify data/features.csv -x age -y income then git add .dvc/plots. This lets you see feature drift across Git commits in CI/CD without manual scripts.

Check your understanding

You change the outlier_threshold parameter in params.yaml from 3.0 to 2.5 and commit both params.yaml and dvc.lock to Git. A teammate checks out your commit and runs dvc status. What will they see, and why?

Show answer hint

They will see 'All pipelines are up to date' because <code>dvc.lock</code> was already committed with the new hash. The key insight: DVC uses <code>dvc.lock</code> to determine if a stage needs rerunning, not the timestamp of params.yaml. If you change params but don't regenerate and commit the updated <code>dvc.lock</code>, the pipeline will appear stale when others pull your code. This is why <code>dvc repro && git add dvc.lock</code> must always happen together.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.