Tool Beginner easy · 7 min integration

Feature Engineering with DVC Tracking

What you will learn

Track feature engineering scripts and outputs with DVC to version your feature pipelines and make experiments reproducible.

Why this matters

Feature engineering is often the highest-ROI ML work, but it's fragile: a small change to feature logic breaks reproducibility and makes it impossible to backtrack why model performance changed. Without tracking, you lose the lineage between raw data → engineered features → model performance. DVC solves this by versioning your feature code and outputs together, so you can always rebuild the exact same features.

Skip if: If you're doing one-off exploratory analysis in a notebook for yourself, DVC overhead isn't worth it. Skip it. But the moment you're sharing code with teammates or training a model for real use, add DVC tracking immediately: the cost of not doing it grows exponentially with team size.

Explanation

Feature engineering in MLOps means writing reproducible Python scripts that transform raw data into ML-ready features, then tracking those scripts and their outputs with DVC. DVC watches your feature files and their outputs (CSV, Parquet, etc.) so when you change feature logic, you can re-run the pipeline and DVC knows exactly what changed and why. This creates an auditable chain: raw data → feature script → engineered features → model training. Without DVC, you end up with scattered Python scripts, no way to know which features are in which model, and engineers asking "which version of features did you use?" DVC answers that question automatically. The core command is dvc stage add, which tells DVC to run a Python script (your feature engineering code) and track its outputs. You define this in a dvc.yaml file that lives in your repo.

Configuration

yaml

stages:
  prepare_features:
    cmd: python src/engineer_features.py data/raw/train.csv data/processed/features.parquet
    deps:
      - data/raw/train.csv
      - src/engineer_features.py
    outs:
      - data/processed/features.parquet
  train_model:
    cmd: python src/train.py data/processed/features.parquet models/model.pkl
    deps:
      - data/processed/features.parquet
      - src/train.py
    outs:
      - models/model.pkl
    metrics:
      - metrics.json:
          cache: false

Why this order?

deps (dependencies) must come before outs (outputs) because DVC needs to know what input files trigger a re-run before it can define what files that stage produces. The pipeline stages themselves (prepare_features before train_model) follow the natural data flow: you can't train on features you haven't engineered yet.

Wrong vs Right

Wrong way

yaml

python src/engineer_features.py
# features.parquet is created but not tracked
# You change engineer_features.py
# You re-run the script manually
# Now you have no idea if the new features.parquet matches the old code or new code
# Your teammate asks which version of features you used in the model: you can't answer

Right way

yaml

dvc stage add -n prepare_features -d data/raw/train.csv -d src/engineer_features.py -o data/processed/features.parquet python src/engineer_features.py data/raw/train.csv data/processed/features.parquet
# DVC records: "this exact script + this exact input data = this exact output"
# If you change the script, run "dvc repro" and DVC rebuilds features automatically
# You can inspect dvc.lock to see exactly which code version created which features

Tool vitals

Primary command

bash

dvc stage add -n <stage_name> -d <input_file> -o <output_file> python <script.py>

Config file dvc.yaml

Verify

bash

dvc dag

Integration notes

DVC tracks the feature outputs, but MLflow tracks the model and metrics that result from those features. In practice: DVC pipeline produces features.parquet → MLflow experiment logs the model trained on those features → you can look at dvc.lock to see exactly which feature code MLflow used. They're complementary: DVC is the lineage of data/code, MLflow is the lineage of models/metrics.

Migration path

If you outgrow DVC (rare for features alone), migrate to Airflow or Prefect for orchestration: they can run your feature scripts on schedules across machines. But they don't replace DVC's versioning; you'd use DVC + Airflow together. For data-heavy pipelines, some teams move to Spark with DVC, or to dedicated feature stores like Tecton. Start with DVC; only migrate when you hit real bottlenecks.

Cost model

DVC itself is free and open-source. Cost comes from remote storage: if you push features to S3, you pay AWS storage + bandwidth (~$0.023/GB/month for S3 Standard). With DVC caching, you only store unique versions, so costs are predictable. No hidden fees.

Common gotcha

DVC will track the feature file (features.parquet) in its cache, but if your raw data file (train.csv) is huge (>1GB), DVC will try to cache that too and consume massive disk space. Solution: use dvc remote to push large data files to S3/cloud storage and only keep a hash locally. If you skip this, your .dvc folder will swell to 50GB and teammates will curse your name.

Team adoption

Day 1: Have one engineer create the dvc.yaml with the feature pipeline. Push to main. Day 2: All engineers clone the repo and run dvc repro once: DVC pulls cached features from remote storage so they don't rebuild locally. Day 3: When someone changes a feature, DVC repro is the standard workflow, not manual Python script runs. Put this in your team wiki: "Never run feature scripts directly. Always use dvc repro." Enforce in code review: if a PR changes src/engineer_features.py but doesn't update dvc.yaml, ask why.

Experienced dev note

Most engineers miss the dvc.lock file: it's boring and looks like a lock file you should ignore. Don't. It's gold: it records the exact timestamp, input hash, and output hash for every run. When someone asks "what data was used to train model X?", you read dvc.lock, find the feature stage hash, and you know exactly which raw data it came from. Commit dvc.lock to git. This single file is what makes your pipeline auditable.

Check your understanding

You change your feature engineering script to add a new feature (e.g., add polynomial features). You run dvc repro. Why does DVC know to re-run the prepare_features stage and not the train_model stage?

Show answer hint

DVC hashes the feature script file (src/engineer_features.py) and detects it changed. The prepare_features stage depends on this script (it's in deps), so DVC marks it as out-of-date and re-runs it. train_model only depends on features.parquet (the output), so it won't re-run until you explicitly tell it to after reviewing the new features.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.