Tool Beginner easy · 8 min cli_command

DVC for data version control

What you will learn
DVC tracks large datasets and model files in Git-compatible version control without bloating your repository.

Why this matters

Machine learning projects have data files too large for Git. Without DVC, teams either lose data history, commit gigabytes to Git (killing repo performance), or use ad-hoc naming conventions like `dataset_v2_final_REAL.csv`. DVC solves this by storing data separately while keeping pointers in Git.

Skip if: Use plain Git if your entire dataset is under 50MB total. Use a data warehouse (Snowflake, BigQuery) if your data lives in cloud storage and multiple teams query it. Use a simple shared S3 bucket if you're a solo researcher with no collaboration needs.

Explanation

DVC (Data Version Control) is a Git-like system for data and models. When you run dvc add data.csv, DVC computes a hash of the file, stores the actual data in a cache (local or remote), and commits a small `.dvc` file to Git. This `.dvc` file acts as a pointer: it's tiny and safe to commit. Later, teammates run dvc pull to fetch the actual data. DVC tracks which version of data was used to train which model, preventing the "which dataset made this result?" problem. It integrates with Git's workflow: branch your code, switch DVC versions with your branch, and both are in sync. DVC also supports pipelines (dvc.yaml), so you can define how raw data becomes training data, tying data version to transformation version.

Configuration

bash
# Initialize DVC in your Git repo
# Run once per repository
git init
dvc init

# Configure remote storage (S3 example)
dvc remote add -d myremote s3://my-bucket/dvc-storage
dvc remote modify myremote profile myprofile

# Track a dataset file
dvc add data/raw/dataset.csv

# This creates data/raw/dataset.csv.dvc (commit this to Git)
# data/raw/dataset.csv is added to .gitignore (don't commit the actual file)

# Push data to remote storage
dvc push

# Later: colleague checks out the branch
git clone <repo>
cd <repo>

# Fetch the data locally
dvc pull

# Verify data is present
ls -lh data/raw/dataset.csv

Why this order?

Git init first ensures you're in a repository. DVC init creates .dvc/ config directory. Remote configuration must happen before push/pull so DVC knows where to send data. Data files are added last because they reference the remote.

Wrong vs Right

Wrong way
bash
# WRONG: Committing large files directly to Git
git add data/raw/dataset.csv
git commit -m "add dataset"
git push

# Result: repo is 2GB, cloning takes 10 minutes, every teammate downloads the whole history even if they only need the latest version.
Right way
bash
# RIGHT: Use DVC to decouple data from Git
dvc add data/raw/dataset.csv
git add data/raw/dataset.csv.dvc .gitignore
git commit -m "add dataset pointer"
git push
dvc push

# Result: repo is 2KB, clone is instant, data is fetched only when needed with dvc pull.

Tool vitals

Primary command
bash
dvc add <file>
Config file .dvc/config
Verify
bash
dvc status

Integration notes

DVC pipelines (dvc.yaml) feed into MLflow: define data processing in DVC, log the resulting model in MLflow, pin both versions in Git. This creates an audit trail from raw data → processed data → trained model. Docker uses DVC to reproduce the exact same dataset inside containers.

Migration path

If you outgrow DVC (petabyte-scale data, 100+ team members), migrate to a cloud data warehouse (BigQuery, Snowflake) where DVC becomes a pointer to cloud tables instead of files. DVC files remain valid; only the remote changes from s3://bucket to a database connection string.

Cost model

DVC itself is free (open source). Storage is pay-as-you-go: S3 costs ~$0.023/GB/month for standard storage, AWS DataTransfer out costs ~$0.09/GB. A 10GB dataset costs ~$1.50/month to store, $0.90 to download once. No surprise billing; you control the bucket.

Common gotcha

If you run dvc add but forget to push with dvc push, teammates will get the `.dvc` file from Git but the actual data won't exist on the remote. When they run dvc pull, it fails silently with a cryptic error about missing cache. Always test the full flow: dvc add → git add → git commit → dvc push → (new terminal/user) → git clone → dvc pull.

Team adoption

On day one: (1) Initialize DVC and pick a remote storage (S3 bucket or GCS) before anyone adds data. (2) Add a dvc pull step to the team's onboarding docs. (3) Set up a pre-commit hook to prevent accidental git add of large files: use pre-commit framework with dvc hook. (4) Document the rule: "Never commit data files directly; always use DVC." Enforce in code review.

Experienced dev note

Set dvc config core.autostage true once. This makes dvc add automatically stage the `.dvc` file to Git, saving the extra git add step. Also use dvc config cache.type 'copy' in CI/CD to avoid symlink issues on some runners.

Check your understanding

Why does committing the `.dvc` file to Git but not running dvc push cause teammates' dvc pull to fail? What's actually missing?

Show answer hint

The `.dvc` file is just a pointer with a hash. The actual data lives in the cache. If you don't push, the cache on the remote is empty, so DVC can't find the data to download.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.