How to beginner · 3 min read

How to use DVC for data version control

Q: How to use DVC for data version control

Use DVC to track and version datasets and machine learning models by initializing a dvc repository, adding data files with dvc add, and pushing data to remote storage. This enables reproducible experiments and collaboration by managing large files outside Git while versioning metadata in Git.

Quick answer

Use DVC to track and version datasets and machine learning models by initializing a dvc repository, adding data files with dvc add, and pushing data to remote storage. This enables reproducible experiments and collaboration by managing large files outside Git while versioning metadata in Git.

PREREQUISITES

Python 3.8+
Git installed and configured
pip install dvc
Access to remote storage (e.g., AWS S3, Google Drive, or local filesystem)

Setup

Install dvc via pip and initialize it in your Git project to start tracking data files.

bash

pip install dvc

git init my-ml-project
cd my-ml-project
dvc init

output

Initialized empty Git repository in /path/to/my-ml-project/.git/
Initialized DVC repository.

Step by step

Track a dataset file with DVC, commit changes to Git, and push data to remote storage for version control.

bash

echo 'sample data' > data.csv

dvc add data.csv

git add data.csv.dvc .gitignore

git commit -m "Add data.csv with DVC tracking"

dvc remote add -d myremote s3://mybucket/dvcstore

dvc push

output

Adding data.csv to DVC tracking.
Data file 'data.csv' is now tracked by DVC.
[master (root-commit) abc1234] Add data.csv with DVC tracking
 2 files changed, 20 insertions(+)
Remote 'myremote' has been added.
Uploading data.csv to remote storage...
Upload complete.

Common variations

You can use different remote storage backends like Google Drive or Azure Blob Storage by configuring dvc remote. For automation, use dvc repro to reproduce pipelines. DVC also supports --jobs for parallel data pushes.

bash

dvc remote add -d gdrive_remote gdrive://folder_id

dvc push --jobs 4

dvc repro

output

Remote 'gdrive_remote' has been added.
Uploading data files in parallel...
Pipeline reproduced successfully.

Troubleshooting

If dvc push fails, check your remote storage credentials and network connection. Use dvc doctor to diagnose common issues. If data files are missing, verify .dvcignore and Git status to ensure files are tracked properly.

bash

dvc doctor

git status
cat .dvcignore

output

DVC version: 2.x.x
Git version: 2.x.x
Remote storage: configured
On branch master
nothing to commit, working tree clean
# Contents of .dvcignore
# Ignore temporary files
*.tmp

✅

Key Takeaways

Use dvc init to enable data version control in your Git project.
Track large data files with dvc add and commit the generated .dvc files to Git.
Configure remote storage with dvc remote add to share and backup data.
Use dvc push and dvc pull to sync data with remote storage.
Run dvc repro to automate pipeline reproduction and ensure reproducibility.

Verified 2026-04

Verify ↗