How to beginner · 3 min read

How to use DVC for data version control

Quick answer
Use DVC to track and version datasets and machine learning models by initializing a dvc repository, adding data files with dvc add, and pushing data to remote storage. This enables reproducible experiments and collaboration by managing large files outside Git while versioning metadata in Git.

PREREQUISITES

  • Python 3.8+
  • Git installed and configured
  • pip install dvc
  • Access to remote storage (e.g., AWS S3, Google Drive, or local filesystem)

Setup

Install dvc via pip and initialize it in your Git project to start tracking data files.

bash
pip install dvc

git init my-ml-project
cd my-ml-project
dvc init
output
Initialized empty Git repository in /path/to/my-ml-project/.git/
Initialized DVC repository.

Step by step

Track a dataset file with DVC, commit changes to Git, and push data to remote storage for version control.

bash
echo 'sample data' > data.csv

dvc add data.csv

git add data.csv.dvc .gitignore

git commit -m "Add data.csv with DVC tracking"

dvc remote add -d myremote s3://mybucket/dvcstore

dvc push
output
Adding data.csv to DVC tracking.
Data file 'data.csv' is now tracked by DVC.
[master (root-commit) abc1234] Add data.csv with DVC tracking
 2 files changed, 20 insertions(+)
Remote 'myremote' has been added.
Uploading data.csv to remote storage...
Upload complete.

Common variations

You can use different remote storage backends like Google Drive or Azure Blob Storage by configuring dvc remote. For automation, use dvc repro to reproduce pipelines. DVC also supports --jobs for parallel data pushes.

bash
dvc remote add -d gdrive_remote gdrive://folder_id

dvc push --jobs 4

dvc repro
output
Remote 'gdrive_remote' has been added.
Uploading data files in parallel...
Pipeline reproduced successfully.

Troubleshooting

If dvc push fails, check your remote storage credentials and network connection. Use dvc doctor to diagnose common issues. If data files are missing, verify .dvcignore and Git status to ensure files are tracked properly.

bash
dvc doctor

git status
cat .dvcignore
output
DVC version: 2.x.x
Git version: 2.x.x
Remote storage: configured
On branch master
nothing to commit, working tree clean
# Contents of .dvcignore
# Ignore temporary files
*.tmp

Key Takeaways

  • Use dvc init to enable data version control in your Git project.
  • Track large data files with dvc add and commit the generated .dvc files to Git.
  • Configure remote storage with dvc remote add to share and backup data.
  • Use dvc push and dvc pull to sync data with remote storage.
  • Run dvc repro to automate pipeline reproduction and ensure reproducibility.
Verified 2026-04
Verify ↗