How to beginner · 3 min read

How to use DVC data version control

Quick answer
Use DVC to version control datasets and machine learning models by initializing a dvc repository, adding data files with dvc add, and pushing data to remote storage. This enables reproducible experiments and collaboration by tracking data changes alongside code in git.

PREREQUISITES

  • Python 3.8+
  • Git installed and configured
  • pip install dvc
  • Access to remote storage (e.g., AWS S3, Google Drive, or local filesystem)

Setup

Install dvc via pip and initialize a DVC project inside your existing Git repository.

bash
pip install dvc

git init

dvc init
output
Initialized empty Git repository in /path/to/repo/.git/
Initialized DVC repository.

Step by step

Track a dataset file with DVC, commit changes to Git, and push data to remote storage.

python
import os

# Bash commands for the workflow

# 1. Add a data file to DVC tracking
!dvc add data/dataset.csv

# 2. Commit DVC files and Git changes
!git add data/dataset.csv.dvc .gitignore
!git commit -m "Add dataset with DVC tracking"

# 3. Configure remote storage (example: local directory)
!dvc remote add -d myremote /path/to/remote/storage

# 4. Push data to remote
!dvc push

# Output example
print("Data tracked and pushed to remote storage.")
output
Data tracked and pushed to remote storage.

Common variations

You can use different remote storage backends like AWS S3, Google Drive, or Azure Blob Storage by configuring dvc remote add accordingly. DVC also supports pipelines to automate ML workflows and dvc repro to reproduce experiments.

bash
dvc remote add -d myremote s3://mybucket/path

dvc push

dvc repro
output
Pushing data to s3://mybucket/path
Reproducing pipeline stages...

Troubleshooting

  • If dvc push fails, check your remote storage credentials and network connection.
  • If data files are missing, ensure you committed the .dvc files and .gitignore updates to Git.
  • Use dvc status to verify the sync status between local and remote.

Key Takeaways

  • Use dvc add to track large data files without storing them in Git.
  • Configure remote storage to share datasets and models across teams.
  • Commit .dvc files and .gitignore changes to keep data versioning consistent.
  • Leverage dvc repro to automate and reproduce ML pipelines.
  • Use dvc status and dvc push to manage data synchronization.
Verified 2026-04
Verify ↗