How to beginner · 3 min read

How to use DVC data version control

Q: How to use DVC data version control

Use DVC to version control datasets and machine learning models by initializing a dvc repository, adding data files with dvc add, and pushing data to remote storage. This enables reproducible experiments and collaboration by tracking data changes alongside code in git.

Quick answer

Use DVC to version control datasets and machine learning models by initializing a dvc repository, adding data files with dvc add, and pushing data to remote storage. This enables reproducible experiments and collaboration by tracking data changes alongside code in git.

PREREQUISITES

Python 3.8+
Git installed and configured
pip install dvc
Access to remote storage (e.g., AWS S3, Google Drive, or local filesystem)

Setup

Install dvc via pip and initialize a DVC project inside your existing Git repository.

bash

pip install dvc

git init

dvc init

output

Initialized empty Git repository in /path/to/repo/.git/
Initialized DVC repository.

Step by step

Track a dataset file with DVC, commit changes to Git, and push data to remote storage.

python

import os

# Bash commands for the workflow

# 1. Add a data file to DVC tracking
!dvc add data/dataset.csv

# 2. Commit DVC files and Git changes
!git add data/dataset.csv.dvc .gitignore
!git commit -m "Add dataset with DVC tracking"

# 3. Configure remote storage (example: local directory)
!dvc remote add -d myremote /path/to/remote/storage

# 4. Push data to remote
!dvc push

# Output example
print("Data tracked and pushed to remote storage.")

output

Data tracked and pushed to remote storage.

Common variations

You can use different remote storage backends like AWS S3, Google Drive, or Azure Blob Storage by configuring dvc remote add accordingly. DVC also supports pipelines to automate ML workflows and dvc repro to reproduce experiments.

bash

dvc remote add -d myremote s3://mybucket/path

dvc push

dvc repro

output

Pushing data to s3://mybucket/path
Reproducing pipeline stages...

Troubleshooting

If dvc push fails, check your remote storage credentials and network connection.
If data files are missing, ensure you committed the .dvc files and .gitignore updates to Git.
Use dvc status to verify the sync status between local and remote.

Key Takeaways

Use dvc add to track large data files without storing them in Git.
Configure remote storage to share datasets and models across teams.
Commit .dvc files and .gitignore changes to keep data versioning consistent.
Leverage dvc repro to automate and reproduce ML pipelines.
Use dvc status and dvc push to manage data synchronization.

Verified 2026-04

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.