Data Version Control Explained
If you intend to study and reproduce the results of machine learning research, you'll need to implement a version control system for your data. Here's an overview of how data version control works.
6 October 2021
In today’s data-driven world, machine learning experts and data scientists deal with a large volume of datasets, files, and metrics to carry out day-to-day operations. The varying versions of these artifacts need to be tracked and managed as experiments are performed on them in multiple iterations. Data Version Control is a great practice for managing numerous datasets, machine learning models, and files in addition to keeping a record of multiple iterations – i.e. when, why, and what was altered.
DVC is an open-source system that ensures reproducibility within machine learning experiments, since its users do not have to manually remember which data model uses which dataset and what actions were conducted to get the desired result.
Furthermore, DVC users do not have to rebuild previous models or data modeling techniques to achieve the same past state of results. Even when provided with tons of models and data metrics, DVC eliminates the effort required to know which model was trained with what type or version of data. All distinguished reports are maintained in rotations.
DVC basically consists of bundle of tools and processes that track changing versions of data and collections of previous data (in other words, no more digging around for files with names like “old2-v2.html”). Repositories in DVC usually refer to files or directories which are under the influence of the version control system. A categorized state is maintained for each change committed to a file (e.g. add, delete, move, or modify).
DVC was released in 2017 as a simple command line tool and most recently released version 1.11.2. It has been adopted by thousands of users and boasts 150+ contributors. Development on DVC is guided by discussions between community members, most of whom are ML engineers, software developers, and data scientists.
It took almost three years of planning to release DVC (which bills itself as “Git for Machine Learning Projects”) with stabilized commands and file formats. An upgraded form of DVC is under active development which will further improve the data management layers to make it a less complicated historical ML tool.
Controlling large-scale data without DVC or a similar tool is almost inconceivable today. Before these tools evolved, ML statistics were handled manually by good old CTRL-C, CTRL-V, and conventional file trees. Here were a few tactics used to manage unwieldy large data files:
To ensure accuracy in projects, data scientists frequently spend weeks and months on time-consuming experimentation. They perform the tedious task of configuring which model to train with what dataset. DVC impacts this process in the following ways:
In a productive ML environment, scientists face many challenges like versioning in a collaborative environment and maintaining enough storage space. To simplify data management and tackle these issues head-on, you can utilize the following tools:
Tool |
Pros |
Cons |
Open Source |
Convenient to Use |
Supports Cloud |
Light Weight Pipelines, |
Tightly coupled, redundancy |
Yes |
Yes |
Yes | |
SQL interface, Light weight |
Does not support images and freedom text, Still evolving |
Yes |
No |
No | |
Portable, robust, and offers |
Integrating with existing |
Yes |
No |
Yes | |
Effective for data processing, Allows ACID transactions |
Less flexible, Built for Spark and bigdata |
Yes |
No |
Yes | |
Smooth integration, Same |
Non scalable servers |
Yes |
Yes |
No |
DVC takes advantage of Git and runs top of it. It uses a remote repository like Google Cloud, Azure, or S3 for storing large files. It could be said that Git supports version control for code, but DVC provides it for data. In simpler terms, you can say that DVC = “Machine Learning Git.”
Git is used as a foundation by DVC to track the lifecycle of how a model was produced and what commands were used to produce metrics. .dvc files also come with downloading a Git repository. Small data files are meant for Git, while large data files goes into the remote storage of DVC. It is not required that Git be paired with DVC, as DVC can work effectively even without it.
A critical challenge in deep learning experiments is to manage, store, and reuse models and algorithms. To minimize the complexity of these challenges, some advantages of DVC for data scientists are listed as follows:
By centralizing data storage, teams find it easier to perform experiments using a shared single machine, which in turn promotes better resource utilization. DVC allows teams to manage a development server for shared data usage.
Servers in this case can be any type of cloud server (Microsoft Azure, Amazon S3, Google SSH, etc.). As we do git checkout for our code, we can do the same for our data models in DVC because it initiates fast switching and workspace restoration speed for all users to share models through the cloud.
Data science features in DVC are versioned in data repositories. Versioning is achieved through regular Git workflows such as pull requests. To store all ML artifacts, DVC uses a built-in cache, which is further synchronized with remote cloud storage. This way, DVC allows for the tracking of data and models for further versioning. A basic step to build artifacts by tracking ML models is to write a dvc.yaml file.
When using ML models in cross-project experiments, DVC data registries can be helpful. These are like a package management system for boosting reproducibility and reusability. DVC repositories store the history for all artifacts, including what was changed and when, and can use no-code pulls to update requests with a single commit. A simple command line interface enables users to reproduce and organize feature stores with dvc get
and dvc import
commits.
Data is the main asset for ML engineers, so proper organization of data is necessary to train models effectively. DVC uses the concept of a data pipeline to version data using Git. These pipelines are lightweight and allow you to organize and reproduce your workflows. Dataset versioning promotes automation, reproducibility, and CI/CD for machine learning.
A stack of modernized features enables fast-paced machine learning innovation. The features include versioning metafiles, fast tracking of metrics in simple text form, switching, sharing data through a centralized development server, lightweight pipelines and data-driven navigation through the directory.
Imagine switching from a 100GB file with a simple git checkout
command and using git clone
to visualize large metafiles and models within seconds, or using sets of similar commands to train systems in shorter times and generate faster results.
DVC is not a one-pack solution for all ML problems. It comes with its own set of pitfalls, which are mentioned below:
Using a separate pipeline tool can cause redundancy because DVC is firmly coupled with pipeline management.
A risk of incorrect configuration of your pipeline (if your team forget to add the output file) is present in DVC. It is false to assume that a DVC-produced version of project from a year ago will work the same in current circumstances. To check for missing dependencies in DVC is tough because data mugging does not become readily visible through an error.
DVC works along with Git, so without the proper definition of metrics and datasets for a given architecture, teams will not be able to get the full benefit of this version control system. Teams may have to manually develop extra features in DVC to meet certain demands of ML.
Note: DVC does not help you with full system design and does not control non-deterministic behavior of your model.
Christopher Samiullah, a freelance software consultant, shared his experience using the DVC platform and posted changes that occurred after incorporating DVC in his project. His model was a convolutional neural network for image classification taking data from a plant seedlings dataset.
Working Without DVC
Working With DVC
The modifications he made to his workflow are listed below:
DVC is useful in all cases where reduced storage space is required to manage datasets and track changes made by multiple team members at same time. If you run a large ML team working with complex datasets, it’s recommended to implement Data Version Control. If your model outputs require debugging, then you must adopt an available DVC tool to enhance reproducibility.
If you’re an ML team, you should know that Crowdbotics provides managed app development services by vetted developers, including ML expertise and business intelligence implementations. Our developers can add ML features and analytics to an existing product or build ML and data tools (including a DVC pipeline) from the ground up. Get in touch with our experts today to learn more.