Data Version Control Explained

In today’s data-driven world, machine learning experts and data scientists deal with a large volume of datasets, files, and metrics to carry out day-to-day operations. The varying versions of these artifacts need to be tracked and managed as experiments are performed on them in multiple iterations. Data Version Control is a great practice for managing numerous datasets, machine learning models, and files in addition to keeping a record of multiple iterations – i.e. when, why, and what was altered.

Introduction to Data Version Control

DVC is an open-source system that ensures reproducibility within machine learning experiments, since its users do not have to manually remember which data model uses which dataset and what actions were conducted to get the desired result.

Furthermore, DVC users do not have to rebuild previous models or data modeling techniques to achieve the same past state of results. Even when provided with tons of models and data metrics, DVC eliminates the effort required to know which model was trained with what type or version of data. All distinguished reports are maintained in rotations.

DVC basically consists of bundle of tools and processes that track changing versions of data and collections of previous data (in other words, no more digging around for files with names like “old2-v2.html”). Repositories in DVC usually refer to files or directories which are under the influence of the version control system. A categorized state is maintained for each change committed to a file (e.g. add, delete, move, or modify).

A Brief History of DVC

DVC was released in 2017 as a simple command line tool and most recently released version 1.11.2. It has been adopted by thousands of users and boasts 150+ contributors. Development on DVC is guided by discussions between community members, most of whom are ML engineers, software developers, and data scientists.

It took almost three years of planning to release DVC (which bills itself as “Git for Machine Learning Projects”) with stabilized commands and file formats. An upgraded form of DVC is under active development which will further improve the data management layers to make it a less complicated historical ML tool.

Accessing and Preserving Large Datasets before DVC

Controlling large-scale data without DVC or a similar tool is almost inconceivable today. Before these tools evolved, ML statistics were handled manually by good old CTRL-C, CTRL-V, and conventional file trees. Here were a few tactics used to manage unwieldy large data files:

Allocate more memory: One method was to use a tool to increase the available memory for your machine learning libraries.
Smaller samples: DVC enables you to track large data files (from 10 G to 100 GB). Before DVC, however, tracking such large files was not possible, so engineers had to work on portable samples of thousands or hundreds of thousands of rows at a time.
Allocate more hardware: Separate computing machines or cloud services were used to run algorithms so as to avoid crashes.
Change data format: Another approach was to alter your data format from CSV to binary to save large data in a more compact form.
Relational databases: These were used to store and retrieve big datasets in a less taxing fashion.

Effect of DVC on Workflows

To ensure accuracy in projects, data scientists frequently spend weeks and months on time-consuming experimentation. They perform the tedious task of configuring which model to train with what dataset. DVC impacts this process in the following ways:

Comparing models can be costly if models are not finalized during training. DVC can help to manage the complexity of ML pipelines so that you can train the same model again and again.
In training a large number of models, it is difficult for teams to remember or keep track of which model was trained with which type of data. DVC helps teams to maintain version files and to easily reference ML models and their results.
In ML, as data is accessed by a massive audience, it can cause confusion among team members if datasets are not properly labelled following proper conventions. DVC, however, facilitates proper labeling to enable broader experimentation.
DVC helps teams answer questions like “How was X model built?” as it offers reproducibility and the ability to share results between projects.
DVC’s distributed nature allows users to work on desktops with GPUs, on laptops, and on cloud resources in the event that they need more memory.
DVC aims to exclude the need for spreadsheets, tools, and ad hoc scripts to share documents for communication. Working on ML projects requires you to run a bunch of experiments and communicate their output with your colleagues, which becomes hectic when you forget to track what you did yesterday or a week ago.

DVC Tools and Frameworks

In a productive ML environment, scientists face many challenges like versioning in a collaborative environment and maintaining enough storage space. To simplify data management and tackle these issues head-on, you can utilize the following tools:

Tool	Pros	Cons	Open Source	Convenient to Use	Supports Cloud
DVC	Light Weight Pipelines, support cloud storage	Tightly coupled, redundancy	Yes	Yes	Yes
DOLT	SQL interface, Light weight	Does not support images and freedom text, Still evolving	Yes	No	No
Pachyderm	Portable, robust, and offers scalability options	Integrating with existing structure is complicated	Yes	No	Yes
Delta Lake	Effective for data processing, Allows ACID transactions	Less flexible, Built for Spark and bigdata	Yes	No	Yes
Git LFS	Smooth integration, Same permissions as for Git repository	Non scalable servers	Yes	Yes	No

DVC on Git

DVC takes advantage of Git and runs top of it. It uses a remote repository like Google Cloud, Azure, or S3 for storing large files. It could be said that Git supports version control for code, but DVC provides it for data. In simpler terms, you can say that DVC = “Machine Learning Git.”

Git is used as a foundation by DVC to track the lifecycle of how a model was produced and what commands were used to produce metrics. .dvc files also come with downloading a Git repository. Small data files are meant for Git, while large data files goes into the remote storage of DVC. It is not required that Git be paired with DVC, as DVC can work effectively even without it.

try our app estimate calculator CTA image

Advantages of DVC

A critical challenge in deep learning experiments is to manage, store, and reuse models and algorithms. To minimize the complexity of these challenges, some advantages of DVC for data scientists are listed as follows:

By centralizing data storage, teams find it easier to perform experiments using a shared single machine, which in turn promotes better resource utilization. DVC allows teams to manage a development server for shared data usage.

Servers in this case can be any type of cloud server (Microsoft Azure, Amazon S3, Google SSH, etc.). As we do git checkout for our code, we can do the same for our data models in DVC because it initiates fast switching and workspace restoration speed for all users to share models through the cloud.

2. Track & Visualize ML Models

Data science features in DVC are versioned in data repositories. Versioning is achieved through regular Git workflows such as pull requests. To store all ML artifacts, DVC uses a built-in cache, which is further synchronized with remote cloud storage. This way, DVC allows for the tracking of data and models for further versioning. A basic step to build artifacts by tracking ML models is to write a dvc.yaml file.

3. Reproducibility

When using ML models in cross-project experiments, DVC data registries can be helpful. These are like a package management system for boosting reproducibility and reusability. DVC repositories store the history for all artifacts, including what was changed and when, and can use no-code pulls to update requests with a single commit. A simple command line interface enables users to reproduce and organize feature stores with dvc get and dvc import commits.

4. Organized ML data

Data is the main asset for ML engineers, so proper organization of data is necessary to train models effectively. DVC uses the concept of a data pipeline to version data using Git. These pipelines are lightweight and allow you to organize and reproduce your workflows. Dataset versioning promotes automation, reproducibility, and CI/CD for machine learning.

5. Increase the pace of data science

A stack of modernized features enables fast-paced machine learning innovation. The features include versioning metafiles, fast tracking of metrics in simple text form, switching, sharing data through a centralized development server, lightweight pipelines and data-driven navigation through the directory.

Imagine switching from a 100GB file with a simple git checkout command and using git clone to visualize large metafiles and models within seconds, or using sets of similar commands to train systems in shorter times and generate faster results.

Disadvantages

DVC is not a one-pack solution for all ML problems. It comes with its own set of pitfalls, which are mentioned below:

1. Redundancy

Using a separate pipeline tool can cause redundancy because DVC is firmly coupled with pipeline management.

2. Incorrect Configuration Risk

A risk of incorrect configuration of your pipeline (if your team forget to add the output file) is present in DVC. It is false to assume that a DVC-produced version of project from a year ago will work the same in current circumstances. To check for missing dependencies in DVC is tough because data mugging does not become readily visible through an error.

3. Poor Performance in Sloppy Architecture

DVC works along with Git, so without the proper definition of metrics and datasets for a given architecture, teams will not be able to get the full benefit of this version control system. Teams may have to manually develop extra features in DVC to meet certain demands of ML.

Note: DVC does not help you with full system design and does not control non-deterministic behavior of your model.

Case Study

Christopher Samiullah, a freelance software consultant, shared his experience using the DVC platform and posted changes that occurred after incorporating DVC in his project. His model was a convolutional neural network for image classification taking data from a plant seedlings dataset.

Working Without DVC

To merge a feature branch into master, every time a new version of the trained model was published to a package (Gemfury) in CI.
That Gemfury package only permitted one version of model at a time to prevent conflict. Uploading the same version caused a fail. It was necessary to manually increment the version file.
The package contained a text file of the reference URL where the training dataset was hosted. Before DVC it was a simple Kaggle URL instead of AWS or S3.
With no cloud tooling, it was tough to share results and outcomes among teams at a centralized point.

Working With DVC

The modifications he made to his workflow are listed below:

AWS credentials were added with DVC. Cloud tooling made resource sharing possible.
The Kaggle fetch script was replaced with a dvc pull.
DVC repro train.dvc effectively used to train pipelines instead of .py scripts.
The versioning feature of DVC allowed Christopher to link tags with model files and outputs to track everything.
DVC improved testing of datasets and eliminated need to track them by published package.
DVC offered neatly stored and accessible models and provided long-term improvements.

Conclusion

DVC is useful in all cases where reduced storage space is required to manage datasets and track changes made by multiple team members at same time. If you run a large ML team working with complex datasets, it’s recommended to implement Data Version Control. If your model outputs require debugging, then you must adopt an available DVC tool to enhance reproducibility.

If you’re an ML team, you should know that Crowdbotics provides managed app development services by vetted developers, including ML expertise and business intelligence implementations. Our developers can add ML features and analytics to an existing product or build ML and data tools (including a DVC pipeline) from the ground up. Get in touch with our experts today to learn more.