Simplify Machine Learning Version Control

Artificial Intelligence (AI), especially Machine Learning (ML) and Deep Learning (DL) is all about experimentation, the more the merrier. Each experiment is composed of different hyper-parameters, code and input data. It is extremely important to monitor and keep track of all the experiments in order to replicate, share, roll back etc. This can be time consuming and frustrating if not done properly. To make the whole process efficient, cost effective and easy to manage, Machine learning version control plays a crucial role. The purpose of this blog post is to understand the importance of versioning in Machine learning, creating versions that works with the existing ML workflow, and managing data with associated trained models that goes into production. 

Why should we care about ML versioning?

ML researchers and practitioners spend endless hours on bringing intelligence to machines. ML modelling is a highly iterative process and consists of multiple steps: it is about the data we use and how we process it, algorithm selection and designing architectures, hyper-parameters and other parameters used in the experiment. Experimentation is the key to find the optimal combination to solve the problem at hand. Not only is this true for training multiple models, evaluation, adjusting and re-training during the development, but also after models are deployed in production. In order to upgrade the models with the latest data and state-of-art ML algorithms, we have to roll new models to production and possibly retrain the old models.

All those iterations also need to be documented, both in terms of code and data. Management quickly becomes a nightmare, unless much of the logistics are efficiently handle by the data platform. 

With this in mind, here are some of the reasons that points out the need and importance of versioning for ML projects:  

  • Keeping track of best models

What if we have accidentally deleted the best performing model which was trained a few weeks back? How can we train our model with the exact configuration and train/test data that was used before? How can we share our experiments with other colleagues and peers so that they can contribute to the project from the last best-known state? These are some of the real-world concerns that every ML researcher and practitioner is dealing on regular basis. In order to keep track of best models we’ve created and the associated code, configuration and data, we need a versioning system. 

  • Inherited complex nature of ML and associated data

Traditional code versioning is capable of keeping track of code, configurations and the project dependencies. In ML, things get complex, versioning the code that implements a model is not enough, because the model may behave so drastically different from one input data set to another. We also need to perform versioning on training data and generated models in order to capture the complete training state. Using versioning like Git, its illogical to upload data to the server as we are dealing with TB and PB scale data. And if our versioning resides on cloud, then kicks in privacy issues. That's where having an efficient data-fabric layer can make a huge difference. Versioning training and testing data should be a capability built into our platform.

Most of the Machine Learning / Deep Learning frameworks results in model files that we need to keep track of. These trained models are often written in different file formats and rely on multiple frameworks, which makes dependency tracking even more complex. 

  • ML model deployment on production

If and when we make significant updates to your production models, these changes are rarely deployed immediately and in one shot. In order to test properly and ensure failure tolerance, new models are typically rolled out gradually, until teams can be sure that they’re working as expected. If we encounter any issue with a production model, we need to be able to revert quickly to the previous working version. Versioning makes it easy and hassle free to deploy the right version at the right time. 

  • Need of data provenance 

Machine learning is gripping its influence almost across all verticals. With increased user impact comes increased regulations, like GDPR. Organizations need to make sure that ML generated decisions are fully explainable in terms of data and algorithm used.

Simple versioning of the data isn’t enough to ensure a sufficient explanation. Even if we know the input data set used for a certain run of our model, can you explain where that data came from, at what times it was generated, how it was combined, and how it was generated?Our trained models must be auditable and easy to explain.

Machine Learning Versioning using NetApp Snapshots 

NetApp snapshot is sophisticated mechanism to capture the point-in-time versions of the data and trained models and logs, which makes it a perfect fit for Machine Learning versioning. Snapshots don't create new copies, but always refers to the original data, hence don’t consume any extra space and are extremely fast (~1sec for PB scale data). 

Multiple data scientists working in a collaborative fashion, using shared data lake for data, models and logs. NetApp snapshots are performing versioning by capturing the point-in-time state of Machine Learning process

By design, snapshots only capture the incrementing file changes, that means if data is updated, deleted or inserted, only new changes are recorded. It makes data versioning very efficient when we are dealing with large amount of data. Snapshots keep track of origin and changes associated with data throughout the process, which makes the whole process of ML explainable and data provenance compliant.

ML versioning using NetApp snapshots makes sharing and collaboration between teams easy, as it is the part of the underlined data platform. Nothing needs to be re-engineered, as with the help of inbuilt CLI and API support, it can be integrated with any code versioning platform, like Git and automation tools, such as Jenkins. Also, it works out-of-the-box with multiple Machine Learning / Deep Learning frameworks e.g. TensorFlow, PyTorch etc. generating models with different file formats e.g. h5, pickle etc.

Demo of NetApp Snapshots with GitHub for Machine Learning Versioning 

As most of the Machine Learning researchers and practitioners are using Git for the versioning of code, we have used GitHub and integrated it with NetApp snapshots for Machine Learning versioning. In this demo, we trained a simple Deep Neural Network on a MNIST dataset and created versions of trained models and respected code. Whenever code is pushed to GitHub, the state of trained models is captured and synched using the snapshots. Snapshots have the same identifier as the Git commit ID, which links the state of code with the associated trained models.

Machine Learning versioning using snapshots and git. Commit ID is used to link the code state with versioned data and models

The concept makes it easy to share and collaborate with others, normal git pull is used to get the latest complete Machine Learning version, and not just code. There is no more need of manually copying the data. NetApp snapshots take care of it.

In case, we want to roll-back to the previous best-known model, we just need to pull the specific version from the git, and associated models will be automatically updated. The same concept holds true when trained models are accidentally deleted and needs to be recovered.

If you interested to learn more than don't miss to checkout the demo video. [demo credit: Muneer Ahmad Dedmari, NetApp and Steve Guhr, NetApp] 


Great article Muneer & Steve! A good use case of the Snapshot technology, thanks James Collins for sharing.

Muneer and Steve this is outstanding! Keep it going!

Great work Muneer and Steve, Keep going men of many talents

To view or add a comment, sign in

Others also viewed

Explore content categories