Why metadata is important for your Machine Learning

No alt text provided for this image

Data Scientists seek to generate reproducible and comparable experiments in their Machine Learning, Deep Learning model building. In order to achieve the reproducibility and comparability, it is important to have metadata related to the experiments.

In this article, let us discuss why it is important to have a metadata of the model for AI.

Model building and production is a iterative process.

“Deep Learning is often disturbingly sensitive to dataset and hyperparameter choice”

Step1: Typically a data scientist starts with an idea and hypothesis as to what datasets are needed, build a set of features and keep training the models.

Step2: Tune hyperparameters to get the best model possible, but the model is not performing the way you anticipated

Step3: You now start contacting the business and domain experts in the organization what more features need to be added and build to get better prediction accuracy

Step4: Weeks and months later you will get a high performance model with high prediction accuracy

Step5: The model is considered production ready. Now it will be validated with new datasets, different AI chips and surprisingly the model performance drops and we go back to step2

No alt text provided for this image

Impact to Business

No alt text provided for this image
  1. Low gross margins due to high infrastructure cost and ongoing human support
  2. Longer development and production cycles means longer TTM(Time To Market) of AI models. Any slight change in business logic and data results in huge time delays
  3. High TCO, as this take a heavy toll on your AI infrastructure systems and data scientist’s time resulting in Low margins of pursuing the AI for your business

There are numerous reasons that impact high cost and longer TTM of AI models. One of the solution to think about is, how can we reduce model build process by reducing num of experiments, trials and move to production quickly.

Having a "metadata of the model" comes to the rescue to certain extent

Why model metadata?

Comparability & reproducibility 

No alt text provided for this image

Comparability

As mentioned earlier Machine Learning is iterative process. We need to be able to compare results between experiments. In order to compare results we need information of model between current and previous trials. This becomes much harder when you have team of data scientists experimenting. It becomes hard to compare results without a standard way to collecting information of model building.

Reproducibility

Imagine you have put the model into production. Now slight change in dataset will result in complete retraining of the model again and again. You need a model object stored which you can refer to in production and rebuild the model with less trails quickly.

What can be considered as model metadata

Data

No alt text provided for this image

The model is always as good as a dataset is. But the main thing about a dataset is it has be varied. You cannot have same dataset and exact features and run experiments. Best of models are developed with variance in datasets between runs, developments vs. production. Any slight change in dataset results in huge variations in model performance. This concept is called as “data drift” and can be discussed in a separate topic.

Now we need to have a metadata or model definition of what datasets were used in a particular run and what features were extracted. Now you can compare datasets using the metadata of the datasets between two runs and reproduce the model much faster.

Hyperparameters

No alt text provided for this image

Metadata of hyperparameters is used during reproducibility of model. Hyperparameters are the variables that determine the network structure and variables which determine how the network was trained. They are set during the training and are major factor which dictate the model performance. For example, number of hidden layers, activation functions, Learning rate, Number of epochs, batch size, etc.

Imagine if we had a metadata of these parameters to compare and reproduce the models quickly and efficiently.

Context

Context is the information about the environment of the experiment which may or may not affect the result. What is the program context. Typically you start with initializing variables and start the model against subsets of datasets. To reproduce the results you need to access the code and value used by something like “random number generator” and parameter sweep.In addition, you rely on source code, system information, library packages dependencies, etc.

AI chip/GPU characteristics

No alt text provided for this image

GPU or AI chip/accelerators are most important to envision a deep learning model for faster data processing and high model performance within a short periods of time. GPUs and AI chips are typically vector based with variance in the amount of memory form factor which tells the model size, model fit (distributed training), number of GPUs used. Was the training experiment run of Nvidia V100 GPUs to T4 vs. Tesla K80 vs. FPGAs.

Stream Multiprocessors, number of threads, block Dims, shared memory, etc.  dictate how the model training works on a particular GPUs.

A metadata of GPU characteristics will enable data scientists and infrastructure team for resource planning and precise allocations for future experiments.

A model during production deployment process may break for mismatch in the GPU/AI chip used. On other side, you may not need heavy duty GPU/AI chip that you used during building and model training process. 

GPU utilization

Model under and over fit issues based on workload variations

Models build/prod workflow GPU utilizations change over period of training

No alt text provided for this image

GPU utilization varies between model to model and between experiment to experiment. As the datasets change, hyper parameter changes, GPU device used will show variance in utilization. In some cases it maybe underutilization and in some we need more resources based on the datasets, hyper parameters.The GPU utilization will also vary based on your business workflows – Build, train and production workloads.

GPU utilization in itself can be discussed in a separate topic

A metadata of utilization factor would get rid of complex model under fit and over fit scenarios.

No alt text provided for this image
No alt text provided for this image

rapt.ai platform’s – “model definition layer” simplifies the business workflows of model build, training and production. It helps in faster training, less trials, transfer or reuse the configurations between data scientists.  

This is one of the feature offering of our rapt product**.



Data and Compute are paramount for AI strategy.

But organizations struggle in managing the data and compute in their workflows resulting in:

  1. Lower gross margins due to high infrastructure cost and ongoing human support
  2. Time-cost of AI products
  3. AI system scaling @cloud, data center and edge. Each have their own challenges unlike traditional software systems

**rapt.ai has built a unique “ DL defined converged compute & data platform” which provides elastic, Multi cloud distributed Virtual AI@scale with unprecedented acceleration at reduced costs and easily scale-up and scale-out in public clouds, private-cloud and edge computing deployments. 

With rapt.ai’s solution, Organizations can strive to be flexible, nimble and agile in AI adoption and strategy.

We are currently running early access program for customers to try our product.

If you would like to know more about our product and various other features, you can reach me privately @  anil@rapt.ai or info@rapt.ai


To view or add a comment, sign in

More articles by Anil Ravindranath

Others also viewed

Explore content categories