Scalable Machine Learning Modelling

Sunil Narsinghani

Published May 27, 2019

In ML process models are created using input data sets. Usually only one ML model has been created. The inherent assumption goes here is that all input data sets is of one data distribution. This is over generalization. The article explains pitfall of over generalization. The over generalization is common practice. In past years when data was limited and collected from one or few data sets collected over very limited time, this was not causing an issue.

This article is not about crunching Big Data for machine learning. This article discuss about idea of representing or processing data with large number of ML models.

Example , a famous ML model is created for Flight Delay. Usually ML practitioners think of creating only one ML model for all the data elements of input data set. This is over generalization. The data-sets represents many-many data distributions. Example in this example , each airline is different, each airport is different and each root on which flight operates is different. Now predicting delay using just one ML model is over generalization. This makes learning inaccurate. On the other hand , we can create one ML model for each airport , for each day of week, each airline , each flight root.

Now before we explain further, one may argue that many times prediction modelling is performed over cluster data. Each cluster is homogeneous data. But this is not a general practice and moreover cluster's are not perfect distribution of data.

Can we divide data into smaller data sets and create many models, one each for every (sub) data set. This way model's will have better learning. Example in flight delay, flights may reason for delay in certain flight roots. Different models with different roots will capture the same extremely well.Bias and Variance trade off will come in play here as well.

Now, how to divide input data-set. If we divide the input data-set, using some variable of high carnality then we will end up creating large number of models. Each one created on fewer example and probably over fit. Other end we have one model using all the data.

To divide the data very appropriate dimension , lets rupture line is required. This should divide the model in to many homogeneous data sets.These data sets should be dissimilar to one another. Two data sets may have degree of similarity. In statistics we use hypothesis testing to measure it. Different test are there. Darlet test is one I remember from my college days. In information theory we have K-L divergence.

Currently experiments are going on and I will update my findings in next version.

Scalable Machine Learning Modelling

Sunil Narsinghani

More articles by Sunil Narsinghani

Explore content categories

More articles by Sunil Narsinghani

Lambda Architecture for ML Models

Explore content categories