Let the Machine tune itself-DIY AutoML Using Dask & BayesianOpt

Rajaganapathy M

Published Sep 7, 2018

Proof of Concept development is always a major part in Data Science career. As we develop machine learning models for prediction in a decision flow, We usually test out lot of different models for same prediction purpose.In the process, majority of time will be consumed to test each model and tune each model to get the best out of the algorithm. Finally we compare each of the best results derived to select final model. Then the process of digging deepdown the final model starts to improve accuracy as much as possible.

But sequential training result in time constraint for testing and tuning each different model, which would become a bottleneck in the development process. So I started out developing a DIY AutoML using Dask and bayes_opt which would train different models parallely and tune itself. The components used are DASK and bayes_opt

Dask is a flexible parallel computing library for analytic computing and Bayes_opt for using bayesian optimatization technique to auto tune the model.

The objective is to use bayesian optimization to tune hyperparameter for each model automatically and parallelize the auto tuning process so that different model will be tested and tuned parallely. This helped to speed up the proof of concept development process. The idea is visualized as folow,

Step1: Lets define the models to be used and their respective hyperparameters to be tuned. For simplicity of explanation only two model (Logistic regression and Decision tree) are included here but the list can be as big as you want. We defined python list of dictionaries where each each dictionaries have parameters to be optimized with their corresponding range and model name.

model1={}
model1['params']={'C':(0.1,5)}
model1['model_type']='logistic'

model2={}
model2['params']={'n_neighbors':(1,10),'leaf_size':(10,40)}
model2['model_type']='knn'

models_list=[model1,model2]

Step2: Now lets define a function having bayesian optimization call with corresponding parameters and one more function which have model training steps for the parameters chosen by bayesian optimisation function as follows,

from bayes_opt import BayesianOptimization
import sklearn
def build_model(train_x,train_y,model_type,**modelparam):
   model=''
   if modeltype=='logistict':
       model=sklearn.linear_model.LogisticRegression(**modelparam)
   elif modeltype=='knn':
       model= KNeighborsClassifier(**modelparam)
   model.fit(X, y)
   return model.score(X,y)


def bayesOpt(data_dict):
   train_x=data_dict['train_x']
   train_y=data_dict['train_y']
   param_range=data_dict['params']
   model_type=data_dict['model_type']
   
   model_func= lambda **params: build_model(trainx,trainy,model_type,**params)
   bo = BayesianOptimization(model_func, param_range)
   bo.maximize(init_points = 10, n_iter = 20, kappa = 2, acq = "ei")
   
   bestresult = bo.res['max']['max_val']
   bestparams=bo.res['max']['max_params']
   return bestresult,bestparams

In the bayesOpt function we receive train data and type of model to be build and the parameters to be passed for training model in a python dict. These data will be packed as lambda function and passed to bayesian optimisation call which will build model by tuning the parameter based on the result(here we return accuracy score for evaluation) in each iteration and maximising the result by adapting the training parameters. The details and arguments of bayesian optimisation is not discussed here but can be read in https://github.com/fmfn/BayesianOptimization

Step3: The data_dict for bayesOpt will be passed from dask scheduler along with bayestOpt function to each dask worker. The code to distribute the model building process is given below,

from dask.distributed import Client
client=Client(n_workers=3)

each_model_best_result_futures=client2.map(bayesOpt,models_list)
best_result_of_models=client2.gather(each_model_best_result_futures)

Now the best_result_of_models is a python list having best result of each model built and the corresponding parameter to achieve the best result. All in depth details about dask framework can be found in http://dask.pydata.org/en/latest/

This completes our small framework for DIY AutoML. We can increase the number of different models to be tested with their corresponding parameter ranges by including in models_list and also number of dask workers should be increased to increase parallelism.

Deploying this kind of framework coupled with adaptive optimisation techniques will help to improve productivity and at the same time expands the horizon which can be explored. It did so for me :). Hope it is useful for everyone!!!!

Leave your comments or any Qs below :) !!!!!!!! RG

To view or add a comment, sign in

Let the Machine tune itself-DIY AutoML Using Dask & BayesianOpt

Rajaganapathy M

More articles by Rajaganapathy M

Others also viewed

Feature Engineering — The Real Power Behind High-Performing Machine Learning Models

Feature Engineering Mastery: Transform Raw Data into Predictive Power

The Ultimate Guide to Data Science and Machine Learning: Transforming the Future

[PdM-Predictive Maintenance in Industries]: How to build a state of art Anomaly Detection Algorithm?

Approaching (Almost) Any Machine Learning Problem

Beginner Guide to Machine Learning Pipeline Monitoring

Is Classical Machine Learning Dead? Or Has the Ground Shifted Beneath Data Science and ML?

When Machine Learning Meets Reality Why MLOps Matters More Than Accuracy

How do I get started with ML project

How to Optimize Machine Learning Performance

Tips for Machine Learning Success

Parameter Tuning Strategies for Large Language Models

Tips for Creating a Machine Learning Experimentation Environment

Guide to Ontology-Based LLM Fine-Tuning

Explore content categories