Let the Machine tune itself-DIY AutoML Using Dask & BayesianOpt
Proof of Concept development is always a major part in Data Science career. As we develop machine learning models for prediction in a decision flow, We usually test out lot of different models for same prediction purpose.In the process, majority of time will be consumed to test each model and tune each model to get the best out of the algorithm. Finally we compare each of the best results derived to select final model. Then the process of digging deepdown the final model starts to improve accuracy as much as possible.
But sequential training result in time constraint for testing and tuning each different model, which would become a bottleneck in the development process. So I started out developing a DIY AutoML using Dask and bayes_opt which would train different models parallely and tune itself. The components used are DASK and bayes_opt
Dask is a flexible parallel computing library for analytic computing and Bayes_opt for using bayesian optimatization technique to auto tune the model.
The objective is to use bayesian optimization to tune hyperparameter for each model automatically and parallelize the auto tuning process so that different model will be tested and tuned parallely. This helped to speed up the proof of concept development process. The idea is visualized as folow,
Step1: Lets define the models to be used and their respective hyperparameters to be tuned. For simplicity of explanation only two model (Logistic regression and Decision tree) are included here but the list can be as big as you want. We defined python list of dictionaries where each each dictionaries have parameters to be optimized with their corresponding range and model name.
model1={}
model1['params']={'C':(0.1,5)}
model1['model_type']='logistic'
model2={}
model2['params']={'n_neighbors':(1,10),'leaf_size':(10,40)}
model2['model_type']='knn'
models_list=[model1,model2]
Step2: Now lets define a function having bayesian optimization call with corresponding parameters and one more function which have model training steps for the parameters chosen by bayesian optimisation function as follows,
from bayes_opt import BayesianOptimization
import sklearn
def build_model(train_x,train_y,model_type,**modelparam):
model=''
if modeltype=='logistict':
model=sklearn.linear_model.LogisticRegression(**modelparam)
elif modeltype=='knn':
model= KNeighborsClassifier(**modelparam)
model.fit(X, y)
return model.score(X,y)
def bayesOpt(data_dict):
train_x=data_dict['train_x']
train_y=data_dict['train_y']
param_range=data_dict['params']
model_type=data_dict['model_type']
model_func= lambda **params: build_model(trainx,trainy,model_type,**params)
bo = BayesianOptimization(model_func, param_range)
bo.maximize(init_points = 10, n_iter = 20, kappa = 2, acq = "ei")
bestresult = bo.res['max']['max_val']
bestparams=bo.res['max']['max_params']
return bestresult,bestparams
In the bayesOpt function we receive train data and type of model to be build and the parameters to be passed for training model in a python dict. These data will be packed as lambda function and passed to bayesian optimisation call which will build model by tuning the parameter based on the result(here we return accuracy score for evaluation) in each iteration and maximising the result by adapting the training parameters. The details and arguments of bayesian optimisation is not discussed here but can be read in https://github.com/fmfn/BayesianOptimization
Step3: The data_dict for bayesOpt will be passed from dask scheduler along with bayestOpt function to each dask worker. The code to distribute the model building process is given below,
from dask.distributed import Client
client=Client(n_workers=3)
each_model_best_result_futures=client2.map(bayesOpt,models_list)
best_result_of_models=client2.gather(each_model_best_result_futures)
Now the best_result_of_models is a python list having best result of each model built and the corresponding parameter to achieve the best result. All in depth details about dask framework can be found in http://dask.pydata.org/en/latest/
This completes our small framework for DIY AutoML. We can increase the number of different models to be tested with their corresponding parameter ranges by including in models_list and also number of dask workers should be increased to increase parallelism.
Deploying this kind of framework coupled with adaptive optimisation techniques will help to improve productivity and at the same time expands the horizon which can be explored. It did so for me :). Hope it is useful for everyone!!!!
Leave your comments or any Qs below :) !!!!!!!! RG