MLOps
I completed one of the tasks given to me by Sir Vimal Daga, a world record holder, during my industrial training under LinuxWorld India Pvt. Ltd.
The task is all about automating the process of training and tuning an ML or DL model till the point where we get a desired accuracy. It is achieved by integrating the concepts of Deep Learning with the operational works of DevOps.
The work starts by creating two docker files, one with preinstalled libraries required for Machine Learning Models and other for Deep Learning Models for respective works. I created below mentioned Dockerfiles to create those images.
Now the main task begins in which I created 5 jenkins jobs via Build Pipeline as mentioned below.
Job1: The work of job one is just to pull the code and dataset from the Github as soon as the developer pushes them to Github and copy the code and dataset to our training environment.
Job2: It will first check if the "project.py" file pushed by developer contains a Machine Learning code or a Deep Learning code by analyzing the libraries used in code. Then this job will launch a Docker image of respective need and as soon as the image is launched, the training process of the model starts.
Job3: This job copies the trained model(model.h5) and the accuracy(accuracy.txt) from the Docker image to the base OS of training environment for further process.
Job4: Here comes the interesting part. I have created tweek.py file which is ment to do some changes in the main code file, i.e., project.txt. As soon as job4 runs, the tweek.py file runs in python terminal which first checks if the accuracy we got is less than the required accuracy (in this case I have set 90%). And if yes, then it adds one more Convolution and Maxpool layer in the code and gives a failure response which triggers Job2 for starting the training process again. We can do many things like altering the number of nurons and epochs but here I am just adding Convolution and Maxpool layer in each iteration.
Job5: If Job4 is successful then this job is triggered and it just copies the model from jenkins workspace to root directory and notifies that the model is trained.
I have created one more additional job, Job6, which keeps on monitoring the Job2 where the main training work goes on. In case, if the training Docker image goes down or Job2 throws some failure message, then Job6 triggers Job2 again.
I used Mnist dataset for this work so in just 2 convolution layers and 1 epoch I got around 92% accuracy. I have uploaded the initial code that I used as well as tweek.py file both to Github.
Github URL: https://github.com/akash335saini/DLwork.git
A Final View: