Streamlining Machine Learning Operations: A Guide to MLOps Processes
Streamlining Machine Learning Operations: A Guide to MLOps Processes
MLOps stands for Machine Learning Operations. MLOps is a core function of Machine Learning engineering, focused on streamlining the process of taking machine learning models to production and then maintaining and monitoring them.
MLOps:- ML system development + ML system operation (Ops)
MLOps are used for automation and monitoring at all steps of ML system construction, including integration, testing, releasing, deployment, and infrastructure management.
Building an integrated ML system and continuously operate it in production.
Figure 1. Elements for ML systems
MLOps vs DevOps
ML systems differ from other software systems in the following ways:
Continuous Integration-merging all working copies to a shared mainline several times a day that triggers automated build with testing.-testing and validating data, data schemas, and models.
Continuous Delivery-produce software in short cycles, allowing for more incremental updates without doing so manually system (an ML training pipeline) that should automatically deploy another service (model prediction service).
Continuous training-automatically retraining and serving the models.
Data science steps for ML(manually or automatic pipeline.)
8. Model monitoring-model predictive performance is monitored to potentially invoke a new iteration in the ML process.
MLOps level 0: Manual process
-Basic level
Figure 2. Manual ML steps to serve the model as a prediction service.
Characteristics-
Before the model is promoted to serve all the prediction request traffic production deployment of a new version of an ML model usually goes through A/B testing or online experiments.
CHALLENGES
This level is sufficient when models are rarely changed but often break when deployed to the real world. The models fail to adapt to changes in the dynamics of the environment, to maintain the accuracy of the model we need-
2. Frequently retrain your production models-retrain with recent data
Recommended by LinkedIn
3. Continuously experiment with new implementations to produce the model-To control or make use of the latest ideas and advances in technology, try out new implementations.
To address these challenges of manual processes, CI/CD and CT are helpful.
MLOps level 1: ML pipeline automation
Goal-to perform continuous training of the model by automating the ML pipeline, continuously delivery of model prediction service, and introducing automated data and model validation steps to the pipeline.
Figure 3. ML pipeline automation for CT.
Characteristics-
5. Continuous delivery of models: An ML pipeline in production continuously delivers prediction services to new models that are trained on new data.
6. Pipeline deployment: In level 1, you deploy a whole training pipeline, which automatically and recurrently runs to serve the trained model as the prediction service.
ADDITIONAL COMPONENTS-
-Metadata management-Information about each execution of the ML pipeline is recorded in order to help with data and artifacts lineage, reproducibility, and comparisons. It also helps you debug errors and anomalies. Each time you execute the pipeline, the ML metadata store records the following metadata:
-ML pipeline Triggers-We can automate the ML production pipelines to retrain the models with new data, depending
Challenges
MLOps level 2: CI/CD pipeline automation
Figure 4. CI/CD and automated ML pipeline.
This MLOps setup includes the following components:
Stages of CI/CD automated ML pipelines
Figure 5. Stages of the CI/CD automated ML pipeline.
Reference - https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning
Well elaborated. In addition to DataOps and MLOps, MLDevOps is essential for effective management of AI systems. MLDevOps encompasses Continuous Integration, Deployment, and Monitoring, which is akin to traditional software development practices, and which ensures early issue detection and swift integration of changes. Automated monitoring helps in rapid bug identification and resolution. Infrastructure Management in AI systems requires flexibility for location, computation, network, and storage needs, crucial for adapting to varying data influx rates. Unlike traditional software, AI systems may experience fluctuations in computational, storage, or memory demands, necessitating rapid adjustments in the underlying infrastructure. The complexity of infrastructure management is expected to increase with emerging hardware architectures prioritizing specialized hardware for training and inferencing in specific use cases or applications. More about this topic: https://lnkd.in/gPjFMgy7
Thanks for posting !
Very insightful