Streamlining Machine Learning Operations: A Guide to MLOps Processes
credit: neptune

Streamlining Machine Learning Operations: A Guide to MLOps Processes

Streamlining Machine Learning Operations: A Guide to MLOps Processes


MLOps stands for Machine Learning Operations. MLOps is a core function of Machine Learning engineering, focused on streamlining the process of taking machine learning models to production and then maintaining and monitoring them.

MLOps:- ML system development + ML system operation (Ops)

MLOps are used for automation and monitoring at all steps of ML system construction, including integration, testing, releasing, deployment, and infrastructure management.

Building an integrated ML system and continuously operate it in production.


No alt text provided for this image
Elements for ML systems



Figure 1. Elements for ML systems

MLOps vs DevOps

ML systems differ from other software systems in the following ways:

  1. Team skills: -data scientists or ML researchers.
  2. Development: ML is experimental in nature.
  3. Testing: data validation, trained model quality evaluation, and model validation.
  4. Deployment: Deploy a multi-step pipeline to automatically retrain and deploy model.
  5. Production: As the data is constantly changing, the performance may be degraded. so we need to track summary statistics of our data and monitor the online performance of your model to send notifications or roll back when values deviate from your expectations.

Continuous Integration-merging all working copies to a shared mainline several times a day that triggers automated build with testing.-testing and validating data, data schemas, and models.

Continuous Delivery-produce software in short cycles, allowing for more incremental updates without doing so manually system (an ML training pipeline) that should automatically deploy another service (model prediction service).

Continuous training-automatically retraining and serving the models.

Data science steps for ML(manually or automatic pipeline.)

  1. Data extraction
  2. Data analysis-(EDA-analysing dataset using statistical graphs and visualization)
  3. Data preparation-data cleaning, split the data into training, validation, and test sets.
  4. Model training-training+hyperparameter tuning
  5. Model evaluation-model is evaluated on data in the test data set which is never been used in training
  6. Model validation-adequate for deployment
  7. Model serving:-(Types of model serving)

  • Microservices with a REST API to serve online predictions.
  • An embedded model to an edge or mobile device.
  • Part of a batch prediction system.

8. Model monitoring-model predictive performance is monitored to potentially invoke a new iteration in the ML process.

MLOps level 0: Manual process

-Basic level


  • In level 0, you deploy a trained model as a prediction service to production.

No alt text provided for this image
Manual ML steps to serve the model as a prediction service.


Figure 2. Manual ML steps to serve the model as a prediction service.

Characteristics-

  1. A manual, script-driven, and interactive process-Every step manual, including data analysis, data preparation, model training, and validation. It requires manual execution of each step and manual transition from one step to another.
  2. Disconnection between ML and operations- training model and deployment on API infrastructure are two different tasks done by data scientists and data engineers.
  3. Infrequent release iterations-(don’t change frequently)A new model version is deployed only a couple of times per year.
  4. No CI: Because few implementation changes are assumed, CI is ignored.
  5. No CD: Because there aren’t frequent model version deployments, CD isn’t considered.
  6. Deployment refers to the prediction service: The process is concerned only with deploying the trained model as a prediction service (for example, a microservice with a REST API), rather than deploying the entire ML system.
  7. Lack of active performance monitoring: The process needs to track the model predictions and actions, in order to detect model performance degradation and other model behavioral drifts.

Before the model is promoted to serve all the prediction request traffic production deployment of a new version of an ML model usually goes through A/B testing or online experiments.

CHALLENGES

This level is sufficient when models are rarely changed but often break when deployed to the real world. The models fail to adapt to changes in the dynamics of the environment, to maintain the accuracy of the model we need-

  1. Actively monitor the quality of your model in production-(manually)


2. Frequently retrain your production models-retrain with recent data

3. Continuously experiment with new implementations to produce the model-To control or make use of the latest ideas and advances in technology, try out new implementations.

To address these challenges of manual processes, CI/CD and CT are helpful.

MLOps level 1: ML pipeline automation

Goal-to perform continuous training of the model by automating the ML pipeline, continuously delivery of model prediction service, and introducing automated data and model validation steps to the pipeline.


No alt text provided for this image
ML pipeline automation for CT.

Figure 3. ML pipeline automation for CT.

Characteristics-

  1. Rapid experiment-(automated)
  2. CT of the model in production-automatically trained in production using fresh data
  3. Experimental-operational symmetry-The pipeline implementation is a key aspect of MLOps practice for unifying DevOps.
  4. Modularized code for components and pipelines-To construct ML pipelines, components need to be reusable, composable, and potentially shareable across ML pipelines.

  • Decouple the execution environment from the custom code runtime.
  • Make code reproducible between development and production environments.
  • Isolate each component in the pipeline. Components can have their own version of the runtime environment and have different languages and libraries.

5. Continuous delivery of models: An ML pipeline in production continuously delivers prediction services to new models that are trained on new data.

6. Pipeline deployment: In level 1, you deploy a whole training pipeline, which automatically and recurrently runs to serve the trained model as the prediction service.

ADDITIONAL COMPONENTS-

  • Data and model validation-This step is required before model training to decide whether we should retrain the model or stop execution. This decision is made by-
  • Data schema skews- Downstream pipeline steps, including data processing and model training, receive data that doesn’t comply with the expected schema. Schema skews include receiving unexpected features, not receiving all the expected features, or receiving features with unexpected values.
  • Data values skews-These skews are significant changes in the statistical properties of data, which means that data patterns are changing, and you need to trigger retraining of the model to capture these changes.
  • Feature score-We evaluates and validate the model before it’s promoted to production. A feature store is a centralized repository where you standardize the definition, storage, and access of features for training and serving.

-Metadata management-Information about each execution of the ML pipeline is recorded in order to help with data and artifacts lineage, reproducibility, and comparisons. It also helps you debug errors and anomalies. Each time you execute the pipeline, the ML metadata store records the following metadata:

-ML pipeline Triggers-We can automate the ML production pipelines to retrain the models with new data, depending

  • On-demand
  • On a schedule
  • On the availability of new training data
  • On-model performance degradation
  • On significant changes in the data distributions

Challenges

  1. try new ML ideas and rapidly deploy new implementations of the ML components
  2. A CI/CD setup is needed to automate the build, test, and deployment of ML pipelines.


MLOps level 2: CI/CD pipeline automation

No alt text provided for this image
CI/CD and automated ML pipeline.

Figure 4. CI/CD and automated ML pipeline.

This MLOps setup includes the following components:

  • Source control
  • Test and build services
  • Deployment services
  • Model registry
  • Feature store
  • ML metadata store
  • ML pipeline orchestrator

Stages of CI/CD automated ML pipelines


No alt text provided for this image
Stages of the CI/CD automated ML pipeline


Figure 5. Stages of the CI/CD automated ML pipeline.


  1. Development and experimentation: try out new ML algorithms and new modeling where the experiment steps are orchestrated. Output is a source code and then pushed to a source repository.
  2. Pipeline continuous integration: outputs of this stage are pipeline components (packages, executables, and artifacts)
  3. Pipeline continuous delivery: The output of this stage is a deployed pipeline with the new implementation of the model.
  4. Automated triggering: The pipeline is automatically executed and the output of this stage is a trained model that is pushed to the model registry.
  5. Model continuous delivery: serve the trained model as a prediction service for the predictions. The output of this stage is a deployed model prediction service.
  6. Monitoring: collect statistics on the model performance based on live data. The output of this stage is a trigger to execute the pipeline or to execute a new experiment cycle.


Reference - https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning


#mlops #ml #pipeline #flow #architecture #design #databasemanagement #data #database #machinelearning #deeplearning #ai #artificialintelligence

Well elaborated. In addition to DataOps and MLOps, MLDevOps is essential for effective management of AI systems. MLDevOps encompasses Continuous Integration, Deployment, and Monitoring, which is akin to traditional software development practices, and which ensures early issue detection and swift integration of changes. Automated monitoring helps in rapid bug identification and resolution. Infrastructure Management in AI systems requires flexibility for location, computation, network, and storage needs, crucial for adapting to varying data influx rates. Unlike traditional software, AI systems may experience fluctuations in computational, storage, or memory demands, necessitating rapid adjustments in the underlying infrastructure. The complexity of infrastructure management is expected to increase with emerging hardware architectures prioritizing specialized hardware for training and inferencing in specific use cases or applications. More about this topic: https://lnkd.in/gPjFMgy7

Like
Reply

To view or add a comment, sign in

More articles by Lalit Moharana

Others also viewed

Explore content categories