MLOps - Production Models

MLOps - Production Models

Machine learning is an amazing tool for tackling some of the biggest problems we now face, but how do you release and maintain models that are always changing and getting bigger (or even running in different 'run-modes').

A 'typical' machine learning pipeline will have by it's nature an iterative cycle with many possible release candidates, upgrades and production models. The following is what I consider the cycle for a model.

No alt text provided for this image


This lifecycle can have overlap and it can depend on the model build/test/promote cycle (some keep running and create 'snapshots' as releases).

Ingesting Data

No alt text provided for this image

Ingesting can depend on the data but generally, most go through some level of validation/labelling/bias correction and anonymization. Generally, the raw data is collected and then either labelled or put into a different storage for actual training.

Build Model

No alt text provided for this image

When the model is building this can either at the end of an iteration or as part of a 'checkpoint' produce a release candidate model, this should be placed in the release candidate storage and the version history used to determine what increment it should be, the example above is semantic version, obviously, there are other ways (datetime, sha, uuid4).

Test Model

No alt text provided for this image

The testing model stage can be built into the training, but it can also be a separate external validation of the model, in any case, it should if successfully promote the model into validation, this is a gatekeeping step that a model can be deleted or expire (depending on your storage).

Validate Model

No alt text provided for this image

The validation stage could be combined with testing but this is specifically less about accuracy or model improvements but more about the performance and size of the model. Ideally, a threshold of performance/size can be set so that if a model goes over then it's never promoted to production and will expire/be deleted. If the model is successful the version information should be marked in some way to denote a new production model and a copy placed into the production model storage. Ideally a dashboard or something to report back on the model performance should give feedback on any sudden changes or changes over time (which is important for production service level agreements etc).

Release Model

No alt text provided for this image

This stage could just be the ML Apps release process but if we are using containers to ship the model data then they can be built here and released to production. The release process could include a staging release strategy or a blue/green style deployment but it's at this stage that any release management on the system(s) will take effect.

Model Performance

No alt text provided for this image

The performance of the model in the production systems should be constantly evaluated from a host/application metric point of view. This can help with improvements in the model engine and model itself and as the application gets more mature you can start checking for edge cases or regressions in performance at scale. Feedback is a general catch for any customer/developer feedback on the production system that can be gathered in any number of ways.

Example Deployments

The deploy.sh script in both instances checks for the existing of the model in the shared volume and if not found copies the content of the container that contains the model into it.

Kubernetes

metadata
  name: simple-ML-app
  labels:
    app: mlapp
spec:
  containers:
    - name: ml-application
      image: mlapp
      volumeMounts:
        - name: model
          mountPath: /model/
  initContainers:
    - name: model-container
	  image: model:1.43.13
	  command: ['sh','-c','/deploy.sh']
      volumeMounts:
	    - name: model
		  mountPath: /model/
 volumes:
    - name: model
      emptyDir: {}:        

Nomad

job "mlapp1" 
  datacenters = ["dc1"]
  type        = "service"


  group "mlapp" {
    volume "models" {
      type      = "host"
      source    = "models"
      read_only = false
    }
    task "ml-bootstrap" {
      driver = "docker"
	
	  config {
	    image = "model:1.43.13"
		command = "sh"
		args = ["-c", "/deploy.sh"]
	  }
	  resources {
        cpu    = 200
        memory = 128
      }
	  volume_mount {
		volume = "models"
		destination = "/model"
	  }
	}
	
    task "ml-service" {
      driver = "docker"


      config {
        image   = "mlapp"
        command = "sh"
        args    = ["-c", "echo The service is running! && while true; do sleep 2; done"]
      }


      resources {
        cpu    = 200
        memory = 128
      }


      service {
        name = "mlapp1"
      }


	  volume_mount {
		volume = "models"
		destination = "/model"
	  }


    }
  }
}

        

Hopefully this can give you some ideas on how a system for machine learning models can operate at scale without large human resource overheads, obviously these steps could be part of a CI/CD or part of ML pipeline or both.

To view or add a comment, sign in

More articles by James Dobson

  • Micro Pipelines - Micro Service Architecture for Pipelines

    #nomad #k8s #pipelines #mlops #devops Micro Pipelines is a term I'm using to describe an architecture where you have…

    4 Comments
  • Hashicorp Nomad and Pipelines

    Nomad Pipelines I have talked about Nomad in the past and although I think it's a great orchestration engine one of the…

  • Dev Ops and Machine Learning

    Dev Ops and Machine Learning What does Dev Ops and machine learning (ML) have in common? Pipelines Data…

  • What are the key 'Dev Ops' skills

    Introduction I have been many things over the past 20 years. Developer, Tester, System Administrator, Manager, Team…

  • Monitoring Hosts via Slack

    I don't think I've ever mentioned this here but a while back I created a slackbot method of reporting on server status.…

  • Media Services In the Cloud

    Recently I was brushing up on my C# with the latest 'free' Microsoft Visual Studio IDE, I noticed azure cloud…

    2 Comments

Others also viewed

Explore content categories