Predictive Maintenance - A Machine Learning application in production

Predictive Maintenance - A Machine Learning application in production

Business context 

I recently led a team commissioned to conceive, build and deploy in production a predictive maintenance application which aims to anticipate the breakdown of a wagon (transport of goods) over a 4-month time horizon. The aim of this article is to outline the technical solution implemented: from the ingestion of source data to the publication of the prediction results to end users. 

Preamble 

I will only focus on the technical aspects of the project. The discussion about the Machine Learning approach and the algorithm will be the subject of another article. 

All the components of the application are hosted on the Azure Cloud for this project. We used three separate environments for development, testing and production, as well as a Gitlab repository with a classic versioning workflow.  

Data workflow

The raw data come from an Oracle on-premise database, hosted in our customer's information system. A couple of Apache Nifi jobs scheduled on a daily and weekly basis, according to the type of data being ingested, perform the extractions and write the csv files on a Data Lake (see Fig 1). 


No alt text provided for this image

Fig. 1 A full view of the technical architecture, entirely hosted on the Cloud. 

We use two Python jobs -one for each ingestion frequency- running on the Databricks (DBKS) platform to read these files (see Fig 2.), transform their content and write the the output in a relational database. At this point, we essentially transform the source data into a data model that fits our needs.  

As for the model training, a job is launched every week on DBKS to produce a new model which is put into production automatically. At this stage, we leverage the MLFlow Model Registry API to version and store every model. 

No alt text provided for this image

Fig. 2 Five different jobs run on Databricks, each one assigned with a specific task. 

For each new trained model, we track its performance metrics and a plot of the calibration curve in order to have a first level of monitoring. Thanks to the fully managed integration of MLFlow on DBKS, all this information is readily accessible from the user interface.

The algorithm returns for each wagon a prediction score between 0 and 1, which can be equated to a breakdown probability (according to the corresponding calibration curve). The prediction job loads the latest registered version of the model (tagged as "in Production") via MLFlow, scores the whole fleet of supervised wagons, and writes the results in a database table.  

Note: A simple improvement - without too much technical effort required - would be to add a step that automatically decides whether to put a newly trained model in production or not, based on a set of criteria (some business rules should be applied to consider also external circumstances that may affect the model performance). 

Once the predictions tables contain the latest results, another job enriches them with contextual data, for instance with the last known geographical location of the wagon. All this information is presented and rendered via a PowerBi report (see Fig 3), built by one of our team members and published by the customer administrator on their own workspace. 

No alt text provided for this image

Fig. 3 The PowerBI report contains the predictions and contextual information for each wagon 

Running the application 

Although DBKS offers a scheduling functionality, we used Rundeck for this task due to organizational constraints. 

The way DBKS integrates with Rundeck in our case is the following: jobs are created beforehand on DBKS (see below for more details). For each of these jobs, a Python script - parametrized with the job name - runs on Rundeck at the scheduled times. This script retrieves through the DBKS API REST the list of all existing jobs on Databricks and filters them to select the corresponding job and extracts its id. It then passes this id in the API call to launch the job. 

No alt text provided for this image

Fig. 4 Jobs are created beforehand on Databricks and launched via Rundeck at scheduled times 

Doing so brings a big advantage: if for some reason we recreate a given job on DBKS, which would assign a new id to the job, we don't need to reconfigure Rundeck (our team does not operate Rundeck actually). Of course, we must make sure that job names are unique on DBKS.   

All application logs are written to the standard output and are therefore easily accessible on the DBKS User Interface. 

In case a job fails, DBKS sends an email to the support team. A link to the job (and to the logs) is included in the mail. 

Deployment 

In terms of deployment, our CI runs entirely on Jenkins and includes: 

  • A pipeline that builds the wheel package and pushes it to DBKS by an automated release pipeline. 
  • A pipeline to deploy the entry point, which is also a Python script. 
  • A pipeline that creates the jobs on DBKS. Note that jobs are created by Jenkins and launched via Rundeck.  

When creating the cluster, we use an init script to install the odbc driver through an apt-get command. 

Team 

As you probably guessed, I did not all this work all alone. A good part of it was done by a truly engaged team (Thomas Elie, Arnaud Capitaine, and Amine Souiki) that worked hard and smart. Each team member was assigned a specific part of the project, essentially divided in the data engineering part, the data science elements, and the construction of the visualizations and reports.

I would like to thank Romain Gouron for proofreading the manuscript and for his useful remarks.

PS: This project was entirely executed during the lockdown period in March 2020, so the whole team worked from home.

Well done Guido Intronati. good proof of the positive impact of digital technology on the environment.

Like
Reply

To view or add a comment, sign in

Others also viewed

Explore content categories