Predictive Maintenance - A Machine Learning application in production

Guido Intronati

Published Nov 11, 2020

Business context

I recently led a team commissioned to conceive, build and deploy in production a predictive maintenance application which aims to anticipate the breakdown of a wagon (transport of goods) over a 4-month time horizon. The aim of this article is to outline the technical solution implemented: from the ingestion of source data to the publication of the prediction results to end users.

Preamble

I will only focus on the technical aspects of the project. The discussion about the Machine Learning approach and the algorithm will be the subject of another article.

All the components of the application are hosted on the Azure Cloud for this project. We used three separate environments for development, testing and production, as well as a Gitlab repository with a classic versioning workflow.

Data workflow

The raw data come from an Oracle on-premise database, hosted in our customer's information system. A couple of Apache Nifi jobs scheduled on a daily and weekly basis, according to the type of data being ingested, perform the extractions and write the csv files on a Data Lake (see Fig 1).

Fig. 1 A full view of the technical architecture, entirely hosted on the Cloud.

We use two Python jobs -one for each ingestion frequency- running on the Databricks (DBKS) platform to read these files (see Fig 2.), transform their content and write the the output in a relational database. At this point, we essentially transform the source data into a data model that fits our needs.

As for the model training, a job is launched every week on DBKS to produce a new model which is put into production automatically. At this stage, we leverage the MLFlow Model Registry API to version and store every model.

Fig. 2 Five different jobs run on Databricks, each one assigned with a specific task.

For each new trained model, we track its performance metrics and a plot of the calibration curve in order to have a first level of monitoring. Thanks to the fully managed integration of MLFlow on DBKS, all this information is readily accessible from the user interface.

The algorithm returns for each wagon a prediction score between 0 and 1, which can be equated to a breakdown probability (according to the corresponding calibration curve). The prediction job loads the latest registered version of the model (tagged as "in Production") via MLFlow, scores the whole fleet of supervised wagons, and writes the results in a database table.

Note: A simple improvement - without too much technical effort required - would be to add a step that automatically decides whether to put a newly trained model in production or not, based on a set of criteria (some business rules should be applied to consider also external circumstances that may affect the model performance).

Once the predictions tables contain the latest results, another job enriches them with contextual data, for instance with the last known geographical location of the wagon. All this information is presented and rendered via a PowerBi report (see Fig 3), built by one of our team members and published by the customer administrator on their own workspace.

Fig. 3 The PowerBI report contains the predictions and contextual information for each wagon

Running the application

Although DBKS offers a scheduling functionality, we used Rundeck for this task due to organizational constraints.

The way DBKS integrates with Rundeck in our case is the following: jobs are created beforehand on DBKS (see below for more details). For each of these jobs, a Python script - parametrized with the job name - runs on Rundeck at the scheduled times. This script retrieves through the DBKS API REST the list of all existing jobs on Databricks and filters them to select the corresponding job and extracts its id. It then passes this id in the API call to launch the job.

Fig. 4 Jobs are created beforehand on Databricks and launched via Rundeck at scheduled times

Doing so brings a big advantage: if for some reason we recreate a given job on DBKS, which would assign a new id to the job, we don't need to reconfigure Rundeck (our team does not operate Rundeck actually). Of course, we must make sure that job names are unique on DBKS.

All application logs are written to the standard output and are therefore easily accessible on the DBKS User Interface.

In case a job fails, DBKS sends an email to the support team. A link to the job (and to the logs) is included in the mail.

Deployment

In terms of deployment, our CI runs entirely on Jenkins and includes:

A pipeline that builds the wheel package and pushes it to DBKS by an automated release pipeline.
A pipeline to deploy the entry point, which is also a Python script.
A pipeline that creates the jobs on DBKS. Note that jobs are created by Jenkins and launched via Rundeck.

When creating the cluster, we use an init script to install the odbc driver through an apt-get command.

Team

As you probably guessed, I did not all this work all alone. A good part of it was done by a truly engaged team (Thomas Elie, Arnaud Capitaine, and Amine Souiki) that worked hard and smart. Each team member was assigned a specific part of the project, essentially divided in the data engineering part, the data science elements, and the construction of the visualizations and reports.

I would like to thank Romain Gouron for proofreading the manuscript and for his useful remarks.

PS: This project was entirely executed during the lockdown period in March 2020, so the whole team worked from home.

Simoh-Mohammed LABDOUI 5y

Well done Guido Intronati. good proof of the positive impact of digital technology on the environment.

To view or add a comment, sign in

Predictive Maintenance - A Machine Learning application in production

Guido Intronati

Others also viewed

Why Training AI Models Is the Next Evolution of Your Database Expertise

How DevOps Drives Analytics Operationalization and Monetization

Vector Databases and AI-Driven Software Development

Lessons Learned from a Machine Learning API Deployment

Version Datasets in MLOps

Infinidat Introduces Retrieval-Augmented Generation (RAG) Workflow Deployment Architecture to Make AI More Accurate for Enterprises

Product Review Series Part 3 - Oracle Analytics Cloud for Machine Learning

The Complete Guide to MLOps: From Data Pipelines to Model Deployment.

Building an End-to-End GenAI/ML Platform on Kubernetes – A Learner's Guide to Making the Right Choices

AI-infused Database

Explore content categories