Machine Learning Project Architecture
Everyone believes that Machine Learning is going to change the total landscape of technology and infrastructure that we are going to live on. So its good thing that every company wants to adapt this technology and prosper. But how many companies are really becoming successful in this journey?
Welcome to my article, Here i am trying to explain the evolution of machine learning architecture patterns.
In any typical ML project we see there are 3 major role players. Data Engineers are responsible for complete data management in the project. Data Scientists are responsible for understanding and analysing the data for business problem. Machine Learning Engineers are responsible for design and implementation of a model that works for business.
ML engineer is responsible for setting up the whole body with designed model where model is like a brain which was designed from the analysis provided by Data Scientist and sensual perceptions are like external data provided by Data Engineer.
Generally project starts with Data, We have data and we want to solve specific business problem. For this purpose we collect the data from different sources, analyse it, build model and serve with an interface. So far so good. But do we need to follow any architecture pattern for this project? Yes because for past 2 to 3 decades we saw how software architectures were evolved like Server-Client, SOA, REST, Microservices, CLOUD. So there must be an architecture for ML project also.
Basic Architecture
Above is the basic architecture of ML project for prediction task. In experimental (development) phase of the project we extract the data and analyse it. Once analysis is done, we prepare the data for model training. With this data we train a model, evaluate and validate it. Once model is ready, we store it in a registry (here registry means its a kind of repo to store models). And once model is finalised , we provide it to production with an interface to serve. All good. We get predictions with business acceptable results. Everyone is happy.
But are these results are consistent? Because we know how data is changing in now a days. I guess i don’t need to pull up the stats of how much data is generating every minute and types of data evolving in the real world. So what about our model? Can it work with these changes in data? May be not. Because our model is trained with experimental or simulated data and rolled out. Our assumption is that production data will be same (at least schema wise and values wise) as that of experimental data and no major changes. If any change happens in data, the we can take the model and train it again with that new data.
Data is changing continuously, so your model also need to be trained continuously
OK agreed. Actually here we need to understand something about how a model learns about data and do predictions. The training of model involves , understanding the statistical properties of the data which was collected over a period of time, so that by using statistical algorithms, model can predict the next out come in that data. So model is completely depending on statistical properties like in timeseries forecasting. But what happens when these statistical properties are changing significantly over a short interval of time? Model prediction can go wrong which effects the business. This is called deterioration of model. For example if you take phishing predictor case, we see how many new phishing techniques coming from different hackers daily. So in order to adapt these changes, our model also need to be trained continuously. Here comes modern architecture.
Modern Architecture
The maturity of an ML project is the level of automation that we do in above architecture. The advantage of matured projects is loose coupling, which facilitates fast, flexible and accurate changes in the project.
Yeah. I know it’s huge, but spare with me, Let’s see why we need this kind of setup? Lets understand this first.
We heard about Dev-Ops CI-CD for software projects. Continuous integration and continuous delivery/deployment. We integrate software development with continuous delivery. Why? Because business is changing rapidly so we need to deliver the product changes rapidly. In order to minimise the gap between development and operations, we adapted Dev-Ops. Good. Since we are also facing model deterioration because of changes in data, can we apply Dev-Ops to ML project? Yes but its not going to CI-CD, its going to be CI-CD-CT. CT means continuous training. Because model needs training along with development which is not required in traditional software products.
There are 2 major parts in this diagram. experimentation/Dev/Test and Stagging/Pre-Prod/Prod. Here pipeline is the complete set of steps we do in model building like data extraction/Analysis/validation/preparation and model building/evaluation/validation. Steps of experimentation is modularized, orchestrated and transitioned between steps is automated. If you come from left of this modern architecture diagram, you see below important components
Feature store
When we have large organisation, we get data from different source, not just from one database or ftp source. We need to collect data from different departments. So in these cases generally we build a Feature store. Its a complete repo storage of features (not just data) out of data in the whole organisation. These features are decided globally by data architects, so that we don’t need to build a feature every time in every ML project. We can just query Feature store to get that feature. Generally feature store provide an API to extract feature data.
- For experimentation, data scientists can get an offline extract from the feature store to run their experiments.
- For continuous training, the automated ML training pipeline can fetch a batch of the up-to-date feature values of the dataset that are used for the training task.
- For online prediction, the prediction service can fetch in a batch of the feature values related to the requested entity, such as customer demographic features, product features, and current session aggregation features.
Experimentation and Automated Pipelines
These are nothing but python packages which contains different modules.Below are steps involved in pipelines.
- Data extraction : You select and integrate the relevant data from various data sources for the ML task. If you have a feature store, you can directly extract from it.
- Data analysis: You perform exploratory data analysis (EDA) to understand the available data for building the ML model. This process leads to the following:
Understanding the data schema and characteristics that are expected by the model.
Identifying the data preparation and feature engineering that are needed for the model.
- Data validation : generally this phase should be before model training. Validation means here we check for issues in schema (issues in data schema are called as data schema skews) and values (issues in data values are called as data value skews).
- Data preparation: Here data is prepared for the ML task. This preparation involves data cleaning, where you split the data into training, validation, and test sets. You also apply data transformations and feature engineering (which were identified in data analysis) to the model that solves the target task. The output of this step are the data splits in the prepared format.
- Model training: The data scientist implements different algorithms with the prepared data to train various ML models. In addition, you subject the implemented algorithms to hyper parameter tuning to get the best performing ML model. The output of this step is a trained model.
- Model evaluation: The model is evaluated on a holdout test set to evaluate the model quality. The output of this step is a set of metrics to assess the quality of the model.
- Model validation: The model is confirmed to be adequate for deployment. its predictive performance is better than a certain baseline. Generally we compare the model performance with previous models.
- Model serving: The validated model is deployed to a target environment to serve predictions. This deployment can be any of the following
- Microservices with a REST API to serve online predictions.
- An embedded model to an edge or mobile device.
- Part of a batch prediction system.
Deciding model deployment depends on the business process.
Batch : when processing historical data overtime. Example is credit scoring a customer at nightly or quarterly cycle. so we need huge processing power for daily or quarterly like spark ML. Depending on this you have to meet a timeline so decide the infrastructure.
Near real time: Speed is important than processing time.Example is capturing events and feeds to dashboard as it refreshes for a specific interval.
Real time: Run model as soon as data comes. Application needs to respond within millisecond. Examples are credit card swipe, Intruder detection in networks.
Edge: Deployment in devices like rasberry PY. Which has less computing power so need efficient coding.
- Model monitoring: The model predictive performance is monitored to potentially invoke a new iteration in the ML process.
ML Metadata Store
This is a kind of log management. Whenever you run prod pipeline, it stores following data. It helps to keep track of errors and lineage of artefacts.
- The pipeline and component versions that were executed.
- The start and end date, time, and how long the pipeline took to complete each of the steps.
- The executor of the pipeline.
- The parameter arguments that were passed to the pipeline.
- The pointers to the artefacts produced by each step of the pipeline, such as the location of prepared data, validation anomalies, computed statistics, and extracted vocabulary from the categorical features. Tracking these intermediate outputs helps you resume the pipeline from the most recent step if the pipeline stopped due to a failed step, without having to re-execute the steps that have already completed.
- A pointer to the previous trained model if you need to roll back to a previous model version or if you need to produce evaluation metrics for a previous model version when the pipeline is given new test data during the model validation step.
- The model evaluation metrics produced during the model evaluation step for both the training and the testing sets. These metrics help you compare the performance of a newly trained model to the recorded performance of the previous model during the model validation step.
Key points in modern architecture
- There are 2 pipelines. One is for experimentation (manual) and another is for production (automated) . We may think that why we are setting up complete pipeline in production environment? This solves our first problem because of difference in experimental data and production data. Here in Automated pipeline which is running in production environment, we are running against production data, means our model is getting trained on production data. Its continuous training. Experimentation is for design of steps for ML and we use this design in production to build the model.
- There are no data analysis and model analysis phases in production pipeline why? Because we do these analysis before coming to conclusion of features and model types. Depending on these analysis only we move forward for next critical transformations of features. These phases are time consuming and requires discussions among data scientists. So we can not automate it.
- No data extraction is experimental phase. We use offline extract from feature store. Means whenever we want data we take it. Its not automated. But since our pipeline in production is automated, we need extraction module here.
- ML metadata store is not there for experimentation. Why? ML metadata store is a kind of log data which logs every metrics , parameters , timings of production pipeline. Which are used for investigation of issues and improvements later.
- The feature store is common for both experimentation and production pipelines. This will make sure no strange data enters into production to give surprises.
- The deployment code for both experimentation and production are same except with extra/missing modules in pipeline. This will make sure that steps we took in our discussion while doing feature engineering, hyper parameter tuning in experimentation phase are same for training in production environment also.
- There is no model registry for experimental phase as its there for production phase only. Because we want to keep track of models for later use in production, its easy to rollback models in case of issues.
- There is feedback loop from prediction service to performance monitoring phase. This is to collect all prediction service results and accuracy of model and provide it to monitoring phase. This monitoring phase analyses the model performance and trigger alerts for production and experimental pipelines. Generally we set threshold for model performance deterioration factor. For example, If model prediction is not doing good that means it triggers alert to run production pipeline to learn changes in data. If model is doing worse or crashing because of skews in data, then it triggers alert for experimental phase which involves some kind of email service, so that data scientists need to reiterate the experiment.
Advanced Architecture
OK. So far so good. But what are the open source tools available for these phases? Because everyone loves open source correct?
Tools
Below is the segregation of open source tools for each phase
Tools for each phase
OK.Are there any open source frameworks to support these kind of architectures (at least most of the phases)?
ML frameworks
- MLFlow
- KuberFlow
There are few matured architecture patterns for ML projects as its still evolving field. The above explained pattern is one of them. Being a machine learning engineer, its really important to understand the complete ML project architecture and its worth to implement all phases.
Thank you.