Lessons Learned from a Machine Learning API Deployment
Image by John Conde from Pixabay

Lessons Learned from a Machine Learning API Deployment

Despite seeing hundreds of similar posts, I still believe every Machine Learning API that is in production has a different story with a lot of lessons learned. In this article, I would like to share my main takeaways from our recent deployment. I won't go into details like how to setup Jenkins jobs or which Machine Learning libraries to use but I want to elaborate on how to ensure data integrity, identify data skew and setup proper monitoring during production.

Trust nobody but yourself

The header might seem aggressive but what I want to point out is, do not rely 100% on any data source. When you meet with other teams/departments during development stages, they can assure you that they will always format their API requests in a such a way that you don't need to double-check incoming data etc. In reality, your data sources are managing only a small part of a larger infrastructure. What might happen is that, some other department will change a data schema in some database without any notice and your "trusted" data source will start sending skewed data without even noticing it. No matter how much they assure you, you should implement your own data validation for identifying possible data corruptions over time.

Ensure Feature Validity

Continuing from the previous point, you should always validate the features that will be input to the model. Whether you receive them directly from the incoming request or perform the computations within the API, you should validate all the features against a reference to identify any corruptions at early stages. This is especially important for the categorical features with feature embeddings as inputs since a corrupt value will still be transformed by the embedding layer and the model will run inference as usual. However, the output will most likely be skewed due to the corrupt feature value(s). These kinds of problems are very hard to identify and will cause hidden biases in the output of the API. The subsequent stages of your pipeline will also be affected from these biases until you identify them.

Setup Proper Logging

I believe logging should be the heart of any Machine Learning API deployment. If you cannot traceback what was sent in, what was sent out and what happened in between, you will have hard time debugging the unexpected behaviours of your API . In our case, Elastic Stack was extremely useful for storing and inspecting the logs but whatever platform you use, you should setup proper logging at every stage possible. If you are concerned about storing sensitive data in the logs, discuss with legal advisors to learn about the parts of the data that can be stored. Because, during production cycle, every single log can make a huge difference in identifying problems at early stages.

Get notified ASAP

Following from the previous point, it will be beneficial to get notified about potential problems in the deployment as early as possible. Whether you choose to send an SMS or develop a chatbot, you should continuously monitor the logs and send notifications to the associated people whenever an incident occurs. In our case, we setup Elasticsearch for storing the logs, a cronjob for continuous monitoring of the logs and a Slack chatbot for notifying the development team about the incidents. This setup significantly improved our response time for the incidents.

The main lesson I learned during our deployment process is that, you shouldn't rush things unless you properly setup logging, perform extensive testing on data validity and perform incident monitoring. Otherwise, it will be much more costly for you to traceback an error that occurred months ago and the collateral damage it caused. Good luck with your machine learning deployments.


To view or add a comment, sign in

More articles by Hakan Karaoguz

Others also viewed

Explore content categories