Data science from a software perspective
Data science projects fail regularly, not because the models are not performant enough but because the gap between development of the model and the reality of putting the model in production is too large. Data science is seen as a field that is too different from regular software development. Together with the fact that many data scientists come from an academic background where scripting and statistics are the tools of the trade, development best practices are discarded too easily due to unfamiliarity and the actual differences between these fields.
In practice, most machine learning projects are not only about modeling but also about deployment, monitoring, robustness, reproducibility, lineage and trust. Because of this we should not treat these solutions as machine learning models but as software. It is true that there are significant differences between traditional software and data science solutions and acknowledging that is critical. The fact that a model is a combination of code and a dataset combined makes relying purely on Git difficult. Monitoring and alerting of certain machine learning systems is quite different from regular applications.
Despite these major differences, there is significant overlap between model based software and usual development. Software development has changed a lot in the past five years. CI/CD significantly reduced the time between development and production, unit and integration tests are ran automatically everytime somebody pushes new code. Containerization means that the probability of your code running everywhere approaches one. Cloud environments abstract away a lot of the complexity of the operations side of the equation.
By approaching data science from a software development mindset, I believe we can get many of the benefits. An issue with this is exactly that the differences between the two fields make using the traditional tooling available not directly feasible. New tooling will arrive that will solve the discrepancies between these two worlds.
While I still cannot share all the details, I can tell you that we are working on this type of tooling. CI/CD for machine learning, where the goal is to reduce the time between the release of your code and deployment of your model to the time it takes to train. Where you can deploy the same model in batch or in API mode. A simple deployment, A/B testing or shadow running multiple versions on the background so that you can push the next version to production with full confidence and rollback instantly when the system picks up that something is wrong.
We believe that by borrowing a few key concepts from software engineering, the maturity of the data science field and the confidence our stakeholders have in the solutions we make can be increased significantly. Iterating over various new versions in a live environment should take days, not months. And once a model is in production, alerts should be in place and your models should be updating automatically. This will allow our and your data scientists to focus on what data scientists are good at, thereby maximizing the utility of our data science efforts.
We are talking to a number of companies to see how we could help them in Q1 already, either on our own cloud, their private clouds or on premise. If you are interested to learn more, send me a message.