Data Engineering and Data Science - Yin and Yang: Lessons from Data Science Projects

Mainak Sarkar

Published Feb 8, 2021

I recently came across couple of AIML / Data Science (DS) projects and I felt compelled to write this article based on what I learned about these projects. I think it is quite an established fact most data science projects fail due to lack of production deployment and operationalization thinking from the start of any DS project. According to Gartner analyst Nick Heudecker, over 85% of data science projects fail. A report from Dimensional Research indicated that only 4% of companies have succeeded in deploying ML models to production environment. I wrote an article calling out the distinction between Exploratory AI vs. Operational AI, and why data scientists need to think about operational AI because in order for these projects to deliver value to organizations, AI projects need to come out of the sand boxes and operate in real world.

In spite of such prevalent knowledge, it was evident from the projects that the teams lacked the "product thinking" and neglected to include the set of user stories that would drive towards the operational needs of the DS models. As a result, the data engineering team and the DS team worked in silos until to the point when it came to deploying the model in production. This resulted in undue stress on the project timelines and outcomes. In fact, this raised the question about the success of the projects even though the DS model was developed to meet the business requirements and running successfully in sandbox environment in a silo mode. Here are some lessons learned from the project that might help you to avoid the mistakes the teams made.

Apply Product thinking:

This approach to implementing AIML /DS project forces you to think in terms of operational AI from the beginning of the project. You are forced to create user stories that would include gathering not only functional requirements but production deployment and operational(model maintenance) requirements. You are forced to plan better for production deployment and maintenance.

Break down Siloed teams:

Siloed data teams (engineering and DS) can hinder the iterative model development processes, slowing innovation - the development of Operational AI. It is imperative that leaders understand AIML /DS projects require effective collaboration between data engineering and DS teams for efficient implementation. Without understanding each others requirements and constraints, the model could ultimately be not suitable for production deployment or would require significant effort on both sides to deploy in production. This could add to the timeline and overall effort. In one of the projects, when it came to deploy the model in production, the engineering team had to try several architectural approaches to find the right one that would fit the DS model requirements to run in production mode. This could have been avoided had the two teams collaborated to understand the integration points and requirements of the operational DS model.

Integration points with the Engineering team and the Data Science team should be throughout the life cycle of the project

The two teams will run in parallel. They might feel like they are distinct streams of work but they are complementary to each other like Yan and Yang. Yan and Yang describes how seemingly opposite or contrary forces may actually be complementary, interconnected, and interdependent in the natural world, and how they may give rise to each other as they interrelate to one another. (Source: Wikipedia).

This picture shows the various activities of a DS team and a data engineering team running in parallel but having various integration points along the project life cycle. Source: Slalom

Matie Zahari who is the chief scientist at Databricks and started the Apache Spark project said in his recent article How to empower data teams in 3 critical ways

We need to break silos down, enable collaboration between data engineering and data science teams, and build a new data team structure.

He also argues how there will be rise of converging roles in future where the hybrid roles will be focused on delivering business value and less of specialization of skillsets. This will be possible as the platforms will evolve and will support the convergence of such roles. ML engineers will have full stack data experience and data engineers will DS experience.

Think MLOps from the beginning:

In order for Operational AI to deliver business value successfully, there is a need for MLOps thinking from the beginning of the project. Danny Farah did an excellent job explaining the need for MLOps as well as the blueprint of MLOps in detail in his article The Modern MLOps Blueprint. Once deployed in production, the model needs to be monitored, upgraded with the changing needs of business (because the model needs to deliver value at all times) and also account for edge cases and biases. Without having a MLOps framework in place, this can become a very difficult and cumbersome process.

Do not under estimate the need for a Scrum Master for a DS project:

This might seem like a very obvious project management requirement but you might be surprised to find that lot of projects tend to make one of the leads to be the scrum master to cut cost. That's an obvious mistake. Like all other data engineering project, there is a need for a dedicated Scrum Master for DS projects also. The Scrum Master needs to make sure all team members are working collaboratively to deliver their user stories at the end of each sprint. The Scrum Master needs to make sure all impediments to the project are removed.

Above all, the Scrum Master needs to break down all silos within the team.

Gartner Survey reveals 66% of organizations increased or did not change AI investments since the onset of COVID-19. As companies continue to invest in AI projects to improve customer experience and retention, revenue growth and cost optimization they must focus on creating strong, collaborative teams where data engineering and DS teams are like Yin and Yang focusing on delivering business value. That is how data teams will have bigger impact in innovation.

Steve Shea 5y

Mainak thanks for drafting that article. I'd be curious to hear where you think ML success rate will be in 2 years. Is a talent gap also part of the reason for the low success rate? Troy Hall, Zola Petkovic, Marc Lobree - you all are people I respect in the AI/ML space. Would be curious to hear your thoughts.

1 Reaction

Satish Chandra Gupta 5y

Nice article Mainak Sarkar. It has been my learning too: - Consolidate ownership - Integrate early - Iterate often https://ml4devs.substack.com/p/003-why-machine-learning-projects-fail

2 Reactions

Michelle Mindala-Freeman 5y

Enamored with the science, but neglectful of the #productmindset needed for success...agreed Mainak!

Data Engineering and Data Science - Yin and Yang: Lessons from Data Science Projects

Mainak Sarkar

Apply Product thinking:

Break down Siloed teams:

Think MLOps from the beginning:

Do not under estimate the need for a Scrum Master for a DS project:

More articles by Mainak Sarkar

Others also viewed

DATA SCIENCE

The Importance of Data Exploration and Analysis in Data Science and Machine Learning Projects

AI-Driven Data Workflows: Top 3 Generative AI Courses.

Data Science at Scale: from Map-Reduce to Spark and SciDB

Demystifying Data Science, Part V: AutoML

Understanding Data Science: Spectrum, Tools & Case Studies!

Introduction to Data Science: Concepts and Applications

What's Your Data Science?

The Role of Automation in Modern Data Science

Reasons AI Projects Fail to Deliver Value

Lessons From Real World AI Deployments

Why Siloed Marketing Teams Fail

How to Justify Data Science Work to Business Teams

How Data Science Optimizes Industrial Operations

How to Overcome Data Silos for Improved Insights

Why Production and Data Intelligence Environments Differ

Explore content categories

Apply Product thinking:

Break down Siloed teams:

Think MLOps from the beginning:

Do not under estimate the need for a Scrum Master for a DS project:

More articles by Mainak Sarkar

The Dawn of a New Era in Software Development: Micro software applications

What is GenBI

Storytelling Nerd - Perhaps the most important people you need to hire

Rise of Data Apps and the data application engineers

Data Cataloging before Data Warehouse Modernization?

Does your Snowflake platform need to be optimized? A few tips from my experience

Fast Lane to Analytics: Rise of Replication tools and Snowflake in Modern Data Architecture

Characteristics of Modern Data Architecture and Key Guiding Principles

Linking Optimism and Hope at the time of Covid-19 - learning from Stockdale Paradox

Do retailers have luxury to not having a 360 view of their customer? Lesson from real life experience.

Others also viewed

DATA SCIENCE

The Importance of Data Exploration and Analysis in Data Science and Machine Learning Projects

AI-Driven Data Workflows: Top 3 Generative AI Courses.

Data Science at Scale: from Map-Reduce to Spark and SciDB

Demystifying Data Science, Part V: AutoML

Understanding Data Science: Spectrum, Tools & Case Studies!

Introduction to Data Science: Concepts and Applications

What's Your Data Science?

The Role of Automation in Modern Data Science

Similar topics

Reasons AI Projects Fail to Deliver Value

Lessons From Real World AI Deployments

Why Siloed Marketing Teams Fail

How to Justify Data Science Work to Business Teams

How Data Science Optimizes Industrial Operations

How to Overcome Data Silos for Improved Insights

Why Production and Data Intelligence Environments Differ

Explore content categories