Enterprise data science workflow

Ricky Ho

Published Sep 8, 2018

Artificial Intelligence, Machine learning are getting hotter and hotter these days. Starting from tech pioneers such as Amazon, Google, Facebook, Uber, Microsoft ... in sharing their success stories applying machine learning in various application including image recognition, natural language processing, autonomous driving, followed by maturity of open source data processing and machine learning technologies becoming easily accessible, a rapidly growing number of enterprises are looking to applying ML / AI to improve their business operation and gain advantages over their competitors.

The majority of enterprises are not tech pioneers. While they have in-house IT staff looking after existing applications to support day to day business, their investment into research effort is relatively low as compare to tech giants mentioned above. Nevertheless, a large proportion of these enterprises have started to look into ML / AI, either because they see opportunities to increase business value, or a general fear of lagging behind. They typically hire consultants to examine their business process, identify niche areas of business operation to inject ML / AI elements, establish a small team of data engineer / data scientist to build the intelligent app, deploy that into production, and finally transform the existing business operation to use the new flow.

In this post, I will go over the key stages of conducting data science (which is the underlying process of applying ML / AI). I will describe the major components involved and the corresponding design considerations and challenges. Let me re-emphasize that this post is drawn from my past consulting experience in building "enterprise-scale" data science architecture. I am not focusing in "web-scale" data science architecture which you can easily found from publications on the web or tech conferences.

Life cycle of intelligent app development

The end to end life cycle of building an intelligent app involves 4 major stages

Data collection
Data curation (clean up the collected data and transform them into a structure more suitable for analytic processing)
Model training (build prediction and optimization models)
Model serving (deploy trained models to serve online request for recommending business decisions)

Data Collection

Data collection is the starting point of the life cycle. Its purpose is to collect data from dispersed data sources to a centralized data lake where all data (still in their raw, unstructured form) can be further processed

Extract data from diverse data sources such as ERP systems, public web services, web site scrapping, RDBMS, Log files etc. Engineering effort is involved to develop new data adaptor library when a new type of data sources is added.
The data collection framework should support different modes of data transfer initiation including data uploaded by data sources or extract data from data sources.
The data collection framework is focusing in just moving data without transformation, data is stored in raw form using same structure as originating data sources.
The data extraction process is fully automated, with sufficient logging to monitor the overall progress of data collection. Alert is generated in case of any component fails or data transfer is not proceeding normally. Also we need to track data sources collection statistics such as frequency / duration / size of upload.

Data Curation

Data curation is a major investment of engineering effort when enterprises move from an trial phase into a production phase in their data analytics journey. A big part of it is to identify data quality issue from the collected data, clean up those inconsistency, transform and organize data into a structure more suitable for downstream data science processing.

Build a Metadata dictionary to describe the semantics of data, such as table and column description, table relationship (primary key / foreign key)
Apply data cleansing and transformation logic, including filter out invalid data (duplicated records, inconsistent ref), handle missing fields (discard, impute … etc), mask out user identity / privacy data
Monitor the ongoing data quality and validate data against user-defined assumptions
Provide summary of data distribution, outlier detection and divergence from expected distribution

Throughout the process of building the data curation system, data quality issues of existing system will be discovered and bugs will be filed. Rather than viewing data curation as an add-on investment of the data science effort, it should be regarded as one more testing/checking step of existing applications to make sure they produce meaningful and useful data. This is usually how enterprise justify this pretty significant investment of engineering effort. I also observed some enterprises who don't invest enough in data curation effort usually end up with bad data that affects the quality of downstream models. Garbage-in / garbage out: having a reliable source of data is fundamental to the quality of final outcome of data science, the effort of investing into data curation effort early will have huge payoff.

In future posts, I will cover the remaining components in the life cycle: Model training and Model serving.

Ankur Goel 7y

Ricky Ho nice article to start with. It should also have one feedback loop or learning loop for model retraining.

1 Reaction

Donald Tse 7y

Ricky, thanks for your great article. Any sense that about the ratio of data engineer, data scientists, devops engineer, app developer in those enterprise ?

See more comments

To view or add a comment, sign in

Enterprise data science workflow

Ricky Ho

Life cycle of intelligent app development

Data Collection

Data Curation

More articles by Ricky Ho

Others also viewed

A Data Pro’s Guide to Unstructured Data in AI

Data Science in the Age of AI Agents: What Changes for Professionals?

MatchX Application: Revolutionizing Data Matching with Splink

Exploring Data with AI: A Beginner's Guide to Data Analysis and Machine Learning

Why Enterprise AI Fails at the Data Layer

Why AI Projects Fail—And It’s Not the Model’s Fault

Microsoft Data Formulator Enhanced with AWS Bedrock Integration: Transforming Data Analysis with AI – Part 1

Your Step-by-Step Plan to Master AI for Data Analytics in 2025

January 02, 2022

Are data science and AI-related

How Data Science Drives AI Development

Scaling Strategies for Large Language Model Architectures

Challenges in AI Data Architecture

Building Custom AI Models for AWS Workflows

Explore content categories

Life cycle of intelligent app development

Data Collection

Data Curation

More articles by Ricky Ho

AI-Assisted IDE with Code Agent

MultiAgent: A Key to Scaling Specialized Intelligence

LLM use cases for enterprise

AI-Assist Software Development

Emerging paradigm: LLM-in-the-loop

Equip LLM with intellectual skills

Is LLM an overhype ?

Adversarial Attack on Machine Learning Model

Great online courses on AI, ML, DL

Linear Programming vs Reinforcement Learning

Others also viewed

A Data Pro’s Guide to Unstructured Data in AI

Data Science in the Age of AI Agents: What Changes for Professionals?

MatchX Application: Revolutionizing Data Matching with Splink

Exploring Data with AI: A Beginner's Guide to Data Analysis and Machine Learning

Why Enterprise AI Fails at the Data Layer

Why AI Projects Fail—And It’s Not the Model’s Fault

Microsoft Data Formulator Enhanced with AWS Bedrock Integration: Transforming Data Analysis with AI – Part 1

Your Step-by-Step Plan to Master AI for Data Analytics in 2025

January 02, 2022

Are data science and AI-related

Similar topics

How Data Science Drives AI Development

Scaling Strategies for Large Language Model Architectures

Challenges in AI Data Architecture

Building Custom AI Models for AWS Workflows

Explore content categories