Enterprise data science workflow
Artificial Intelligence, Machine learning are getting hotter and hotter these days. Starting from tech pioneers such as Amazon, Google, Facebook, Uber, Microsoft ... in sharing their success stories applying machine learning in various application including image recognition, natural language processing, autonomous driving, followed by maturity of open source data processing and machine learning technologies becoming easily accessible, a rapidly growing number of enterprises are looking to applying ML / AI to improve their business operation and gain advantages over their competitors.
The majority of enterprises are not tech pioneers. While they have in-house IT staff looking after existing applications to support day to day business, their investment into research effort is relatively low as compare to tech giants mentioned above. Nevertheless, a large proportion of these enterprises have started to look into ML / AI, either because they see opportunities to increase business value, or a general fear of lagging behind. They typically hire consultants to examine their business process, identify niche areas of business operation to inject ML / AI elements, establish a small team of data engineer / data scientist to build the intelligent app, deploy that into production, and finally transform the existing business operation to use the new flow.
In this post, I will go over the key stages of conducting data science (which is the underlying process of applying ML / AI). I will describe the major components involved and the corresponding design considerations and challenges. Let me re-emphasize that this post is drawn from my past consulting experience in building "enterprise-scale" data science architecture. I am not focusing in "web-scale" data science architecture which you can easily found from publications on the web or tech conferences.
Life cycle of intelligent app development
The end to end life cycle of building an intelligent app involves 4 major stages
- Data collection
- Data curation (clean up the collected data and transform them into a structure more suitable for analytic processing)
- Model training (build prediction and optimization models)
- Model serving (deploy trained models to serve online request for recommending business decisions)
Data Collection
Data collection is the starting point of the life cycle. Its purpose is to collect data from dispersed data sources to a centralized data lake where all data (still in their raw, unstructured form) can be further processed
- Extract data from diverse data sources such as ERP systems, public web services, web site scrapping, RDBMS, Log files etc. Engineering effort is involved to develop new data adaptor library when a new type of data sources is added.
- The data collection framework should support different modes of data transfer initiation including data uploaded by data sources or extract data from data sources.
- The data collection framework is focusing in just moving data without transformation, data is stored in raw form using same structure as originating data sources.
- The data extraction process is fully automated, with sufficient logging to monitor the overall progress of data collection. Alert is generated in case of any component fails or data transfer is not proceeding normally. Also we need to track data sources collection statistics such as frequency / duration / size of upload.
Data Curation
Data curation is a major investment of engineering effort when enterprises move from an trial phase into a production phase in their data analytics journey. A big part of it is to identify data quality issue from the collected data, clean up those inconsistency, transform and organize data into a structure more suitable for downstream data science processing.
- Build a Metadata dictionary to describe the semantics of data, such as table and column description, table relationship (primary key / foreign key)
- Apply data cleansing and transformation logic, including filter out invalid data (duplicated records, inconsistent ref), handle missing fields (discard, impute … etc), mask out user identity / privacy data
- Monitor the ongoing data quality and validate data against user-defined assumptions
- Provide summary of data distribution, outlier detection and divergence from expected distribution
Throughout the process of building the data curation system, data quality issues of existing system will be discovered and bugs will be filed. Rather than viewing data curation as an add-on investment of the data science effort, it should be regarded as one more testing/checking step of existing applications to make sure they produce meaningful and useful data. This is usually how enterprise justify this pretty significant investment of engineering effort. I also observed some enterprises who don't invest enough in data curation effort usually end up with bad data that affects the quality of downstream models. Garbage-in / garbage out: having a reliable source of data is fundamental to the quality of final outcome of data science, the effort of investing into data curation effort early will have huge payoff.
In future posts, I will cover the remaining components in the life cycle: Model training and Model serving.
Ricky Ho nice article to start with. It should also have one feedback loop or learning loop for model retraining.
Ricky, thanks for your great article. Any sense that about the ratio of data engineer, data scientists, devops engineer, app developer in those enterprise ?