Enterprise data science workflow

Artificial Intelligence, Machine learning are getting hotter and hotter these days. Starting from tech pioneers such as Amazon, Google, Facebook, Uber, Microsoft ... in sharing their success stories applying machine learning in various application including image recognition, natural language processing, autonomous driving, followed by maturity of open source data processing and machine learning technologies becoming easily accessible, a rapidly growing number of enterprises are looking to applying ML / AI to improve their business operation and gain advantages over their competitors.

The majority of enterprises are not tech pioneers. While they have in-house IT staff looking after existing applications to support day to day business, their investment into research effort is relatively low as compare to tech giants mentioned above. Nevertheless, a large proportion of these enterprises have started to look into ML / AI, either because they see opportunities to increase business value, or a general fear of lagging behind. They typically hire consultants to examine their business process, identify niche areas of business operation to inject ML / AI elements, establish a small team of data engineer / data scientist to build the intelligent app, deploy that into production, and finally transform the existing business operation to use the new flow.

In this post, I will go over the key stages of conducting data science (which is the underlying process of applying ML / AI). I will describe the major components involved and the corresponding design considerations and challenges. Let me re-emphasize that this post is drawn from my past consulting experience in building "enterprise-scale" data science architecture. I am not focusing in "web-scale" data science architecture which you can easily found from publications on the web or tech conferences.

Life cycle of intelligent app development

The end to end life cycle of building an intelligent app involves 4 major stages

  1. Data collection
  2. Data curation (clean up the collected data and transform them into a structure more suitable for analytic processing)
  3. Model training (build prediction and optimization models)
  4. Model serving (deploy trained models to serve online request for recommending business decisions)

Data Collection

Data collection is the starting point of the life cycle. Its purpose is to collect data from dispersed data sources to a centralized data lake where all data (still in their raw, unstructured form) can be further processed

  • Extract data from diverse data sources such as ERP systems, public web services, web site scrapping, RDBMS, Log files etc. Engineering effort is involved to develop new data adaptor library when a new type of data sources is added.
  • The data collection framework should support different modes of data transfer initiation including data uploaded by data sources or extract data from data sources.
  • The data collection framework is focusing in just moving data without transformation, data is stored in raw form using same structure as originating data sources.
  • The data extraction process is fully automated, with sufficient logging to monitor the overall progress of data collection. Alert is generated in case of any component fails or data transfer is not proceeding normally. Also we need to track data sources collection statistics such as frequency / duration / size of upload.

Data Curation

Data curation is a major investment of engineering effort when enterprises move from an trial phase into a production phase in their data analytics journey. A big part of it is to identify data quality issue from the collected data, clean up those inconsistency, transform and organize data into a structure more suitable for downstream data science processing.

  • Build a Metadata dictionary to describe the semantics of data, such as table and column description, table relationship (primary key / foreign key)
  • Apply data cleansing and transformation logic, including filter out invalid data (duplicated records, inconsistent ref), handle missing fields (discard, impute … etc), mask out user identity / privacy data
  • Monitor the ongoing data quality and validate data against user-defined assumptions
  • Provide summary of data distribution, outlier detection and divergence from expected distribution

Throughout the process of building the data curation system, data quality issues of existing system will be discovered and bugs will be filed. Rather than viewing data curation as an add-on investment of the data science effort, it should be regarded as one more testing/checking step of existing applications to make sure they produce meaningful and useful data. This is usually how enterprise justify this pretty significant investment of engineering effort. I also observed some enterprises who don't invest enough in data curation effort usually end up with bad data that affects the quality of downstream models. Garbage-in / garbage out: having a reliable source of data is fundamental to the quality of final outcome of data science, the effort of investing into data curation effort early will have huge payoff.

In future posts, I will cover the remaining components in the life cycle: Model training and Model serving.



Ricky Ho nice article to start with. It should also have one feedback loop or learning loop for model retraining.

Ricky, thanks for your great article. Any sense that about the ratio of data engineer, data scientists, devops engineer, app developer in those enterprise ?

Like
Reply

To view or add a comment, sign in

More articles by Ricky Ho

  • AI-Assisted IDE with Code Agent

    As I predicted in a previous post, the Software Development Process Lifecycle is set to experience significant…

  • MultiAgent: A Key to Scaling Specialized Intelligence

    As the race to build generative AI (GenAI) applications accelerates, two distinct strategies have emerged in the…

    1 Comment
  • LLM use cases for enterprise

    Since the arrival of ChatGPT, the potential power of generative AI models has impressed the leaders minds of many…

    3 Comments
  • AI-Assist Software Development

    Agile development method has become essential for enterprises to build their products in a rapidly changing global…

    1 Comment
  • Emerging paradigm: LLM-in-the-loop

    Recently, the way data scientists apply AI to tackle business problems has shifted since LLM come to the center of the…

    3 Comments
  • Equip LLM with intellectual skills

    Since around 2017, GPT has already been using the sophisticated Transformer-based decoder architecture to train massive…

  • Is LLM an overhype ?

    Since the release of ChatGPT, LLM has been on wild fire and has penetrated almost every corner of our software…

  • Adversarial Attack on Machine Learning Model

    As we enter the AI era, our daily lives depends increasingly on the automated decisions made by machine learning…

  • Great online courses on AI, ML, DL

    Recently, I was compiling a list of material on AI/ML for my son who is going to college after summer. This was not as…

    2 Comments
  • Linear Programming vs Reinforcement Learning

    In today's highly competitive business environment, enterprises are increasingly move away from manual decisions to…

    3 Comments

Others also viewed

Explore content categories