Data Engineering for a Pandemic

Turner Kunkel, MSM

Published Aug 24, 2020

Introduction

I had the pleasure of diving into a project in the last few years using some modern data engineering and warehousing techniques. This allowed me to witness a corporate response to the COVID-19 pandemic as the 2020 year rolled on. This article will go over what I saw happen from a data engineering perspective, and how some contemporary approaches helped curve some fallout for an organization directly impacted by the pandemic.

Distributed Computing Tools Helping a Company

A cloud-based, distributed data platform coupled with a strong framework to effectively use data helped decision-makers at one large organization respond effectively to changes in revenue, supply chain, customer retention and employee productivity brought on by the new coronavirus (COVID-19).

This organization had many locations around the U.S., where most employees worked on-site, managing physical deliveries of goods and services. As COVID-19 spread, it became apparent it could not continue normal operations. It adapted by implementing remote work for most employees and adding protective measures for others involved in deliveries and on-site services that could not be done virtually.

Still, the pandemic provided a unique set of challenges to the organization.

Employees were not regularly working remotely prior to the pandemic.

How could they fluidly transition?
What impact would the transition have on productivity?
How could new talent be attracted as needed to support new projects?

The organization needed to sustain its cost of operations with incoming revenue in response to the pandemic.

How did the pandemic affect the organization’s supply chain?
What was the impact on supplies, goods and selling?
What was the impact on deliveries and services provided after a sale?

Because the organization had, prior to the pandemic, purchased distributed big data tools as they were looking forward to scaling their data resources to a very large size, they were able to analyze data in a way that provided useful answers to these questions.

What Is Needed

An effective cloud-based, distributed data platform needs three things

Conformed data
A strong framework
The right tools

The collection of pertinent data related to productivity, inventory, supply chains, sales and other key metrics provided the means for a more intelligent response to a global viral pandemic possible.

The organization put in place a framework to transform the data to be understandable and available to decision makers. This is the core of any business intelligence (BI) project and why data engineering is so important; in this case, a good foundational framework allowed organization leaders to pivot quickly in response to the pandemic.

Lastly, the proper tools to get the data in the right hands was necessary. The organization’s cloud-based and distributed computing systems allowed it to gather, transform and deliver analytics faster than the alternative, providing a clearer picture of what was happening in a timely fashion.

The Framework

The framework, given that the right data is available, is key. Having a flashy tool is useless if it cannot support organizational needs and service-level agreements. In general, solid data engineering and an effective extract, transform and load (ETL) framework accomplishes what is shown in this visual for data warehousing.

Generalized framework - ingestion, transformation, and surfacing data

Why did this framework lead to a successful outcome for this organization?

Metadata is the key driver of data movement, making changes and additions fluid and fast
Orchestrating movement in a parallel fashion allowed for quick data movement when needed
Having persisted staging kept historical data, even if company systems did not retain it

Once data is available in an understandable form, a data analyst or scientist can take that data and make it meaningful using visuals for downstream audiences.

In this case, having data science personnel and business leaders working together helped in identifying exactly what was needed to be added to the data engineering workflow to provide decision makers with a look at company data to answer questions on the impact from COVID-19.

Fulfilling Requests for Data

Here is an example of how this organization could harness the power of this architecture to fulfill a customer request quickly during the pandemic. The framework allowed it to do the following.

Add metadata for sourcing, transformation, and how the data will be surfaced
Run the orchestration tool for movement of data based on this new output
Schedule recurring data updates through the orchestration tool after testing and refining
Build dashboards and reports from this data for senior leadership to understand over time

One major request was fulfilled in about one week’s time and was up and running in production daily in a week and a half. The turnaround was fast because of the framework already in place.

Eventually, after a few weeks, the organization could better understand changes in the following as correlation to the pandemic.

Selling rates – Were they still making sales projections?
Extraneous costs – Was there a difference in overhead that should be noted?
Employee productivity – How were employees engaged while suddenly working remotely?
Credits and payments – Were there any slowdowns in customer payments?
Customer retention – Were customers still engaged in the business, or slowing down amidst the pandemic?

Understanding these changes allowed senior leadership to adjust company operations appropriately to balance out any losses or gains from such a quick change in the way business was being run. It also allowed them to see how healthy the company is as the pandemic continued and make data-driven projections for the coming year.

Without the existing framework and data sourcing available, an effort like this would have taken weeks to months (or never be completed) and would have potentially been a manual and potentially messy process for the organization.

The Tools

There is not a silver bullet or magic tool that can accomplish the above. A response to this scenario required quite a bit of foresight, an already-established framework, the proper questions asked by senior leadership, and tools to provide data in a timely fashion while handling the scale of the data.

In this case, it was useful to have cloud-based and distributed computing systems for several reasons. Here are the general pieces used for the tool set, and some reasoning.

Data Lake

Used for persisted staging and transformed staging
A data lake is a useful file storage layer that can scale for a relatively low price. Several vendors offer a data lake solution, such as Amazon’s AWS or Microsoft’s Azure.
The key with a data lake is to organize the files well, and not create a “data swamp” of files. This allows for metadata and other access mechanisms to follow a pattern for ingestion, and generally avoids confusion as to where data lives in the lake.

Massively Parallel Processing Database

Used for modeled data storage.
A Massively Parallel Processing (MPP) system is a distributed data platform for large-scale operations, querying, and storage
There are several options on the market for this purpose, including Snowflake and Azure Synapse Analytics

Cloud ETL/ELT

Used for orchestration
Orchestrating and ordering the events of how data is copied, moved, and transformed is done through an ETL/ELT tool, which reads metadata to call procedures in a specific order to accomplish what is needed in automation
Most cloud-based data movement systems can run parallel operations at once, ingesting data and running transformations at a high speed
While Azure Data Factory is a prevalent cloud tool, there are several others available such as Amazon’s AWS Glue and Matillion

Cloud Data Transformation and Staging

Usage for the data transformation layers of the process
While many tools offer solutions to transforming data to become an all-in-one platform, it is best to find a tool that is particularly good at linking to pieces of the framework, processing the data, and storing it for usage downstream
Systems such as Databricks and Zepl can be useful here

Live Data Visualization

Used for reports and dashboards
There are clear benefits to using visualization tools to show trends, daily snapshots, and transaction data to key decision-makers in the organization
Having a tool to plug into a cloud platform and respond as data is available was key in this setup. Having something like Microsoft’s Power BI or Tableau available is beneficial

Tool Usage Notes

Even the few tools noted above created an almost overwhelming slew of learning curves and connections to make things work. The organization in this instance studied what was available, projected what data usage they would scale to over several years, spoke to some vendors regarding pricing based on what tools would be best for their scale, and kept in mind the skill set available within their workforce to implement the solution.

It is a best practice to keep cloud resources under “one roof,” such as Microsoft Azure or Amazon AWS, instead of mixing platforms. This allows for better ease of use, connectivity, and security.

Having all the proper tools in place for their day-to-day business and looking to the future positioned the organization to handle questions from senior leadership during the pandemic with a relatively quick turnaround.

Takeaways

There are several key takeaways in how this large corporation responded to a global pandemic using cloud-based and distributed computing tools to provide data to answer relevant questions for leadership.

If the framework and tools were not in place already, this would not have been nearly as successful
Before any data was provisioned, organization leaders had to ask the right questions
The tools used catered to this organization’s future success; a different type of organization may need different tools for success
The organization had enough resources to have the framework and tools in place before the pandemic hit

Because of the organization’s effective use of its data, it continued to meet expectations even after slowing down operations during the pandemic. Organization leaders also learned that their cloud framework can answer questions quickly and will be incredibly valuable as it matures, data grows, and the business evolves.

Having an operational data framework that can not only handle what an organization has been doing, and not only what the organization is slated to do, but potentially handle unexpected impacts, can be a powerful component to deciding how to build it. Having some of that attitude assisted this organization immensely when a completely unforeseen and global threat came to its front door.

Thanks for reading.

Feel free to reach out to me at turner.kunkel@bakertilly.com or on LinkedIn with any comments or questions.

Robert M. Dayton 2y

Thank you for sharing Turner!

To view or add a comment, sign in

Data Engineering for a Pandemic

Turner Kunkel, MSM

Introduction

Distributed Computing Tools Helping a Company

What Is Needed

The Framework

Fulfilling Requests for Data

The Tools

Data Lake

Massively Parallel Processing Database

Cloud ETL/ELT

Cloud Data Transformation and Staging

Live Data Visualization

Tool Usage Notes

Takeaways

More articles by Turner Kunkel, MSM

Others also viewed

Idempotency in Data Pipelines

The Rise of Real-time Data Engineering: A Deep Dive into Implications, Challenges, and the Road Ahead

Building Robust Data Pipelines for Effective Data Engineering

Introduction to Data Orchestration and Dagster

Data Playbook: The Architecture of Trust

Data-driven disaster

Embracing Observability into DataOps

#13 Data Modeling & Design in Action - Designing a Model

Architecting for silent failure: why modern data engineering requires observability, not just data quality checks

Data Mesh – How to prevent it from turning into a money draining mess

Explore content categories

Introduction

Distributed Computing Tools Helping a Company

What Is Needed

The Framework

Fulfilling Requests for Data

The Tools

Data Lake

Massively Parallel Processing Database

Cloud ETL/ELT

Cloud Data Transformation and Staging

Live Data Visualization

Tool Usage Notes

Takeaways

More articles by Turner Kunkel, MSM

Using Microsoft Fabric to Ingest API Data

Others also viewed

Idempotency in Data Pipelines

The Rise of Real-time Data Engineering: A Deep Dive into Implications, Challenges, and the Road Ahead

Building Robust Data Pipelines for Effective Data Engineering

Introduction to Data Orchestration and Dagster

Data Playbook: The Architecture of Trust

Data-driven disaster

Embracing Observability into DataOps

#13 Data Modeling & Design in Action - Designing a Model

Architecting for silent failure: why modern data engineering requires observability, not just data quality checks

Data Mesh – How to prevent it from turning into a money draining mess

Similar topics

Adapting Business Metrics After COVID-19

Managing Post-COVID Shipping Demand and Supply Challenges

Explore content categories