Chop Chop #10

Data Science with Daniel

Build. Share. Learn.

Published Jul 18, 2023

Another day, another layer

ChopChop#1 ChopChop#2 ChopChop#3 ChopChop#4 ChopChop#5 ChopChop#6 ChopChop#7 ChopChop#8 ChopChop#9

Welcome to ChopChop#10! We are now into the double digits! Let's continue from where we left off in ChopChop#9 and continue to talk about this sweet data platform. Remember to go back and check that one out to hear all about how the bronze layer works as today we are going to be looking at the silver layer. In our data lake, the silver layer is responsible for the cleaning and processing of data that ensures the raw data in the bronze layer is of a high quality.

For the silver layer, we leveraged the AWS Step Functions service that is triggered by the EventBridge notification sent by the bronze layer when the daily file has been successfully consolidated. Step Functions give us a clear workflow that we can run, and new steps can easily be added without too much change to the overall structure of the code. We can also do cool things like check if steps succeeded or failed, or run steps in parallel.

We have kept the pipeline, as we are calling it, quite simple but the complexity will come as the silver layer has more work to do. Currently, the pipeline has a wait step, an item-mapping step, a check-success step, and then either a success or a failure step. You can see how that looks from this nice diagram that the service provides below:

Recommended by LinkedIn

The Great Big Data Vault Lie

Andrew Foad 10 months ago

Big Data, No Wait, Some Flex

Brownie Davis 8 years ago

Liked list

Amit Bar-Gil 3 years ago

No alt text provided for this image — Silver Pipeline

The wait step is essentially just a 10 second buffer that ensures that the pipeline doesn't run too soon when things are not ready to go. The item-mapping step is where the actual interesting stuff is happening which I will explain shortly. The output of that step is either success or failure and the check-success looks for that output, and then sends us down the respective path. Currently there is nothing happening for either of these steps, but an example of what can be added is a retry if it ends up failure, or notifications being sent for either.

Let's get into the item-mapping that is what the silver layer is doing. This was originally in the app itself, but it was becoming a bit too complex to manage, so now it is here in the data side of things. When an item is created by a user, it creates a unique id known as a UUID. Now this uuid is designed so that it is unique, of course, but when two users create essentially the same item, they will have different uuid's on their respective apps.

Our solution to this is to have a DynamoDB table that contains a mapping of the name of an item, like Apples, and the first uuid that was assigned to it. Then, if someone else creates Apples, this step looks up the name, sees that it is already in the table, and updates the uuid to be the same one. So the output of this item-mapping step ensures that all the same items have the same uuid. There are multiple ways to do this, but for now this is the simple one that works for us.

Some eagle-eyed readers may have noticed that as we are matching on name, then Apple and Apples are actually two different items, and hence different uuid values. Even though we know they are not. That is the next version of this item-mapping for us to solve, to do our best to ensure high quality data is coming out of the silver layer and ready for the gold layer. This is an interesting problem to solve, even more so if you add in that users can add items in any language they choose, so not only is Apple and Apples the same, but so is 苹果.

We also want to expand the silver pipeline to ensure there is only the best data passing through. Things like checking if there are duplicated transactions, or ensuring that the transactions we are seeing have been correctly processed into the inventory that the users sees. These steps can be added into the pipeline as we build that functionality out. All of this is to ensure that the gold layer is outputting the best results, so stay tuned to see what that layer is all about next time.

That's all for now, chop chop.

To view or add a comment, sign in

Chop Chop #10

Data Science with Daniel

Build. Share. Learn.

Another day, another layer

Recommended by LinkedIn

More articles by Data Science with Daniel

Others also viewed

Linked List: A Dynamic Data Structure

Stack.

The Reason You're Using Scan In Your DynamoDB Queries

HOOK vs Data Vault: Willibald Part 2

Deleting Sensitive Data in the Data Lake (and beyond)

Data Sharing Case Study on Direct Share & CURRENT_ROLE

Episode #32: How to use the QUERY function in Google Sheets on COVID-19 data

IntList -- An array of integers that returns the index as the value

InnoDB Internals: When does TEXT data hit the Buffer Pool cache

It's my way or Hive-way

Explore content categories

Another day, another layer

Recommended by LinkedIn

More articles by Data Science with Daniel

Chop Chop #11

Chop Chop #9

Chop Chop #8

Chop Chop #7

Chop Chop #6

Chop Chop #5

Chop Chop #4

Chop Chop #3

Chop Chop #2

Chop Chop #1

Others also viewed

Linked List: A Dynamic Data Structure

Stack.

The Reason You're Using Scan In Your DynamoDB Queries

HOOK vs Data Vault: Willibald Part 2

Deleting Sensitive Data in the Data Lake (and beyond)

Data Sharing Case Study on Direct Share & CURRENT_ROLE

Episode #32: How to use the QUERY function in Google Sheets on COVID-19 data

IntList -- An array of integers that returns the index as the value

InnoDB Internals: When does TEXT data hit the Buffer Pool cache

It's my way or Hive-way

Explore content categories