A LOOK INTO DATA ENGINEERING

A LOOK INTO DATA ENGINEERING

DATA CLEANING

Data Engineering can be summarized into three things despite the numerous software available.

The first is Extraction.

The second is Transformation.

The third is Loading.

The process is described as the E-T-L process. This is the basis on which all other data software is used.


DATA EXTRACTION

Data Extraction is the process of receiving data from one format into another more useful usually done for processing. For example, from binary codes to real numbers, which can be used for numerical calculations.

You may think, why the term extraction? If you are familiar with Excel spreadsheets, you might say why can't we format this to our desired format and go on. Truly, you might be right, but this is also a form of data extraction. Data extraction can be likened to squeezing an orange, as the orange changes form from solid to liquid (orange juice). Therefore, data extraction requires a sort of data conversion from one format to another.

For instance, at my former place of work, I received medical data from the CX department. I got different spreadsheets from the different shifts assigned over two weeks, and each shift used different dates and accounting formats. Some date formats were not date-like, but just strings. This required me to do some thorough data extraction to aid the automation process.

Note: You will always be faced with these challenges, which is why Data Structures and Algorithm will prepare you for these types of variant cases in your extraction

Now, I had to load all the spreadsheets into a function I built in python and had to perform data extraction. In reality, you might have multiple data sources especially if you source data via the web through web scraping, and data extraction will become more necessary for you to learn to be able to synthesize your data into one source.

From this analogy, it is clear that there are two types of data: Structured and Unstructured. In structured data, the data can be put directly in tools for necessary extraction, however, the work is more when dealing with unstructured data of the same magnitude. In unstructured, formatting has to be done first before using the tools for data extraction.

The use of databases has made data extraction easier.

Eventually, after the process of extraction is done, the data can either be stored in archives, transferred for transformation processes, or even analyzed directly.


DATA TRANSFORMATION

Data transformation serves as the middle phase of the ETL process, and it is the process of altering the format, structure, or even the values of data. Data transformation is a very common term in data warehousing. Think of a normal warehouse, where a particular commodity is stored, a data warehouse is similar to that. The only functional difference between a typical warehouse and a data warehouse is that in a data warehouse, data is the only commodity that is stored there. Truly, there are many data needs and it becomes burdensome, especially in a company where everyone has their specific data needs and as the data engineer you cannot attend to each of all these simultaneously. To put it in perspective, if you have to perform data extraction for 3 to 5 people in a day, it seems manageable; however, when this number increases to 200 per hour, we find out that a single data engineer can't solve all these needs. However, the use of data warehouses empowers the data engineer by allowing him to focus on just the transformation process, while the warehouse itself can simultaneously attend to the needs of each of the 200 users.

There are indeed many parts of data transformation, which will be discussed under the following subheadings:

Data Concatenation

Just like the name implies, data concatenation is the aggregation of different data points. Have you ever wondered how you get a username like Sydney729? Does that mean that there are 728 other Sydneys before you? Well, sometimes it does, but that is not our focus. The system scans your name and could count the position you are on the system's database based on the date you accessed their site. This number probably 729 will then be concatenated with your username. If this example was too complex to understand, let us look at a simpler case. Let us imagine that you were admitted to a degree program at a prestigious university. When filling in your details, you filled in your first name, your middle name, and your last name differently. The database will store the information like this:

No alt text provided for this image



Now, during graduation, I finished as the best student in my department. Now, the school's scholarship board wants to award me a gift. They have organized a ceremony and have printed my name and my picture on the front of the School's magazine according to their usual format. My name in the article is saved as ANUYAH, Sydney Oghenetega. Now, how is this possible? First, remember that the names were saved differently from the onset. This is where data concatenation comes in. In short, it is possible to join any data pair that you want to join.

Data Joining

This process is similar to concatenation, however, it is not the same. If you are familiar with SQL (Standard Query Language), which is a language for communicating with databases, the concept of joins will be familiar. So this class of joins refers to combining two or more tables to become one. Though this is very common with SQL, it can be achieved using Python, any high-level programming language, or even Power Query which is an add-on for Microsoft Excel.

Data joining is quite important, because we cannot always have all our data in one place. However, what can we do when we have bits and pieces scattered all over the place? We can easily join them together. So, next time you request data, remember that a lot of work went into it.

I recently opened a small store, and I have a Google Sheet File that saves my data automatically to the cloud. My small store is online! Yay! We have this table which I extracted from the file at random.

No alt text provided for this image


I also have another table that assists me in tracking the different orders a customer makes. I do this to ensure that I provide quality services within a 24-hour limit.


No alt text provided for this image


Since I wanted to be a little cryptic, I created codes for the status of my sale. Without knowing where the codes, are, you would not be able to understand what those figures mean. It is my little measure in adding data security.

No alt text provided for this image

Now, I want to export this into a business intelligence (BI) tool, however, I only need 1 table, but I also need all this information. This is where joining comes in.

This is rich, Sydney. I recently started my journey into Data Engineering and a lot of the things you mentioned here I have just been introduced to, but only theoretically. I can't wait to get to the practical aspects of it. Your article has helped me to further understand all I have learnt. By the way, don't you think there should be a part 2?

Beautiful and insightful write-up.

Very insightful writeup Sydney Anuyah. It was worthwhile to read

Great read, Learnt a bunch. Thanks for sharing.

This piece helped me understand more and easily about data processing

To view or add a comment, sign in

Others also viewed

Explore content categories