Data ingestion framework Simplified
The image depicts the unlimited opportunities a woman has in front of her. Banner created using canva.com

Data ingestion framework Simplified

In an enterprise setup, a data ingestion framework is used to help govern and manage ingestion efficiently. What is a data ingestion framework? why do we need one? And what should you consider while building a framework? We will discuss them in this article.

1. What is a data ingestion framework?

Data ingestion framework is like a software framework that can be used for various types of data ingestions either by merely having configurations settings or minimal code. We will also talk about Feed and Run, the understanding of which will help comprehend the rest of the article.

A Feed is a setup done in the framework to connect the source to the target. A Run is an activity performed when a Feed is triggered. Hence, a Feed can run many Runs depending upon how many times it was triggered.

2. Why do we need a data ingestion framework?

No alt text provided for this image

Ingestion is a straightforward process and you can write a piece of code to move data rather than build an expensive framework that does the same by calling the same code. Hence, the question arises: why do we need the framework in the first place? I have listed down the top four important reasons.

2.1. Automation:

You will have an ingestion requirement to fulfil from the same source to the target combination regularly. An ingestion framework will help automate the same.

2.2. Regulatory requirements:

Industries that are highly regulated, such as Banking and Pharma, will have to present the audit logs of each activity in the data pipeline to the regulators whenever requested. An ingestion framework will come to your rescue.

2.3.Controls:

By having an ingestion framework, you can control the activities performed in the production servers by providing the right access to the right roles.

2.4.Data integrity:

A framework will enforce you to capture the details required for maintaining data integrity thereby, establishing the source to target data lineage for each feed.

3. Recipe for data ingestion framework:

Building an Ingestion framework consists of many activities. However, make sure that you consider the below while building a data ingestion framework. The list is not exhaustive but will serve the purpose of building a basic framework.

No alt text provided for this image

3.1. GUI Interface:

Develop a GUI framework so that you can set up a feed, control activities, and democratise information about the feed by providing necessary access to the users.

3.2. Know your Run:

Capture basic metadata such as ‘How many Feeds are scheduled? How many Runs are successful? Where is the data coming from? Where is it stored on target? What is the size of the data? Etc,’ and present whenever needed.

3.3. Ensure Quality:

Ensure that a Run is marked successful only after it completes the quality checks. E.g.: the number of rows and columns is the same, both at the source and target.

3.4. Logging & Error Handling:

Capture all information related to a Run in a user-friendly manner so that the data engineer can troubleshoot if there is a failure in the Run, making no assumptions.

3.5. Alerting mechanism:

No alt text provided for this image

Depending upon the size of the data, the time taken to ingest ranges from a few hours to several hours. Imagine how good the teams feel, if the framework monitors each Run and alerts them automatically when it fails rather than someone monitoring manually. Based on the alerts the teams can act as soon as they are aware of any failures.

3.6. Versioning:

After you set up a feed, you may have to amend it in the future based on the business needs. For instance, add or delete a few tables. Hence, incorporate version control of feeds in your framework so that you can trace back the changes and revert if you want to use the old feed settings.

No alt text provided for this image

3.7. Smart Rerun:

When the Run fails, your framework should have the intelligence to rerun from the point of failure rather than rerunning from the beginning all over again.

3.8. Load Balance:

If you are running your ingestion framework in On-prem servers, make sure that you have a facility to distribute the processing to different compute resources. A framework without load balancing will pose a challenge when the number of feeds grows over a while.

3.9. Dashboards:

Last but not the least, produce a dashboard that provides a real-time run statistic of the feeds that are running at any point in time.

------------------------------------------------------------------------------------------------------------

I have limited the scope of the article only to structured data. I hope it provides a good understanding of the data ingestion framework. Thanks for reading. Please share your views and comments. The content of the article is purely my view and in no way reflects my current and earlier organizations and vendor partners. 

------------------------------------------------------------------------------------------------------------Image Credit: free images from Canva.com

References:

  1. Data Ingestion - Demystified
  2. Software framework

------------------------------------------------------------------------------------------------------------

Very well written sujit.. One of the major challenges implementing in anyframework is 'quality check' thats a trade off between compute and time the second major challenge is implementing smart rerun that can potentially add stress on source database

Lucid and well written..thank Sujith!

To view or add a comment, sign in

More articles by SujithKumar Chandrasekaran

  • GDPR in 3 mins - 1 of 7 Principles

    Having gone through the scope and objective in our earlier Newsletters, let us discuss the protection and…

  • GDPR in 3 mins - Objective & Rights it protects

    Understanding the legal terms is difficult for an Engineer like me. However, I attempted my level best to simplify by…

    1 Comment
  • GDPR in 3 mins - Scope & Definitions

    The General Data Protection Regulation (GDPR) is the world's strictest data privacy and security law. This law was…

    1 Comment
  • Are you becoming a Chicken ?

    I had never taken Tea or coffee until I went to the university and started to stay in the hostel. That was because my…

    3 Comments
  • Differential data privacy - an Overview

    Customers' data is private, and the data analyst can't use this sensitive information. But then, the Dataset is full of…

  • Differential Data privacy - demystified

    One of the critical challenges data practitioners face is that we expect them to provide vital information without…

    1 Comment
  • Model extraction using Active Learning

    Most cloud service providers offer Machine Learning as a Service (MLaas). By the way, what is MLaaS? As the name…

  • Data Free Model Extraction Attack

    Before we start discussing the data-free model extraction attack, let us understand how the Model extraction typically…

  • I know what you did last summer

    You had a common business problem across the industry. So you, as a CDO, secured funding from the Business to develop a…

  • Adversarial attacks on "Explanation models"

    Before we start our discussion on attacks, let us understand the explanation model, why we need it in the first place…

Others also viewed

Explore content categories