Data Pipelines
Data Pipelines is as simple as traveling of data from one point to another but at the same time, the magnitude and importance of this term is as critical as, in the absence of a reliable Data Pipeline, no Decision can be taken on time, and we know ‘Time is Money’.
Data Processing Mechanisms
● Push: When source systems push data to target systems.
● Pull: When target systems pull data from source systems.
● Streaming: When data keeps flowing from source to target systems.
Data Loading Types
● Bulk Loading: This means processing the whole set of data every time from the source. We also call it ‘Data Dump’.
● CDC Loading: This means processing only changed data from the source.
Recommended by LinkedIn
Type of Data Pipeline
● Batch processing means processing data after a certain time duration e.g., start processing after 24 hours or maybe 12 hours. The better hardware is in place the small-batch processing time-duration can be.
● Real-Time processing (event base) means, the moment there is a change in the source system, that piece of data is processed towards the target e.g., the moment a post is LIKED on Facebook or LinkedIn it moves to downstream systems. To support real-time from OLTP, CDC plays a vital role as in real-time, we expect a very small size of data to travel.
Not to miss out on the reality, businesses have been processing data either verbally or on leaves or paper since the dawn of time. It was the KPI i.e., to get data from one point to another for decision making within shortest possible time which brought all technological enhancements at the level which we see today. The earlier the data reach to the decision-maker, the better the timely decision is possible.
Now, to make sure everything is aligned accordingly introduced databases, data warehouse, data lake, BI tool, data science, etc., in fact, whole data solutions came into being just to get the right information at the right time at the right place.
A few aspects to keep in mind while creating Data Pipelines are always making sure Data Lineage, Data Catalog, Data Governance standards, and policies are in-place else Data Swamp is the future.
Cheers.
Insightful. Thanks for sharing