Data Processing Over the Decade
This article is about the learnings, challenges, and how we solved them over a decade of my journey in different stages of my career being a Developer, Lead, Manager, Java Developer, Database Developer, DataWarehouse/ETL Developer, BI/Report Developer, Data Modeler, Database Administrator, Database Architect, Data Analyst , Product Analyst, Data Engineer etc and the journey continues.
Use case 1 - OLTP vs OLAP
Early Days of my career as a software developer wanted to retrieve the data from the same database as we are recording on it. Although we are Able to do it in some cases but largely it made our application down during retrieval of various reports such as monthly, yearly report etc.
we need to copy all the data from source database i.e, OLTP (OnLine Transaction Processing) to a new Database i.e OnLine Analytical Processing (OLAP) Database. which later on we have used for all the reporting requirements.
OLTP is good for recording the transaction. It's not meant for retrieval or analysis of large amount of the data and can be considered as a write only Database.
OLAP can be used for Business Intelligence reporting, analysis purpose and can consider as a read only Database.
Recommended by LinkedIn
Use case 2 - ETL
We have started copying the data from OLTP to OLAP in every one hour. Extract the transactions happened during last hour from OLTP Schema and copied into OLAP Database schema based on the Modified Date column. But we have observed sometime the operation happened effortlessly and other times OLTP database got locked , retrieving the transactions took time, Application slowness and the operation was again a never ending task.
We have changed the Extract strategy from Hourly to Daily. Instead of retrieving the transactions on hourly basis we have retrieved it on daily basis as a batch during mid night which fetched all the transactions happened on previous day.
Data Extraction from OLTP should not be done during lot of writes happening on the database i.e termed as peak hours. Instead of it we can select the data in off peak hours when there are less write typically in our case i.e, MidNight. We need to make sure the column's used in filter criteria needs to be indexed properly to make the data retrieval faster. Although now there are read only replicas and other different strategies come up for data retrieval over the period of time.
Note :- The above is few of the basic problem statements will continue to share more use cases in a series of articles.
Nice article Dipti