Processing CDC and SCD Type-2 for sources Without Effective Date time- Hybrid approach

Processing CDC and SCD Type-2 for sources Without Effective Date time- Hybrid approach

INTRODUCTION

In a data warehousing system Change Data capture (CDC) plays an important part not just in making the data warehouse (DWH) aware of the change but also providing a means of flowing the change to the DWH marts and reporting tables so that we see the current and latest version of the truth. This together with Slowly Changing dimensions (SCD) create a cycle which runs the DWH and provides valuable insights in the history and decision making future. What if the source has no CDC? It would be an ETL nightmare to identify the exact change and report the absolute truth.

This paper focuses on such sources which DO NOT have CDC in their sources and need to do SCD Type 2 to process incremental change, maintain history of the records without worrying about data duplications and increased processing times.

Often it is seen that these two processes are separated and applied at different stages of the DWH, e.g. CDC is applied at the source area like the DBMS which will give you the changed records in the Staging area (often a stored process or a trigger event applied on the schema will do it.) whereas the SCD is done in the Transform area where we act on the changed records to process them as per the business rules.

If these two different processes can be combined in just a single process where just one single transform does both jobs of “identifying the change” and “applying the change to the DWH” then we can save significant processing times and value resources of the system doing all in in one pass of the run. Hence, I came up with a Hybrid SCD with CDC approach for this.

USE CASES

I will consider the below use cases which we will solve the CDC and applying SCD by the our hybrid approach,

  1. Sources WITH Effective Dates – Often CDC is dependent on the last updated date of the record in the source system. This together with Business Start and End dates can identify the changed records or just the update date will suffice.
  2. Sources WITHOUT Effective Dates – There is virtually no scope of identifying the changed records as the source system is constantly refreshing the changes instead of inserting a new record for the change. This would be like starting off with a clean slate before any change is made. The changed data is lost completely unless we have a DWH to look for a change here.
  3. Sources where records are soft deleted from the source.

 

THE DIFFERENTIATOR

                                 Figure 1: Hybrid Execution Process

For the runs where there is no CDC,

  • Identify the UPDATED records and NEW
  • For the UPDATED records change the Valid Start /End Dates and update Current Record Indicator to ‘1’
  • Close out the records in Dimension which have updates.
  • For the NEW records add Valid Start and End Dates with Current Record indicator set to ‘1’ indicating it is the latest record
  • Generate new Surrogate Keys (SK) for the records in 1 and 1.2.
  • Insert the NEW and UPDATED records in the Dimension table.

For the runs where there is a CDC – The CDC compare process,

I have created 2 tables called CDC_Updated and CDC_New, these tables hold daily CDC’s for new and updated records, I compare the daily CDC with these tables to continue processing or to exit if the CDC has already been processed (flowchart as below)

Compare Change Data Capture (CDC) 

                                  Figure 2: Compare CDC tables

PERFORMANCE COMPARISON

The below chart shows the performance comparison of the SAS® Data Integration Studio SCD Type 2 transform against the Hybrid. It is clear that there is a large improvement in performance with the hybrid transform as compared to the regular SCD Type 2, in addition to get better audit trail features.

                                        Figure 3: Duration vs CPU Time

                                         Figure 4: Current vs System I/O

NOTE: I am outlining an approaches which can be taken to identify CDC and apply it to the data warehouse for processing as a SCD Type-2 process. This process takes less than 2 minutes to process 2.8 million records

We can not only process CDC but also do an Audit trail and find missing records, which has huge applications in any industry viz. Anti-Money Laundering (AML), Banking applications, Insurance, Intelligence to name a few.

To view or add a comment, sign in

More articles by Vishant Bhat

Others also viewed

Explore content categories