Implementing SCD Type 2 in PySpark — A Practical Guide for Data Engineers
One of the key concepts every Data Engineer should get comfortable with is Slowly Changing Dimension (SCD) Type 2.
It’s essential when you're managing historical data — whether it's customer profiles, product details, or any dimensional data that evolves over time.
SCD Type 2 doesn’t just store the current state; it preserves a complete history of changes.
I’ll admit — I struggled with SCD Type 2 when I first encountered it.
So, if you're in the same boat, this breakdown might help you understand it better.
What is SCD Type 2?
Instead of updating or deleting records when changes occur, SCD Type 2 preserves history by:
My Step-by-Step Implementation in PySpark
Using PySpark, here’s how I tackled it:
1. Load the latest snapshot
This is your incoming source data — the current view of customers, products, etc.
2. Join with existing historical records
These records have columns like start_date, end_date, and is_current.
C001 exists in both, but the address has changed (LA vs NY).
C002 exists and is the same (no changes).
C004 is new — it doesn’t exist in the historical target_df, so all tgt.* fields are null.
C003 deleted record doesnot exist in source
Compare and categorize changes:
Recommended by LinkedIn
3. New records
Customers/products not seen before → Insert with is_current = True.
4. Changed records
Existing IDs but with different attributes (like address change) →
→ Expire the old record (is_current = False, set end_date)
→ Insert the new version (is_current = True, new start_date)
5. Deleted records
IDs no longer in the latest snapshot → Expire them by setting is_current = False and updating end_date.
6. Unchanged records Keep them as-is — no action needed
Use unionByName() instead of union() when combining all your processed DataFrames.It ensures columns are matched by name (not order), making your pipeline safe and schema-resilient.
Let me know if you'd like a deep dive into each step — happy to help!
#DataEngineering #PySpark #SCDType2 #ETL #Databricks #BigData #Azure #WomenInTech