Implementing SCD Type 2 in PySpark — A Practical Guide for Data Engineers

Implementing SCD Type 2 in PySpark — A Practical Guide for Data Engineers

One of the key concepts every Data Engineer should get comfortable with is Slowly Changing Dimension (SCD) Type 2.

It’s essential when you're managing historical data — whether it's customer profiles, product details, or any dimensional data that evolves over time.

SCD Type 2 doesn’t just store the current state; it preserves a complete history of changes.


I’ll admit — I struggled with SCD Type 2 when I first encountered it.

So, if you're in the same boat, this breakdown might help you understand it better.

What is SCD Type 2?

Instead of updating or deleting records when changes occur, SCD Type 2 preserves history by:

  • Storing previous versions
  • Tracking the validity of each version using start_date and end_date
  • Marking the active record with a flag like is_current = True


My Step-by-Step Implementation in PySpark

Using PySpark, here’s how I tackled it:

1. Load the latest snapshot

This is your incoming source data — the current view of customers, products, etc.


Article content

2. Join with existing historical records

These records have columns like start_date, end_date, and is_current.


Article content
Article content


C001 exists in both, but the address has changed (LA vs NY).

C002 exists and is the same (no changes).

C004 is new — it doesn’t exist in the historical target_df, so all tgt.* fields are null.

C003 deleted record doesnot exist in source

Compare and categorize changes:

3. New records

Customers/products not seen before → Insert with is_current = True.


Article content

4. Changed records

Existing IDs but with different attributes (like address change) →

→ Expire the old record (is_current = False, set end_date)

→ Insert the new version (is_current = True, new start_date)


Article content
Article content
Article content

5. Deleted records

IDs no longer in the latest snapshot → Expire them by setting is_current = False and updating end_date.


Article content

6. Unchanged records Keep them as-is — no action needed


Article content

Use unionByName() instead of union() when combining all your processed DataFrames.It ensures columns are matched by name (not order), making your pipeline safe and schema-resilient.


Article content


Article content

Let me know if you'd like a deep dive into each step — happy to help!

#DataEngineering #PySpark #SCDType2 #ETL #Databricks #BigData #Azure #WomenInTech

To view or add a comment, sign in

More articles by Surabhi Sinha

Others also viewed

Explore content categories