Implementing SCD Type 2 in PySpark — A Practical Guide for Data Engineers

Surabhi Sinha

Published Apr 16, 2025

One of the key concepts every Data Engineer should get comfortable with is Slowly Changing Dimension (SCD) Type 2.

It’s essential when you're managing historical data — whether it's customer profiles, product details, or any dimensional data that evolves over time.

SCD Type 2 doesn’t just store the current state; it preserves a complete history of changes.

I’ll admit — I struggled with SCD Type 2 when I first encountered it.

So, if you're in the same boat, this breakdown might help you understand it better.

What is SCD Type 2?

Instead of updating or deleting records when changes occur, SCD Type 2 preserves history by:

Storing previous versions
Tracking the validity of each version using start_date and end_date
Marking the active record with a flag like is_current = True

My Step-by-Step Implementation in PySpark

Using PySpark, here’s how I tackled it:

1. Load the latest snapshot

This is your incoming source data — the current view of customers, products, etc.

2. Join with existing historical records

These records have columns like start_date, end_date, and is_current.

C001 exists in both, but the address has changed (LA vs NY).

C002 exists and is the same (no changes).

C004 is new — it doesn’t exist in the historical target_df, so all tgt.* fields are null.

C003 deleted record doesnot exist in source

Compare and categorize changes:

3. New records

Customers/products not seen before → Insert with is_current = True.

4. Changed records

Existing IDs but with different attributes (like address change) →

→ Expire the old record (is_current = False, set end_date)

→ Insert the new version (is_current = True, new start_date)

5. Deleted records

IDs no longer in the latest snapshot → Expire them by setting is_current = False and updating end_date.

6. Unchanged records Keep them as-is — no action needed

Use unionByName() instead of union() when combining all your processed DataFrames.It ensures columns are matched by name (not order), making your pipeline safe and schema-resilient.

Let me know if you'd like a deep dive into each step — happy to help!

#DataEngineering #PySpark #SCDType2 #ETL #Databricks #BigData #Azure #WomenInTech

To view or add a comment, sign in

Implementing SCD Type 2 in PySpark — A Practical Guide for Data Engineers

Surabhi Sinha

I’ll admit — I struggled with SCD Type 2 when I first encountered it.

What is SCD Type 2?

1. Load the latest snapshot

2. Join with existing historical records

Recommended by LinkedIn

3. New records

4. Changed records

5. Deleted records

6. Unchanged records Keep them as-is — no action needed

More articles by Surabhi Sinha

Others also viewed

Will they blend? Experiments in Data & Tool Blending

Why Spark Was Hard (Until It Wasn't): 5 Secrets to Mastering Big Data

🚀 Day 2 of 100 Spark Interview Questions: Dive Deeper into Spark's Core Components! 🌐✨

💻 PySpark in Practice: 7 Real-World Tips to Handle Data at Scale Like a True Data Engineer

How to optimize Pyspark Codes for better efficiency.

How to deal with white spaces in Column Names of your Raw dataset using PySpark

Delta Live Tables: Declarative vs. Procedural Approaches in Databricks

Why output of my PySpark logic is different than Hive queries ?

Loading data with multi-format date column using SQL in PySpark

Efficient Pandas: Using Chunksize for Large Data Sets

Explore content categories

I’ll admit — I struggled with SCD Type 2 when I first encountered it.

What is SCD Type 2?

1. Load the latest snapshot

2. Join with existing historical records

Recommended by LinkedIn

3. New records

4. Changed records

5. Deleted records

6. Unchanged records Keep them as-is — no action needed

More articles by Surabhi Sinha

🚀 Optimizing Data Analysis with Window Functions in PySpark 🧑💻

Understanding the Differences Between HDFS and Cloud Object Storage (ADLS Gen2 & Amazon S3)

Others also viewed

Will they blend? Experiments in Data & Tool Blending

Why Spark Was Hard (Until It Wasn't): 5 Secrets to Mastering Big Data

🚀 Day 2 of 100 Spark Interview Questions: Dive Deeper into Spark's Core Components! 🌐✨

💻 PySpark in Practice: 7 Real-World Tips to Handle Data at Scale Like a True Data Engineer

How to optimize Pyspark Codes for better efficiency.

How to deal with white spaces in Column Names of your Raw dataset using PySpark

Delta Live Tables: Declarative vs. Procedural Approaches in Databricks

Why output of my PySpark logic is different than Hive queries ?

Loading data with multi-format date column using SQL in PySpark

Efficient Pandas: Using Chunksize for Large Data Sets

Explore content categories