How to Solve a Complex Data Engineering Merge Problem with #Snowflake “MERGE ALL BY NAME”
Snowflake recently introduced or I should say, they enhanced the Merge statement to include ALL BY NAME options. So, I decided to try and how it really enable the data engineers to leverage the new feature.
Maintaining MERGE logic in ELT pipelines has always been a slow, error-prone process, especially when schemas evolve frequently. With constant upstream changes, downstream analytics teams like ours are often left catching up, ensuring every transformation still aligns with the latest source structure.
How it used to be on the Old Way
Previously, each MERGE meant listing every column explicitly:
MERGE INTO target t
USING source s
ON t.id = s.id
WHEN MATCHED THEN
UPDATE SET
t.col1 = s.col1,
t.col2 = s.col2
WHEN NOT MATCHED THEN
INSERT (col1, col2)
VALUES (s.col1, s.col2);
Any new column upstream meant editing MERGE scripts by hand-slowing down agile data operations.
The Snowflake Cloud's New Way – MERGE ALL BY NAME
Snowflake’s most recent release introduced MERGE ALL BY NAME, which maps columns automatically based on their names:
MERGE INTO target AS t
USING source AS s
ON t.id = s.id
WHEN MATCHED THEN
UPDATE
ALL BY NAME
WHEN NOT MATCHED THEN
INSERT
ALL BY NAME;
That's it! No mapping lists. No worries on column sequence order. As long as source and target share exact column names, the merge just works.
Cheers, happy ending... wait, not quite yet! 😄
Real-World Challenge – Duplicate Source Rows
Well, that’s not the whole story in real-world scenarios. Every data engineer knows you can’t fully trust source data that claims to have no duplicates, that’s a nightmare waiting to break your pipelines, if you are not prepared!.
Most of the time, upstream sources send multiple rows for the same ID - especially when data is aggregated from various systems and combined together.
It will end up something like Multiple source rows match and are attempted to be merged into the same target row.
Recommended by LinkedIn
The solution was to deduplicate before merging:
WITH deduped AS (
SELECT *
FROM source
QUALIFY ROW_NUMBER() OVER (PARTITION BY id ORDER BY last_updated DESC) = 1
)
MERGE INTO target
USING deduped
ON target.id = deduped.id
WHEN MATCHED THEN UPDATE ALL BY NAME
WHEN NOT MATCHED THEN INSERT ALL BY NAME;
Additional Scenarios to Watch
Schema Drift
Matching by name won’t work if source has extra columns or target has the missing ones. Always validate schemas before merging.
That's another whole different topic to discuss. May be, another blog post on this specific topic later!
CDC Pipelines
For change-data-capture tasks, handle the deletes explicitly:
Explicitly, handle it in the Merge statement based on source records delete status.
WHEN MATCHED AND s.operation = 'DELETE' THEN DELETE
🔑 Key Lessons
Snowflake’s evolution is making data engineering more efficient. Still, the responsibility lies with us to design pipelines that handle edge cases and keep production reliable.
#Snowflake #DataEngineering #ELT #SQL #CloudDataWarehouse #SnowflakeRelease #DataPipeline #DataArchitecture #DataOps #SchemaDrift #ChangeDataCapture #CDC #MergeByName #ETL #AzureData #RetailAnalytics #dbt #DataEngineerLife