How to Solve a Complex Data Engineering Merge Problem with #Snowflake “MERGE ALL BY NAME”

How to Solve a Complex Data Engineering Merge Problem with #Snowflake “MERGE ALL BY NAME”


Snowflake recently introduced or I should say, they enhanced the Merge statement to include ALL BY NAME options. So, I decided to try and how it really enable the data engineers to leverage the new feature.

Maintaining MERGE logic in ELT pipelines has always been a slow, error-prone process, especially when schemas evolve frequently. With constant upstream changes, downstream analytics teams like ours are often left catching up, ensuring every transformation still aligns with the latest source structure.

How it used to be on the Old Way

Previously, each MERGE meant listing every column explicitly:

MERGE INTO target t
USING source s
ON t.id = s.id
WHEN MATCHED THEN 
  UPDATE SET 
    t.col1 = s.col1,
    t.col2 = s.col2
WHEN NOT MATCHED THEN 
  INSERT (col1, col2)
  VALUES (s.col1, s.col2);        

Any new column upstream meant editing MERGE scripts by hand-slowing down agile data operations.

The Snowflake Cloud's New Way – MERGE ALL BY NAME

Snowflake’s most recent release introduced MERGE ALL BY NAME, which maps columns automatically based on their names:

MERGE INTO target AS t

USING source AS s

ON t.id = s.id

WHEN MATCHED THEN 

	UPDATE 

		ALL BY NAME

WHEN NOT MATCHED THEN 

	INSERT

		ALL BY NAME;        


That's it! No mapping lists. No worries on column sequence order. As long as source and target share exact column names, the merge just works.

Cheers, happy ending... wait, not quite yet! 😄

Real-World Challenge – Duplicate Source Rows

Well, that’s not the whole story in real-world scenarios. Every data engineer knows you can’t fully trust source data that claims to have no duplicates, that’s a nightmare waiting to break your pipelines, if you are not prepared!.

Most of the time, upstream sources send multiple rows for the same ID - especially when data is aggregated from various systems and combined together.

It will end up something like Multiple source rows match and are attempted to be merged into the same target row.

The solution was to deduplicate before merging:

WITH deduped AS (

  SELECT *

  FROM source

  QUALIFY ROW_NUMBER() OVER (PARTITION BY id ORDER BY last_updated DESC) = 1

)

MERGE INTO target

USING deduped

ON target.id = deduped.id

WHEN MATCHED THEN UPDATE ALL BY NAME

WHEN NOT MATCHED THEN INSERT ALL BY NAME;        


Additional Scenarios to Watch

Schema Drift

Matching by name won’t work if source has extra columns or target has the missing ones. Always validate schemas before merging.

That's another whole different topic to discuss. May be, another blog post on this specific topic later!

CDC Pipelines

For change-data-capture tasks, handle the deletes explicitly:

Explicitly, handle it in the Merge statement based on source records delete status.

WHEN MATCHED AND s.operation = 'DELETE' THEN DELETE        


🔑 Key Lessons

  1. MERGE ALL BY NAME eliminates tedious column-mapping, saving hours in maintenance.
  2. Ensure source and target schemas align - column names must match exactly for seamless merge operation.
  3. Always deduplicate or aggregate source data to prevent merge conflicts.
  4. In production, plan ahead for schema drift and CDC logic - handle them explicitly to keep pipelines resilient.

Snowflake’s evolution is making data engineering more efficient. Still, the responsibility lies with us to design pipelines that handle edge cases and keep production reliable.

#Snowflake #DataEngineering #ELT #SQL #CloudDataWarehouse #SnowflakeRelease #DataPipeline #DataArchitecture #DataOps #SchemaDrift #ChangeDataCapture #CDC #MergeByName #ETL #AzureData #RetailAnalytics #dbt #DataEngineerLife

To view or add a comment, sign in

More articles by Karthikeyan Shanthakumar

Others also viewed

Explore content categories