Deduplication logic for user emails

Explore top LinkedIn content from expert professionals.

Summary

Deduplication logic for user emails means identifying and removing repeated email entries in datasets or CRM systems, so each user is represented only once. This helps maintain clean data, avoids issues with automation and reporting, and ensures compliance with privacy rules.

Normalize email format: Always convert email addresses to a consistent case, such as lowercase, to catch duplicates that differ only by letter case.
Choose smart matching keys: Use email alongside other fields like name or phone number to flag duplicates more accurately, especially when handling complex datasets.
Handle duplicates in bulk: Design your deduplication process to check for repeated emails both in the current batch and existing records, so you don’t miss duplicates during large imports.

Summarized by AI based on LinkedIn member posts

SAMEER SHRINATH

Salesforce Developer | GATE DA 2026 Qualified | Building Scalable ML Systems (FastAPI & MLOps) | CSE ’27

1,428 followers 3w
Report this post
Building a Contact Trigger in Salesforce to prevent duplicate email entries and improve data quality within the system. The initial implementation focused on identifying existing emails and blocking new records with the same value using Apex trigger logic. This helped ensure that duplicate contacts are not created, which is a common real-world problem in CRM systems. After completing the implementation, I reviewed the solution and identified several important improvements that can make it more efficient, scalable, and aligned with best practices used in production environments. Instead of querying all Contact records, the logic can be optimized by filtering only the required records using a selective query. This reduces unnecessary data processing and helps avoid governor limit issues. Using a Set data structure instead of a List for storing emails significantly improves performance, especially when working with large datasets, because it allows faster lookup operations. Another important improvement is handling duplicates within the same transaction. In bulk operations, multiple records can be inserted together, and it is essential to validate duplicates not just against the database but also within the incoming data itself. Normalizing email values by converting them to lowercase ensures case-insensitive comparison, which is critical for accurate duplicate detection. Additionally, making the solution fully bulk-safe ensures that it performs reliably whether processing a single record or hundreds of records at once. This exercise helped me understand the difference between writing code that works and writing code that is optimized for real-world scalability and performance. It also reinforced the importance of thinking about governor limits, data structures, and efficient querying in Salesforce development. Moving forward, I am also exploring how this approach compares with Salesforce Duplicate Rules and when it is better to use declarative features versus custom Apex logic. #Salesforce #Apex #CRM #SoftwareDevelopment #Programming #Developer #Coding #Tech #Learning #DataQuality #CloudComputing
Like Comment
Sudhanshu Tiwari

Data Scientist | Ex - Internshala | Data Analytics | Python | SQL | ML | Gen AI | Azure

12,417 followers 1y
Report this post
SQL interview question: How to Identify and Delete Duplicates (with Code) Handling duplicate records in SQL is a common task, especially when dealing with raw or legacy datasets, and interviewers love to ask this. Here are 3 reliable methods to identify and delete duplicates using SQL: 1. 𝗨𝘀𝗶𝗻𝗴 𝗥𝗢𝗪_𝗡𝗨𝗠𝗕𝗘𝗥() (𝗕𝗲𝘀𝘁 𝗳𝗼𝗿 𝗰𝗼𝗺𝗽𝗹𝗲𝘅 𝗱𝘂𝗽𝗹𝗶𝗰𝗮𝘁𝗲 𝗰𝗼𝗻𝗱𝗶𝘁𝗶𝗼𝗻𝘀) WITH CTE AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY name, email ORDER BY id) AS rn FROM users ) DELETE FROM users WHERE id IN ( SELECT id FROM CTE WHERE rn > 1 ); ✅ Why this works: It keeps the first occurrence (based on id) and removes the rest. Super handy when deduplication depends on multiple columns. 2. 𝗨𝘀𝗶𝗻𝗴 𝗚𝗥𝗢𝗨𝗣 𝗕𝗬 𝘄𝗶𝘁𝗵 𝗠𝗜𝗡() 𝗼𝗿 𝗠𝗔𝗫() DELETE FROM users WHERE id NOT IN ( SELECT MIN(id) FROM users GROUP BY name, email ); ✅ Why this works: Simple datasets with clear duplicate keys. Just keeps the record with the smallest id. 3. 𝗨𝘀𝗶𝗻𝗴 𝗦𝗘𝗟𝗙 𝗝𝗢𝗜𝗡 DELETE u1 FROM users u1 JOIN users u2 ON u1.name = u2.name AND u1.email = u2.email WHERE u1.id > u2.id; ✅ Why this works: No CTE required, straightforward and readable. How would you answer that? Comment that down! ------------------------------------------------------------------ #SQL #interviewquestions

6 Comments
Like Comment
Varun Sagar Theegala

Senior Consultant - Product Analytics @ Eli Lilly | Building Scalable HCP, Patient & Real-World Analytics Products | Master’s in Data Science (AI/ML) @ Deakin University (25-27) | DataBricks | AI | SQL-Python-Dashboards

9,013 followers 3w
Report this post
Every advanced SQL interview question — YoY growth, top-N per group, running totals, percent contribution — has the same answer. 𝗪𝗶𝗻𝗱𝗼𝘄 𝗳𝘂𝗻𝗰𝘁𝗶𝗼𝗻𝘀. Here are the 5 patterns that cover 90% of real analytics work — no theory, just the queries you'll actually use. 𝗣𝗮𝘁𝘁𝗲𝗿𝗻 𝟭: 𝗣𝗲𝗿𝗶𝗼𝗱-𝗼𝘃𝗲𝗿-𝗽𝗲𝗿𝗶𝗼𝗱 𝗰𝗼𝗺𝗽𝗮𝗿𝗶𝘀𝗼𝗻 • revenue - LAG(revenue) OVER (ORDER BY month) • Month-over-month, week-over-week, YoY — one line. No self-joins. 𝗣𝗮𝘁𝘁𝗲𝗿𝗻 𝟮: 𝗧𝗼𝗽-𝗡 𝗽𝗲𝗿 𝗴𝗿𝗼𝘂𝗽 • ROW_NUMBER() OVER (PARTITION BY region ORDER BY sales DESC) • Filter WHERE rn <= 3 in a CTE. Top 3 products per region, top 5 customers per segment — same pattern every time. 𝗣𝗮𝘁𝘁𝗲𝗿𝗻 𝟯: 𝗥𝘂𝗻𝗻𝗶𝗻𝗴 𝘁𝗼𝘁𝗮𝗹𝘀 & 𝗺𝗼𝘃𝗶𝗻𝗴 𝗮𝘃𝗲𝗿𝗮𝗴𝗲𝘀 • SUM(revenue) OVER (ORDER BY date ROWS UNBOUNDED PRECEDING) • AVG(revenue) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) • Cumulative revenue and 7-day rolling average. No temp tables, no loops. 𝗣𝗮𝘁𝘁𝗲𝗿𝗻 𝟰: 𝗣𝗲𝗿𝗰𝗲𝗻𝘁𝗮𝗴𝗲 𝗼𝗳 𝘁𝗼𝘁𝗮𝗹 • revenue * 100.0 / SUM(revenue) OVER () • Each row's share of overall revenue. OVER() with empty parentheses = entire result set as one partition. 𝗣𝗮𝘁𝘁𝗲𝗿𝗻 𝟱: 𝗗𝗲𝗱𝘂𝗽𝗹𝗶𝗰𝗮𝘁𝗶𝗼𝗻 • ROW_NUMBER() OVER (PARTITION BY email ORDER BY updated_at DESC) • Keep rn = 1. Cleanest way to deduplicate without DELETE or DISTINCT ON. 𝗧𝗵𝗿𝗲𝗲 𝘁𝗵𝗶𝗻𝗴𝘀 𝘁𝗼 𝗿𝗲𝗺𝗲𝗺𝗯𝗲𝗿: → Window functions run after WHERE and GROUP BY — you can't filter on them directly, wrap in a CTE → LAST_VALUE with default frame only sees up to the current row — always set ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING → Every major platform supports these — BigQuery, Snowflake, Postgres, Databricks, Redshift If you're writing self-joins or correlated subqueries for any of the above, you're writing 5x the SQL you need to. Learn these 5 patterns. They'll cover most of what analytics actually asks for. #SQL #DataAnalytics #WindowFunctions #DataEngineering #Analytics
Like Comment
Maxime Seligman

Senior Salesforce Architect - 5X Salesforce Certified

8,118 followers 11mo
Report this post
Duplicate Leads in Salesforce? It’s not just messy — it’s dangerous! As a Salesforce Architect, one of the most underestimated pain points I see across orgs is poor duplicate management. It silently: ⚠️ Breaks automation 📉 Skews reporting ❌ Slows down sales 🛑 Violates GDPR rules Last week, I worked on optimizing a lead flow where 80,000+ leads were sitting unvalidated — many of them potential contacts, already in the system. Here’s how we tackled it: 📌 Step 1: Smart Duplicate Check → We built a flow that compares incoming Leads to Contacts & Leads using fuzzy logic (Email, Phone, Name, etc.) 📌 Step 2: Decision Branch → If a duplicate is found, we flag it or merge it automatically (using Apex + native Merge tools). → If not, we convert the Lead cleanly to a Contact, ensuring no clutter. 📌 Step 3: Automation with Guardrails → All this runs inside a scalable Salesforce Flow — enriched with Apex where needed — and leaves a full audit trail. 💡 Architecture isn’t just about building — it’s about protecting your data layer. If you're still relying on name-only matching or manual checks, you're setting your CRM up for failure. Let’s talk if you want a duplicate management framework that scales 👇 #Salesforce #CRMStrategy #DuplicateCheck #SalesforceFlow #Architect #RevOps #DataIntegrity
No more previous content

No more next content
46 Comments
Like Comment
Sanjeev Sharma

Data Architect & Sr Data Engineer with expertise in data modernisation, Azure Data Factory, Databricks (PySpark), Microsoft Fabric, Synapse, Snowflake, Python, data migration and AI-enabled data enrichment.

3,251 followers 1y
Report this post
Data Engineer Interview Scenario!! You are tasked with deduplicating a massive dataset where processing the entire dataset at once is not feasible due to memory constraints. You need to handle it in chunks, ensuring deduplication based on specific columns while keeping one record per duplicate set. How would you approach this in PySpark? Key Interview Flow # 1. Initial Question: Partitioning Logic - How would you process such a huge dataset without loading all the data into memory at once? - Expected Answer: Repartition the dataset by a logical key to split it into manageable chunks. # 2. Code Discussion: Partition-Based Deduplication Present this code to the candidate and ask them to explain each step: ```python chunked_df = df.repartition("key") # Step 1: Repartition by key def process_partition(iterator): partition_data = list(iterator) # Step 2: Collect data in the current chunk if not partition_data: return # Skip empty chunks # Convert the chunk into a DataFrame partition_df = spark.createDataFrame(partition_data, schema=df.schema) # Deduplicate within the chunk using Window function window_spec = Window.partitionBy("key").orderBy(col("date").desc()) deduplicated = partition_df.withColumn("row_number", row_number().over(window_spec)) \ .filter(col("row_number") == 1) \ .drop("row_number") # Simulate saving results for row in deduplicated.collect(): print(row) # Replace with save logic chunked_df.foreachPartition(process_partition) # Step 3: Apply logic to each partition ``` Follow-Up Questions # Partitioning Challenges 1. Why is repartitioning critical in this approach? - Expected Answer: Repartitioning ensures that all duplicate records for a given key are in the same partition, enabling efficient local deduplication. 2. What could go wrong if some keys have significantly more data than others? - Hint: Lead the discussion toward data skew and its impact on memory. 3. How would you handle a situation where a single partition exceeds memory limits? - Expected Answer: Use salting techniques, further chunking, or write intermediate results to disk. # Deduplication Logic 1. What is the role of the `Window` function here? - Expected Answer: It assigns a `row_number` to each duplicate set based on the ordering criteria (`date` in this case). 2. Can `dropDuplicates()` replace the `Window` function here? Why or why not? - Expected Answer: No, because `dropDuplicates()` does not allow control over which duplicate to keep. # Performance/Optimization 1. How does `foreachPartition` improve performance compared to `mapPartitions`? - Expected Answer: `foreachPartition` is designed for side effects (like saving to a database) and avoids unnecessary data collection. 2. What optimizations can you suggest for this approach? - Repartitioning based on expected skew. - Caching intermediate results. - Using efficient file formats like Parquet.

1 Comment
Like Comment
Hammton Ndeke

AI Builder | Building Back-Office Systems for Service Businesses

4,384 followers 1mo
Report this post
Recently, I shared the autonomous PR discovery engine I built in n8n.It got traction. That system finds and verifies media emails. But discovery is only half the battle. You still have to pitch them. You have to track follow-ups. You have to ensure you never spam an editor twice. So, I built Part 2: The Sequential Story Pipeline. Instead of paying $/mo for a generic email sequencer, this workflow runs a personalized, week-by-week PR campaign on autopilot. Zero mail merges. Zero manual tracking. The Setup Guide (Save this for later): 1. The Brain (State Machine) The Tool: Airtable. The Logic: Every night at midnight, the system queries your CRM. It isolates brand new leads (Week 0) from active contacts who are scheduled for their next weekly pitch. 2. Dynamic Content Assembly The Tool: Google Drive & n8n Code Nodes. The Logic: It does not use static text blocks. It downloads a live Google Doc for that specific week's story. It strips the raw text, natively injects the target publication's name, grabs the correct Drive images, and builds a clean HTML email on the fly. 3. The "100x Faster" Safety Tip The Rule: Don't rely on basic loops. Build a strict deduplication engine. The Logic: Generate a unique Campaign ID for every action (e.g., ContactID-Week2). Before sending, the workflow queries Airtable. If that ID is already marked "Sent", it skips them entirely. Zero accidental double-sends. 4. Automated Delivery & Reporting The Flow: It dispatches the emails via SMTP. Afterward, it increments a daily tracking counter and delivers a summarized HTML "Receipt" to the team's Gmail, showing exactly what went out. ✦ Where it Falls Short Text Formatting: Raw Google Docs contain hidden characters and bad line breaks. Fix: Use a Code Node with Regex (replace(/\r\n/g, '\n')) to parse the text and wrap paragraphs in clean HTML. SMTP Throttling: Sending hundreds of emails instantly will destroy your domain reputation. Fix: Insert a 5-second Wait Node inside your batch loop to throttle the send rate.

2 Comments
Like Comment
Pooja Pawar, PhD

Data Analyst | Business Intelligence & Data Visualization | Data Insights & Practical Learning | Top 127 Global Data Science Creators (Favikon)

19,304 followers 9mo
Report this post
𝐒𝐐𝐋 𝐈𝐧𝐭𝐞𝐫𝐯𝐢𝐞𝐰 𝐒𝐞𝐫𝐢𝐞𝐬 – 𝐃𝐚𝐲 𝟏𝟐⁣ ⁣ ⁣ 𝐓𝐚𝐬𝐤:⁣ Write a SQL query to identify email addresses that appear more than once in the customers table.⁣ ⁣ This type of question is commonly asked in interviews at companies like PwC, KPMG, and Infosys, especially when the role involves data quality audits, reporting, or data migration tasks. The focus here is on identifying duplicates—an essential skill in data cleaning and preprocessing workflows.⁣ ⁣ ⁣ 𝐇𝐨𝐰 𝐭𝐨 𝐟𝐫𝐚𝐦𝐞 𝐢𝐭:⁣ Start by grouping the table by the email column. Then apply the COUNT(*) function to count how many times each email appears in the dataset. To find duplicates, use a HAVING clause to return only those email groups where the count is greater than one.⁣ ⁣ This logic helps detect data integrity issues such as multiple records with the same email due to failed validations or duplicate imports.⁣ ⁣ ⁣ 𝐂𝐨𝐧𝐜𝐞𝐩𝐭𝐬 𝐮𝐬𝐞𝐝:⁣ 1. 𝐆𝐑𝐎𝐔𝐏 𝐁𝐘 𝐂𝐥𝐚𝐮𝐬𝐞:⁣ Groups records by email so that aggregation functions can be used to count how many times each unique email appears.⁣ ⁣ 2. 𝐂𝐎𝐔𝐍𝐓(*) 𝐅𝐮𝐧𝐜𝐭𝐢𝐨𝐧:⁣ Counts the total number of records for each grouped email value. If an email appears more than once, its count will be greater than one.⁣ ⁣ 3. 𝐀𝐥𝐢𝐚𝐬𝐢𝐧𝐠 𝐀𝐠𝐠𝐫𝐞𝐠𝐚𝐭𝐞𝐬:⁣ The result of COUNT(*) is aliased as occurrences for better readability and downstream usage in reporting or debugging queries.⁣ ⁣ 4. 𝐇𝐀𝐕𝐈𝐍𝐆 𝐂𝐥𝐚𝐮𝐬𝐞 𝐭𝐨 𝐅𝐢𝐥𝐭𝐞𝐫 𝐀𝐠𝐠𝐫𝐞𝐠𝐚𝐭𝐞𝐬:⁣ Since WHERE cannot be used with aggregated values, the HAVING clause is applied to filter the grouped data. It returns only those emails with more than one record.⁣ ⁣ 5. 𝐃𝐚𝐭𝐚 𝐐𝐮𝐚𝐥𝐢𝐭𝐲 𝐑𝐞𝐥𝐞𝐯𝐚𝐧𝐜𝐞:⁣ Identifying duplicates is critical in ETL pipelines, CRM data syncs, and compliance checks. Interviewers expect you to write efficient queries that surface these issues clearly.⁣ ⁣ 6. 𝐀𝐝𝐯𝐚𝐧𝐜𝐞𝐝 𝐅𝐨𝐥𝐥𝐨𝐰-𝐔𝐩 𝐒𝐭𝐫𝐚𝐭𝐞𝐠𝐢𝐞𝐬:⁣ Once duplicates are identified, interviewers may ask how to remove or resolve them. You can suggest using ROW_NUMBER() to isolate the latest record, or DISTINCT to retain unique values based on business rules.⁣ ⁣ 7. 𝐏𝐫𝐚𝐜𝐭𝐢𝐜𝐚𝐥 𝐀𝐩𝐩𝐥𝐢𝐜𝐚𝐭𝐢𝐨𝐧:⁣ This pattern is useful in fraud detection, contact deduplication, lead cleanup, or preparing customer data for machine learning models.⁣ ⁣ ⁣ 𝐖𝐡𝐲 𝐭𝐡𝐢𝐬 𝐢𝐬 𝐚𝐬𝐤𝐞𝐝:⁣ This question tests your understanding of SQL’s grouping and filtering logic, and how to detect and report anomalies in data. Clean data is the foundation of every analytics project, and the ability to identify duplicates is a must-have skill for any analyst or engineer. Interviewers look for candidates who can think practically and solve messy data problems with precision.⁣ ⁣ ⁣ ⁣ #SQL #SQLInterview #PwCInterview #DataCleaning #HAVINGClause #DataAnalytics #SQLQuery #LearnSQL #DataQuality #BusinessIntelligence #InterviewPreparation #DataEngineering #AnalyticsJobs #SQLTips
No more previous content

No more next content
48 Comments
Like Comment
Gowtham SB

Sr Data Engineer | PayPal | YouTuber

87,571 followers 1y Edited
Report this post
💡 𝐏𝐲𝐒𝐩𝐚𝐫𝐤 Interview Problem – #day12 Problem 🎯 Topic: Detecting Near-Duplicate Records (Fuzzy Logic) 🚀 Question: You’re given a PySpark DataFrame containing user registration data. Due to human entry or OCR errors, names and emails might have minor typos. Your task: ✅ Detect potential duplicate users based on similar name and email patterns (e.g., John Doe vs Jhon Doe, john@gmail.com vs john@gmal.com) ✅ Flag pairs where Levenshtein distance between name or email is below a certain threshold. 📦 Sample Input (user_data): +---------+-----------------+---------------------+ | user_id | full_name | email | +---------+-----------------+---------------------+ | U001 | John Doe | john@gmail.com | | U002 | Jhon Doe | john@gmal.com | | U003 | Alice Wonder | alice@xyz.com | | U004 | Ailce Wonder | alice@xyz.com | | U005 | Mark Smith | mark.smith@abc.com | 🎯 Expected Output: +--------+--------+ name_distance | email_distance | | user_1 | user_2 | 1 | 2 | +--------+--------+---------------+----------------+ | U001 | U002 | 1 | 2 | | U003 | U004 | 2 | 0 | ✅ PySpark Code Solution: from pyspark.sql import SparkSession from pyspark.sql.functions import col, levenshtein spark = SparkSession.builder.appName("FuzzyDuplicateDetection").getOrCreate() data = [ ("U001", "John Doe", "john@gmail.com"), ("U002", "Jhon Doe", "john@gmal.com"), ("U003", "Alice Wonder", "alice@xyz.com"), ("U004", "Ailce Wonder", "alice@xyz.com"), ("U005", "Mark Smith", "mark.smith@abc.com") ] df = spark.createDataFrame(data, ["user_id", "full_name", "email"]) # Self join to compare each pair df1 = df.alias("a") df2 = df.alias("b") # Avoid duplicate comparisons and self-join joined = df1.join(df2, col("a.user_id") < col("b.user_id")) # Calculate Levenshtein distances result = joined.select( col("a.user_id").alias("user_1"), col("b.user_id").alias("user_2"), levenshtein(col("a.full_name"), col("b.full_name")).alias("name_distance"), levenshtein(col("a.email"), col("b.email")).alias("email_distance") ).filter((col("name_distance") <= 2) | (col("email_distance") <= 2)) result.show() 🧠 Why this is different and useful: Applies fuzzy matching with levenshtein() — rarely used in basic PySpark problems Used in data deduplication, identity matching, master data cleanup Real-world use case in CRM systems, KYC, user onboarding platforms 🏢 Companies Asked: Razorpay LinkedIn Gojek 𝐒𝐨𝐜𝐢𝐚𝐥𝐬 🤝𝐂𝐨𝐧𝐧𝐞𝐜𝐭 𝐟𝐨𝐫 𝟏:𝟏 - https://lnkd.in/ggMJgs_k 📖𝐌𝐲 𝐁𝐨𝐨𝐤𝐬 - https://lnkd.in/ggMJgs_k 🎥𝐘𝐨𝐮𝐓𝐮𝐛𝐞 - https://lnkd.in/db-XNeP9 🎥𝐘𝐨𝐮𝐓𝐮𝐛𝐞 - https://lnkd.in/gc-8rdjM 📸𝐈𝐧𝐬𝐭𝐚𝐠𝐫𝐚𝐦 - https://lnkd.in/gwH84mRW 📸𝐈𝐧𝐬𝐭𝐚𝐠𝐫𝐚𝐦 - https://lnkd.in/gccKJZek 💼𝐋𝐢𝐧𝐤𝐞𝐝𝐈𝐧 - Gowtham SB #PySpark #BigData #DataCleaning #FuzzyMatching #Levenshtein #DataDeduplication #ApacheSpark #DataEngineering #100DaysOfPySpark #RealWorldUseCase
No more previous content

No more next content
5 Comments
Like Comment
vinesh diddi

DataEngineer| Bigdata Engineer| Data Analyst|Bigdata Developer|Works at callaway golf| Hdfs| Hive|Mysql|Shellscripting|Python|scala|DSA|Pyspark|Scala Spark|SparkSQl|Aws|Aws s3|Aws Lambda| Aws Glue|Aws Redshift |AWsEmr

5,152 followers 5mo
Report this post
PySpark scenario-based interview questions & answer: 1) Deduplicate and normalize messy user data (Beginner): #Scenario: You receive a user signup CSV with messy names, mixed-case emails, and multiple signups per email. Keep the most recent signup per email and normalize fields. #Purpose: Data hygiene — prevents duplicate users and inconsistent keys that break joins and metrics. Question & data (sample): #Schema: user_id: int, full_name: string, email: string, signup_ts: string, country: string #Samplerows: (1, " john DOE ", "JOHN@EX.COM ", "2025-11-20 10:00", "US") (2, "John Doe", "john@ex.com", "2025-11-21 09:00", "US") (3, "alice", "alice@mail.com", "11/20/2025 12:00", "IN") #Approach: Read CSV with header. Trim & normalize (full_name → title case, email → lower-case). Parse multiple timestamp formats to timestamp. Filter obviously invalid emails (basic regex). Deduplicate by email, keeping row with latest signup_ts. #Explanation: Lowercasing emails and trimming prevents false-unique keys. Multiple to_timestamp attempts handle variable input formats. Window + row_number() deterministically selects the most recent record per email. Caveat: Basic regex filters obvious invalid addresses but not full RFC validation. Karthik K. #PySpark #DataCleaning #ETL #DataEngineering #ApacheSpark code:
No more previous content

No more next content
6 Comments
Like Comment
Sajal Agarwal

Snowflake Architect | Enterprise Data Platforms | Snowpark & Cortex AI | Cloud Migration & Cost Optimization | 11+ Years

3,030 followers 4mo
Report this post
One of the most underrated features of Snowflake that can instantly enhance your data quality and pipeline efficiency is the use of HASH() or CHECKSUM() to detect duplicate rows. Why duplicates become a headache: - Multiple systems sending the same record - Late-arriving data - Ingestion retries - Missing primary keys - Manual file loads Traditional duplicate detection methods often involve long JOIN conditions, row-by-row comparisons, and complex WHERE clauses, which can become slow and costly as datasets grow. Snowflake offers a far simpler solution. Instead of manually comparing each column, you can generate a unique fingerprint of a row using: - HASH(col1, col2, col3, ...) - CHECKSUM(col1, col2, col3, ...) This fingerprint condenses the entire row into a single numeric value. If two rows have the same hash value, they are duplicates. For example, consider a table with the following data: NAME EMAIL CITY John j@a.com Pune John j@a.com Pune Rahul r@b.com Delhi You can create a row signature with the following query: SELECT *, HASH(name, email, city) AS row_hash FROM customers; The output will show: NAME EMAIL CITY ROW_HASH John j@a.com Pune 84739281 John j@a.com Pune 84739281 Rahul r@b.com Delhi 12849372 Now, duplicates become obvious, same data equals same hash. To find duplicates in one line, you can use: SELECT row_hash, COUNT(*) FROM ( SELECT HASH(name, email, city) AS row_hash FROM customers ) GROUP BY row_hash HAVING COUNT(*) > 1; This query provides all duplicate groups instantly. Snowflake hides so much power in small functions. HASH() is one of those features easy to use, zero maintenance, and extremely effective. If you aren’t using hash-based deduplication yet, it’s one of the quickest ways to improve your pipelines. Follow Sajal Agarwal for more such content.
No more previous content

No more next content
1 Comment
Like Comment

Deduplication logic for user emails

Summary

More in Email File Management

Explore categories