Avoid Blind Distinct in SQL for Data Control

STOP using DISTINCT blindly. You’re losing control over your data. 🔹 Duplicates sneak into your data more often than you think: 👉 joins gone wrong 👉late-arriving data 👉retries in pipelines 👉messy ingestion 🔹 Most people fix it like this: SELECT DISTINCT id, name, city FROM customers; It works… But it’s also the least controlled approach. ✅ Here are 3 ways to remove duplicates in SQL: 1. ROW_NUMBER() (Best Way) full control over which row stays lets you keep latest / priority records production-safe 2. DISTINCT (Quick Fix) simple and fast removes exact duplicates ⚠️ no control over which row survives 3. GROUP BY (Same as DISTINCT) does the same job more verbose rarely needed for deduplication 🎯 REAL TAKEAWAY All 3 give the same result. But only one gives control. If you care about data correctness, use ROW_NUMBER() — not luck. 🔥 WHY THIS MATTERS Choosing the wrong method can: hide data issues break downstream reports silently corrupt logic And the worst part? You won’t notice until it’s too late. Which one do you use most in your projects? 🤔 #SQL #DataEngineering #AnalyticsEngineering #DataEngineer #ETL #Databricks #BigData

10 Comments

Asmita Kaushal 1w

DISTINCT hides the problem, ROW_NUMBER() solves it with control and intent 🎯 Batchu V Sarath Chandra

2 Reactions

Anandnarayanan S 2w

Great share bro

Ritwik Mohapatra 2w

Apt finding on duplicate removal

Rajesh De 2w

#CFBR

Harshit Bhadiyadra 2w

https://www.udemy.com/course/become-sql-champion/?instructorPreviewMode=guest&couponCode=KEEPLEARNING

Vinod Devalraju 2w

How about deleting the duplicates which we found

See more comments

To view or add a comment, sign in

More Relevant Posts

Loveleen Kaur
2w
Report this post
#DataQuality issues don’t always break pipelines. They quietly break decisions. 📉 Everything can look fine — Pipelines running, dashboards refreshing… But the numbers still don’t add up. That’s what makes data quality tricky. From what I’ve seen: • Missing values → misleading trends • Inconsistent definitions → confusion across teams • Late data → incorrect reporting • Silent issues → hardest to detect It made me realize — Reliability isn’t just pipelines running. It's data being trusted. Still learning how to build systems where data is not just available, but reliable. What’s the most common data quality issue you’ve faced? #DataEngineering #DataQuality #DataAnalytics #DataPipelines #SQL #DataReliability #DataPlatform #LearningInPublic

1 Comment
Like Comment
To view or add a comment, sign in
Vishwanath T L
3w
Report this post
🚀 Stop corrupting your data lake with duplicate records. The pattern: Idempotent DELETE+INSERT (Upsert) ensures that re-running a pipeline produces the same state regardless of how many times it executes. Here is how to implement it: target_table = "analytics.sales" staging_table = "staging.sales_updates" # Define the partition range to be replaced update_date = "2023-10-27" spark.sql(f"DELETE FROM {target_table} WHERE load_date = '{update_date}'") spark.table(staging_table).write \ .format("delta") \ .mode("append") \ .saveAsTable(target_table) Why every pipeline needs this: In data engineering, failures are inevitable. Whether it's a network glitch or a bad upstream file, you will need to re-run your jobs. Without an idempotent design, a simple retry results in duplicate rows and skewed metrics. By clearing the partition before loading, we guarantee that the output remains consistent even if the job crashes halfway through. It is the single most important habit for maintaining "source of truth" reliability in our warehouse. How do you handle idempotency when your source systems don't provide a clear partition key or timestamp? #DataEngineering #DataPipeline #ApacheSpark #DataQuality #BigData
Like Comment
To view or add a comment, sign in
Yogesh Aluri
1w
Report this post
One of my data pipelines broke… and I had to figure out why. At first, it was frustrating. Data wasn’t reaching the final table. Dashboards weren’t updating. No clear error message. Instead of guessing, I followed a step-by-step approach: 👉 Step 1: Check the source Is data coming in correctly? 👉 Step 2: Verify ingestion Did the pipeline actually pick up the data? 👉 Step 3: Inspect transformations Any logic breaking in processing? 👉 Step 4: Check the output layer Is data written correctly to the destination? 👉 Step 5: Look at logs Find the exact point of failure That’s when I found it… A small issue in one transformation step was breaking the entire pipeline. Fixing it was easy. Finding it was the hard part. That experience taught me: 👉 Debugging is not about guessing… it’s about isolating. Now whenever something breaks, I don’t panic. I just follow the flow. How do you debug pipeline failures? #DataEngineering #DataPipeline #Debugging #BigData #ETL #DataEngineer #TechLearning #ProblemSolving #CloudComputing #Analytics #DeveloperJourney #LearnInPublic #CareerGrowth
Like Comment
To view or add a comment, sign in
Mohamed Khasim
2w
Report this post
Your data pipeline ran successfully. Your dashboard is still wrong. Your pipeline ran. No errors. Green checkmarks everywhere. But the numbers don't match. This is the silent killer in data engineering — pipelines that succeed without delivering truth. Here's what usually goes wrong: • Schema drift: upstream source changed a column name. No alert. Data loads, but maps to null. • Late-arriving data: your daily job ran at 6 AM. 30% of records hadn't arrived yet. • Deduplication gaps: the source sends duplicates. You count them twice. • Silent type coercion: "1000" becomes 1 in one system, 1000 in another. • Timezone mismatches: events stored in UTC, reported in local time, joined across both. The pipeline isn't broken. The trust is. Senior data engineers don't just build pipelines. They build pipelines that know when to fail loudly. If your system can't tell you when something is wrong — that's the real bug. What's the sneakiest data issue you've ever debugged? Drop it below 👇 #DataEngineering #DataQuality #DataPipelines #SQL #ETL
8 Comments
Like Comment
To view or add a comment, sign in
Harish Chatla
1w
Report this post
THIS QUERY LOOKS CORRECT. IT IS NOT. It runs. It returns clean percentages. But the logic is WRONG. Business problem: Calculate per team: - Resolution rate (%) - On-time resolution rate (%) (within 24h SLA) - Quality score (%) based on feedback Tables involved: tickets - ticket_id - team_id - created_time - resolved_time feedback - feedback_id - ticket_id - rating - feedback_time At first glance, this looks simple. But real data breaks this logic: - Unresolved tickets are included in calculations - SLA is applied incorrectly - Multiple feedback rows inflate quality score - Latest feedback per ticket is not considered So even if your query runs, your numbers are misleading. Think before answering: Are you using the right denominator? Are you counting only resolved tickets? Are duplicates affecting your results? Fix the logic, not just the syntax. Comment your answer. Repost if this made you think. Follow for real-world SQL problems. Subscribe to practice on our platform. #DataRejected #SQL #DataEngineering #Analytics #DataScience #InterviewPrep #CodingPractice #BusinessLogic
8 Comments
Like Comment
To view or add a comment, sign in
Harish Chatla
2w Edited
Report this post
THIS QUERY LOOKS CORRECT. IT IS NOT. Most people think this query is correct. It runs. It returns results. But the logic is completely broken. Business problem: Find the latest product review for each customer based on their most recent completed order. Tables involved: orders - order_id - customer_id - order_date payments - payment_id - order_id - status - payment_date reviews - review_id - order_id - review_text - review_date At first glance, the logic seems simple: Join orders → payments → reviews and pick the latest order per customer. But real data doesn’t behave like that. - One order can have multiple payments - One order can have multiple reviews (edits / updates) - Joins create duplicate rows - “Latest” becomes ambiguous if not handled carefully So even if your query runs, you might be picking the wrong review. Think before answering: Are you selecting the latest order? Or the latest review? Or a random row created by joins? Fix the logic, not just the syntax. Comment your answer. Repost if this made you think. Follow Harish Chatla more real-world data problems. Subscribe to practice on our platform. #DataRejected #SQL #DataEngineering #DataAnalytics #DataScience #LearnByDoing #TechCareers #Analytics #CodingPractice
11 Comments
Like Comment
To view or add a comment, sign in
Sam Miraki
3w
Report this post
Data pipelines don’t fail loudly they fail silently One of the biggest risks in data engineering isn’t crashes. It’s this: everything looks like it’s working Pipelines run. Dashboards load. Reports go out. But underneath: • data is delayed • fields are partially populated • logic changed without anyone noticing And suddenly: decisions are based on wrong data Strong data systems don’t just move data they validate it. That means: ✔ data quality checks at every stage ✔ monitoring for anomalies, not just failures ✔ clear ownership when things break Because “no errors” doesn’t mean “no problems” #DataEngineering #DataQuality #Analytics #Backend #DataOps
Like Comment
To view or add a comment, sign in
Ahmed Shifa
3w
Report this post
Salam everyone! Data engineering is not about big data. It is about not losing data. Everything else is secondary. Here is a simple checklist I use before any pipeline runs: If this runs twice, do I get the same result? If it crashes at row 10,000, do I restart from 0 or from 10,001? If the source has nulls, does my pipeline break or handle it? Can I explain what this pipeline does in two sentences to someone non-technical? Four questions. No fancy tools required. Get these right first. Add complexity later. #DataEngineering #DataFundamentals #SimpleOverComplex Wasalam!
Like Comment
To view or add a comment, sign in
Ganesh R
1mo
Report this post
Understanding 𝗦𝗹𝗼𝘄𝗹𝘆 𝗖𝗵𝗮𝗻𝗴𝗶𝗻𝗴 𝗗𝗶𝗺𝗲𝗻𝘀𝗶𝗼𝗻𝘀 (𝗦𝗖𝗗) Types in 𝗗𝗮𝘁𝗮 𝗪𝗮𝗿𝗲𝗵𝗼𝘂𝘀𝗶𝗻𝗴 If you're working in data engineering or analytics, mastering SCD is a must! Let’s simplify the key types 👇 🔹 𝗧𝘆𝗽𝗲 0 – 𝗙𝗶𝘅𝗲𝗱 No changes allowed. Data remains constant. 👉 Use case: Immutable fields like Date of Birth 🔹 𝗧𝘆𝗽𝗲 1 – 𝗢𝘃𝗲𝗿𝘄𝗿𝗶𝘁𝗲 Old data is replaced with new data (no history). 👉 Use case: Correcting errors in data 🔹 𝗧𝘆𝗽𝗲 2 – 𝗩𝗲𝗿𝘀𝗶𝗼𝗻𝗶𝗻𝗴 Maintains full history by adding new rows. 👉 Use case: Tracking customer address changes over time 🔹 𝗧𝘆𝗽𝗲 3 – 𝗣𝗿𝗲𝘃𝗶𝗼𝘂𝘀 𝗩𝗮𝗹𝘂𝗲 Stores limited history (previous value only). 👉 Use case: Comparing current vs last value 🔹 𝗧𝘆𝗽𝗲 4 – 𝗛𝗶𝘀𝘁𝗼𝗿𝘆 𝗧𝗮𝗯𝗹𝗲 Current data in main table, history stored separately. 👉 Use case: Reducing main table size while keeping history 🔹 𝗧𝘆𝗽𝗲 6 – 𝗛𝘆𝗯𝗿𝗶𝗱 (1+2+3) Combines overwrite, versioning, and previous value. 👉 Use case: Advanced analytics with complete flexibility 💡 𝗤𝘂𝗶𝗰𝗸 𝗧𝗶𝗽: Use Type 1 for simplicity Use Type 2 when history matters Use Type 6 for complex business needs 📊 Choosing the right SCD type directly impacts performance, storage, and reporting accuracy. #DataEngineering #DataWarehouse #SCD #SQL #BigData #Azure #Databricks #Analytics

26 Comments
Like Comment
To view or add a comment, sign in

28,321 followers

361 Posts

View Profile Connect

Avoid Blind Distinct in SQL for Data Control

More Relevant Posts

Explore content categories