Prevent Non-Deterministic Ordering with ROW_NUMBER()

⚡ The Window Function Trap That's Costing You Accuracy A subtle mistake most SQL writers make without realizing it. When you use ROW_NUMBER() to deduplicate records, the order of ties matters — but most people never specify a deterministic tiebreaker. If two rows share the same PARTITION BY key and the same ORDER BY value, the database engine is free to assign row numbers arbitrarily. Your results become non-reproducible across runs — and in production, that's a silent data quality bug. ❌ WRONG — Non-deterministic ordering: SELECT * FROM ( SELECT *, ROW_NUMBER() OVER ( PARTITION BY user_id ORDER BY created_at DESC -- ⚠️ ties unresolved ) AS rn FROM events ) WHERE rn = 1; ✅ RIGHT — Deterministic tiebreaker: SELECT * FROM ( SELECT *, ROW_NUMBER() OVER ( PARTITION BY user_id ORDER BY created_at DESC, event_id ASC -- ✓ unique tiebreaker ) AS rn FROM events ) WHERE rn = 1; "Non-deterministic ordering is the most common source of flaky deduplication logic in production pipelines." If two rows have the same user_id and created_at, which one survives? Without a tiebreaker — it's random. Run the same query twice and you might get different rows. 🔑 Rule of thumb: Every ROW_NUMBER() OVER clause should end with a column that is globally unique — typically a surrogate key, UUID, or row-level hash. Even if it feels redundant, add it anyway. Reproducibility is a feature, not an accident. Always test your deduplication logic with intentional duplicate timestamps. If the result changes between runs, you're missing a tiebreaker. #SQL #DataEngineering #WindowFunctions #Analytics #DataQuality

To view or add a comment, sign in

More Relevant Posts

Rajesh P
2w
Report this post
Small SQL changes that made a noticeable difference Over time, I’ve noticed that performance issues are not always about complex tuning. Sometimes, small changes in how we write SQL make a big impact. Here are a few simple ones I’ve come across 👇 🔹 1️⃣ Avoid functions on indexed columns WHERE TO_CHAR(order_date,'YYYY-MM-DD') = '2024-01-01' 👉 Prevents index usage ✔ Better: WHERE order_date = DATE '2024-01-01' 🔹 2️⃣ NVL can affect performance WHERE NVL(status,'X') = 'A' 👉 Index may not be used ✔ Better: WHERE status = 'A' OR status IS NULL 🔹 3️⃣ Avoid SELECT * SELECT * FROM orders WHERE status = 'COMPLETE'; 👉 Fetches unnecessary data ✔ Better: SELECT order_id, order_date, amount FROM orders WHERE status = 'COMPLETE'; 🔹 4️⃣ NOT IN vs NOT EXISTS WHERE emp_id NOT IN (SELECT emp_id FROM terminated_employees) 👉 Fails if NULL exists ✔ Better: WHERE NOT EXISTS ( SELECT 1 FROM terminated_employees t WHERE t.emp_id = e.emp_id ) 💡 What I’ve learned Many performance improvements come from writing SQL in a way the optimizer can understand better — not just adding hints or indexes. Have you seen similar small changes make a difference? #OracleSQL #SQLTuning #Performance #DatabaseDevelopment #PLSQL
Like Comment
To view or add a comment, sign in
Zain Ul Abideen
3w
Report this post
💡 What *really* happens when you run an SQL query? Let’s break it down with a simple example: `SELECT name, age FROM users WHERE city = 'New York';` Most developers stop at writing queries. But the real growth starts when you understand what happens *under the hood* 👇 --- ⚙️ **𝗦𝘁𝗲𝗽 𝟭: 𝗧𝗿𝗮𝗻𝘀𝗽𝗼𝗿𝘁 𝗦𝘂𝗯𝘀𝘆𝘀𝘁𝗲𝗺** The moment you hit “Run”, your query doesn’t jump straight into the database. It first lands in the Transport Subsystem — the gatekeeper. ✅ Manages client connections ✅ Authenticates & authorizes requests ✅ Decides whether your query is allowed to proceed --- 🧠 **𝗦𝘁𝗲𝗽 𝟮: 𝗤𝘂𝗲𝗿𝘆 𝗣𝗿𝗼𝗰𝗲𝘀𝘀𝗼𝗿** This is where your SQL gets *understood*. It has two key components: 🔹 **𝗤𝘂𝗲𝗿𝘆 𝗣𝗮𝗿𝘀𝗲𝗿** Breaks your query into parts (SELECT, FROM, WHERE) Checks syntax and builds a parse tree 🔹 **𝗤𝘂𝗲𝗿𝘆 𝗢𝗽𝘁𝗶𝗺𝗶𝘇𝗲𝗿** Validates tables/columns (semantic checks) Figures out the *most efficient way* to run your query 🎯 Output: An optimized execution plan --- 🚀 **𝗦𝘁𝗲𝗽 𝟯: 𝗘𝘅𝗲𝗰𝘂𝘁𝗶𝗼𝗻 𝗘𝗻𝗴𝗶𝗻𝗲** Now the plan turns into action. The Execution Engine: ✅ Follows the execution plan step-by-step ✅ Coordinates with lower layers ✅ Collects and merges results --- 💾 **𝗦𝘁𝗲𝗽 𝟰: 𝗦𝘁𝗼𝗿𝗮𝗴𝗲 𝗘𝗻𝗴𝗶𝗻𝗲** This is where the actual data work happens. Think of it as a team working behind the scenes: 👨💼 Transaction Manager → ensures consistency 🔒 Lock Manager → prevents conflicts ⚡ Buffer Manager → fetches data from memory/disk 🧾 Recovery Manager → logs for rollback & recovery --- 🔍 The key insight? Your SQL query is not just a command. It’s a *journey through multiple layers of abstraction, optimization, and coordination.* And understanding this is what separates: 👉 Query writers from system thinkers --- 💬 Curious — what else would you add to this journey? #SQL #Databases #BackendEngineering #SystemDesign #SoftwareEngineering
Like Comment
To view or add a comment, sign in
ZAID MUSHTAQ
2w
Report this post
🚀 Struggling with complex SQL queries that are hard to debug? You don’t always need one giant query… 👉 Sometimes you need Temporary Tables 👇 --- 💡 What are Temporary Tables? Temporary tables store intermediate results for a short time. 👉 Created in "tempdb" 👉 Automatically deleted after session ends --- 📌 Local Temp Table (#) Visible only in your session Example: SELECT customer_id, SUM(total) AS total_spent INTO #customer_spend FROM orders GROUP BY customer_id --- 📌 Use it later easily SELECT * FROM #customer_spend WHERE total_spent > 500 --- 🌍 Global Temp Table (##) Visible across sessions Example: CREATE TABLE ##shared_data (id INT, value NVARCHAR(100)) --- ⚖️ Temp Table vs CTE vs Subquery 🔹 Subquery • Inline • Not reusable 🔹 CTE • More readable • Still limited to one query 🔹 Temp Table ✅ • Reusable across multiple steps • Can be indexed • Great for debugging --- 🔥 When should you use Temp Tables? ✔ Complex multi-step transformations ✔ Reusing intermediate results ✔ Breaking large queries into smaller steps ✔ Improving performance with indexing --- ⚠️ Common Mistake Using CTEs everywhere ❌ 👉 If you're reusing the same data multiple times 👉 Temp tables are a better choice --- 🔥 Real Insight (Important): Good SQL developers don’t write long queries… 👉 They break problems into steps --- 🧠 One-Line Takeaway: Temporary tables help you simplify, reuse, and optimize complex SQL workflows. --- #SQL #DataEngineering #SQLServer #LearnSQL #DataAnalytics #ETL #TechLearning #Analytics
3 Comments
Like Comment
To view or add a comment, sign in
Olusegun Oluyemi
1w
Report this post
Most developers blame slow queries on missing indexes. The real culprit is usually hidden inside the execution plan. After years of tuning SQL Server workloads, I have learned that reading execution plans is the single highest-leverage skill a data engineer can develop. It tells you exactly where SQL Server is spending its time — no guessing required. Here is what I look for first when a query is underperforming: 1. Thick arrows between operators — wide data flows signal excessive row estimates and memory pressure 2. Key Lookups — these often mean a nonclustered index is missing one or two covering columns 3. Hash Matches on large tables — usually a sign of outdated statistics or a missing join index 4. Parallelism warnings — CXPACKET waits visible in the plan indicate skewed data distribution or MAXDOP misconfiguration 5. Estimated vs actual row counts — a significant gap almost always points to stale statistics or parameter sniffing Once you identify the bottleneck operator, the fix is usually surgical. Update statistics, add a covering index, rewrite the predicate, or force a plan hint where justified. You rarely need to rewrite the entire query. Execution plan analysis is not reserved for DBAs. Every engineer who writes T-SQL should be comfortable opening an actual execution plan before escalating a performance issue. Build that habit early and you will resolve most slow query tickets in under thirty minutes. #SQLServer #QueryOptimization #DataEngineering #PerformanceTuning #DatabaseAdministration
Like Comment
To view or add a comment, sign in
Harshini Ravi
1w
Report this post
Day 15/365 - SQL Tip: Mastering Conditional JOINs A Conditional JOIN is a powerful SQL technique where you add extra conditions directly inside the `ON` clause. Instead of simply matching rows using a key, you can control exactly which records should be joined. 📌 Basic Example SELECT c.customer_name, o.order_id, o.order_status FROM customers c LEFT JOIN orders o ON c.customer_id = o.customer_id AND o.order_status = 'Completed'; In this query: * All customers are returned * Only completed orders are joined * Customers without completed orders still appear ❓Why This Matters Placing conditions in the `ON` clause preserves the behavior of an OUTER JOIN. If you move the condition to the `WHERE` clause, your `LEFT JOIN` can accidentally turn into an `INNER JOIN`. ❌ Risky Approach: The below query removes customers who have no completed orders. SELECT * FROM customers c LEFT JOIN orders o ON c.customer_id = o.customer_id WHERE o.order_status = 'Completed'; ✅ Best Practice: Always place filtering conditions for the joined table inside the `ON` clause when working with `LEFT JOIN`. Where is this applicable in real-world scenarios? • Active customers only • Recent transactions • Date-range joins • Soft-delete handling • Category-specific matching Master this concept, and your SQL skills will level up instantly. #SQL #DataAnalytics #DataEngineering #LearnSQL #SQLTips #Database #Analytics #BusinessIntelligence #DataScience #ConditionalJoin
Like Comment
To view or add a comment, sign in
Venkata Sai Charan Ravipati
1w
Report this post
🚨 SQL Trap: NULL + IN operator = Silent Data Thief This one has fooled senior developers in production. Let me save you the pain. The IN operator in SQL does NOT match NULLs — even when NULL is in the list. Why? Because NULL means "unknown." And UNKNOWN IN (anything) = UNKNOWN, not TRUE. The row just vanishes. No error. No warning. 🚨 TRAP #1 — NULL in IN list SELECT * FROM orders WHERE status IN ('pending', NULL, 'shipped'); You expect NULL rows to be included. Reality: they're silently ignored. 😱 🚨 TRAP #2 — NOT IN with a subquery containing NULLs SELECT * FROM customers WHERE id NOT IN ( SELECT customer_id FROM orders); If even ONE customer_id is NULL in orders → this returns ZERO rows. 💀 Complete data wipeout. No error thrown. ✅ FIX #1 — Handle NULL explicitly SELECT * FROM orders WHERE status IN ('pending', 'shipped') OR status IS NULL; ✅ FIX #2 — Filter NULLs in subquery SELECT * FROM customers WHERE id NOT IN ( SELECT customer_id FROM orders WHERE customer_id IS NOT NULL ); ✅ FIX #3 — Use NOT EXISTS (NULL-safe by design) SELECT * FROM customers c WHERE NOT EXISTS ( SELECT 1 FROM orders o WHERE o.customer_id = c.id ); The golden rule: IN and NOT IN are NOT null-safe. Always check your subqueries for NULLs. When in doubt — use EXISTS / NOT EXISTS. #SQL #Database #DataEngineering #TechTips #DataAnalytics
Like Comment
To view or add a comment, sign in
Moe Ahad
5d
Report this post
𝗪𝗼𝗿𝗸𝗶𝗻𝗴 𝘄𝗶𝘁𝗵 𝗡𝗨𝗟𝗟𝘀 🚨 NULL values in SQL are one of the most misunderstood concepts — and they can silently break your queries. I recently came across a great deep-dive on working with NULLs in T-SQL, and it reinforced a few critical points every data professional should know: First — the fundamentals: ✅ NULL ≠ empty A missing integer is not 0. A NULL string is not ''. Treat them differently. ✅ SQL uses three-valued logic: TRUE, FALSE, and UNKNOWN NULLs evaluate to UNKNOWN — and since WHERE clauses only keep TRUE, NULL rows are silently excluded unless handled explicitly. Where things go wrong 👇 • column = NULL → returns UNKNOWN (not TRUE) • NULL = NULL → also UNKNOWN (not TRUE) • WHERE filters drop anything not TRUE → data disappears without warning And the big one: ⚠️ A single NULL inside a NOT IN subquery can cause your query to return zero rows — even when valid matches exist. Functions & patterns you should know: • IS NULL / IS NOT NULL → always use these (never = or <>) • ISNULL → simple, fast, SQL Server-specific (2 params) • COALESCE → ANSI standard, multiple params, returns highest-precedence type • NULLIF → useful for avoiding divide-by-zero • TRY_CAST / TRY_CONVERT → return NULL instead of throwing errors Other edge cases that trip people up: • Joins on nullable columns → NULL = NULL is not TRUE, so expected matches can disappear • NOT IN + NULLs → returns empty set → use NOT EXISTS instead • Set operators (UNION, INTERSECT, EXCEPT) → treat NULLs as equal (unlike standard comparisons) 💡 Bottom line: Most people think SQL works in TRUE/FALSE. It doesn’t. It works in TRUE / FALSE / UNKNOWN — and NULLs are what introduce that third state. That’s why: 👉 Queries can look right but return the wrong results 👉 Dashboards can quietly drift off from reality Real talk: A lot of “data discrepancies” aren’t complex… They’re just NULLs being ignored. If you build dashboards, pipelines, or anything stakeholders rely on: 👉 This is one of those small details that separates “it runs” from “it’s trustworthy.”
Like Comment
To view or add a comment, sign in
Brinal Rodrigues
2w
Report this post
🚀 SQL Tip: NULLIF() – Small Function, Big Use Case! Have you ever faced a situation where you need to avoid division by zero or want to convert a specific value into NULL? That’s where NULLIF() comes in handy. ✅ What NULLIF() does: It compares two expressions. If both are equal → returns NULL If not equal → returns the first expression 📌 Syntax: NULLIF(expression1, expression2) 🔥 Common Example: Avoid Division by Zero SELECT SalesAmount / NULLIF(Quantity, 0) AS AvgPrice FROM SalesTable; 👉 If Quantity = 0, NULLIF() returns NULL, and the division safely returns NULL instead of throwing an error. 🎯 Another Example: Convert unwanted values into NULL SELECT NULLIF(CustomerName, '') AS CleanCustomerName FROM Customers; This converts empty strings into NULL, which is useful in reporting and data cleanup. 💡 Why it’s useful? ✔ Prevent runtime errors ✔ Cleaner data handling ✔ Helpful in reporting calculations ✔ Makes queries more readable 📌 Pro Tip: NULLIF() is often used with COALESCE() for better default values. #SQL #SQLServer #Database #TSQL #DataEngineering #SQLTips
Like Comment
To view or add a comment, sign in
Shahriya Naeem
2w
Report this post
Day 13/30 of SQL Challenge Today I learned about pattern matching in SQL using: LIKE While working with text data, I realized that exact matching is often not enough. We sometimes need to search for patterns, partial matches or specific formats. This is where the LIKE operator becomes useful. Concept: LIKE is used in the WHERE clause to search for a specified pattern in a column. Basic syntax: SELECT column_name FROM table_name WHERE column_name LIKE pattern; Common patterns: * '%' -> represents zero, one, or multiple characters * '_' -> represents exactly one character Examples: 1. Find names starting with 'A' SELECT name FROM customers WHERE name LIKE 'A%'; 2. Find names ending with 'n' SELECT name FROM customers WHERE name LIKE '%n'; 3. Find names containing 'ar' SELECT name FROM customers WHERE name LIKE '%ar%'; 4. Find names with exactly 5 characters SELECT name FROM customers WHERE name LIKE '_____'; Explanation: * '%' gives flexibility for partial matching * '_' helps match fixed-length patterns Key understanding: LIKE allows us to work with real-world messy text data where exact matches are not always possible. Practical use cases: * Searching users by partial name * Filtering emails or domains * Finding patterns in product names or codes Important note: LIKE is case-sensitive in some databases and case-insensitive in others, depending on the system being used. Reflection: This concept made me realize that querying text data requires flexibility, not just exact conditions. #SQL #LearningInPublic #Data #BackendDevelopment #SQLPractice #BuildInPublic
Like Comment
To view or add a comment, sign in
Sapna Nimbalkar
1w Edited
Report this post
📌 This SQL query looks 100% correct… but returns ZERO rows No error. No warning. Just empty results. At first glance, nothing looked wrong 👇 SELECT CustomerID FROM Customers WHERE CustomerID NOT IN ( SELECT CustomerID FROM Orders ); Everything checked out: ✔ Data ✔ Logic ✔ Joins Still… EMPTY result. 🔍 The hidden culprit => NULL Just one NULL in the subquery can break everything. What SQL actually does internally Subquery returns: (2, 3, NULL) SQL interprets this as: CustomerID <> 2 AND CustomerID <> 3 AND CustomerID <> NULL Now here’s the catch 👇 1 <> NULL → UNKNOWN So the condition becomes: TRUE AND TRUE AND UNKNOWN = UNKNOWN And SQL behaves like this: TRUE → keep FALSE → remove UNKNOWN → also remove 💥 Result → EMPTY DATASET Even when valid rows exist. And that’s how one NULL can silently invalidate your entire dataset. ✅ The safer approach → NOT EXISTS SELECT c.CustomerID FROM Customers c WHERE NOT EXISTS ( SELECT 1 FROM Orders o WHERE o.CustomerID = c.CustomerID ); ✔ Works row-by-row ✔ Stop early (better performance) ✔ NULL doesn’t break logic 🔥 Learning NOT IN doesn’t fail loudly… it fails silently. 💡 Rule to follow: - Default → NOT EXISTS - Use NOT IN → only when you're 100% sure NO NULLs exist #SQL #SQLServer #DataAnalytics

19 Comments
Like Comment
To view or add a comment, sign in

427 followers

5 Posts

View Profile Follow

Prevent Non-Deterministic Ordering with ROW_NUMBER()

More Relevant Posts

Explore content categories