Prevent Non-Deterministic Ordering with ROW_NUMBER()

⚡ The Window Function Trap That's Costing You Accuracy A subtle mistake most SQL writers make without realizing it. When you use ROW_NUMBER() to deduplicate records, the order of ties matters — but most people never specify a deterministic tiebreaker. If two rows share the same PARTITION BY key and the same ORDER BY value, the database engine is free to assign row numbers arbitrarily. Your results become non-reproducible across runs — and in production, that's a silent data quality bug. ❌ WRONG — Non-deterministic ordering: SELECT * FROM ( SELECT *, ROW_NUMBER() OVER ( PARTITION BY user_id ORDER BY created_at DESC -- ⚠️ ties unresolved ) AS rn FROM events ) WHERE rn = 1; ✅ RIGHT — Deterministic tiebreaker: SELECT * FROM ( SELECT *, ROW_NUMBER() OVER ( PARTITION BY user_id ORDER BY created_at DESC, event_id ASC -- ✓ unique tiebreaker ) AS rn FROM events ) WHERE rn = 1; "Non-deterministic ordering is the most common source of flaky deduplication logic in production pipelines." If two rows have the same user_id and created_at, which one survives? Without a tiebreaker — it's random. Run the same query twice and you might get different rows. 🔑 Rule of thumb: Every ROW_NUMBER() OVER clause should end with a column that is globally unique — typically a surrogate key, UUID, or row-level hash. Even if it feels redundant, add it anyway. Reproducibility is a feature, not an accident. Always test your deduplication logic with intentional duplicate timestamps. If the result changes between runs, you're missing a tiebreaker. #SQL #DataEngineering #WindowFunctions #Analytics #DataQuality

To view or add a comment, sign in

Explore content categories