Deleting 125k Duplicates in Oracle Database with CTAS and ROW_NUMBER

Ever opened a database and found 125k duplicates of the exact same entity? 😅 That was my starting point in a recent data engineering challenge involving an Oracle database — and yes, it was as fun (and scary) as it sounds. At first glance, it looked like a simple cleanup. But reality kicked in quickly: this wasn’t just about deleting duplicates ❌ — it was about safely merging records while preserving relationships across multiple dependent tables. Think foreign keys everywhere, data inconsistencies, and a lot of “if I delete this… what breaks?” 🤯 The first 125k rows turned into almost 200k 👀 The real twist came with performance. Traditional DELETE operations? Painfully slow on large datasets (~30m) 🐢. So I switched gears and leaned on CTAS (Create Table As Select) + analytical functions like ROW_NUMBER — and boom 💥 massive performance gains. Sometimes the database just wants you to play smarter, not harder. I also tried python tools like pandas, polars, and parquet 🐍⚡. Super powerful for transformations, but it reinforced something important: the best solution isn’t always the fanciest stack — it’s the one that fits your context. All of this was done in a non-production environment (safety first 🛟), and now it’s ready to be aligned with the team for rollout. Great reminder that data engineering is part logic, part strategy… and part detective work 🕵️♂️. #DataEngineering #SQL #Oracle #DataQuality #ETL #Python #BigData

1 Comment

Evelym Rondon Pereira ☁ 1w

Congratulations 🎉🍾

1 Reaction

To view or add a comment, sign in

More Relevant Posts

Sudheer C
2w
Report this post
I stopped writing long SQL queries. And my work got better. Earlier, I thought complex problems needed complex queries. One giant script. Nested logic. Everything in one place. It looked impressive. It was also hard to debug, hard to explain, and easy to break. So I changed one habit. Now I write SQL like I’m telling a story. 🔹 Break it into steps 🔹 Use clear, meaningful names 🔹 Build logic layer by layer 🔹 Validate each step before moving on Most of my queries now are just a series of simple blocks stitched together. The result? Faster debugging. Cleaner logic. Easier handoffs. Here’s the truth: SQL isn’t about writing the smartest query. It’s about writing the clearest one. 🔍 If someone else reads your query tomorrow, will they understand it in 2 minutes? #SQL #DataAnalytics #DataEngineering #AnalyticsMindset #QueryOptimization #DataModeling #ETL #DataWorkflow #BigQuery #Snowflake #Database #DataProfessionals #TechCareers #CleanCode #DataBestPractices #AnalyticsCommunity #DataStorytelling #CodingTips
Like Comment
To view or add a comment, sign in
Chandu Deeti
2d
Report this post
😅 Things SQL Developers Say vs What Actually Happens We’ve all been there… --- 💬 “This query will run fast” ⏳ Runs forever 💬 “Just testing with SELECT *” 🚨 Ends up in production 💬 “Works perfectly in Dev” 🔥 Fails in Prod 💬 “Small change only” 💥 Breaks the entire pipeline --- SQL looks simple… Until it isn’t 😄 --- 💡 Real growth starts when you: ✔ Understand execution plans ✔ Optimize queries ✔ Think about performance early --- 🚀 Every SQL developer has faced this phase. The difference? Some learn from it. --- 💾 Save this if you’ve been here 🔁 Share with your SQL friends ➕ Follow for more real Data Engineering content #SQL #DataEngineering #DeveloperLife #Coding #Analytics #ETL #LearnSQL
Like Comment
To view or add a comment, sign in
Sumit Sharma
3w
Report this post
The difference between Junior and Senior SQL? It’s moving from 'What' to 'How'. When you start with SQL, you focus on what data to pull: SELECT -> FROM -> WHERE. When you master Advanced SQL, you focus on how that data is processed. You transition from querying data to architecting performance. ⬇️ This graphic maps out the four pillars that defined my own shift into senior data roles: 1️⃣ Window Functions: Moving beyond static analysis to dynamic, row-by-row calculations (RANK, ROW_NUMBER, PARTITION BY). 2️⃣ CTEs & Hierarchies: Turning unreadable, monolithic scripts into modular, maintainable, logical code (and handling recursion like a boss). 3️⃣ Indexing & Optimization: The core of Database Engineering. Understanding B-Trees and execution plans to turn a 10-minute query into a 10-second one. 4️⃣ Transactions & ACID: Ensuring data integrity and reliability, even when dealing with massive concurrency and high-stakes operations. Mastering the logic is one thing. Mastering the architecture is where the true value lies. Which of these four pillars are you currently focused on mastering? 🧠 #AdvancedSQL #DataArchitecture #DatabaseEngineering #DataOps #SQLMasterclass
Like Comment
To view or add a comment, sign in
Mohamed Khasim
2w
Report this post
Most SQL engineers are writing 10 lines when they need 3. QUALIFY is why. Here's what most engineers do to filter a window function: SELECT * FROM ( SELECT *, ROW_NUMBER() OVER ( PARTITION BY customer_id ORDER BY order_date DESC ) AS rn FROM orders ) sub WHERE rn = 1 Here's what QUALIFY lets you do: SELECT * FROM orders QUALIFY ROW_NUMBER() OVER ( PARTITION BY customer_id ORDER BY order_date DESC ) = 1 Same result. No subquery. No noise. Think of it this way: → HAVING filters aggregate results → QUALIFY filters window function results That's it. Works in: Snowflake · BigQuery · DuckDB · Teradata Not yet in: PostgreSQL · MySQL Know your warehouse before you ship it. The engineers writing cleaner SQL aren't writing more SQL. They're writing less of it — better. Which SQL clause do you wish you'd found sooner? ♻️ Repost to help a data engineer clean up their code today. #SQL #DataEngineering #Snowflake #SQLTips #DataEngineers Follow Mohamed Khasim for more on Data Engineering + AI
6 Comments
Like Comment
To view or add a comment, sign in
Shivani Yadav
5d
Report this post
📈 Database Design at Scale: Lessons Learned 🚨 Warning: A bad database schema is technical debt you pay for forever. 💡 One of my earliest challenges: designing core DB structures for SQL Server & PostgreSQL to serve millions of users. 🏗️ That experience permanently shaped how I approach data modeling. What scaling taught me about DB design: 👇 ✅ Normalize early, denormalize only with deliberate performance reason. 🧹🔍 Index for actual queries, not imagined ones. 🏎️💨 🧩 Relationship cost is massive – don't underestimate it. 💔 🌱 Design for data growth, not just today's data. 🚀 The biggest trap for backend engineers? 🧐 Designing for the "happy path." 😊 💥 Production data never stays on the happy path. 🌟 Good design is invisible. 👻 Bad design is an 11 PM Friday nightmare. 😭🕰️ #DatabaseDesign #SQLServer #PostgreSQL #BackendEngineering #SoftwareEngineering
1 Comment
Like Comment
To view or add a comment, sign in
MAYAVATHI AYYAPPAN
1w
Report this post
🚀Week 6 SQL journey :Scaling Up & Speeding Up with Oracle SQL My journey into Data Analytics is getting more technical and exciting! This week was all about moving from "writing queries" to "managing systems." Here are the four pillars of my learning this week: 🔗 Mastering Joins: Data is rarely in one place. I dove deep into Equi Joins for perfect matches, Outer Joins to catch missing data, and the Self Join—a powerful tool for uncovering hierarchical relationships within a single table. 🔢 The ROWNUM Logic: I finally cracked the "Subquery" secret for ROWNUM. Since it's a pseudo-column, you can't just pick "Row 5." I learned how to use nested queries to "freeze" row numbers, allowing for precise data pagination. 📥 SQL*Loader Automation: Why insert rows one by one when you can automate? I practiced building Control Files (.ctl) to map external CSV data directly into my schema. Understanding FIELDS TERMINATED BY and OPTIONALLY ENCLOSED BY is a game-changer for high-speed data loading. ⚡ Materialized Views (MV): The ultimate performance booster! I learned how to create Materialized Views that physically store query results. By using BUILD IMMEDIATE and REFRESH COMPLETE ON COMMIT, I can turn heavy, slow calculations into instant insights. It’s been a week of connecting the dots—literally and figuratively. Huge thanks to my mentor Praveen Kalimuthu and Tech Data Community for the guidance! #DataAnalysis #SQL #PLSQL #SQLLDR #Python #MONGODB #PowerBI #LearningJourney
Like Comment
To view or add a comment, sign in
Sourav Kumar
4w
Report this post
They say you haven't truly lived as a developer until you’ve had a minor heart attack before hitting "Execute" on a DELETE query. 😅 I officially started my SQL journey today with Chai Aur Code, and the biggest "Aha!" moment wasn't about syntax—it was about architecture. The big realisation: Data doesn't go magically inside the database. It lives on the hard disk or SSD. The database is actually just highly intelligent software that acts as a bridge, helping us read and write to that disk efficiently. Whether it's SQL or NoSQL, at the end of the day, they both store data on physical disks. The real magic lies in the architecture each software follows to handle different use cases. Here’s how I’m breaking down the learning: DDL (Data Definition Language): Designing the blueprint of how that data is structured on the disk. DQL (Data Query Language): The efficient way to ask the software to "fetch" specific bits of data. DML (Data Manipulation Language): The power to change, add, or—frighteningly—remove data. Always write a SELECT query with your WHERE clause first to verify the results before you swap it for a DELETE. Better safe than sorry! Huge thanks to Hitesh Choudhary and Piyush Garg for making these deep architecture concepts so simple and practical. Onward to Joins and more! 🚀 #SQL #NoSQL #DBMS #ChaiAndCode #BackendDevelopment #CodingJourney #DataEngineering #SoftwareArchitecture Hitesh Choudhary, Piyush Garg, Akash Kadlag, Jay Kadlag
Like Comment
To view or add a comment, sign in
Abhishek Sen
4w
Report this post
Most people think SQL is just about SELECT. That’s where most careers stay average. But top-tier Developers and Data Analysts know something different: SQL is the structural language of decision-making. It doesn’t just fetch data. It builds systems, controls access, protects transactions, and powers business intelligence. Here’s the backbone every professional should master: 🏗️ DDL — The Architects Build and reshape the database structure. CREATE | ALTER | DROP | TRUNCATE ✍️ DML — The Editors Modify the actual data inside the tables. INSERT | UPDATE | DELETE 🔍 DQL — The Investigators The world’s most-used command for extracting insights. SELECT ⚖️ DCL & TCL — The Guardians Manage permissions and ensure transaction safety. GRANT | REVOKE | COMMIT | ROLLBACK Here’s a question many beginners get wrong: DELETE vs TRUNCATE ❌ DELETE removes selected rows using WHERE → slower → can be rolled back → row-by-row deletion 🧹 TRUNCATE wipes the full table instantly → much faster → resets storage efficiently → no WHERE clause That one difference alone can save hours of execution time in large datasets. Introduced by IBM in the 1970s, SQL still remains the universal language across Oracle, MySQL, PostgreSQL, and SQL Server. Technology evolves. SQL remains. Because data is the new oil, and SQL is the refinery. 📊 Which SQL command do you use the most in your daily work? Mine is SELECT — because every business decision starts with the right question. #SQL #DataAnalytics #Programming #SoftwareEngineering #DataScience
Like Comment
To view or add a comment, sign in
Mohammad Tanvir Chowdhury
4w
Report this post
Ever wondered what actually happens inside a database when you write data? I wrote a breakdown of the core storage structures : B-Trees, LSM Trees, WAL, Checkpointing, and Compaction with real examples and how they show up in systems like PostgreSQL and Cassandra. Worth a read if you're into backend or system design. https://lnkd.in/g5A6W8Ye

How Databases Store Your Data: B-Trees, LSM Trees, WAL, and More tanvir.qom.bd
Like Comment
To view or add a comment, sign in

1,027 followers

5 Posts

View Profile Connect

Deleting 125k Duplicates in Oracle Database with CTAS and ROW_NUMBER

More Relevant Posts

Explore related topics

Explore content categories