CTEs over Subqueries for Clean SQL Code

Your SQL should read like a story, not a puzzle. In modern Data Engineering, "it works" isn't the only requirement. Readability is a production feature. Here is why the choice between Subqueries and CTEs defines your code quality: 🔹 Subqueries (The Quick Fix) A query nested inside another query. Pros: • Great for simple, one-off filtering or scalar checks. • Quick to write for a "throwaway" query. Cons: • Creates "spaghetti code" that grows inward. • Hard to debug when a join fails 4 levels deep. • Makes it nearly impossible for a teammate to follow your logic. 🚀 CTEs — Common Table Expressions (The Systematic Way) Defined using the WITH clause at the top of your script. Pros: • Modular: You "name" your steps (e.g., cleaned_sales). • Top-to-Bottom: Anyone can read your logic like a book. • Reusable: Reference the same logic multiple times. Performance Note: In modern engines like Spark or SQL Server, the optimizer is smart enough to handle both efficiently. But your coworkers aren't compilers—they need to be able to read your work. 💡 Simple way to think about it: Subqueries = A long paragraph with 12 commas. CTEs = A clean, organized list of bullet points. The Verdict: If you’re building a Medallion architecture, especially in the Silver layer, default to CTEs. Your future self (and your teammates) will thank you. #DataEngineering #SQL #Databricks #Azure #CleanCode #MedallionArchitecture #Subquery #CTE

To view or add a comment, sign in

More Relevant Posts

Sourav Kumar
4w
Report this post
They say you haven't truly lived as a developer until you’ve had a minor heart attack before hitting "Execute" on a DELETE query. 😅 I officially started my SQL journey today with Chai Aur Code, and the biggest "Aha!" moment wasn't about syntax—it was about architecture. The big realisation: Data doesn't go magically inside the database. It lives on the hard disk or SSD. The database is actually just highly intelligent software that acts as a bridge, helping us read and write to that disk efficiently. Whether it's SQL or NoSQL, at the end of the day, they both store data on physical disks. The real magic lies in the architecture each software follows to handle different use cases. Here’s how I’m breaking down the learning: DDL (Data Definition Language): Designing the blueprint of how that data is structured on the disk. DQL (Data Query Language): The efficient way to ask the software to "fetch" specific bits of data. DML (Data Manipulation Language): The power to change, add, or—frighteningly—remove data. Always write a SELECT query with your WHERE clause first to verify the results before you swap it for a DELETE. Better safe than sorry! Huge thanks to Hitesh Choudhary and Piyush Garg for making these deep architecture concepts so simple and practical. Onward to Joins and more! 🚀 #SQL #NoSQL #DBMS #ChaiAndCode #BackendDevelopment #CodingJourney #DataEngineering #SoftwareArchitecture Hitesh Choudhary, Piyush Garg, Akash Kadlag, Jay Kadlag
Like Comment
To view or add a comment, sign in
Haritha Gurram
2w
Report this post
Why your Index is being IGNORED ❌ You added an index, but your query is still slow. You check the execution plan and see the dreaded "Index Scan." Why did the database ignore your shortcut and choose the long way around? The 3 Reasons Your Index is Failing: 1️⃣ Non-SARGable Queries: If you wrap your column in a function like WHERE UPPER(user_name) = 'HARITHA', the engine can't use the index. It has to transform every single row first. The Fix: Keep your columns "naked." Use WHERE user_name = 'Haritha' (assuming case-insensitivity) or handle transformations in your ETL. 2️⃣ The "Selectivity" Tipping Point: If your query returns more than ~20% of the table, the optimizer decides it’s actually faster to just read the whole thing (Scan) rather than jumping back and forth (Seek). The Fix: Be more specific with your filters. If you need 50% of the data, an index might not be the right tool Partitioning is. 3️⃣ Leading Wildcards: LIKE '%Gurram' forces a scan because the engine doesn't know where the string starts. The Fix: Use trailing wildcards like LIKE 'Gurram%' to allow the engine to "Seek" the starting characters. The Result: Moving from an Index Scan to an Index Seek isn't just a small win, it’s often a 100x speed improvement for your production workloads. 🚀 Are you checking your "Explain Plans" for Scans, or just hoping the Index works? Let’s swap tuning tips in the comments! 👇 I’m Haritha Gurram, a Senior Data Engineer specializing in high-performance, cost-effective data engines. Let's optimize! 🤝 #SQL #DataEngineering #BigData #QueryOptimization #PerformanceTuning #SeniorDataEngineer #Walgreens #Costco #CloudComputing #DatabaseDesign #10YearsInTech #TechArchitecture #OpenToWork #Databricks #Pyspark #Snowflake #python
Like Comment
To view or add a comment, sign in
Harshavardhan R.

Brand partnership
1w
Report this post
I handwrote a complete SQL Roadmap so you don't have to google "where to start with SQL" ever again. 🗺️ Here's everything covered (Beginner → Advanced): 📌 SQL Commands → DDL: CREATE, ALTER, DROP, TRUNCATE → DML: INSERT, UPDATE, DELETE → TCL: COMMIT, ROLLBACK → DCL: GRANT, REVOKE → DQL: SELECT 📌 Clauses → WHERE, HAVING, GROUP BY, ORDER BY, FROM 📌 Joins → INNER, LEFT, RIGHT, FULL OUTER, CROSS, SELF 📌 Subqueries → Scalar, Inline, Correlated, CTE 📌 Indexes → Unique, Bitmap, B-Tree, Composite 📌 Functions → Aggregate, Arithmetic, Date, Char, Analytical, REGEXP 📌 Views, Constraints, Normalization, ACID Properties 📌 Optimization → Explain Plan, Cost, Cardinality, Logical & Relational Sets This is everything you need to go from zero to SQL pro. 💪 💾 Save this post — you'll need it. 🔁 Repost to help someone learning SQL right now. 👉 Follow me for more handcrafted roadmaps like this! Which topic from this roadmap do YOU find hardest? Comment below 👇 #SQL #DataAnalytics #DataEngineering #DataScience #Programming #OracleDatabase #Tech #CareerGrowth #Data #100DaysOfCode
Like Comment
To view or add a comment, sign in
Arunkumar Palanisamy
1mo
Report this post
𝗠𝗼𝘀𝘁 𝗮𝗻𝗮𝗹𝘆𝘁𝗶𝗰𝗮𝗹 𝗾𝘂𝗲𝗿𝗶𝗲𝘀 𝗱𝗼𝗻'𝘁 𝗻𝗲𝗲𝗱 𝗮 𝗰𝗹𝘂𝘀𝘁𝗲𝗿. 𝗧𝗵𝗲𝘆 𝗻𝗲𝗲𝗱 𝗮 𝗳𝗮𝘀𝘁 𝗲𝗻𝗴𝗶𝗻𝗲 𝗼𝗻 𝘁𝗵𝗲 𝗱𝗮𝘁𝗮 𝘁𝗵𝗮𝘁'𝘀 𝗮𝗹𝗿𝗲𝗮𝗱𝘆 𝘁𝗵𝗲𝗿𝗲. DuckDB is an in-process OLAP engine no server, no cluster, no infrastructure. It runs inside your Python script, your notebook, or your CLI. Think SQLite, but built for analytics instead of transactions. 𝗪𝗵𝘆 𝗶𝘁'𝘀 𝗰𝗵𝗮𝗻𝗴𝗶𝗻𝗴 𝘁𝗵𝗲 𝗴𝗮𝗺𝗲: → 𝗥𝗲𝗮𝗱𝘀 𝗣𝗮𝗿𝗾𝘂𝗲𝘁 𝗻𝗮𝘁𝗶𝘃𝗲𝗹𝘆: Query Parquet files directly without loading them into a database. No ETL step. Just SQL on files. → 𝗖𝗼𝗹𝘂𝗺𝗻𝗮𝗿 𝗲𝗻𝗴𝗶𝗻𝗲: Vectorized execution optimized for analytical workloads. Scans only the columns your query touches the same physics we covered in Episode 21. → 𝗭𝗲𝗿𝗼 𝗶𝗻𝗳𝗿𝗮𝘀𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲: No server to manage. No cluster to provision. pip install duckdb and you're running analytical SQL in seconds. → 𝗘𝗺𝗯𝗲𝗱𝘀 𝗲𝘃𝗲𝗿𝘆𝘄𝗵𝗲𝗿𝗲: Python, R, Node.js, Java, WASM. Runs in notebooks, CI/CD pipelines, dbt projects, and local development. 𝗪𝗵𝗲𝗿𝗲 𝗶𝘁 𝗳𝗶𝘁𝘀: → Local development: test SQL transforms before deploying to Spark or Snowflake → Data quality checks: validate Bronze and Silver layers without spinning up cloud compute → Ad-hoc analysis: query Parquet on a laptop without waiting for infrastructure → CI/CD testing: validate pipeline logic in-process, no external dependencies 𝗪𝗵𝗲𝗿𝗲 𝗶𝘁 𝗱𝗼𝗲𝘀𝗻'𝘁: → Multi-terabyte distributed workloads DuckDB is single-node → Concurrent multi-user serving it's an embedded engine, not a warehouse → Production serving layers with enterprise SLAs DuckDB isn't replacing Spark or your warehouse. It removes the friction before you need them. Where in your workflow are you still paying a cluster tax for a single-node problem? #DataEngineering #DuckDB #DataArchitecture
26 Comments
Like Comment
To view or add a comment, sign in
Denis Pindak
1w
Report this post
“Normalization” is how you destroy performance — slowly and confidently. Everyone learns it. Everyone repeats it. Almost no one questions it. You take a schema. You split it into clean, beautiful tables. No duplication. Perfect structure. Feels right. Until production happens. 💥 Queries get slower. 💥 Joins start stacking. 💥 Indexes multiply like weeds. 💥 The optimizer hesitates. And suddenly… Your “perfect” design becomes your bottleneck. Here’s the uncomfortable truth: “Normalization” optimizes for data integrity — not for read performance. And in real systems? Reads dominate. I’ve seen schemas where: - A single request needed 6–8 joins - Simple features required complex query plans - Indexes became a full-time maintenance job All in the name of “best practices.” ⚠️ The mistake is not normalization. The mistake is blind normalization. Applied without: - understanding access patterns - measuring real workloads - thinking about latency Real-world databases are not textbooks. Sometimes the right move is: - denormalize - duplicate data - trade purity for speed Yes, intentionally. If your query needs 7 joins… your “Normalization” has already failed. The real skill? Knowing when to break the rules. #SQL #MySQL #DatabaseAdministration #DatabaseOptimization #DataEngineering
Like Comment
To view or add a comment, sign in
Ganesh R
3w
Report this post
🚀 𝗛𝗼𝘄 𝗦𝗤𝗟 𝗤𝘂𝗲𝗿𝗶𝗲𝘀 𝗔𝗿𝗲 𝗣𝗿𝗼𝗰𝗲𝘀𝘀𝗲𝗱 𝗜𝗻𝘁𝗲𝗿𝗻𝗮𝗹𝗹𝘆 — 𝗙𝗿𝗼𝗺 𝗤𝘂𝗲𝗿𝘆 𝘁𝗼 𝗘𝘅𝗲𝗰𝘂𝘁𝗶𝗼𝗻 Ever wondered what happens behind the scenes when you run: SELECT column FROM schema_name.table_name; 𝗛𝗲𝗿𝗲’𝘀 𝘁𝗵𝗲 𝘀𝗶𝗺𝗽𝗹𝗶𝗳𝗶𝗲𝗱 𝗷𝗼𝘂𝗿𝗻𝗲𝘆 👇 𝗟𝗲𝘃𝗲𝗹 1: Parser Validates SQL syntax Builds Syntax Tree / Parse Tree 𝗟𝗲𝘃𝗲𝗹 2: Compiler / Optimizer • Creates Logical Plan • Performs Binding (resolves tables/columns) • Applies Optimization + Planning • Generates Physical Execution Plan 𝗟𝗲𝘃𝗲𝗹 3: Executor • Executes the physical plan • Returns final result set 📌 𝗦𝗤𝗟 𝗘𝘅𝗲𝗰𝘂𝘁𝗶𝗼𝗻 𝗙𝗹𝗼𝘄: SQL Query → Parser → Syntax Tree → Logical Plan → Binder → Optimizer → Physical Plan → Execution Engine → Results Understanding this pipeline helps data engineers and developers write better, faster, and more optimized SQL. #SQL #DataEngineering #DatabaseInternals #QueryOptimization #BigData #Analytics #TechLearning
22 Comments
Like Comment
To view or add a comment, sign in
Data Engineering Byte

483 followers
4d
Report this post
You probably don't need a separate vector database. If you're already running #PostgreSQL, you can add AI-powered semantic search with one extension: pgvector. Here's how it works in 5 steps: - Enable the extension → CREATE EXTENSION vector; - Add a VECTOR column to your existing tables - Store embeddings from any ML model (OpenAI, Hugging Face, etc.) - Query by meaning using similarity operators in plain SQL - Add an HNSW index for fast nearest-neighbor search at scale No new infrastructure. No sync jobs. No new vendor. Just SQL and vectors. The best part? You can combine vector search with everything PostgreSQL already does joins, filters, transactions, full-text search in a single query. Use cases teams are building right now: → Smart product search that understands intent, not just keywords → FAQ chatbots that match questions by meaning → Content recommendation engines Vibhor Kumar and Marc Linster wrote a great step-by-step walkthrough covering all of this from setup to production use cases. Full article here - https://lnkd.in/gBKF2APe Follow our Substack page for more such how to tutorials straight to your inbox - Data Engineering Byte
1 Comment
Like Comment
To view or add a comment, sign in
Ailine Studio

114 followers
2w
Report this post
YYour database is not what the SQL says. It’s what the relationships do. I met a backend lead with 47 tables. No diagram. New hires took three weeks to understand it. I asked how they learn. “Read the CREATE TABLE files.” Line by line. Then guess how things connect. One guess went wrong. A new dev misunderstood a foreign key. Dropped a production table. Six hours to recover. That’s not a junior mistake. That’s missing visibility. SQL shows structure. Columns. Types. Constraints. What it hides is everything that matters. Which table depends on which. What breaks if you remove something. Where data actually flows. So people build a mental map. And every mental map is slightly different. That’s where errors creep in. We switched to one view. Entities. Attributes. Relationships. All visible at once. You don’t read it. You scan it. Users connect to orders. Orders connect to products. Products connect to inventory. You see the shape of the system. Not just the syntax. That’s the shift. Documentation shouldn’t require interpretation. It should remove it. Because when relationships are visible, mistakes get obvious. And obvious mistakes don’t make it to production. If someone new joined your team tomorrow, how long would it take them to truly understand your schema? #ERDiagram #DatabaseDesign #SQL #EntityRelationship #DataArchitecture #BackendDevelopment #IndianAI #AILineStudio
Like Comment
To view or add a comment, sign in
Manoj Mali
1w
Report this post
Finally! A text-to-SQL solution that actually works 🎯 (100% open-source) The real reason your text-to-SQL fails in production: It's not the LLM. It's not your prompt engineering. It's schema retrieval 💥 Here's the enterprise trap: Query: _"Which publishers received royalty payments above $5,000?"_ Vector search finds → `publisher` table ✅ Vector search finds → `royalty_ledger` table ✅ Vector search misses → `vendor_agreement` table ❌ That's your bridge table. No join path = zero rows returned. Your LLM wrote perfect SQL. Your database returned nothing. 😵💫 Why this happens: Vector embeddings match semantics, not relationships. They can't see foreign keys or multi-hop join paths across 60+ tables. The QueryWeaver approach: Treat schemas as graphs 🕸️ Instead of embedding docs: 📍 Tables become nodes 🔗 Foreign keys become edges 🛤️ Join paths are discovered by walking the graph When a query comes in, it traverses the structure and pulls EVERY bridge table needed. Including 5-hop chains. Proof it works: BIRD Benchmark on a 60-table superhero database → Resolved a complex 5-hop query by automatically chaining through all intermediate tables 🏆 No more "valid SQL, zero results" in production. If you're building text-to-SQL for enterprise databases, schema-as-graph is the missing piece. Who else has burned hours debugging "correct" SQL with empty results? 👇 check link in comment 🔗 #TextToSQL #OpenSource #DataEngineering #SQL #LLM #AIAgents #GraphDatabase #QueryWeaver #EnterpriseAI #BigData #Analytics #DeveloperTools #GenAI #Database #2026Tech
2 Comments
Like Comment
To view or add a comment, sign in
Khôi Dương
1mo
Report this post
🚀 Built a high-performance CSV import pipeline handling millions of rows — with real-time progress updates and near-constant memory usage. Instead of using common approaches like EF Core inserts or DataTable buffering, I designed a streaming-first architecture optimized for both speed and scalability. 🔹 Flow Overview User uploads CSV → saved to disk → Hangfire enqueues job → background worker processes file → SqlBulkCopy streams directly into SQL Server → SignalR pushes real-time progress to UI 💡 Key design decisions: 1️⃣ Single SqlBulkCopy stream (no parallel workers) Naive approach: spin up multiple threads to insert faster Reality: SQL Server transaction log is a bottleneck (single writer) 👉 Parallel inserts cause lock contention + wasted threads ✅ Solution: one connection + TableLock → maximum throughput, minimal overhead 2️⃣ IDataReader instead of List or DataTable Naive approach: load entire file into memory 👉 Millions of rows = hundreds of MB + GC pressure ✅ Solution: custom CsvDataReader (streaming) Reads line by line No object allocation per row Memory stays ~64KB regardless of file size 3️⃣ SqlBulkCopy with TableLock Naive approach: row-level locks 👉 Massive lock overhead, slow inserts ✅ Solution: single exclusive lock Zero per-row lock cost Combined with SIMPLE/BULK_LOGGED → ~10x less logging 4️⃣ Hangfire for background processing Naive approach: process inside HTTP request 👉 Timeout + poor UX ✅ Solution: enqueue job Instant response to user Retry + persistence Safe cancel support 5️⃣ Fire-and-forget SignalR updates Naive approach: await inside SqlRowsCopied event 👉 Blocks bulk insert thread ✅ Solution: async push without blocking UI updates every 10k rows No impact on throughput 6️⃣ Custom CsvDataReader over libraries Naive approach: use CsvHelper → object mapping 👉 Allocation per row → GC overhead ✅ Solution: lightweight parser Only parse what SQL needs Zero intermediate object graph 📊 Results: Constant ~64KB RAM usage Millions of rows imported in minutes Near-maximum SQL Server throughput Real-time progress without slowing down insert 👉 Lesson learned: The fastest system is not the one that does more in parallel — it’s the one that removes unnecessary work. #DotNet #SQLServer #Performance #Backend #Architecture #Hangfire #SignalR #DataEngineering
1 Comment
Like Comment
To view or add a comment, sign in

5,886 followers

37 Posts

View Profile Connect

CTEs over Subqueries for Clean SQL Code

More Relevant Posts

Explore related topics

Explore content categories