A query was taking 4 minutes in production. I spent 2 days rewriting the SQL logic. Tried different joins. Tried different filters. Tried breaking it into CTEs. Nothing helped. Then my senior said: "Did you look at the execution plan?" I didn't even know what that was. I looked. Found the problem in 30 seconds. Fixed it in 5 minutes. Query now runs in 12 seconds. 👇 ──────────────── What is an execution plan? It is the database's internal roadmap. It shows you EXACTLY what happens when your query runs — step by step. Which tables it scans. Which indexes it uses. Where it is spending the most time. Stop guessing. Start reading the plan. ──────────────── How to read it — Snowflake and SQL Server: ----- SNOWFLAKE: use EXPLAIN EXPLAIN SELECT c.customer_name, SUM(o.revenue) AS total_revenue FROM customers c JOIN orders o ON c.customer_id = o.customer_id WHERE o.order_date >= '2024-01-01' GROUP BY c.customer_name; ----- SQL SERVER: use SET STATISTICS SET STATISTICS IO ON; SET STATISTICS TIME ON; SELECT c.customer_name, SUM(o.revenue) AS total_revenue FROM customers c JOIN orders o ON c.customer_id = o.customer_id WHERE o.order_date >= '2024-01-01' GROUP BY c.customer_name; ──────────────── 3 things to look for first: → Full table scan — reading ALL rows when you need 100 Fix: add a clustering key or index on filter columns → Wrong row estimates — engine guesses wrong counts Fix: update table statistics so planner works correctly → Nested loops on large tables — wrong join algorithm Fix: ensure join columns are indexed on both sides ──────────────── What I found in my 4 minute query: Full table scan on a 200 million row table. No index on order_date. One clustering key later → 4 minutes became 12 seconds. I had rewritten SQL for 2 days. The fix was one command. Found in 30 seconds. ──────────────── Read the execution plan before you rewrite the query. Always. Without exception. What is the slowest query you have ever fixed? Drop the before and after below 👇 #SQL #DataEngineering #QueryOptimization #DataEngineer #SQLServer #Snowflake #DatabasePerformance #Analytics #DataWarehouse #TechTips #OpenToWork #DataCommunity #HiringDataEngineers #100DaysOfSQL
Optimize SQL Queries with Execution Plans
More Relevant Posts
-
📊 SQL Concepts I Revisited Recently — and Why They Matter More Than You Think! As someone working in the tech space, I always knew SQL was important. But recently when I sat down to truly practice it, I realised how many small but critical mistakes we make without even noticing. Here are 13 concepts that genuinely changed how I think about writing queries: 1️⃣ NULL is special Never compare NULL with = — it will always fail silently. Always use IS NULL or IS NOT NULL 2️⃣ WHERE vs HAVING WHERE filters rows before grouping. HAVING filters after grouping and only works with aggregate functions. Mixing these up is one of the most common SQL mistakes! 3️⃣ GROUP BY logic Always group by the column you want to summarise across — not the unique ID. Grouping by a unique ID defeats the entire purpose! 4️⃣ Finding duplicates GROUP BY + HAVING COUNT > 1 is your best friend here. Simple but powerful. 5️⃣ Finding missing records LEFT JOIN on the table you want all records from + IS NULL on the right table column — this pattern will save you in real investigations! 6️⃣ Aggregate functions belong in HAVING Never use SUM, COUNT or AVG in WHERE — they only work after grouping happens. 7️⃣ Subqueries When you can't answer something in one step — break it into two. Inner query runs first, outer query uses that result. 8️⃣ CASE WHEN SQL's version of IF/ELSE. Cleaner than you think and incredibly useful for handling NULLs and conditional logic. 9️⃣ SELECT with GROUP BY When using GROUP BY, your SELECT can only have the grouped column or aggregate functions. No individual columns allowed! 🔟 COUNT vs COUNT(DISTINCT) COUNT(column) counts everything including duplicates. COUNT(DISTINCT column) counts only unique values. Know when to use which! 1️⃣1️⃣ JOIN direction matters Always start your JOIN from the table you want all records from — the direction changes your entire result set! 1️⃣2️⃣ When to use which JOIN Want all records from one side → LEFT JOIN Want only matching records → INNER JOIN 1️⃣3️⃣ Finding missing records using JOINs LEFT JOIN on the main table + IS NULL on the right table column = powerful pattern to find records that don't exist in another table! SQL is one of those skills that seems simple on the surface but has so many nuances underneath. The best way to learn it? Practice with real world scenarios, not just textbook examples. What's the one SQL concept that took you the longest to truly understand? Drop it in the comments! 👇 #SQL #DataAnalysis #TechLearning #FinTech #CareerGrowth #LearningInPublic
To view or add a comment, sign in
-
-
⚠️ DAY 14/15 — SQL TRAP: No Index vs Index Same query. Same table. Same result. One takes 120 seconds. Other takes 0.1 seconds. The difference? One line of code. 👇 🎯 The Situation: Your table has 1 million employees. You search for one employee by name. No error. Query runs. Result comes. But it takes 2 whole minutes. 😵 Your dashboard is freezing. Users are complaining. Your manager is asking questions. All because of one missing setup. 🧠 Here's what's happening: Without an index — SQL has no idea where your data lives. So it checks every single row. One by one. All 1 million of them. This is called a FULL TABLE SCAN. Every. Single. Row. Checked. Even when only 1 row matches. With an index — SQL knows exactly where Alice is. Jumps directly to her row. 1 row scanned. Done. ✅ The Numbers: No Index → Rows scanned = 1,000,000 Time = ~120 seconds ❌ With Index → Rows scanned = 1 Time = ~0.1 seconds ✅ 1200x faster. Just from one line — CREATE INDEX idx_name ON employees(name) 💡 Real Life Example: Without index = finding a word in a book by reading every single page from start to finish. 😵 With index = opening the book's index at the back, finding the word, jumping straight to that page. ✅ Same book. Same word. One way is just exhausting. 📌 Save This Rule: → No index = full table scan = checks every row = very slow → Index = direct lookup = jumps to result = very fast → Always index columns used in WHERE, JOIN, ORDER BY → Too many indexes slow down INSERT and UPDATE — balance is key → Use EXPLAIN command to check if your query is using an index 🔑 One Line to Remember: No Index = reads entire table Index = jumps directly to answer On small tables you won't notice. On production data with millions of rows — this is the difference between a fast app and a broken one. 💬 Real Talk: Most slow dashboards and freezing reports in real companies are not caused by bad queries. They are caused by missing indexes. This is one of the first things senior engineers check when a query is running slow. 😎 🙋 Have you ever had a query that worked fine in testing but became super slow in production? Missing index was probably the reason! Comment below 👇 Follow for Day 15 tomorrow — the FINAL day of the series! 🚀 #SQL #SQLForBeginners #DataAnalytics #LearnSQL #SQLTips #DataEngineering #InterviewPrep
To view or add a comment, sign in
-
-
🚀 SQL Server Execution Plan — The Secret Behind Every Slow Query Most developers try to optimize queries by guessing. But the real answer is always in one place: 👉 Execution Plan Let’s break it down in a practical way 👇 🔍 Estimated vs Actual Execution Plan Estimated Execution Plan Generated before query execution Based on statistics Shows what SQL Server thinks will happen Actual Execution Plan Generated after execution Shows real data (rows processed, actual cost) Helps identify real performance issues 💡 Rule: If your query is slow → always check the Actual Execution Plan ⚙️ Key Operators You MUST Understand 1. Nested Loop (🔁) Best for small datasets Works like: For each row in table A → scan table B ✅ Fast when: One table is small Proper indexes exist ❌ Bad when: Large datasets → becomes very slow 2. Hash Match (🧠) Used for large, unsorted data Creates a hash table in memory ✅ Good for: Large joins No indexes ❌ Downside: High memory usage 3. Merge Join (🔀) Requires sorted data Very efficient for large datasets ✅ Fast when: Both inputs are sorted (via index) ❌ Problem: Sorting itself can be expensive 🤔 Why SQL Chooses One Over Another? SQL Server Query Optimizer decides based on: Table size Available indexes Data distribution (statistics) Estimated cost 👉 It always picks the lowest estimated cost plan But here's the catch: ⚠️ If statistics are outdated → SQL can choose the wrong plan 🐢 Real Example: Fixing a Slow Query Problem: A query was taking 8–10 seconds SELECT o.OrderId, c.CustomerName FROM Orders o JOIN Customers c ON o.CustomerId = c.Id WHERE c.City = 'Lahore' 🔍 Execution Plan Analysis: Operator used: Hash Match Reason: No index on Customers.City SQL scanned entire table 😬 ✅ Fix: Added index: CREATE INDEX IX_Customers_City ON Customers(City); ⚡ Result: Execution plan switched to Nested Loop Query time reduced: 👉 10 sec → 200 ms 💡 Key Takeaways ✔ Don’t guess — read the execution plan ✔ Indexes can completely change performance ✔ Hash Match ≠ bad, but depends on scenario ✔ Always compare Estimated vs Actual 🔥 Pro Tip Next time your API is slow: 👉 Open Execution Plan 👉 Look for: Table Scans High cost operators Missing indexes That’s where the real problem is. 💬 If you found this useful, comment “PLAN” and I’ll share more real-world SQL optimization tricks. #SQLServer #DatabasePerformance #QueryOptimization #ExecutionPlan #SQLTips #BackendDevelopment #DotNet #SoftwareEngineering #TechCareers #Developers #Programming #CodeOptimization #APIPerformance #DataEngineering #SQLTuning #PerformanceTuning #LearnSQL #CodingTips #TechCommunity #LinkedInTech
To view or add a comment, sign in
-
-
SQL Mastery Roadmap SQL Mastery Landscape The complete roadmap - from zero to advanced 👇 SQL isn't just a query language. It's the foundation of every data role. Whether you're just starting or levelling up, this is the full landscape you need to master. 1️⃣ SQL Foundations Relational databases, tables, primary & foreign keys, constraints, SQL syntax. 2️⃣ Core SQL Operations SELECT, WHERE, ORDER BY, LIMIT. Filtering with AND/OR/NOT, LIKE, BETWEEN. INSERT, UPDATE, DELETE. 3️⃣ Joins & Relationships INNER, LEFT, RIGHT, FULL, CROSS, SELF JOIN. Aliases and cardinality. 4️⃣ Aggregations & Grouping GROUP BY, HAVING, COUNT/SUM/AVG/MIN/MAX, DISTINCT, Rollup & Cube. 5️⃣ SQL Functions String, Date & Time, and Number functions - CONCAT, DATE_ADD, ROUND and more. 6️⃣ Subqueries & Advanced Queries Subqueries in SELECT/WHERE/FROM, correlated subqueries, EXISTS vs IN, CTEs, Recursive CTEs. 7️⃣ Database Design Normalization (1NF–3NF), ERDs, foreign keys, referential integrity, schema design patterns. 8️⃣ Indexing & Query Optimization Clustered vs non-clustered indexes, EXPLAIN, reducing query cost, avoiding full table scans. 9️⃣ Transactions & Concurrency ACID properties, COMMIT, ROLLBACK, SAVEPOINT, isolation levels, deadlocks. 🔟 Stored Procedures & Functions CREATE PROCEDURE, triggers, parameters, automation and validation use cases. 1️⃣1️⃣ SQL for Analytics Window functions, PARTITION BY, ROW_NUMBER/RANK/DENSE_RANK, LAG & LEAD, Pivoting. SQL mastery isn't a destination - it's a progression. Start at 1. Work to 11. Build systems that last. Bookmark this roadmap 🔖 Which level are you at right now? Drop it below 👇 ♻️ Repost to help others grow 🔔 Follow Abhisek Sahu for more ♻️ I share cloud , data analysis/data engineering tips, real world project breakdowns, and interview insights through my free newsletter. 🤝 Subscribe for free here → https://lnkd.in/ebGPbru9
To view or add a comment, sign in
-
-
This roadmap lays out a structured path that many learners, including myself, truly need starting from fundamentals, moving through aggregations and joins, and gradually stepping into advanced concepts like window functions and optimization. It reinforces the idea that mastering SQL is less about memorizing syntax and more about developing the right way of thinking. As someone working towards building a strong foundation in Data Analytics and BI, this serves as a great reference point to stay consistent and focused. Instead of jumping between topics, it reminds me to follow a step-by-step approach and build depth in each area. A sincere thanks to Abhisek Sahu for putting together and sharing such a valuable guide. Content like this not only simplifies learning but also motivates learners to stay on track. Looking forward to applying these concepts in practice and continuously improving my SQL skills. #SQL #DataAnalytics #BusinessIntelligence #LearningJourney #CareerGrowth #DataDriven
Cloud, Data & AI Creator | 350K+ Data Community | Senior Azure Data & DevOps Engineer | Databricks • PySpark • ADF • Synapse • Python • SQL • Power BI
SQL Mastery Roadmap SQL Mastery Landscape The complete roadmap - from zero to advanced 👇 SQL isn't just a query language. It's the foundation of every data role. Whether you're just starting or levelling up, this is the full landscape you need to master. 1️⃣ SQL Foundations Relational databases, tables, primary & foreign keys, constraints, SQL syntax. 2️⃣ Core SQL Operations SELECT, WHERE, ORDER BY, LIMIT. Filtering with AND/OR/NOT, LIKE, BETWEEN. INSERT, UPDATE, DELETE. 3️⃣ Joins & Relationships INNER, LEFT, RIGHT, FULL, CROSS, SELF JOIN. Aliases and cardinality. 4️⃣ Aggregations & Grouping GROUP BY, HAVING, COUNT/SUM/AVG/MIN/MAX, DISTINCT, Rollup & Cube. 5️⃣ SQL Functions String, Date & Time, and Number functions - CONCAT, DATE_ADD, ROUND and more. 6️⃣ Subqueries & Advanced Queries Subqueries in SELECT/WHERE/FROM, correlated subqueries, EXISTS vs IN, CTEs, Recursive CTEs. 7️⃣ Database Design Normalization (1NF–3NF), ERDs, foreign keys, referential integrity, schema design patterns. 8️⃣ Indexing & Query Optimization Clustered vs non-clustered indexes, EXPLAIN, reducing query cost, avoiding full table scans. 9️⃣ Transactions & Concurrency ACID properties, COMMIT, ROLLBACK, SAVEPOINT, isolation levels, deadlocks. 🔟 Stored Procedures & Functions CREATE PROCEDURE, triggers, parameters, automation and validation use cases. 1️⃣1️⃣ SQL for Analytics Window functions, PARTITION BY, ROW_NUMBER/RANK/DENSE_RANK, LAG & LEAD, Pivoting. SQL mastery isn't a destination - it's a progression. Start at 1. Work to 11. Build systems that last. Bookmark this roadmap 🔖 Which level are you at right now? Drop it below 👇 ♻️ Repost to help others grow 🔔 Follow Abhisek Sahu for more ♻️ I share cloud , data analysis/data engineering tips, real world project breakdowns, and interview insights through my free newsletter. 🤝 Subscribe for free here → https://lnkd.in/ebGPbru9
To view or add a comment, sign in
-
-
I ran into something interesting the other day that reminded me why understanding your database engine matters more than just knowing SQL. 🤔 I had a query running perfectly fine on MS SQL Server. Same query, moved to Databricks SQL- boom, error. Here's a scenario many data professionals might have faced and knowing this one difference might save you hours of debugging. You write a query with SELECT DISTINCT and ORDER BY a column that isn't in your SELECT list. It runs fine on MS SQL Server but it's a big 'NO' for Databricks. 𝙼̲𝚂̲ ̲𝚂̲𝚀̲𝙻̲ ̲𝚂̲𝚎̲𝚛̲𝚟̲𝚎̲𝚛̲ ̲𝙲̲𝚘̲𝚍̲𝚎̲ ̲𝙴̲𝚡̲𝚊̲𝚖̲𝚙̲𝚕̲𝚎̲:̲ Select Distinct AccountNo, CAST(MODIFY_TIME AS DATE) AS TransactionDate, Amount From TransactionTable Order by MODIFY_TIME 𝙳̲𝚊̲𝚝̲𝚊̲𝚋̲𝚛̲𝚒̲𝚌̲𝚔̲𝚜̲ ̲𝙲̲𝚘̲𝚍̲𝚎̲ ̲𝙴̲𝚡̲𝚊̲𝚖̲𝚙̲𝚕̲𝚎̲ ̲(̲𝙸̲𝚗̲𝚌̲𝚘̲𝚛̲𝚛̲𝚎̲𝚌̲𝚝̲ ̲𝚅̲𝚎̲𝚛̲𝚜̲𝚒̲𝚘̲𝚗̲)̲:̲ Select Distinct AccountNo, Date( MODIFY_TIME ) AS TransactionDate, Amount From TransactionTable Order by MODIFY_TIME Why??? 𝗠𝗦 𝗦𝗤𝗟 𝗦𝗲𝗿𝘃𝗲𝗿 → Permissive MS SQL Server allows you to ORDER BY a column that is not in the SELECT list even when using SELECT DISTINCT, as long as it can resolve the column unambiguously. It internally handles the sorting before applying DISTINCT, so it doesn't complain. In other words, MS SQL Server processes the query like this internally: 1. FROM & JOINs 2. WHERE filtering 3. ORDER BY MODIFY_TIME ← applied before DISTINCT 4. SELECT DISTINCT columns So it sorts first, then removes duplicates and doesn't require the ORDER BY column to be in the SELECT list. 𝗗𝗮𝘁𝗮𝗯𝗿𝗶𝗰𝗸𝘀 𝗦𝗤𝗟 → Strict Databricks follows a stricter ANSI SQL standard. According to ANSI SQL rules: 𝘞𝘩𝘦𝘯 𝘚𝘌𝘓𝘌𝘊𝘛 𝘋𝘐𝘚𝘛𝘐𝘕𝘊𝘛 𝘪𝘴 𝘶𝘴𝘦𝘥, 𝘢𝘭𝘭 𝘤𝘰𝘭𝘶𝘮𝘯𝘴 𝘪𝘯 𝘖𝘙𝘋𝘌𝘙 𝘉𝘠 𝘮𝘶𝘴𝘵 𝘢𝘱𝘱𝘦𝘢𝘳 𝘪𝘯 𝘵𝘩𝘦 𝘚𝘌𝘓𝘌𝘊𝘛 𝘭𝘪𝘴𝘵. This is because Databricks/Spark SQL processes things differently: 1. FROM & JOINs 2. WHERE filtering 3. SELECT DISTINCT columns ← applied first 4. ORDER BY ← must work with already-distinct result set Since MODIFY_TIME (raw TIMESTAMP) was never selected, Databricks doesn't know how to sort by it after DISTINCT is applied -hence the error. This isn't a bug, but a fundamental difference in how each engine interprets the SQL standard. 𝗧𝗵𝗲 𝘀𝗮𝗳𝗲𝘀𝘁 𝗳𝗶𝘅 𝘁𝗵𝗮𝘁 𝘄𝗼𝗿𝗸𝘀 𝗲𝘃𝗲𝗿𝘆𝘄𝗵𝗲𝗿𝗲: → Use the column alias: ORDER BY TransactionDate → Use the expression: ORDER BY date(MODIFY_TIME)/ ORDER BY CAST(MODIFY_TIME AS DATE) → Use the position: ORDER BY 2 Any of these alternatives work across BOTH platforms because they reference what's actually in the SELECT list. The lesson? Write SQL that's portable, especially if you're working in multi-platform data environments. Have you ever been caught off guard by a database engine behaving differently than expected? Drop it in the comments 👇 #Databricks #SQLServer #SparkSQL #DataEngineering #CloudMigration #SQLTips #Analytics #DataArchitecture
To view or add a comment, sign in
-
🧠 Ever wondered how SQL Server decides HOW to run your query? You write: SELECT * FROM Orders WHERE CustomerId = 10 But SQL Server has multiple ways to execute it: 👉 Index Seek 👉 Table Scan 👉 Different join strategies So how does it choose? Welcome to the world of the SQL Server Query Optimizer 👇 ⚙️ What is the Query Optimizer? 👉 It’s the brain of SQL Server Its job: Find the most efficient way to execute your query 💡 Not the fastest always… 👉 The lowest estimated cost plan 🔍 What Happens Under the Hood? When you run a query, SQL Server goes through steps: 1. Parsing 👉 Checks syntax & converts query into a logical tree 2. Optimization (Magic happens here ✨) 👉 SQL generates multiple execution plans It evaluates: Table size Indexes Statistics Join methods 👉 Then assigns a cost to each plan 3. Plan Selection 👉 Picks the plan with lowest estimated cost 4. Execution 👉 Runs the chosen plan 💡 This is what you see in the Execution Plan 🤔 Why Sometimes SQL Makes Bad Decisions? Because it depends on: 👉 Statistics If stats are: Outdated Inaccurate 👉 SQL may choose: ❌ Table Scan instead of Index Seek ❌ Wrong join type ⚡ Real Example Query: SELECT * FROM Orders WHERE CustomerId = 10 🔍 Scenario 1: Index exists Few matching rows 👉 Plan: ✔ Index Seek (fast) 🔍 Scenario 2: No index Large table 👉 Plan: ❌ Table Scan (slow) 🔥 Scenario 3 (Interesting): Index exists But SQL thinks many rows will match 👉 Plan: ❌ Table Scan (even though index exists 😬) 💡 Key Concepts You Should Know 👉 Cost-based optimization 👉 Cardinality estimation (row prediction) 👉 Plan caching (reuse execution plan) 💡 These directly affect performance 🐢 Real-World Insight Ever seen this? 👉 Query fast sometimes 👉 Slow other times 💡 Reason: Cached execution plan Different parameters Wrong estimation 🔥 Pro Tips ✔ Keep statistics updated ✔ Create proper indexes ✔ Always check execution plan ✔ Don’t blindly trust SQL—it can be wrong 💡 Final Thought SQL Server is powerful… But it’s not magic. 👉 It makes decisions based on data it knows If that data is wrong… 👉 Your performance will be too 💬 Comment “OPTIMIZER” and I’ll break down execution plan operators next. #SQLServer #QueryOptimizer #DatabasePerformance #ExecutionPlan #SQLTips #BackendDevelopment #DotNet #SoftwareEngineering #PerformanceTuning #DataEngineering #Developers #Programming #SystemDesign #CodingTips #TechCareers #TechCommunity #LearnSQL
To view or add a comment, sign in
-
-
I designed my DynamoDB table completely wrong. The mistake: Thinking like a SQL database. What I did wrong: Created multiple tables: → Users table → Orders table → Products table → Inventory table To get user's orders with product details: → Query Users table → Query Orders table → Query Products table (for each order) → Query Inventory table Result: 4+ database calls, 300ms response time What I learned: Single-table design The fix: One table. All entities. Different access patterns. Structure: PK (Partition Key) | SK (Sort Key) | Attributes USER#123 | PROFILE | {name, email, created} USER#123 | ORDER#456 | {total, status, date} USER#123 | ORDER#789 | {total, status, date} ORDER#456 | METADATA | {items, shipping, payment} ORDER#456 | ITEM#001 | {product, quantity, price} PRODUCT#001 | METADATA | {name, description, stock} Access patterns now: Get user profile: → PK = USER#123, SK = PROFILE → 1 query, 15ms Get user's orders: → PK = USER#123, SK begins_with ORDER# → 1 query, 22ms Get order with items: → PK = ORDER#456 → 1 query, 18ms (gets all items) Total: 55ms vs 300ms How I think about it now: Partition Key = "What am I looking up?" Sort Key = "How do I organize related data?" Example patterns: E-commerce: PK: USER#id, SK: ORDER#timestamp PK: PRODUCT#id, SK: REVIEW#timestamp Social media: PK: USER#id, SK: POST#timestamp PK: POST#id, SK: COMMENT#timestamp Multi-tenancy: PK: TENANT#id, SK: USER#id PK: TENANT#id, SK: RESOURCE#type#id GSI for different queries: GSI1: → PK: STATUS (e.g., "PENDING") → SK: CREATED_AT → Query: All pending orders sorted by date Performance comparison: Before (multiple tables): → Queries per request: 4+ → Response time: 300ms → Cost: 4 read units After (single table): → Queries per request: 1 → Response time: 50ms → Cost: 1 read unit The hardest part: Unlearning SQL thinking. SQL: Normalize everything. Join at query time. DynamoDB: Denormalize. Pre-join at write time. SQL: Design tables first, queries later. DynamoDB: Design queries first, table later. It took me 3 failed attempts to get this right. But once it clicked: → Faster queries → Lower costs → Simpler code DynamoDB veterans: What took you longest to learn? #DynamoDB #AWS #Database #NoSQL #CloudEngineering #SingleTableDesign
To view or add a comment, sign in
-
-
𝗧𝗵𝗲 𝗗𝘂𝗰𝗸𝗟𝗮𝗸𝗲 𝗦𝗽𝗲𝗰: 𝗠𝗲𝘁𝗮𝗱𝗮𝘁𝗮 𝗶𝗻 𝗦𝗤𝗟 𝗧𝗮𝗯𝗹𝗲𝘀 Other lakehouse formats put metadata in files like JSON or Avro and you need special libraries to read it. DuckLake puts metadata in 28 SQL tables in your catalog database. 𝗧𝗵𝗲 𝗰𝗼𝗿𝗲 𝘁𝗮𝗯𝗹𝗲𝘀: 𝘥𝘶𝘤𝘬𝘭𝘢𝘬𝘦_𝘴𝘯𝘢𝘱𝘴𝘩𝘰𝘵 and 𝘥𝘶𝘤𝘬𝘭𝘢𝘬𝘦_𝘴𝘯𝘢𝘱𝘴𝘩𝘰𝘵_𝘤𝘩𝘢𝘯𝘨𝘦𝘴: every write creates a snapshot. These track the snapshot ID, timestamp, schema version, what changed, who changed it, and why. 𝘥𝘶𝘤𝘬𝘭𝘢𝘬𝘦_𝘴𝘤𝘩𝘦𝘮𝘢, 𝘥𝘶𝘤𝘬𝘭𝘢𝘬𝘦_𝘵𝘢𝘣𝘭𝘦, 𝘥𝘶𝘤𝘬𝘭𝘢𝘬𝘦_𝘤𝘰𝘭𝘶𝘮𝘯: your schema definitions. Column types, defaults, nullability, nesting. 𝘥𝘶𝘤𝘬𝘭𝘢𝘬𝘦_𝘥𝘢𝘵𝘢_𝘧𝘪𝘭𝘦 and 𝘥𝘶𝘤𝘬𝘭𝘢𝘬𝘦_𝘥𝘦𝘭𝘦𝘵𝘦_𝘧𝘪𝘭𝘦: the file registry. Every Parquet file has an entry with its path, record count, file size, row ID range, and encryption key if you use one. Delete files reference a data file and mark which rows were removed. 𝘥𝘶𝘤𝘬𝘭𝘢𝘬𝘦_𝘧𝘪𝘭𝘦_𝘤𝘰𝘭𝘶𝘮𝘯_𝘴𝘵𝘢𝘵𝘴 𝘢𝘯𝘥 𝘥𝘶𝘤𝘬𝘭𝘢𝘬𝘦_𝘵𝘢𝘣𝘭𝘦_𝘤𝘰𝘭𝘶𝘮𝘯_𝘴𝘵𝘢𝘵𝘴: per-column min, max, null count, value count at file and table level that powers file pruning during queries. Other supporting tables handle partition info, sort orders, macros, tags, column mappings, variant stats, and inlined data staging. Most tables use a temporal validity pattern with 𝘣𝘦𝘨𝘪𝘯_𝘴𝘯𝘢𝘱𝘴𝘩𝘰𝘵 𝘢𝘯𝘥 𝘦𝘯𝘥_𝘴𝘯𝘢𝘱𝘴𝘩𝘰𝘵 columns. A row is valid when the current snapshot falls between those values. This is how DuckLake tracks the full history of your data without rewriting anything. You can imagine how this can be used for something like time travel or schema evolution. The practical upside of SQL metadata: you query it the same way you query your data. Want to know which files belong to a table? SELECT from ducklake_data_file. Want to see how a column's type changed over time? SELECT from ducklake_column with snapshot filters. Want to find files with no stats? Join the stats table. All plain SQL against your catalog database. If you've ever tried to parse Iceberg manifest files or Delta Lake transaction logs by hand, you'll appreciate the simplicity. 𝗧𝗵𝗲 𝗗𝗮𝘁𝗮 𝗧𝘆𝗽𝗲𝘀: DuckLake supports primitive types (booleans, ints, floats, varchar, etc.), nested types (list, struct, map), variant and geospatial data (linestrings, polygons, collections). Column types are stored as strings in ducklake_column.column_type without the catalog DB needing to understand the types natively (except when inlining, which I'll get into later). #ducklake #duckdb #dataengineering #lakehouse
To view or add a comment, sign in
-
-
SQL Mastery Roadmap SQL Mastery Landscape The complete roadmap - from zero to advanced 👇 SQL isn't just a query language. It's the foundation of every data role. Whether you're just starting or levelling up, this is the full landscape you need to master. 1️⃣ SQL Foundations Relational databases, tables, primary & foreign keys, constraints, SQL syntax. 2️⃣ Core SQL Operations SELECT, WHERE, ORDER BY, LIMIT. Filtering with AND/OR/NOT, LIKE, BETWEEN. INSERT, UPDATE, DELETE. 3️⃣ Joins & Relationships INNER, LEFT, RIGHT, FULL, CROSS, SELF JOIN. Aliases and cardinality. 4️⃣ Aggregations & Grouping GROUP BY, HAVING, COUNT/SUM/AVG/MIN/MAX, DISTINCT, Rollup & Cube. 5️⃣ SQL Functions String, Date & Time, and Number functions - CONCAT, DATE_ADD, ROUND and more. 6️⃣ Subqueries & Advanced Queries Subqueries in SELECT/WHERE/FROM, correlated subqueries, EXISTS vs IN, CTEs, Recursive CTEs. 7️⃣ Database Design Normalization (1NF–3NF), ERDs, foreign keys, referential integrity, schema design patterns. 8️⃣ Indexing & Query Optimization Clustered vs non-clustered indexes, EXPLAIN, reducing query cost, avoiding full table scans. 9️⃣ Transactions & Concurrency ACID properties, COMMIT, ROLLBACK, SAVEPOINT, isolation levels, deadlocks. 🔟 Stored Procedures & Functions CREATE PROCEDURE, triggers, parameters, automation and validation use cases. 1️⃣1️⃣ SQL for Analytics Window functions, PARTITION BY, ROW_NUMBER/RANK/DENSE_RANK, LAG & LEAD, Pivoting. SQL mastery isn't a destination - it's a progression. Start at 1. Work to 11. Build systems that last. Bookmark this roadmap 🔖 Which level are you at right now? Drop it below 👇 ♻️ Repost to help others grow 🔔 Follow Muhammad Mobeen Tahir for more ♻️ I share cloud , data analysis/data engineering tips, real world project breakdowns, and interview insights through my free newsletter.
To view or add a comment, sign in
-
Explore content categories
- Career
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Hospitality & Tourism
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development
Open to Data Engineer roles with SQL, Snowflake, PySpark and Azure — 4.5 yrs at GfK NIQ. DM me or tag someone hiring!