SQL Server Histogram in Action: Optimizer Estimation

View organization page for DataPlatformGeeks

931 followers

SQL Server Notes by AB | Note #12 | Histogram In Action | Original Draft Date: 19 Jan, 2022 | Re-posted on 27 Apr, 2026 | #SQLServerWithAmitBansal In one of my previous notes, I had talked about the histogram, which is one of the most critical things in the stats object. Histogram, as the name suggests, is a bucketing technique of how the column data is distributed - the highest value of a bucket, which defines a boundary, how many rows are between two boundary values, how many rows are equal to a specific boundary value, how many unique values are there between a range and, how many rows are there on average per distinct value. All of this helps the optimizer make the right estimates. Here is a sample histogram on TotalDue column of SalesOrderHeader table (not putting down all the columns for brevity): RANGE_HI_KEY RANGE_ROWS EQ_ROWS ============= ============ ========= 26.2769 40 142 30.1444 17 202 . . (more rows) Now, let's we write a query: SELECT * FROM Sales.SalesOrderHeader WHERE TotalDue = 30.1444 The above query will return 202 rows. If check the cardinality estimation from the execution plan, you will observe the Estimated Number of Rows = 202 and the Actual Number of Rows=202. This is perfect and the best-case scenario for the optimizer where the estimate and the actual matched 100%. How did this happen? Well, this is called Histogram Step Hit. The predicate value mentioned in the query (WHERE condition) has a step representation in the histogram and the optimizer does a perfect estimation with EQ_ROWS (the number of rows equal to the step value), which is 202. The above is just a quick explanation of the ways how the optimizer leverages the histogram. Demo URL: https://lnkd.in/g5-ef5gq. Want to read more SQL notes like this one? Here: https://bit.ly/ABSQLNotes. Looking for deep-dive content on SQL Server Performance Tuning? Get lifetime access to master class recordings. Check this: https://lnkd.in/d-JrAG78

To view or add a comment, sign in

More Relevant Posts

SQLMaestros

1,361 followers
1w
Report this post
SQL Server Notes by AB | Note #12 | Histogram In Action | Original Draft Date: 19 Jan, 2022 | Re-posted on 27 Apr, 2026 | #SQLServerWithAmitBansal In one of my previous notes, I had talked about the histogram, which is one of the most critical things in the stats object. Histogram, as the name suggests, is a bucketing technique of how the column data is distributed - the highest value of a bucket, which defines a boundary, how many rows are between two boundary values, how many rows are equal to a specific boundary value, how many unique values are there between a range and, how many rows are there on average per distinct value. All of this helps the optimizer make the right estimates. Here is a sample histogram on TotalDue column of SalesOrderHeader table (not putting down all the columns for brevity): RANGE_HI_KEY RANGE_ROWS EQ_ROWS ============= ============ ========= 26.2769 40 142 30.1444 17 202 . . (more rows) Now, let's we write a query: SELECT * FROM Sales.SalesOrderHeader WHERE TotalDue = 30.1444 The above query will return 202 rows. If check the cardinality estimation from the execution plan, you will observe the Estimated Number of Rows = 202 and the Actual Number of Rows=202. This is perfect and the best-case scenario for the optimizer where the estimate and the actual matched 100%. How did this happen? Well, this is called Histogram Step Hit. The predicate value mentioned in the query (WHERE condition) has a step representation in the histogram and the optimizer does a perfect estimation with EQ_ROWS (the number of rows equal to the step value), which is 202. The above is just a quick explanation of the ways how the optimizer leverages the histogram. Demo URL: https://lnkd.in/g5-ef5gq. Want to read more SQL notes like this one? Here: https://bit.ly/ABSQLNotes. Looking for deep-dive content on SQL Server Performance Tuning? Get lifetime access to master class recordings. Check this: https://lnkd.in/d-JrAG78
Like Comment
To view or add a comment, sign in
Majid Khadim
2d
Report this post
🧠 Ever wondered how SQL Server decides HOW to run your query? You write: SELECT * FROM Orders WHERE CustomerId = 10 But SQL Server has multiple ways to execute it: 👉 Index Seek 👉 Table Scan 👉 Different join strategies So how does it choose? Welcome to the world of the SQL Server Query Optimizer 👇 ⚙️ What is the Query Optimizer? 👉 It’s the brain of SQL Server Its job: Find the most efficient way to execute your query 💡 Not the fastest always… 👉 The lowest estimated cost plan 🔍 What Happens Under the Hood? When you run a query, SQL Server goes through steps: 1. Parsing 👉 Checks syntax & converts query into a logical tree 2. Optimization (Magic happens here ✨) 👉 SQL generates multiple execution plans It evaluates: Table size Indexes Statistics Join methods 👉 Then assigns a cost to each plan 3. Plan Selection 👉 Picks the plan with lowest estimated cost 4. Execution 👉 Runs the chosen plan 💡 This is what you see in the Execution Plan 🤔 Why Sometimes SQL Makes Bad Decisions? Because it depends on: 👉 Statistics If stats are: Outdated Inaccurate 👉 SQL may choose: ❌ Table Scan instead of Index Seek ❌ Wrong join type ⚡ Real Example Query: SELECT * FROM Orders WHERE CustomerId = 10 🔍 Scenario 1: Index exists Few matching rows 👉 Plan: ✔ Index Seek (fast) 🔍 Scenario 2: No index Large table 👉 Plan: ❌ Table Scan (slow) 🔥 Scenario 3 (Interesting): Index exists But SQL thinks many rows will match 👉 Plan: ❌ Table Scan (even though index exists 😬) 💡 Key Concepts You Should Know 👉 Cost-based optimization 👉 Cardinality estimation (row prediction) 👉 Plan caching (reuse execution plan) 💡 These directly affect performance 🐢 Real-World Insight Ever seen this? 👉 Query fast sometimes 👉 Slow other times 💡 Reason: Cached execution plan Different parameters Wrong estimation 🔥 Pro Tips ✔ Keep statistics updated ✔ Create proper indexes ✔ Always check execution plan ✔ Don’t blindly trust SQL—it can be wrong 💡 Final Thought SQL Server is powerful… But it’s not magic. 👉 It makes decisions based on data it knows If that data is wrong… 👉 Your performance will be too 💬 Comment “OPTIMIZER” and I’ll break down execution plan operators next. #SQLServer #QueryOptimizer #DatabasePerformance #ExecutionPlan #SQLTips #BackendDevelopment #DotNet #SoftwareEngineering #PerformanceTuning #DataEngineering #Developers #Programming #SystemDesign #CodingTips #TechCareers #TechCommunity #LearnSQL
Like Comment
To view or add a comment, sign in
Aahlaad Musunuru
1w
Report this post
🔥 𝗗𝗮𝘆 34 – 𝗦𝗤𝗟 𝗦𝗲𝗿𝘃𝗲𝗿 𝗝𝗼𝗶𝗻𝘀 (𝗣𝗮𝗿𝘁 2) SQL Joins are one of the most important concepts in SQL Server. They are used to combine data from multiple tables using related columns. In real-world projects, data is usually stored in separate tables such as Customers, Products, Transactions, Accounts, Employees, and Sales. ━━━━━━━━━━━━━━━━━━ 📌 𝗧𝘆𝗽𝗲𝘀 𝗼𝗳 𝗦𝗤𝗟 𝗝𝗼𝗶𝗻𝘀 1️⃣ 𝗜𝗡𝗡𝗘𝗥 𝗝𝗢𝗜𝗡 Returns only matching records from both tables. SELECT * FROM AccountMaster A INNER JOIN TransactionMaster T ON A.AccountID = T.AccountID; 2️⃣ 𝗟𝗘𝗙𝗧 𝗝𝗢𝗜𝗡 Returns all rows from the left table and matching rows from the right table. SELECT * FROM AccountMaster A LEFT JOIN TransactionMaster T ON A.AccountID = T.AccountID; 3️⃣ 𝗥𝗜𝗚𝗛𝗧 𝗝𝗢𝗜𝗡 Returns all rows from the right table and matching rows from the left table. SELECT * FROM AccountMaster A RIGHT JOIN TransactionMaster T ON A.AccountID = T.AccountID; 4️⃣ 𝗙𝗨𝗟𝗟 𝗝𝗢𝗜𝗡 Returns matching and unmatched rows from both tables. SELECT * FROM AccountMaster A FULL JOIN TransactionMaster T ON A.AccountID = T.AccountID; 5️⃣ 𝗖𝗥𝗢𝗦𝗦 𝗝𝗢𝗜𝗡 Returns every possible combination of rows. Formula: Rows in Table A × Rows in Table B Example: 6 rows × 6 rows = 36 rows SELECT * FROM AccountMaster A CROSS JOIN TransactionMaster T; ━━━━━━━━━━━━━━━━━━ 💡 𝗜𝗺𝗽𝗼𝗿𝘁𝗮𝗻𝘁 𝗜𝗻𝘁𝗲𝗿𝘃𝗶𝗲𝘄 𝗣𝗼𝗶𝗻𝘁𝘀 ✔ NULL does not match NULL ✔ JOIN means INNER JOIN by default ✔ CROSS JOIN does not need a common column ✔ If joining N tables, we need N - 1 join conditions ✔ Use aliases for clean queries ✔ Use table names when same column exists in multiple tables ━━━━━━━━━━━━━━━━━━ ⚠️ 𝗔𝗺𝗯𝗶𝗴𝘂𝗼𝘂𝘀 𝗖𝗼𝗹𝘂𝗺𝗻 𝗘𝗿𝗿𝗼𝗿 ❌ Wrong: SELECT AccountID FROM AccountMaster A JOIN TransactionMaster T ON A.AccountID = T.AccountID; ✅ Correct: SELECT A.AccountID FROM AccountMaster A JOIN TransactionMaster T ON A.AccountID = T.AccountID; ━━━━━━━━━━━━━━━━━━ 🚀 𝗥𝗲𝗮𝗹-𝗧𝗶𝗺𝗲 𝗘𝘅𝗮𝗺𝗽𝗹𝗲 Customer-wise Transaction Summary SELECT A.Name, T.TransactionType, COUNT(*) AS NumberOfTransactions, SUM(T.TransactionAmount) AS TotalAmount FROM AccountMaster A JOIN TransactionMaster T ON A.AccountID = T.AccountID GROUP BY A.Name, T.TransactionType; ━━━━━━━━━━━━━━━━━━ 🎯 𝗙𝗶𝗻𝗮𝗹 𝗧𝗵𝗼𝘂𝗴𝗵𝘁 SQL Joins are the backbone of: ✔ Reporting ✔ Analytics ✔ ETL ✔ Dashboards ✔ Data Engineering 👉 Master joins and complex SQL becomes much easier. #SQL #SQLServer #SQLQueries #SQLDeveloper #SQLLearning #LearnSQL #SQLInterview #SQLTips #SQLPractice #TSQL #MicrosoftSQLServer #Database #DatabaseDeveloper #DatabaseManagement #DataAnalytics #DataAnalysis #DataAnalyst #BusinessIntelligence #BI #PowerBI #PowerBIDeveloper #Dashboard #Reporting #ETL #DataEngineering #DataEngineer #DataWarehouse #DataModeling #Joins #InnerJoin #LeftJoin #RightJoin #FullJoin #CrossJoin #TechJobs #Analytics #Coding #Programming #InterviewPreparation #CareerGrowth #LinkedInLearning Bhaskar Jogi Go Online Trainings
1 Comment
Like Comment
To view or add a comment, sign in
Sagar Kopnar
4w
Report this post
🚀 Boost SQL Query Performance with Partitioning When your tables grow into millions (or billions) of rows, query performance starts to suffer. One powerful technique to solve this is **Partitioning**. 🔹 SQL Server Example (Step-by-Step – Orders Table) -- 1. Create Partition Function (by year) CREATE PARTITION FUNCTION pf_orders (DATE) AS RANGE RIGHT FOR VALUES ('2024-01-01', '2025-01-01', '2026-01-01'); -- 2. Create Partition Scheme CREATE PARTITION SCHEME ps_orders AS PARTITION pf_orders ALL TO ([PRIMARY]); -- 3. Create Partitioned Table CREATE TABLE orders ( order_id INT IDENTITY(1,1), order_date DATE NOT NULL, amount DECIMAL(10,2) ) ON ps_orders(order_date); -- 4. Insert Data INSERT INTO orders (order_date, amount) VALUES ('2023-12-15', 400), ('2024-06-10', 500), ('2025-03-15', 800); -- 5. Query (Partition Elimination) SELECT * FROM orders WHERE order_date BETWEEN '2025-01-01' AND '2025-12-31'; ``` 🔹 Why it’s powerful: ✅ Faster queries (partition elimination) ✅ Only relevant data is scanned ✅ Better performance for large tables 🔹 Pro Tip 💡 Always filter using direct date ranges for best performance. Partition smart → Query fast → Scale efficiently 🚀 #SQLServer #SQL #DataEngineering #PerformanceTuning
Like Comment
To view or add a comment, sign in
Vin Miller
2w
Report this post
This is a terrific technical piece on optimizing SQL Server when using PowerBI. But even more importantly, it touches on an important Requirements point: 'The honest conversation with stakeholders usually starts with the question: "What's your actual data freshness requirement?" Most teams discover that "real-time" means "within the hour" once they think it through. If that's the case, Import with scheduled refresh covers it, and you skip the entire DirectQuery tuning exercise.' This is true (and often it's less frequent than that) But I'll add two other things: 1] whenever possible you should model your data. Even if you aren't using a dedicated data warehouse, you can create a SQL Server reporting database or at least a reporting schema. Shifting your logic "left" will simplify your PBI queries and help standardize data model logic. 2] don't just stop with the honest conversation about data freshness. Make sure that you really understand how they are using the dashboard. The minute the phrase "export to excel" comes up, consider whether you even need a dashboard, or if different delivery approach makes more sense. (PowerBI should not be another ETL tool.) https://lnkd.in/ee8_rPY2

What Power BI DirectQuery does to your SQL Server (and how to fix it) https://www.sqlservercentral.com
Like Comment
To view or add a comment, sign in
Harsh Dixit
1w Edited
Report this post
🧠 SQL Execution Plan — The Secret Behind Fast Queries Writing a SQL query is easy. Writing a fast SQL query is what makes the real difference in interviews and production systems 👇 Whenever a query is slow, the first thing every developer should check is the Execution Plan. 🔷 What is an Execution Plan? An Execution Plan shows how SQL Server decides to execute your query. 👉 It tells you: • Which table SQL Server accesses first • What type of joins are being used • Whether it is performing a Scan or a Seek • Which operation is taking the highest cost • Where the query is spending most of its time 💡 In simple words: it is the roadmap SQL Server follows to fetch your data. 🔷 Why is it Important? Two queries may return the same result, but one may take: ✅ 1 second ❌ 30 seconds The Execution Plan helps you understand why. It helps in: • Query optimization • Finding performance bottlenecks • Reducing logical reads • Improving production performance Without checking the execution plan, optimization becomes guesswork. 🔷 Types of Execution Plans ✅ Estimated Execution Plan → Shows what SQL Server plans to do before execution Shortcut: Ctrl + L ✅ Actual Execution Plan → Shows what SQL Server actually did after execution Shortcut: Ctrl + M 💡 Actual Execution Plan is more useful for performance tuning. 🔷 Common Operators You Should Know 🔸 Table Scan → Reads the entire table ❌ Slow for large tables 🔸 Index Scan → Scans many rows from an index ⚠️ Better than Table Scan 🔸 Index Seek → Directly jumps to required rows ✅ Fast and efficient 🔸 Key Lookup → Fetches extra columns from the main table ⚠️ Too many can slow performance 🔸 Nested Loop / Hash Match / Merge Join → Join strategies chosen by SQL Server 🔷 Interview Question Q: How do you identify why a query is slow? 👉 I first check the Actual Execution Plan, look for scans, key lookups, and expensive joins, then optimize the query accordingly. This shows practical knowledge, not just theory. 💡 Final Thought Anyone can write SQL queries. But understanding the Execution Plan is what makes you a better developer🚀 Stay tuned for my next post on how to use indexes according to the Execution Plan in SQL Server😊 #sqlserver #sql #executionplan #database #performanceoptimization #backenddeveloper #interviewprep #sqldeveloper #queryoptimization #dotnetdeveloper
Like Comment
To view or add a comment, sign in
ARVIND YADAV
3w
Report this post
Index in GROUP BY Clause in SQL Server In this post, I’ll explain how indexes improve performance in a GROUP BY query and also touch on the concept of a covering query. SQL Server uses two main algorithms for grouping data: Hash Aggregate – Creates a temporary hash table to store grouped results. Sort + Group – Sorts data by grouping columns, then aggregates sequentially. Both approaches require intermediate processing, but the Sort + Group algorithm can leverage indexes to avoid sorting, improving performance. 🔹 Why Index Matters in GROUP BY Consider this query: SELECT PS.ProductID, SUM(PS.QunatitSold) AS TotalQuantitySold FROM ProductSales PS GROUP BY PS.ProductID Without an index, SQL Server performs a Table Scan, which is expensive. 🔹 Step 1: Add Index on GROUP BY Column CREATE NONCLUSTERED INDEX IX_ProductSales_ProductID ON ProductSales(ProductID) Now SQL Server uses an Index Scan, but still performs RID Lookup, which adds overhead. 🔹 Step 2: Create a Better (Composite) Index CREATE NONCLUSTERED INDEX IX_ProductSales_ProductID_QunatitSold ON ProductSales(ProductID, QunatitSold) Now the query uses this optimized index, reducing extra lookups and improving performance. 🔹 What is a Covering Query? A query is called a Covering Query when all required columns are available in the index itself. Example: SELECT ProductID, QunatitSold FROM ProductSales Since both columns exist in the composite index, SQL Server does not need to access the table, making the query faster. ✅ Key Takeaways Index on GROUP BY columns improves performance Composite indexes reduce lookups Covering queries eliminate table access SQL Server automatically chooses the best execution plan. #SQLServer #Database #DataEngineering #DataAnalytics #SQL #TSQL #DatabasePerformance #QueryOptimization #Indexing #DataScience #SQLServerTips #SQLPerformance #QueryTuning #ExecutionPlan #DatabaseOptimization #IndexStrategy #CoveringIndex #GroupBy #TechLearning #LearnSQL #ArtificialIntelligence #BigData #CloudComputing #MicrosoftSQLServer #Azure #TechTrends #DigitalTransformation
Like Comment
To view or add a comment, sign in
Ogochukwu (Christiana) Nnamah
2w
Report this post
Last time, I talked about SQL Query order. Today, I will talk about the 𝗤𝘂𝗲𝗿𝘆 𝗘𝘅𝗲𝗰𝘂𝘁𝗶𝗼𝗻 𝗢𝗿𝗱𝗲𝗿. In SQL, queries are executed in a specific order, which can be quite different from the order in which the clauses are written. Here’s the logical order of SQL query execution: 1. FROM Specifies the tables from which to retrieve or manipulate data. 2. WHERE Filters rows based on specified conditions. Only rows that meet the conditions proceed to the next stage. 3. GROUP BY Groups rows into sets based on column(s) specified. Any aggregate functions (like SUM, COUNT, etc.) will now apply to each group. 4. HAVING Applies filters to groups created by GROUP BY. Only groups meeting these conditions move forward. 5. SELECT Determines which columns and expressions to return. Executes any functions or expressions listed in the SELECT clause. Deduplication of rows (DISTINCT) happens here if specified. 6. ORDER BY Sorts the result based on specified column(s) and sort direction (ASC or DESC). Does not impact the final rows selected, only the display order. 7. LIMIT Restricts the number of rows returned by the query. Useful for pagination or getting a specific subset of rows. Let's take an example with this simple SQL query: SELECT department, COUNT(employee_id) AS total_employees FROM employees WHERE status = 'active' GROUP BY department HAVING total_employees > 5 ORDER BY total_employees DESC LIMIT 10; This query would execute in the following order: 1. FROM employees 2. WHERE status = 'active' 3. GROUP BY department 4. HAVING total_employees > 5 5. SELECT department, COUNT(employee_id) AS total_employees 6. ORDER BY total_employees DESC 7. LIMIT 10 #TipsOnSQL #DataAnalysis

4 Comments
Like Comment
To view or add a comment, sign in
Venkatesh Gunasekaran
1mo
Report this post
💬 SQL Challenge of the Day Problem: You are given a table named "transactions" with the following columns: transaction_id, user_id, timestamp, and amount. Write a SQL query to calculate the running total of the amount for each user ordered by the timestamp in ascending order. Query: ```sql SELECT transaction_id, user_id, timestamp, amount, SUM(amount) OVER (PARTITION BY user_id ORDER BY timestamp) AS running_total FROM transactions ``` Answer: The SQL query to calculate the running total of the amount for each user ordered by the timestamp in ascending order is shown above. Explanation: In this query, we use a window function with the PARTITION BY clause to calculate the running total for each user. The SUM(amount) function calculates the running total of the amount for each user based on the timestamp ordering. Example: Consider the "transactions" table: | transaction_id | user_id | timestamp | amount | |----------------|---------|--------------------|--------| | 1 | A | 2022-01-01 08:00:00| 100 | | 2 | B | 2022-01-01 09:00:00| 150 | | 3 | A | 2022-01-01 10:00:00| 50 | | 4 | A | 2022-01-01 11:00:00| 200 | The query will result in: | transaction_id | user_id | timestamp | amount | running_total | |----------------|---------|--------------------|--------|---------------| | 1 | A | 2022-01-01 08:00:00| 100 | 100 | | 3 | A | 2022-01-01 10:00:00| 50 | 150 | | 4 | A | 2022-01-01 11:00:00| 200 | 350 | | 2 | B | 2022-01-01 09:00:00| 150 | 150 | #Hashtags #PowerBIChallenge #PowerInterview #LearnPowerBi #LearnSQL #TechJobs #DataAnalytics #DataScience #BigData #DataAnalyst #MachineLearning #Python #SQL #Tableau #DataVisualization #DataEngineering #ArtificialIntelligence #CloudComputing #BusinessIntelligence #Data
Like Comment
To view or add a comment, sign in
Rushikesh Tawale
4w
Report this post
Most people create indexes hoping queries get faster. Half the time they make things SLOWER. Here's what they missed: --- 📖 CLUSTERED vs NON-CLUSTERED INDEXING A CLUSTERED index physically reorders the table rows to match the index key. One per table. Think of it like a dictionary — words ARE sorted alphabetically. A NON-CLUSTERED index is a separate structure pointing back to the actual rows. Like a book's index — it lists terms with page numbers, but the book's content stays untouched. You can have up to 999 per table in SQL Server. --- CLUSTERED INDEX — create on your most queried column: CREATE CLUSTERED INDEX idx_order_date ON Orders(OrderDate); Now a range query like: SELECT * FROM Orders WHERE OrderDate BETWEEN '2024-01-01' AND '2024-12-31'; ...does a contiguous disk read. FAST. --- NON-CLUSTERED INDEX — for selective lookups: CREATE NONCLUSTERED INDEX idx_customer_email ON Customers(Email) INCLUDE (CustomerName, City); The INCLUDE keyword is key — it adds columns to the leaf level of the index, avoiding a KEY LOOKUP back to the base table. Most people skip this and wonder why their query still scans. --- GOTCHA — most people don't know this: If your non-clustered index doesn't COVER the query, SQL Server does a KEY LOOKUP (also called a Bookmark Lookup). For large result sets, this is WORSE than a full table scan. Check it with: SET STATISTICS IO ON; SELECT CustomerName, City FROM Customers WHERE Email = 'rushi@gmail.com'; Look for "logical reads" in the output. If it's high despite the index, your index isn't covering. Add the missing columns to INCLUDE. Also — every index you create SLOWS DOWN INSERT, UPDATE, and DELETE because SQL Server must maintain each index on write. A table with 15 non-clustered indexes on a high-write OLTP system is a performance disaster. --- COMPOSITE INDEX ORDER MATTERS: CREATE INDEX idx_sales ON Sales(Region, ProductID, SaleDate); This index helps: WHERE Region = 'West' AND ProductID = 101 This index DOESN'T help: WHERE SaleDate = '2024-06-01' (skips leading columns) The leftmost prefix rule — always build indexes based on your actual WHERE clause order. --- WHY THIS MATTERS FOR YOUR CAREER: Interviewers don't want definitions. They ask: "Your query returns 10M rows in 45 seconds. How do you fix it?" If your answer doesn't include EXPLAIN / execution plans, covering indexes, and write-cost tradeoffs — you haven't answered the question. Query optimization is where junior analysts become senior engineers. --- #SQL #DataAnalytics #DatabaseOptimization #DataEngineering #SQLPerformance
Like Comment
To view or add a comment, sign in

931 followers

View Profile Follow

SQL Server Histogram in Action: Optimizer Estimation

More from this author

Are Stats Objects Being Used?

Explore content categories