How to Improve Databricks SQL Warehouse Performance | Shrinivas Vishnupurikar Kulkarni posted on the topic | LinkedIn

Shrinivas Vishnupurikar Kulkarni

6mo

🐢 Slow queries aren’t always about hardware — sometimes, it’s about understanding 🔥 𝙃𝙤𝙬 𝘿𝙖𝙩𝙖𝙗𝙧𝙞𝙘𝙠𝙨 𝙎𝙌𝙇 𝙒𝙖𝙧𝙚𝙝𝙤𝙪𝙨𝙚𝙨 '𝙨𝙘𝙖𝙡𝙚' 𝙪𝙣𝙙𝙚𝙧 𝙡𝙤𝙖𝙙. Here’s another interesting practice exam question that dives into 𝗦𝗤𝗟 𝗪𝗮𝗿𝗲𝗵𝗼𝘂𝘀𝗲 𝗶𝗻𝘁𝗲𝗿𝗻𝗮𝗹𝘀 — from query concurrency and scaling ranges to how the 𝗶𝗻𝘁𝗲𝗿𝗻𝗮𝗹 𝗹𝗼𝗮𝗱 𝗯𝗮𝗹𝗮𝗻𝗰𝗲𝗿 𝗮𝗻𝗱 𝗮𝘂𝘁𝗼𝘀𝗰𝗮𝗹𝗲𝗿 handle queued workloads. 𝗤𝗨𝗘𝗦𝗧𝗜𝗢𝗡: You have noticed that Databricks SQL queries are running slow. You are asked to look reason why queries are running slow and identify steps to improve the performance. When you looked at the issue you noticed all the queries are running in parallel and using a SQL endpoint [ SQL Warehouse ] with a single cluster. Which of the following steps can be taken to improve the performance/response times of the queries? ⚠️ [ Option 1 ] They can turn on the Serverless feature for the SQL endpoint [ SQL warehouse ]. ✅ [ Option 2 ] They can increase the maximum bound of the SQL endpoint (SQL warehouse)’s scaling range. ⚠️ [ Option 3 ] They can increase the warehouse size from 2X-Small to 4XLarge of the SQL endpoint [ SQL warehouse ]. ❌ [ Option 4 ] They can turn on the Auto Stop feature for the SQL endpoint [ SQL warehouse ]. ❌ [ Option 5 ] They can turn on the Serverless feature for the SQL endpoint [ SQL warehouse ] and change the Spot Instance Policy to “Reliability Optimized.” This question is a perfect way to test: 1️⃣ How 𝗦𝗰𝗮𝗹𝗶𝗻𝗴-𝗨𝗽 𝘃𝘀 𝗦𝗰𝗮𝗹𝗶𝗻𝗴-𝗢𝘂𝘁 actually works inside Databricks SQL Warehouses. 2️⃣ The role of the 𝗶𝗻𝘁𝗲𝗿𝗻𝗮𝗹 𝗾𝘂𝗲𝘂𝗲, 𝗹𝗼𝗮𝗱 𝗯𝗮𝗹𝗮𝗻𝗰𝗲𝗿, 𝗮𝗻𝗱 𝗮𝘂𝘁𝗼𝘀𝗰𝗮𝗹𝗲𝗿 in distributing query load efficiently. I’ve broken this down in an 11-minute explainer video — where I walk through the 𝘦𝘹𝘢𝘤𝘵 𝘧𝘭𝘰𝘸 𝘰𝘧 𝘩𝘰𝘸 𝘲𝘶𝘦𝘳𝘪𝘦𝘴 𝘨𝘦𝘵 𝘥𝘪𝘴𝘵𝘳𝘪𝘣𝘶𝘵𝘦𝘥 𝘢𝘤𝘳𝘰𝘴𝘴 𝘤𝘭𝘶𝘴𝘵𝘦𝘳𝘴 and 𝘸𝘩𝘺 𝘩𝘰𝘳𝘪𝘻𝘰𝘯𝘵𝘢𝘭 𝘴𝘤𝘢𝘭𝘪𝘯𝘨 (𝘯𝘰𝘵 𝘷𝘦𝘳𝘵𝘪𝘤𝘢𝘭) 𝘧𝘪𝘹𝘦𝘴 𝘵𝘩𝘦 𝘣𝘰𝘵𝘵𝘭𝘦𝘯𝘦𝘤𝘬 𝘩𝘦𝘳𝘦. 💬 If you’re also fascinated by the architecture behind Databricks SQL Warehouses, I’d love to hear your take on this. #Databricks #DataEngineering #SQLWarehouse #Scalability #PerformanceOptimization #Lakehouse #DataArchitecture #SystemDesign

Transcript

Hello guys, my name is Srinivas and in this video I'm going to be decoding another interesting question that I encountered while preparing for my certification exam. And this question requires you to have a very deep understanding of SQL warehouse architectures and also how vertical and horizontal scaling actually happened in scale warehouses. I haven't written the problem statement in detail, so you could check that out in the caption of this video. And I've also provided the five options out of which 4th and 5th are completely irrelevant. The second one is the correct however. 1st and 3rd seem quite tempting but are not the correct solutions and in this video we are going to be specifically. Understanding why the first and third option are not the correct ones, so we have slow queries. We as data professionals have to investigate why is this the case. Slow queries and we have to improve the performance. However, the main crux of this question is that we have multiple queries running in parallel. So not being able to achieve parallelism or concurrency is our main issue. And the root cause of this is single cluster. So right now we are only bound with cluster A and we are considering. We don't have cluster B&CI, have simply drawn them right now for saving time. But you have to consider a. Right now cluster B&C do not exist, we are simply working with cluster A SO. You would say, if parallelism is an issue, why not just increase the number of worker nodes or increase the capacity of worker nodes or do even both if you have too much budget? But that is specifically what option #3 suggest you do. You increase your cluster size from, let's say small to larger ones. So we will introduce another node. In this case we have increased the number of worker nodes, also their capacity. However, this is not how horizontal scaling actually work in SQL warehouses and data bricks. You might think OK, but we have been. Horizontally scaled. No, this is not horizontal scaling. Whether you increase or decrease number of worker nodes or the capacities of worker nodes, it is still considered as vertical scaling only. So horizontal is horizontal scaling is quite a different ideology in SQL warehouses which we will discuss again. So why do I say this? Let's say you have multiple queries coming up like mentioned in the problem statement. But each cluster, regardless of its processing capacities, is only able to handle only 10 queries at Max. So the. Maximum processing capabilities of our cluster A which we have only one cluster at this point is only 10. So right now this cluster. Is already processing at its maximum capacities. So this is completely busy and has no capacity to take on more queries. So what happens if there are more 3 queries coming up? Let's say query #1112 and 13th. These queries will actually be accepted by our SQL warehouse, but the internal load balancer of our SQL warehouse won't submit them directly to the Cluster A. In fact, it will store these additional 3 queries into its internal queue. So this is our internal queue of our SQL warehouse and these three pending queries. You know this idea because this is already running at our fullest capacity and these two are not. Into our system right now because these are just hypothetical. We have single cluster right now. So how do we do this? Let's say these 10 queries will still take slower time. When we increase the number of worker nodes or let's say their capacities, but that still does not solve the problem of throughput or parallelism or concurrency, which is our main goal. In that case, horizontal scaling means increasing or decreasing the number of clusters actually comes in. So horizontal scaling is not about worker modes, it is rather about clusters that are SQL warehouse can run. So let's say. From this point now we are having multiple clusters and we specify the minimum and maximum range of clusters that we could spin up is 1 and 3. So I have mentioned that we need to have at least one cluster in our SQL warehouse and the maximum we could have is. 3 clusters. So at this point when these 10 queries are in the execution state, only cluster A is active. Add cluster. B&C are still inactive. Although we have specified we could spin up 3 clusters, these are not actually spilled up. They are still ideal at this point, so this is a very important consideration I would like you to remember because there is another component other than internal load balancer. Which I would like to share about in a few minutes. So the red ones are still provisioned or let's say we have the capacity to provision them. But our ideal in this case, so when we have these three queries already in our pending stage and these are still, this cluster is still working at its maximum capacity, what do we do about these three queries? Now we are going to horizontally scale our SQL warehouse. That is we are going to introduce a newer cluster in that case it is the purpose of auto scaler. Again, it is a component that will. Intelligently figure out whether do I need to increase the number of clusters depending on the workload. At this point, we have very large workload. Let's say, although there are only three queries pending. Yeah, we have to. We do have pending workload. So I would need to increase the number of active clusters. Now again, this is quite a simple formula. Do we have pending clusters within our maximum bound? Yeah, we do. We do have cluster B&C still in their ideal state, so we could leverage them for horizontal scaling. So the auto scaler will actually signal that yeah, we need this cluster B to actually spin up. So now status of cluster B has now gone from ideal to now active state and these three queries. I don't know actually. Forward it to cluster B. Now this is processing 3 queries. Now our Q will become. Empty. Now let's see what happens to cluster C It stays idle. It does not spin up unless it's necessary. Now, when would it be necessary? When we have 8 or 7 plus queries coming up, I'll tell you why, because if we have 8 queries, sorry. If we have 8 queries coming to our SQL warehouse, again, regardless of if this is, if this still has capacity to process, it is going to go to the internal queue. Now we will check do any of the active clusters have capacity to process these pending queries? Yeah, A cluster B can take on at least 1037 queries. Cluster B could take 7 queries out of these 8 so. Now this has seven queries coming up. Now, given cluster A is working at its fullest capacity, so is cluster B. And the number of pending queries in our internal queue is reduced from 8:00 to 1:00. So they still have one query pending. So what do we do now? We don't want queries to stay in pending state in our internal queue. That's when we are going to again recalculate while we still have some additional workload to be processed. Do we have anymore capacity to process or let's say increase or horizontally scale or system? Yes, we do. We have cluster C in the ideal state which we could leverage to make our system more parallel. So in that case. In that case our auto scaler is going to be tied to our SQL warehouse. We need additional. Cluster in our active state which we could afford and again this one query. Will now go to. That's Mr. C. And this will have nothing but. A and we are still working at their fullest capacities and in this worst case, we are just considering that our workload simply keeps on increasing. But what happens? We now have 10 queries coming up. So let's say there are now 10 queries being pushed to our internal queue. This is working at full capacity and so does cluster B. But our cluster C is not working to its fullest potential. So in that case we are again going to. 10 -, 1 Nine queries will be forwarded to cluster C. Now in this case our entire SQL warehouse is working at its fullest capacity. Now we would check do we have capacity to horizontally scale our system even more. No, because we have specified our minimum and maximum bound to be 1 and 3. So in that case unless each cluster completes processing at least one query. This single query that was left in our. Internal queue will continue to stay over there, so again, if you don't want to scale your system horizontally more, you could keep this single query in the pending state or you could again. Figure out SQL Warehouse to change our maximum bound from three to four and this query will again be forwarded to our cluster number D or clustered let's say. So I think this video has been helpful to understand how does horizontal and vertical scaling actually happen. And the internal components like load balancer and auto scaler actually play a significant role in auto scaling our systems. So you just defined minimum maximum number of clusters you would want and the. Internal provisioning of worker nodes is actually handled by our SQL warehouse very intelligently, so I think you must have learned something really interesting and how you could just understand by. Paying attention to the specific nuances of the question and not pick up an incorrect option in your examinations and try to read questions carefully and figure out the answers as smart as auto scaler. So thank you guys. I really enjoyed making this video. And yeah, we'll stay in touch. Bye bye.

To view or add a comment, sign in

More Relevant Posts

Dipti Ranjan Sahoo
6mo
Report this post
Excited to share my latest blog post on the Snowflake Builders Blog! 🚀 We’re introducing new capabilities for both Snowflake Managed and Externally Managed Iceberg tables: Target File Size and Partition Writes. These features are designed to boost cross-engine query performance and make your Iceberg tables more interoperable and efficient. Target File Size: Configure Parquet file sizes for your Iceberg tables to match ecosystem best practices, improving performance across engines like Spark and Trino. Choose from AUTO or specific sizes (16MB, 32MB, 64MB, 128MB) to optimize for your workload. Table Optimization: Automatically compact and cluster files for Snowflake Managed Iceberg tables, reducing manual maintenance and ensuring optimal file sizes. Partition Writes: Define partitioning schemes using all Iceberg v2 transforms, enabling efficient partition pruning and faster queries—whether you’re using Snowflake or external engines. These enhancements help ensure your Snowflake-written Iceberg tables are highly performant and compatible across the open data ecosystem.

Introducing Target File Size and Partition Writes for Snowflake Managed and Externally Managed… medium.com

1 Comment
Like Comment
To view or add a comment, sign in
Siddhartha Varanasi
6mo Edited
Report this post
Lately, I’ve been focusing less on writing SQL and more on understanding it, treating each slow query like an investigation. The goal isn’t just to make it run, but to know exactly what the database is doing behind the scenes. I usually begin with an EXPLAIN or query profile to visualize the execution plan. Seeing how the query scans, joins, and filters instantly reveals where things slow down. Most performance issues come from full table scans, inefficient joins, or using SELECT * when only a few columns are needed. Small changes can create big improvements. I push filters earlier, rewrite subqueries as joins, and in Snowflake, cluster data on frequently filtered columns to reduce scans. Keeping table statistics fresh also helps the optimizer make smarter decisions. Once I make changes, I measure the before-and after results runtime, rows scanned, and bytes read. Sometimes, caching an intermediate step or moving a transformation upstream improves performance more than rewriting the SQL itself. Over time, I’ve learned that SQL optimization isn’t just about syntax. It’s about reducing the amount of work your database has to do and designing queries that align with how the engine actually thinks. What’s been your most satisfying SQL optimization win? #SQL #DataEngineering #Snowflake #Azure #QueryOptimization #PerformanceTuning #DataWarehouse
Like Comment
To view or add a comment, sign in
Manoj P.
5mo
Report this post
Snowflake Performance Tuning That Actually Works : One of our daily transformation pipelines in Snowflake suddenly went from 8 minutes to 40 with no code changes. Instead of scaling up, I dug into the Query Profile and fixed the real issues: excessive micro-partition scans, wide selects, and oversized transformations. This post covers the practical, foundational tuning steps that improved performance 5x, all with real Snowflake SQL you can test right now. These are the basics. In the next post, I will dive into advanced optimization techniques such as Query Acceleration Service, Search Optimization, and Materialized Views for sub-second analytics. #Snowflake #DataEngineering #PerformanceTuning #SQL #DataOps #CloudData #SnowflakeTips

Snowflake Performance Tuning: Practical Techniques That Actually Work medium.com

4 Comments
Like Comment
To view or add a comment, sign in
Ajay Kumar Pandey
6mo Edited
Report this post
🚀 Exploring Query Federation with Databricks! 🔍 As data professionals, we often face the challenge of accessing and analyzing data spread across multiple systems. With Databricks Query Federation, that challenge becomes a lot easier to tackle. Recently, I explored how Databricks enables remote queries across external data sources like MySQL, PostgreSQL, Snowflake, and more—without the need to move data around. This capability not only simplifies data access but also enhances performance and governance. 💡 Key benefits: ⏺️ Seamless integration with external databases ⏺️ Unified analytics across diverse data sources ⏺️ Reduced data movement and duplication ⏺️ Improved data governance and security Whether you're building dashboards, running complex analytics, or powering ML models, Query Federation can be a game-changer. 🔗 Learn more: https://lnkd.in/gY2TzMY3 #Databricks #QueryFederation #DataEngineering #BusinessIntelligence #RemoteQueries #DataAnalytics
Like Comment
To view or add a comment, sign in
vandana bankuru
6mo
Report this post
🚀 Understanding SQL Warehouse Sizing & Scaling in Databricks! When working with Databricks SQL, one of the key aspects that impacts performance, cost, and efficiency is how you size and scale your SQL Warehouse (formerly known as SQL Endpoint). Here’s a quick breakdown for anyone exploring or optimizing their Databricks workloads 👇 🔹 What is a SQL Warehouse? A SQL Warehouse provides the compute resources required to execute SQL queries, dashboards, and BI workloads on the Databricks platform. ⚙️ 1️⃣ Sizing the SQL Warehouse Databricks offers multiple T-shirt sizes (Small → Medium → Large → 4X-Large, etc.), each representing a different level of compute power (number of clusters, cores, memory). 🧩 Choose your size based on: Query complexity (Joins, aggregations, transformations) Data volume (MBs vs TBs) Concurrency (Number of users or BI connections hitting at once) SLAs and response time goals 💡 Tip: Start with a smaller size → Monitor query performance → Scale up as needed. ⚡ 2️⃣ Scaling Options Databricks makes scaling simple and flexible: ✅ Auto-Stop: Saves cost when warehouse is idle. ✅ Auto-Resume: Automatically spins up when a query is triggered. ✅ Scaling Clusters: Min and Max Clusters — Helps handle peak loads dynamically. Auto-scaling — Adds or removes clusters automatically based on workload. 💡 Tip: Enable auto-scaling to balance cost and performance — it’s perfect for fluctuating workloads. 📊 3️⃣ Monitoring & Optimization Use the Query History and SQL Dashboard Performance Metrics to analyze: Query execution time Cluster utilization Cost trends Then fine-tune warehouse size or concurrency limits accordingly. ✨ Takeaway: Choosing the right warehouse size and enabling scaling features ensures you’re running your Databricks SQL workloads efficiently, cost-effectively, and at top performance. #Databricks #SQLWarehouse #DataEngineering #AzureDatabricks #BigData #CloudComputing #PerformanceOptimization #Scalability #DataAnalytics

1 Comment
Like Comment
To view or add a comment, sign in
Jovan Popovic
6mo
Report this post
This evergreen post by Bradley Schacht and John Hoang clearly explains the equivalent SKUs for #MicrosoftFabric #DataWarehouse and #Synapse dedicated SQL pools. It’s an essential read if you’re comparing these two systems or planning a migration from #AzureSynapse to #MicrosoftFabric. https://lnkd.in/daQheNU2

Mapping Azure Synapse dedicated SQL pools to Fabric data warehouse compute | Microsoft Fabric Blog | Microsoft Fabric blog.fabric.microsoft.com

1 Comment
Like Comment
To view or add a comment, sign in
venkatewara Rao tambabathula
5mo
Report this post
Snowflake Series-Venky-Topic8 Types of tables ✅ 1. Permanent Tables Default table type. Data is stored permanently. Fail-safe (7 days) + Time Travel (0–90 days) available. Used for production data. Example: CREATE TABLE SALES (ID INT, AMOUNT NUMBER); ✅ 2. Transient Tables Cheaper than permanent. No Fail-safe, only Time Travel (0–1 day). Used for staging or temporary analytics where long-term recovery is not required. Example: CREATE TRANSIENT TABLE STG_CUSTOMER (ID INT, NAME STRING); ✅ 3. Temporary Tables Live only during the session. Automatically dropped when the session ends. No Fail-safe, No Time Travel. Used for complex transformations inside a single session. Example: CREATE TEMPORARY TABLE TEMP_DATA (ID INT); ✅ 4. Variant (Semi-Structured) Tables Not a separate table type, but tables that store VARIANT columns such as JSON, XML, AVRO. Example: CREATE TABLE EVENTS ( EVENT_ID INT, RAW_DATA VARIANT ); ✅ 5. External Tables Data stays in S3 / Azure / GCS. Snowflake only reads metadata and query results. Good for Data Lake Architecture. Example: CREATE EXTERNAL TABLE EXT_EVENTS WITH LOCATION = @my_s3_stage FILE_FORMAT = (TYPE = JSON); ✅ 6. Iceberg Tables (Native / External Iceberg) New in Snowflake: Manages Apache Iceberg table metadata. Allows cross-platform table sharing between Databricks, AWS Glue, EMR, etc.
Like Comment
To view or add a comment, sign in
Ramakrishna Alvakonda
6mo
Report this post
We often have this question: Should we use Snowflake or Databricks? The answer as you guessed is “It Depends”. The way I see it is: Snowflake is a good choice if your final consumer base is SQL. Irrespective of how many other alternatives came and gone, at the of the day SQL has consistently shown that it beats all of its competition. If you have a choice, stick with SQL databases like Snowflake. But on the other hands, if your consumer base is technical engineers, you may want to use Databricks. It is also a good option when your process involves heavy transformations and involves unstructured files. And sometimes you may want to complement each other, by predominantly using Databricks as a compute layer building all the transformations and then pushing the results to Snowflake, so you have best of both worlds. What is your experience ?

7 Comments
Like Comment
To view or add a comment, sign in
Gursimran Saluja
5mo Edited
Report this post
🚀 The Hidden Cost of DataFrames (and Why Temp Views Could Save You Thousands) Most Databricks teams are overspending — without realizing it. And the culprit is surprisingly simple: 👉 Using DataFrames for SQL-heavy workloads. If your pipelines are mostly SQL transformations wrapped in PySpark… you’re paying a DataFrame Tax every single day. 🔥 The DataFrame Tax Nobody Talks About Every DataFrame transformation: Creates a new logical plan Consumes driver memory Adds Catalyst optimization overhead Mixes SQL + Python (harder to audit) Slows down SQL-heavy ETL by 15–20% Multiply that across 5–10 stages and the cost compounds. ⚡ Why Temp Views Win (Especially for SQL Workloads) Temp Views turn your entire pipeline into a single SQL flow: Catalyst sees the entire transformation chain No DataFrame objects sitting in memory SQL lineage stays readable Legacy SQL (Teradata/Oracle) drops in cleanly Governance & audits become painless It’s like giving Spark the full map instead of step-by-step instructions. 💰 A Quick Cost Reality Check A typical enterprise pipeline: 5 transformations 50M rows per stage 8-core cluster ($2.20/hour) Runs daily Runtime difference: DataFrames → 18 min (~$0.66/run) Temp Views → 14 min (~$0.50/run) 👉 Savings per pipeline: $58.40/year Now scale that across: 100 pipelines → $5,840/year 500 pipelines → $25,000–$50,000/year Zero business logic changes. Just better architecture. 🎯 When You Should Use Temp Views Use Temp Views if your pipeline is: Mostly SQL A legacy Teradata/Oracle migration Heavy on joins, window functions, aggregations Used by SQL practitioners Being audited / governed Long-running and costly Governance teams LOVE Temp Views. SQL humans LOVE Temp Views. Catalyst LOVES Temp Views. 🛠 When DataFrames Still Make Sense Keep DataFrames for: Complex programmatic logic ML + feature engineering UDF-heavy workloads Dynamic schema manipulation Choose the right tool — not the default one. https://lnkd.in/gyRPJ8Qz #Databricks #DataEngineering #SparkSQL #ETL #CostOptimization #CloudCosts #SQL #TeradataMigration
Like Comment
To view or add a comment, sign in

Shrinivas Vishnupurikar Kulkarni

3,507 followers

View Profile Follow

More from this author

Explore content categories