🐢 Slow queries aren’t always about hardware — sometimes, it’s about understanding 🔥 𝙃𝙤𝙬 𝘿𝙖𝙩𝙖𝙗𝙧𝙞𝙘𝙠𝙨 𝙎𝙌𝙇 𝙒𝙖𝙧𝙚𝙝𝙤𝙪𝙨𝙚𝙨 '𝙨𝙘𝙖𝙡𝙚' 𝙪𝙣𝙙𝙚𝙧 𝙡𝙤𝙖𝙙. Here’s another interesting practice exam question that dives into 𝗦𝗤𝗟 𝗪𝗮𝗿𝗲𝗵𝗼𝘂𝘀𝗲 𝗶𝗻𝘁𝗲𝗿𝗻𝗮𝗹𝘀 — from query concurrency and scaling ranges to how the 𝗶𝗻𝘁𝗲𝗿𝗻𝗮𝗹 𝗹𝗼𝗮𝗱 𝗯𝗮𝗹𝗮𝗻𝗰𝗲𝗿 𝗮𝗻𝗱 𝗮𝘂𝘁𝗼𝘀𝗰𝗮𝗹𝗲𝗿 handle queued workloads. 𝗤𝗨𝗘𝗦𝗧𝗜𝗢𝗡: You have noticed that Databricks SQL queries are running slow. You are asked to look reason why queries are running slow and identify steps to improve the performance. When you looked at the issue you noticed all the queries are running in parallel and using a SQL endpoint [ SQL Warehouse ] with a single cluster. Which of the following steps can be taken to improve the performance/response times of the queries? ⚠️ [ Option 1 ] They can turn on the Serverless feature for the SQL endpoint [ SQL warehouse ]. ✅ [ Option 2 ] They can increase the maximum bound of the SQL endpoint (SQL warehouse)’s scaling range. ⚠️ [ Option 3 ] They can increase the warehouse size from 2X-Small to 4XLarge of the SQL endpoint [ SQL warehouse ]. ❌ [ Option 4 ] They can turn on the Auto Stop feature for the SQL endpoint [ SQL warehouse ]. ❌ [ Option 5 ] They can turn on the Serverless feature for the SQL endpoint [ SQL warehouse ] and change the Spot Instance Policy to “Reliability Optimized.” This question is a perfect way to test: 1️⃣ How 𝗦𝗰𝗮𝗹𝗶𝗻𝗴-𝗨𝗽 𝘃𝘀 𝗦𝗰𝗮𝗹𝗶𝗻𝗴-𝗢𝘂𝘁 actually works inside Databricks SQL Warehouses. 2️⃣ The role of the 𝗶𝗻𝘁𝗲𝗿𝗻𝗮𝗹 𝗾𝘂𝗲𝘂𝗲, 𝗹𝗼𝗮𝗱 𝗯𝗮𝗹𝗮𝗻𝗰𝗲𝗿, 𝗮𝗻𝗱 𝗮𝘂𝘁𝗼𝘀𝗰𝗮𝗹𝗲𝗿 in distributing query load efficiently. I’ve broken this down in an 11-minute explainer video — where I walk through the 𝘦𝘹𝘢𝘤𝘵 𝘧𝘭𝘰𝘸 𝘰𝘧 𝘩𝘰𝘸 𝘲𝘶𝘦𝘳𝘪𝘦𝘴 𝘨𝘦𝘵 𝘥𝘪𝘴𝘵𝘳𝘪𝘣𝘶𝘵𝘦𝘥 𝘢𝘤𝘳𝘰𝘴𝘴 𝘤𝘭𝘶𝘴𝘵𝘦𝘳𝘴 and 𝘸𝘩𝘺 𝘩𝘰𝘳𝘪𝘻𝘰𝘯𝘵𝘢𝘭 𝘴𝘤𝘢𝘭𝘪𝘯𝘨 (𝘯𝘰𝘵 𝘷𝘦𝘳𝘵𝘪𝘤𝘢𝘭) 𝘧𝘪𝘹𝘦𝘴 𝘵𝘩𝘦 𝘣𝘰𝘵𝘵𝘭𝘦𝘯𝘦𝘤𝘬 𝘩𝘦𝘳𝘦. 💬 If you’re also fascinated by the architecture behind Databricks SQL Warehouses, I’d love to hear your take on this. #Databricks #DataEngineering #SQLWarehouse #Scalability #PerformanceOptimization #Lakehouse #DataArchitecture #SystemDesign
More Relevant Posts
-
Excited to share my latest blog post on the Snowflake Builders Blog! 🚀 We’re introducing new capabilities for both Snowflake Managed and Externally Managed Iceberg tables: Target File Size and Partition Writes. These features are designed to boost cross-engine query performance and make your Iceberg tables more interoperable and efficient. Target File Size: Configure Parquet file sizes for your Iceberg tables to match ecosystem best practices, improving performance across engines like Spark and Trino. Choose from AUTO or specific sizes (16MB, 32MB, 64MB, 128MB) to optimize for your workload. Table Optimization: Automatically compact and cluster files for Snowflake Managed Iceberg tables, reducing manual maintenance and ensuring optimal file sizes. Partition Writes: Define partitioning schemes using all Iceberg v2 transforms, enabling efficient partition pruning and faster queries—whether you’re using Snowflake or external engines. These enhancements help ensure your Snowflake-written Iceberg tables are highly performant and compatible across the open data ecosystem.
To view or add a comment, sign in
-
Lately, I’ve been focusing less on writing SQL and more on understanding it, treating each slow query like an investigation. The goal isn’t just to make it run, but to know exactly what the database is doing behind the scenes. I usually begin with an EXPLAIN or query profile to visualize the execution plan. Seeing how the query scans, joins, and filters instantly reveals where things slow down. Most performance issues come from full table scans, inefficient joins, or using SELECT * when only a few columns are needed. Small changes can create big improvements. I push filters earlier, rewrite subqueries as joins, and in Snowflake, cluster data on frequently filtered columns to reduce scans. Keeping table statistics fresh also helps the optimizer make smarter decisions. Once I make changes, I measure the before-and after results runtime, rows scanned, and bytes read. Sometimes, caching an intermediate step or moving a transformation upstream improves performance more than rewriting the SQL itself. Over time, I’ve learned that SQL optimization isn’t just about syntax. It’s about reducing the amount of work your database has to do and designing queries that align with how the engine actually thinks. What’s been your most satisfying SQL optimization win? #SQL #DataEngineering #Snowflake #Azure #QueryOptimization #PerformanceTuning #DataWarehouse
To view or add a comment, sign in
-
-
Snowflake Performance Tuning That Actually Works : One of our daily transformation pipelines in Snowflake suddenly went from 8 minutes to 40 with no code changes. Instead of scaling up, I dug into the Query Profile and fixed the real issues: excessive micro-partition scans, wide selects, and oversized transformations. This post covers the practical, foundational tuning steps that improved performance 5x, all with real Snowflake SQL you can test right now. These are the basics. In the next post, I will dive into advanced optimization techniques such as Query Acceleration Service, Search Optimization, and Materialized Views for sub-second analytics. #Snowflake #DataEngineering #PerformanceTuning #SQL #DataOps #CloudData #SnowflakeTips
To view or add a comment, sign in
-
🚀 Exploring Query Federation with Databricks! 🔍 As data professionals, we often face the challenge of accessing and analyzing data spread across multiple systems. With Databricks Query Federation, that challenge becomes a lot easier to tackle. Recently, I explored how Databricks enables remote queries across external data sources like MySQL, PostgreSQL, Snowflake, and more—without the need to move data around. This capability not only simplifies data access but also enhances performance and governance. 💡 Key benefits: ⏺️ Seamless integration with external databases ⏺️ Unified analytics across diverse data sources ⏺️ Reduced data movement and duplication ⏺️ Improved data governance and security Whether you're building dashboards, running complex analytics, or powering ML models, Query Federation can be a game-changer. 🔗 Learn more: https://lnkd.in/gY2TzMY3 #Databricks #QueryFederation #DataEngineering #BusinessIntelligence #RemoteQueries #DataAnalytics
To view or add a comment, sign in
-
-
🚀 Understanding SQL Warehouse Sizing & Scaling in Databricks! When working with Databricks SQL, one of the key aspects that impacts performance, cost, and efficiency is how you size and scale your SQL Warehouse (formerly known as SQL Endpoint). Here’s a quick breakdown for anyone exploring or optimizing their Databricks workloads 👇 🔹 What is a SQL Warehouse? A SQL Warehouse provides the compute resources required to execute SQL queries, dashboards, and BI workloads on the Databricks platform. ⚙️ 1️⃣ Sizing the SQL Warehouse Databricks offers multiple T-shirt sizes (Small → Medium → Large → 4X-Large, etc.), each representing a different level of compute power (number of clusters, cores, memory). 🧩 Choose your size based on: Query complexity (Joins, aggregations, transformations) Data volume (MBs vs TBs) Concurrency (Number of users or BI connections hitting at once) SLAs and response time goals 💡 Tip: Start with a smaller size → Monitor query performance → Scale up as needed. ⚡ 2️⃣ Scaling Options Databricks makes scaling simple and flexible: ✅ Auto-Stop: Saves cost when warehouse is idle. ✅ Auto-Resume: Automatically spins up when a query is triggered. ✅ Scaling Clusters: Min and Max Clusters — Helps handle peak loads dynamically. Auto-scaling — Adds or removes clusters automatically based on workload. 💡 Tip: Enable auto-scaling to balance cost and performance — it’s perfect for fluctuating workloads. 📊 3️⃣ Monitoring & Optimization Use the Query History and SQL Dashboard Performance Metrics to analyze: Query execution time Cluster utilization Cost trends Then fine-tune warehouse size or concurrency limits accordingly. ✨ Takeaway: Choosing the right warehouse size and enabling scaling features ensures you’re running your Databricks SQL workloads efficiently, cost-effectively, and at top performance. #Databricks #SQLWarehouse #DataEngineering #AzureDatabricks #BigData #CloudComputing #PerformanceOptimization #Scalability #DataAnalytics
To view or add a comment, sign in
-
This evergreen post by Bradley Schacht and John Hoang clearly explains the equivalent SKUs for #MicrosoftFabric #DataWarehouse and #Synapse dedicated SQL pools. It’s an essential read if you’re comparing these two systems or planning a migration from #AzureSynapse to #MicrosoftFabric. https://lnkd.in/daQheNU2
To view or add a comment, sign in
-
Snowflake Series-Venky-Topic8 Types of tables ✅ 1. Permanent Tables Default table type. Data is stored permanently. Fail-safe (7 days) + Time Travel (0–90 days) available. Used for production data. Example: CREATE TABLE SALES (ID INT, AMOUNT NUMBER); ✅ 2. Transient Tables Cheaper than permanent. No Fail-safe, only Time Travel (0–1 day). Used for staging or temporary analytics where long-term recovery is not required. Example: CREATE TRANSIENT TABLE STG_CUSTOMER (ID INT, NAME STRING); ✅ 3. Temporary Tables Live only during the session. Automatically dropped when the session ends. No Fail-safe, No Time Travel. Used for complex transformations inside a single session. Example: CREATE TEMPORARY TABLE TEMP_DATA (ID INT); ✅ 4. Variant (Semi-Structured) Tables Not a separate table type, but tables that store VARIANT columns such as JSON, XML, AVRO. Example: CREATE TABLE EVENTS ( EVENT_ID INT, RAW_DATA VARIANT ); ✅ 5. External Tables Data stays in S3 / Azure / GCS. Snowflake only reads metadata and query results. Good for Data Lake Architecture. Example: CREATE EXTERNAL TABLE EXT_EVENTS WITH LOCATION = @my_s3_stage FILE_FORMAT = (TYPE = JSON); ✅ 6. Iceberg Tables (Native / External Iceberg) New in Snowflake: Manages Apache Iceberg table metadata. Allows cross-platform table sharing between Databricks, AWS Glue, EMR, etc.
To view or add a comment, sign in
-
We often have this question: Should we use Snowflake or Databricks? The answer as you guessed is “It Depends”. The way I see it is: Snowflake is a good choice if your final consumer base is SQL. Irrespective of how many other alternatives came and gone, at the of the day SQL has consistently shown that it beats all of its competition. If you have a choice, stick with SQL databases like Snowflake. But on the other hands, if your consumer base is technical engineers, you may want to use Databricks. It is also a good option when your process involves heavy transformations and involves unstructured files. And sometimes you may want to complement each other, by predominantly using Databricks as a compute layer building all the transformations and then pushing the results to Snowflake, so you have best of both worlds. What is your experience ?
To view or add a comment, sign in
-
🚀 The Hidden Cost of DataFrames (and Why Temp Views Could Save You Thousands) Most Databricks teams are overspending — without realizing it. And the culprit is surprisingly simple: 👉 Using DataFrames for SQL-heavy workloads. If your pipelines are mostly SQL transformations wrapped in PySpark… you’re paying a DataFrame Tax every single day. 🔥 The DataFrame Tax Nobody Talks About Every DataFrame transformation: Creates a new logical plan Consumes driver memory Adds Catalyst optimization overhead Mixes SQL + Python (harder to audit) Slows down SQL-heavy ETL by 15–20% Multiply that across 5–10 stages and the cost compounds. ⚡ Why Temp Views Win (Especially for SQL Workloads) Temp Views turn your entire pipeline into a single SQL flow: Catalyst sees the entire transformation chain No DataFrame objects sitting in memory SQL lineage stays readable Legacy SQL (Teradata/Oracle) drops in cleanly Governance & audits become painless It’s like giving Spark the full map instead of step-by-step instructions. 💰 A Quick Cost Reality Check A typical enterprise pipeline: 5 transformations 50M rows per stage 8-core cluster ($2.20/hour) Runs daily Runtime difference: DataFrames → 18 min (~$0.66/run) Temp Views → 14 min (~$0.50/run) 👉 Savings per pipeline: $58.40/year Now scale that across: 100 pipelines → $5,840/year 500 pipelines → $25,000–$50,000/year Zero business logic changes. Just better architecture. 🎯 When You Should Use Temp Views Use Temp Views if your pipeline is: Mostly SQL A legacy Teradata/Oracle migration Heavy on joins, window functions, aggregations Used by SQL practitioners Being audited / governed Long-running and costly Governance teams LOVE Temp Views. SQL humans LOVE Temp Views. Catalyst LOVES Temp Views. 🛠 When DataFrames Still Make Sense Keep DataFrames for: Complex programmatic logic ML + feature engineering UDF-heavy workloads Dynamic schema manipulation Choose the right tool — not the default one. https://lnkd.in/gyRPJ8Qz #Databricks #DataEngineering #SparkSQL #ETL #CostOptimization #CloudCosts #SQL #TeradataMigration
To view or add a comment, sign in
More from this author
-
Snowflake + Postgres: A Small Feature That Signals a Big Shift
Shrinivas Vishnupurikar Kulkarni 4mo -
Selective Denormalization Strategy: Why the opposite direction, partially?
Shrinivas Vishnupurikar Kulkarni 11mo -
The Spark Eco-System with not very often elaborated Components
Shrinivas Vishnupurikar Kulkarni 1y
Explore related topics
Explore content categories
- Career
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Hospitality & Tourism
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development