"Elevate Your Analytics: Snowflake's Cutting-Edge Performance Optimization Methods"
Snowflake Performance Optimization
Five most common performance challenges
1. Large table Scans
The Problem:
If snowflake can't exclude micro partitions based on metadata, we have to download and scan through them .For large tables, this can be very time consuming
Why is it bad?
If you are returning a small percentage of records but reading a lot of files, you are paying for a lot of time spent excluding data rather than processing it
How common is it?
Very!
Large table scans in Snowflake can be a performance concern, especially when dealing with massive datasets. To mitigate this, consider the following strategies:
Clustering: Use clustering keys to co-locate similar rows in the same micro-partitions. This improves scan efficiency by skipping irrelevant data.
Partition Pruning: Ensure your data is well-clustered on the columns used in the WHERE clause to enable partition pruning.
Query Optimization: Refine your queries to select only the necessary columns, which can reduce the amount of data scanned.
Micro-Partitions: Understand how Snowflake stores and organizes data in micro-partitions to optimize storage for performance.
For specific cases where queries are taking longer than expected, it’s advisable to review the query’s execution plan and consider reaching out to Snowflake support for further analysis. Additionally, using the EXPLAIN statement can provide insights into the query execution and help identify potential optimizations
2. Unnecessary join Explosions
The Problem:
When joining two tables, the result is larger than either of the input sets
Why is it bad?
Join Explosions are usually not the desired outcome of a join - they are generally the result of data quality issues, logical errors or an incorrect join order. This means that the increased number of records is processed, and then discarded at some point (generally via a DISTINCT or some other kind of aggregate). This processing is wasteful.
How common is it?
Very common. More common in teams with lots of joins or aggregations in a single query.
Join explosions in Snowflake occur when a join operation produces a disproportionately large number of rows, often due to low cardinality join keys or incorrect join logic. This can lead to performance issues and increased query costs. To address this, consider the following strategies:
Review Join Conditions: Ensure that your join conditions are correct and that they don’t unintentionally match too many rows.
Increase Cardinality: If possible, use joins keys with higher cardinality to reduce the number of matches.
Use Proper Filters: Apply filters before the join, if applicable, to reduce the dataset size.
Optimize Query Logic: Rewrite queries to avoid unnecessary join operations, especially those involving the LIKE operator which can be expensive.
Leverage Query Profile: Utilize Snowflake’s Query Profile to analyze and understand the join behavior and identify potential issues.
Append-Only Streams: For certain scenarios, using append-only streams can help avoid join explosions by altering the data ingestion pattern.
3. High compile Time
The Problem:
Complex queries with a large number of joins, nested views, CTEs, etc.
Why is it bad?
Overly complex queries take longer to compile and to compute, resulting in higher expense than necessary. They also often result in unnecessary extra handling of the same data
How common is it?
Reasonably. More common for customers who have moved to snowflake from on-premises or
In-memory system
Compile time and execution time refer to two different stages in the lifecycle of a query or program:
· Compile Time: This is the phase where the code you’ve written is being converted into a format that can be run by the machine. In databases like Snowflake, compile time involves the optimizer creating an optimal query plan. This plan determines the most efficient way to execute the query based on the current state of the database.
· Execution Time: After the query plan has been compiled, execution time is when this plan is actually carried out. The database performs the operations outlined in the plan, such as retrieving data, joining tables, filtering results, etc.
In summary, compile time is about preparing to run the query, and execution time is about actually running it.
High compile times in Snowflake can be a concern, especially when they exceed the execution time. Here are some strategies to reduce compile time:
Simplify Query Complexity: The more complex the query, with numerous lines, functions, joins, filters, group by, order by, and union statements, the longer it takes for the optimizer to create the query plan.
Reduce the Number of Tables and Columns: The optimizer analyzes the schema and statistics of the tables to create an optimized query plan. Minimizing the number of tables and columns involved can reduce compile time.
Optimize Cloud Services Layer: Since query compilation occurs in the Cloud Service Layer, optimizing operations like metadata management and query optimization.
Warehouse Size: Note that increasing the virtual warehouse size does not improve compilation time as it only affects execution time.
Review SQL Structure: Reevaluate the structure of your SQL, looking for areas to simplify and optimize. Avoid unnecessary casting and complex sub queries.
Metadata Cache State: Compilation time can depend on the metadata cache state. If you infrequently access base tables, this can affect the cache and, consequently, the compile time.
To improve query execution time in Snowflake, consider the following tips:
1. Select Fewer Columns: Only select the columns you need. This reduces the amount of data processed and transferred.
2. Leverage Query Pruning: Use partitioning and clustering keys to prune unnecessary data from your queries.
3. Use Clustered Columns in Join Predicates: This can help Snowflake optimize join operations.
4. Use Pre-Aggregated Tables: If you frequently perform the same aggregations, consider storing the results in a separate table.
5. Simplify Your Queries: Break down complex queries into simpler components.
6. Reduce Data Volume: Filter out unnecessary data early in your queries to reduce the volume of data processed.
7. Repeating CTEs: Common Table Expressions (CTEs) can sometimes be faster than subqueries or joins.
8. Remove Unnecessary Sorts: Sorting can be resource-intensive, so only sort data when necessary.
9. Adjust MAX_CONCURRENCY_LEVEL: This parameter controls resource allocation and can be adjusted to improve execution speed for complex queries.
10. Dedicate Warehouses or Use Multi-Clustering: For large complex queries, consider using a dedicated warehouse or Snowflake’s multi-clustering feature to allow more parallel execution.
The Snowflake optimizer, often referred to as the query optimizer, is a critical component of the Snowflake data platform. It is responsible for taking a SQL query and determining the most efficient way to execute it. Here’s what it does:
1. Automated Optimization: The optimizer automatically tunes queries and maintains all statistics used for optimization upon data load and updates.
2. Cascades-Style Approach: It follows a top-down, cost-based optimization approach, considering various execution plans and choosing the one with the lowest estimated cost.
3. No Traditional Indices: Unlike some other database systems, Snowflake does not use indices (indexes), which means the plan search space is smaller, simplifying the optimization process.
4. Resource Management: It also plays a role in resource optimization by eliminating basic maintenance tasks such as vacuuming, partitioning, and indexing.
Recommended by LinkedIn
5. Continuous Improvement: Snowflake is designed to become faster and more efficient over time through seamless ongoing updates.
The optimizer works behind the scenes to ensure that your queries run as quickly and efficiently as possible, leveraging Snowflake’s unique architecture and capabilities.
4. Excessive Spilling
The Problem:
Queries have to store data to disk or remote storage in-flight, called ‘spilling’
Why is it bad?
Storing and retrieving data while it’s still in use in an expensive operation from a time perspective
How common is it?
Very common, more common in teams with lot of joins or aggregation in a single query
Excessive spilling in Snowflake occurs when a query’s operations exceed the memory capacity of the warehouse and data is written to disk, which can degrade performance. To mitigate this, you can:
Review and Optimize Queries: Analyze the query for any possible optimizations, especially if it’s a new query.
Reduce Data Processed: Try to improve partition pruning or project only the necessary columns.
Manage Parallel Queries: Decrease the number of parallel queries running in the warehouse to reduce competition for resources.
Use Larger Warehouses: A larger warehouse provides more memory and local disk space, which can help accommodate larger operations.
This topic discusses how a warehouse owner or administrator can resolve memory spillage in order to improve the performance of a query.
Performance degrades drastically when a warehouse runs out of memory while executing a query because memory bytes must “spill” onto local disk storage. If the query requires even more memory, it spills onto remote cloud-provider storage, which results in even worse performance.
Finding queries that spill to storage
This query identifies the top 10 worst offending queries in terms of bytes spilled to local and remote storage.
SELECT query_id, SUBSTR(query_text, 1, 50) partial_query_text, user_name, warehouse_name,
bytes_spilled_to_local_storage, bytes_spilled_to_remote_storage
FROM snowflake.account_usage.query_history
WHERE (bytes_spilled_to_local_storage > 0
OR bytes_spilled_to_remote_storage > 0 )
AND start_time::date > dateadd('days', -45, current_date)
ORDER BY bytes_spilled_to_remote_storage, bytes_spilled_to_local_storage DESC
LIMIT 10;
Options for resolving memory spillage
When memory spillage is the issue, you can convert your existing warehouse to a Snowpark-optimized warehouse, which provides 16x more memory per node and 10x the local cache compared to a standard warehouse. Though a larger warehouse also has more memory available, a query might not require its expanded compute resources.
If you want to try resolving the spillage of a query without adjusting the warehouse that runs it, use the Query Profile to identify which operation nodes are causing the spillage.
5. Queued Queries
The problem: when too many statements run at once on a single cluster, reducing throughput
Why is it bad: while snowflake can run multiple queries at once on a single cluster and doing so to some degree can be beneficial to total throughput , past as certain point this will return is spilling and CPU contention , increasing wait time and decreasing throughput
How common it is: More common on BI warehouses where lots of people are accessing many reports
Reducing queues:
If too many queries are sent to a warehouse at the same time, the warehouse’s compute resources become exhausted and subsequent queries are queued until resources become available. The time between submitting a query and getting its results is longer when the query must wait in a queue before starting.
This query lists the warehouses that had a queue in the last month, sorted by date.
SELECT TO_DATE(start_time) AS date
,warehouse_name
,SUM(avg_running) AS sum_running
,SUM(avg_queued_load) AS sum_queued
FROM snowflake.account_usage.warehouse_load_history
WHERE TO_DATE(start_time) >= DATEADD(month,-1,CURRENT_TIMESTAMP())
GROUP BY 1,2
HAVING SUM(avg_queued_load) > 0;
Options for reducing queues
You have several options to stop warehouse queuing:
· For a regular warehouse (i.e. not a multi-cluster warehouse), consider creating additional warehouses, and then distribute the queries among them. If specific queries are causing usage spikes, focus on moving those queries.
· Consider converting a warehouse to a multi-cluster warehouse so the warehouse can elastically provision additional compute resources when demand spikes. Multi-cluster warehouses require the Enterprise Edition of Snowflake.
· If you are already using a multi-cluster warehouse, increase the maximum number of clusters.
Single-cluster or multi-cluster (in Maximized mode): Statements are queued until already-allocated resources are freed or additional resources are provisioned, which can be accomplished by increasing the size of the warehouse.
Snowflake uses virtual warehouses for compute which are nothing but cluster of compute nodes. Each node has 8 cores. So, when you submit a query to a virtual warehouse, each query is being processed by one or more core (based on if the query can be parallelized). So, if the virtual warehouse does not have any core to execute the 6th query, it will queue up. If you logon to snowflake UI and click on the warehouse tab, you will see this queuing through the yellow color on the bars. You can also see it under 'QUEUED_OVERLOAD_TIME' if you query the QUERY_HISTORY view.
Now, this is not a good thing for queries to queue up consistently. So, the best practice is to have a multi warehouse strategy. Give every unique group of workload a dedicated warehouse so that you can scale them horizontally/vertically based on the query load of the given workload.
Exciting insights on performance optimization in data analytics using Snowflake!