Liquid Clustering: Optimizing Databricks Workloads for Performance and Cost
We are currently in an era where data is a goldmine. The amount of data stored in data stores is increasing rapidly, and the patterns of queries for each dataset are changing swiftly. This is due to the rising number of data users, automated systems, and AI tools that are accessing these datasets. As a result, there are greater challenges in maintaining the organization of the data efficiently to ensure optimal query performance and support concurrent writing, all while minimizing the total cost of ownership (TCO).
When working with big data ecosystems like Databricks, data is stored in Delta Lake. Traditionally, optimization methods such as partitioning and z-ordering have been used to improve organization and querying. However, these methods have their limitations. In response, Databricks introduced Liquid Clustering, a revolutionary and simplified approach to data layout and optimization. This innovative feature promises significant cost savings and performance enhancements.
The Challenges of Traditional Data Layouts
Traditional partitioning and z-ordering can be effective for specific query patterns, but they also lead to several challenges, including:
Enter Liquid Clustering
Liquid Clustering addresses these challenges through several innovative approaches:
Let's explore Liquid Clustering with an example table. Consider that we have impressions data (TB scale) from the advertising campaigns run by a company. When this data is written each day without a data layout configuration, it is incrementally written to the files, resulting in a following data layout.
Assume the user's frequent query pattern for accessing data from the ‘dcm_impression’ table is based on the event ‘date’ and ‘campaign_name’.
SELECT * FROM dcm_impression WHERE date = '04/29/2025' AND campaign_name= 'C'
The query engine leverages Delta data skipping statistics (min/max values, null counts, and total records per file) to identify the relevant files to scan. In this scenario, the engine needs to read 5 files out of 10, yielding a pruning rate of 50%.
The primary issue here is that the data for campaign C is not collocated in a single file. The query engine will determine campaign C might be lying somewhere between min and max within each file for ‘4/29/2025’. The extraction of all records for campaign C also requires reading a significant number of entries for other campaigns from each of the 5 files.
To improve query performance, we can enable Liquid Clustering. There are two ways to enable Liquid Clustering:
Manually Using ALTER and OPTIMIZE commands
In this example, we use the following ALTER command to enable liquid clustering with the new clustering keys: 'date' and 'campaign_name'. The ALTER will register the cluster keys in the log and create a new data layout. The data will reorganize into the new layout only when the OPTIMIZE command is executed. For a valid reason, the data team may choose not to execute the OPTIMIZE command right away.
ALTER TABLE dcm_impression CLUSTER BY (date, campaign_name);
OPTIMIZE dcm_impression;
In the manual approach, it is necessary to regularly monitor the query patterns. If there is a change in frequent query patterns, data team must evaluate the overhead cost of reclustering against the potential performance gains before altering the clustering keys. Managing decisions related to data layout design and implementing changes can be time-consuming and complex for a data team. Let's explore what else can be done in the next section.
Evolve Clustering through Automatic Liquid Clustering
Note: At the time of writing this article, automatic Liquid clustering is in Public Preview.
Automatic Liquid Clustering simplifies data management by eliminating the need for manual tuning. Predictive Optimization harnesses the power of Unity Catalog to monitor and analyze data and query patterns.
For our example, the following command will enable the automatic liquid clustering for Unity Catalog managed table- dcm_impression.
ALTER TABLE dcm_impression CLUSTER BY AUTO;
Once enabled, Predictive Optimization will take care of harder data management problems i.e. cluster key selection and evolve data layout based on the query pattern by continuously performing the following:
Telemetry: To determine if dcm_impression will benefit from the Liquid Clustering, Predictive optimization analyzes the query filters from the transaction logs (including metadata within logs). It determines event date and campaign_name are frequently queried.
Perform Model Evaluation: Predictive Optimization involves a comprehensive evaluation of query workloads to identify the most effective clustering keys for maximizing data-skipping efficiency.
This process leverages insights gained from historical query patterns to estimate the potential performance enhancements associated with various clustering schemes. By simulating past queries, it predicts how effectively each option can reduce the amount of data that needs to be scanned.
There are potentially 3 strategies Predictive Optimization discovers:
Recommended by LinkedIn
In the illustration below, with the liquid clustering on the ‘date’ and ‘compaign_name’, query engine will read the file that contains records for Campaign C and the rest of the files will be skipped. The min and max (data skipping) for this file is C.
Cost Benefit Optimization: Clustering or re-clustering always incurs some overhead. Therefore, the next step in predictive optimization is to perform a cost-benefit analysis for option number 3 mentioned earlier. Predictive optimization will assess whether the performance benefit of the query outweighs its associated costs. In our case, the pruning rate with the clustering keys "date" and "campaign_name" is 90%, and these columns are frequently queried. As a result, the benefits are substantial and justify clustering based on these columns. Predictive optimization will then proceed to update this Unity Catalog managed table, adding these columns as new cluster keys. The automatic decision-making of predictive optimization helps minimize the total cost of ownership (TCO).
Liquid Clustering Optimizes Writes
Liquid Clustering in Databricks achieves the ability to cluster data without rewriting the entire dataset in the traditional sense by leveraging the Delta Lake transaction log and a process of incremental clustering.
Here's a breakdown of the technical details:
1. Delta Lake Transaction Log as the Source of Truth:
2. Incremental Clustering on Write:
3. OPTIMIZE Command for Reclustering (When Needed):
This incremental and metadata-driven approach makes Liquid Clustering more efficient and flexible than traditional data layout optimization techniques.
Data Volume Considerations
It's less about a strict "minimum volume" and more about the characteristics of your data such as cardinality, skew and query patterns. Liquid Clustering's advantages become increasingly apparent as the data volume grows. The optimizations it provides have a more significant impact on performance when dealing with terabytes or petabytes of data. Like any optimization technique, Liquid Clustering introduces some overhead. For very small datasets with extremely simple queries, the overhead might outweigh the benefits. However, in most real-world scenarios, the performance gains may far exceed the overhead.
Additionally, Databricks documentation provides thresholds relating to the number of clustering columns, and related data sizes. This is more of a guideline for how the system manages the data, and not a minimum data size for usage.
Clustering the Entire Data Lake
When you have several unrelated datasets in Delta tables serving distinct data products, using Liquid Clustering requires careful consideration of each dataset's unique characteristics and query patterns. If you are using Medallion Architecture start at the gold level. Here's a breakdown of how to approach this scenario:
Key Considerations:
Guidance for Verifying Effectiveness
Once Automatic Liquid Clustering is enabled on your Delta table, verifying cost minimization and improved query efficiency requires a combination of monitoring query performance, analyzing cloud resource consumption, and understanding the characteristics of your data and workloads. Here's a breakdown of what to verify and how:
To verify Automatic Liquid Clustering's cost and efficiency benefits:
If you don't see improvements:
By systematically monitoring these aspects, you can gain confidence that Automatic Liquid Clustering is effectively minimizing costs and improving query efficiency for your Delta tables. Remember that the benefits might accrue over time as more data is ingested and the automatic clustering has more opportunities to optimize the layout.
Conclusion
Databricks' Liquid Clustering represents a significant advancement in data lake optimization. By dynamically adapting data layout to query patterns, Liquid Clustering delivers substantial cost savings and performance improvements. As data-lake grow and complexity increases, Liquid Clustering will become an increasingly essential tool for optimizing Databricks workloads. By embracing this technology, organizations can unlock the full potential of their data and drive data-driven innovation more efficiently. However, while opting for Automatic Liquid Clustering can be very beneficial, it's important to remain vigilant in making AI-related decisions and managing costs effectively.
Thanks for sharing, Ankit