When Partitioning and Clustering Go Wrong: Lessons from Optimizing Queries in BigQuery
Recently, I worked with a table in BigQuery called orders, sized at 2.98 GB. My goal was to optimize query performance and reduce the amount of data scanned for cost efficiency. Here’s how the journey unfolded:
The Initial Query
I ran the following query on the unpartitioned table:
SELECT *
FROM foodwagon.orders
WHERE order_date BETWEEN '2020-01-01' AND '2022-01-31'
AND restaurant_ratings >= 4 AND restaurant_ratings < 5;
Data scanned: 2.98 GB.
First Optimization: Partitioning
Since the query frequently filtered by order_date, I decided to partition the table on order_date. After partitioning: Data scanned: 1.24 GB.
This was a decent improvement, but I wanted to push the optimization further.
Next Step: Clustering
Given that most of my queries also filtered on restaurant_ratings, clustering seemed like the logical next step. However, restaurant_ratings is a float column with high cardinality, which isn't ideal for clustering—it results in minimal gains.
To address this, I created a derived column rounded_ratings by flooring the restaurant_ratings to the nearest integer:
CREATE TABLE foodwagon.orders_clustered
PARTITION BY DATE_TRUNC(order_date, MONTH)
CLUSTER BY rounded_ratings
AS
SELECT *, CAST(FLOOR(restaurant_ratings) AS INT64) AS rounded_ratings
FROM foodwagon.orders;
This approach reduced the scanned data for my query to 283.03 MB, which is a 77% reduction compared to partitioning alone and a 90% reduction from the original table scan. Huge success!
Recommended by LinkedIn
The Unexpected Outcome
However, things got interesting when I ran a slightly modified query:
SELECT *
FROM foodwagon.orders_clustered
WHERE order_date BETWEEN '2020-01-01' AND '2022-01-31';
This query scanned 1.4 GB, which is more than the partitioned table scan (1.24 GB). Upon inspecting the clustered table, I found its size had grown to 3.35 GB—larger than the original table.
Why Did This Happen?
Partitioning vs. Clustering: When to Use What?
Partitioning:
Clustering:
Key Takeaways
By understanding these strategies and tradeoffs, I was able to reduce costs and improve query performance significantly, despite some unexpected challenges.
Excellent write up once again! Quick and crisp learning