When Partitioning and Clustering Go Wrong: Lessons from Optimizing Queries in BigQuery

Ayushi Gupta

Published Jan 24, 2025

Recently, I worked with a table in BigQuery called orders, sized at 2.98 GB. My goal was to optimize query performance and reduce the amount of data scanned for cost efficiency. Here’s how the journey unfolded:

The Initial Query

I ran the following query on the unpartitioned table:

SELECT * 
FROM foodwagon.orders 
WHERE order_date BETWEEN '2020-01-01' AND '2022-01-31' 
AND restaurant_ratings >= 4 AND restaurant_ratings < 5;

Data scanned: 2.98 GB.

First Optimization: Partitioning

Since the query frequently filtered by order_date, I decided to partition the table on order_date. After partitioning: Data scanned: 1.24 GB.

This was a decent improvement, but I wanted to push the optimization further.

Next Step: Clustering

Given that most of my queries also filtered on restaurant_ratings, clustering seemed like the logical next step. However, restaurant_ratings is a float column with high cardinality, which isn't ideal for clustering—it results in minimal gains.

To address this, I created a derived column rounded_ratings by flooring the restaurant_ratings to the nearest integer:

CREATE TABLE foodwagon.orders_clustered 
PARTITION BY DATE_TRUNC(order_date, MONTH) 
CLUSTER BY rounded_ratings 
AS 
SELECT *, CAST(FLOOR(restaurant_ratings) AS INT64) AS rounded_ratings 
FROM foodwagon.orders;

This approach reduced the scanned data for my query to 283.03 MB, which is a 77% reduction compared to partitioning alone and a 90% reduction from the original table scan. Huge success!

Recommended by LinkedIn

Benefits of Clustering in Snowflake

Minzhen Yang 5 years ago

Data Partitioning and Clustering for Performance…

Kumar Preeti Lata 1 year ago

The Rise of Graph Databases in Big Data Analytics

Ali Hussain Awan 2 years ago

The Unexpected Outcome

However, things got interesting when I ran a slightly modified query:

SELECT * 
FROM foodwagon.orders_clustered 
WHERE order_date BETWEEN '2020-01-01' AND '2022-01-31';

This query scanned 1.4 GB, which is more than the partitioned table scan (1.24 GB). Upon inspecting the clustered table, I found its size had grown to 3.35 GB—larger than the original table.

Why Did This Happen?

Clustering Increases Storage Size: Clustering creates metadata to optimize how data is stored and queried. This metadata adds overhead, especially when combined with partitioning, which can lead to a larger table size.
Inefficient Clustering for Non-Filtered Queries: Clustering only helps queries that filter on clustered fields. In this case, rounded_ratings wasn’t part of the query, so BigQuery couldn’t leverage the clustering metadata, resulting in a larger data scan.
Cluster Size Impacts Query Efficiency: Clustering benefits depend on how well the clustered field aligns with query patterns. High cardinality or sparsely filtered clusters can offset performance gains, especially for large datasets.

Partitioning vs. Clustering: When to Use What?

Partitioning:

Use for low-cardinality fields like dates or categories.
Significant benefits when queries heavily filter on the partitioned column.
Adds minimal overhead to storage.

Clustering:

Works well for fields frequently used in filters or joins, especially with moderate cardinality.
Avoid clustering on high-cardinality fields (e.g., floats) without transforming the data.
Be cautious when combining clustering with partitioning; it can lead to larger table sizes if not used wisely.

Key Takeaways

Understand Query Patterns: Before deciding on partitioning or clustering, analyze query requirements. In my case, rounded_ratings was a good clustering field because most queries filtered on it.
Quantify Benefits: Partitioning reduced the scanned data by 58%, and clustering brought it down by an additional 77%. However, the tradeoff was increased table size and complexity.
Balance Partitioning and Clustering: Combining both can be powerful but may backfire if clusters don’t align with query patterns.

By understanding these strategies and tradeoffs, I was able to reduce costs and improve query performance significantly, despite some unexpected challenges.

Amit Choudhary 1y

Excellent write up once again! Quick and crisp learning

1 Reaction

To view or add a comment, sign in

When Partitioning and Clustering Go Wrong: Lessons from Optimizing Queries in BigQuery

Ayushi Gupta

Recommended by LinkedIn

The Unexpected Outcome

Why Did This Happen?

Partitioning vs. Clustering: When to Use What?

Partitioning:

Clustering:

Key Takeaways

More articles by Ayushi Gupta

Others also viewed

Google BigQuery: Beyond Data Warehousing Part 2 - AI-augmented Data Preparation

Veterans Are a Bit Trickier to Maintain. Even in Vertica. On History in Your Data and Slowly Changing Dimensions

Data Ecosystem for Real-Time Analytics - Part 2

Parquet- The preferred file format

Ensuring Data Consistency in BigQuery: How to Set Up a Dynamic and Automated Table Comparison

Top 5 Fastest Databases for OLAP in 2025 🚀

Delta Table Layout Options: Partitioning vs. Liquid Clustering

Data Clustering in Modern Data Platforms

Unlocking Snowflake Performance: A Deep Dive into Micro-Partitions, Clustering, and Search Optimisation

Optimizing BigQuery

Explore content categories

Recommended by LinkedIn

The Unexpected Outcome

Why Did This Happen?

Partitioning vs. Clustering: When to Use What?

Partitioning:

Clustering:

Key Takeaways

More articles by Ayushi Gupta

Lazy Evaluation in Spark: Lessons from a Delta Table Delete-Update Ordeal

Troubleshooting Pip Installation Issues on Dataproc with Internal IP Only

Optimizing Query Performance in BigQuery

Others also viewed

Google BigQuery: Beyond Data Warehousing Part 2 - AI-augmented Data Preparation

Veterans Are a Bit Trickier to Maintain. Even in Vertica. On History in Your Data and Slowly Changing Dimensions

Data Ecosystem for Real-Time Analytics - Part 2

Parquet- The preferred file format

Ensuring Data Consistency in BigQuery: How to Set Up a Dynamic and Automated Table Comparison

Top 5 Fastest Databases for OLAP in 2025 🚀

Delta Table Layout Options: Partitioning vs. Liquid Clustering

Data Clustering in Modern Data Platforms

Unlocking Snowflake Performance: A Deep Dive into Micro-Partitions, Clustering, and Search Optimisation

Optimizing BigQuery

Similar topics

How to Reduce Bigquery Costs

Explore content categories