The problem of Granularity
Granularity refers to the level of detail or depth of data stored in a database. In SQL and relational data modeling, managing granularity is essential but also one of the trickiest design decisions developers and data engineers face. Granularity in SQL isn’t just a buzzword. It’s a foundational concept that quietly determines how well your data models scale, how performant your queries are, and whether your results are accurate or misleading.
At first look, data seems simple (tables, columns, rows), but poor decisions about granularity can lead to confusing queries, bloated storage, inaccurate results, and performance bottlenecks. In this first part of the broader SQL Impedance Mismatch series, we’ll see what granularity is, why it matters, and how to recognize and solve problems associated with it.
What Is Granularity?
Granularity defines the smallest unit of data a table or record represents. The more granular your data, the finer the details. The less granular it is, the more aggregated your data becomes. An example of High granularity is a web analytics table that stores one row per page view, and Low granularity is a summary table that stores one row per day per website. Both have valid use cases but mixing them up leads to trouble.
Why Granularity Matters
Let’s consider a case example
Case: Sales Data
You’re tracking sales data. You could model it in different granularities:
This is transaction-level granularity where you can analyze individual purchases. Now, imagine storing it like this:
This is daily product summary granularity. It saves space but loses detail. Want to know who bought what? You can’t.
The Pitfalls of Granularity Mismatch
1. Inconsistent Aggregation
If you join a highly granular table to a low-granular one without adjusting for it, you may get duplicated data or misleading aggregations. This is bad because it can lead to double counting due to mismatched joins.
SELECT
s.product_id,
SUM(s.total_quantity * p.unit_cost) AS total_cost
FROM
summary_sales s
JOIN
product_catalog p ON s.product_id = p.product_id
GROUP BY
s.product_id;
If product_catalog has one row per product variant and summary_sales is per day, you could multiply quantities incorrectly. This can occur especially if variants aren’t unique or indexed well.
Recommended by LinkedIn
2. Over or Under Normalization
If as a developer you break down data so much that a simple query requires complex joins and sub queries, it would emerge that you have out did yourself and now you have another problem i.e. you have over granularized your application. On the other end if you summarize your tables too early you will lose the traceability and/or the ability to drill down on the data late. This would cause another problem, you are too coarse.
When to Choose What Granularity
As a developer you have to find the middle point, not too granulated and not too coarse. One may ask themselves how then will I get to choose what level of granularity is best for me and this application. You will have to consider some factors:
What’s the performance budget? Detailed data allows flexibility but at the cost of slower queries unless indexed and optimized. Aggregated data is fast but rigid and in case you need a new slice go back to the ETL (Extract, Transform, Load).
Solutions to Granularity Problems
One common approach to handling various data granularities is to employ a Layered Architecture. Imagine your data going through a series of steps, almost like a very well-organized assembly line. In the initial step, you have your raw, high-granularity tables. These tables are the unadulterated source, capturing each detail, no matter how small i.e. potentially each conceivable click on a site, every single product scan, or every sensor reading. This raw layer is the single source of truth, no longer having missing information. You then construct views or materialized views for reporting and analysis on top of this building block layer. These views can roll the raw data up to different levels of aggregation as needed.
For instance, while your raw table might have a single row for each single product sale, a materialized view might roll up sales by day, by product category, or even month, with pre-computed sums that are faster to query for routine reporting needs. This split allows analysts to view pre-digested information and still have the option of descending to the ultimate level of granularity when necessary, without having to query the massive raw tables.
Dimensional Modeling, the basis of data warehousing, is another paradigm that has had a strong impact on granularity solutions. This paradigm advocates using star or snowflake schemas, specifically designed for querying analysis. At the center of this model is the fact table that holds events or transactions at a constant and fixed granularity. An example is a fact table having one row for every sale transaction with the quantity, price, etc., being derived straight from that specific sale. Surrounding the fact table are dimension tables that are lookup tables holding descriptive attributes related to the facts. These dimensions, such as "Product," "Time," "Customer," or "Location," provide context without providing granularity to the fact table itself. This structured approach makes it so that analysts always know what level of granularity exists in the fact table and thus be able to write their queries and join data more readily without running into an nasty aggregation surprise.
Beyond architectural and design choices, Clear Documentation is a requirement in keeping granularity headaches at bay. It's not enough to design a good data sound system; you must also describe its organization. For every table in your schema, explicitly state its "grain". That is to say, exactly what every row is saying. For instance, one row per customer, one row per transaction a day, or one row per sensor reading per minute? Writing down this kind of information leaves comparatively fewer doors open for data user misunderstandings. When one has to ask questions against data or merge tables, one can easily determine the right amount of aggregation, avoiding errors such as double-counting or joining unrelated datasets inappropriately. Such proactive communication encourages data literacy and coherence within your organization.
For dynamic aggregation and analysis without permanently modifying the underlying data structure, Window Functions in SQL prove to be a very valuable resource. These strong functions let you calculate over a group of table rows in a relationship with the current row without the loss of context of the current row. This lets you dynamically roll up and drill down within your queries. For example, you can do a running total of sales, average sale of a given product category, or percentage contribution to the total sales by a particular region, and so on within one query. Window functions provide analysts with the ability to drill down into data at multiple levels of aggregation at will, without being rigid and unresponsive, while not having to build numerous pre-aggregated tables. They enable a more smooth and interactive data discovery process, bridging the gap between raw detail and higher-level insights.
-- Example: Average quantity per day per product
SELECT
product_id,
date,
quantity,
AVG(quantity) OVER (PARTITION BY product_id ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS avg_last_7_days
FROM sales;
Final Thoughts
Granularity is a foundational concept that quietly shapes every part of your database system from design to performance to business value. A developer being deliberate about it can save countless hours of rework and bugs down the road. When in doubt, store the most granular form you can afford, then aggregate later. It’s easier to roll up than to drill down into a black box. Always know what each row represents. That’s the key to writing better SQL.
Have you run into granularity issues in your SQL journey? Share your story in the comments below or reach out—I’d love to hear how you solved it.