The Datelist Int - An Efficient Data Structure for User Growth
Zach Wilson recently brought up the Datelist data structure in a list of mind-blowing technical concepts, and I totally agree. Datelists are a great addition to a Data Engineer’s toolkit, particularly those that work on user growth, but there is a dearth of online literature explaining what they are or how they are used. So allow me to nerd out for a bit and share why I find Datelists to be so useful.
First, some disambiguation. There are two “Datelist” concepts that are closely related but can be implemented independently.
Datelist Tables
A Datelist Table is a class of datasets, like a fact table, dimension table, or a cube. A Datelist table contains a cumulative list of users (one row per user who has ever been active), and for each user the Datelist stores a first active date, a last active date, and a history of past active dates. The active date history can be comprehensive or not (could be a rolling 90-day history, for example) and can be stored in a variety of formats, such as an array of dates, a map of dates-to-metric values (e.g. number of sessions), or as a Datelist Int. A basic schema for a Datelist to track active users might look like this:
Thus, a fact table that contains one row per active user each day like this:
could be transformed into active_user_datelist like this:
With a complete or near-complete history of user visits, a Datelist Table can be used to derive key growth metrics like daily/weekly/monthly active users, retention rates, and L-ness (L7/L28), all from a single date partition. Datelists are populated incrementally: the current date partition is generated by a full outer join of the current partition of a daily user activity table with the previous partition of the Datelist table. There are variations and extensions of this core design (additional granularity keys like product feature or interface, additional metrics like number of sessions or time spent) that can make a Datelist table even more powerful and flexible.
Datelist Ints
A Datelist Int is a data structure that encodes multiple days of user activity in a single integer value (usually a BIGINT). Each bit in the integer represents a calendar date for a rolling history, with the smallest bit representing the current date. If the user was active on a particular date, that bit is set to 1, otherwise it is set to 0.
For example, in our sample active_user_datelist data above, user 123 was active on the current date (2022-02-13), 1 day ago (2022-02-12), and 3 days ago (2022-02-10). Each of these days is translated into a flipped bit, and the bit values are summed to generate the datelist int.
Using this approach, we can update our active_user_datelist table to replace the active_date_list field (stored as an ARRAY<VARCHAR>) with an active_datelist_int (stored as a BIGINT). Because we are using a signed 64-bit integer we will be limited to 63 days of history, but the storage footprint will be much smaller and extracting useful information will be much more efficient. When we update the active_datelist_int we need to shift the previous value left one bit (adding extra handling to avoid signed integer overflow) and then add 1 if the user is active on the current date.
Recommended by LinkedIn
Why go through all this trouble? Once we have our user activity encoded in a Datelist Int, we can perform several useful transformations with extremely efficient bitwise operations. Here are some examples of queries against a datelist_int using Presto’s bitwise functions:
Users who were active 5 days ago:
L7 (number of active days out of the last 7 days) for each user:
Number of users who churned this week (were active last week, became inactive this week):
The logic for the above queries is quite susceptible to off-by-one errors and other snags. For that reason datelist_ints are much more practically useful when used in conjunction with frameworks or a set of custom UDF’s to simplify and standardize the transformation logic. For example, you might create a UDF that takes as inputs the first_active_date and the datelist_int for a user along with the current date to generate the Growth Accounting status for each user.
So that’s what Datelists are in a nutshell. The Datelist Table is an indispensable class of datasets that immediately add value to both Data Scientists (for efficient user analysis) and Data Engineers (for streamlined data models). The Datelist Int is an efficient, flexible data structure to track a rolling history of user activity, and is especially powerful when used in a Datelist Table and coupled with tooling to standardize common transformations. To summarize the tradeoffs for the Datelist Int:
Pros:
Cons:
This is my first stab at long-form writing on LinkedIn, if you found this content useful and would like to see more nerdy Data Engineering discussions please let me know :)
`Working example in PostgresSql` https://github.com/ankurbhambri/SQL-Learn/blob/main/datelist.sql
Hats off to Cristian Figueroa , who I understand had the original idea for the Datelist Int (at FB)
Thank you very much for the write-up Max; I can't wait to see what's next!
Thanks for this write-up Max Sung! I am quite captivated by this since Zach Wilson brought it up as well.