come get sum
Mathematics consists in proving the most obvious thing in the least obvious way.
— George Polya
Too Long, Didn't Read (TLDR)
a discovered checksum in the wild
More info
I was able to use Google BigQuery and the maths to find all the Mailchimp automation-email campaigns that have an invalid state, in a, if i may say so myself (and I do), "elegant" way.
The classic automated email campaign (automation) in Mailchimp can have many steps. Each step sends an email, and the steps happen sequentially with varying delays between each step.
The database records the steps of an automation as: campaign_id, position, otherstuff
In an excel sheet it would look like this with the | character as cell walls:
campaign_id | position | name
c1 | 1 | my first email
c1 | 2 | my dearest second
A database table is a lot like a spreadsheet, if you haven't coded against a DB before.
campaign_id represents the unique identifier for the overall group of steps.
position represents which step in the automation it is. The email for position 1 sends, then at some point later, the email for position 2 sends, etc.
name is an example of something else about the email at that position, like the subject, a memorable name for filing purposes, etc.
For instance, an automation with one step has a single record in the database table, and that record has position=1.
For more instance, an automation with two steps has two records in the db table. the first record has position = 1 to represent the first step, and the second record has position = 2 to represent the second step.
due to bugs and such, its possible for users to create duplicate invisible steps that cause strange behavior. for instance, they may have an automation they intend to have just one step, but the db table has two records (rows), both with position = 1. Something like this:
campaign_id | position | name
c1 | 1 | my first email
c1 | 1 | my blurst email
the fun Math part I realized is that a valid automation has the property
sum(position) = (max(position) * (max(position) + 1)) / 2 due to this most excellent property
that is to say, a valid automation with one step also has a sum of its position values of 1.
a valid automation with two steps has a sum of its position values equal to 3, because 1 + 2 = 3
Recommended by LinkedIn
Therefore, a checksum
*therefore* (by "modus tollens" I think) an automation that is in a buggy state due to duplicate or missing steps will not have this property; if the sum of the position values does not equal the maximum position value yadda yadda then the data is certainly invalid. If you noticed that the property can hold on invalid data, good work, I did, too. The property doesn't identify all invalid data, but it must be true for valid data.
Because of the nature of the bug, position values got scrambled and so the property was most excellent for finding scrambled data.
big query can expose this all very quickly and concisely:
SELECT
campaign_id,
sum(position) actual_sum,
(max(position) * (max(position) + 1)) / 2 expected_sum,
actual_sum = expected_sum valid,
-- show the position values all in one cell, like the valid data "1,2,3,4,5" or the invalid data "1,1,2,3"
string_agg(cast(position as string), ', ' order by position) step_list
FROM database_table
group by 1 -- cool group by trick to not have to type the SELECTed terms again, just reference them ordinally
the key is to group the data by the campaign id, then compare the sum of its position values to the expected sum, since the expected sum is simply 1+2+...position.
Lets do some examples of the invalid automations some users have, identified by this query.
Data Defenestration
SELECT string_agg(cast(position as string), ', ') step_list
returns the position values for a campaign, like:
"13, 14, 14, 15, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 15, 15" <-- nonsense
which made my tiny brain unable to easily see where the problems in the data were. Let's tidy it up more!
4tunately, the string_agg function supports the almighty analytic functions of SQL. In particular, we can do more than just gather a list of all the positions assigned to each sql group in the result set; we can apply more functions to the result for each group. In this case, I want the comma separated list of positions listed in order, via `order by position`:
SELECT string_agg(cast(position as string), ', ' order by position) step_list
The `string_agg` analytic function combined with a windowing function was tight here for showing the `position` values as comma separated list, for each invalid automation.
actual_sum | expected_sum | step_list
164 | 120 | 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 14, 15, 15, 15
^ now its easier to read that this campaign has multiple emails at position 15!
135 | 120.0 | 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 15 <-- same
133,120.0,"1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 13, 14, 15" <-- multiples at position 13
184 | 120.0 | 1, 2, 3, 4, 5, 6, 7, 7, 8, 8, 9, 10, 10, 10, 11, 12, 13, 14, 14, 15, 15 <-- multiple multiples
Its trivial from here, barring any skews in the data due to the normal variances in the world, such as deleted campaigns, to further query the database to find all automations with an invalid state, and compare that to the number of automations that have a valid state, to get a sense of how often this bug happens (the ratio of automations with invalid steps to total number of automations). Also easy to find each user affected by the bug, and whatever else.
if you'd like to hack on big query or talk about math, I do too.