come get sum

Mathematics consists in proving the most obvious thing in the least obvious way.
— George Polya


Too Long, Didn't Read (TLDR)

a discovered checksum in the wild

More info

I was able to use Google BigQuery and the maths to find all the Mailchimp automation-email campaigns that have an invalid state, in a, if i may say so myself (and I do), "elegant" way.

The classic automated email campaign (automation) in Mailchimp can have many steps. Each step sends an email, and the steps happen sequentially with varying delays between each step.

The database records the steps of an automation as: campaign_id, position, otherstuff

In an excel sheet it would look like this with the | character as cell walls:

campaign_id | position | name

c1 | 1 | my first email

c1 | 2 | my dearest second

A database table is a lot like a spreadsheet, if you haven't coded against a DB before.

campaign_id represents the unique identifier for the overall group of steps.

position represents which step in the automation it is. The email for position 1 sends, then at some point later, the email for position 2 sends, etc.

name is an example of something else about the email at that position, like the subject, a memorable name for filing purposes, etc.

For instance, an automation with one step has a single record in the database table, and that record has position=1.

For more instance, an automation with two steps has two records in the db table. the first record has position = 1 to represent the first step, and the second record has position = 2 to represent the second step.

due to bugs and such, its possible for users to create duplicate invisible steps that cause strange behavior. for instance, they may have an automation they intend to have just one step, but the db table has two records (rows), both with position = 1. Something like this:

campaign_id | position | name

c1 | 1 | my first email

c1 | 1 | my blurst email


the fun Math part I realized is that a valid automation has the property

sum(position) = (max(position) * (max(position) + 1)) / 2 due to this most excellent property


that is to say, a valid automation with one step also has a sum of its position values of 1.

a valid automation with two steps has a sum of its position values equal to 3, because 1 + 2 = 3


Therefore, a checksum

*therefore* (by "modus tollens" I think) an automation that is in a buggy state due to duplicate or missing steps will not have this property; if the sum of the position values does not equal the maximum position value yadda yadda then the data is certainly invalid. If you noticed that the property can hold on invalid data, good work, I did, too. The property doesn't identify all invalid data, but it must be true for valid data.

Because of the nature of the bug, position values got scrambled and so the property was most excellent for finding scrambled data.


big query can expose this all very quickly and concisely:

SELECT 

campaign_id,

sum(position) actual_sum, 

(max(position) * (max(position) + 1)) / 2 expected_sum, 

actual_sum = expected_sum valid,

-- show the position values all in one cell, like the valid data "1,2,3,4,5" or the invalid data "1,1,2,3"

string_agg(cast(position as string), ', ' order by position) step_list

FROM database_table

group by 1 -- cool group by trick to not have to type the SELECTed terms again, just reference them ordinally

the key is to group the data by the campaign id, then compare the sum of its position values to the expected sum, since the expected sum is simply 1+2+...position.

Lets do some examples of the invalid automations some users have, identified by this query. 

Data Defenestration

SELECT string_agg(cast(position as string), ', ') step_list

returns the position values for a campaign, like:

"13, 14, 14, 15, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 15, 15" <-- nonsense

 which made my tiny brain unable to easily see where the problems in the data were. Let's tidy it up more!

4tunately, the string_agg function supports the almighty analytic functions of SQL. In particular, we can do more than just gather a list of all the positions assigned to each sql group in the result set; we can apply more functions to the result for each group. In this case, I want the comma separated list of positions listed in order, via `order by position`:

SELECT string_agg(cast(position as string), ', ' order by position) step_list 

The `string_agg` analytic function combined with a windowing function was tight here for showing the `position` values as comma separated list, for each invalid automation.

actual_sum | expected_sum | step_list

164 | 120 | 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 14, 15, 15, 15

^ now its easier to read that this campaign has multiple emails at position 15!

135 | 120.0 | 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 15 <-- same

133,120.0,"1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 13, 14, 15" <-- multiples at position 13

184 | 120.0 | 1, 2, 3, 4, 5, 6, 7, 7, 8, 8, 9, 10, 10, 10, 11, 12, 13, 14, 14, 15, 15 <-- multiple multiples

Its trivial from here, barring any skews in the data due to the normal variances in the world, such as deleted campaigns, to further query the database to find all automations with an invalid state, and compare that to the number of automations that have a valid state, to get a sense of how often this bug happens (the ratio of automations with invalid steps to total number of automations). Also easy to find each user affected by the bug, and whatever else.

if you'd like to hack on big query or talk about math, I do too.

To view or add a comment, sign in

More articles by David Mann

  • MS in CS: amortization and off-by-ones

    Hi Remember games like hide-and-seek where the seeker counts out loud for some number of seconds? The way I learned to…

    2 Comments
  • The python, the egg, and the back-tick

    This seems like a fun place to start posting my musings on software. One thing I like to show people who are learning…

Others also viewed

Explore content categories