Dear developers, please stop using checksums. For anything. Ever.

In the primitive days of computing we transferred data by magnetic tape or modem. This data was usually not considered to be at risk of attack, the major concern was that the medium might fail and the data would be corrupted. To check for transmission errors, we used an algorithm called a cyclic redundancy check (CRC), a.k.a. checksum. A checksum performs a bit-wise summation of a data stream and provides a sort of total value. When a checksum of data did not match the checksum of the original data, this indicated there had been an error. However, even a matching checksum does not guarantee there was no error. If enough bits in enough places changed, the resulting checksum could still be the same value, though this is statistically unlikely. When a checksum matched the expected value, it indicated that the data had most likely been transferred properly.

Today the world is a different place and your data are almost always under threat of tampering. If someone can modify your data, it is trivial to do so in such a way that a checksum will still match the original value. Since the checksum is just a summation, an attacker can easily adjust the modification so that the checksum will be the same for the modified data as it is for the legitimate data. The checksum was devised to detect random errors, not intentionally created ones.

Because a checksum cannot guarantee data integrity, the cryptographic hash algorithm was invented. A hash works much like a checksum, but because of the additional cryptographic aspect, it is not possible to predict the result. One cannot create an input data stream that will hash to a specific result. Changing even a single bit in the data stream results in a very different hash value. By comparing the value of a cryptographic hash of data with that of the original, a match assures you that the data was not modified, either by accident or intent.

The cryptographic hash algorithm requires more computation than the checksum. If your data is absolutely not under any threat whatsoever and you really only care about error detection, you may have an argument that a checksum is sufficient. Even then, when trying to convince a customer or a certification organization that your code is secure, using a checksum rather than a cryptographic hash raises a red flag. If there is any scenario where your data might be threatened (even a scenario not yet known to you!), using a checksum to verify its integrity will not be seen as credible.

Best practice is to always use a cryptographic hash instead of a checksum, even for simple error detection. It doesn’t matter how fast your code runs if your data can be manipulated. Using an accepted cryptographic hash algorithm (no longer including MD5 or SHA-1) is the only way to guarantee data has not been altered. It may give you new performance problems to solve, but bad performance comes with many fewer ramifications than unprotected data.

To view or add a comment, sign in

More articles by king ables

  • A crisis of influencers

    Influencers are a malady crippling our society. If you think that's an exaggeration, consider the evidence all around…

Others also viewed

Explore content categories