String validation in C++ is fast, until you have to do it millions of times per second across massive datasets. While engineering a correctness fix for a silent truncation bug in Apache Arrow’s base64_decode utility, an automated review flagged a bottleneck: the function was using a linear search (std::string::find) to validate every single byte. For a 1MB payload, that meant potentially tens of millions of redundant CPU operations. Rather than bloating the initial correctness PR, I scoped the performance upgrade into a separate architectural follow-up. I replaced the linear scan with a static 256-entry lookup table (a direct-addressed array). This shifted the character validation from an O(N) linear search to an O(1) constant-time memory lookup via pointer arithmetic. The benchmarks on a 1MB payload: 🔴 Before (Unsafe): ~2832 ms 🟡 Intermediate (Strict Validation, but linear): ~4302 ms 🟢 Final (Strict Validation + O(1) Lookup): ~1126 ms Massive thanks to Kouhei Sutou and Dmitry C. for the feedbacks and for helping me get this optimization across the finish line. PR Link : https://lnkd.in/gjpM5ey9 #Cpp #Apache #ApacheArrow #SystemsEngineering #DataInfrastructure #OpenSource
1. How find() can be replaced with a table? 2. O(1) is nice, but on modern CPUs some O(1)’s can be whopping 100x+ larger than the others . In particular, table-driven access can cause (difficult-to-see-in-naive-micro-benchmarking) degradations. For example , if such a table is known in compile-time, then compile-time codegen maybe able to help to get significantly faster (and better-predictable) code.
Fwiw, the original implementation is pretty bad performance-wise. It's totally fine if you do some base64-ing once in a while, but one should pay more attention than "here's some open-source implementation with a permissive license", if it feels like you'll be doing an awful lot of base64. A very similar LUT PR was submitted more than 5 years ago, didn't go anywhere: https://github.com/ReneNyffenegger/cpp-base64/pull/27
👏👏
I have a solution that would reduce that time; it can be 2 to 4 times faster than the optimized scalar versions. Going from processing ~1MB in 1.1 seconds (optimized scalar) to doing it in just ~300 milliseconds is not only possible, but it's an improvement of an order of magnitude that is completely achievable.