Optimizing Apache Arrow base64_decode for O(1) performance

2w Edited

String validation in C++ is fast, until you have to do it millions of times per second across massive datasets. While engineering a correctness fix for a silent truncation bug in Apache Arrow’s base64_decode utility, an automated review flagged a bottleneck: the function was using a linear search (std::string::find) to validate every single byte. For a 1MB payload, that meant potentially tens of millions of redundant CPU operations. Rather than bloating the initial correctness PR, I scoped the performance upgrade into a separate architectural follow-up. I replaced the linear scan with a static 256-entry lookup table (a direct-addressed array). This shifted the character validation from an O(N) linear search to an O(1) constant-time memory lookup via pointer arithmetic. The benchmarks on a 1MB payload: 🔴 Before (Unsafe): ~2832 ms 🟡 Intermediate (Strict Validation, but linear): ~4302 ms 🟢 Final (Strict Validation + O(1) Lookup): ~1126 ms Massive thanks to Kouhei Sutou and Dmitry C. for the feedbacks and for helping me get this optimization across the finish line. PR Link : https://lnkd.in/gjpM5ey9 #Cpp #Apache #ApacheArrow #SystemsEngineering #DataInfrastructure #OpenSource

21 Comments

Dwight José Trujillo Barco 2w

I have a solution that would reduce that time; it can be 2 to 4 times faster than the optimized scalar versions. Going from processing ~1MB in 1.1 seconds (optimized scalar) to doing it in just ~300 milliseconds is not only possible, but it's an improvement of an order of magnitude that is completely achievable.

1 Reaction

Sherry Ignatchenko 2w

1. How find() can be replaced with a table? 2. O(1) is nice, but on modern CPUs some O(1)’s can be whopping 100x+ larger than the others . In particular, table-driven access can cause (difficult-to-see-in-naive-micro-benchmarking) degradations. For example , if such a table is known in compile-time, then compile-time codegen maybe able to help to get significantly faster (and better-predictable) code.

Vitaly Ivanov 2w

Fwiw, the original implementation is pretty bad performance-wise. It's totally fine if you do some base64-ing once in a while, but one should pay more attention than "here's some open-source implementation with a permissive license", if it feels like you'll be doing an awful lot of base64. A very similar LUT PR was submitted more than 5 years ago, didn't go anywhere: https://github.com/ReneNyffenegger/cpp-base64/pull/27

SHALLY KATARIYA 2w

👏👏

1 Reaction

See more comments

To view or add a comment, sign in

More Relevant Posts

Gaurang Gupta
3w
Report this post
Improved my multithreaded, cache-blocked matrix multiplication engine in C++ with deeper performance optimizations. Progression: • Started with naive row × column multiplication • Implemented block-based multiplication for better cache locality • Switched from task queues + mutexes → static partitioning (lock-free parallelism) • Added dynamic thread scaling based on workload Benchmarks vs optimized implementation: 256×256 Library: 0.0153s Ideal: 0.00527s (~2.9× slower) 512×512 Library: 0.102s Ideal: 0.0218s (~4.68× slower) 800×800 Library: 0.414s Ideal: 0.0620s (~6.67× slower) Key takeaway: Performance isn’t just about multithreading — memory access patterns and cache locality dominate. Even with parallelism, inefficient memory access (like column-wise traversal) can bottleneck performance. Currently working on: • Improving cache efficiency (transpose + loop ordering) • Reducing memory stalls • Closing the gap toward high-performance (BLAS-style) implementations #cpp #multithreading #performance #systemsprogramming #hpc
2 Comments
Like Comment
To view or add a comment, sign in
Maruthi D
3w
Report this post
How I Debugged a Production Issue at 2 AM 🚨 We started seeing random API timeouts in production. No errors. No clear pattern. 🔍 Investigation • Logs → clean • CPU/Memory → stable • DB → slightly slow Something felt off. ⚠️ Root Cause A newly added index on a high-write table. ✔️ Faster reads ❌ Slower writes under load Result → queue buildup → timeouts 🛠️ Fix • Removed the index • Optimized query differently System recovered immediately. 🎯 Lesson Every optimization is a trade-off. In backend systems, small changes can have big impacts. #BackendEngineering #Debugging #SystemDesign #DistributedSystems
Like Comment
To view or add a comment, sign in
Sarthak Singhwal
4d
Report this post
Our service was dropping requests under load. CPU was at 30%. Everything looked "fine." It wasn't a capacity problem. It was thread-pool starvation. Here's how I diagnosed and fixed it: 📌 SYMPTOM 503 errors and timeouts spiking. CPU looked idle. DB looked healthy. Restarting temporarily fixed it. 🔍 DIAGNOSIS Pulled a thread dump → every thread was in WAITING state. Pool was saturated. Queue depth was at max. 🕵️ ROOT CAUSE One upstream service had no timeout set. Each request grabbed a thread and held it for 8–15 seconds. 200 threads × 10s = 2,000 thread-seconds of capacity gone. 🛠 FIX 1. Set timeouts on all external calls (non-negotiable) 2. Moved slow tasks to a separate pool 3. Added circuit breakers to fail fast 📐 SIZING (the math that saved us) I/O-bound pool size = cores × (1 + wait_time / compute_time) We had 8 cores and 90% wait time → pool needed ~80 threads, not 20. The lesson: When CPU is low but requests drop, look at your thread pool, not your servers. Have you ever debugged thread-pool starvation? Drop your experience below👇 #SystemDesign #Backend #SoftwareEngineering #Java #DistributedSystems
1 Comment
Like Comment
To view or add a comment, sign in
Nikolay Sivko
6d
Report this post
Continuous profiling usually means CPU profiles, especially in the #ebpf world. But for many incidents, memory profiling is just as useful. When an application is under memory pressure, infrastructure metrics alone are not enough. You need to dig into the code level and understand which functions allocate memory and why. Coroot can now collect heap profiles for Go applications with zero configuration. No code changes, no redeploys, and surprisingly, no eBPF involved :) I described how we implemented it in the Coroot blog: https://lnkd.in/dxwXjxmV

Zero-config Go heap profiling | Coroot Blog coroot.com

2 Comments
Like Comment
To view or add a comment, sign in
Gopal Chakraborty
1w
Report this post
How to debug kernel crash: 1️⃣ First capture exact crash signature Immediately collect: dmesg or serial console log if system hangs before filesystem flush. Look for: Oops, Panic, stack trace, faulting address, instruction pointer Typical output gives: faulting function, CPU number, process name, register dump 2️⃣ Read panic type carefully Typical categories: NULL pointer dereference Example: Unable to handle kernel NULL pointer dereference Usually: invalid pointer, freed structure, missing init Page fault: Means invalid memory access. Need to inspect: virtual address, permission bits, call stack 3️⃣ Read stack trace top-down Important: Top frame is not always root cause. Read: Call Trace: funcA funcB funcC Need: Identify first suspicious function you own Not just kernel generic wrapper. 4️⃣ Check if crash is deterministic or random Deterministic: same place every boot / action Likely: logic bug, init issue Random: different locations Likely: race, memory corruption DMA overwrite 5️⃣ If memory corruption suspected Use: KASAN Kernel Address Sanitizer. Detects: out-of-bounds, use-after-free 6️⃣ If race suspected Use: lockdep, KCSAN Symptoms: bug disappears with printk, hard to reproduce, random call traces 7️⃣ If interrupt-related crash Check: interrupt context stack whether sleeping inside ISR wrong lock usage Look for: BUG: sleeping function called from invalid context 8️⃣ Use crash dump if available Enable: kdump Then analyze vmcore using: crash vmlinux vmcore #kernel #debug #systemdesign #embedded #learning
Like Comment
To view or add a comment, sign in
Codewarbler

422 followers
5d
Report this post
good. a practical, no-nonsense how-to for when you open a blob and ghidra stares back at you. set the flash base (0x08000000), add sram/periph segments, then let SVD-Loader do the heavy lifting – saves you from red-address hell. walkthrough shows the real pain: mapping memory, not guessing strings. screenshots actually match steps. useful when you’re elbows-deep in ARM firmware and tired of fairy-tale tooling.

Analyzing bare metal firmware binaries in Ghidra blog.attify.com
Like Comment
To view or add a comment, sign in
Guru Jadhav
2d Edited
Report this post
Built CacheCore — an in-memory key-value store in C++. One thing that bugged me was the TTL eviction thread was sleeping for a fixed 60 seconds between checks. It worked, but it felt wasteful. Since the next key to expire is always at the top of the min-heap, I switched it to calculate the exact sleep duration dynamically. Fewer wakeups, no unnecessary work. The rabbit holes are the best part of this kind of project. Debugging lock contention led me down into MESI cache coherence — how L1/L2 caches across cores stay consistent when threads are fighting over the same mutex. Didn't plan to go there. Curiosity just does that. What's inside: LRU via doubly linked list + unordered_map (O(1)), TTL via min-heap with lazy deletion, custom RESP protocol parser, stateless multi-DB routing. Benchmarked at 500K ops, ~6,260 ops/sec, p99 6.3ms, zero deadlocks. Repo: https://lnkd.in/gwsxQEHk If you've worked on storage engines or low-level concurrency, feedback is welcome. #cpp #systemsprogramming #backend #lowlevelprogramming
Like Comment
To view or add a comment, sign in
Yahav Gabay
2w
Report this post
In C++, traditional error handling can disrupt the most optimized hot paths. The hidden costs of Exceptions are often too high for performance-critical systems. 🔹 The Stack Unwinding Overhead: When an exception is thrown, the CPU must stop its current execution to search for a matching catch block. This process involves complex logic that is impossible to predict accurately, leading to significant latency spikes (Jitter). 🔹 The Binary Bloat: Supporting exceptions requires the compiler to generate extra data structures (exception tables) to track object lifetimes. This increases the binary size and can lead to less efficient Instruction Cache usage, even if an exception is never actually thrown. 🔹 The Alternative: Use patterns like std::expected (C++23) or simple return codes. This keeps the control flow explicit, the stack clean, and the execution time predictable. #CPP #LowLatency #SystemsProgramming #SoftwareEngineering #PerformanceOptimization #HFT #CodingStandard #ModernCpp
31 Comments
Like Comment
To view or add a comment, sign in
Ondi Frans Butarbutar

ex Solution Architect on Wondr by BNI | Software Engineer | AI Engineer
3w
Report this post
Recently, I started using the #Codex extension by OpenAI in #VSCode as part of my daily workflow. Not long after, I noticed my laptop heating up, with “Code Helper (Renderer)” consistently spiking CPU and RAM usage. At first, I assumed it was a general VS Code issue. It turned out to be something more specific. After examining extension logs and webview behavior, I traced it to a retry loop: 1. The open-in-targets handler throws in extension mode 2. Instead of failing gracefully, the webview keeps retrying 3. Resulting in sustained CPU and RAM usage and wasted resources, a clear #Performance concern The solution is straightforward: return an empty response instead of throwing, so the retry loop stops. This kind of issue is a good example of how small edge cases in #DeveloperTools can escalate quickly if not handled properly. I also included a small script in the issue to make it easy to patch locally while waiting for an official fix. Since sharing it, several users have confirmed that the script resolves the issue on their end. I opened a #GitHub issue to document the findings and share the fix: https://lnkd.in/d6a763s4 This was a good reminder of how digging into logs and system behavior is critical in #Debugging, especially when working with modern #AIDev tools. It may seem like a small edge case, but it can easily slip into production and impact performance.

VSCode extension: open-in-targets error loop causes high CPU (Code Helper Renderer 100%+) · Issue #16849 · openai/codex github.com
Like Comment
To view or add a comment, sign in
Yavuz Karan
1w
Report this post
Deep Spring Boot Failures Part 26 Epoll Starvation – Event Loop Runs, But the System Cannot Progress CPU at 12%. Thread dump clean. JVM healthy. System dead. When I first encountered this, my debugging reflexes failed me. The problem was not in the threads it was in the time trapped inside the event loop. Reactive Spring Boot (WebFlux/Netty) dispatches I/O through epoll. A small number of event loop threads pick up ready file descriptors from epoll_wait() and pass them into the ChannelPipeline. epoll_wait() → ready fd → event loop → ChannelPipeline → handler Progress depends not on thread count, but on how quickly the event loop returns to epoll_wait(). This distinction changes everything. How the Breakdown Begins Blocking DB call, synchronous HTTP client, file I/O, or CPU-bound serialization when any of these runs on an event loop thread, the thread cannot return to epoll_wait(). The result unfolds in sequence: → read events cannot be processed → write flush is delayed → kernel socket buffers (sk_rcvbuf, sk_sndbuf) start filling → TCP window shrinks, throughput collapses The client times out and retries. New connections arrive, descriptor count grows. The system locks into this loop: event loop delay ↑ → I/O stall ↑ → retry ↑ → fd load ↑ → event loop delay ↑ No component has failed. But the system has lost forward progress. Event loop threads appear RUNNABLE. Yet they are not on CPU they are inside a long-running user-space operation. This is not thread starvation. This is event loop starvation. BlockHound reports blocking calls inside the reactive chain with a stacktrace at runtime, not compile time. Keep it permanently active in development. If this metric rises while active.connections and response.time swell together, you are in this pattern: reactor.netty.eventloop.pending.tasks The Correct Response The solution is not to increase concurrency. It is to keep the event loop clean. Define scheduler boundaries explicitly subscribeOn moves the upstream source; publishOn moves the downstream. Mixing them pulls the pipeline back onto the event loop. Every transition must be intentional. boundedElastic is sufficient but limited Moving blocking work to Schedulers.boundedElastic() is correct. But it can drop silently queue limit reached means work rejected without error. Undetectable without metrics. Backpressure is mandatory An unbounded buffer accumulates on the event loop and amplifies GC pressure. Every downstream must constrain its upstream through a proper signal chain. Apply strict timeouts to every I/O operation The event loop should see cancelled work, not waiting work. The timeout() operator must not be skipped. Failures in reactive systems do not come from CPU shortage or thread exhaustion they come from the event loop being held inside user-space. Event exists. FD exists. Thread exists. But time does not. Seeing this requires metrics, the right question, and knowing how epoll works.
Like Comment
To view or add a comment, sign in

986 followers

8 Posts

View Profile Follow

Optimizing Apache Arrow base64_decode for O(1) performance

More Relevant Posts

Explore content categories