ReactiveChainDB V2 Outperforms ScyllaDB in GPU Benchmark

Numbers don't lie. ReactiveChainDB V2 just bypassed the competition. 🚀 By moving our entire consensus and backpressure execution to the GPU and implementing a strict zero-CPU-fallback policy, we've created a literal "CPU Bypass." When you stop context-switching on the CPU and let the GPU batch-process the workloads natively, the performance gains are staggering. p50: 2ms (ReactiveChainDB v2) vs. 120ms (ScyllaDB) p99: 1,265ms (ReactiveChainDB v2) vs. 28,873ms (ScyllaDB) Even at the 99th percentile, the system remains incredibly stable at just ~1.2 seconds, completely outclassing the competition in our test environment. Take a look at the log-scale chart below. Soon Full Article Will be Published #Coding #Benchmarks #DataEngineering #GPU #ReactiveChainDB #BuildInPublic #OpenSource #SoftwareArchitecture #Java21 #TornadoVM #GPU #Innovation #TechMilestone #DatabaseOptimization #Engineering #ScyllaDB #RocksDB #Backend #DatabaseEngineering #SystemDesign #HighPerformanceComputing #GPUComputing #CUDA #namanoncode

To view or add a comment, sign in

More Relevant Posts

CC-FR

1,172 followers
3w
Report this post
In C++, GPU computing evoled drastically in 2020 when NVidia announced its compiler, nvc++, could compile C++ 17 computing directly on GPU without the need for the developer to use extra libraries such as SYCL, Kokkos, etc. Now, nvc++ can also deal...

Modern C++ GPU Computing with STD:: Algotithm and Cuda

https://www.youtube.com/
Like Comment
To view or add a comment, sign in
Srajan Bhagat
2w
Report this post
47% lower p999 latencies. 15% less CPU usage. No new hardware. No architecture changes. Just a smarter compiler. ⚡ This engineering blog breaks down how profile-guided optimization (#PGO) supercharges the Redpanda #Streaming binary in 26.1 — and why it matters for #CPU-intensive workloads. Read the full #benchmark breakdown below. 📊 https://lnkd.in/dAnJ9_hk
Like Comment
To view or add a comment, sign in
Taylor Young
3d
Report this post
47% lower p999 latencies. 15% less CPU usage. No new hardware. No architecture changes. Just a smarter compiler. ⚡ This engineering blog breaks down how profile-guided optimization (PGO) supercharges the Redpanda #Streaming binary in 26.1 — and why it matters for #CPU-intensive workloads. Read the full #benchmark breakdown below. 📊 https://lnkd.in/e7a27Xep
Like Comment
To view or add a comment, sign in
Jeremy Johnson
2w
Report this post
47% lower p999 latencies. 15% less CPU usage. No new hardware. No architecture changes. Just a smarter compiler. ⚡ This engineering blog breaks down how profile-guided optimization (PGO) supercharges the Redpanda #Streaming binary in 26.1 — and why it matters for #CPU-intensive workloads. Read the full #benchmark breakdown below. 📊 https://lnkd.in/e2RCcXyX
Like Comment
To view or add a comment, sign in
Sai Dushyant
2w
Report this post
I asked how threads actually work under the hood… and the answer was more interesting than I expected. After my last post, I got some really insightful explanations about how threads are scheduled. Here’s what finally started to make things click for me: -> Threads are NOT directly tied to CPU cores. Instead: • The OS scheduler decides which thread runs • A thread can run on different cores at different times • Threads don’t “own” a core — they just get time on it What actually happens The CPU doesn’t run multiple threads magically. It does one of two things: 1. Parallel execution (multiple cores) Different threads can run at the same time on different cores. 2. Time slicing (single core or oversubscription) The CPU rapidly switches between threads. The important part: context switching When switching between threads, the CPU: • Saves the current thread state (registers, instruction pointer, etc.) • Loads another thread’s state • Continues execution So it’s not “running everything at once” — it’s switching fast enough to look like it is. What surprised me A thread: • can move between cores • doesn’t execute continuously • depends completely on the OS scheduler So writing multithreaded code means: 👉 You cannot assume order 👉 You cannot assume timing 👉 You cannot assume which core runs your code What I realized Multithreading is not just about creating threads. It’s about writing code that works no matter how unpredictably those threads are scheduled. Still exploring this space — next I’m trying to understand: How synchronization primitives (mutex, locks) actually work internally #cpp #multithreading #systems #concurrency #learning #os #softwareengineering

4 Comments
Like Comment
To view or add a comment, sign in
Shantanu K. Pradhan
3w
Report this post
Why 2^{64} is a Lie: The Physics of Paging If you examine /proc/self/maps, you'll notice that 64-bit Linux pointers actually cap out at 0x00007fffffffffff. Modern x86_64 systems don't use 64 bits for addressing. It uses 48-bit (4-level paging) or the newer 57-bit (5-level paging) introduced by Intel’s Ice Lake. Why the "cap"? It’s an optimization problem: Translation Latency vs. Addressable Space. The Walk Cost: Translating a Virtual Address to a Physical Address requires walking the Page Table hierarchy (PML4/PML5 -> PDP -> PD -> PT). 4-level paging (48-bit) requires 4 memory references. 5-level paging (57-bit) requires 5 memory references. If we used all 64 bits, we’d need 6 or 7 levels. Every extra level is a potential cache miss that stalls the CPU pipeline. TLB Pressure: The Translation Lookaside Buffer (TLB) is tiny and expensive. Supporting a wider address space increases the complexity of TLB tag matching and power consumption. By limiting the bits, hardware designers keep the TLB fast and hit rates high. Canonical Addressing: To prevent "pointer tagging" that could break future compatibility, x86_64 enforces canonical forms. In a 48-bit space, bits 47 through 63 must be identical. This creates the "hole" in the middle of the address space, effectively splitting it into lower (User) and higher (Kernel) half-canonical ranges. The 5-Level Jump: Intel introduced PML5 (57-bit) to raise the limit from 256 TB to 128 PB. While this adds a "cycle tax" for the extra table walk, it’s a necessary evil for Exascale computing. #LinuxKernel #X86_64 #SystemsProgramming #ComputerArchitecture #MemoryManagement #LowLevel
2 Comments
Like Comment
To view or add a comment, sign in
Jaimin Chovatia
4w
Report this post
A context switch does not preserve your program, it preserves the illusion that it never stopped. A context switch saves a minimal execution snapshot: CPU registers, the stack pointer, and the instruction pointer. This is sufficient because the entire execution state is reducible to register contents and a stack frame boundary. General-purpose registers capture intermediate computation, the instruction pointer defines the next operation, and the stack pointer anchors the active call chain in memory. The constraint is brutal: anything not materialized into registers or the current stack frame is lost, which is why compiler optimizations and calling conventions are tightly coupled to context switch correctness and performance. #OperatingSystems #ContextSwitching #CPUArchitecture #SystemsEngineering
1 Comment
Like Comment
To view or add a comment, sign in
Spectral Compute

1,627 followers
2w
Report this post
𝗢𝗻𝗲 𝗰𝗼𝗱𝗲𝗯𝗮𝘀𝗲. 𝗠𝘂𝗹𝘁𝗶𝗽𝗹𝗲 𝗚𝗣𝗨 𝘃𝗲𝗻𝗱𝗼𝗿𝘀. 𝗡𝗼 𝗿𝗲𝘄𝗿𝗶𝘁𝗲𝘀. That's what 𝗦𝗖𝗔𝗟𝗘 offers: a pure compiler toolchain that takes unmodified CUDA code and compiles it natively for both NVIDIA and AMD architectures. Your team keeps writing CUDA code. 𝗧𝗵𝗲 𝘂𝗻𝗱𝗲𝗿𝗹𝘆𝗶𝗻𝗴 𝗵𝗮𝗿𝗱𝘄𝗮𝗿𝗲 𝗯𝗲𝗰𝗼𝗺𝗲𝘀 𝗮 𝗯𝘂𝗶𝗹𝗱 𝗳𝗹𝗮𝗴.
1 Comment
Like Comment
To view or add a comment, sign in
Mufhatutshedzwa Magoloi
2w
Report this post
128 threads. Zero meaningful speedup. That's not a bug. That's the point. 🧵 This open-source benchmark in C pits serial execution against OpenMP parallelism on a linked list of up to 100 million nodes — and the results are a reality check on parallel computing. No matter the thread count, the speedup curve stays flat. The bottleneck isn't the CPU. It's the memory. Linked list nodes live at random heap addresses. Every traversal step is a potential cache miss — a 100–300 cycle stall while the CPU waits for RAM. Sixteen threads hitting that wall simultaneously doesn't halve the wait. It multiplies the traffic. What makes this project worth exploring: → Cache-unfriendly memory patterns and why they defeat parallelism → Amdahl's Law - why even 1% serial work can cap your max speedup → Task overhead vs. task granularity trade-offs in OpenMP → Memory bandwidth as a shared, exhaustible resource Built with: C · OpenMP 4.5 · GCC · Linux · Task Reduction Sometimes the most valuable benchmark is the one that shows you exactly where the ceiling is - and why. 🔗 https://lnkd.in/d4G5pDXu #C #OpenMP #ParallelComputing #SystemsProgramming #HPC #PerformanceEngineering #OpenSource #SoftwareEngineering #ComputerArchitecture
Like Comment
To view or add a comment, sign in

496 followers

45 Posts

View Profile Follow

ReactiveChainDB V2 Outperforms ScyllaDB in GPU Benchmark

More Relevant Posts

Modern C++ GPU Computing with STD:: Algotithm and Cuda

https://www.youtube.com/

Explore content categories