Kirill Valitskii’s Post

Most C++ performance advice sounds like this: “Use -O3.” “Avoid virtual calls.” “Use lock-free.” Cool. And also… mostly useless without context. The biggest latency win I’ve seen in production came from replacing a mutex-protected queue with a lock-free ring buffer between two hot threads (market data ingest -> strategy engine). Not because lock-free is “faster” by default, but because tail latency stopped exploding under contention. With mutexes, average looked fine, but p99 was chaos when both threads fought over the same lock. With a bounded lock-free SPSC queue: • No kernel blocking path • No lock convoying • Predictable memory footprint • Stable p99/p999 That’s the part people miss: lock-free is often a latency consistency tool, not just a throughput trick. A few practical rules that actually matter: 🧠 Pick the right structure for the traffic pattern • SPSC queue for one producer/one consumer • MPMC only when you truly need it (it’s much harder to get right) ⚙️ Be explicit with memory ordering • Start with seq_cst for correctness • Relax to acquire/release only after measurement + proof • If you can’t explain your ordering on a whiteboard, it’s not ready 🧱 Fight false sharing • Pad hot atomics to cache line size (alignas(64)) • Keep producer and consumer counters on different lines 📏 Benchmark like an adult • Track p50/p95/p99, not just average • Pin threads, warm caches, test under realistic contention • Count retries/spins - “lock-free” can still burn CPU badly And yes, sometimes a plain mutex is still the right answer. If contention is low and code clarity matters more, mutex wins on engineering economics. #cpp #performance #lowlatency #lockfree #systemsprogramming #hft

  • graphical user interface

Kirill Valitskii: SPSC queues are textbook toys that fit a narrow set of use cases. Please develop a lock-free wait-free MPMC queue. 😊

I sincerely hope nobody takes that seriously.

"Avoid virtual calls" is only really good advice in the inner loop of a hot path, and/or when a profiler tells you that a vtable-based dispatch is impacting performance. Avoiding blocking in a high-priority task (whether or not you have a RT scheduler) is always a good idea, whether it's taking a mutex (even if only to guard a "short" critical section), or something as silly as a debug printf or log message in the hot path (any I/O can block!)

From my point of view, most of performance issues come from poor application of oop principles that can cause a lot of cache misses. If you don't take into account what is the structure of the data that you process and how you access them from memory, then everything else is of far less importance.

I feel like, in C++, the benchmarks are all about ~1000x too slow. In my experience (I recently wrote one), Lock free Spsc ring buffer implementations should average in the order of a few nanoseconds on a pretty average CPU. Unless I'm missing something about this being per 1k read-writes?

Like
Reply

Where can I see the code for the benchmark

A fair mutex… wouldn’t that also work? (By default they’re not fair)

Like
Reply
See more comments

To view or add a comment, sign in

Explore content categories