Kirill Valitskii’s Post

Most C++ performance advice sounds like this: “Use -O3.” “Avoid virtual calls.” “Use lock-free.” Cool. And also… mostly useless without context. The biggest latency win I’ve seen in production came from replacing a mutex-protected queue with a lock-free ring buffer between two hot threads (market data ingest -> strategy engine). Not because lock-free is “faster” by default, but because tail latency stopped exploding under contention. With mutexes, average looked fine, but p99 was chaos when both threads fought over the same lock. With a bounded lock-free SPSC queue: • No kernel blocking path • No lock convoying • Predictable memory footprint • Stable p99/p999 That’s the part people miss: lock-free is often a latency consistency tool, not just a throughput trick. A few practical rules that actually matter: 🧠 Pick the right structure for the traffic pattern • SPSC queue for one producer/one consumer • MPMC only when you truly need it (it’s much harder to get right) ⚙️ Be explicit with memory ordering • Start with seq_cst for correctness • Relax to acquire/release only after measurement + proof • If you can’t explain your ordering on a whiteboard, it’s not ready 🧱 Fight false sharing • Pad hot atomics to cache line size (alignas(64)) • Keep producer and consumer counters on different lines 📏 Benchmark like an adult • Track p50/p95/p99, not just average • Pin threads, warm caches, test under realistic contention • Count retries/spins - “lock-free” can still burn CPU badly And yes, sometimes a plain mutex is still the right answer. If contention is low and code clarity matters more, mutex wins on engineering economics. #cpp #performance #lowlatency #lockfree #systemsprogramming #hft

12 Comments

Dan K. 3w

Kirill Valitskii: SPSC queues are textbook toys that fit a narrow set of use cases. Please develop a lock-free wait-free MPMC queue. 😊

1 Reaction

Niels Johannes de Wit 3w

I sincerely hope nobody takes that seriously.

1 Reaction

Scott Johnson 2w

"Avoid virtual calls" is only really good advice in the inner loop of a hot path, and/or when a profiler tells you that a vtable-based dispatch is impacting performance. Avoiding blocking in a high-priority task (whether or not you have a RT scheduler) is always a good idea, whether it's taking a mutex (even if only to guard a "short" critical section), or something as silly as a debug printf or log message in the hot path (any I/O can block!)

1 Reaction

Marios Prokopakis 2w

From my point of view, most of performance issues come from poor application of oop principles that can cause a lot of cache misses. If you don't take into account what is the structure of the data that you process and how you access them from memory, then everything else is of far less importance.

1 Reaction

Lynden Shields 3w

I feel like, in C++, the benchmarks are all about ~1000x too slow. In my experience (I recently wrote one), Lock free Spsc ring buffer implementations should average in the order of a few nanoseconds on a pretty average CPU. Unless I'm missing something about this being per 1k read-writes?

José Rafael Oses 3w

Where can I see the code for the benchmark

2 Reactions

Paul Olaru

Software engineer

A fair mutex… wouldn’t that also work? (By default they’re not fair)

See more comments

To view or add a comment, sign in

More Relevant Posts

Ankitkumar Modi
2w
Report this post
Stop digging through board dtsi files: Use DTSh. Between SoC includes, board overlays, and shield configurations, tracking down a specific node property or a flash partition can be a massive context-switch. DTSh (Devicetree Shell) - an interactive, shell-like interface for your Devicetree. Instead of grepping through hundreds of lines of static files, DTSh lets you: 🔷 Navigate like a pro: Use `cd` and `ls` to browse your hardware hierarchy. 🔷Search with intent: Find nodes based on bus protocols (I2C/SPI), interrupts, or memory size. 🔷Visualize Partitions: Use the find command to instantly map out your flash storage and image slots without doing the mental hex math. 🔷Document on the fly: Redirect output to HTML or SVG to create instant hardware snapshots for your team. It’s a West extension that transforms a static configuration into a discoverable environment. If you're building systems on #ZephyrRTOS, this belongs in your toolbox. How to get started: - add the module to your manifest - build application - west dtsh Are you still manually tracing Devicetree includes, or have you moved to interactive tooling? #ZephyrRTOS #EmbeddedSystems #FirmwareEngineering #EmbeddedC #RTOS #ProgrammingTools #Devicetree #IoTDevelopment
Like Comment
To view or add a comment, sign in
Jeyhun Abbasov
1mo
Report this post
Go’s real superpower is not goroutines. It is the scheduler behind them. In many languages: 1 app thread = 1 OS thread That sounds simple. But it is expensive: - each OS thread takes ~1–2mb of memory - 1,000 threads ? that is up to 2GB RAM - context switching slow - creating and destroying threads is costly Go does it differently. Goroutines: - start with ~2kb stack - grow and shrink when needed - context switching fast - os does not even know they exist Go uses M:N scheduling (G-M-P Model). It works with 3 building blocks: ↳ G (Goroutine) - a task to execute ↳ M (Machine) - an actual OS thread ↳ P (Processor) — a "context" that holds the tasks What happens when a task gets stuck ? \1. If a goroutine waits for network I/O: ↳ it goes to the network poller ↳ OS thread is freed ↳ that thread runs another goroutine ↳ when I/O finishes, the goroutine goes back to a run queue \2. If a goroutine makes a blocking syscall (like disk read): ↳ OS thread blocks ↳ the runtime detaches the processor (P) ↳ that P is attached to another free thread ↳ other goroutines keep running \3. If one processor is idle, another is overloaded: ↳ Go "steals" half the tasks from busy part ↳ it balances CPU usage ↳ no manual tuning required Basically, Go scheduler is like a tiny manager. It handles thousands of tasks, so we do not have to. --- What would you add to the list about goroutines ? Let's discuss it in the comments.
Like Comment
To view or add a comment, sign in
Craig Wenger
1w
Report this post
𝗦𝘁𝗼𝗽 𝘀𝗶𝗳𝘁𝗶𝗻𝗴 𝘁𝗵𝗿𝗼𝘂𝗴𝗵 𝟰𝟴 𝗵𝗼𝘂𝗿𝘀 𝗼𝗳 𝗹𝗼𝗴𝘀 𝗯𝘆 𝗵𝗮𝗻𝗱! If your device is resetting in the field, you don't have time to "Find + Replace" your way to a solution... especially if it's happening 30 or 40 times over a weekend. At EES, we’re using LLMs to move faster. We have a well defined process to use AI to help write the code, but we are also using it to parse massive log files and find the signal in the noise. For example, It can identify a pattern across 37 resets while a human is still scrolling through the first thousand lines. The goal isn't just to find the crash; it's to find the cause—distinguishing between a simple power cycle and a genuine firmware flaw that’s going to haunt your production run. Fewer hours billed on log analysis means more budget for actual engineering. 𝗚𝗼𝘁 𝗮 "𝗳𝗹𝗮𝗸𝘆" 𝗱𝗲𝘃𝗶𝗰𝗲 𝘆𝗼𝘂 𝗰𝗮𝗻'𝘁 𝗽𝗶𝗻 𝗱𝗼𝘄𝗻? Let’s do a Firmware Flaw Finder Audit - https://lnkd.in/gQRPgWMx. We’ll take a look at your code and your logs, find the landmines, and give you a roadmap to fix them before they hit your customers. #EmbeddedSystems #FirmwareEngineering #Debugging #RootCauseAnalysis #AIinEngineering #ProductReliability #EmbeddedLinux
Like Comment
To view or add a comment, sign in
Christopher Merck
3w
Report this post
I had the opportunity to chat with a world class performance engineer over the winter break which led me to some discoveries about downsides of the way I had always "packed" my on-disk data structures in firmware. In this article, I give a visual, example-driven guide to how exactly struct packing harms performance and how to practically achieve good performance while maintaining predictable structure layout for interoperability between systems. --- The gains can be quite extreme: up to 7x smaller & faster code. https://lnkd.in/eHtpwSSH

The Hidden Cost of Misalignment interrupt.memfault.com

1 Comment
Like Comment
To view or add a comment, sign in
Vadym Shturkhal
3w
Report this post
Benchmark: TAOCP Queue (Integrated) – GCC -O3 (C) vs LLVM (Rust) Bare-Metal Assembly Results: • Optimized C (GCC -O3): cold cycles = 2,747 | warm cycles = 2,653 | size = 428 bytes • Optimized Rust (LLVM): cold cycles = 2,883 | warm cycles = 2,670 | size = 588 bytes Summary: • Speed Delta: ~17 cycles (~0.07 cycles per operation) • Size Delta: 160 bytes • Conclusion: When stripped of safety bureaucracy and forced into hardware compliance, Rust's LLVM backend mathematically matches the execution speed of GCC -O3 C. The only remaining distinction is Rust's slightly larger static Flash memory footprint. Architectural Techniques Deployed: Both compilers natively fail to reach these speeds due to strict pointer aliasing paranoia. To bypass the compilers' internal heuristics, the following bare-metal techniques were injected: • Knuth/Torvalds Double Pointer: Rerouting the linkage matrix to bypass redundant node lookups • L0 Caching & Register Hoisting: Manually extracting the Front, Rear, and Avail pointers into local variables to satisfy the compiler alias analyzers, locking the states into CPU registers and eliminating AHB SRAM thrashing • Deferred NULL Linkage: Evading the inner-loop safety tax by intentionally leaving the memory "dirty" during the hot loop, applying the NULL cap to the Rear Node strictly at the execution boundary Setup: • Algorithm: The Art of Computer Programming Vol. 1, Queue (linked allocation) • Hardware: STM32G431RB (ARM Cortex-M4) • Profiler: Internal DWT Cycle Counter • Base case: 128 nodes (128 Enqueue + 128 Dequeue using Bump Allocator) • Architecture: Integrated execution (not Inlined)
Like Comment
To view or add a comment, sign in
Ultan O.
3w
Report this post
🚨 Extracting data from PDFs just got solved. Someone open-sourced a tool that turns PDFs into Markdown at 100 pages a second 🤯 It’s called OpenDataLoader. It runs flawlessly on CPU and decodes tables, complex layouts, and nested structures like an absolute pro. 100% free and open-source. Repo link in comments → 👉 Follow Ultan O. for more
3 Comments
Like Comment
To view or add a comment, sign in
Jaya Prakash Morusu
3w
Report this post
💡 Arrays and pointers in C are closely related, but behave very differently. 👉 Real-life example: Think of an apartment building 🏢 Array = full building (fixed structure) Pointer = key to one room (points to a location) 👉 In C: int arr[3] = {10, 20, 30}; int *p = arr; Here: arr → represents the whole array p → points to the first element 👉 Where the confusion usually comes from: arr[i] == *(arr + i) 👉 How this works internally: arr gives the base address of the array Each element has size → sizeof(int) So arr + i → moves i positions forward in memory Example: If arr starts at address 1000 arr + 1 → 1000 + 4 = 1004 arr + 2 → 1000 + 8 = 1008 👉 Then: *(arr + i) → value stored at that address So: arr[1] == *(arr + 1) → 20 📌 Array indexing is just pointer arithmetic behind the scenes 👉 Key differences: ✔ Array has fixed memory (cannot be reassigned) ✔ Pointer can change and point anywhere ✔ sizeof(arr) ≠ sizeof(p) ✔ Array name is not a variable 👉 Embedded insight: ✔ Used to handle buffers (UART, SPI, ADC) ✔ Efficient memory traversal ✔ Direct access to hardware registers 📌 Arrays store data, pointers give control over memory. #cprogramming #embeddedsystems #pointers #arrays #growth
5 Comments
Like Comment
To view or add a comment, sign in
Félix Rollet
5d
Report this post
➤ Branch prediction benchmark: wrong conclusion. Sorted vs random data : Same loop. Same CPU. Sorted : 0.31 ns - and - Unsorted : 0.32 ns No slowdown. Looks like branch prediction doesn’t matter. Wrong. There was no branch. At -O3, the compiler already did the obvious: if-conversion, CMOV, SIMD masks. No branch, nothing to mispredict, no difference. In this setup, you’re measuring the compiler doing its job not any CPU code optimisation. Now put this in a real scenario: tick stream, runtime threshold, structs, pointers. branchy : 3.45 ns with if (tick.price >= threshold) sum += tick.price * tick.qty; (vs) branchless : 0.96ns with const int32_t mask = -(tick.price >= threshold); sum += (tick.price * tick.qty) & mask; Same logic, 3.7× speedup. Here the compiler just couldn’t help. It sees a runtime value (atomic load), data inside structs, code split across boundaries it can’t inline. No full picture, no transformation. Blind trust in the compiler is the real mistake. It works on toy loops over primitives. It stops working the moment your data or control flow gets real. 3.7× on a single condition. On a burst of 1M ticks, that’s milliseconds. The question isn’t “branch or branchless”. It’s more, can your compiler see enough to remove the branch? If not, you are the optimizer. #cpp #hpc #lowlatency #systemsprogramming #performance
1 Comment
Like Comment
To view or add a comment, sign in
Habib Gamal
3w
Report this post
Most developers assume the OS is “in control” of everything. In reality, most of the time… it’s not. When a program runs, the OS doesn’t micromanage every instruction. That would be far too slow. Instead, it uses a clever design called 𝗟𝗶𝗺𝗶𝘁𝗲𝗱 𝗗𝗶𝗿𝗲𝗰𝘁 𝗘𝘅𝗲𝗰𝘂𝘁𝗶𝗼𝗻 (𝗟𝗗𝗘): → Let the program run 𝘥𝘪𝘳𝘦𝘤𝘵𝘭𝘺 on the CPU → But keep just enough control to step in when needed Here’s how that balance actually works: • 𝗨𝘀𝗲𝗿 𝗠𝗼𝗱𝗲 𝘃𝘀 𝗞𝗲𝗿𝗻𝗲𝗹 𝗠𝗼𝗱𝗲 The program runs in 𝘶𝘴𝘦𝘳 𝘮𝘰𝘥𝘦 with restricted privileges Critical operations (I/O, memory access) require switching to 𝘬𝘦𝘳𝘯𝘦𝘭 𝘮𝘰𝘥𝘦 • 𝗦𝘆𝘀𝘁𝗲𝗺 𝗖𝗮𝗹𝗹𝘀 (𝗖𝗼𝗻𝘁𝗿𝗼𝗹𝗹𝗲𝗱 𝗘𝗻𝘁𝗿𝘆) The only way a program can ask the OS for help → A safe, explicit boundary crossing • 𝗜𝗻𝘁𝗲𝗿𝗿𝘂𝗽𝘁𝘀 & 𝗧𝗿𝗮𝗽𝘀 (𝗙𝗼𝗿𝗰𝗲𝗱 𝗖𝗼𝗻𝘁𝗿𝗼𝗹) The OS can regain control anytime → Timer interrupts prevent a process from running forever This design solves a fundamental tension: 𝘙𝘶𝘯𝘯𝘪𝘯𝘨 𝘦𝘷𝘦𝘳𝘺𝘵𝘩𝘪𝘯𝘨 𝘵𝘩𝘳𝘰𝘶𝘨𝘩 𝘵𝘩𝘦 𝘖𝘚 → 𝘀𝗮𝗳𝗲 𝗯𝘂𝘁 𝘀𝗹𝗼𝘄 𝘓𝘦𝘵𝘵𝘪𝘯𝘨 𝘱𝘳𝘰𝘨𝘳𝘢𝘮𝘴 𝘳𝘶𝘯 𝘧𝘳𝘦𝘦𝘭𝘺 → 𝗳𝗮𝘀𝘁 𝗯𝘂𝘁 𝗱𝗮𝗻𝗴𝗲𝗿𝗼𝘂𝘀 LDE gives us both: ✔ Near-native performance ✔ Strong control and isolation But there’s a trade-off that often gets ignored: Every boundary crossing (syscall, interrupt) has a cost. • Context switches • Cache invalidation • Pipeline disruption At scale, these costs become very real. This is why performance-sensitive systems: • Batch syscalls • Avoid excessive kernel transitions • Sometimes use techniques like mmap or async I/O The deeper takeaway: Operating systems aren’t about control. They’re about 𝗰𝗼𝗻𝘁𝗿𝗼𝗹𝗹𝗲𝗱 𝙡𝙖𝙘𝙠 𝗼𝗳 𝗰𝗼𝗻𝘁𝗿𝗼𝗹. That’s where real performance comes from. #operatingsystems #systemsdesign #performance #backendengineering
Like Comment
To view or add a comment, sign in
MCP Revolution

16 followers
3w
Report this post
Spiciest take: prompt engineering won’t fix p99. Agents need a scheduler. Orchestration & Scheduling profile for MCP: - Resource hints on the wire: CPU/GPU/RAM/VRAM, cold‑start, KV‑cache reuse tokens - QoS lanes: priorities, deadlines, tenant fairness; preempt + resume receipts - Placement & locality: region/data‑class aware; egress‑ and cost‑optimized routing - Batching/fusion: micro‑batch windows, coalescing, and legal op fusion - Speculative/hedged runs: N‑way races with cancel‑on‑first and spend caps - Warm pools: session affinity, model snapshots, cache pin/unpin APIs - Admission & budgets: per‑tenant quotas; cost/latency/CO2e as hard constraints - Observability: critical‑path/Gantt receipts; queue vs compute time split - Failover playbooks: degraded modes; cross‑provider spillover with proofs - Backpressure hooks: tie to Realtime profile; stream partials while queued Outcome: lower p99, lower $/task, fewer “it’s slow today” pages. Should MCP ship a Scheduler profile before more connectors—or is this infra‑provider turf? Pick one primitive (resource hints, hedged exec, or cache tokens) and drop your worst p99 spike story in the comments. #ModelContextProtocol #AI #Agents #Orchestration #Scheduling #SLOs #OpenStandards
Like Comment
To view or add a comment, sign in

2,659 followers

30 Posts

View Profile Follow

Kirill Valitskii’s Post

More Relevant Posts

Explore related topics

Explore content categories