C++ and Benchmarking
Benchmarks running on OrangePi5

C++ and Benchmarking

It may sound extreme, but I believe benchmarking too should get the special treatment of testing in a project. There are usually some performance criteria around a C++ code and as time passes and people come and go, one developer may prefer one approach over the other while the next may choose something different. It's good to have a history of these attempts somewhere so developers may refer to and compare their results. At least planning for benchmarking is a good idea. With that said, I like to introduce the Google micro benchmarking framework through a useful example.

A while ago, I started looking for the most performant approach to pass data between multiple producers and multiple consumers. There is a book called C++ Concurrency in Action, which goes to great length discussing concurrent programming and memory ordering in C++. That book tries to address the need for such data structure in different evolving ways. When we consider the overhead of reference counting in a lock-free approach or an additional locking introduced in the enhanced queue, one might wonder how these methods perform against each other. I have gathered all these approaches under one roof in a repository called FreshQueue. Over there, I use Google micro benchmarking framework to compare the performance.

First, I put a bare std::queue to the test to have a baseline. Then I use std::shared_ptrs to store the data in the queue. In another benchmark, I add explicit synchronization. Finally, I wrap the whole thing in a concurrent data structure with an internal std::queue and a std::mutex (ThreadSafeFreshQueue).

Then, I switch to an enhanced singly linked list with separation of head and tail to reduce contention on the shared data (ConcurrentFreshQueue).

Last but not least, I tested an open-source implementation of a lock-free queue available in the boost library. Although the book goes into so much trouble to come up with a lock-free implementation, there are some shortcomings regarding memory allocations and initialization which prevent the true potential of lock-free code to be unlocked!

Now to the results. I used an ARM RK3588S 8-core 64-bit processor with the main frequency locked at 2.4 GHz. The following is the single thread result :

-----------------------------------------------------------------------------------------------------
Benchmark                                           Time             CPU   Iterations UserCounters...  
-----------------------------------------------------------------------------------------------------  
BM_Queue_PushAndPop<int>                         65.5 ns         65.5 ns     10686853 Pushes=15.2616M/s
BM_QueueOfSharedPointer_PushAndPop<int>           116 ns          116 ns      6040365 Pushes=8.63589M/s
BM_QueueOfSharedPointer_PushAndPopWithLock<int>   192 ns          192 ns      3648586 Pushes=5.22111M/s
BM_ThreadSafeFreshQueue_PushAndPop<int>           478 ns          478 ns      1460062 Pushes=2.09155M/s
BM_ConcurrentFreshQueue_PushAndPop<int>           813 ns          813 ns       860585 Pushes=1.22979M/s
BM_LockFreeFreshQueue_PushAndPop<int>             651 ns          651 ns      1072108 Pushes=1.53636M/s        

A bare std::queue can do 15.2M pushes and pops per sec. In a release build that can increase to 500M. If we store shared pointers in our queue we will lose 45% of performance. Due to the nature of memory management in multithreaded environments, using some sort of reference counting is inevitable. If we add two locking and unlocking operation to the mix we will lose 40% more of the performance and if we wrap the whole thing in another class we will lose more than half of the performance here. That is because of the overhead of copying on the push and the fact that I'm storing integers. Next, if we implement the queue ourselves and separate the locking on the head and tail, in a single thread we end up with another 40% drop in performance. Interestingly the lock-free queue performs not much better than a locking solution.

Here are the multithreaded results. For each solution, I have increased the thread count.

--------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                              Time             CPU   Iterations UserCounters...  
--------------------------------------------------------------------------------------------------------------------------------------------------------  
BM_QueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:2                          444 ns          735 ns      1472954 Pushes=1.12727M/s
BM_QueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:4                          745 ns         1813 ns      1053216 Pushes=670.992k/s
BM_QueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:8                         1114 ns         4550 ns       673328 Pushes=448.789k/s
BM_QueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:16                         832 ns         3474 ns       669264 Pushes=600.793k/s
BM_QueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:32                        1172 ns         7159 ns       482304 Pushes=426.528k/s
BM_QueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:64                        1192 ns         7105 ns       569344 Pushes=419.579k/s
BM_QueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:128                       3374 ns         9037 ns       161536 Pushes=148.211k/s
BM_QueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:256                       3003 ns         9825 ns       173312 Pushes=166.486k/s
BM_QueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:512                       2324 ns        13776 ns       422912 Pushes=215.161k/s
BM_QueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:1024                      2434 ns        17422 ns       392192 Pushes=205.457k/s
BM_ThreadSafeFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:2           506 ns          812 ns      1503706 Pushes=988.321k/s
BM_ThreadSafeFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:4           702 ns         1724 ns       906120 Pushes=712.033k/s
BM_ThreadSafeFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:8          1296 ns         4779 ns       646952 Pushes=385.835k/s
BM_ThreadSafeFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:16         1171 ns         6673 ns       858544 Pushes=427.008k/s
BM_ThreadSafeFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:32         1141 ns         7133 ns       595808 Pushes=438.356k/s
BM_ThreadSafeFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:64         1154 ns         7179 ns       619584 Pushes=433.308k/s
BM_ThreadSafeFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:128        1218 ns         6780 ns       614528 Pushes=410.601k/s
BM_ThreadSafeFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:256        1351 ns         7238 ns       542720 Pushes=370.024k/s
BM_ThreadSafeFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:512        1647 ns         7261 ns       414208 Pushes=303.532k/s
BM_ThreadSafeFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:1024       1561 ns         8169 ns       324608 Pushes=320.35k/s
BM_ConcurrentFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:2           446 ns          847 ns      1427228 Pushes=1.12125M/s
BM_ConcurrentFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:4           443 ns         1003 ns      1524464 Pushes=1.12752M/s
BM_ConcurrentFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:8           585 ns         2000 ns      1323832 Pushes=854.616k/s
BM_ConcurrentFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:16          706 ns         3607 ns      1015632 Pushes=707.991k/s
BM_ConcurrentFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:32          714 ns         4377 ns       967360 Pushes=700.708k/s
BM_ConcurrentFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:64          704 ns         4614 ns       995968 Pushes=710.077k/s
BM_ConcurrentFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:128         699 ns         4462 ns      1082112 Pushes=715.263k/s
BM_ConcurrentFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:256         769 ns         4418 ns       964864 Pushes=649.866k/s
BM_ConcurrentFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:512         806 ns         4851 ns       773632 Pushes=620.631k/s
BM_ConcurrentFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:1024        839 ns         5571 ns       999424 Pushes=596.125k/s
BM_LockFreeFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:2             285 ns          570 ns      2448794 Pushes=1.75571M/s
BM_LockFreeFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:4             188 ns          743 ns      3707932 Pushes=2.66646M/s
BM_LockFreeFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:8             160 ns         1066 ns      4686752 Pushes=3.13447M/s
BM_LockFreeFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:16            174 ns         1329 ns      4552320 Pushes=2.87749M/s
BM_LockFreeFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:32            161 ns         1250 ns      3200000 Pushes=3.10694M/s
BM_LockFreeFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:64            161 ns         1270 ns      8296960 Pushes=3.10445M/s
BM_LockFreeFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:128           179 ns         1422 ns      7419136 Pushes=2.80032M/s
BM_LockFreeFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:256           133 ns         1053 ns      6436608 Pushes=3.7628M/s
BM_LockFreeFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:512           114 ns          910 ns      5120000 Pushes=4.37553M/s
BM_LockFreeFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:1024          141 ns         1121 ns     10240000 Pushes=3.54401M/s        

As we can see for all but the lock-free solution the performance is decreasing as we add to the threads working on the queue. If we compare the 1024 threads result for separated head and tail locking and the wrapped std::queue, we see an almost 86% increase in performance. Here the lock-free solution is almost sixfold more performant than the separated locking on the head and tail solution. When working with only two threads we can see no serious gap.

I have done the same test on an Intell 11700k processor. It shows the same trends. The noticeable difference is in the lock-free results where a 150$ Arm SoC performs better than Intell 11th-gen core i7 desktop machine. That's because the Arm processor uses more relaxed memory ordering. Although on Intell machine we are using relaxed atomic operations as well, there is no gain in performance since they are almost the same as sequentially consistent ones.

As we can see in the source code and Google micro benchmarking documentation, there is good support for templates, custom timers, and counters, as well as reporting options. The last one makes it possible to add these benchmarks to the CI\CD pipeline and keep a good track of performance.

Sadly benchmarking is not as easy as it may look. Some factors can have a negative impact on our benchmarks. It is recommended to have a dedicated machine for running benchmarks and on that machine have CPU scaling turned of. Other sources of inconsistancy should be dealt with as well regarding the OS and memory.


I hope it provides a good starting point for your benchmarking needs and also I can appreciate it if you test the code on your machine and share the results.

To view or add a comment, sign in

More articles by Mohammad Rahimi

  • Programming Policies

    Picture this situation: you've got a continuous flow of data coming in, and your task is to process it and present the…

  • Exception x Exception

    In my previous post, we demystified std::forward. In this one, we'll explore what happens when you encounter another…

  • Forward Demystified

    In this post I have shed some light on one of the darker corners of the C++ language: std::forward. The code snippets…

  • Maintainer's Dream

    The distributed workflow for Linux source code development is called Benevolent Dictator. Some maintainers collect…

    2 Comments
  • FTowerX

    A few years ago, I was developing a program for remotely controlling a signal generator. I wanted to create a simple…

    1 Comment
  • GitCheat

    In my previous post, I introduced a long-existing tool that helps you migrate your workflow from Perforce to Git. Here,…

  • Perforce meets Git

    Perforce is a Centralised Version Control System founded in 1995 and used by many companies. Git came into existence…

  • In Mail IP

    I was planning to run a home server and access it remotely. But there was an issue.

    2 Comments
  • StateBench

    While developing software, from large to small scale, you can use state machines to reduce code complexity and help…

  • Dot Bash History

    I have prepared a GitHub repository where you can find many useful commands and scripts. The following explains how you…

Others also viewed

Explore content categories