C++ and Benchmarking
It may sound extreme, but I believe benchmarking too should get the special treatment of testing in a project. There are usually some performance criteria around a C++ code and as time passes and people come and go, one developer may prefer one approach over the other while the next may choose something different. It's good to have a history of these attempts somewhere so developers may refer to and compare their results. At least planning for benchmarking is a good idea. With that said, I like to introduce the Google micro benchmarking framework through a useful example.
A while ago, I started looking for the most performant approach to pass data between multiple producers and multiple consumers. There is a book called C++ Concurrency in Action, which goes to great length discussing concurrent programming and memory ordering in C++. That book tries to address the need for such data structure in different evolving ways. When we consider the overhead of reference counting in a lock-free approach or an additional locking introduced in the enhanced queue, one might wonder how these methods perform against each other. I have gathered all these approaches under one roof in a repository called FreshQueue. Over there, I use Google micro benchmarking framework to compare the performance.
First, I put a bare std::queue to the test to have a baseline. Then I use std::shared_ptrs to store the data in the queue. In another benchmark, I add explicit synchronization. Finally, I wrap the whole thing in a concurrent data structure with an internal std::queue and a std::mutex (ThreadSafeFreshQueue).
Then, I switch to an enhanced singly linked list with separation of head and tail to reduce contention on the shared data (ConcurrentFreshQueue).
Last but not least, I tested an open-source implementation of a lock-free queue available in the boost library. Although the book goes into so much trouble to come up with a lock-free implementation, there are some shortcomings regarding memory allocations and initialization which prevent the true potential of lock-free code to be unlocked!
Now to the results. I used an ARM RK3588S 8-core 64-bit processor with the main frequency locked at 2.4 GHz. The following is the single thread result :
-----------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-----------------------------------------------------------------------------------------------------
BM_Queue_PushAndPop<int> 65.5 ns 65.5 ns 10686853 Pushes=15.2616M/s
BM_QueueOfSharedPointer_PushAndPop<int> 116 ns 116 ns 6040365 Pushes=8.63589M/s
BM_QueueOfSharedPointer_PushAndPopWithLock<int> 192 ns 192 ns 3648586 Pushes=5.22111M/s
BM_ThreadSafeFreshQueue_PushAndPop<int> 478 ns 478 ns 1460062 Pushes=2.09155M/s
BM_ConcurrentFreshQueue_PushAndPop<int> 813 ns 813 ns 860585 Pushes=1.22979M/s
BM_LockFreeFreshQueue_PushAndPop<int> 651 ns 651 ns 1072108 Pushes=1.53636M/s
A bare std::queue can do 15.2M pushes and pops per sec. In a release build that can increase to 500M. If we store shared pointers in our queue we will lose 45% of performance. Due to the nature of memory management in multithreaded environments, using some sort of reference counting is inevitable. If we add two locking and unlocking operation to the mix we will lose 40% more of the performance and if we wrap the whole thing in another class we will lose more than half of the performance here. That is because of the overhead of copying on the push and the fact that I'm storing integers. Next, if we implement the queue ourselves and separate the locking on the head and tail, in a single thread we end up with another 40% drop in performance. Interestingly the lock-free queue performs not much better than a locking solution.
Recommended by LinkedIn
Here are the multithreaded results. For each solution, I have increased the thread count.
--------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------------------------------------------------
BM_QueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:2 444 ns 735 ns 1472954 Pushes=1.12727M/s
BM_QueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:4 745 ns 1813 ns 1053216 Pushes=670.992k/s
BM_QueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:8 1114 ns 4550 ns 673328 Pushes=448.789k/s
BM_QueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:16 832 ns 3474 ns 669264 Pushes=600.793k/s
BM_QueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:32 1172 ns 7159 ns 482304 Pushes=426.528k/s
BM_QueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:64 1192 ns 7105 ns 569344 Pushes=419.579k/s
BM_QueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:128 3374 ns 9037 ns 161536 Pushes=148.211k/s
BM_QueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:256 3003 ns 9825 ns 173312 Pushes=166.486k/s
BM_QueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:512 2324 ns 13776 ns 422912 Pushes=215.161k/s
BM_QueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:1024 2434 ns 17422 ns 392192 Pushes=205.457k/s
BM_ThreadSafeFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:2 506 ns 812 ns 1503706 Pushes=988.321k/s
BM_ThreadSafeFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:4 702 ns 1724 ns 906120 Pushes=712.033k/s
BM_ThreadSafeFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:8 1296 ns 4779 ns 646952 Pushes=385.835k/s
BM_ThreadSafeFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:16 1171 ns 6673 ns 858544 Pushes=427.008k/s
BM_ThreadSafeFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:32 1141 ns 7133 ns 595808 Pushes=438.356k/s
BM_ThreadSafeFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:64 1154 ns 7179 ns 619584 Pushes=433.308k/s
BM_ThreadSafeFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:128 1218 ns 6780 ns 614528 Pushes=410.601k/s
BM_ThreadSafeFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:256 1351 ns 7238 ns 542720 Pushes=370.024k/s
BM_ThreadSafeFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:512 1647 ns 7261 ns 414208 Pushes=303.532k/s
BM_ThreadSafeFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:1024 1561 ns 8169 ns 324608 Pushes=320.35k/s
BM_ConcurrentFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:2 446 ns 847 ns 1427228 Pushes=1.12125M/s
BM_ConcurrentFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:4 443 ns 1003 ns 1524464 Pushes=1.12752M/s
BM_ConcurrentFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:8 585 ns 2000 ns 1323832 Pushes=854.616k/s
BM_ConcurrentFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:16 706 ns 3607 ns 1015632 Pushes=707.991k/s
BM_ConcurrentFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:32 714 ns 4377 ns 967360 Pushes=700.708k/s
BM_ConcurrentFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:64 704 ns 4614 ns 995968 Pushes=710.077k/s
BM_ConcurrentFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:128 699 ns 4462 ns 1082112 Pushes=715.263k/s
BM_ConcurrentFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:256 769 ns 4418 ns 964864 Pushes=649.866k/s
BM_ConcurrentFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:512 806 ns 4851 ns 773632 Pushes=620.631k/s
BM_ConcurrentFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:1024 839 ns 5571 ns 999424 Pushes=596.125k/s
BM_LockFreeFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:2 285 ns 570 ns 2448794 Pushes=1.75571M/s
BM_LockFreeFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:4 188 ns 743 ns 3707932 Pushes=2.66646M/s
BM_LockFreeFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:8 160 ns 1066 ns 4686752 Pushes=3.13447M/s
BM_LockFreeFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:16 174 ns 1329 ns 4552320 Pushes=2.87749M/s
BM_LockFreeFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:32 161 ns 1250 ns 3200000 Pushes=3.10694M/s
BM_LockFreeFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:64 161 ns 1270 ns 8296960 Pushes=3.10445M/s
BM_LockFreeFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:128 179 ns 1422 ns 7419136 Pushes=2.80032M/s
BM_LockFreeFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:256 133 ns 1053 ns 6436608 Pushes=3.7628M/s
BM_LockFreeFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:512 114 ns 910 ns 5120000 Pushes=4.37553M/s
BM_LockFreeFreshQueueMultiThreadFixture<int>/PushAndPop/process_time/real_time/threads:1024 141 ns 1121 ns 10240000 Pushes=3.54401M/s
As we can see for all but the lock-free solution the performance is decreasing as we add to the threads working on the queue. If we compare the 1024 threads result for separated head and tail locking and the wrapped std::queue, we see an almost 86% increase in performance. Here the lock-free solution is almost sixfold more performant than the separated locking on the head and tail solution. When working with only two threads we can see no serious gap.
I have done the same test on an Intell 11700k processor. It shows the same trends. The noticeable difference is in the lock-free results where a 150$ Arm SoC performs better than Intell 11th-gen core i7 desktop machine. That's because the Arm processor uses more relaxed memory ordering. Although on Intell machine we are using relaxed atomic operations as well, there is no gain in performance since they are almost the same as sequentially consistent ones.
As we can see in the source code and Google micro benchmarking documentation, there is good support for templates, custom timers, and counters, as well as reporting options. The last one makes it possible to add these benchmarks to the CI\CD pipeline and keep a good track of performance.
Sadly benchmarking is not as easy as it may look. Some factors can have a negative impact on our benchmarks. It is recommended to have a dedicated machine for running benchmarks and on that machine have CPU scaling turned of. Other sources of inconsistancy should be dealt with as well regarding the OS and memory.
I hope it provides a good starting point for your benchmarking needs and also I can appreciate it if you test the code on your machine and share the results.