Parallel Algorithms (C++17)

C++17 introduced a significant update to the Standard Library by adding parallel versions of many existing algorithms and introducing new ones designed for parallel execution. These algorithms leverage the power of multi-core processors to potentially achieve significant performance improvements compared to their sequential counterparts.

What Are C++17 Parallel Algorithms?

C++17 extended the existing STL algorithms (e.g., sort, transform, reduce) by adding support for parallel execution policies. These policies allow algorithms to automatically use:

Multi-threading (parallel execution across CPU cores).
SIMD vectorization (data-level parallelism within a single core).

Key Headers

#include <algorithm>   // For algorithms like sort, transform
#include <execution>   // For execution policies: par, seq, par_unseq

Execution Policies

The key to using these parallel algorithms is specifying an execution policy as the first argument. This policy guides the algorithm's execution and allows it to utilize parallelism. C++17 defines three standard execution policies:

std::execution::seq (Sequential Policy): This policy forces the algorithm to execute sequentially, just like the original versions. It's useful for ensuring code behaves identically to pre-C++17 or for benchmarking.
std::execution::par (Parallel Policy): This policy allows the algorithm to execute in parallel, potentially using multiple threads, but no vectorization. The implementation is free to choose how to parallelize the execution.
std::execution::par_unseq (Parallel Unsequenced Policy): This policy not only allows parallel execution but also permits vectorization (SIMD) within each thread. This can lead to further performance gains if the underlying hardware supports it and the operations are suitable for vectorization.

Compilation and Execution

On Linux (GCC):

g++ -std=c++17 -O2 -ltbb My_code.cpp -o My_program

-ltbb: Links Intel’s Threading Building Blocks (TBB) for parallelism.
-O2: Enables optimizations (including auto-vectorization).

C++17 Execution Policy Defaults

In C++17, execution policies are optional. If you omit them:

The algorithm defaults to sequential execution (std::execution::seq).
You do not need to explicitly specify seq.

#include <vector>
#include <algorithm>

int main() {
    std::vector<int> data = {5, 3, 2, 6, 1, 4};
    
    // Defaults to sequential execution (no parallelism)
    std::sort(data.begin(), data.end()); // Same as std::sort(std::execution::seq, ...)
}

No Overhead: Omitting the policy avoids potential parallelism overhead.
Explicit Policies: Use par or par_unseq only when parallelism is needed.
Header Requirement: To use policies like par, you must include <execution>.

Types of Parallelism

The C++17 parallel algorithms can leverage different types of parallelism:

Thread-level Parallelism (MIMD): This is achieved using the std::execution::par policy. The algorithm's work is divided among multiple threads, which can execute independently on different cores. This is a form of Multiple Instruction, Multiple Data (MIMD) parallelism, where each thread executes its own sequence of instructions on different parts of the data.
Data-level Parallelism (SIMD): This is possible with the std::execution::par_unseq policy. In addition to thread-level parallelism, this policy allows the implementation to use Single Instruction, Multiple Data (SIMD) instructions. SIMD enables a single instruction to operate on multiple data elements simultaneously, which can significantly speed up operations on arrays or vectors.

Here's an example demonstrating the use of std::execution::par for parallelizing a finding the max element in the big dataset:

EX1:

#include <iostream>
#include <vector>
#include <algorithm>
#include <execution>

int main() {
    std::vector<int> data(1000000, 1);

    // Using parallel execution policy to fill the vector
    int cnt = 0;
    std::for_each(std::execution::par, data.begin(), data.end(), [&](int& n) {
            cnt++;
            if(cnt == 9999) {
                n = 100;
            }
            n *= 2; 
    });

    // Using parallel execution policy to find the maximum element
    auto max_elem = std::max_element(std::execution::par, data.begin(), data.end());

    std::cout << "Max element: " << *max_elem << "\n";

    return 0;
}
/*
Max element: 200
*/

Ex2 : Parallel Sort

#include <vector>
#include <algorithm>
#include <execution>

int main() {
    std::vector<int> data = {5, 3, 2, 6, 1, 4};
    
    // Parallel sort using multiple CPU threads
    std::sort(std::execution::par, data.begin(), data.end());
    
    // Result: {1, 2, 3, 4, 5, 6}
}

Note: Not all sorting algorithms can be parallelized. std::sort with par uses a parallel implementation if available.

Full Example: Parallel Word Count

#include <iostream>
#include <vector>
#include <string>
#include <algorithm>
#include <execution>

using namespace std;
int main() {
    vector<string> words = {"apple", "banana", "apple", "cherry"};
    vector<int> counts(words.size(), 0);
    
    // Count occurrences of "apple" in parallel
    transform(execution::par,
        words.begin(), words.end(), counts.begin(),
        [](const std::string& word) {
            return (word == "apple") ? 1 : 0;
        });
    
    int total = reduce(execution::par,
        counts.begin(), counts.end(), 0);
    
    cout<<"Total count = "<<total;
    // total = 2
}
// Total count = 2

SIMD Vectorization with par_unseq

The par_unseq policy allows the compiler to generate SIMD instructions (e.g., AVX on x86 CPUs) for data-level parallelism. This is automatic and depends on:

Here's an example demonstrating the use of std::execution::par_unseq for parallelizing a vector transformation:

par_unseq: The compiler/runtime may use SIMD (e.g., AVX instructions) to process multiple elements at once.

#include <iostream>
#include <vector>
#include <algorithm>
#include <execution>

int main() {
  std::vector<int> data(1000000);
  std::iota(data.begin(), data.end(), 1); // Fill with values 1 to 1000000

  // Parallel transformation using par_unseq
  std::transform(std::execution::par_unseq, data.begin(), data.end(), data.begin(),
                 [](int x) { return x * x; });

  // Print the first few elements to verify
  for (int i = 0; i < 5; ++i) {
    std::cout << data[i] << " ";
  }
  std::cout << std::endl;

  return 0;
}
/*
1 4 9 16 25 
*/

Supported Algorithms

Not all STL algorithms support parallelism. Common parallelizable algorithms include:

std::sort, std::transform, std::reduce, std::for_each, std::count, std::find, std::inclusive_scan.

Considerations

Thread Safety: Lambdas/functions must be thread-safe (no shared mutable state).
Data Races: When using parallel algorithms, it's crucial to ensure that the operations performed on the data are free from data races. This means that if multiple threads access the same memory location, at most one of them should be writing to it.

// UNSAFE (race condition):
int counter = 0;
std::for_each(std::execution::par, data.begin(), data.end(), [&](int) {
    counter++; // Race condition!
});

Overhead: Parallelization introduces some overhead, such as thread creation and synchronization. For small input sizes or computationally simple operations, the overhead might outweigh the benefits of parallelism.
Hardware Support: The actual performance gains depend on the underlying hardware, such as the number of CPU cores and SIMD capabilities.

Good read : Parallel Algorithms of the Standard Template Library – MC++ BLOG

Parallel Algorithms (C++17)

Amit Nadiger

What Are C++17 Parallel Algorithms?

Key Headers

Execution Policies

Compilation and Execution

On Linux (GCC):

C++17 Execution Policy Defaults

Types of Parallelism

Recommended by LinkedIn

Full Example: Parallel Word Count

SIMD Vectorization with par_unseq

Supported Algorithms

Considerations

More articles by Amit Nadiger

Others also viewed

Vectorization Part 1 – The Rise of Parallelism

Mojo: How MLIR-Based Architecture Solves the Heterogeneous Computing Challenge

Replication vs Disaggregation: Benchmarking Two-Node LLM Inference with vLLM and NIXL

Optimizing System Performance: Minimizing Wasted CPU Cycles in Embedded Systems

The Silicon Rebellion — Why RISC-V Is the Linux of Hardware

Macro to Micro: Bringing Perfetto Profiling to the Arm Compute Library

FROM ISA TO THE MACHINE OF MEANING

Spectra Update: Blaze and HPC Multithreading Infrastructure — The Pain and Power of Low-Level Design

Different Variants of Spectre/Meltdown & Recommended Defenses

To harness benefits of parallel processing

Explore content categories

What Are C++17 Parallel Algorithms?

Key Headers

Execution Policies

Compilation and Execution

On Linux (GCC):

C++17 Execution Policy Defaults

Types of Parallelism

Recommended by LinkedIn

Full Example: Parallel Word Count

SIMD Vectorization with par_unseq

Supported Algorithms

Considerations

More articles by Amit Nadiger

Dvb-APIs

Satellite Communication Basics w.r.t TV

Actor Design Pattern in Rust

Mock and Stub in Android

Integrating C and Rust

Const generics

Rust modules

HRTB - Higher-Ranked Trait Bounds

K8s basics

Atomics in Rust

Others also viewed

Vectorization Part 1 – The Rise of Parallelism

Mojo: How MLIR-Based Architecture Solves the Heterogeneous Computing Challenge

Replication vs Disaggregation: Benchmarking Two-Node LLM Inference with vLLM and NIXL

Optimizing System Performance: Minimizing Wasted CPU Cycles in Embedded Systems

The Silicon Rebellion — Why RISC-V Is the Linux of Hardware

Macro to Micro: Bringing Perfetto Profiling to the Arm Compute Library

FROM ISA TO THE MACHINE OF MEANING

Spectra Update: Blaze and HPC Multithreading Infrastructure — The Pain and Power of Low-Level Design

Different Variants of Spectre/Meltdown & Recommended Defenses

To harness benefits of parallel processing

Explore content categories