Parallel Algorithms (C++17)

C++17 introduced a significant update to the Standard Library by adding parallel versions of many existing algorithms and introducing new ones designed for parallel execution. These algorithms leverage the power of multi-core processors to potentially achieve significant performance improvements compared to their sequential counterparts.


What Are C++17 Parallel Algorithms?

C++17 extended the existing STL algorithms (e.g., sort, transform, reduce) by adding support for parallel execution policies. These policies allow algorithms to automatically use:

  • Multi-threading (parallel execution across CPU cores).
  • SIMD vectorization (data-level parallelism within a single core).

Key Headers

#include <algorithm>   // For algorithms like sort, transform
#include <execution>   // For execution policies: par, seq, par_unseq        


Execution Policies

The key to using these parallel algorithms is specifying an execution policy as the first argument. This policy guides the algorithm's execution and allows it to utilize parallelism. C++17 defines three standard execution policies:

  1. std::execution::seq (Sequential Policy): This policy forces the algorithm to execute sequentially, just like the original versions. It's useful for ensuring code behaves identically to pre-C++17 or for benchmarking.
  2. std::execution::par (Parallel Policy): This policy allows the algorithm to execute in parallel, potentially using multiple threads, but no vectorization. The implementation is free to choose how to parallelize the execution.
  3. std::execution::par_unseq (Parallel Unsequenced Policy): This policy not only allows parallel execution but also permits vectorization (SIMD) within each thread. This can lead to further performance gains if the underlying hardware supports it and the operations are suitable for vectorization.


Compilation and Execution

On Linux (GCC):

g++ -std=c++17 -O2 -ltbb My_code.cpp -o My_program        

  • -ltbb: Links Intel’s Threading Building Blocks (TBB) for parallelism.
  • -O2: Enables optimizations (including auto-vectorization).


C++17 Execution Policy Defaults

In C++17, execution policies are optional. If you omit them:

  • The algorithm defaults to sequential execution (std::execution::seq).
  • You do not need to explicitly specify seq.

#include <vector>
#include <algorithm>

int main() {
    std::vector<int> data = {5, 3, 2, 6, 1, 4};
    
    // Defaults to sequential execution (no parallelism)
    std::sort(data.begin(), data.end()); // Same as std::sort(std::execution::seq, ...)
}        

  • No Overhead: Omitting the policy avoids potential parallelism overhead.
  • Explicit Policies: Use par or par_unseq only when parallelism is needed.
  • Header Requirement: To use policies like par, you must include <execution>.

Types of Parallelism

The C++17 parallel algorithms can leverage different types of parallelism:

  • Thread-level Parallelism (MIMD): This is achieved using the std::execution::par policy. The algorithm's work is divided among multiple threads, which can execute independently on different cores. This is a form of Multiple Instruction, Multiple Data (MIMD) parallelism, where each thread executes its own sequence of instructions on different parts of the data.
  • Data-level Parallelism (SIMD): This is possible with the std::execution::par_unseq policy. In addition to thread-level parallelism, this policy allows the implementation to use Single Instruction, Multiple Data (SIMD) instructions. SIMD enables a single instruction to operate on multiple data elements simultaneously, which can significantly speed up operations on arrays or vectors.

Here's an example demonstrating the use of std::execution::par for parallelizing a finding the max element in the big dataset:

EX1:

#include <iostream>
#include <vector>
#include <algorithm>
#include <execution>

int main() {
    std::vector<int> data(1000000, 1);

    // Using parallel execution policy to fill the vector
    int cnt = 0;
    std::for_each(std::execution::par, data.begin(), data.end(), [&](int& n) {
            cnt++;
            if(cnt == 9999) {
                n = 100;
            }
            n *= 2; 
    });

    // Using parallel execution policy to find the maximum element
    auto max_elem = std::max_element(std::execution::par, data.begin(), data.end());

    std::cout << "Max element: " << *max_elem << "\n";

    return 0;
}
/*
Max element: 200
*/        

Ex2 : Parallel Sort

#include <vector>
#include <algorithm>
#include <execution>

int main() {
    std::vector<int> data = {5, 3, 2, 6, 1, 4};
    
    // Parallel sort using multiple CPU threads
    std::sort(std::execution::par, data.begin(), data.end());
    
    // Result: {1, 2, 3, 4, 5, 6}
}        

  • Note: Not all sorting algorithms can be parallelized. std::sort with par uses a parallel implementation if available.

Full Example: Parallel Word Count

#include <iostream>
#include <vector>
#include <string>
#include <algorithm>
#include <execution>

using namespace std;
int main() {
    vector<string> words = {"apple", "banana", "apple", "cherry"};
    vector<int> counts(words.size(), 0);
    
    // Count occurrences of "apple" in parallel
    transform(execution::par,
        words.begin(), words.end(), counts.begin(),
        [](const std::string& word) {
            return (word == "apple") ? 1 : 0;
        });
    
    int total = reduce(execution::par,
        counts.begin(), counts.end(), 0);
    
    cout<<"Total count = "<<total;
    // total = 2
}
// Total count = 2        



SIMD Vectorization with par_unseq

The par_unseq policy allows the compiler to generate SIMD instructions (e.g., AVX on x86 CPUs) for data-level parallelism. This is automatic and depends on:

Here's an example demonstrating the use of std::execution::par_unseq for parallelizing a vector transformation:

par_unseq: The compiler/runtime may use SIMD (e.g., AVX instructions) to process multiple elements at once.

#include <iostream>
#include <vector>
#include <algorithm>
#include <execution>

int main() {
  std::vector<int> data(1000000);
  std::iota(data.begin(), data.end(), 1); // Fill with values 1 to 1000000

  // Parallel transformation using par_unseq
  std::transform(std::execution::par_unseq, data.begin(), data.end(), data.begin(),
                 [](int x) { return x * x; });

  // Print the first few elements to verify
  for (int i = 0; i < 5; ++i) {
    std::cout << data[i] << " ";
  }
  std::cout << std::endl;

  return 0;
}
/*
1 4 9 16 25 
*/        

Supported Algorithms

Not all STL algorithms support parallelism. Common parallelizable algorithms include:

  • std::sort, std::transform, std::reduce, std::for_each, std::count, std::find, std::inclusive_scan.


Considerations

  • Thread Safety: Lambdas/functions must be thread-safe (no shared mutable state).
  • Data Races: When using parallel algorithms, it's crucial to ensure that the operations performed on the data are free from data races. This means that if multiple threads access the same memory location, at most one of them should be writing to it.

// UNSAFE (race condition):
int counter = 0;
std::for_each(std::execution::par, data.begin(), data.end(), [&](int) {
    counter++; // Race condition!
});        

  • Overhead: Parallelization introduces some overhead, such as thread creation and synchronization. For small input sizes or computationally simple operations, the overhead might outweigh the benefits of parallelism.
  • Hardware Support: The actual performance gains depend on the underlying hardware, such as the number of CPU cores and SIMD capabilities.


Good read : Parallel Algorithms of the Standard Template Library – MC++ BLOG

To view or add a comment, sign in

More articles by Amit Nadiger

  • Dvb-APIs

    The Linux Digital Video Broadcasting (DVB) APIs serve as the critical bridge between user applications and kernel-level…

  • Satellite Communication Basics w.r.t TV

    Main Highlights RF: Raw satellite signal (10.7-12.

  • Actor Design Pattern in Rust

    What Is the Actor Design Pattern? The Actor model is a concurrency design pattern in which: Each Actor runs in its own…

  • Mock and Stub in Android

    In Android unit testing, the ability to replace dependencies with test doubles is crucial for effective and isolated…

  • Integrating C and Rust

    Interfacing Rust with existing C code is one of the most common and powerful real-world uses of Rust: you get Rust’s…

  • Const generics

    Const generics in Rust allow us to use a value (like a number) as a generic parameter, in addition to types and…

  • Rust modules

    Referance : Modules - Rust By Example Rust uses a module system to organize and manage code across multiple files and…

  • HRTB - Higher-Ranked Trait Bounds

    As you know lifetimes are a core feature of the Rust language, which will enforce that references do not outlive the…

  • K8s basics

    What is Kubernetes? Kubernetes is an open-source container orchestration platform. It helps to deploy, manage, scale…

  • Atomics in Rust

    In computer science, Atomic is used to describe an operation that is indivisible: it is either fully completed, or it…

Others also viewed

Explore content categories