B3Search: A Bounded Branchless Binary Search algorithm optimized for GPU performance

In the realm of modern computing, especially with the rise of powerful graphics processing units (GPUs), the quest for efficient algorithms has never been more critical. GPUs have revolutionized computational tasks by enabling parallel processing on an unprecedented scale, making them indispensable in fields ranging from scientific simulations to artificial intelligence. However, despite their immense power, traditional search algorithms often fall short when it comes to fully leveraging GPU capabilities.

During my exploration of search algorithms tailored for GPUs, I encountered numerous approaches that either failed to deliver the desired efficiency or introduced unnecessary complexity. This frustration sparked a desire to design something new - a search algorithm specifically optimized for GPU environments.

Drawing inspiration from the innovative work of developers who have pioneered branchless binary search techniques, I embarked on creating an algorithm that would not only be efficient but also straightforward in its implementation. The result of this personal research is B3Search, a branchless binary search algorithm designed to maximize performance on GPUs by minimizing the overhead associated with conditional branches.

In the following sections, I will delve into the intricacies of B3Search, explaining how it achieves superior efficiency and why it stands out as an optimal choice for GPU-based applications.

Understanding Branchless Binary Search for GPUs

Avoiding conditional branches like the plague - why does it matter? Let's explore some reasons in this section. Those who follow best practices understand that performance is key. General-purpose GPU (GPGPU) programming requires rethinking certain concepts, and these ideas can feel counterintuitive to CPU developers.

In GPGPU programming, minimizing memory transfers between VRAM and SRAM is critical for efficiency, as moving data between these memories introduces latency. Conditional branches are particularly problematic. Additionally, expecting threads to synchronize should be avoided.

Why Avoid Branches on GPUs?

GPUs excel at handling multiple tasks simultaneously by processing threads in groups, often referred to as "warps". Each warp consists of many threads executing the same instruction at the same time. When conditional branches (like if-else statements) are encountered, these warps can split into different execution paths. This phenomenon, known as warp divergence, significantly slows down GPU performance because it disrupts the parallel processing efficiency.

To maintain high performance on GPUs, algorithms should avoid branching to keep all threads in a warp executing the same instructions. This is where branchless algorithms come into play, ensuring that operations remain efficient and parallelized.

Binary Search: A Classic Algorithm

Binary search is a fundamental algorithm used to find a target value within a sorted list by repeatedly dividing the search interval in half. The traditional implementation involves comparing the middle element of the current range with the target value and deciding whether to continue searching in the left or right half.

Here's a simplified version of how binary search works:

Initialize: Set begin to the start of the array (index 0) and end to the end of the array (length of the array).
Loop: While begin is less than end:Calculate the middle index, mid.If the value at array[mid] is less than the target, adjust begin to mid + 1.Otherwise, adjust end to mid.
Return: The index where the target value is found or the position where it should be inserted.

Transitioning to Branchless Operations

To adapt binary search for GPU execution without using branches, we employ arithmetic operations and bitwise manipulations to simulate the decision-making process traditionally handled by conditional statements.

1. IsLessThan Function

This function determines if one unsigned integer is less than another without using a branch. Here's how it works:

Subtraction: Subtract the two numbers.
Bitwise Shift: Divide the result by int.MaxValue (2^31 - 1) serves a specific purpose. If the first number is less than the second, the subtraction results in a negative value where the sign bit (bit 31) is set to 1. When this value is Divide the result by int.MaxValue, it becomes 1. On the other hand, if the result is non-negative, resulting in 0.

This function returns 1 if the first number is less than the second, otherwise 0.

C#

public static uint IsLessThan(uint element, uint compareTo)
=> (int)((@this - other) / int.MaxValue);

2. Min Function

Finding the minimum of two numbers without branching involves creating a mask based on their comparison:

Subtraction: Subtract the second number from the first.
Mask Creation: Use bitwise operations to create a mask where all bits are 1 if the first number is less than the second, otherwise 0.
Combination: Use the mask to combine the two numbers, effectively selecting the smaller one.

C#

public static int Min(int a, int b)
{
    // diff's MSB is 1 if a < b (underflow)
    int diff = a - b;

    // mask: 0xFFFFFFFF if a < b, else 0
    int mask = (diff >> 31);

    // returns a if a < b, else b
    return (a & mask) | (b & ~mask);
}

3. Search Function

The binary search function accept an iteration parameter, which specifies the number of iterations needed for the search. This value should be set as the smallest power of two that is greater than or equal to the size of the array being searched. For example, if the array contains 6 elements, the iteration count would be 3 (since 2^3=8). Similarly, for an array with 15000 elements, the iteration count would be 14 (as 2^14=16384).

The function performs the following steps in each iteration:

Mid Calculation: It calculates the middle index of the current search range and ensures it stays within valid array bounds.
Adjustment: Based on whether array[mid] is less than the target value, it adjusts the search range arithmetically without using branching statements.

C#

public static int Search(uint[] array, uint value, byte iteration)
{
    int begin = 0; // Start of the search range
    int end = array.Length; // End (exclusive) of the search range
    int lastIndex = end - 1; // Last valid index in the array

    for (byte i = 0; i < iteration; i++)
    {
        // Compute the middle index, clamped to lastIndex
        int mid = (begin + end) / 2;

        // Check if array[mid] < value (1 if true, 0 if false)
        int isLess = IsLessThan(array[mid], value);

        // If array[mid] < value, move begin to mid + 1; else, keep begin
        begin += isLess * (mid + 1 - begin);

        // If array[mid] >= value, move end to mid; else, keep end
        end -= (1 - isLess) * (end - mid);
    }

    // Return the first index where array[index] >= value, or lastIndex if not found
    return Min(begin, lastIndex);
}

Why This Matters for GPUs

By avoiding conditional branches, this algorithm prevents warp divergence on GPUs. All threads within a warp execute the same instructions uniformly, maintaining high parallel efficiency. This leads to significant performance improvements when processing large datasets or complex computations.

Conclusion

The B3Search algorithm presented here is a testament to the importance of understanding and optimizing for GPU architectures. By leveraging arithmetic operations and bitwise manipulations, we can achieve efficient, parallelized execution that avoids the pitfalls of traditional branching constructs. This approach not only enhances performance but also underscores the necessity of adapting algorithms to the unique capabilities of modern computing hardware.

Appendix

This version of the algorithm is specifically designed to work with unsigned integers (uint). However, with minimal adjustments, it can be adapted for other standard numeric types such as int or long. The key considerations when adapting the algorithm lie in ensuring that the helper functions Min and IsLessThan are correctly implemented for the target type.

For instance, when working with signed integers (int), the subtraction operation may yield different results due to the sign bit's influence. Therefore, the IsLessThan function would need to be adjusted to account for this difference in binary representation. Similarly, the Min function must correctly handle the comparison and selection of the smaller value without introducing any conditional branches.

To optimize performance further, especially when working with CUDA-enabled GPUs, one could utilize CUDA intrinsics such as __vmin or __vmax for vector operations. These functions are specifically designed to perform element-wise comparisons efficiently on NVIDIA GPUs, potentially offering better performance than custom implementations.

In summary, while the current implementation is tailored for uint, it provides a solid foundation that can be extended and optimized for other numeric types with careful consideration of type-specific characteristics and potential use of hardware-accelerated intrinsics.

You can see the repos GitHub here DevNullx64/B3Search: B3Search: A Bounded Branchless Binary Search algorithm optimized for GPU performance

- Written with the assistance of DeepSeekR1:32B and Ollama -

B3Search: A Bounded Branchless Binary Search algorithm optimized for GPU performance

Luc Bologna

Understanding Branchless Binary Search for GPUs

Binary Search: A Classic Algorithm

Transitioning to Branchless Operations

1. IsLessThan Function

Recommended by LinkedIn

2. Min Function

3. Search Function

Why This Matters for GPUs

Conclusion

Others also viewed

Introduction to the Three Key Processing Cores Inside NVIDIA GPUs

CPU vs GPU

CPUs vs. GPUs: Two Philosophies for the Same Computation Problem

Why No GPU Company Has Caught Up With NVIDIA:

Brain vs. Muscle: Differences Between CPU & GPU

3> Optimizing C++ - Offloading N-Body Problem to GPUs with OpenACC

GPUs: The Parallel Computing Engines Powering the Future of AI, Science, and Technology

From Sequential to Parallel: Understanding Prefix Sum and the Power of Kogge-Stone on GPUs

GPGPU SIMT CPU simulation with QEMU, Part 1

If GPUs Are Faster Than CPUs… Why Don’t We Build GPU-Only Computers?

Explore content categories