Splitting a long string in lines efficiently

Splitting a long string in lines efficiently

Suppose that you have a long string and you want to insert line breaks every 72 characters. You might need to do this if you need to write a public cryptographic key to a text file.

A simple C function ought to suffice. I use the letter K to indicate the length of the lines. I copy from an input buffer to an output buffer.

void insert_line_feed(const char *buffer, size_t length, 
        int K, char *output) {
  if (K == 0) {
    memcpy(output, buffer, length);
    return;
  }
  size_t input_pos = 0;
  size_t next_line_feed = K;
  while (input_pos < length) {
    output[0] = buffer[input_pos];
    output++;
    input_pos++;
    next_line_feed--;
    if (next_line_feed == 0) {
      output[0] = '\n';
      output++;
      next_line_feed = K;
    }
  }
}
        

This character-by-character process might be inefficient. To go faster, we might call memcpy to copy blocks of data.

void insert_line_feed_memcpy(const char *buffer, size_t length, int K,
                             char *output) {
  if (K == 0) {
    memcpy(output, buffer, length);
    return;
  }
  size_t input_pos = 0;
  while (input_pos + K < length) {
    std::memcpy(output, buffer + input_pos, K);
    output += K;
    input_pos += K;
    output[0] = '\n';
    output++;
  }
  std::memcpy(output, buffer + input_pos, length - input_pos);
}        

The memcpy function is likely to be turned into just a few instruction. For example, if you compile for a recent AMD processor (Zen 5), it might generate only two instructions (two vmovups) when the length of the lines (K) is 64.

Can we do better ?

In general, I expect that you cannot do much better than using the memcpyfunction. Compilers are simply great a optimizing it.

Yet it might be interesting to explore whether deliberate use of SIMD instructions could optimize this code. SIMD (Single Instruction, Multiple Data) instructions process multiple data elements simultaneously with a single instruction: the memcpy function automatically uses it. We can utilize SIMD instructions through intrinsic functions, which are compiler-supplied interfaces that enable direct access to processor-specific instructions, optimizing performance while preserving high-level code readability.

Let me focus on AVX2, the instruction set supported by effectively all x64 (Intel and AMD) processors. We can load 32-byte registers and write 32-byte registers. Thus we need a function that takes a 32-byte register and inserts a line-feed character at some location (N) in it. For cases where N is less than 16, the function shifts the input vector right by one byte to align the data correctly, using mm256alignr_epi8 and mm256blend_epi32, before applying the shuffle mask and inserting the newline. When N is 16 or greater, it directly uses a shuffle mask from the precomputed shuffle_masks array to reorder the input bytes and insert the newline, leveraging a comparison with 0x80 to identify the insertion point and blending the result with a vector of newline characters for efficient parallel processing.

inline __m256i insert_line_feed32(__m256i input, int N) {
  __m256i line_feed_vector = _mm256_set1_epi8('\n');
  __m128i identity =
      _mm_setr_epi8(0, 1, 2, 3, 4, 5, 6, 7, 8, 
         9, 10, 11, 12, 13, 14, 15);
  if (K >= 16) {
    __m128i maskhi = _mm_loadu_si128(shuffle_masks[N - 16]);
    __m256i mask = _mm256_set_m128i(maskhi, identity);
    __m256i lf_pos = _mm256_cmpeq_epi8(mask, 
       _mm256_set1_epi8(0x80));
    __m256i shuffled = _mm256_shuffle_epi8(input, mask);
    __m256i result = _mm256_blendv_epi8(shuffled, 
       line_feed_vector, lf_pos);
    return result;
  }
  // Shift input right by 1 byte
  __m256i shift = _mm256_alignr_epi8(
      input, _mm256_permute2x128_si256(input, input, 0x21), 
        15);
  input = _mm256_blend_epi32(input, shift, 0xF0);
  __m128i masklo = _mm_loadu_si128(shuffle_masks[N]);
  __m256i mask = _mm256_set_m128i(identity, masklo);
  __m256i lf_pos = _mm256_cmpeq_epi8(mask, 
      _mm256_set1_epi8(0x80));
  __m256i shuffled = _mm256_shuffle_epi8(input, mask);
  __m256i result = _mm256_blendv_epi8(shuffled, 
      line_feed_vector, lf_pos);
  return result;
}
        

Can we go faster by using such a fancy function ? Let us test it out. I wrote a benchmark. I use a large input string on an Intel Ice Lake processor with GCC 12.


Article content

The handcrafted AVX2 approach is faster in my tests than the memcpy approach despite using more instructions. However, the handcrafted AVX2 approach stores data to memory using fewer instructions.

To view or add a comment, sign in

More articles by Daniel Lemire

  • House prices and fertility

    No, rising house prices are not the driver of sharp fertility declines. The evidence shows only modest, mixed effects…

  • You can beat the binary search

    We sometimes have to look for a value in a sorted array. The simplest algorithm consists in just going through the…

    7 Comments
  • The fastest way to match characters on ARM processors?

    Consider the following problem. Given a string, you must match all of the ASCII white-space characters (\t, \n, \r, and…

    1 Comment
  • A brief history of C/C++ programming languages

    Initially, we had languages like Fortran (1957), Pascal (1970), and C (1972). Fortran was designed for number crunching…

    10 Comments
  • Can your AI rewrite your code in assembly?

    Suppose you have several strings and you want to count the number of instances of the character ! in your strings. In…

    3 Comments
  • A Fast Immutable Map in Go

    Consider the following problem. You have a large set of strings, maybe millions.

    4 Comments
  • Prefix sums at tens of gigabytes per second with ARM NEON

    Suppose that you have a record of your sales per day. You might want to get a running record where, for each day, you…

    2 Comments
  • You can use newline characters in URLs

    We locate web content using special addresses called URLs. We are all familiar with addresses like https://google.

  • How fast do browsers correct UTF-16 strings?

    JavaScript represents strings using Unicode, like most programming languages today. Each character in a JavaScript…

    1 Comment
  • How bad can Python stop-the-world pauses get?

    When programming, we need to allocate memory, and then deallocate it. If you program in C, you get used to malloc/free…

    3 Comments

Others also viewed

Explore content categories