Splitting a long string in lines efficiently

Daniel Lemire

Published Sep 7, 2025

Suppose that you have a long string and you want to insert line breaks every 72 characters. You might need to do this if you need to write a public cryptographic key to a text file.

A simple C function ought to suffice. I use the letter K to indicate the length of the lines. I copy from an input buffer to an output buffer.

void insert_line_feed(const char *buffer, size_t length, 
        int K, char *output) {
  if (K == 0) {
    memcpy(output, buffer, length);
    return;
  }
  size_t input_pos = 0;
  size_t next_line_feed = K;
  while (input_pos < length) {
    output[0] = buffer[input_pos];
    output++;
    input_pos++;
    next_line_feed--;
    if (next_line_feed == 0) {
      output[0] = '\n';
      output++;
      next_line_feed = K;
    }
  }
}

This character-by-character process might be inefficient. To go faster, we might call memcpy to copy blocks of data.

void insert_line_feed_memcpy(const char *buffer, size_t length, int K,
                             char *output) {
  if (K == 0) {
    memcpy(output, buffer, length);
    return;
  }
  size_t input_pos = 0;
  while (input_pos + K < length) {
    std::memcpy(output, buffer + input_pos, K);
    output += K;
    input_pos += K;
    output[0] = '\n';
    output++;
  }
  std::memcpy(output, buffer + input_pos, length - input_pos);
}

The memcpy function is likely to be turned into just a few instruction. For example, if you compile for a recent AMD processor (Zen 5), it might generate only two instructions (two vmovups) when the length of the lines (K) is 64.

Can we do better ?

In general, I expect that you cannot do much better than using the memcpyfunction. Compilers are simply great a optimizing it.

Recommended by LinkedIn

What is std::simd in C++26?

Ayman Alheraki 2 weeks ago

Overcoming IO overhead in micro-services

Kislay Verma 6 years ago

free trial: integrate NN processing in MCU & DSP with…

Weiming Li 1 year ago

Yet it might be interesting to explore whether deliberate use of SIMD instructions could optimize this code. SIMD (Single Instruction, Multiple Data) instructions process multiple data elements simultaneously with a single instruction: the memcpy function automatically uses it. We can utilize SIMD instructions through intrinsic functions, which are compiler-supplied interfaces that enable direct access to processor-specific instructions, optimizing performance while preserving high-level code readability.

Let me focus on AVX2, the instruction set supported by effectively all x64 (Intel and AMD) processors. We can load 32-byte registers and write 32-byte registers. Thus we need a function that takes a 32-byte register and inserts a line-feed character at some location (N) in it. For cases where N is less than 16, the function shifts the input vector right by one byte to align the data correctly, using mm256alignr_epi8 and mm256blend_epi32, before applying the shuffle mask and inserting the newline. When N is 16 or greater, it directly uses a shuffle mask from the precomputed shuffle_masks array to reorder the input bytes and insert the newline, leveraging a comparison with 0x80 to identify the insertion point and blending the result with a vector of newline characters for efficient parallel processing.

inline __m256i insert_line_feed32(__m256i input, int N) {
  __m256i line_feed_vector = _mm256_set1_epi8('\n');
  __m128i identity =
      _mm_setr_epi8(0, 1, 2, 3, 4, 5, 6, 7, 8, 
         9, 10, 11, 12, 13, 14, 15);
  if (K >= 16) {
    __m128i maskhi = _mm_loadu_si128(shuffle_masks[N - 16]);
    __m256i mask = _mm256_set_m128i(maskhi, identity);
    __m256i lf_pos = _mm256_cmpeq_epi8(mask, 
       _mm256_set1_epi8(0x80));
    __m256i shuffled = _mm256_shuffle_epi8(input, mask);
    __m256i result = _mm256_blendv_epi8(shuffled, 
       line_feed_vector, lf_pos);
    return result;
  }
  // Shift input right by 1 byte
  __m256i shift = _mm256_alignr_epi8(
      input, _mm256_permute2x128_si256(input, input, 0x21), 
        15);
  input = _mm256_blend_epi32(input, shift, 0xF0);
  __m128i masklo = _mm_loadu_si128(shuffle_masks[N]);
  __m256i mask = _mm256_set_m128i(identity, masklo);
  __m256i lf_pos = _mm256_cmpeq_epi8(mask, 
      _mm256_set1_epi8(0x80));
  __m256i shuffled = _mm256_shuffle_epi8(input, mask);
  __m256i result = _mm256_blendv_epi8(shuffled, 
      line_feed_vector, lf_pos);
  return result;
}

Can we go faster by using such a fancy function ? Let us test it out. I wrote a benchmark. I use a large input string on an Intel Ice Lake processor with GCC 12.

The handcrafted AVX2 approach is faster in my tests than the memcpy approach despite using more instructions. However, the handcrafted AVX2 approach stores data to memory using fewer instructions.

To view or add a comment, sign in

Splitting a long string in lines efficiently

Daniel Lemire

Recommended by LinkedIn

More articles by Daniel Lemire

Others also viewed

kernel log: when hardware interrupts things I’m learning while building a kernel from scratch

Computing Power

From Simulation to System: Verifying AMD Versal AI Engine Designs with QEMU + XSIM Co-simulation

A step-by-step guide to install Intel Advisor and analyze a sample application and find out where Vectorization matters the most

Vectorized Quick Sort In JDK21

Idealism meets reality

Use of Volatile Keyword in Embedded Systems.

Sensor Fusion and DMIPS

Don't write the fastest code

Explore content categories

Recommended by LinkedIn

More articles by Daniel Lemire

House prices and fertility

You can beat the binary search

The fastest way to match characters on ARM processors?

A brief history of C/C++ programming languages

Can your AI rewrite your code in assembly?

A Fast Immutable Map in Go

Prefix sums at tens of gigabytes per second with ARM NEON

You can use newline characters in URLs

How fast do browsers correct UTF-16 strings?

How bad can Python stop-the-world pauses get?

Others also viewed

kernel log: when hardware interrupts things I’m learning while building a kernel from scratch

Computing Power

From Simulation to System: Verifying AMD Versal AI Engine Designs with QEMU + XSIM Co-simulation

A step-by-step guide to install Intel Advisor and analyze a sample application and find out where Vectorization matters the most

Vectorized Quick Sort In JDK21

Idealism meets reality

Use of Volatile Keyword in Embedded Systems.

Sensor Fusion and DMIPS

Don't write the fastest code

Similar topics

Improving Code Readability in Large Projects

Explore content categories