Base64 for compression

Base64 for compression

C and C++ compilers like GCC first take your code and produce assembly, typically a pure ASCII output (so just basic English characters). This assembly code is a low-level representation of the program, using mnemonic instructions specific to the target processor architecture. The compiler then passes this assembly code to an assembler, which translates it into machine code—binary instructions that the processor can execute directly.

When compiling code, characters like ‘é’ in strings, such as unsigned char a[] = "é";, may be represented in UTF-8. The Unicode (UTF-8) encoding for ‘é’ is two bytes, \303\251. However, when this is represented as an assembly string, it requires 8 characters to express those two bytes (e.g., "\303\251") because the assembly is ASCII. Thus, a single character in source code can expand significantly in the compiled output.

As a related issue, new versions of C and C++ have an ‘#embed’ directive that allows you to directly embed an arbitrary file in your code (e.g., en image). Such data might be encoded inefficiently as assembly.

What could you do?

Base64 is an encoding method that converts binary data into a string of printable ASCII characters, using a set of 64 characters (uppercase and lowercase letters, digits, and symbols like + and /). It is commonly used to represent binary data, such as images or files, in text-based formats like JSON, XML, or emails (MIME).

When starting from binary data, base64 data expands the data, turning 3 input bytes into 4 ASCII characters. Interestingly, in some cases, base64 can be used for compression purposes. Older versions of GCC would compile

unsigned char a[] = "éééééééé";        

to

.string "\303\251\303\251\303\251\303\251\303\251\303\251\303\251\303\251"        

The sequences \303\251 are octal escape codes representing the bytes 0xC3 (\303 in octal) and 0xA9 (\251 in octal).

GCC 15 now supports base64 encoding of data during compilation, with a new “base64” pseudo-op. Our array now gets compiled to the much shorter string

.base64 "w6nDqcOpw6nDqcOpw6nDqQA="        

To view or add a comment, sign in

More articles by Daniel Lemire

  • House prices and fertility

    No, rising house prices are not the driver of sharp fertility declines. The evidence shows only modest, mixed effects…

  • You can beat the binary search

    We sometimes have to look for a value in a sorted array. The simplest algorithm consists in just going through the…

    7 Comments
  • The fastest way to match characters on ARM processors?

    Consider the following problem. Given a string, you must match all of the ASCII white-space characters (\t, \n, \r, and…

    1 Comment
  • A brief history of C/C++ programming languages

    Initially, we had languages like Fortran (1957), Pascal (1970), and C (1972). Fortran was designed for number crunching…

    10 Comments
  • Can your AI rewrite your code in assembly?

    Suppose you have several strings and you want to count the number of instances of the character ! in your strings. In…

    3 Comments
  • A Fast Immutable Map in Go

    Consider the following problem. You have a large set of strings, maybe millions.

    4 Comments
  • Prefix sums at tens of gigabytes per second with ARM NEON

    Suppose that you have a record of your sales per day. You might want to get a running record where, for each day, you…

    2 Comments
  • You can use newline characters in URLs

    We locate web content using special addresses called URLs. We are all familiar with addresses like https://google.

  • How fast do browsers correct UTF-16 strings?

    JavaScript represents strings using Unicode, like most programming languages today. Each character in a JavaScript…

    1 Comment
  • How bad can Python stop-the-world pauses get?

    When programming, we need to allocate memory, and then deallocate it. If you program in C, you get used to malloc/free…

    3 Comments

Others also viewed

Explore content categories