Some Unicode Notes

Every now and again I find myself reading up on character sets, usually when I'm doing some kind of heavy text processing.

When I first started out programming I used to tell myself, "Just use ASCII, it's not like I write programs for people who speak other languages right?". Well, as the Web became more multi-lingual, it's increasingly harder to justify that thinking.

UTF-8 has some terms and notes I need to keep in mind:

  •  UTF-8 can use 1,2,3, or 4 bytes to encode a character, that is 8 bits, 16 bits, 24bits or 32bits,
  •   because of this it's called a variable width character encoding.
  • A UTF-8 file that contains only characters in the ASCII range is identical to an ASCII file.
  • The term Basic Multilingual Plane(BMP) or Plane contains commonly used characters for all scripts in the world.
  • It contains the code points 0000- FFFF (hexadecimal).
  • A breakdown of the groups can be found here.

Aside from UTF-8 there is also UTF-16 and UTF-32.

Both are not backwards compatible with ASCII. That is, if you convert an ASCII file to UTF-16 or UTF-32, it's contents at the byte level are going to change (but not the actual data).

UTF-16 is variable width like UTF-8 but uses either one or two 16 bit values for each code point.

ASCII of course only uses one byte (8bits) for each of its characters, hence the backwards incompatibility.

I should mention that a code point is the numeric value assigned to a character in a character set. Think of a character set as a huge array where the code point is an index.

UTF-32 is fixed width and each character is encoded using 32bits (a whole 4 bytes!). Converting an ASCII file into UTF-16 or UTF-32 encoding is going to result in a larger size. That goes for databases as well.

I think what trips me up are statements like "JavaScript/ECMAScript supports UTF-8" that I saw online when I was first learning the language. 

It had me assuming that the escape sequences \u0000 - \uFFFF are how you use UTF-8 code points in strings. I even wrote that here in an earlier draft.

That's not really accurate, all strings in JavaScript are UTF-16 encoded, or at least, each code point is represented by a 16 bit value as the spec does not get into the details of implementation.

When a code point in the range 0xD800 - 0xDBFF occurs next to a code point in the range 0xDC00 - 0xDFFF it is called a surrogate pair and the spec describes an algorithm for calculating the resulting value.

The `\uXXXX` syntax is actually for representing characters of the BMP that may not be on your keyboard. At runtime, it's still a 16 bit value in memory.

Where does UTF-8 come in? It's your actual source code! The JS files that you serve or pass to node are expected to be UTF-8 encoded! This is consistent with the requirement that everything be UTF-8 encoded.

There may be opportunities here for subtle bugs and even security issues but I have not taken the time to properly understand them yet.

I do know that mixing up the encoding of your data in a MariaDB/MySQL server can cause your queries to give the wrong results. I have also heard of XSS and SQL injection bypasses by manipulating misconfigured character encoding.

I'm starting to get a better understanding as to how, but again I have not dug into it yet.

This post originally occurred on my personal blog.


To view or add a comment, sign in

More articles by Lasana Murray

  • The Evolved Operating System

    I see cloud computing platforms as the current stage in the evolution of Operating Systems. It may be hard to see if…

  • What was the mission in the first place?

    People usually start a new business venture for one of the following reasons: a) Improve financial status. b) They…

  • Inaccurate Results From Floating Point Arithmetic (JavaScript)

    This post was originally supposed to be a quick note about rounding in JavaScript. Further research into the scenario I…

    3 Comments
  • New Year's Note From The Managing Director Of Quenk Technologies Limited

    Happy New Year! From the management and staff of Quenk Technologies Limited; we hope last year gave you what you needed…

  • The Ticking T&T Time Bomb

    Leadership in Trinidad and Tobago is tone deaf to reality and this is not news. The nation's leaders are about as…

    2 Comments
  • Tanty As A Service

    I'm coining two new terms today: Tanty As A Service (TaaS) and Tanty Tech. "Tanty As A Service" refers to those online…

    3 Comments
  • Un-scoped NPM Modules Must Die

    The way the latest Node.js ecosystem security debacle played off and subsequent explanation, brings attention to a well…

    2 Comments
  • Believe In The You That Believes In The Developer In You

    Note: This post first appeared in the Caribbean Developers Facebook group. On this journey of mine, I have met quite a…

  • Measured Boundaries

    After nearly two years of static typing and functional programming, I have come to better appreciate the relationship…

    1 Comment
  • Get Ready For Caribbean Developer Week 2018

    UPDATE 1: Check out our opening video! UPDATE 0: We have a calendar! Caribbean Developer Week is a week for celebrating…

    3 Comments

Explore content categories