Formatting a string: translators hell
(Time to read: 15 minutes. Basic to intermediate programming knowledge required)
(To see copy-able form of each code block, just click on it)
Before I go into main content, short introduction: what is the very first thing (almost) each C++ course does start with? Well, if we skip setting up compiler / environment (which can happen to eat up the entire first lesson!) the most typical code every future native programmer sees is something like that:
So we see streams. Streams, the concept developed in 80’ and 90’ with the rise of Unix, when someone had the idea that output of one application could be input to another, without making assumptions by the original app author. The idea was brilliant at the times – I am serious – and it spawned entire family of solutions of divide-and-conquer and, to make things more interesting, it generated a few new words in dictionary (for example interobility).
Unfortunately, even through applications were happily co-operating with each other and dividing their responsibilities, it was still assumed that all of them are made in English. But in some instant someone in the USA noticed that at the other side of Great Ocean there is Europe and there – oh God – people who put reverse question marks before questions and whole lot more of language deviations that were unimaginable for programmers of the New World. It was obvious that to storm this new, virgin land successfully there was a need for translations.
And here lies the very first trap in cout… It is very hard to acquire language strings from the outside! Let us try following example
How the hell could we replace texts in strings with different language? How can we obtain them? Of course, universe hates emptiness, and brave programmers invented idea of string tables!
Of course now the texts are stored inside variable and not literal-constant, so we can exchange it during runtime (for example read from file). Looks reasonable enough? Okay, so let’s try in German!
Oh. The order of arguments was changed! And this is unsolvable with streams, which cannot rearrange that order (well, we could program all the possible permutation… good luck with 10 arguments!)
Ah, I see more experienced developers telling me now: "That is the perfect place for printf with POSIX extensions!" With that we can have entire language item in one string, with arbitrary number of arguments, and we can rearrange them at will.
Good, it’s one string and not a three as it was with cout. Well – we can even rearrange order via POSIX tags:
"We’ve got it!" shouted happy programmers of Microsoft and in C# they reduced complexity of the notation to:
And they lived happily ever after until the ends of their days…
Well… it was only till someone discovered armed nuke: there are languages that contain multiple plural forms! And the problem was how to separate (during translation phase) such situation (in Polish):
The change of word "kartek" into "kartki" depends on entire number before, not only the last digit. Okaaay…
Of course brave programmers from the save world via open source movement found an answer to that; the gettext library was conceived which is dedicated to translations and is aware of changing the word in different plural forms.
We’ve got it!
…Dang.
The brave gettext developers skip one very important aspect: the change of forms may happen independently from plural form. Let us check the following example – also in Polish – of the words with branding:
Of course inserting this into string will generate very rough Polish: "Przeczytaj informacje o Firefox", while it should have been "Przeczytaj informacje o Firefoksie"
What can we do about it? Well, not much… there is no translation library on the market right now which could win this fight (!). We’re talking about C++/C# of course; there is such library – the previous example was not a coincident – written in Javascript by the Firefox makers. You can find it there: https://projectfluent.org/
When I’ll have a bit of spare time I plan to make a port to C++ on (really) permissive license, like MIT/X11.
So we can finally live happily ever after until the ends of our days… right?
FUBAR. Not.
The problem is our translators are humans, and humans tend to make mistakes. The most usual is to translate too much texts, or do not know basics of algorithms (well, that one is pretty common; even some programmers have problems there). Let us take example directly from Fluent, where translator takes:
And translates it into:
Everyone who saw translation of IT book including translated source code (sometimes even with keywords!) can understand that…
That is the reason it is my intention to modify syntax of Fluent in C++ a bit:
- Each markup tokens will be stored inside brackets; each translatable text will be stored outside. The rule has no exceptions.
- If you are translator the idea is simple: you only translate things outside brackets.
- All markup tokens will contain formatting information (prefix, currencies, etc.) – I’m using ICU/Boost.Locale as back-end for that
- The items will be identified not by key, but by URI. My dream world is the engine that takes URI and launches app in preview mode showing where the text via given URI is used and how does the context looks like. That way it is enough to click a link for a translator (without even closing Excel) to see the exactly place were the item is used. See the following example:
Clicking on: pixie://message-box.dialog/client-not-found?client=Marek#button_ok
Will open Pixie Engine app, and automatically load MessageBoxView, with button OK highlighted. The URI parameters parts (client=Marek) does not take part in language lookup, but they are used to populate sample data into View
Sounds good for you? Okay, it is 40% done right now.
Stay tunned!
And if you got your own ideas how to make Fluent even better – please, share them in the comments!
For the sake of completeness, some examples: