Translating ICU4C with Globalization Pipeline


This is a work in progress. If you read to the end, you’ll see we almost reached our goal here.


I work on ICU4C (the premier C/C++ library for Unicode support). And I work on Globalization Pipeline. These two haven’t really crossed paths… until now.

What we’ll do

This blog will cover how to use the Globalization Pipeline to translate uconv, one of ICU’s sample command line apps. We'll be translating the resource files you can see in source/extra/uconv/resources

Gathering the tools

  • First, Download ICU4C source code (as a tarball or from the SVN repository) and compile it. See its readme for more details.

  • Now, set up Globalization Pipeline. See our Quick Start Guide for getting your Globalization Pipeline instance created and set up.

    • In the GP dashboard, create a bundle named uconv. Select which languages you want to translate into, but don’t upload any strings. Click Save.

    • Also from the Bluemix dashboard, get the service credentials for your service. Save these in a file called mycreds.json that should look like the example in this document.

  • We’ll also need the gp-cli java tool, so download the latest jar from gp-java-tools.

Into the Pipeline

Now, let's get some translated content

Hm. These files are in icu4c resource format, which isn't (yet?) supported by Globalization Pipeline… directly. Let's try an interim step.

  • genrb -x root root.txt
  • genrb -x fr fr.txt

Now we have root.xlf and fr.xlf (for good measure).

Here's a snippet of root.xlf:

<group id = "root" restype = "x-icu-table">
<source>Buffer overflow</source>

OK. The gp-cli tool says it handles XLIFF as a file format. Let's get that set up.

  • java -jar gp-cli-1.1.0.jar import -b uconv -f root.xlf -l en -t xliff -j mycreds.json

Note that we use the language tag en for English here, while the file was originally entitled root. This is because Globalization Pipeline works with the explicit source language, whereas for ICU, root is what will be consulted as a fallback if no other languages are available.

It says it uploaded… but let’s check in the Globalization Pipeline dashboard:

English content uploaded

OK! That’s great. Browsing over to the other language translations, we can see that the MT engines are hard at work. However, we happen to already have some French translations in the ICU source base. We'll upload this, to overwrite some of the Machine-translated entries for French:

  • java -jar gp-cli-1.1.0.jar import -b uconv -f fr.xlf -l fr -t xliff -j mycreds.json

Great. Now we have some human translated content as well. We can now correct, upload/download content in the dashboard until we are happy with the translations there.

Out of the Pipeline

OK, now for the next step- getting those translations back into ICU4C.

We can list the bundle status from the command line:

  • java -jar gp-cli-1.1.0.jar show-bundle -b uconv -j mycreds.json
"sourceLanguage": "en",
"targetLanguages": [
"readOnly": false,
"updatedBy": "…",
"updatedAt": "2016-07-14T15:22:40.233-07"

Now, we’ll download the files in XLIFF format again:

  • java -jar gp-cli-1.1.0.jar export -b uconv -f fr.xlf -l fr -t xliff -j mycreds.json
  • java -jar gp-cli-1.1.0.jar export -b uconv -f es.xlf -l es -t xliff -j mycreds.json
  • java -jar gp-cli-1.1.0.jar export -b uconv -f de.xlf -l de -t xliff -j mycreds.json
  • java -jar gp-cli-1.1.0.jar export -b uconv -f zh.xlf -l zh-Hans -t xliff -j mycreds.json

… and so on. Repeat for each language you wish to download. Note that we’ve used zh for Chinese instead of zh-Hans.

OK, we have XLIFF format. How to convert it to ICU format? genrb only writes XLIFF, it can’t read it.

And back again… almost.

We need the XLIFF2ICU Converter as is noted here.

To build it, at present, this worked for me:

  • download ICU4J source (yes, J)
  • run ant xliff from the top level
  • you will end up with an out/xliff/lib/xliff.jar

Still with me? Head back to the uconv/resources directory, and now run:

  • java -jar xliff.jar -s . -d . fr.xlf

And that brings us to…

Processing file: ./fr.xlf
The XLIFF document is invalid, please check it first:
Line 3, Column 81
Error: cvc-elt.1: No se ha encontrado la declaración del elemento 'xliff'.

Hrm. Seems like the XLIFF output isn't quite ready to be consumed. I filed a bug on this, of course.

Plan B

We're so close… let's see what we can do. What if we fetch the data in JSON format, and then hack up something to convert it to ICU format? It might suffice for this blog post.

Let's run the fetches again, but get JSON this time:

  • java -jar gp-cli-1.1.0.jar export -b uconv -f fr.json -l fr -t json -j mycreds.json

Now, run the following Node.js script over the JSON files:

  • node js2icu.js fr.json es.json …
// js2icu.js
const fs=require('fs');
const args = process.argv.slice(2);
for (var i in args) {
const f = args[i];
console.log('# read ' + f);
const loc = f.split('.')[0];
const json = JSON.parse(fs.readFileSync(f));
var s = '\ufeff// -*- Coding: utf-8; -*-\n//auto converted\n' + loc + '\n{\n';
for (var k in json) {
s = s + ' ' + k + ' { "' + json[k].replace(/"/g,'\\"') + '" }\n';
s = s + '}\n';
console.log('# wrote ' + loc + '.txt');
fs.writeFileSync(loc+'.txt', s);

You should be the proud owner of .txt files matching all of the languages you are using.

We're almost there. Let's go up and build uconv:

  • cd ..

Now edit and change the RESSRC line to reference the new translations:


Build uconv

  • make


Let’s test it. I know uwmsg.o isn't really utf-8, that's why this is a test.

  • env LC_ALL=es ./../../bin/uconv -f utf-8 < uwmsg.o
La conversión a Unicode de página de códigos falló en posición de byte de entrada de 0. Bytes: Error de cf: El carácter ilegal encontró La conversión a Unicode de página de códigos falló en posición de byte de entrada de 1. ……

Looks like we have a (more) translated uconv now. Some of the messages don’t quite work correctly due to ICU4C message conventions. Perhaps we will investigate this in the future.

Perl on Bluemix

Quick Start

  1. Marcus DelGreco at #FluentConf said something about perl support on platforms. I mentioned Bluemix allowed bring your own buildpack

  2. Looking through the buildpack lists didn't turn up Perl per se but…

  3. … enter sourcey-buildpack. It's a generic buildpack! From its README I knew I was in the right spot:

    Isn't it simply amazing to see these demos, where they throw a bunch of php, ruby, Java or python code at a Cloud Foundry site and it gets magically turned into a running web applications. Alas for me, life is often a wee bit more complicated than that. My projects always seem to required a few extra libraries or they are even written in an dead scripting language like Perl.

  4. And now for that tl;dr-inspiring moment:

Let's see if the canned sample works. Hint: yes.

First, cf login into Bluemix, and then:

$ git clone
$ cd sourcey-buildpack/example
$ cf push MYAPPLICATION$$ -m 128M -b

The above builds perl (takes a while the first time) and deploys a little app that just dumps the deserialized JSON out.


But wait! It could be even simpler. So, I opened PR oetiker/sourcey-buildpack#2 which adds a manifest file to the example. Then, only cf push is needed, the -b … option is now unnecessary.