2017-02-02

Fallbacks in ICU4C Converters

Unicode’s ICU version 59 is well underway at this point. While ideally everything would use Unicode, there still remains many systems — and much content — that is in non-Unicode encodings. For this reason, ICU, in both the C/C++ and the Java flavors, has rich support for codepage conversion.

One of many great features in ICU is the callback support. A lot can go wrong during codepage conversion, but in ICU, you can control what happens during exceptional situations.

Let’s try a simple sample. By the way, see the end of this post for hints on compiling the samples.

Substitute, Always

Our task is to convert black-bird (but with a U+00AD, “Soft Hyphen” in between the two words) to ASCII.

substituteTest-0.cppview raw

#include <unicode/utypes.h>
#include <unicode/ustdio.h>
#include <unicode/ucnv.h>

int main(int /*argc*/, const char * /*argv*/ []) {
    UErrorCode status=U_ZERO_ERROR;

    LocalUConverterPointer cnv(ucnv_open("us-ascii", &status));
    if(U_FAILURE(status)) {
        u_printf("Error opening: %s\n", u_errorName(status));
        return 1;
    }
    UnicodeString str("black-bird");
    str.setCharAt(5, 0x00AD); // soft hyphen
    const UChar *uch = str.getTerminatedBuffer();
    u_printf("Input String: %S length %d\n", uch, str.length());

    char bytes[1024];
    int32_t bytesWritten =
     ucnv_fromUChars(cnv.getAlias(), bytes, 1024, uch, -1, &status);

    if(U_FAILURE(status)) {
        u_printf("Error converting: %s\n", u_errorName(status));
        return 1;
    }

    u_printf("Converted %d bytes\n", bytesWritten);
    for(int32_t i=0; i<bytesWritten; i++) {
        u_printf("\\x%02X ", bytes[i]&0xFF);
    }
    u_printf("\n");
    // try to print it out on the console
    bytes[bytesWritten]=0; // terminate it first
    puts(bytes);

    return 0; // LocalUConverterPointer will cleanup cnv
}

Output:

Input String: blackbird length 10
Converted 9 bytes
\x62 \x6C \x61 \x63 \x6B \x62 \x69 \x72 \x64 
blackbird

Hm. Ten characters in, nine out. What happened? Well, U+00AD is not a part of ASCII. ASCII is a seven bit encoding, thus only maps code points \x00 through \x7F inclusively. Furthermore, U+00AD is Default Ignorable, and as of ICU 54.1 (2014) in #10551, the soft hyphen can just be dropped.

But what if, for some reason, you don’t want the soft hyphen dropped? The pre ICU 54.1 behavior can be brought back easily with a custom call back. So, roll up your collective sleeves, and:

alwaysSubstitute.hview raw

// © 2016 and later: Unicode, Inc. and others.
// License & terms of use: http://www.unicode.org/copyright.html

#include <unicode/ucnv.h>
#include <unicode/ucnv_err.h>
#include <unicode/ucnv_cb.h>

/**
 * This is a modified version of ICU’s UCNV_FROM_U_CALLBACK_SUBSTITUTE
 * it unconditionally substitutes on irregular codepoints.
 *
 * Usage:
 *   ucnv_setFromUCallBack(c, UCNV_FROM_U_CALLBACK_SUBSTITUTE_ALWAYS, NULL, NULL, NULL, &status);
 */
U_CAPI void    U_EXPORT2
UCNV_FROM_U_CALLBACK_SUBSTITUTE_ALWAYS (
				 const void *context,
				 UConverterFromUnicodeArgs *fromArgs,
				 const UChar* codeUnits,
				 int32_t length,
				 UChar32 codePoint,
				 UConverterCallbackReason reason,
				 UErrorCode * err)
{
    (void)codeUnits;
    (void)length;
    if (reason <= UCNV_IRREGULAR) {
      *err = U_ZERO_ERROR;
	  ucnv_cbFromUWriteSub(fromArgs, 0, err);
      /* else the caller must have set the error code accordingly. */
    }
    /* else ignore the reset, close and clone calls. */
}

If we #include this little header, and set it on the converter before we convert…

1
2
3

LocalUConverterPointer cnv(ucnv_open("us-ascii", &status));
ucnv_setFromUCallBack(cnv.getAlias(), UCNV_FROM_U_CALLBACK_SUBSTITUTE_ALWAYS, NULL, NULL, NULL, &status);
…

… we get the following result:

Input String: blackbird length 10
Converted 10 bytes
\x62 \x6C \x61 \x63 \x6B \x1A \x62 \x69 \x72 \x64 
black?bird

Great! Now, we are getting \x1A (ASCII SUB). It works.

When missing goes missing

A related question to the above has to do with converting from codepage to Unicode. That’s a better direction anyway. Convert to Unicode and stay there! One can hope. In any event…

For this task, we will convert 0x61, 0x80, 0x94, 0x4c, 0xea, 0xe5 from Shift-JIS to Unicode.

substituteTest-2.cppview raw

#include <unicode/utypes.h>
#include <unicode/ustdio.h>
#include <unicode/ucnv.h>

int main(int /*argc*/, const char * /*argv*/ []) {
    UErrorCode status=U_ZERO_ERROR;

    LocalUConverterPointer cnv(ucnv_open("shift-jis", &status));
    if(U_FAILURE(status)) {
        u_printf("Error opening: %s\n", u_errorName(status));
        return 1;
    }
    #define NRBYTES 6
    const uint8_t bytes[NRBYTES] = { 0x61, 0x80, 0x94, 0x4c, 0xea, 0xe5 };

    u_printf("Input Bytes: length %d\n", NRBYTES);

    #define NRUCHARS 50
    UChar uchars[NRUCHARS];

    int32_t ucharsRead =
     ucnv_toUChars(cnv.getAlias(), uchars, NRUCHARS, (const char*)bytes, NRBYTES, &status);

    if(U_FAILURE(status)) {
        u_printf("Error converting: %s\n", u_errorName(status));
        return 1;
    }

    u_printf("Converted %d uchars\n", ucharsRead);
    for(int32_t i=0; i<ucharsRead; i++) {
        u_printf("U+%04X ", uchars[i]);
    }
    u_printf("\n");
    // try to print it out on the console
    u_printf("Or string: '%S'\n", uchars);

    return 0; // LocalUConverterPointer will cleanup cnv
}

Output:

Input Bytes: length 6
Converted 4 uchars
U+0061 U+001A U+732B U+FFFD 
Or string: 'a猫�'

So, the letter "a" byte \x61 turned into U+0061, and then we have an illegal byte \x80 which turned into U+001A. Next, the valid sequence \x94 \x4c turns into U+732B which is 猫 (“cat”). Finally, the unmapped sequence \xea \xe5 turns into U+FFFD. Notice that the single byte illegal sequence turned into (SUB, U+001A), but the two byte sequence turned into U+FFFD. This is discussed somewhat here.

So far so good?

But what if you actually want U+FFFD as the substitution character for both sequences? This would be unexpected, but perhaps you have code that is particularly looking for U+FFFDs. We can write a similar callback:

alwaysFFFD.hview raw

// © 2016 and later: Unicode, Inc. and others.
// License & terms of use: http://www.unicode.org/copyright.html

#include <unicode/ucnv.h>
#include <unicode/ucnv_err.h>
#include <unicode/ucnv_cb.h>

static const UChar kFFFD[] = { 0xFFFD };

/**
 * This is a modified version of ICU’s UCNV_TO_U_CALLBACK_SUBSTITUTE
 * it unconditionally substitutes U+FFFD.
 *
 * Usage:
 *   ucnv_setToUCallBack(c, UCNV_TO_U_CALLBACK_SUBSTITUTE_FFFD, NULL, NULL, NULL, &status);
 */
U_CAPI void    U_EXPORT2
UCNV_TO_U_CALLBACK_SUBSTITUTE_FFFD (
                 const void *context,
                 UConverterToUnicodeArgs *toArgs,
                 const char* codeUnits,
                 int32_t length,
                 UConverterCallbackReason reason,
                 UErrorCode * err)
{
    (void)codeUnits;
    (void)length;
    if (reason <= UCNV_IRREGULAR)
    {
		*err = U_ZERO_ERROR;
		ucnv_cbToUWriteUChars(toArgs, kFFFD, 1, NULL, err);
		// see ucnv_cbToUWriteSub()
    }
    /* else ignore the reset, close and clone calls. */
}

Let’s hook it up, as before:

1
2
3

LocalUConverterPointer cnv(ucnv_open("shift-jis", &status));
ucnv_setToUCallBack(cnv.getAlias(), UCNV_TO_U_CALLBACK_SUBSTITUTE_FFFD, NULL, NULL, NULL, &status);
…

And drumroll please…

Input Bytes: length 6
Converted 4 uchars
U+0061 U+FFFD U+732B U+FFFD 
Or string: 'a�猫�'

Garbage out never looked so good…

Building (or, nothing-up-my-sleeve)

To build these little snippets, I recommend the shell script icurun

If ICU is already installed in your appropriate paths, (visible to pkg-config or at least icu-config), you can simply run:

1	icurun some-great-app.cpp

… and icurun will compile and run a one-off.

If, however, you’ve built ICU yourself in some directory, you can instead use:

1	icurun -i path/to/your/icu some-great-app.cpp

… where path/to/you/icu is the full path to an ICU build or install directory.

If you are on windows… well, there isn’t a powershell version yet. Contributions welcome!