I am interested in the word iterator of the ICU63 library in a JavaScript project (in a browser). So after reading the docs, I believe that ICU uses UTF-16 by default which is the same than JS and it would avoid me to encode JS strings into something else.
First step was to build a wrapper with the only function that I need (I don't know yet if it is working):
#include "emscripten.h"
#include <string.h>
#include <unicode/brkiter.h>
#include <unicode/unistr.h>
#include <unicode/errorcode.h>
using namespace icu_63; 
EMSCRIPTEN_KEEPALIVE
int splitWords(const char *locale, const uint16_t *text, uint16_t *splitted) {
    //Note that Javascript is working in UTF-16
    //icu::
    UnicodeString result = UnicodeString();
    UnicodeString visibleSpace = UnicodeString(" ");
    int32_t previousIdx = 0;
    int32_t idx = -1;
    //Create a Unicode String from input
    UnicodeString uTextArg = UnicodeString(text);
    if (uTextArg.isBogus()) {
        return -1; // input string is bogus
    }
    //Create and init the iterator
    UErrorCode err = U_ZERO_ERROR;
    BreakIterator *iter = BreakIterator::createWordInstance(locale, err);
    if (U_FAILURE(err)) {
        return -2; // cannot build iterator
    }
    iter->setText(uTextArg);
    //Iterate and store results
    while ((idx = iter->next()) != -1) {
        UnicodeString word = UnicodeString(uTextArg, idx, idx - previousIdx);
        result += word;
        result += visibleSpace;
        previousIdx = idx;
    }
    result.trim();
    //The buffer contains UTF-16 characters, so it takes 2 bytes per point
    memcpy(splitted, result.getBuffer(), result.getCapacity() * 2);
    return 0;
}
It compiles and looks good except that symbols are missing when trying to link because I have no clue about how to proceed.
LibICU looks to need a lot of builtin data. For my case, the frequency tables are mandatory for using the word iterator.
Should I try to copy my wrapper into the source folder and try to figure out how to use emconfigure. Or is it possible to link the libicu when I try to compile my wrapper? Second option looks like a waste of data as I am not interested by the larger portion of the lib.
In my experience, the easiest way to deal with libraries is to build the libraries using emconfigure/emmake first then link them statically with your own code. Like the following:
$ emcc your_wrapper.cpp \
       your_compiled_libICU_static_lib.a \
       -o result.js
Compiling libraries using emconfigure/emmake sometimes quite hard because you may need to modify the source code in order to make it work in WebAssembly.
But...Good news! Emscripten provides ports of some popular and complicated libraries and ICU is one of them.
You can compile your code without compiling ICU yourself using -s USE_ICU=1 flag:
$ emcc your_wrapper.cpp \
       -s USE_ICU=1 \
       -s ERROR_ON_UNDEFINED_SYMBOLS=0 \
       -std=c++11
The caveats is that Emscripten ICU port is ICU 62. So you need to change using namespace icu_63; to using namespace icu_62;
While -s USE_ICU=1 is convenient when you can easily modify your build flags, I've found it more convenient to install ICU from source, because I also had to build other libraries whose configure/make/build processes do not play nicely with -s USE_ICU=1 (at least not without plenty of modification) and instead expect a more traditional way to find and link to the icu libs.
Unfortunately, building libicu does not seem to work with the usual configure && make install without some tweaking. To do that, first you have to do a "regular" native build (./configure && make) to create the necessary local files.
Then, if you do not need PTHREADS, you can build in a fairly straightforward manner as follows, assuming /opt/wasm is your PREFIX.
PKG_CONFIG_LIBDIR=/opt/wasm/lib/pkgconfig emconfigure ./configure --prefix=/opt/wasm --with-cross-build=`pwd` --enable-static=yes --enable-shared=no --target=wasm32-unknown-emscripten --with-data-packaging=static --enable-icu-config --enable-extras=no --enable-tools=no --enable-samples=no --enable-tests=no
emmake make clean install
If you do need PTHREADS for some downstream consumer of the lib, you might have to rebuild the lib with that enabled from the get-go. This is trickier because configure scripts will break when they do their tests that require building and running C snippets, due to warnings about requiring additional node flags (see https://github.com/emscripten-core/emscripten/issues/15736), which to the configure scripts mean an error. The easiest solution I found was to temporarily modify make_js_executable in emcc.py:
  ...
  with open(script, 'w') as f:
    # f.write('#!%s\n' % cmd); ## replaced with the below line
    f.write('#!%s --experimental-wasm-threads --experimental-wasm-bulk-memory\n' % cmd)
    f.write(src)
  ...
With that hack done, you can proceed to something like the below (though possibly, not all of those thread-related flags are absolutely needed)
CXXFLAGS='-s PTHREAD_POOL_SIZE=8 -s USE_PTHREADS=1 -O3 -pthread' CFLAGS='-s PTHREAD_POOL_SIZE=8 -s USE_PTHREADS=1 -O3 -pthread' FORCE_LIBS='-s PTHREAD_POOL_SIZE=8 -s USE_PTHREADS=1 -pthread -lm' PKG_CONFIG_LIBDIR=/opt/wasm/lib/pkgconfig emconfigure ./configure --prefix=/opt/wasm --with-cross-build=`pwd` --enable-static=yes --enable-shared=no --target=wasm32-unknown-emscripten --with-data-packaging=static --enable-icu-config --enable-extras=no --enable-tools=no --enable-samples=no --enable-tests=no
emmake make clean install
After that, set your emcc.py back to its original state. Note that if you try to build the tools, they will fail-- I haven't yet found a solution to that-- but the lib does successfully install with the above.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With