Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How exactly "locale" param affects the result of Intl.Segmenter execution in JavaScript?

I had a task to parse Chinese and Japanese text to find the most precise amount of words. I didn't know about Intl.Segmenter and once I found it, I thought "ah I cannot use it since my text can have a lot of different locales at the same time". But I tried to use "en" locale with many languages including Chinese and Japanese in a single text source, and it seems to work fine. Actually I'm not a native speaker of Chinese and Japanese, but it looked to work correctly, or at least almost correctly, and that's ok for me.

So what does a locale param really do? I also checked the spec, but didn't find any information on exactly how different Intl.Segmenter locales affect the result.

like image 942
Dmitry Avatar asked Sep 02 '25 16:09

Dmitry


1 Answers

The Intl spec doesn't have much to say about this. The description of FindBoundary, the abstract operation for finding the next boundary in some text that ultimately gets used under the hood by Intl.Segmenter's segment method, specifies essentially nothing about how boundary finding should be done and contains the following note:

Boundary determination is implementation-dependent, but general default algorithms are specified in Unicode Standard Annex #29. It is recommended that implementations use locale-sensitive tailorings such as those provided by the Common Locale Data Repository (available at https://cldr.unicode.org).

That doesn't shed much light. But with a bit of research, I found some examples of reasons it might make sense for rules for grapheme, word, and sentence splitting to differ between locales, some of which are actually implemented in V8 or JavaScriptCore. I list them below.

(Not currently implemented) Language-specific digraphs, such as "ch" in Czech and Slovak

Some European languages have digraphs in their alphabet. For instance, the Czech alphabet looks like this, with "ch" considered to be a letter in its own right:

a á b c č d ď e é ě f g h ch i í j k l m n ň o ó p q r ř s š t ť u ú ů v w x y ý z ž

This is most consequential for alphabetical ordering, which gets handled by locale-specific collations, but also potentially affects how you might want to split a Czech or Slovak text into characters - namely that you might want to treat "ch" as a single grapheme cluster.

(UAX29 is kinda cagey about whether this is a good idea; "ch" is suggested as an example locale-tailored grapheme cluster named "Slovak ch digraph" in that spec, but the tailorings in CLDR don't actually include this, and neither V8 nor JavaScriptCore treat "ch" as a single grapheme cluster when using Czech or Slovak locale. But perhaps in future, this might change!)

(Currently implemented) Colons in words in Swedish, Finnish

For text in most languages, it is reasonable to treat a colon (:) as a word break, even if there isn't a space after (or before) it. However, in Swedish and Finnish, this isn't reasonable, because colons get used in the middle of words, either:

  • as a contraction, in Swedish (e.g. k:a is short for kyrka, meaning "church"), or
  • in a manner similar to an apostrophe in English, but only in the possessive form of an initialism. For instance, Google Translate translates "the USA's current president" into Finnish as "USA:n nykyinen presidentti".

Therefore, CLDR has a special case for Swedish and Finnish that makes colons not count as word breaks. You can see this working in any current implementation of Intl.Segmenter:

$ node
Welcome to Node.js v18.16.0.
Type ".help" for more information.
> segmenterEn = new Intl.Segmenter("en", {granularity: "word"})
Segmenter [Intl.Segmenter] {}
> segmenterSv = new Intl.Segmenter("sv", {granularity: "word"})
Segmenter [Intl.Segmenter] {}
> Array.from(segmenterEn.segment("foo:bar baz:qux k:a")).map(x => x.segment)
[
  'foo', ':',   'bar',
  ' ',   'baz', ':',
  'qux', ' ',   'k',
  ':',   'a'
]
> Array.from(segmenterSv.segment("foo:bar baz:qux k:a")).map(x => x.segment)
[ 'foo:bar', ' ', 'baz:qux', ' ', 'k:a' ]

(Currently implemented) Semicolons as question marks in Greek

In almost every modern language, a semicolon does not terminate a sentence, but Greek is a strange exception: the semicolon is used as a question mark, and ends a sentence.

CLDR has Greek-specific tailorings that are aware of this, and this is respected in current Intl.Segmenter implementations:

> segmenterEn = new Intl.Segmenter("en", {granularity: "sentence"})
Segmenter [Intl.Segmenter] {}
> segmenterEl = new Intl.Segmenter("el", {granularity: "sentence"})
Segmenter [Intl.Segmenter] {}
> Array.from(segmenterEn.segment("гдѣ єсть рождeйсѧ царь їудeйскій; Τι είναι μια διασύνδεση;")).length
1
> Array.from(segmenterEl.segment("гдѣ єсть рождeйсѧ царь їудeйскій; Τι είναι μια διασύνδεση;")).length
2

(Church Slavonic, the language of the Eastern Orthodox Church, also uses semicolons as question marks, just like Greek, according to Wikipedia, but this isn't currently respected by Unicode CLDR or Intl.Segmenter for some reason. (CLDR bug report))

(Not currently implemented) Dotted abbreviations

Lots of languages, including English, use dots/periods (.) to indicate abbreviations, including sometimes by putting a dot at the end of a word. This is confusing to a sentence segmenter, because without a dictionary of such words, it can't tell the difference between a dot that is being used to terminate a sentence and one that is being used to indicate an abbreviation.

For instance, It's nice to see you, Mr. Smith. is a single sentence, but a naive sentence segmenter will think the dot in "Mr." is a sentence terminator.

To address this, CLDR contains per-language dictionaries of exceptions to the rule that a dot should terminate a sentence - see e.g. https://github.com/unicode-org/cldr/blob/main/common/segments/en.xml. (I note, though, that the choice of inclusions in at least the English dictionary seems pretty arbitrary, and it doesn't look like anyone has ever made even a modest effort to make the list not suck. For instance, some month abbreviations like "Sept." are included while others like "Oct." are not; a vast number of military titles like "Lt.Cdr." are included but more common personal titles like "Jr." and "Sr." are missing; and "Hon.B.A." is included while "Hon." on its own is omitted.) In principle, an Intl.Segmenter implementation could use these (or its own lists) to avoid inappropriately treating abbreviations as segment breaks.

In practice, though, they don't, at least today:

> segmenterEn = new Intl.Segmenter("en", {granularity: "sentence"})
Segmenter [Intl.Segmenter] {}
> Array.from(segmenterEn.segment("It's nice to see you, Mr. Smith.")).map(x => x.segment)
[ "It's nice to see you, Mr. ", 'Smith.' ]

Some links for further research

In case you want to do further research yourself, here are some places to start looking that I referred to when constructing the list of examples above:

  • Unicode Standard Annex #29: UNICODE TEXT SEGMENTATION
  • The common/segments/ folder of CLDR
  • The two open source implementations of Intl.Segmenter
    • the one in the V8 (used by Chrome, Edge, and Node)
    • the one in JavaScriptCore (used by Safari)
  • The brkitr data in ICU, which V8 turns out to use under the hood

To be honest, I found it frustratingly difficult to find good examples of locale-specific segmentation rules from the sources above. I found it infuriating, in particular, that the unit tests for Intl.Segmenter in V8 and JavaScriptCore do not include a single demonstration of how behaviour differs based on locale code, and I found that some of the relevant-looking discussions I read from Unicode people were red herrings. For instance, if you start researching this, you will doubtless see plenty of mention of Indic languages and how in some but not all of them, characters on either side of a virama should be considered a single joined-together character. And sure enough that is how viramas are handled in existing Intl.Segmenter implementations... but there doesn't seem to be any locale-specific behaviour, since every language's virama has its own distinct Unicode code point anyway, and so the segmenter can decide how to handle a virama based on code point without needing to consider the locale setting.

Nonetheless, there may be other cool examples I've missed of genuinely locale-dependent segmentation behaviour. Good luck finding them!

like image 58
Mark Amery Avatar answered Sep 05 '25 04:09

Mark Amery