I had a task to parse Chinese and Japanese text to find the most precise amount of words. I didn't know about Intl.Segmenter
and once I found it, I thought "ah I cannot use it since my text can have a lot of different locales at the same time". But I tried to use "en" locale with many languages including Chinese and Japanese in a single text source, and it seems to work fine. Actually I'm not a native speaker of Chinese and Japanese, but it looked to work correctly, or at least almost correctly, and that's ok for me.
So what does a locale param really do? I also checked the spec, but didn't find any information on exactly how different Intl.Segmenter
locales affect the result.
The Intl
spec doesn't have much to say about this. The description of FindBoundary, the abstract operation for finding the next boundary in some text that ultimately gets used under the hood by Intl.Segmenter
's segment
method, specifies essentially nothing about how boundary finding should be done and contains the following note:
Boundary determination is implementation-dependent, but general default algorithms are specified in Unicode Standard Annex #29. It is recommended that implementations use locale-sensitive tailorings such as those provided by the Common Locale Data Repository (available at https://cldr.unicode.org).
That doesn't shed much light. But with a bit of research, I found some examples of reasons it might make sense for rules for grapheme, word, and sentence splitting to differ between locales, some of which are actually implemented in V8 or JavaScriptCore. I list them below.
Some European languages have digraphs in their alphabet. For instance, the Czech alphabet looks like this, with "ch" considered to be a letter in its own right:
a á b c č d ď e é ě f g h ch i í j k l m n ň o ó p q r ř s š t ť u ú ů v w x y ý z ž
This is most consequential for alphabetical ordering, which gets handled by locale-specific collations, but also potentially affects how you might want to split a Czech or Slovak text into characters - namely that you might want to treat "ch" as a single grapheme cluster.
(UAX29 is kinda cagey about whether this is a good idea; "ch" is suggested as an example locale-tailored grapheme cluster named "Slovak ch digraph" in that spec, but the tailorings in CLDR don't actually include this, and neither V8 nor JavaScriptCore treat "ch" as a single grapheme cluster when using Czech or Slovak locale. But perhaps in future, this might change!)
For text in most languages, it is reasonable to treat a colon (:
) as a word break, even if there isn't a space after (or before) it. However, in Swedish and Finnish, this isn't reasonable, because colons get used in the middle of words, either:
Therefore, CLDR has a special case for Swedish and Finnish that makes colons not count as word breaks. You can see this working in any current implementation of Intl.Segmenter
:
$ node
Welcome to Node.js v18.16.0.
Type ".help" for more information.
> segmenterEn = new Intl.Segmenter("en", {granularity: "word"})
Segmenter [Intl.Segmenter] {}
> segmenterSv = new Intl.Segmenter("sv", {granularity: "word"})
Segmenter [Intl.Segmenter] {}
> Array.from(segmenterEn.segment("foo:bar baz:qux k:a")).map(x => x.segment)
[
'foo', ':', 'bar',
' ', 'baz', ':',
'qux', ' ', 'k',
':', 'a'
]
> Array.from(segmenterSv.segment("foo:bar baz:qux k:a")).map(x => x.segment)
[ 'foo:bar', ' ', 'baz:qux', ' ', 'k:a' ]
In almost every modern language, a semicolon does not terminate a sentence, but Greek is a strange exception: the semicolon is used as a question mark, and ends a sentence.
CLDR has Greek-specific tailorings that are aware of this, and this is respected in current Intl.Segmenter
implementations:
> segmenterEn = new Intl.Segmenter("en", {granularity: "sentence"})
Segmenter [Intl.Segmenter] {}
> segmenterEl = new Intl.Segmenter("el", {granularity: "sentence"})
Segmenter [Intl.Segmenter] {}
> Array.from(segmenterEn.segment("гдѣ єсть рождeйсѧ царь їудeйскій; Τι είναι μια διασύνδεση;")).length
1
> Array.from(segmenterEl.segment("гдѣ єсть рождeйсѧ царь їудeйскій; Τι είναι μια διασύνδεση;")).length
2
(Church Slavonic, the language of the Eastern Orthodox Church, also uses semicolons as question marks, just like Greek, according to Wikipedia, but this isn't currently respected by Unicode CLDR or Intl.Segmenter
for some reason. (CLDR bug report))
Lots of languages, including English, use dots/periods (.
) to indicate abbreviations, including sometimes by putting a dot at the end of a word. This is confusing to a sentence segmenter, because without a dictionary of such words, it can't tell the difference between a dot that is being used to terminate a sentence and one that is being used to indicate an abbreviation.
For instance, It's nice to see you, Mr. Smith. is a single sentence, but a naive sentence segmenter will think the dot in "Mr." is a sentence terminator.
To address this, CLDR contains per-language dictionaries of exceptions to the rule that a dot should terminate a sentence - see e.g. https://github.com/unicode-org/cldr/blob/main/common/segments/en.xml. (I note, though, that the choice of inclusions in at least the English dictionary seems pretty arbitrary, and it doesn't look like anyone has ever made even a modest effort to make the list not suck. For instance, some month abbreviations like "Sept." are included while others like "Oct." are not; a vast number of military titles like "Lt.Cdr." are included but more common personal titles like "Jr." and "Sr." are missing; and "Hon.B.A." is included while "Hon." on its own is omitted.) In principle, an Intl.Segmenter
implementation could use these (or its own lists) to avoid inappropriately treating abbreviations as segment breaks.
In practice, though, they don't, at least today:
> segmenterEn = new Intl.Segmenter("en", {granularity: "sentence"})
Segmenter [Intl.Segmenter] {}
> Array.from(segmenterEn.segment("It's nice to see you, Mr. Smith.")).map(x => x.segment)
[ "It's nice to see you, Mr. ", 'Smith.' ]
In case you want to do further research yourself, here are some places to start looking that I referred to when constructing the list of examples above:
common/segments/
folder of CLDRIntl.Segmenter
To be honest, I found it frustratingly difficult to find good examples of locale-specific segmentation rules from the sources above. I found it infuriating, in particular, that the unit tests for Intl.Segmenter
in V8 and JavaScriptCore do not include a single demonstration of how behaviour differs based on locale code, and I found that some of the relevant-looking discussions I read from Unicode people were red herrings. For instance, if you start researching this, you will doubtless see plenty of mention of Indic languages and how in some but not all of them, characters on either side of a virama should be considered a single joined-together character. And sure enough that is how viramas are handled in existing Intl.Segmenter
implementations... but there doesn't seem to be any locale-specific behaviour, since every language's virama has its own distinct Unicode code point anyway, and so the segmenter can decide how to handle a virama based on code point without needing to consider the locale setting.
Nonetheless, there may be other cool examples I've missed of genuinely locale-dependent segmentation behaviour. Good luck finding them!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With