Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In which cases normalize('NFKC') method work?

I tried to use normalize('NFKC') method with different characters, but it didn't work. Fortunately, can't say this for NFC. When it's possible normalize('NFC') always replaces multiple codepoints with the single one. For example:

let t1 = `\u00F4`; //ô
let t2 = `\u006F\u0302`; //ô
console.log(t2.normalize('NFC') == t1); //true

And here's example with NFKC that never works:

let s1 = '\uFB00'; //"ff"
let s2 = '\u0066\u0066'; //"ff"
console.log(s2.normalize('NFKC') == s1); //false

I thought before that NFKC replaces multiple codepoints with the single one that represents compatible character. To put it simple, I thought that NFKC will replace \u0066\u0066 with \uFB00.

If NFKC doesn't work like that, then... how does it work?

like image 634
Ivan Avatar asked Oct 20 '25 19:10

Ivan


1 Answers

The thing is NFKC (as well as NFKD) supports compatible and canonically equivalent normalization.

Unicode

The type of full decomposition chosen depends on which Unicode Normalization Form is involved. For NFC or NFD, one does a full canonical decomposition, which makes use of only canonical Decomposition_Mapping values. For NFKC or NFKD, one does a full compatibility decomposition, which makes use of canonical and compatibility Decomposition_Mapping values.

And that's completely understandable because as MDN says:

All canonically equivalent sequences are also compatible, but not vice versa.

But it's also worth to notice that NFKC makes compatible and canonically equivalent normalizations in different ways. Canonically equivalent normalization by NFKC is produced the same way as NFC. For example:

//"ô" (U+00F4) -> "a" (U+006F) + " ̂" (U+0302) -> "â" (U+00F4)
let c1 = `\u006F\u0302`; //ô
console.log(c1.normalize('NFKC').length); //1

But compatible normalization by this parameter works differently. The spec is saying:

Normalization Form KC does not attempt to map character sequences to compatibility composites. For example, a compatibility composition of “office” does not produce “o\uFB03ce”, even though “\uFB03” is a character that is the compatibility equivalent of the sequence of three characters “ffi”. In other words, the composition phase of NFC and NFKC are the same—only their decomposition phase differs, with NFKC applying compatibility decompositions.

For example:

//"ff"(U+FB00) -> "f"(U+0066) + "i"(U+0066) -> "f"(U+0066) + "i"(U+0066)
let c2 = '\u0066\u0066'; //ff
console.log(c2.normalize('NFKC').length); //2
like image 77
Ivan Avatar answered Oct 23 '25 10:10

Ivan



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!