For compatibility reasons I need to replicate the behaviour of another application. It is using Unicode strings as identifiers but ignoring case and performing some sort of normalisation. By intercepting API calls I have determined it is using CompareStringW(LOCALE_USER_DEFAULT, NORM_IGNORECASE, SORT_STRINGSORT, ...)
to do the comparison.
I can just call this function directly for every pair of strings in the set of strings I am considering but I would prefer a canonical form I can use in a hash table.
CompareStringW
uses with those flags set? Is it a standard Unicode algorithm?NormalizeString
and FoldString
to generate this canonical form? If i can, what arguments do I need to pass?edit: As David Heffernan pointed out I'll need to use FoldString
as well as NormalizeString
to do a proper caseless comparison.
The great Michael Kaplan (RIP) provided a lot of useful information about the NLS functions on his blog over the years and some posts have been archived before Microsoft made the blog internal only.
His A few of the gotchas of CompareString post provides descriptions of these flags:
NORM_IGNORECASE - Ignore case. A better name for this flag might have been IGNORE_TERTIARYWEIGHT since that is what it accomplishes (it masks the tertiary weight), although it is obviously too late to consider such a change. It can cause undesirable results when used in the comparison of strings containing characters that depend on the weight for vital information, which thankfully is a very small number of cases. But if you are not expecting "ʏ", "Y", and "y" (U+028f, U+0059, and U+0079, a.k.a. LATIN LETTER SMALL CAPITAL Y, LATIN LETTER CAPITAL Y, and LATIN LETTER SMALL Y) to all be equal, then you may want to think twice about throwing this flag into the mix. You will also lose the distinctions of the final forms for Hebrew (e.g. "מ" and "ם", U+05de U+05dd a.k.a. HEBREW LETTER MEM and HEBREW LETTER FINAL MEM), Arabic (e.g. "ش" U+0634 a.k.a. ARABIC LETTER SHEEN and its isolated, final, initial, and medial forms (ﺵ, ﺶ, ﺷ, and ﺸ) at U+feb5, U+feb6, U+feb7, and U+feb8, and other languages.
SORT_STRINGSORT - Treat punctuation the same as symbols. For example, a STRING sort treats co-op and co_op as strings that should sort together since the hyphen and the underscore are both treated as symbols. On the other hand, a WORD sort treats the hyphen and apostrophe differently, so that co-op and co_op would not sort together but co-op and coop would. The real documentation for this is built into the winnls.h header file:
//
// Sorting Flags.
//
// WORD Sort: culturally correct sort
// hyphen and apostrophe are special cased
// example: "coop" and "co-op" will sort together in a list
//
// co_op <------- underscore (symbol)
// coat
// comb
// coop
// co-op <------- hyphen (punctuation)
// cork
// went
// were
// we're <------- apostrophe (punctuation)
//
//
// STRING Sort: hyphen and apostrophe will sort with all other symbols
//
// co-op <------- hyphen (punctuation)
// co_op <------- underscore (symbol)
// coat
// comb
// coop
// cork
// we're <------- apostrophe (punctuation)
// went
// were
//
The results might also vary depending on the Windows version because newer versions supports later Unicode versions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With