NUnit

Question

I'm using NUnit v2.5 to compare strings that contain composite Unicode characters.
Although comparison itself works fine, a caret indicating first difference seems to be misplaced.

UPD: I've ended up with overridden EqualConstraint that in turn invokes a custom TextMessageWriter, so I no longer need an answer. See for solution below.

Here's the snippet:

string s1 = "ใช้งานง่าย";
string s2 = "ใช้งานงาย";
Assert.That(s1, Is.EqualTo(s2));

Here's the output:

Expected: "ใช้งานงาย"
But was:  "ใช้งานง่าย"
------------------^

The arrow indicating first different character seems to be off 2 positions (as many as there are tone marks above). For longer strings, it becomes a real pain.
I have attempted String.Normalize() but it wouldn't work either.

~~How can I overcome this problem?~~ Thanks for your help. See my answer below.

tchrist · Accepted Answer

When you are comparing Unicode strings, you must always normalize both sides of the comparison, and in the same way. It is not good enough to do binary compare of s1 and s2, because canonically equivalent strings would not test binary equivalent.

Positing the existence of four trivial normalization function, one for each of the four normalization forms, you would want to test NFD(s1) for binary eqality to NFD(s2). It doesn't matter whether you use NFD or NFC there, but you must do the same thing to both strings.

For the k-compat functions, NFKD and NFKD, those are useful when doing string searching, because they improve the recall at the cost of some precision. For example NFKD("™") would be equal to NFKD("TM"). This is what Adobe Reader does, for example, when you run searches on documents: it always runs the search in k-compat mode, so that your searches have a better chance at finding things. However, unlike NFC and NFD, the k-compat functions NFKC and NFKD lose information and are not reversible. With simple NFD and NFC, though, you can always get back to the other one.

NUnit - how to compare strings containing composite Unicode characters?

Tags:

string

unicode

localization

bytebuster

1 Answers

tchrist

Recent Activity

Donate For Us