Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unexpected behavior with StartsWith and unicode

While working on an issue, I encountered an interesting use case. The StartsWith method (and maybe some others) returns true for the following 2 cases:

"\u0000Example".StartsWith("Example");          // returns true
"\u0000Example".StartsWith("\u0000Example");    // returns true

Repository with this example and unit tests: https://github.com/DeanMilojevic/UnicodeInvestigation

As I didn't have time to dive deeper into the implementation of the method, I was wondering if this is a "bug" or expected behavior?

Otherwise when I find out more in my free time, will update the question with additional information.

like image 798
D34NM Avatar asked Oct 17 '25 02:10

D34NM


1 Answers

One of the breaking changes introduced in .NET 5 is the transition from NLS to ICU globalization libraries on Windows. I don't see any mention of it in your post but assuming you are using .NET 5 then this behaviour does not appear to be a bug.

To quote the docs:

If you use functions like string.IndexOf(string) without calling the overload that takes a StringComparison argument, you might intend to perform an ordinal search, but instead you inadvertently take a dependency on culture-specific behavior. Since NLS and ICU implement different logic in their linguistic comparers, the results of methods like string.IndexOf(string) can return unexpected values.

If you take a look at this table, you'll see that the string.StartsWith method uses CurrentCulture comparison by default when passing a string parameter. Since you're doing a culture-specific comparison (by not specifying the StringComparison parameter) then it seems that the ICU library, however it goes about its implementation, ignores the null unicode character \u0000 whereas the NLS library seemingly doesn't.

Judging by your code, it looks like what you wanted to do was perform an ordinal search which can be done using: "\u0000Example".StartsWith("Example", StringComparison.Ordinal) which will correctly return false.

It is recommended to enable code analyzers in your project to help you identify code that is unexpectedly using a linguistic comparer when an ordinal one was likely intended.

The recommended rules to enable are:

CA1307: Specify StringComparison for clarity

CA1309: Use ordinal StringComparison

CA1310: Specify StringComparison for correctness

To enable these code analysis rules and have them cause build errors, simply add the following to your project file:

<PropertyGroup>
  <AnalysisMode>AllEnabledByDefault</AnalysisMode>
  <WarningsAsErrors>$(WarningsAsErrors);CA1307;CA1309;CA1310</WarningsAsErrors>
</PropertyGroup>
like image 82
Darren Ruane Avatar answered Oct 18 '25 16:10

Darren Ruane