Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting Javascript RegEx to C# Regex

I have a Javascript regex that tokenizes words from a sentence which is like the following:

/\\[^]|\.+|\w+|[^\w\s]/g

Like if a sentence is entered like Hello World. the above regex will tokenize it into words:

Hello, World, .

I am trying to convert the above regex in C#, but its not able to group it. I have tried removing the / and the \g from the beginning and the end respectively, in order to make it compatible with .NET regex engine. But its still not working.

Below is the C# code I am trying:

public static void Main()
{
        string pattern = @"\\[^]|\.+|\w+|[^\w\s]";
        string input = @"hello world.";

        foreach (Match m in Regex.Matches(input, pattern, RegexOptions.ECMAScript))
        {
            Console.WriteLine("'{0}' found at index {1}.", m.Value, m.Index);
        }
}

Can anyone help me converting the above regex into C#?

like image 288
Kunal Mukherjee Avatar asked Oct 22 '25 16:10

Kunal Mukherjee


1 Answers

Note that RegexOptions.ECMAScript just makes sure shorthand character classes (here, \w and \s) only match ASCII letters, digits and whitespace. You can't expect this option to "convert" the whole pattern for use in .NET regex library.

Here, [^] construct was used in JS regex to match any char. You may use . with a RegexOptions.Singleline option (and then you will have to remove the RegexOptions.ECMAScript option) instead of [^], or just use [\s\S] to match any char:

public static void Main()
{
        string pattern = @"\\.|\.+|\w+|[^\w\s]";
        string input = @"hello world.";

        foreach (Match m in Regex.Matches(input, pattern,  RegexOptions.Singleline))
        {
            Console.WriteLine("'{0}' found at index {1}.", m.Value, m.Index);
        }
}

See the C# demo, its output:

'hello' found at index 0.
'world' found at index 6.
'.' found at index 11.

NOTE: \w and \s are Unicode aware in .NET regex, the match all Unicode letters with some diacritics, too. If you only want to handle ASCII, use

string pattern = @"\\.|\.+|[A-Za-z0-9_]+|[^A-Za-z0-9_\f\n\r\t\v\u00a0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff]";

More details

  • Word Character: \w in .NET regex
  • White-Space Character: \s in .NET regex
like image 107
Wiktor Stribiżew Avatar answered Oct 25 '25 05:10

Wiktor Stribiżew