Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing unrecognized ASCII characters from string

Tags:

string

c#

ascii

I'm parsing html using HTML Agility Pack and from time to time I get weird looking strings like:"–". What is the simplest way to remove them ? By the way, I'm using C#.

like image 557
Dr_Freeman Avatar asked Feb 02 '26 08:02

Dr_Freeman


1 Answers

You probably need to look into why you are getting those characters in the first place, and it will likely be something is wrong with the encoding

But if you do need to remove all the non-ascii characters from a string, the regex [^ -~] does the trick

        var stripped = Regex.Replace("străipped of baâ€d charâ€cters", "[^ -~]", "");
        Console.WriteLine(stripped); //outputs "stripped of bad characters"

see http://www.catonmat.net/blog/my-favorite-regex/ for the explanation of why that regex works

like image 102
Alex Avatar answered Feb 03 '26 21:02

Alex



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!