Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

A script to strip ranges of UTF-8 Characters out of a file

My problem is that I have a data file containing UTF-8, most of which is valid and must be kept, but some of which has random "garbage" UTF-8, namely in the range of 0xf0 - 0xff. An example of the hex for the bad data can be seen below

 f4 80 80  ab f4 80 80 b6 f4 80 80 
 a5 f4 80 80 a6 f4 80 80  83 f4 80 80 b6 f4 80 81  
 84 f4 80 81 98 f4 80 81  87 f4 80 81 8c f4

I'm trying to write a perl script that will search and replace for characters that the first byte is in the range 0xf0 - 0xff. On this website the codepage is listed as private use.

My existing attempts either do nothing, or have only been able to remove the first byte of a multi-byte character, such as perl -CSD -pi.orig -e 's/[\x{f4}-\x{ff}]/?/g' Running perl v5.12.5

I'm not much of a perl expert, nor a utf-8 expert. I'm also open to doing this in ruby/python/C++(98)/whatever as long as it's relatively portable on a linux box.

Here's a link to a snippet of the garbage data. http://pastebin.com/LR0StPHu

like image 327
Christopher Wirt Avatar asked Dec 14 '25 21:12

Christopher Wirt


1 Answers

Ok, let's not mix up a few things.

UTF-8 characters whose first byte is 0xf0 are four bytes long, which is the most you ever need to encode a legal Unicode character. Since over 94% of the possible Unicode range requires that fourth byte, 0xf0 doesn't map to any single code page, and certainly not to the private use areas.

Such characters are outside the Basic Multilingual Plane. But that's different from being invalid or private use; it just means their code points are greater than U+FFFF (decimal value 65,535).

If you want to exclude all characters outside the BMP, you should be searching for the ones matching this regex:

[\x{10000}-\x{10FFFF}]

That uses Perl's \x{...} interpolation syntax to include characters by their hexadecimal code point value. If you're actually using Perl, then for ease of use you might want to put the regex into a variable (using the quote-regex construction qr(...), since bare slashes will immediately try to match the regex against $_ at assignment time):

my $not_bmp = qr([\x{10000}-\x{10FFFF}]);

But, again, removing characters matching that regex eliminates over 94% of possible Unicode characters, so be sure that's what you want.

If you really only want to eliminate private use characters - some of which are inside the BMP - just exclude those ranges specifically. With Perl or Python or any other UTF-8-aware language, you don't have to worry about bytes; just check the code points.

As Wikipedia will tell you, the three Private Use Areas are in these code point ranges:

  • U+E000..U+F8FF
  • U+F0000..U+FFFFF
  • U+100000..U+10FFFF

So the corresponding Perl regex looks like this:

my $pua = qr([\x{e000}-\x{f8ff}\x{f0000}-\x{fffff}\x{100000}-\x{10ffff}]);

Many other languages have similar Unicode support (matching against UTF-8 characters, including characters in a string by code point, and so on). For example, here's Ruby, which mainly differs in using \u{...} instead of \x{...} for the interpolation:

not_bmp = %r([\u{10000}-\u{10FFFF}])
pua = %r([\u{e000}-\u{f8ff}\u{f0000}-\u{fffff}\u{100000}-\u{10ffff}])

Python \u escapes only work with exactly four hex digits, but if you have Python3 - or a Python2 compiled in wide mode - you can use capital \U, which takes exactly eight (there's no variable-length support via {...} as Perl and Ruby have):

not_bmp = re.compile(u'[\U00010000-\U0010ffff]')
pua = re.compile(u'[\ue000-\uf8ff\U000f0000-\U000fffff\U00100000-\U0010ffff]')
like image 98
Mark Reed Avatar answered Dec 16 '25 11:12

Mark Reed



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!