Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

utf8 "\xFF" does not map to Unicode at tokenizer.perl line 44, <STDIN> line 1.

I am using a perl tokenizer for German. The tokenizer works fine for some files but now I am facing the following error:

perl tokenizer.perl -l de < ~/Desktop/me.txt > ~/Desktop/me.txt.tok 
Tokenizer v3
Language: de
utf8 "\xFF" does not map to Unicode at tokenizer.perl line 44, <STDIN> line 1.
Malformed UTF-8 character (byte 0xff) in pattern match (m//) at tokenizer.perl line 45, <STDIN> line 1.
Malformed UTF-8 character (byte 0xff) in pattern match (m//) at tokenizer.perl line 45, <STDIN> line 1.
Malformed UTF-8 character (fatal) at tokenizer.perl line 64, <STDIN> line 1.

Any thoughts?

Thanks in advance.

Neg.

like image 275
user89423 Avatar asked Oct 28 '25 16:10

user89423


1 Answers

The error message is misleading, but the intended information is correct and useful: the byte FF (hexadecimal) was encountered in the data, but it cannot appear in UTF-8 data. So “utf8 "\xFF"” is nonsense as such, but read it as “byte FF encountered as data purported to be UTF-8 encoded”. Similarly, read “Malformed UTF-8 character (byte 0xff)” as “Invalid data (byte FF) encountered in purported UTF8 data”.

To find out why your data contains the byte FF, you need to reveal more of it. My guess is that it is actually part of a byte order mark in UTF-16 encoding, but this is just a guess.

like image 65
Jukka K. Korpela Avatar answered Oct 31 '25 14:10

Jukka K. Korpela