Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Guessing UTF-8 encoding

Tags:

encoding

utf-8

I have a question that may be quite naive, but I feel the need to ask, because I don't really know what is going on. I'm on Ubuntu.

Suppose I do

echo "t" > test.txt

if I then

file test.txt

I get test.txt:ASCII text

If I then do

echo "å" > test.txt

Then I get

test.txt: UTF-8 Unicode text

How does that happen? How does file "know" the encoding, or, alternatively, how does it guess it?

Thanks.

like image 296
Dervin Thunk Avatar asked Oct 24 '25 04:10

Dervin Thunk


2 Answers

There are certain byte sequences that suggest that UTF-8 encoding may be in use (see Wikipedia). If file finds one or more of those and doesn't find anything that can't occur in UTF-8, it's a fair guess that the file is encoded in UTF-8. But again, just a guess. For the basic ASCII character set (normal characters like 't'), the binary representation is the same in most common encodings (including UTF-8), so if a file contains only basic ASCII characters, file has no way to tell which of the many ASCII-compatible encodings was intended. It just goes with ASCII by default.

The other thing to take note of is that your shell is set to use UTF-8, which is why the file gets written in UTF-8 in the first place. Conceivably, you could set the shell to use another encoding like UTF-16, and then the command

echo "å" > test.txt

would write a file using UTF-16.

like image 126
David Z Avatar answered Oct 27 '25 00:10

David Z


From the file manpage:

If a file does not match any of the entries in the magic file, it is examined to see if it seems to be a text file. ASCII, ISO-8859-x, non-ISO 8-bit extended-ASCII character sets (such as those used on Macintosh and IBM PC systems), UTF-8-encoded Unicode, UTF-16-encoded Unicode, and EBCDIC character sets can be distinguished by the different ranges and sequences of bytes that constitute printable text in each set. If a file passes any of these tests, its character set is reported. ASCII, ISO-8859-x, UTF-8, and extended-ASCII files are identified as ''text'' because they will be mostly readable on nearly any terminal; UTF-16 and EBCDIC are only ''character data'' because, while they contain text, it is text that will require translation before it can be read. In addition, file will attempt to determine other characteristics of text-type files. If the lines of a file are terminated by CR, CRLF, or NEL, instead of the Unix-standard LF, this will be reported. Files that contain embedded escape sequences or overstriking will also be identified.

like image 43
schnaader Avatar answered Oct 26 '25 22:10

schnaader