Guessing UTF-8 encoding

Question

I have a question that may be quite naive, but I feel the need to ask, because I don't really know what is going on. I'm on Ubuntu.

Suppose I do

echo "t" > test.txt

if I then

file test.txt

I get test.txt:ASCII text

If I then do

echo "å" > test.txt

Then I get

test.txt: UTF-8 Unicode text

How does that happen? How does file "know" the encoding, or, alternatively, how does it guess it?

Thanks.

David Z · Accepted Answer

There are certain byte sequences that suggest that UTF-8 encoding may be in use (see Wikipedia). If file finds one or more of those and doesn't find anything that can't occur in UTF-8, it's a fair guess that the file is encoded in UTF-8. But again, just a guess. For the basic ASCII character set (normal characters like 't'), the binary representation is the same in most common encodings (including UTF-8), so if a file contains only basic ASCII characters, file has no way to tell which of the many ASCII-compatible encodings was intended. It just goes with ASCII by default.

The other thing to take note of is that your shell is set to use UTF-8, which is why the file gets written in UTF-8 in the first place. Conceivably, you could set the shell to use another encoding like UTF-16, and then the command

echo "å" > test.txt

would write a file using UTF-16.

The other thing to take note of is that your shell is set to use UTF-8, which is why the file gets written in UTF-8 in the first place. Conceivably, you could set the shell to use another encoding like UTF-16, and then the command

echo "å" > test.txt

would write a file using UTF-16.

schnaader · Answer

From the file manpage:

If a file does not match any of the entries in the magic file, it is examined to see if it seems to be a text file. ASCII, ISO-8859-x, non-ISO 8-bit extended-ASCII character sets (such as those used on Macintosh and IBM PC systems), UTF-8-encoded Unicode, UTF-16-encoded Unicode, and EBCDIC character sets can be distinguished by the different ranges and sequences of bytes that constitute printable text in each set. If a file passes any of these tests, its character set is reported. ASCII, ISO-8859-x, UTF-8, and extended-ASCII files are identified as ''text'' because they will be mostly readable on nearly any terminal; UTF-16 and EBCDIC are only ''character data'' because, while they contain text, it is text that will require translation before it can be read. In addition, file will attempt to determine other characteristics of text-type files. If the lines of a file are terminated by CR, CRLF, or NEL, instead of the Unix-standard LF, this will be reported. Files that contain embedded escape sequences or overstriking will also be identified.

Guessing UTF-8 encoding

Tags:

encoding

utf-8

Dervin Thunk

2 Answers

David Z

schnaader

Recent Activity

Donate For Us

Guessing UTF-8 encoding

Tags:

encoding

utf-8

Dervin Thunk

2 Answers

David Z

schnaader

Related questions

Recent Activity

Donate For Us