Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find file encoding type or convert any encoding type to UTF-8 in shell?

I get text file of random encoding format, usc-2le, ansi, utf-8, usc-2be etc. I have to convert this files to utf8.

For conversion am using the following command

iconv options -f from-encoding -t utf-8 <inputfile > outputfile

But if incorrect from-encoding is provided, then incorrect file is generated.

I want a way to find the input file encoding type.

Thanks in advance

like image 820
dhpratik Avatar asked Oct 22 '25 06:10

dhpratik


1 Answers

On Linux you could try using file(1) on your unknown input file. Most of the time it would guess the encoding correctly. Or else try several encodings to iconv till you "feel" that the result is acceptable (for example if you know that the file is some Russian poetry, you might try KOI-8, UTF-8, etc.... till you recognize a good Russian poem).

But character encoding is a nightmare and can be ambiguous. The provider of the file should tell you what encoding he used (and there is no way to get that encoding reliably and in all cases : there are some byte sequences which would be valid and interpreted differently with various encodings).

(notice that the HTTP protocol mentions and explicits the encoding)

In 2017, better use UTF-8 everywhere (and you should follow that http://utf8everywhere.org/ link) so ask your human partners to send you UTF-8 (hopefully most of your files are in UTF-8, since today they all should be).

(so encoding is more a social issue than a technical one)

I get text file of random encoding format

Notice that "random encoding" don't exist. You want and need to find out what character encoding (and file format) has been used by the provider of that file (so you mean "unknown encoding", not "random" one).

BTW, do you have a formal, unambiguous, sound and precise definition of text file, beyond file without zero bytes, or files with few control characters? LaTeX, C source, Markdown, SQL, UUencoding, shar, XPM, and HTML files are all text files, but very different ones!

You probably want to expect UTF-8, and you might use the file extension as some hint. Knowing the media-type could help.

(so if HTTP has been used to transfer the file, it is important to keep (and trust) the Content-Type...; read about HTTP headers)

[...] then incorrect file is generated.

How do you know that the resulting file is incorrect? You can only know if you have some expectations about that result (e.g. that it contains Russian poetry, not junk characters; but perhaps these junk characters are some bytecode to some secret interpreter, or some music represented in weird fashion, or encrypted, etc....). Raw files are just sequences of bytes, you need some extra knowledge to use them (even if you know that they use UTF-8).

like image 59
Basile Starynkevitch Avatar answered Oct 25 '25 14:10

Basile Starynkevitch