I am using a Python script to convert files from gb2312 to utf-8. This character messes everything: ㎜ (it is one symbol, not "mm").
text = '㎜'
text.encode(encoding='gb2312')
raises this error:
UnicodeEncodeError: 'gb2312' codec can't encode character '\u040b' in position 1: illegal multibyte sequence
I can use workaround by text.replace('㎜', 'mm'). But what if there are others such characters? What is wrong with it? Why it is so special?
Is there a way to make Python treat it as any other character?
OK, so, I downloaded the file 1.php and ran your original script on it and I get a different error mesage:
UnicodeDecodeError: 'gb2312' codec can't decode bytes in position 99-100:
  illegal multibyte sequence
The bytes in the file at offsets 99 and 100 are A9 4C in that order.  That is neither a valid GB2312 nor a valid UTF-8 encoding of anything.  I suspect you may be in the situation of having a whole bunch of files that are supposedly GB2312 but actually in some other encoding.  If you need to just bull through all such problems, you can use errors='replace' and mode='rU' (the latter makes Python understand your DOS newlines).
file_old=open('1.php', mode='rU', encoding='gb2312', errors='replace')
This will insert U+FFFD REPLACEMENT CHARACTER in place of anything it can't decode, and continue.  This destroys data; first try to figure out what the real encoding of the file is.
By the way, don't forget to fix up your HTML header when you're done; the preferred form nowadays is
<!doctype html>
<html><head>
  <meta charset="utf-8">
Concise, standard compliant, and tested to work all the way back to IE6.
EDIT: On further investigation, GB2312 is a character set, not an encoding.  There are several possible encodings of it, but only one allows the two-byte sequence A9 4C: in Big5, it corresponds to the character 呶.  (I do not know any of the languages that use Chinese characters; does that make more sense in context than ㎜?)
Python and iconv assume that GB2312 is encoded in a different format, EUC-CN, unless specifically told otherwise.  If I modify your script to read
file_old=open('1.php', mode='rU', encoding='big5', errors='strict')
file_new=open('2.php', mode='w', encoding='utf-8')
file_new.write(file_old.read())
then it executes without error on the 1.php you provided.
EDIT 2: On further further investigation, what web browsers do with <meta charset="gb2312"> is pretend you wrote <meta charset="gbk">.  GBK is a superset of GB2312 that does include the ㎜ character.  Python, however, treats GB2312 per its original definition.  So what you really want in order for your conversion to match the original file is 
file_old=open('1.php', mode='rU', encoding='gbk', errors='strict')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With