Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

create a chinese folder use python

I now encounter a problem about chinese charater. I use beautifulsoup to extract data,and want to creat a folder use the name of extracted data. data likes:

<A href="love">搴1824)</A>

I want to extract '搴1824)',so I do like

soup.find('a',href='love')

but in console,it come out:

霉(1824)

I have use '# -- coding:utf-8 --' at head of my source. It must be some encoding problem,anyone can give some good material about python work with non-english?

I want create a folder named '搴1824)' I do :

if not os.path.exists(dir_name):
        os.mkdir('./pic/'+dir_name) 

when I find a folder named"霉(1824)' exists,so why it still come out:

OSError: [Errno 17] File exists: './vguagua_pic/\xc3\x90\xc3\x87\xc3\x97\xc3\xb9(1824)'

thx

like image 373
kuafu Avatar asked Nov 15 '25 13:11

kuafu


1 Answers

Even if your .py script is written in UTF-8, if the webpage is not, the parsed text may not be correct.

The webpage's encoding is actually GB-2312 (or GB-18030), but BeautifulSoup guessed the webpage's encoding wrongly as ISO-8859-1, and with that incorrect assumption, converting to UTF-8 and causing mojibake. We can verify:

>>> b'\xc3\x90\xc3\x87\xc3\x97\xc3\xb9'.decode('utf8').encode('latin1').decode('gb2312')
'搴

You could add from_encoding="gb2312" (in bs4) or fromEncoding="gb2312" (in 3.x) to the BeautifulSoup constructor to force the encoding, as documented in the Beautiful Soup Documentation (and also in Chinese 涓妗.

like image 65
kennytm Avatar answered Nov 18 '25 05:11

kennytm



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!