I now encounter a problem about chinese charater. I use beautifulsoup to extract data,and want to creat a folder use the name of extracted data. data likes:
<A href="love">搴1824)</A>
I want to extract '搴1824)',so I do like
soup.find('a',href='love')
but in console,it come out:
霉(1824)
I have use '# -- coding:utf-8 --' at head of my source. It must be some encoding problem,anyone can give some good material about python work with non-english?
I want create a folder named '搴1824)' I do :
if not os.path.exists(dir_name):
os.mkdir('./pic/'+dir_name)
when I find a folder named"霉(1824)' exists,so why it still come out:
OSError: [Errno 17] File exists: './vguagua_pic/\xc3\x90\xc3\x87\xc3\x97\xc3\xb9(1824)'
thx
Even if your .py script is written in UTF-8, if the webpage is not, the parsed text may not be correct.
The webpage's encoding is actually GB-2312 (or GB-18030), but BeautifulSoup guessed the webpage's encoding wrongly as ISO-8859-1, and with that incorrect assumption, converting to UTF-8 and causing mojibake. We can verify:
>>> b'\xc3\x90\xc3\x87\xc3\x97\xc3\xb9'.decode('utf8').encode('latin1').decode('gb2312')
'搴
You could add from_encoding="gb2312" (in bs4) or fromEncoding="gb2312" (in 3.x) to the BeautifulSoup constructor to force the encoding, as documented in the Beautiful Soup Documentation (and also in Chinese 涓妗.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With