Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In python, how to decode strings whose literal content is in utf-8?

Tags:

python

I was trying to make an add-on for Anki which imports the opml notes from Mubu, and I the contents that I needed were stored in a str object like the one below, and I was not able to decode them or convert them into byte objects.

"\x3Cspan\x3E\xE6\x88\x91\xE5\x8F\x91\xE7\x8E\xB0\xE6\x88\x91\xE5\xB1\x85\xE7\x84\xB6\xE6\xB2\xA1\xE6\x9C\x89\xE6\xB5\x8B\xE8\xAF\x95\xE8\xBF\x87\xE4\xB8\xAD\xE6\x96\x87\xEF\xBC\x8C\xE8\xBF\x99\xE4\xB8\xAA\xE5\xB0\xB1\xE5\xA4\xAA\xE7\xA6\xBB\xE8\xB0\xB1\xE4\xBA\x86\xE3\x80\x82\x3C/span\x3E"

Previously, I was trying able to decode this string using the following method, but it does not support utf-8:

text = text.encode().decode("unicode_escape")

I wonder if there is a way to turn str objects whose literal content is in utf-8 into byte objects.

like image 265
Sushi Bear Avatar asked Sep 01 '25 03:09

Sushi Bear


1 Answers

In python3 this can be decoded as follows:

# put a b in front of the string to make it bytes
s = b"\x3Cspan\x3E\xE6\x88\x91\xE5\x8F\x91\xE7\x8E\xB0\xE6\x88\x91\xE5\xB1\x85\xE7\x84\xB6\xE6\xB2\xA1\xE6\x9C\x89\xE6\xB5\x8B\xE8\xAF\x95\xE8\xBF\x87\xE4\xB8\xAD\xE6\x96\x87\xEF\xBC\x8C\xE8\xBF\x99\xE4\xB8\xAA\xE5\xB0\xB1\xE5\xA4\xAA\xE7\xA6\xBB\xE8\xB0\xB1\xE4\xBA\x86\xE3\x80\x82\x3C/span\x3E"
import chardet
encoding = chardet.detect(s)
content = s.decode(encoding['encoding'])
content

It decodes to

<span>我发现我居然没有测试过中文,这个就太离谱了。</span>
like image 87
forgetso Avatar answered Sep 02 '25 16:09

forgetso