Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python 2.7, convert utf8 string to ascii

I am working with python 2.7.12 I have string which contains a unicode literal, which is not of type Unicode. I would like to convert this to text. This example explains what I am trying to do.

>>> s
'\x00u\x00s\x00e\x00r\x00n\x00a\x00m\x00e\x00'
>>> print s
username
>>> type(s)
<type 'str'>
>>> s == "username"
False

How would I go about converting this string?

like image 729
bdclosne Avatar asked Jan 30 '26 01:01

bdclosne


1 Answers

That's not UTF-8, it's UTF-16, though it's unclear whether it's big endian or little endian (you have no BOM, and you have a leading and trailing NUL byte, making it an uneven length). For text in the ASCII range, UTF-8 is indistinguishable from ASCII, while UTF-16 alternates NUL bytes with the ASCII encoded bytes (as in your example).

In any event, converting to plain ASCII is fairly easy, you just need to deal with the uneven length one way or another:

s = 'u\x00s\x00e\x00r\x00n\x00a\x00m\x00e\x00' # I removed \x00 from beginning manually
sascii = s.decode('utf-16-le').encode('ascii')

# Or without manually removing leading \x00
sascii = s.decode('utf-16-be', errors='ignore').encode('ascii')

Course, if your inputs are just NUL interspersed ASCII and you can't figure out the endianness or how to get an even number of bytes, you can just cheat:

sascii = s.replace('\x00', '')

But that won't raise exceptions in the case where the input is some completely different encoding, so it may hide errors that specifying what you expect would have caught.

like image 138
ShadowRanger Avatar answered Feb 02 '26 03:02

ShadowRanger



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!