I want to remove characters with encodings larger than 3 bytes. Because when I upload my CSV data to Amazon Mechanical Turk system, it asks me to do it.
Your CSV file needs to be UTF-8 encoded and cannot contain characters with encodings larger than 3 bytes. For example, some non-English characters are not allowed (learn more).
To overcome this problem,
I want to make a filter_max3bytes
funciton to remove those characters in Python3.
x = 'below ð\x9f~\x83,'
y = remove_max3byes(x) # y=="below ~,"
Then I will apply the function before saving it to a CSV file, which is UTF-8 encoded.
This post is related my problem, but they uses python 2 and the solution did not worked for me.
Thank you!
None of the characters in your string seems to take 3 bytes in UTF-8:
x = 'below ð\x9f~\x83,'
Anyway, the way to remove them, if there were any would be:
filtered_x = ''.join(char for char in x if len(char.encode('utf-8')) < 3)
For example (with such characters):
>>> x = 'abcd漢字efg'
>>> ''.join(char for char in x if len(char.encode('utf-8')) < 3)
'abcdefg'
BTW, you can verify that your original string does not have 3-byte encodings by doing the following:
>>> for char in 'below ð\x9f~\x83,':
... print(char, [hex(b) for b in char.encode('utf-8')])
...
b ['0x62']
e ['0x65']
l ['0x6c']
o ['0x6f']
w ['0x77']
['0x20']
ð ['0xc3', '0xb0']
['0xc2', '0x9f']
~ ['0x7e']
['0xc2', '0x83']
, ['0x2c']
EDIT: A wild guess
I believe the OP asks the wrong question and the question is in fact whether the character is printable. I'll assume anything Python displays as \x<number>
is not printable, so this solution should work:
x = 'below ð\x9f~\x83,'
filtered_x = ''.join(char for char in x if not repr(char).startswith("'\\x"))
Result:
'below ð~,'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With