Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to clean \xc2\xa0 \xc2\xa0..... in text data

When I was trying to read a text file with the following python code:

     with open(file, 'r') as myfile:
          data = myfile.read()

Got some weird characters start with \x...., what do they stand for and how to get rid of them in reading a text file?

e.g.

...... \xc2\xa0 \xc2\xa0 chapter 1 tuesday 1984 \xe2\x80\x9chey , jake , your mom sent me to pick you up \xe2\x80\x9d jacob robbins knew better than to accept a ride from a stranger , but when his mom\xe2\x80\x99s friend ronny was waiting for him in front of school he reluctantly got in the car \xe2\x80\x9cmy name is jacob........

like image 592
Paul Avatar asked Oct 26 '25 13:10

Paul


2 Answers

That's UTF-8 encoded text. You open the file as UTF-8.

with open(file, 'r', encoding='utf-8') as myfile:
   ...

2.x:

with codecs.open(file, 'r', encoding='utf-8') as myfile:
   ...

Unicode In Python, Completely Demystified

like image 126
Ignacio Vazquez-Abrams Avatar answered Oct 28 '25 02:10

Ignacio Vazquez-Abrams


Those are string escapes. They represent a character by its hexadecimal value. For example, \x24 is 0x24, which is the dollar sign.

>>> '\x24'
'$'
>>> chr(0x24)
'$'

One such escape (from the ones you provided) is \xc2 which is Â, a capital A with a circumflex.

like image 30
Zach Gates Avatar answered Oct 28 '25 02:10

Zach Gates



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!