Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas to_csv prefixing 'b' when doing .astype('|S') on column

Tags:

python

pandas

I'm following advice of this article to reduce Pandas DataFrame memory usage, I'm using .astype('|S') on an object column like so:

data_frame['COLUMN1'] = data_frame['COLUMN1'].astype('|S')
data_frame['COLUMN2'] = data_frame['COLUMN2'].astype('|S')

Performing this on the DataFrame cuts memory usage by 20-40% without negative impacts on processing the columns. However, when outputting the file using .to_csv():

data_frame.to_csv(filename, sep='\t', encoding='utf-8')

The columns with .astype('|S') are outputted with a prefix of b with single quotes:

b'00001234'  b'Source'

Removing the .astype('|S') call and outputting to csv gives the expected behavior:

00001234  Source

Some googling on this issue does find GitHub issues, but I don't think they are related (looks like they were fixed as well): to_csv and bytes on Python 3, BUG: Fix default encoding for CSVFormatter.save

I'm on Python 3.6.4 and Pandas 0.22.0. I tested the behavior is consistent on both MacOS and Windows. Any advice on how to output the columns without the b prefix and single quotes?

like image 279
Brett VanderHaar Avatar asked Dec 06 '25 02:12

Brett VanderHaar


1 Answers

The 'b' prefix indicates a Python 3 bytes literal that represents an object rather than an unicode string. So if you want to remove the prefix you could decode the bytes object using the string decode method before saving it to a csv file:

data_frame['COLUMN1'] = data_frame['COLUMN1'].apply(lambda s: s.decode('utf-8'))
like image 75
Milton Arango G Avatar answered Dec 08 '25 15:12

Milton Arango G