This is somewhat the reverse of things that most people would like to do when converting between lists and dataframes.
I am looking to convert a large dataframe (10M+ rows, 20+ columns) to a list of strings, where each entry is a string representation of each row in the dataframe. I can do this using pandas' to_csv() method, but I'm wondering if there is a faster way as this is proving to be a bottleneck in my code.
Minimum working example:
import numpy as np
import pandas as pd
# Create the initial dataframe.
size = 10000000
cols = list('abcdefghijklmnopqrstuvwxyz')
df = pd.DataFrame()
for col in cols:
df[col] = np.arange(size)
df[col] = "%s_" % col + df[col].astype(str)
# Convert to the required list structure
ret_val = _df_.to_csv(index=False, header=False).split("\n")[:-1]
The conversion aspect of the above code takes around ~90 seconds for a dataframe of 10,000,000 rows on a single thread of my Core i9, and is highly CPU dependent. I'd love to reduce that down by an order of magnitude if at all possible.
EDIT: I'm not looking to save the data to a .csv or to a file. I'm just looking to convert the dataframe to an array of strings.
EDIT: Input/Output example with only 5 columns:
In [1]: df.head(10)
Out [1]: a b c d e
0 a_0 b_0 c_0 d_0 e_0
1 a_1 b_1 c_1 d_1 e_1
2 a_2 b_2 c_2 d_2 e_2
3 a_3 b_3 c_3 d_3 e_3
4 a_4 b_4 c_4 d_4 e_4
5 a_5 b_5 c_5 d_5 e_5
6 a_6 b_6 c_6 d_6 e_6
7 a_7 b_7 c_7 d_7 e_7
8 a_8 b_8 c_8 d_8 e_8
9 a_9 b_9 c_9 d_9 e_9
In [2]: ret_val[:10]
Out [2]: ['a_0,b_0,c_0,d_0,e_0',
'a_1,b_1,c_1,d_1,e_1',
'a_2,b_2,c_2,d_2,e_2',
'a_3,b_3,c_3,d_3,e_3',
'a_4,b_4,c_4,d_4,e_4',
'a_5,b_5,c_5,d_5,e_5',
'a_6,b_6,c_6,d_6,e_6',
'a_7,b_7,c_7,d_7,e_7',
'a_8,b_8,c_8,d_8,e_8',
'a_9,b_9,c_9,d_9,e_9']
I get ~2.5 times speedup with multiprocessing...
import multiprocessing
# df from OPs above code available in global scope
def fn(i):
return df[i:i+1000].to_csv(index=False, header=False).split('\n')[:-1]
with multiprocessing.Pool() as pool:
result = []
for a in pool.map(fn, range(0, len(df), 1000)):
result.extend(a)
reduces overall time for 1M rows from 6.8 secs to 2.8 secs on my laptop, so would hopefully scale to more cores in an i9 CPU.
This depends on Unix fork semantics to share the dataframe with child processes, and obviously does a bit more work, but might help...
using numpy.savetxt suggestion from Massifox with multiprocessing takes this down to 2.0 seconds, just map the following function:
def fn2(i):
with StringIO() as fd:
np.savetxt(fd, df[i:i+N], fmt='%s', delimiter=',')
return fd.getvalue().split('\n')[:-1]
result is otherwise basically the same
your comment that says "the dataframe is a variable within a class" can be fixed in a variety of different ways. a simple way would just be to pass the dataframe to the Pool initializer at which point it won't be picked (under Unix anyway) and stash a reference to it in a global variable somewhere. this reference can then be used by each worker processes, e.g:
def stash_df(df):
global the_df
the_df = df
def fn(i):
with StringIO() as fd:
np.savetxt(fd, the_df[i:i+N], fmt='%s', delimiter=',')
return fd.getvalue().split('\n')[:-1]
with multiprocessing.Pool(initializer=stash_df, initargs=(df,)) as pool:
result = []
for a in pool.map(fn, range(0, len(df), N)):
result.extend(a)
this will be fine as long as each Pool is used by a single dataframe
You can try different ways to speed up data writing to disk:
Writing compressed file could speedup writing up to 10x
df.to_csv('output.csv.gz'
, header=True
, index=False
, chunksize=100000
, compression='gzip'
, encoding='utf-8')
Choose the chunksize that works best for you.
Switch to hdf format:
df.to_hdf(r'output.h5', mode='w')
According to krassowski answer, using numpy. For example, using the following df:
df=pd.DataFrame({'A':range(1000000)})
df['B'] = df.A + 1.0
df['C'] = df.A + 2.0
df['D'] = df.A + 3.0
Pandas to csv:
df.to_csv('pandas_to_csv', index=False)
On my computer, takes 6.45 s ± 1.05 s per loop (mean ± std. dev. of 7 runs, 1 loop each)`
Numpy to csv:
savetxt(
'numpy_savetxt', aa.values, fmt='%d,%.1f,%.1f,%.1f',
header=','.join(aa.columns), comments='')
On my computer, takes 3.38 s ± 224 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Using Pandaral·lel.
Is a simple and efficient tool to parallelize your Pandas computation on all your CPUs (Linux & MacOS only). How to significantly speed up your pandas computation with only one line of code. Cool!
You can think of replacing Pandas dataframe with DASK dataframe. The CSV API are very similar to pandas.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With