This is perfectly legal in Python:
In [1]: 'abc' + 'def'
Out[1]: 'abcdef'
If I have an all text Pandas DataFrame, like the example below:
In [2]: df = pd.DataFrame([list('abcd'), list('efgh'), list('ijkl')],
columns=['C1','C2','C3','C4'])
df.loc[[0,2], ['C2', 'C3']] = np.nan
df
Out[2]: C1 C2 C3 C4
0 a NaN NaN d
1 e f g h
2 i NaN NaN l
Is it possible to do the same with columns of the above DataFrame? Something like:
In [3]: df.apply(+, axis=1) # Or
df.sum(axis=1)
Note that both of the statements above don't work. Using .str.cat() in a loop is easy, but I am looking for something better.
Expected output is:
Out[3]: C
0 ad
1 efgh
2 il
You could do
df.fillna('').sum(axis=1)
Of course, this assumes that your dataframe is made up only of strings and NaNs.
Option 1
stack
I wanted to add it for demonstration. We don't have to accept the rectangular nature of the dataframe and use stack. When we do, stack drops nan by default. Leaving us with a vector of strings and a pd.MultiIndex. We can groupby the first level of this pd.MultiIndex (which used to be row indices) and perform summation:
df.stack().groupby(level=0).sum()
0 ad
1 efgh
2 il
dtype: object
Option2
Use Masked Arrays np.ma.masked_array
I was motivated by @jezrael to post a faster solution (-:
pd.Series(
np.ma.masked_array(
df.values,
df.isnull().values,
).filled('').sum(1),
df.index
)
0 ad
1 efgh
2 il
dtype: object
Timing
df = pd.concat([df]*1000).reset_index(drop=True)
%%timeit
pd.Series(
np.ma.masked_array(
df.values,
df.isnull().values,
fill_value=''
).filled('').sum(1),
df.index
)
1000 loops, best of 3: 860 µs per loop
%timeit (pd.Series(df.fillna('').values.sum(axis=1), index=df.index))
1000 loops, best of 3: 1.33 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With